# Project: Wrangle and Analyse Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [1]:
# Import the modules I require
import subprocess
import sys
import pandas as pd
import json

#### Some modules are not imported, yet
##### Why?
* they may not be available out-of-the-box, so to speak
    * this happens in older versions of python

##### Which modules, exactly?
* `tweepy`
* `requests`

##### What then?
* import said modules inside a `try`-`except` block
* a function, `install`, will fire when the `try` block fails
* flow will be as such: 
    
    ```
       try:
           import module
       except ImportError:
           install(module)
           import module
        else:
            # do something IFF `try` succeeds
        finally:
            # do something whether or not an error was thrown
            ```

In [2]:
'''
a function to install modules using `pip`

accepts one arg, type str,that is the name
of the module to install

return: None
'''
def install(mod):
    """function install"""
    subprocess.check_call([sys.executable, "-m", "pip", "install", mod])




In [3]:
#import requests

try:
    import requests
except ImportError:
    install('requests')
    import requests

if requests: 
    print(f'Successfully imported `requests` module...')

Successfully imported `requests` module...


In [4]:
#import tweepy

try:
    import tweepy
except ImportError:
    install('tweepy')
    import tweepy

if tweepy: 
    print(f'Successfully imported `tweepy` module...')

Successfully imported `tweepy` module...


In [5]:
#Constants
with open('./keys/twtr_api_keys.json', 'r') as f:
    data = json.load(f)
CONSUMER_KEY = data['api_key']
CONSUMER_SECRET = data['api_key_secret']
ACCESS_TOKEN = data['access_key']
ACCESS_SECRET = data['access_token']

In [6]:
#download twtr archive data then upload it to workspace
twtr_archive = pd.read_csv('twitter_archive_enhanced.csv', sep=',')

In [7]:
#view the file
twtr_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [None]:
#a way to download twtr archive data, write and save it to file
# w/o physically downloading then uploading it

url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv'
r = requests.get(url)
with open('twitter_archive_enhanced.csv', 'wb') as f:
    f.write(r.content)

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [8]:
#use `requests` to download tweet image prediction data & save to file
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
im = requests.get(url)

with open('image_predictions.tsv', 'wb') as f:
    f.write(im.content)

In [9]:
#view the file
img_predictions = pd.read_csv('image_predictions.tsv', sep='\t')
img_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [10]:
#set up the api object
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)


In [12]:
#query additional data via twtr api
with open('tweet_json.txt', 'a', encoding='utf8') as f:
    for i in twtr_archive['tweet_id']:
        try:
            tweet = api.get_status(i, tweet_mode='extended')
            json.dump(tweet._json, f)
            f.write('\n')
        except:
            continue

Rate limit reached. Sleeping for: 736
Rate limit reached. Sleeping for: 736


In [49]:
#append each tweet into a list
with open('tweet_json.txt', 'r') as f:
    tweets = f.readlines()
    try:
        tweet_list = [json.loads(i) for i in f]
    except:
#         continue
#         raise
        pass



In [50]:
# alternative way to append each tweet into a list
tweet_list = []

with open('tweet_json.txt', 'r') as f:
    for i in f:
        try:
            tweet = json.loads(i)
            tweet_list.append(tweet)
        except:
            continue
        

In [52]:
#check length of `tweet_list`
len(tweet_list) #6378 obs

6378

In [53]:
#create tweets to df
df_tweets = pd.DataFrame()

#add variables and values to df
df_tweets['tweet_id'] = list(map(lambda tweet: tweet['id'], tweet_list))
df_tweets['retweet_count'] = list(map(lambda tweet: tweet['retweet_count'], tweet_list))
df_tweets['favorite_count'] = list(map(lambda tweet: tweet['favorite_count'], tweet_list))

In [56]:
#check if df_tweets exists and that it is not empty
df_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6378 entries, 0 to 6377
Data columns (total 3 columns):
tweet_id          6378 non-null int64
retweet_count     6378 non-null int64
favorite_count    6378 non-null int64
dtypes: int64(3)
memory usage: 149.6 KB


## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



### Method

* will assess data sets one by one
* will perform the following on each data set:
    * under  heading `Structure`:
        * visual assessment
        * summary of programmatic assessment
        * list of properties and/or attributes
    * list of quality and tidiness issues

#### Dataset 1: Dataframe `twtr_archive` that contains archive data

In [57]:
twtr_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [59]:
twtr_archive.shape

(2356, 17)

In [60]:
twtr_archive.duplicated().value_counts()

False    2356
dtype: int64

In [61]:
twtr_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


#### Structure

##### Overall

* 2356 observations
    
* 17 variables:
    * 4 of type `float`
    * 3 of type `int`
    * 10 of type `str`
* time stamps are in `str` format:
    * `timestamp`
    * `retweeted_status_timestamp`
* 5 unique IDs:
    * `tweet_id`
    * `in_reply_to_status_id`
    * `in_reply_to_user_id`
    * `retweeted_status_id`
    * `retweeted_status_user_id`
* the age range of a dog is presented qualitatively using the variables:
    * `puppo` -> very young dog
    * `puppy` -> young-to-adolescent dog
    * `doggo` -> adult dog
* the fur type and/or fur texture of a dog is under the `floofer` variable
    

##### Missing and null values
* 6 out of 17 variables have null values
* Case(s) in point:
    * 2278 values missing in the `in_reply_to_status_id` and `in_reply_to_user_id` variables
    * 2175 values missing in the `retweeted_status_id`, `retweeted_status_user_id`  and `retweeted_status_timestamp` variables 

##### Duplicated observations
* `twtr_achive` has no duplicated observations

##### Multiple values for a variable
* observations in `twtr_achive` have a single value per variable
     

##### Quality and tidiness

###### Quality
* 
* 
* 
* 
* 
* 
* 
* 

###### Tidiness
* 
* 

#### Dataset 2: Dataframe `img_predictions` that contains image data

In [63]:
img_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [64]:
img_predictions.shape

(2075, 12)

In [65]:
img_predictions.duplicated().value_counts()

False    2075
dtype: int64

In [66]:
img_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


#### Structure

##### Overall

* 2075 observations   
* 12 variables:
    * 3 of type `float`
    * 2 of type `int`
    * 4 of type `str`
    * 3 of type `bool`
* 1 unique ID: `tweet_id`
* the breed of a dog in a photo is presented using the variable `pN`, where `pN` represents the photo number and `N` is 1, 2 or 3
    * example: `p1` -> photo 1
* the likelihood that the breed of a dog is the value of variable `pN` is presented using the variable `pN_conf`
* whether or not the image at `jpg_url` is that of a dog is presented under the variable `pN_dog`    

##### Missing and null values
* `img_predictions` has no missing and/or null values 

##### Duplicated observations
* `img_predictions` has no duplicated observations

##### Multiple values for a variable
* observations in `img_predictions` have a single value per variable
     

##### Quality and tidiness

###### Quality
* 
* 
* 
* 
* 
* 
* 
* 

###### Tidiness
* 
* 

#### Dataset 3: Dataframe `df_tweets` that contains twitter API data

In [68]:
df_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6378 entries, 0 to 6377
Data columns (total 3 columns):
tweet_id          6378 non-null int64
retweet_count     6378 non-null int64
favorite_count    6378 non-null int64
dtypes: int64(3)
memory usage: 149.6 KB


In [69]:
df_tweets.shape

(6378, 3)

In [71]:
df_tweets.duplicated().value_counts()

False    3991
True     2387
dtype: int64

In [72]:
df_tweets.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,6981,33730
1,892177421306343426,5281,29261
2,891815181378084864,3468,21992
3,891689557279858688,7202,36833
4,891327558926688256,7724,35219


#### Structure

##### Overall

* 6378 observations
    
* 3 variables, all of type `float`
* 1 unique ID: `tweet_id`
* `retweet_count` shows the number of re-tweets
* `favourite_count` shows the number of times the tweet under `tweet_id` has been marked as favourite
    

##### Missing and null values
* `df_tweets` has no missing and/or null values 

##### Duplicated observations
* `df_tweets` has 2387 duplicated observations

##### Multiple values for a variable
* observations in `df_tweets` have a single value per variable
     

##### Quality and tidiness

###### Quality
* 
* 
* 
* 
* 
* 
* 
* 

###### Tidiness
* 
* 

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [None]:
# Make copies of original pieces of data


### Issue #1:

#### Define:

#### Code

#### Test

### Issue #2:

#### Define

#### Code

#### Test

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization