# Project: Wrangle and Analyse Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [25]:
# Import the modules I require
import subprocess
import sys
import pandas as pd
import json
from random import randint
import re

#### Some modules are not imported, yet
##### Why?
* they may not be available out-of-the-box, so to speak
    * this happens in older versions of python

##### Which modules, exactly?
* `tweepy`
* `requests`

##### What then?
* import said modules inside a `try`-`except` block
* a function, `install`, will fire when the `try` block fails
* flow will be as such: 
    
    ```
       try:
           import module
       except ImportError:
           install(module)
           import module
        else:
            # do something IFF `try` succeeds
        finally:
            # do something whether or not an error was thrown
            ```

In [3]:
'''
a function to install modules using `pip`

accepts one arg, type str,that is the name
of the module to install

return: None
'''
def install(mod):
    """function install"""
    subprocess.check_call([sys.executable, "-m", "pip", "install", mod])




In [4]:
#import requests

try:
    import requests
except ImportError:
    install('requests')
    import requests

if requests: 
    print(f'Successfully imported `requests` module...')

Successfully imported `requests` module...


In [5]:
#import tweepy

try:
    import tweepy
except ImportError:
    install('tweepy')
    import tweepy

if tweepy: 
    print(f'Successfully imported `tweepy` module...')

Successfully imported `tweepy` module...


In [5]:
#Constants
with open('./keys/twtr_api_keys.json', 'r') as f:
    data = json.load(f)
CONSUMER_KEY = data['api_key']
CONSUMER_SECRET = data['api_key_secret']
ACCESS_TOKEN = data['access_key']
ACCESS_SECRET = data['access_token']

In [13]:
#see if a df exists
'''
a simple function to see if a df exists

takes in 1 arg: name of the df
please do not pass the arg as a string

return: None
'''

def confirm_exists(df):
    """ function confirm_exists """
    if not df.empty:
        print(f'This dataframe exists')
        return
    print(f'This dataframe does not exist')
    

In [2]:
#download twtr archive data then upload it to workspace
twtr_archive = pd.read_csv('twitter_archive_enhanced.csv', sep=',')

In [14]:
#view the file
confirm_exists(twtr_archive)

This dataframe exists


In [None]:
#a way to download twtr archive data, write and save it to file
# w/o physically downloading then uploading it

url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv'
r = requests.get(url)
with open('twitter_archive_enhanced.csv', 'wb') as f:
    f.write(r.content)

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [5]:
#use `requests` to download tweet image prediction data & save to file
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
im = requests.get(url)

with open('image_predictions.tsv', 'wb') as f:
    f.write(im.content)

In [15]:
#view the file
img_predictions = pd.read_csv('image_predictions.tsv', sep='\t')
confirm_exists(img_predictions)

This dataframe exists


3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [10]:
#set up the api object
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)


In [12]:
#query additional data via twtr api
with open('tweet_json.txt', 'a', encoding='utf8') as f:
    for i in twtr_archive['tweet_id']:
        try:
            tweet = api.get_status(i, tweet_mode='extended')
            json.dump(tweet._json, f)
            f.write('\n')
        except:
            continue

Rate limit reached. Sleeping for: 736
Rate limit reached. Sleeping for: 736


In [9]:
#append each tweet into a list
with open('tweet_json.txt', 'r') as f:
    tweets = f.readlines()
    try:
        tweet_list = [json.loads(i) for i in f]
    except:
#         continue
#         raise
        pass



In [4]:
# alternative way to append each tweet into a list
tweet_list = []

with open('tweet_json.txt', 'r') as f:
    for i in f:
        try:
            tweet = json.loads(i)
            tweet_list.append(tweet)
        except:
            continue
        

In [8]:
#check length of `tweet_list`
len(tweet_list) #6378 obs

6378

In [5]:
#create tweets to df
df_tweets = pd.DataFrame()

#add variables and values to df
df_tweets['tweet_id'] = list(map(lambda tweet: tweet['id'], tweet_list))
df_tweets['retweet_count'] = list(map(lambda tweet: tweet['retweet_count'], tweet_list))
df_tweets['favorite_count'] = list(map(lambda tweet: tweet['favorite_count'], tweet_list))

In [10]:
#save `df_tweets` as a `.csv` file for ease of reference
df_tweets.to_csv('twitter_api_data.csv', index=False)

In [4]:
#re-load `df-tweets` from `twitter_api_data.csv`
df_tweets = pd.read_csv('twitter_api_data.csv', sep=',')

In [16]:
#check if `df_tweets` exists and that it is not empty
confirm_exists(df_tweets)

This dataframe exists


## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



### Method

* assess data sets one by one
* perform the following on each data set:
    * under  heading `Structure`:
        * visual and programmatic assessment
    * list of quality and tidiness issues
* summary list of quality and tidiness issues in all data sets

#### Dataset 1: Dataframe `twtr_archive` that contains archive data

In [13]:
twtr_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [14]:
twtr_archive.shape

(2356, 17)

In [15]:
twtr_archive.duplicated().value_counts()

False    2356
dtype: int64

In [16]:
twtr_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [17]:
twtr_archive.tweet_id.nunique()

2356

In [18]:
twtr_archive.sample(randint(5, 15))

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2100,670704688707301377,,,2015-11-28 20:43:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Danny. He's too good to look at the road ...,,,,https://twitter.com/dog_rates/status/670704688...,6,10,Danny,,,,
1921,674262580978937856,,,2015-12-08 16:21:41 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Gus. He's super stoked about being an ...,,,,https://twitter.com/dog_rates/status/674262580...,9,10,Gus,,,pupper,
2245,667885044254572545,,,2015-11-21 01:59:37 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Stu. Stu has stacks on stacks and an eye ...,,,,https://twitter.com/dog_rates/status/667885044...,10,10,Stu,,,,
2129,670290420111441920,,,2015-11-27 17:17:44 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Sandra. She's going skydiving. Nice ad...,,,,https://twitter.com/dog_rates/status/670290420...,11,10,Sandra,,,,
2259,667550904950915073,,,2015-11-20 03:51:52 +0000,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",RT @dogratingrating: Exceptional talent. Origi...,6.675487e+17,4296832000.0,2015-11-20 03:43:06 +0000,https://twitter.com/dogratingrating/status/667...,12,10,,,,,
2278,667435689202614272,,,2015-11-19 20:14:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Ermergerd 12/10 https://t.co/PQni2sjPsm,,,,https://twitter.com/dog_rates/status/667435689...,12,10,,,,,
1862,675432746517426176,,,2015-12-11 21:51:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Happy Friday. Here's some golden puppers. 12/1...,,,,https://twitter.com/dog_rates/status/675432746...,12,10,,,,,


In [None]:
twtr_archive

#### Structure

##### Overall

* 2356 observations
    
* 17 variables:
    * 4 of type `float`
    * 3 of type `int`
    * 10 of type `str`
* time stamps are in `str` format:
    * `timestamp`
    * `retweeted_status_timestamp`
* 5 unique IDs:
    * `tweet_id`
    * `in_reply_to_status_id`
    * `in_reply_to_user_id`
    * `retweeted_status_id`
    * `retweeted_status_user_id`
* the age range of a dog is presented qualitatively using the variables:
    * `puppo` -> very young dog
    * `puppy` -> young-to-adolescent dog
    * `doggo` -> adult dog
* the `floofer` variable might represent the fur type and/or fur texture of a dog
    

##### Missing and null values
* 6 out of 17 variables have null values
* case(s) in point:
    * 2278 values missing in the `in_reply_to_status_id` and `in_reply_to_user_id` variables
    * 2175 values missing in the `retweeted_status_id`, `retweeted_status_user_id`  and `retweeted_status_timestamp` variables 

##### Duplicated observations
* `twtr_archive` has no duplicated observations

##### Multiple values for a variable
* observations in `twtr_archive` have a single value per variable
     

#### Quality and tidiness

###### Quality
* missing values for observations
* some values under the `name` variable do not fit in
    * examples: 
        * value `a` in observation `2354`
        * value `such` in observation `22`
* the case of values under the `name` variable is not consistent
    * some values are all lower case
    * the first letter of some is upper case
    * all values in lowercase do not fit in as spelt out above
* some values under the `rating_denominator` variable are not `10`
    * the value of observation `2335`, for example, is `2`


###### Tidiness
* too many variables for a dog's age range
    * it can be represented by one variable, say, `age-range`; the values of `age_range` will be `puppo`, `puppy` or `doggo`
* too many unique IDs

#### Dataset 2: Dataframe `img_predictions` that contains image data

In [19]:
img_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [20]:
img_predictions.shape

(2075, 12)

In [21]:
img_predictions.duplicated().value_counts()

False    2075
dtype: int64

In [22]:
img_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [23]:
img_predictions.tweet_id.nunique()

2075

In [24]:
img_predictions.sample(randint(5, 15))

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
682,683834909291606017,https://pbs.twimg.com/ext_tw_video_thumb/68383...,1,Maltese_dog,0.738449,True,toy_poodle,0.102992,True,Samoyed,0.023247,True
1714,818627210458333184,https://pbs.twimg.com/media/C1xZGkzWIAA8vh4.jpg,1,Labrador_retriever,0.384188,True,beagle,0.255917,True,grocery_store,0.079799,False
1161,734787690684657664,https://pbs.twimg.com/media/CjJ9gQ1WgAAXQtJ.jpg,4,golden_retriever,0.883991,True,chow,0.023542,True,Labrador_retriever,0.016056,True
683,683849932751646720,https://pbs.twimg.com/media/CX2F4qNUQAAR6Cm.jpg,1,hog,0.458855,False,Mexican_hairless,0.164906,True,wild_boar,0.1117,False
79,667453023279554560,https://pbs.twimg.com/media/CUNE_OSUwAAdHhX.jpg,1,Labrador_retriever,0.82567,True,French_bulldog,0.056639,True,Staffordshire_bullterrier,0.054018,True
1735,821765923262631936,https://pbs.twimg.com/media/C2d_vnHWEAE9phX.jpg,1,golden_retriever,0.980071,True,Labrador_retriever,0.008758,True,Saluki,0.001806,True


In [None]:
img_predictions

#### Structure

##### Overall

* 2075 observations   
* 12 variables:
    * 3 of type `float`
    * 2 of type `int`
    * 4 of type `str`
    * 3 of type `bool`
* 1 unique ID: `tweet_id`
* the predicted breed of a dog in a photo is presented using the variable `pN`, where `N` is 1, 2 or 3
    * example: `p1` -> prediction 1
* the likelihood that the breed of a dog is the value of variable `pN` is presented using the variable `pN_conf`
* whether or not the image at `jpg_url` is that of a dog is presented under the variable `pN_dog`    

##### Missing and null values
* `img_predictions` has no missing and/or null values 

##### Duplicated observations
* `img_predictions` has no duplicated observations

##### Multiple values for a variable
* observations in `img_predictions` have a single value per variable
     

#### Quality and tidiness

###### Quality
* values under variables `pN` (where `N` is 1, 2 or 3) do not have a  consistent case
    * examples:
        * value `redbone` in observation `1` under variable `p1`
        * value `Ibizan_hound` in observation `12` under variable `p3`
* some values under variables `pN` do not fit in; that is to say, they are not breeds of dogs
    * examples:
        * value `three-toed_sloth` in observation `21` under variable `p1`
        * value `skunk` in observation `25` under variable `p2`
* some observations do not fit in
    * examples:
        * observation `17` appears to be about birds
        * observation `18` appears to be about library or office equipment
        * observation `25` appears to be about small mammals
    * however, the values of said observations under variables `pN_dog` are `False`. The classifier identified that the respective images in question were not of dogs

* some variable names are not descriptive
    * example: variables `p1`, `p2` and `p3`
       * saves a few keystrokes but is difficult for anyone that assesses the data set. `pred_N` or `prediction_N` are more descriptive 

###### Tidiness
* `None`, so far

#### Dataset 3: Dataframe `df_tweets` that contains twitter API data

In [25]:
df_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6378 entries, 0 to 6377
Data columns (total 3 columns):
tweet_id          6378 non-null int64
retweet_count     6378 non-null int64
favorite_count    6378 non-null int64
dtypes: int64(3)
memory usage: 149.6 KB


In [26]:
df_tweets.shape

(6378, 3)

In [27]:
df_tweets.duplicated().value_counts()

False    3991
True     2387
dtype: int64

In [28]:
df_tweets.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,6981,33730
1,892177421306343426,5281,29261
2,891815181378084864,3468,21992
3,891689557279858688,7202,36833
4,891327558926688256,7724,35219


In [29]:
df_tweets.tweet_id.nunique()

2327

In [19]:
df_tweets.duplicated().apply(lambda x: x != 'True').sample(randint(5, 15))

259     True
3734    True
1561    True
5744    True
348     True
2247    True
3792    True
5184    True
5830    True
dtype: bool

In [20]:
df_tweets.duplicated

<bound method DataFrame.duplicated of                 tweet_id  retweet_count  favorite_count
0     892420643555336193           6981           33730
1     892177421306343426           5281           29261
2     891815181378084864           3468           21992
3     891689557279858688           7202           36833
4     891327558926688256           7724           35219
5     891087950875897856           2590           17764
6     890971913173991426           1649           10342
7     890729181411237888          15699           56712
8     890609185150312448           3606           24460
9     890240255349198849           6084           27881
10    890006608113172480           6123           26976
11    889880896479866881           4145           24505
12    889665388333682689           8314           41923
13    889638837579907072           3702           23605
14    889531135344209921           1875           13317
15    889278841981685760           4428           22054
16    8889

In [None]:
df_tweets

#### Structure

##### Overall

* 6378 observations
    
* 3 variables; all of type `float`
* 1 unique ID: `tweet_id`
* `retweet_count` shows the number of re-tweets
* `favorite_count` shows the number of times the tweet under `tweet_id` has been marked as favourite
    

##### Missing and null values
* `df_tweets` has no missing and/or null values 

##### Duplicated observations
* `df_tweets` has 2387 duplicated observations
* variable `tweet_id` has only 2327 unique values 

##### Multiple values for a variable
* observations in `df_tweets` have a single value per variable
     

#### Quality and tidiness

###### Quality
* almost half of the obsevations are duplicates
* values are of type `float`
    * `tweet_id`, `retweet_count` and `favorite_count` cannot be reasonably expected to have fractional units
        * case in point: `retweet_count` cannot have a value of, say, 3.142 

###### Tidiness
* there is little to no context
    * the variables in the data set do not, by themselves, present a big picture view of the data in question
    * the "big picture" (or context, if you like) will come from either `twtr_archive` or `img_predictions`; a merger is imminent

#### Summary of issues

##### Quality

* missing values for observations
    * example from `twtr_archive`: 2278 values missing in the `in_reply_to_status_id` variable

* some values do not fit in
    * example from `twtr_archive`: value `such` in observation `22`

* `twtr_archive`: the values under  variable `name` that are in all lowercase are not names of dogs
    * value `a` in observation `2354`

* the case of values under the some variable is not consistent
    * example from `img_predictions`: value `Ibizan_hound` in observation `12` under variable `p3`

* some values under the `rating_denominator` variable in `twtr_archive` are not `10`
    * the value of observation `2335`, for example, is `2`

* some variable names in `img_predictions` are not descriptive
    * "`pN` (where `N` is 1, 2 or 3)" is used to refer to `p1`, `p2` and `p3`

* `img_predictions`: some values under variables `pN` do not fit in; that is to say, they are not breeds of dogs
    * example: value `three-toed_sloth` in observation `21` under variable `p1`

* some observations do not fit in
    * example from `img_predictions`: observation `17` appears to be about birds
    * the values of said observations under variables `pN_dog`, however,  are `False`. The classifier identified that the respective images in question were not of dogs

* `df_tweets`: almost half of the observations are duplicates

* `df_tweets`: values are of type `float`
    * the values of `tweet_id`, `retweet_count` and `favorite_count` cannot be reasonably expected to have fractional units

##### Tidiness

* `twtr_archive`: too many variables for a dog's age range
    * variables `puppo`, `puppy` and `doggo` can be values under variable `age_range`

* `twtr_archive`: too many unique IDs

* `df_tweets`: the variables in the data set do not, by themselves, present a big picture view of the data in question
    * the "big picture" will emerge from a merger with either `twtr_archive` or `img_predictions`

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [7]:
# Make copies of original pieces of data
twtr_archive_copy = twtr_archive.copy()
img_predictions_copy = img_predictions.copy()
df_tweets_copy = df_tweets.copy()

In [17]:
confirm_exists(twtr_archive_copy)

This dataframe exists


In [18]:
confirm_exists(img_predictions_copy)

This dataframe exists


In [19]:
confirm_exists(df_tweets_copy)

This dataframe exists


### Issue #1:

#### Define

* Missing values for observations

#### Code

In [20]:
#`twtr_archive_copy`: replace all the `None` with empty str

age_range = ['doggo', 'pupper', 'puppo']
for i in age_range:
    twtr_archive_copy[i] = twtr_archive_copy[i].replace('None','')

In [21]:
#collapse the variables into one called `age_range`
twtr_archive_copy['age_range'] = twtr_archive_copy['doggo'].str.cat(twtr_archive_copy[['pupper', 'puppo']], sep=',')

#remove trailing triple commas
twtr_archive_copy['age_range'] = twtr_archive_copy['age_range'].replace(',,,', 'None')



In [22]:
#remove trailing commas in all values
twtr_archive_copy['age_range'] = twtr_archive_copy['age_range'].str.strip(',').astype(str)

#remove multiple commas 
twtr_archive_copy['age_range'] = twtr_archive_copy['age_range'].str.replace(',,,', ',')
twtr_archive_copy['age_range'] = twtr_archive_copy['age_range'].str.replace(',,', ',')

In [23]:
#remove `doggo`,`pupper` and `puppo`
twtr_archive_copy.drop(['doggo', 'pupper','puppo'], axis = 1)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,floofer,age_range
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,,doggo


#### Test

In [24]:
twtr_archive_copy.age_range.value_counts()

                1985
pupper           245
doggo             84
puppo             29
doggo,pupper      12
doggo,puppo        1
Name: age_range, dtype: int64

### Issue #2:

#### Define

* some values do not fit in
    * values such as `a` or `such` under the `name` variable in `twtr_archive_copy`

#### Code

In [32]:
#replace lowercase names
#case 1: name is after the word `name is`
li = list(twtr_archive_copy.loc[(twtr_archive_copy['name'].str.islower()) &
            (twtr_archive_copy['text'].str.contains('name is'))]['text'])

for i in li:
    mask = twtr_archive_copy.text == i
    twtr_archive_copy.loc[mask, 'name'] = re.findall(r"name is\s(\w+)", i)
    

[]

In [33]:
#case 2: name is after the word `named`
li = list(twtr_archive_copy.loc[(twtr_archive_copy['name'].str.islower()) & 
        (twtr_archive_copy['text'].str.contains('named'))]['text'])

for i in li:
    mask = twtr_archive_copy.text == i
    twtr_archive_copy.loc[mask, 'name'] = re.findall(r"named\s(\w+)", i)
    

In [34]:
#case 3: no name w/i text 
li = list(twtr_archive_copy.loc[(twtr_archive_copy['name'].str.islower())]['text'])

for i in li:
    mask = twtr_archive_copy.text == i
    twtr_archive_copy.loc[mask, 'name'] = "None"

#### Test

In [37]:
#see if changes worked
twtr_archive_copy.loc[(twtr_archive_copy.name.str.islower())].info() #zero obs, 18 variables

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 18 columns):
tweet_id                      0 non-null int64
in_reply_to_status_id         0 non-null float64
in_reply_to_user_id           0 non-null float64
timestamp                     0 non-null object
source                        0 non-null object
text                          0 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 0 non-null object
rating_numerator              0 non-null int64
rating_denominator            0 non-null int64
name                          0 non-null object
doggo                         0 non-null object
floofer                       0 non-null object
pupper                        0 non-null object
puppo                         0 non-null object
age_range                     0 non-null object
dtypes: float64(4), int64(3), object(11)
m

### Issue #3:

#### Define

* `twtr_archive_copy`: some values under variable `name` are not names of dogs
    * remove observations whose variables `pN_dog` (`N` is 1, 2 or 3) is `False`

#### Code

In [39]:
# remove rows where pN_dog is `False`
li = ['p1_dog', 'p2_dog', 'p3_dog']

for i in li :
    img_predictions_copy.drop(img_predictions_copy[img_predictions_copy[i] == False].index, inplace=True)

#### Test

In [40]:
img_predictions_copy.p1_dog.value_counts()

True    1243
Name: p1_dog, dtype: int64

In [41]:
img_predictions_copy.p2_dog.value_counts()

True    1243
Name: p2_dog, dtype: int64

In [42]:
img_predictions_copy.p3_dog.value_counts()

True    1243
Name: p3_dog, dtype: int64

In [44]:
img_predictions_copy[['p1_dog', 'p2_dog', 'p3_dog']].sample(randint(5, 15))

Unnamed: 0,p1_dog,p2_dog,p3_dog
1908,True,True,True
10,True,True,True
1285,True,True,True
621,True,True,True
306,True,True,True
225,True,True,True
925,True,True,True
1685,True,True,True
1424,True,True,True
1427,True,True,True


### Issue #4

#### Define

* some values under the `rating_denominator` variable in `twtr_archive` are not 10
    * change said values to `10`

In [59]:
twtr_archive_copy.rating_denominator = twtr_archive_copy.rating_denominator.apply(lambda x: 10 if x != 10 else 10)

#### Test

In [60]:
twtr_archive_copy.rating_denominator.value_counts()

10    2356
Name: rating_denominator, dtype: int64

### Issue #5

### Define
 * some variable names in `img_predictions` are not descriptive
     * change `pN` to `prediction_N`

### Issue #6

### Define
* `img_predictions`: some values under variables `pN` do not fit in; that is to say, they are not breeds of dogs
    * drop observations that fit the criteria above

### Issue #7

### Define

* `null` values under the `rating_denominator` variable in `twtr_archive`
    * drop `null` values

### Issue #8

### Define

* re-tweets are not required
    * remove re-tweets

### Issue #9

### Define

* combine data sets

### Issue #10

### Define

* remove variables that will not be used for analysis and visualisation

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization