# Wrangle and Analyze Data

## Table of Contents
- [Introduction](#intro)
- [Gathering Data](#gather)
- [Assessing Data](#assess)
 - [Quality](#quality)
 - [Tidiness](#tidy)
- [Cleaning Data](#clean)
- [Storing Data](#store)
- [Analyzing Data](#analyze)
- [Visualizing Data](#visualize)
- [Conclusion](#conclusion)
- [Sources](#source)

<a id='intro'></a>
### Introduction
This project will illustrate the data wrangling process.  The dataset that is being wrangled comes from the tweet archive of Twitter user [@dog_rates](https://twitter.com/dog_rates), also known as WeRateDogs.  WeRateDogs is a Twitter account that rate's people's dogs with a humorous comment about the dog.  These ratings almost always have a denominator of 10 and numerators almost always greater than 10.  With this data, I will try to create interesting and trustworthy analyses and visualizations.

<a id='gather'></a>
### Gathering Data

There are 3 pieces of data required for this project.  The first one is the WeRateDogs Twitter archive.  This file is given to us and will be treated like an internal file.

In [354]:
# import libraries
import datetime
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import seaborn as sns
import time
import tweepy

from functools import reduce

%matplotlib inline

In [38]:
# open the csv file
twitter_archive = pd.read_csv('twitter-archive-enhanced-2.csv')

In [294]:
# i'm setting the column widths to be bigger so it displays more information
pd.set_option('display.max_colwidth', 200)

In [295]:
# look at the data
twitter_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,,,,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,,,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",,,,"https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12,10,Franklin,,,,


The second piece of data is tweet image predictions.  The file is hosted on Udacity's servers and will be downloaded programmatically using the Requests library and the following URL: [https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv].

In [40]:
# this is the URL given to us to download from
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

# get the website and store it in response
response = requests.get(url)
response
#response 200 means that our request was successful

<Response [200]>

In [41]:
# we can check the content with this line of code, just uncomment it if you want to check
#response.content

In [42]:
# we will open a file called image-predictions.tsv
# we then write the contents of the response variable to it
open('image_predictions.tsv', 'wb').write(response.content)

335079

In [43]:
# we now read the tsv file we just created
image_preds = pd.read_csv('image_predictions.tsv', sep = '\t')

In [265]:
# view the data
image_preds.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


The third piece of data will be queried from Twitter's API.  Using the tweet ID's in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt.  Each tweet's JSON data should be written to its own line.  Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count.

In [45]:
# set up Twitter api
# the keys are intentionally left blank, you need your own keys to run this code
consumer_key = 
consumer_secret = 
access_token = 
access_secret = 

# create a OAuth instance and pass in our keys
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# set access token
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, parser = tweepy.parsers.JSONParser())

In [46]:
# query Twitter's API for WeRateDogs JSON data
# match tweet ID from archive to WeRateDogs tweets

# list of tweets that were not retrieved successfully
error_list = []
# list of tweets that were retrieved sucessfully
success_list = []
# start the timer to see how long it will take to retrieve all the tweets
start = time.time()

for tweet_id in twitter_archive['tweet_id']:
    try:
        # get the tweet using ID
        data = api.get_status(tweet_id, tweet_mode = 'extended',
                             wait_on_rate_limit = True,
                             wait_on_rate_limit_notify = True)
        success_list.append(data)
    except:
        error_list.append(tweet_id)
        # I'm printing out the tweet id's that failed
        print('F', tweet_id)
        
end = time.time()
print('Time it took to retrieve tweets is ', '{:.4f}'.format(end - start), ' in seconds')

F 888202515573088257
F 873697596434513921
F 872668790621863937
F 872261713294495745
F 869988702071779329
F 866816280283807744
F 861769973181624320
F 856602993587888130
F 851953902622658560
F 845459076796616705
F 844704788403113984
F 842892208864923648
F 837366284874571778
F 837012587749474308
F 829374341691346946
F 827228250799742977
F 812747805718642688
F 802247111496568832
F 779123168116150273
F 775096608509886464
F 771004394259247104
F 770743923962707968
F 759566828574212096


Rate limit reached. Sleeping for: 674


F 754011816964026368
F 680055455951884288


Rate limit reached. Sleeping for: 685


Time it took to retrieve tweets is  1959.0298  in seconds


In [51]:
# format is hours : minutes : seconds : microseconds
str(datetime.timedelta(seconds=(end-start)))

'0:32:39.029822'

In [232]:
# i'm making a second attempt to get the Tweets that failed
retry_list = []
retry_error_list = []
start = time.time()

for tweet_id in error_list:
    try:
        # get the tweet using ID
        data = api.get_status(tweet_id, tweet_mode = 'extended',
                             wait_on_rate_limit = True,
                             wait_on_rate_limit_notify = True)
        retry_list.append(data)
    except:
        retry_error_list.append(tweet_id)
        # I'm printing out the tweet id's that failed
        print('F', tweet_id)
        
end = time.time()
print('Time it took to retrieve tweets is ', '{:.4f}'.format(end - start), ' in seconds')

F 888202515573088257
F 873697596434513921
F 872668790621863937
F 872261713294495745
F 869988702071779329
F 866816280283807744
F 861769973181624320
F 856602993587888130
F 851953902622658560
F 845459076796616705
F 844704788403113984
F 842892208864923648
F 837366284874571778
F 837012587749474308
F 829374341691346946
F 827228250799742977
F 812747805718642688
F 802247111496568832
F 779123168116150273
F 775096608509886464
F 771004394259247104
F 770743923962707968
F 759566828574212096
F 754011816964026368
F 680055455951884288
Time it took to retrieve tweets is  5.9481  in seconds


In [222]:
# store the data into tweet_json.txt
with open('tweet_json.txt', mode = 'w') as file:
    json.dump(success_list, file)

In [334]:
# open the JSON text and put it into a DataFrame
query_data = pd.read_json('tweet_json.txt')

In [335]:
query_data.head()

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status
0,2017-08-01 16:23:56+00:00,892420643555336193,892420643555336192,This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,False,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 892420639486877696, 'id_str': '892420639486877696', 'indices': [86, 109], 'media_url': 'http://pbs.twimg.com/media...","{'media': [{'id': 892420639486877696, 'id_str': '892420639486877696', 'indices': [86, 109], 'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg', 'media_url_https': 'https://pbs.twimg.com...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,...,False,False,0.0,0.0,en,,,,,
1,2017-08-01 00:17:27+00:00,892177421306343426,892177421306343424,"This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",False,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 892177413194625024, 'id_str': '892177413194625024', 'indices': [139, 162], 'media_url': 'http://pbs.twimg.com/medi...","{'media': [{'id': 892177413194625024, 'id_str': '892177413194625024', 'indices': [139, 162], 'media_url': 'http://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg', 'media_url_https': 'https://pbs.twimg.co...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,...,False,False,0.0,0.0,en,,,,,
2,2017-07-31 00:18:03+00:00,891815181378084864,891815181378084864,This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,False,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 891815175371796480, 'id_str': '891815175371796480', 'indices': [122, 145], 'media_url': 'http://pbs.twimg.com/medi...","{'media': [{'id': 891815175371796480, 'id_str': '891815175371796480', 'indices': [122, 145], 'media_url': 'http://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg', 'media_url_https': 'https://pbs.twimg.co...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,...,False,False,0.0,0.0,en,,,,,
3,2017-07-30 15:58:51+00:00,891689557279858688,891689557279858688,This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,False,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 891689552724799489, 'id_str': '891689552724799489', 'indices': [80, 103], 'media_url': 'http://pbs.twimg.com/media...","{'media': [{'id': 891689552724799489, 'id_str': '891689552724799489', 'indices': [80, 103], 'media_url': 'http://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg', 'media_url_https': 'https://pbs.twimg.com...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,...,False,False,0.0,0.0,en,,,,,
4,2017-07-29 16:00:24+00:00,891327558926688256,891327558926688256,"This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",False,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': [129, 138]}], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 891327551943041024, 'id_str': '891327551943041024', 'indices': [139, 16...","{'media': [{'id': 891327551943041024, 'id_str': '891327551943041024', 'indices': [139, 162], 'media_url': 'http://pbs.twimg.com/media/DF6hr6AVYAAZ8G8.jpg', 'media_url_https': 'https://pbs.twimg.co...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,...,False,False,0.0,0.0,en,,,,,


In [270]:
query_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 32 columns):
created_at                       2331 non-null datetime64[ns, UTC]
id                               2331 non-null int64
id_str                           2331 non-null int64
full_text                        2331 non-null object
truncated                        2331 non-null bool
display_text_range               2331 non-null object
entities                         2331 non-null object
extended_entities                2059 non-null object
source                           2331 non-null object
in_reply_to_status_id            77 non-null float64
in_reply_to_status_id_str        77 non-null float64
in_reply_to_user_id              77 non-null float64
in_reply_to_user_id_str          77 non-null float64
in_reply_to_screen_name          77 non-null object
user                             2331 non-null object
geo                              0 non-null float64
coordinates                 

In [336]:
query_data['tweet_id'] = query_data['id']
query_data = query_data[['tweet_id', 'retweet_count', 'favorite_count']]
query_data.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7718,36251
1,892177421306343426,5704,31259
2,891815181378084864,3781,23536
3,891689557279858688,7870,39531
4,891327558926688256,8488,37743


<a id='assess'></a>
### Assessing Data
Detect and document at least eight quality issues and two tidiness issues.

I'll use pandas to visually assess the three dataframes.  The better option would be to look at them in a spreadsheet program since pandas collapses rows and columns.  It is also less convenient to scroll around in pandas.

In [91]:
# look at a sample of twitter_archive
# to view the entire dataset, a csv viewer would be a better option
twitter_archive.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1918,674271431610523648,,,2015-12-08 16:56:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...","""AT DAWN, WE RIDE""\n10/10 for both dogs https:...",,,,https://twitter.com/dog_rates/status/674271431...,10,10,,,,,
874,761292947749015552,,,2016-08-04 20:09:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Bonaparte. He's pupset because it's cloud...,,,,https://twitter.com/dog_rates/status/761292947...,11,10,Bonaparte,,,,
1644,683852578183077888,,,2016-01-04 03:28:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to Tiger. He's a penbroke (little do...,,,,https://twitter.com/dog_rates/status/683852578...,10,10,Tiger,,,,
2019,672125275208069120,,,2015-12-02 18:48:47 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is just impressive I have nothing else to...,,,,https://twitter.com/dog_rates/status/672125275...,11,10,just,,,,
2162,669393256313184256,,,2015-11-25 05:52:43 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Ronduh. She's a Finnish Checkered Blitzkr...,,,,https://twitter.com/dog_rates/status/669393256...,10,10,Ronduh,,,,


In [63]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [99]:
# investigate what the object data types actually are
type(twitter_archive.timestamp[0]), type(twitter_archive.source[0]), type(twitter_archive.text[0])

(str, str, str)

In [100]:
type(twitter_archive.retweeted_status_timestamp[0]), type(twitter_archive.expanded_urls[0])

(float, str)

In [101]:
type(twitter_archive.name[0]), type(twitter_archive.doggo[0])

(str, str)

In [102]:
type(twitter_archive.floofer[0]), type(twitter_archive.pupper[0]), type(twitter_archive.puppo[0])

(str, str, str)

In [104]:
# checking what values are in numerator and denominator
twitter_archive.rating_numerator.value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

In [288]:
twitter_archive.query('rating_numerator < 10').head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
45,883482846933004288,,,2017-07-08 00:28:19 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Bella. She hopes her smile made you smile. If not, she is also offering you her favorite monkey. 13.5/10 https://t.co/qjrljjt948",,,,"https://twitter.com/dog_rates/status/883482846933004288/photo/1,https://twitter.com/dog_rates/status/883482846933004288/photo/1",5,10,Bella,,,,
229,848212111729840128,,,2017-04-01 16:35:01 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Jerry. He's doing a distinguished tongue slip. Slightly patronizing tbh. You think you're better than us, Jerry? 6/10 hold me back https://t.co/DkOBbwulw1",,,,https://twitter.com/dog_rates/status/848212111729840128/photo/1,6,10,Jerry,,,,
315,835152434251116546,,,2017-02-24 15:40:31 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",When you're so blinded by your systematic plagiarism that you forget what day it is. 0/10 https://t.co/YbEJPkg4Ag,,,,"https://twitter.com/dog_rates/status/835152434251116546/photo/1,https://twitter.com/dog_rates/status/835152434251116546/photo/1,https://twitter.com/dog_rates/status/835152434251116546/photo/1",0,10,,,,,
387,826598799820865537,8.265984e+17,4196984000.0,2017-02-01 01:11:25 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","I was going to do 007/10, but the joke wasn't worth the &lt;10 rating",,,,,7,10,,,,,
462,817502432452313088,,,2017-01-06 22:45:43 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: Meet Herschel. He's slightly bigger than ur average pupper. Looks lonely. Could probably ride 7/10 would totally pet https:/…,6.924173e+17,4196984000.0,2016-01-27 18:42:06 +0000,https://twitter.com/dog_rates/status/692417313023332352/photo/1,7,10,Herschel,,,pupper,


In [122]:
twitter_archive.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

In [111]:
# checking if there are duplicate tweet_ids present in dataset
sum(twitter_archive.tweet_id.duplicated())

0

In [219]:
# checking validity of dog names, scrolling through the data i found some names such as: a, an, None
twitter_archive.name.unique()

array(['Phineas', 'Tilly', 'Archie', 'Darla', 'Franklin', 'None', 'Jax',
       'Zoey', 'Cassie', 'Koda', 'Bruno', 'Ted', 'Stuart', 'Oliver',
       'Jim', 'Zeke', 'Ralphus', 'Canela', 'Gerald', 'Jeffrey', 'such',
       'Maya', 'Mingus', 'Derek', 'Roscoe', 'Waffles', 'Jimbo', 'Maisey',
       'Lilly', 'Earl', 'Lola', 'Kevin', 'Yogi', 'Noah', 'Bella',
       'Grizzwald', 'Rusty', 'Gus', 'Stanley', 'Alfy', 'Koko', 'Rey',
       'Gary', 'a', 'Elliot', 'Louis', 'Jesse', 'Romeo', 'Bailey',
       'Duddles', 'Jack', 'Emmy', 'Steven', 'Beau', 'Snoopy', 'Shadow',
       'Terrance', 'Aja', 'Penny', 'Dante', 'Nelly', 'Ginger', 'Benedict',
       'Venti', 'Goose', 'Nugget', 'Cash', 'Coco', 'Jed', 'Sebastian',
       'Walter', 'Sierra', 'Monkey', 'Harry', 'Kody', 'Lassie', 'Rover',
       'Napolean', 'Dawn', 'Boomer', 'Cody', 'Rumble', 'Clifford',
       'quite', 'Dewey', 'Scout', 'Gizmo', 'Cooper', 'Harold', 'Shikha',
       'Jamesy', 'Lili', 'Sammy', 'Meatball', 'Paisley', 'Albus',
       'Nept

In [215]:
# these wouldn't be real names, so i'm checking names less than length 3
twitter_archive.name = twitter_archive.name.astype('str')
print(type(twitter_archive.name))
test = twitter_archive[twitter_archive.name.str.len() < 3]
test.name.value_counts()

<class 'pandas.core.series.Series'>


a     55
Bo     9
an     7
by     1
Ed     1
my     1
Al     1
O      1
Jo     1
JD     1
Mo     1
Name: name, dtype: int64

In [164]:
# checks names with length = 3
test2 = twitter_archive[twitter_archive.name.str.len() == 3]
test2.name.value_counts()

the    8
Jax    6
Leo    6
Gus    5
one    4
Max    3
Mia    3
Ted    3
Tyr    2
Eve    2
Doc    2
Ken    2
Ava    2
Moe    2
Lou    2
mad    2
Sam    2
not    2
Eli    2
Bob    2
Ash    2
old    1
Jim    1
Edd    1
Dug    1
Stu    1
Pip    1
Ole    1
Obi    1
Rey    1
Tug    1
Mac    1
Evy    1
Mya    1
Jay    1
Jed    1
Gin    1
Jeb    1
Taz    1
Aja    1
all    1
Cal    1
Dot    1
Ben    1
his    1
Zoe    1
Blu    1
Ace    1
Tom    1
Ito    1
Ron    1
Amy    1
Dex    1
Ike    1
Sky    1
Sid    1
Alf    1
Name: name, dtype: int64

In [212]:
# this one is filtering by the first character of name being lowercase
test3 = twitter_archive[twitter_archive.name.str[0].str.islower() == True]
test3.name.value_counts()

a               55
the              8
an               7
very             5
just             4
quite            4
one              4
actually         2
mad              2
getting          2
not              2
his              1
infuriating      1
space            1
such             1
my               1
all              1
by               1
this             1
incredibly       1
unacceptable     1
old              1
life             1
light            1
officially       1
Name: name, dtype: int64

In [213]:
# i'm just checking the total of names
test3.name.count()

109

In [123]:
# look at a sample from image_preds
image_preds.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
502,675870721063669760,https://pbs.twimg.com/media/CWEs1b-WEAEhq82.jpg,1,golden_retriever,0.263892,True,Welsh_springer_spaniel,0.184193,True,beagle,0.182241,True
547,677331501395156992,https://pbs.twimg.com/media/CWZdaGxXAAAjGjb.jpg,1,beagle,0.313464,True,boxer,0.218503,True,French_bulldog,0.106462,True
518,676470639084101634,https://pbs.twimg.com/media/CWNOdIpWoAAWid2.jpg,1,golden_retriever,0.790386,True,borzoi,0.022885,True,dingo,0.015343,False
1017,709918798883774466,https://pbs.twimg.com/media/CdojYQmW8AApv4h.jpg,2,Pembroke,0.956222,True,Cardigan,0.020727,True,Chihuahua,0.007912,True
1536,790581949425475584,https://pbs.twimg.com/media/Cvi2FiKWgAAif1u.jpg,2,refrigerator,0.998886,False,malinois,0.000153,True,kelpie,0.000131,True


In [64]:
image_preds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [165]:
# investigate what those objects really are
type(image_preds.jpg_url[0]), type(image_preds.p1[0]), type(image_preds.p2[0]), type(image_preds.p3[0])

(str, str, str, str)

In [214]:
# check if there are any duplicate ids or duplicate jpg_url
sum(image_preds.tweet_id.duplicated()) , sum(image_preds.jpg_url.duplicated())

(0, 66)

In [337]:
# look at a sample of query_data
query_data.sample(5)

Unnamed: 0,tweet_id,retweet_count,favorite_count
1886,674410619106390016,447,1164
2279,666983947667116034,928,2444
622,793241302385262592,3379,10738
1025,743222593470234624,1901,6214
2080,670474236058800128,710,1453


In [338]:
query_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 3 columns):
tweet_id          2331 non-null int64
retweet_count     2331 non-null int64
favorite_count    2331 non-null int64
dtypes: int64(3)
memory usage: 54.8 KB


In [339]:
sum(query_data.tweet_id.duplicated())

0

In [340]:
# i'm making a csv file for query_data so it can be viewed in Google sheets
query_data.to_csv('query_data.csv', index = False)

<a id='quality'></a>
**Quality**

Completeness: Are we missing data?  
Validity: Does the data conform to realistic values? (ex. A person can't have negative height.)  
Accuracy: Is the data right? (It can be valid and still wrong.)  
Consistency: Is the data in a standard format?  

twitter_archive:  
- missing data in columns:  
in_reply_to_status_id,  
in_reply_to_user_id,  
retweeted_status_id,  
retweeted_status_user_id,  
retweeted_status_timestamp,  
expanded_urls
- wrong data types:  
in_reply_to_status_id should be type int  
in_reply_to_user_id should be type int  
timestamp should be type datetime  
retweeted_status_id should be type int  
retweeted_status_user_id should be type int  
retweeted_status_timestamp should be type datetime  
- there are values in rating_numerator below 10 or has decimal points
- there are values in rating_denominator that aren't 10
- there are invalid names such as: a, an, by, my, O, the, one, mad, not, old, all, his, etc.
- there are retweets (rows where retweeted_status has a value instead of NaN)


image_preds
- missing data, image_preds is a dataset that has image predictions for the twitter_archive dataset  
twitter_archive has 2356 entries, while image_preds has 2075 entries
- inconsistent capitalization in p1, p2, and p3 columns
- there are duplicate values in jpg_url

query_data
- missing data, twitter_archive has 2356 entries, while query_data has 2331 entries
- there are also retweets in this dataset

<a id='tidy'></a>
**Tidiness**

Each variable forms a column.  
Each observation forms a row.  
Each type of observational unit forms a table.

twitter_archive
- doggo, floofer, pupper, puppo belong to one variable, they are all a 'stage' of dog

image_preds
- the data is a part of twitter_archive observations

query_data
- the data is a part of twitter_archive observations

<a id='clean'></a>
### Cleaning Data

In [342]:
# make copies of the dataframe before doing any cleaning
twitter_archive_clean = twitter_archive.copy()
image_preds_clean = image_preds.copy()
query_data_clean = query_data.copy()

**Define**  
<sup>T1</sup> Merge all 3 datasets together.

**Code and Test**

In [352]:
# the list of dataframes I want to merge
dfs = [twitter_archive_clean, image_preds_clean, query_data_clean]

In [366]:
# merge them all together, default is inner merge
master_df = reduce(lambda left, right: pd.merge(left,right, on = 'tweet_id'), dfs)

In [367]:
# check if merged successfully
master_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2059 entries, 0 to 2058
Data columns (total 30 columns):
tweet_id                      2059 non-null int64
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     2059 non-null object
source                        2059 non-null object
text                          2059 non-null object
retweeted_status_id           72 non-null float64
retweeted_status_user_id      72 non-null float64
retweeted_status_timestamp    72 non-null object
expanded_urls                 2059 non-null object
rating_numerator              2059 non-null int64
rating_denominator            2059 non-null int64
name                          2059 non-null object
doggo                         2059 non-null object
floofer                       2059 non-null object
pupper                        2059 non-null object
puppo                         2059 non-null object
jpg_url                       2059 

**Define**  
<sup>Q1</sup> Delete retweets in the master_df dataframe.

**Code and Test**

In [368]:
# retweets are rows where retweeted_status_id has a value that is not null
# i'm getting the indices of these rows
retweet_indices = master_df.query('retweeted_status_id != "NaN"').index

In [369]:
# i'm dropping the rows by their indices
master_df.drop(master_df.index[retweet_indices], inplace = True)

In [371]:
# making sure the rows are gone
master_df.query('retweeted_status_id != "NaN"')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,...,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,retweet_count,favorite_count


**Define**  
<sup>Q2</sup> Remove columns retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp.

**Code and Test**

In [372]:
master_df.drop(columns = ['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], inplace = True)

In [376]:
# check if the columns are actually dropped
master_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1987 entries, 0 to 2058
Data columns (total 25 columns):
tweet_id              1987 non-null int64
timestamp             1987 non-null object
source                1987 non-null object
text                  1987 non-null object
expanded_urls         1987 non-null object
rating_numerator      1987 non-null int64
rating_denominator    1987 non-null int64
name                  1987 non-null object
doggo                 1987 non-null object
floofer               1987 non-null object
pupper                1987 non-null object
puppo                 1987 non-null object
jpg_url               1987 non-null object
img_num               1987 non-null int64
p1                    1987 non-null object
p1_conf               1987 non-null float64
p1_dog                1987 non-null bool
p2                    1987 non-null object
p2_conf               1987 non-null float64
p2_dog                1987 non-null bool
p3                    1987 non-null obj

**Define**  
<sup>Q3</sup> Remove columns in_reply_to_status_id and in_reply_to_user_id

**Code and Test**

In [374]:
# these two columns were missing data, but they're not relevant to any analysis i'm planning to make
# so i'm just dropping them
master_df.drop(columns = ['in_reply_to_status_id', 'in_reply_to_user_id'], inplace = True)

In [375]:
# check if the columns dropped properly
master_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1987 entries, 0 to 2058
Data columns (total 25 columns):
tweet_id              1987 non-null int64
timestamp             1987 non-null object
source                1987 non-null object
text                  1987 non-null object
expanded_urls         1987 non-null object
rating_numerator      1987 non-null int64
rating_denominator    1987 non-null int64
name                  1987 non-null object
doggo                 1987 non-null object
floofer               1987 non-null object
pupper                1987 non-null object
puppo                 1987 non-null object
jpg_url               1987 non-null object
img_num               1987 non-null int64
p1                    1987 non-null object
p1_conf               1987 non-null float64
p1_dog                1987 non-null bool
p2                    1987 non-null object
p2_conf               1987 non-null float64
p2_dog                1987 non-null bool
p3                    1987 non-null obj

**Define**  


**Code and Test**

<a id='store'></a>
### Storing
Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv.

<a id='analyze'></a>
### Analyzing
Make at least 3 insights.

<a id='visualize'></a>
### Visualizing Data
Make at least 1 visual.

<a id='conclusion'></a>
### Conclusion

<a id='source'></a>
### Sources
All of the links I used as references are listed below.
- https://twitter.com/dog_rates
- 