Key Points

Key points to keep in mind when data wrangling for this project:

   - You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
   
  
   -  Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
   
   
   - Cleaning includes merging individual pieces of data according to the rules of tidy data.
   
   
   - The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs. 
   
   
   - You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used. - Done


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tweepy
from tweepy import OAuthHandler
import json
import csv
import sys
import os
import time

In [15]:
# Importing the given Data

In [37]:
df_tweet = pd.read_csv("twitter-archive-enhanced.csv")
df_tweet.head(1)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,


In [64]:
df_tweet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [74]:
df_tweet.isna().sum()/len(df_tweet)

tweet_id                      0.000000
in_reply_to_status_id         0.966893
in_reply_to_user_id           0.966893
timestamp                     0.000000
source                        0.000000
text                          0.000000
retweeted_status_id           0.923175
retweeted_status_user_id      0.923175
retweeted_status_timestamp    0.923175
expanded_urls                 0.025042
rating_numerator              0.000000
rating_denominator            0.000000
name                          0.000000
doggo                         0.000000
floofer                       0.000000
pupper                        0.000000
puppo                         0.000000
dtype: float64

In [77]:
len(df_tweet)

2356

In [115]:
df_tweet["timestamp"] = pd.to_datetime(df_tweet["timestamp"])

In [116]:
df_tweet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null datetime64[ns]
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: datetime64[ns](1

In [125]:
df_tweet[df_tweet["timestamp"] > '2017-08-02 00:00:00 +0000']

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [123]:
# No Tweet is beyond 1st Aug 2017

In [126]:
df_tweet.rating_numerator.value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

In [82]:
df_tweet.rating_denominator.value_counts()
# Mostly it's out of 10. There are few different ones as well.

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

In [72]:
# Mostly retweet data is not present

In [85]:
df_tweet.loc[df_tweet.rating_denominator == 11, 'text']

784     RT @dog_rates: After so many requests, this is...
1068    After so many requests, this is Bretagne. She ...
1662    This is Darrel. He just robbed a 7/11 and is i...
Name: text, dtype: object

In [87]:
df_tweet['text'][df_tweet.loc[df_tweet.rating_denominator == 11, 'text']]

text
RT @dog_rates: After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https:/…    NaN
After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ    NaN
This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5     NaN
Name: text, dtype: object

In [89]:
df_tweet['text'][df_tweet.loc[df_tweet.rating_denominator == 50, 'text']]

text
This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq                                 NaN
From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK    NaN
Here is a whole flock of puppers.  60/50 I'll take the lot https://t.co/9dpcw6MdWa                                                                    NaN
Name: text, dtype: object

In [124]:
df_tweet['text'][df_tweet.loc[df_tweet.rating_denominator == 80, 'text']]

text
Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12    NaN
Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw     NaN
Name: text, dtype: object

In [92]:
df_tweet['text'][df_tweet.loc[df_tweet.rating_denominator == 20, 'text']]

text
Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a                                                                            NaN
Yes I do realize a rating of 4/20 would've been fitting. However, it would be unjust to give these cooperative pups that low of a rating    NaN
Name: text, dtype: object

In [17]:
# Downloading the required file from the internet

In [3]:
import os
import requests

url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
response

with open('image_predictions.tsv', 'wb') as file:
    file.write(response.content)


In [4]:
#open tsv file
images = pd.read_table('image_predictions/image-predictions.tsv',
                       sep='\t')

In [None]:
# Additional Data via the Twitter API

In [5]:
"""
# authentication pieces
consumer_key = ""
consumer_secret = ""
access_token = ""
access_secret = ""


auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth_handler=auth, 
                 wait_on_rate_limit=True, 
                 wait_on_rate_limit_notify=True)

"""

'\n# authentication pieces\nconsumer_key = ""\nconsumer_secret = ""\naccess_token = ""\naccess_secret = ""\n\n\nauth = tweepy.OAuthHandler(consumer_key, consumer_secret)\nauth.set_access_token(access_token, access_secret)\n\napi = tweepy.API(auth_handler=auth, \n                 wait_on_rate_limit=True, \n                 wait_on_rate_limit_notify=True)\n\n'

In [6]:
#Twitter Query using tweet_id information from the df.
"""
tweet_ids = list(df.tweet_id)

tweet_data = {}
for _id in tweet_ids:
    try:
        tweet_status = api.get_status(_id,
                                      wait_on_rate_limit=True, 
                                      wait_on_rate_limit_notify=True)
        tweet_data[str(_id)] = tweet_status._json
    except: 
        print("Error for: " + str(_id))
"""

'\ntweet_ids = list(df.tweet_id)\n\ntweet_data = {}\nfor _id in tweet_ids:\n    try:\n        tweet_status = api.get_status(_id,\n                                      wait_on_rate_limit=True, \n                                      wait_on_rate_limit_notify=True)\n        tweet_data[str(_id)] = tweet_status._json\n    except: \n        print("Error for: " + str(_id))\n'

In [7]:
"""
with open('tweet_json.txt', 'w') as file:
     file.write(json.dumps(tweet_data))
"""

"\nwith open('tweet_json.txt', 'w') as file:\n     file.write(json.dumps(tweet_data))\n"

In [8]:
tweet_json = open('tweet_json.txt', "r")
tweet_line =[]
tweets = tweet_json.read() 
for line in tweet_json:
    try:
        tweet = json.loads(line)
        tweet_line.append(tweet)
    except:
        continue
        
tweet_json.close()

In [18]:
# Converting the twitter api to pandas DataFrame to Combine with the Existing DataFrame 

In [20]:
import json    # or `import simplejson as json` if on Python < 2.6

obj = json.loads(tweets)    # obj now contains a dict of the data

In [21]:
obj['892420643555336193'][ 'text']
obj['892420643555336193']['retweet_count']
obj['892420643555336193']['favorite_count']

38018

In [93]:
loaded_r = json.loads(tweets)

In [94]:
data = json.loads(tweets)

In [98]:
data = json.loads(tweets)
tweet_id,text,retweet_count,favorite_count = [],[],[],[]
for key in list(data.keys()):
    tweet_id.append(key)
    retweet_count.append(data[key]['retweet_count'])
    favorite_count.append(data[key]['favorite_count'])
df_api = pd.DataFrame([tweet_id,retweet_count,favorite_count]).T

In [99]:
df_api.columns = ["tweet_id","retweet_count","favorite_count"]
df_api.head(1)

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,8315,38018


In [100]:
# Complete API Data

In [101]:
data["892420643555336193"]

{'created_at': 'Tue Aug 01 16:23:56 +0000 2017',
 'id': 892420643555336193,
 'id_str': '892420643555336193',
 'text': "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU",
 'truncated': False,
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [],
  'media': [{'id': 892420639486877696,
    'id_str': '892420639486877696',
    'indices': [86, 109],
    'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'url': 'https://t.co/MgUWQ76dJU',
    'display_url': 'pic.twitter.com/MgUWQ76dJU',
    'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
    'type': 'photo',
    'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
     'medium': {'w': 540, 'h': 528, 'resize': 'fit'},
     'small': {'w': 540, 'h': 528, 'resize': 'fit'},
     'large': {'w': 540, 'h': 528, 'resize': 'fit'}}}]},


In [102]:
# Checking the variables in df_api DataFrame

In [103]:
df_api.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,8315,38018
1,892177421306343426,6142,32639
2,891815181378084864,4067,24561
3,891689557279858688,8451,41368
4,891327558926688256,9157,39551


In [104]:
df_api.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2340 entries, 0 to 2339
Data columns (total 3 columns):
tweet_id          2340 non-null object
retweet_count     2340 non-null object
favorite_count    2340 non-null object
dtypes: object(3)
memory usage: 54.9+ KB


In [105]:
# We can see all are objects. That's shouldn't be the case.

In [106]:
df_api["tweet_id"] = df_api.tweet_id.astype(np.uint64)

In [107]:
df_api["retweet_count"] = df_api.retweet_count.astype(np.uint64)

In [108]:
df_api["favorite_count"] = df_api.favorite_count.astype(np.uint64)

In [109]:
df_api.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2340 entries, 0 to 2339
Data columns (total 3 columns):
tweet_id          2340 non-null uint64
retweet_count     2340 non-null uint64
favorite_count    2340 non-null uint64
dtypes: uint64(3)
memory usage: 54.9 KB


In [110]:
df_api.isna().sum()/len(df_api)

tweet_id          0.0
retweet_count     0.0
favorite_count    0.0
dtype: float64

In [111]:
len(df_api)

2340

In [112]:
df_merge = pd.merge(df_tweet,df_api,on="tweet_id")

In [113]:
df_merge.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,retweet_count,favorite_count
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,,8315,38018
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,,6142,32639
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,,4067,24561
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,,8451,41368
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,,9157,39551
