# Project: Wrange and Analyse WeRateDogs Twitter data to create analysis and visualisations


## Table of Contents:

# 1. Introduction


This project focused on wrangling data from the WeRateDogs Twitter account using Python, documented in a Jupyter Notebook (wrangle_act.ipynb). This Twitter account rates dogs with humorous commentary. The rating denominator is usually 10, however, the numerators are usually greater than 10. They’re Good Dogs Brent wrangles WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. WeRateDogs has over 4 million followers and has received international media coverage.

WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for us to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017.

The goal of this project is to wrangle the WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The challenge lies in the fact that the Twitter archive is great, but it only contains very basic tweet information that comes in JSON format. I needed to gather, assess and clean the Twitter data for a worthy analysis and visualization.

We gather 3 pieces of data:

### a. Enhanced Twitter Archive

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced.". We manually downloaded this twitter_archive_enhanced.csv file by clicking the following [link](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv). 

### b. Image Predictions File

The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) hosted on Udacity's servers and we downloaded it programmatically using python Requests library AND the following URL of the file: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv)

### c. Twitter API

Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. But we, because we have the WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+.
In this project, I'll be using Tweepy to query Twitter's API for data included in the WeRateDogs Twitter archive. This data will include retweet count and favorite count.

### Key Points

Before we start, here are a few points to keep in mind when data wrangling for this project:

    1) We only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.

    2) Fully assessing and cleaning the entire dataset requires exceptional effort so only a subset of its issues (eight (8) quality issues and two (2) tidiness issues at minimum) need to be assessed and cleaned.

    3) Cleaning includes merging individual pieces of data according to the rules of tidy data.

    4) The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.

<br><br>

In [14]:

import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import tweepy
import json
from timeit import default_timer as timer
from tweepy import OAuthHandler

## 2. Gathering Data

<br><br>
#### Loading the twitter-archive-enhanced.csv [WeRateDogs Twitter archive] into a DataFrame

In [26]:
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

In [27]:
twitter_archive.head(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,8.92421e+17,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,8.92177e+17,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,8.91815e+17,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,8.9169e+17,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,8.91328e+17,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [28]:
twitter_archive.shape

(2356, 17)

In [46]:
print(twitter_archive['tweet_id'][:3])

0    892421000000000000
1    892177000000000000
2    891815000000000000
Name: tweet_id, dtype: object


In [29]:
twitter_archive.dtypes

tweet_id                      float64
in_reply_to_status_id         float64
in_reply_to_user_id           float64
timestamp                      object
source                         object
text                           object
retweeted_status_id           float64
retweeted_status_user_id      float64
retweeted_status_timestamp     object
expanded_urls                  object
rating_numerator                int64
rating_denominator              int64
name                           object
doggo                          object
floofer                        object
pupper                         object
puppo                          object
dtype: object

In [59]:
twitter_archive['tweet_id'] = twitter_archive['tweet_id'].astype(int)

#twitter_archive['tweet_id'] = pd.to_numeric(twitter_archive['tweet_id'])

# additional_data_clean["tweet_id"] = pd.to_numeric(additional_data_clean["tweet_id"])

In [60]:
twitter_archive.dtypes

tweet_id                        int64
in_reply_to_status_id         float64
in_reply_to_user_id           float64
timestamp                      object
source                         object
text                           object
retweeted_status_id           float64
retweeted_status_user_id      float64
retweeted_status_timestamp     object
expanded_urls                  object
rating_numerator                int64
rating_denominator              int64
name                           object
doggo                          object
floofer                        object
pupper                         object
puppo                          object
dtype: object

In [61]:
twitter_archive.head(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892421000000000000,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177000000000000,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815000000000000,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891690000000000000,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891328000000000000,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


<br><br>
#### Importing tweet image predictions

In [62]:
# import tweet image predictions:
r = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')
r.status_code
r.headers['content-type']

'text/tab-separated-values; charset=utf-8'

In [63]:
r.encoding     # In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

'utf-8'

In [64]:
# here we add the columns 'p1 until p3_dog to the first 3 columns in the input tsv-file:

with open('image-predictions.tsv' , mode='wb') as file:
    file.write(r.content)
    

In [65]:
image_df = pd.read_csv('image-predictions.tsv', sep='\t', encoding = 'utf-8')
image_df.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [66]:
image_df.shape

(2075, 12)

In [67]:
image_df.dtypes

tweet_id      int64
jpg_url      object
img_num       int64
p1           object
p1_conf     float64
p1_dog         bool
p2           object
p2_conf     float64
p2_dog         bool
p3           object
p3_conf     float64
p3_dog         bool
dtype: object

<br><br>
#### Load retweet count and favorite count data via Tweepy API towards Twitter data-domain

In [68]:
import tweepy

consumer_key = 'RJv2wT0PvFEYV2AkalZRPtyl8'
consumer_secret = 'ndz7M0aIpQx0bFFzEcvC9Spslh4XluejtnqSopE1hCYvOd7Amy'
access_token = '48122041-iIsP9DHPI3zwWBtF7WNbS6LlNeqD40ZrHvLD1XzF2'
access_secret = 'VWdmNOBqeMaJLqT1u2KvvmKzCVmIYw8BQPdHOcxpjZsE8'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

# create API-connection to twitter:
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True) 

# NOTE: wait_on_rate_limit – Whether or not to automatically wait for rate limits to replenish     
# NOTE: wait_on_rate_limit_notify – Whether or not to print a notification when Tweepy is waiting for rate limits to replenish


In [69]:
tweet_ids = twitter_archive.tweet_id.values
print('We will have to query the following number of tweet IDs in the Twitter archive:', len(tweet_ids))

We will have to query the following number of tweet IDs in the Twitter archive: 2356


In [51]:
# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)

1: 892421000000000000
Fail
2: 892177000000000000
Fail
3: 891815000000000000
Fail
4: 891690000000000000
Fail
5: 891328000000000000
Fail
6: 891088000000000000
Fail
7: 890972000000000000
Fail
8: 890729000000000000
Fail
9: 890609000000000000
Fail
10: 890240000000000000
Fail
11: 890007000000000000
Fail
12: 889881000000000000
Fail
13: 889665000000000000
Fail
14: 889639000000000000
Fail
15: 889531000000000000
Fail
16: 889279000000000000
Fail
17: 888917000000000000
Fail
18: 888805000000000000
Fail
19: 888555000000000000
Fail
20: 888203000000000000
Fail
21: 888078000000000000
Fail
22: 887705000000000000
Fail
23: 887517000000000000
Fail
24: 887474000000000000
Fail
25: 887343000000000000
Fail
26: 887101000000000000
Fail
27: 886983000000000000
Fail
28: 886737000000000000
Fail
29: 886680000000000000
Fail
30: 886366000000000000
Fail
31: 886267000000000000
Fail
32: 886258000000000000
Fail
33: 886054000000000000
Fail
34: 885985000000000000
Fail
35: 885529000000000000
Fail
36: 885519000000000000
Fail
3

Rate limit reached. Sleeping for: 622


Fail
902: 758475000000000000
Fail
903: 758467000000000000
Fail
904: 758406000000000000
Fail
905: 758355000000000000
Fail
906: 758100000000000000
Fail
907: 758041000000000000
Fail
908: 757742000000000000
Fail
909: 757729000000000000
Fail
910: 757726000000000000
Fail
911: 757612000000000000
Fail
912: 757598000000000000
Fail
913: 757596000000000000
Fail
914: 757400000000000000
Fail
915: 757393000000000000
Fail
916: 757355000000000000
Fail
917: 756998000000000000
Fail
918: 756939000000000000
Fail
919: 756652000000000000
Fail
920: 756526000000000000
Fail
921: 756303000000000000
Fail
922: 756289000000000000
Fail
923: 756276000000000000
Fail
924: 755956000000000000
Fail
925: 755207000000000000
Fail
926: 755111000000000000
Fail
927: 754875000000000000
Fail
928: 754857000000000000
Fail
929: 754747000000000000
Fail
930: 754482000000000000
Fail
931: 754450000000000000
Fail
932: 754120000000000000
Fail
933: 754012000000000000
Fail
934: 753656000000000000
Fail
935: 753421000000000000
Fail
936: 7533

Rate limit reached. Sleeping for: 626


Fail
1802: 676958000000000000
Fail
1803: 676950000000000000
Fail
1804: 676948000000000000
Fail
1805: 676947000000000000
Fail
1806: 676942000000000000
Fail
1807: 676937000000000000
Fail
1808: 676917000000000000
Fail
1809: 676898000000000000
Fail
1810: 676865000000000000
Fail
1811: 676822000000000000
Fail
1812: 676820000000000000
Fail
1813: 676812000000000000
Fail
1814: 676776000000000000
Fail
1815: 676618000000000000
Fail
1816: 676614000000000000
Fail
1817: 676607000000000000
Fail
1818: 676603000000000000
Fail
1819: 676593000000000000
Fail
1820: 676591000000000000
Fail
1821: 676588000000000000
Fail
1822: 676583000000000000
Fail
1823: 676576000000000000
Fail
1824: 676534000000000000
Fail
1825: 676496000000000000
Fail
1826: 676471000000000000
Fail
1827: 676440000000000000
Fail
1828: 676431000000000000
Fail
1829: 676264000000000000
Fail
1830: 676237000000000000
Fail
1831: 676220000000000000
Fail
1832: 676216000000000000
Fail
1833: 676192000000000000
Fail
1834: 676146000000000000
Fail
1835:

In [188]:
df_tweet_json = pd.DataFrame(columns=['tweet_id', 'retweet_count', 'favorite_count'])
with open('tweet_json.txt') as data_file:
    for line in data_file:
        tweet = json.loads(line)
        tweet_id = tweet['id_str']
        retweet_count = tweet['retweet_count']
        favorite_count = tweet['favorite_count']
        
        df_tweet_json = df_tweet_json.append(pd.DataFrame([[tweet_id, retweet_count, favorite_count]], columns=['tweet_id', 'retweet_count', 'favorite_count']))
        df_tweet_json = df_tweet_json.reset_index(drop=True)

In [190]:
df_tweet_json.describe()

Unnamed: 0,tweet_id,retweet_count,favorite_count
count,2059,2059,2059
unique,2059,1570,1820
top,772117678702071809,521,0
freq,1,6,72


In [None]:
# Note: alternative approach to generate directly a list with tweet_id, favorite_count and retweet_count:
# Query Twitter's API for JSON data for each tweet ID in the Twitter archive (siddhi789)

start = timer()
df_list = []
errors = []
for id in image_df['tweet_id']:
    try:
        tweet = api.get_status(id, tweet_mode='extended')        # this has as output a 'JSON array' which is a list of JSON-objects (=list of Python dictionaries)
        # print(tweet.text)

        df_list.append({'tweet_id': str(tweet.id),
                        'favorite_count': int(tweet.favorite_count),
                        'retweet_count': int(tweet.retweet_count)})
    except Exception as e:
        print(str(id) + " : " + str(e))
        errors.append(id)
end = timer()

df_list

[{'tweet_id': '666020888022790149',
  'favorite_count': 2412,
  'retweet_count': 462},
 {'tweet_id': '666029285002620928',
  'favorite_count': 121,
  'retweet_count': 42},
  
  ]