# WeRateDogs Twitter Feed

This project looks at various data sources for Tweets from the [WeRateDogs](https://twitter.com/dog_rates) Twitter account, specifically:

1. the `twitter-archive-enhanced.csv` which contains the tweet text and various related identifiers
1. the Twitter API is used to access the original tweets to retrieve missing fields such as the retweet and favorite counts
1. an image prediction file containing the top 3 predictions for each of the (up to 4) dog pictures in the tweet


The `twitter-archive-enhanced.csv` is the master source for the set of tweets to be included in this project. However, this dataset will need to be cleaned, and will then be enriched with additional attributes obtained through the [Twitter API](https://developer.twitter.com/en/docs/twitter-api), and finally enriched with data from the breed predictions file, which includes the top 3 predictions based on the dog's images. The data will be mapped into Pandas DataFrames folowing the Tidy Data principles.

---

## Gather data

The data assets used are listed below. Some are local files, downloaded into the `data` directory, some are remote files on web servers, and some are sourced via an API. Where there is enrichment of existing data, I will add the new attributes to the base DataFrame. Standalone data such as the breed predictions I will load into a separate DataFrame, so as to later reconcile back to the tweets.

In [1]:
!ls ./data

image-predictions.tsv        tweet-json.zip
tweet-json copy              twitter-archive-enhanced.csv


In [2]:
WE_RATE_DOGS_TWEETS = 'data/twitter-archive-enhanced.csv'
DOG_BREED_PREDICTIONS_SNAPSHOT = 'data/image_predictions.tsv'
DOG_BREED_PREDICTIONS_SOURCE_URL = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
TWEET_STATISTICS_SNAPSHOT = 'data/tweet_json.txt'
TWEET_STATISTICS_API_URL = ''

It is worth noting that the Pandas `read_csv()` function is quite versatile when uploading data, and can be configured to handle different date formats, numeric data types, not available (NA) value markers, etc. But to benefit from such functionality we need to eyeball the raw data first.

Let's look at the first 8 records, and 8 records from near the end of the file:-

In [3]:
!head -n 11 ./data/twitter-archive-enhanced.csv

tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,None,None,None,None
892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,None,None,None,None
891815181378084864,,,20

In [4]:
!tail -n 1940 ./data/twitter-archive-enhanced.csv | head -n 8

813142292504645637,,,2016-12-25 22:00:04 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Everybody stop what you're doing and look at this dog with her tiny Santa hat. 13/10 https://t.co/KK4XQK9SPi,,,,"https://twitter.com/dog_rates/status/813142292504645637/photo/1,https://twitter.com/dog_rates/status/813142292504645637/photo/1,https://twitter.com/dog_rates/status/813142292504645637/photo/1",13,10,None,None,None,None,None
813130366689148928,8.13127251579564e+17,4196983835.0,2016-12-25 21:12:41 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",I've been informed by multiple sources that this is actually a dog elf who's tired from helping Santa all night. Pupgraded to 12/10,,,,,12,10,None,None,None,None,None
813127251579564032,,,2016-12-25 21:00:18 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here's an anonymous doggo that appears to be very done

A few observations:

1. the 1st line lists out the colummn names
1. tweet Ids are very large integres
1. some tweet Ids appear as floating point numbers, e.g.: in_reply_to_status_id, in_reply_to_user_id, as many are NaNs (i.e: nulls or missing values)
1. time stamps are almost in ISO 8601 (they are missing the 'T' separator) and are GMT (+00:00 offset) 
1. strings are wrapped in double quotes (") and embeded quotes appear as two contiguous quotes ("")
1. some strings such as the tweets with links to GoFundMe page have 2 embeded new lines within the double quotes
1. dog names and stages (last 5 columns) are extracted where found, otherwise the value _None_ is placed in those columns
1. whilst the dog name is variable, the stage is predefined, so putting the stage name into a column named after the stage name is redundant information
1. the 'source' column for all rows is an HTML anchor with a link to http://twitter.com/download/iphone which has no value, we can drop this column


Actions taken to address above observations:

* convert floating point tweets Ids to a 64-bit integer, retaining the Not Available representation
* specifcally tell Pandas which columns are dates
* drop the 'source' column

In [26]:
import yaml
import tweepy
import json

In [6]:
import numpy as np
import pandas as pd

In [7]:
from util.config import read_creds

Load the enhanced Twitter archive, using explicit data types for fields, instead of letting Pandas infer them. The [Twitter API](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/tweet-object) will define the data types for the Twitter fields, to which I add the data types for the "enhanced" fields.

To get around the fact that nullable numeric fields, by default, are interpreted by `read_csv()` as floats (so as to include NaN to represent null or Not Available), I am mapping optional tweet Ids to Pandas nullable integer data type.

In [51]:
feed_data_types = {
    'tweet_id': np.int64,
    'in_reply_to_status_id': 'Int64',
    'in_reply_to_user_id': 'Int64',
    'retweeted_status_id': 'Int64',
    'retweeted_status_user_id': 'Int64',
    'text': str,
    'expanded_urls': str,
    'rating_numerator': np.int32,
    'rating_denominator': np.int32,
    'name': str,
    'doggo': str,
    'floofer': str,
    'pupper': str,
    'puppo': str 
}

feed_date_cols = [
    'timestamp', 
    'retweeted_status_timestamp'
]

In [52]:
tweets_df = pd.read_csv(WE_RATE_DOGS_TWEETS,
                        index_col=['tweet_id'],
                        dtype=feed_data_types,
                        parse_dates=feed_date_cols)

In [53]:
tweets_df = tweets_df.drop(columns=['source'])

In [54]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

Now we can doublecheck the column data types.

In [56]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2356 entries, 892420643555336193 to 666020888022790149
Data columns (total 15 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   in_reply_to_status_id       78 non-null     Int64              
 1   in_reply_to_user_id         78 non-null     Int64              
 2   timestamp                   2356 non-null   datetime64[ns, UTC]
 3   text                        2356 non-null   object             
 4   retweeted_status_id         181 non-null    Int64              
 5   retweeted_status_user_id    181 non-null    Int64              
 6   retweeted_status_timestamp  181 non-null    datetime64[ns, UTC]
 7   expanded_urls               2297 non-null   object             
 8   rating_numerator            2356 non-null   int32              
 9   rating_denominator          2356 non-null   int32              
 10  name                        2

Get Twitter credentials and authenticate:-

In [57]:
creds = read_creds('./config/private/creds.yaml')

In [58]:
consumer_key = creds['consumer_api']['key']
consumer_secret = creds['consumer_api']['secret']

In [59]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

In [60]:
access_token = creds['access_token']['token']
acess_secret = creds['access_token']['secret']

In [61]:
auth.set_access_token(access_token, acess_secret)

Setup the Twitter API and load the enrichment data in batches, for better performance, as API invocations are subject to significant network latency. Twitter applies rate limiting to their APIs, so it is necessary to throttle the rate at which we make requests, and to retry any failed API invocations. Tgis can be handled automatically by the Tweepy library, by setting the `wait_on_rate_limit_notify` flag when setting up the API connection.

Although we are enriching the core enhanced tweets archive, we will initially load the API data into a separate data frame, cleanup as necessary, and then merge from there into the main table.

In [63]:
api = tweepy.API(auth, wait_on_rate_limit_notify=True)

In [64]:
batch_size = 100
num_tweets = len(tweets_df.index)

In [113]:
def process_batch(batch):
    idxs = []
    retweet_counts = []
    favorite_counts = []
    for status in batch:
        tweet = status._json
        idxs.append(tweet['id'])
        retweet_counts.append(tweet['retweet_count'])
        favorite_counts.append(tweet['favorite_count'])
    return np.array(idxs, dtype=np.int64), np.array([retweet_counts, favorite_counts], dtype=np.int64).T

In [145]:
indices = np.empty((0), dtype=np.int64)
rows = np.empty((0, 2), dtype=np.int64)

In [146]:
for batch_start in range(0, num_tweets, batch_size):
    batch_end = min(batch_start + batch_size, num_tweets)
    batch_tweet_ids = tweets_df.iloc[batch_start:batch_end].index.to_numpy().tolist()
    statuses = api.statuses_lookup(batch_tweet_ids, include_entities=False, map_=False)
    b_indices, b_rows = process_batch(statuses)
    indices = np.concatenate((indices, b_indices), axis=0)
    rows = np.concatenate((rows, b_rows), axis=0)

In [148]:
enrich_df = pd.DataFrame(index=indices, data=rows, columns=['retweet_counts', 'favorite_counts'], dtype='Int32')
enrich_df.index.name = 'tweet_id'

In [149]:
enrich_tweets_df = tweets_df.merge(enrich_df, how='left', on='tweet_id')