# WeRateDogs Twitter Feed

This project looks at various data sources for Tweets from the [WeRateDogs](https://twitter.com/dog_rates) Twitter account, specifically:

1. the `twitter-archive-enhanced.csv` which contains the tweet text, as is the core data set
1. the Twitter API is used to access the original tweets to retrieve missing fields such as the retweet and favorite counts
1. an image prediction file containing the top 3 predictions for each of the (up to 4) dog pictures in the tweet

Having gathered the data, we assess, clean and analyse it.


---

## GATHER DATA

We use a number of data assets including remote files on web servers, and JSON payloads returned by the Twitter API.

In [1]:
WE_RATE_DOGS_TWEETS_PATH = 'data/twitter-archive-enhanced.csv'
DOG_BREED_PREDICTIONS_SOURCE_URL = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

### Read the CSV data

Pandas `read_csv()` function is quite versatile when uploading data, and can be configured to handle different date formats, numeric data types, not available (NA) markers, etc. Getting this right upfront can save time, but requires the raw data in files to be eyeballed first. For this we can use command line tools like head & tail, or alternatively Excel, which allows column headings to be frozen, data to be sorted and searched, etc.

Having looked at the raw data, we make the following observations:

1. tweet Ids are large integers, we need to select an approriate integer datatype so no accuracy is lost
1. some tweet Ids use floats, e.g.: `in_reply_to_status_id`, `in_reply_to_user_id`, with NaNs used as a Not Available marker, as mentioned above these need to be converted to integers
1. time stamps are close to ISO 8601 format, and are GMT

Actions taken to address above observations:

* convert floating point tweets Ids to a 64-bit integer, retaining the Not Available representation
* specifcally tell Pandas which columns are dates

In [2]:
import yaml
import tweepy
import json

In [3]:
import numpy as np
import pandas as pd

In [4]:
pd.__version__

'1.1.3'

Load the enhanced Twitter archive, using explicit data types for fields, instead of letting Pandas infer them. The [Twitter API](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/tweet-object) will define the data types for the Twitter fields, to which I add the data types for the "enhanced" fields.

To get around the fact that nullable numeric fields are interpreted by `read_csv()` as floats (thus allowing NaNs to represent null), I am mapping optional tweet Ids to the Pandas nullable integer data type (Int64).

In [None]:
feed_data_types = {
    'tweet_id': np.int64,
    'in_reply_to_status_id': 'Int64',
    'in_reply_to_user_id': 'Int64',
    'retweeted_status_id': 'Int64',
    'retweeted_status_user_id': 'Int64',
    'text': 'string',
    'expanded_urls': 'string',
    'rating_numerator': np.int32,
    'rating_denominator': np.int32,
    'name': 'string',
    'doggo': 'string',
    'floofer': 'string',
    'pupper': 'string',
    'puppo': 'string'
}

feed_date_cols = [
    'timestamp', 
    'retweeted_status_timestamp'
]

In [None]:
tweets_df = pd.read_csv(WE_RATE_DOGS_TWEETS_PATH,
                        index_col=['tweet_id'],
                        dtype=feed_data_types,
                        parse_dates=feed_date_cols)
tweets_df.shape

In [None]:
pd.__version__