# WeRateDogs Twitter Feed

This project looks at various data sources for Tweets from the [WeRateDogs](https://twitter.com/dog_rates) Twitter account, specifically:

1. the `twitter-archive-enhanced.csv` which contains the tweet text and various related identifiers
1. the Twitter API is used to access the original tweets to retrieve missing fields such as the retweet and favorite counts
1. an image prediction file containing the top 3 predictions for each of the (up to 4) dog pictures in the tweet


The `twitter-archive-enhanced.csv` is the master source for the set of tweets to be included in this project. However, this dataset will need to be cleaned, and will then be enriched with additional attributes obtained through the [Twitter API](https://developer.twitter.com/en/docs/twitter-api), and finally enriched with data from the breed predictions file, which includes the top 3 predictions based on the dog's images. The data will be mapped into Pandas DataFrames folowing the Tidy Data principles.

---

## Gather data

The data assets used are listed below. Some are local files, downloaded into the `data` directory, some are remote files on web servers, and some are sourced via an API. Where there is enrichment of existing data, I will add the new attributes to the base DataFrame. Standalone data such as the breed predictions I will load into a separate DataFrame, so as to later reconcile back to the tweets.

In [None]:
!ls ./data

In [None]:
WE_RATE_DOGS_TWEETS = 'data/twitter-archive-enhanced.csv'
DOG_BREED_PREDICTIONS_SNAPSHOT = 'data/image_predictions.tsv'
DOG_BREED_PREDICTIONS_SOURCE_URL = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
TWEET_STATISTICS_SNAPSHOT = 'data/tweet_json.txt'
TWEET_STATISTICS_API_URL = ''

It is worth noting that the Pandas `read_csv()` function is quite versatile when uploading data, and can be configured to handle different date formats, numeric data types, not available (NA) value markers, etc. But to benefit from such functionality we need to eyeball the raw data first.

Let's look at the first 8 records, and 8 records from near the end of the file:-

In [None]:
!head -n 11 ./data/twitter-archive-enhanced.csv

In [None]:
!tail -n 1940 ./data/twitter-archive-enhanced.csv | head -n 8

A few observations:

1. the 1st line lists out the colummn names
1. tweet Ids are very large integres
1. some tweet Ids appear as floating point numbers, e.g.: in_reply_to_status_id, in_reply_to_user_id, as many are NaNs
1. time stamps are close to ISO 8601 (they are missing the 'T' separator) and are GMT (+00:00 offset) 
1. strings are wrapped in double quotes (") and embeded quotes appear as two contiguous quotes ("")
1. some strings such as the tweets with links to GoFundMe page have 2 embeded new lines within the double quotes
1. dog names and stages (last 5 columns) are extracted where found, otherwise the value _None_ is placed in those columns
1. the 'source' column for all rows is an HTML anchor with a link to http://twitter.com/download/iphone which is useless


In [None]:
import yaml
import tweepy

In [None]:
import numpy as np
import pandas as pd

In [None]:
from util.config import read_creds

Load the enhanced Twitter archive, using explicit data types for fields, instead of letting Pandas infer them. The [Twitter API](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/tweet-object) will define the data types for the Twitter fields. To that I will add data types for the "enhanced" fields.

In [None]:
feed_data_types = {
    'tweet_id': np.int64,
#     'in_reply_to_status_id': np.int64,
#     'in_reply_to_user_id': np.int64,
#     'source': str,
#     'text': str,
#     'retweeted_status_id': np.int64,
#     'retweeted_status_user_id': np.int64,
#     'expanded_urls': str,
#     'rating_numerator': int,
#     'rating_denominator': int,
#     'name': str,
#     'doggo': str,
#     'floofer': str,
#     'pupper': str,
#     'puppo': str 
}
feed_date_cols = ['timestamp', 'retweeted_status_timestamp']

In [None]:
tweets_df = pd.read_csv(WE_RATE_DOGS_TWEETS,
                       dtype=feed_data_types,
                       parse_dates=feed_date_cols)
# tweets_df = pd.read_csv(WE_RATE_DOGS_TWEETS)

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [None]:
display(tweets_df)

In [None]:
tweets_df.describe()

Get Twitter credentials and authenticate:-

In [None]:
creds = read_creds('./config/private/creds.yaml')

In [None]:
consumer_key = creds['consumer_api']['key']
consumer_secret = creds['consumer_api']['secret']

In [None]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

In [None]:
access_token = creds['access_token']['token']
acess_secret = creds['access_token']['secret']

In [None]:
auth.set_access_token(access_token, acess_secret)