# WeRateDogs Twitter Feed

This project looks at various data sources for Tweets from the [WeRateDogs](https://twitter.com/dog_rates) Twitter account, specifically:

1. the `twitter-archive-enhanced.csv` which contains the tweet text, as is the core data set
1. the Twitter API is used to access the original tweets to retrieve missing fields such as the retweet and favorite counts
1. an image prediction file containing the top 3 predictions for each of the (up to 4) dog pictures in the tweet


---

## GATHER DATA

We use a number of data assets including local files, remote files on web servers, and JSON payloads returned by the Twitter API.

In [1]:
!ls ./data

ls: cannot access './data': No such file or directory


In [2]:
WE_RATE_DOGS_TWEETS_PATH = 'data/twitter-archive-enhanced.csv'
DOG_BREED_PREDICTIONS_SOURCE_URL = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

### Read the CSV data

Pandas `read_csv()` function is quite versatile when uploading data, and can be configured to handle different date formats, numeric data types, not available (NA) markers, etc. Getting this right upfront can save time, but requires the raw data in files to be eyeballed first. For this we can use command line tools like head & tail, or Excel, which allows column headings to be frozen, data to be sorted and searched, etc.

In [3]:
!head -n 11 {WE_RATE_DOGS_TWEETS_PATH}

head: cannot open 'data/twitter-archive-enhanced.csv' for reading: No such file or directory


In [4]:
!tail -n 1940 {WE_RATE_DOGS_TWEETS_PATH} | head -n 8

tail: cannot open 'data/twitter-archive-enhanced.csv' for reading: No such file or directory


Having looked at the raw data, we make the following observations:

1. tweet Ids are large integers, we need to select an approriate integer datatype so no accuracy is lost
1. some tweet Ids use floats, e.g.: `in_reply_to_status_id`, `in_reply_to_user_id`, with NaNs used as a Not Available marker, as mentioned above these need to be converted to integers
1. time stamps are close to ISO 8601 format, and are GMT 
1. some strings have embeded new line characters, `read_csv()` handles this, but it messes up commands like `wc -l`

Actions taken to address above observations:

* convert floating point tweets Ids to a 64-bit integer, retaining the Not Available representation
* specifcally tell Pandas which columns are dates

In [5]:
import yaml
import tweepy
import json

ModuleNotFoundError: No module named 'tweepy'

In [None]:
import numpy as np
import pandas as pd

Load the enhanced Twitter archive, using explicit data types for fields, instead of letting Pandas infer them. The [Twitter API](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/tweet-object) will define the data types for the Twitter fields, to which I add the data types for the "enhanced" fields.

To get around the fact that nullable numeric fields, by default, are interpreted by `read_csv()` as floats (so as to include NaN to represent null or Not Available), I am mapping optional tweet Ids to Pandas nullable integer data type.

In [None]:
feed_data_types = {
    'tweet_id': np.int64,
    'in_reply_to_status_id': 'Int64',
    'in_reply_to_user_id': 'Int64',
    'retweeted_status_id': 'Int64',
    'retweeted_status_user_id': 'Int64',
    'text': 'string',
    'expanded_urls': 'string',
    'rating_numerator': np.int32,
    'rating_denominator': np.int32,
    'name': 'string',
    'doggo': 'string',
    'floofer': 'string',
    'pupper': 'string',
    'puppo': 'string'
}

feed_date_cols = [
    'timestamp', 
    'retweeted_status_timestamp'
]

In [None]:
tweets_df = pd.read_csv(WE_RATE_DOGS_TWEETS_PATH,
                        index_col=['tweet_id'],
                        dtype=feed_data_types,
                        parse_dates=feed_date_cols)
tweets_df.shape

### Examine the core data set

The first discrepancy we note is that, according to the project motivation document, the main "archive contains basic tweet data for all 5000+ of their tweets" however that is clearly not the case as, having loaded it, the number of tweets is less than half that. As this is the master data set we have been provided with, this is the data we will go with, since it has been previously enhanced.

Just to double check this row count, we will run a line count on the input file, which should roughly match the number of rows in the data frame. Any discrepancy on counts is due to those embeded new line (NL) characters in the tweet text, as was previously mentioned.

In [None]:
!wc -l {WE_RATE_DOGS_TWEETS_PATH}

In [None]:
# When displaying Pandas data frames in the Jupyter notebook, raise or remove any limits on number displayed
# Care is needed here, specially with row counts, as very large data sets may get transferred into the browser

pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

Now we can double check the column data types, against the data type mapping provided to `read_csv()`.

In [None]:
tweets_df.info()

### Use Twitter API to enrich the core data set

Next we want to use the Twitter API to retrieve the original tweets, so that we can enrich our enhanced tweets data with the missing attributes previously dientified (`retweet_counts`, `favorite_counts`).

Having registered with Twitter as a developer, and obtained credentials and keys, we stored these in a private project directory and configuration file (which are excluded from our git repo, and thus won't be visible online in [github](https://github.com/benvens-udacity/wrangle-and-analyze-data/blob/main/wrangle_act.ipynb)).

We now use those credentials to authenticate with Twitter for API access.

In [None]:
def read_creds(conf_path):
    with open(conf_path, 'r') as cf:
        config = yaml.load(cf, Loader=yaml.FullLoader)
        return config

In [None]:
creds = read_creds('./config/private/creds.yaml')

In [None]:
consumer_key = creds['consumer_api']['key']
consumer_secret = creds['consumer_api']['secret']

In [None]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

In [None]:
access_token = creds['access_token']['token']
acess_secret = creds['access_token']['secret']

In [None]:
auth.set_access_token(access_token, acess_secret)

Next we will load the enrichment data in batches, for better performance, as API invocations are subject to significant network latency. Twitter also applies rate limiting to their APIs, so it is necessary to throttle the rate at which we make requests, and to retry any failed requests. Luckily, this can be handled automatically by the Tweepy library, by setting the `wait_on_rate_limit_notify` flag when configuring API connection.

Although we are enriching the core enhanced tweets archive, we will initially load the API data into a separate data frame, cleanup as necessary, and then merge into the main table.

In [None]:
api = tweepy.API(auth, wait_on_rate_limit_notify=True)

In [None]:
def process_batch(batch):
    idxs = []
    retweet_counts = []
    favorite_counts = []
    for status in batch:
        tweet = status._json
        idxs.append(tweet['id'])
        retweet_counts.append(tweet['retweet_count'])
        favorite_counts.append(tweet['favorite_count'])
    return np.array(idxs, dtype=np.int64), np.array([retweet_counts, favorite_counts], dtype=np.int64).T

In [None]:
indices = np.empty((0), dtype=np.int64)
rows = np.empty((0, 2), dtype=np.int64)

In [None]:
batch_size = 100
num_tweets = len(tweets_df.index)

In [None]:
%%time 
for batch_start in range(0, num_tweets, batch_size):
    batch_end = min(batch_start + batch_size, num_tweets)
    batch_tweet_ids = tweets_df.iloc[batch_start:batch_end].index.to_numpy().tolist()
    statuses = api.statuses_lookup(batch_tweet_ids, include_entities=False, map_=False)
    b_indices, b_rows = process_batch(statuses)
    indices = np.concatenate((indices, b_indices), axis=0)
    rows = np.concatenate((rows, b_rows), axis=0)

In [None]:
tweet_counts_df = pd.DataFrame(index=indices, data=rows, 
                               columns=['retweet_counts', 'favorite_counts'], 
                               dtype='Int32').sort_index()
tweet_counts_df.index.name = 'tweet_id'
tweet_counts_df.shape

In [None]:
enriched_tweets_df = tweets_df.merge(tweet_counts_df, how='inner', on='tweet_id')
enriched_tweets_df.shape

Next we review the structure of the merged data frame. In particular, the number of rows should be unchanged (as you'd expect given it is a left join), and the 2 additional columns have been added on at the end, with some NA values reflecting tweets that have never been retweeted or favorited.

In [None]:
enriched_tweets_df.info()

### Read in and normalise image data set

Finally we need to load the image predictions data, so we can tidy it. We will read this data from the CloudFront URL, a opposed to the local file, to ensure we get the most up-to-date version.

In [None]:
img_preds_data_types = {
    'tweet_id': 'Int64',
    'jpg_url': 'string',
    'img_num': np.int32,
    'p1': 'string',
    'p1_conf': np.float32,
    'p1_dog': bool,
    'p2': 'string',
    'p2_conf': np.float32,
    'p2_dog': bool,
    'p3': 'string',
    'p3_conf': np.float32,
    'p3_dog': bool
}

In [None]:
# Load the TSV (not CSV) records, and tell read_csv() to use a tab as the field separator

img_preds_df = pd.read_csv(DOG_BREED_PREDICTIONS_SOURCE_URL,
                           index_col=['tweet_id'],
                           sep='\t', 
                           dtype=img_preds_data_types)
img_preds_df.shape

We can now briefly review the structure of this data frame: 

1. each row refers to an image
1. each image is numbered, as it is selected as one of up to 4 dog images that may be associated with each tweet
1. we then have the top 3 predictions for the image

Each prediction consists of the following information:

1. a predicted label or class (e.g.: the dog breed) tht describes that image
1. a confidence factor associated with the previous prediction, in the range 0.0 -> 1.0
1. a boolean indicator confirming if the predicted label is a dog breed, or some other object

Looking at the confidence factors for predictions p1 - p3, they appear to be listed in most confident to least confident order. Therefore we will use the column name numeric suffix to generate a ranking column, which we can later sort by (to preserve this decreasing confidence order).

This last attribute confirms that the image classifier used to generate these prediction was trained on a broad set of images, only a subset of which are dog images labelled with their corresponding dog breed. But on occasions the classifier may have interpreted a dog image as an object other than a dog.

In [None]:
img_preds_df.info()

---

## ASSESS DATA

We have already assessed the core tweet data, and fixed some issues at load time. Lets now look at data quality and structural issues.

### Remedy data quality issues

Some of the issues we now want to remedy are: 

1. dog names and stages are extracted when found, otherwise the value _None_ is used
1. dog stages are predefined, so storing the stage name into the relevant stage columnis redundant information, all we require is a binary marker
1. the 'source' column is an HTML anchor with a link to http://twitter.com/download/iphone which is repeated for all rows, therefore we will drop the `source` column

So next we convert those dog stage columns into a boolean data type.

In [None]:
# Convert dog stage columns into a boolean data type

stage_cols = ['doggo', 'floofer', 'pupper', 'puppo']
enriched_tweets_df[stage_cols] = enriched_tweets_df[stage_cols].apply(lambda c: c.to_numpy() != 'None')

In [None]:
# Confirm that the relevant dog stage columns are booleans (not strings)

enriched_tweets_df.info()

In [None]:
# Drop uninformative 'source' column

enriched_tweets_df = enriched_tweets_df.drop(columns=['source'])
enriched_tweets_df.shape

### Remedy structural issues

The `expanded_urls` column, which stores the full length URL for shortened URLs that appear in the tweet, has a couple of issues, both quality and structure:

1. it can store multiple URLs per cell, as a comma separated string
1. the same URL can appear multiple times

As looking at the tweet text it is not obvious why the same URL can appear more than once, instead of storing a repetition count, we will just drop any duplicates. Since these URLs can now have a many-to-one relationship with the tweet in which they appear, we will store them in a separate data frame.

In [None]:
# Pull out rows containing one or more expanded URLs, as some rows have none

expanded_urls_ser = enriched_tweets_df.loc[enriched_tweets_df['expanded_urls'].isna() == False]['expanded_urls']

In [None]:
# Nested list comprehension to split multiple URL strings on comma separator, then create [tweet Id, URL] tuples

expanded_url_tuples = [(ix, url) for ix, urls in expanded_urls_ser.iteritems() for url in urls.split(',')]

In [None]:
expanded_url_df = pd.DataFrame(expanded_url_tuples, columns=['tweet_id', 'expanded_url'])

In [None]:
# Now drop duplicates and make 'tweet_id' the index for consistency with other data frames

expanded_url_df = expanded_url_df.drop_duplicates().set_index('tweet_id')

In [None]:
# Finally drop the original expanded_urls column

enriched_tweets_df = enriched_tweets_df.drop(columns='expanded_urls')

In [None]:
# Note that the index can contain duplicate entries (where a tweet has more than one URL)
# We compare duplicate and non-duplicate counts below

expanded_url_df.info()

In [None]:
len(expanded_url_df.index), len(expanded_url_df.index.unique())

Finally we will tidy up the image predictions data. By applying the Tidy Data principles, we will remove variables (the prediction number) from the relevant column names. 

We are repeating the same image URL and image number, on each of the predictions, just for simplicity (the alternative is to split the image data into 2 data frames). 

In [None]:
def slice_pred_cols(df, numeric):
    preds_df = df[['jpg_url', 'img_num']]
    preds_df = preds_df.assign(pred_rank=numeric,
                               pred_class=img_preds_df[f'p{numeric}'],
                               pred_confidence=img_preds_df[f'p{numeric}_conf'],
                               pred_is_dog=img_preds_df[f'p{numeric}_dog'])
    return preds_df

In [None]:
len(img_preds_df.index)

In [None]:
preds1_df = slice_pred_cols(img_preds_df, 1)
preds2_df = slice_pred_cols(img_preds_df, 2)
preds3_df = slice_pred_cols(img_preds_df, 3)
image_predictions_df = pd.concat([preds1_df, preds2_df, preds3_df]).sort_values(by=['tweet_id', 'pred_rank'])

As we are removing variables from the column names, we will end up with more rows. Since all top 3 predictions are always generated, we will have exactly 3 times the number of rows we started with, as reflected by the row counts above and below.

In [None]:
image_predictions_df.info()

---

## CLEAN DATA

We have already performed a few cleaning tasks, including correcting data types and eliminating redundant data.

### Address specific data cleaning requirements

Next we will clean up the data as specified under the section **Key Points**, in the Project Motivation page, specifically:

1. We drop retweets, [by definition](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/tweet-object) these are tweets where the `retweet_status` fields are populated
2. Since none of the remaining rows are retweets, any column related to retweets will only hold NA and are now redundant, so we drop these columns
3. We drop columns without images, i.e.: where the `tweet_id` does not appear in the images data frame

In [None]:
# Delete retweets based on the presence of a retweet time, the drop all redundant retweet related columns

enriched_tweets_df = enriched_tweets_df.loc[enriched_tweets_df['retweeted_status_timestamp'].isna()]
enriched_tweets_df = enriched_tweets_df.drop(columns=['retweeted_status_id', 
                                                      'retweeted_status_user_id', 
                                                      'retweeted_status_timestamp'])
enriched_tweets_df.shape

In [None]:
# Delete tweets without images

enriched_tweets_df = enriched_tweets_df.loc[enriched_tweets_df.index.intersection(img_preds_df.index, sort=None)]
enriched_tweets_df.shape

### Keep persistent copy of wrangled clean data

As there appear to be no tweets beyond August 1st, 2017 (most likely since we dropped tweets without images) we are now done with cleaning. We end the wranggling process by writing the clean data to a new set of clean data files.

In [None]:
# We now write the cleaned data frames into files, in a directory called 'clean'

enriched_tweets_df.to_csv('clean/twitter_archive_master.csv')
expanded_url_df.to_csv('clean/twitter_archive_urls.csv')
image_predictions_df.to_csv('clean/image_predictions.csv')

---

## Data insights and visualisation

In this section we look at the data and query it to obtain some insights. Specifically, we are interested in:

1. Finding the number of tweets with a score above 10/10, versus tweets with a score under 10/10
2. Identify the tweets where more than one dog stage appears
3. Finding the number of top breed predictions from the image classifier, with a predictionconfidence below 0.5

**Count number of scores above and below 10/10**

In [None]:
((enriched_tweets_df['rating_numerator'] / enriched_tweets_df['rating_denominator']) > 1.0).sum(), \
((enriched_tweets_df['rating_numerator'] / enriched_tweets_df['rating_denominator']) <= 1.0).sum()

**Show tweets with more than one dog stage in the tweet text**

In [None]:
stage_cols = ['doggo', 'floofer', 'pupper', 'puppo']
enriched_tweets_df.loc[enriched_tweets_df[stage_cols].sum(axis=1) > 1][['text'] + stage_cols]

**Count tweets where the top scoring breed prediction is below 0.5** 

In [None]:
dog_preds = image_predictions_df.loc[(image_predictions_df['pred_rank'] == 1) \
                                     & image_predictions_df['pred_is_dog']]

In [None]:
len(dog_preds[dog_preds['pred_confidence'] < 0.5].index)

Now we are going to generate some visualisations:

1. First, based on the top image prediction, look at the frequency distribution for the top 10 breeds only, based on number of tweets
2. Now look at the frequency distribution for the top 10 breeds only, based on aggregate number of favorites

**Distribution by number of tweets**

In [None]:
dog_preds['pred_class'].value_counts(sort=True)[0:10].plot.pie()

**Distribution by number of favorites**

In [None]:
dog_preds.join(enriched_tweets_df['favorite_counts']).groupby(['pred_class']) \
    .sum().sort_values(by='favorite_counts', ascending=False)[0:10]['favorite_counts'].plot.pie()

### Generate internal report

Having cleaned the data, and generated data insights, we can now generate the internal documentation from this notebook's markdown cells.

(you probably want to clear all output previous to the data insights output generated in the last section, and then SAVE the notebook)

In [None]:
!jupyter nbconvert --no-input --to pdf wrangle_act.ipynb
!mv wrangle_act.pdf wrangle_report.pdf

---

## Just for fun :)

In this section I will build a quick image classifier to try and predict dog breeds, and compare against the existing predictions.