# Wrangle Report: WeRateDogs Twitter Account
by Charles Dellinger

### Overview
The goal of this project is perform all the steps in the data wrangling process using the tweets of the WeRateDogs Twitter account.

The general steps include:
* [Gather](#gather) tweet data from multiple sources
* [Assess](#assess) quality and tidiness issues in the data sets
* [Clean](#clean) quality and tidiness issues identified during assessment
* [Store](#store) cleaned data as a flat file or database

<a id='gather'></a>
### Gather
The information is present in three different sources:
* flat file: `twitter-archive-enhanced.csv`
    + provided as a manual download
    + contains historical Twitter record from user WeRateDogs
    + enhanced with extracted dog name and growth stage jargon
* flat file: `image-predictions.tsv`
    + provided as a manual download
    + contains machine learning image prediction of dog breed
* API data: `tweet_json.txt`
    + JSON strings for all the tweets still available (not deleted, etc.)
    + retrieved using Tweepy, a Python implementation for the Tweepy API
    + contains duplicate information  present in the historical Twitter record
    + contains additional data such as favorite count and retweet count
    
After gathering the data using the three methods above, the data was loaded into the following respective DataFrames:
* `tweet_archive_df`
* `image_predictions_df`
* `tweet_json_df`

<a id='assess'></a>
### Assess
General assessment was initiated with visual inspection using `.head()` and variable types using `.info()`. While investigating variables, others started to show problems. The cleaning process also uncovered new areas to clean. What follows are the quality and tidiness issues that were discovered.

#### Quality Issues:
`tweet_archive_df`
* tweets archive includes retweets and replies, which are less curated data 
* datetime column timestamp is a object type (string)
* source contains HTML fragments, but the column isn't needed
* DataFrame contains extraneous variables to analysis
* numerator extraction of decimals extracted the number behind the '.' as the whole number rating
* numerators/denominator pairs with denominator above 10 are for multiple dogs
* denominators that aren't divisible by 10 are invalid
* 'name' column contains non-dog names
* manual fixes:
    + tweet_id = 855862651834028034 Snoop Dogg is not a real dog

`image_predictions_df`
* tweeted pictures aren't always of dogs

`tweet_json_df`
* missing entries from the api versus the id's present in the tweet archive
* too much information; a lot of trash fields

all files
* tweet_id has an inconsistent data type

#### Tidy Issues:
`tweet_archive_df`
* doggo, floofer, pupper, and puppo columns should be one variable column

`image_predictions_df`
* many columns leading to most probable dog breed

`tweet_json_df`
* multiple data in several columns
    + JSON tree includes compound fields with another dictionary tree
* contains duplicate data fields as 'tweet_archive_df'

all files
* data should be consolidated into one DataFrame

<a id='clean'></a>
### Clean

All DataFrames were copied to new tables for editing where the \*\_df format changed to \*\_clean.

The following are the actions that were taken to clean up the issues discovered in the [Assess](#assess) phase.

`tweet_archive_df`
* Drop the retweets and replies to eliminate qtweets that aren't primary rating pictures from WeRateDogs
* Changed 'timestamp' variable from string text to `datetime` objects
* Delete extraneous variables not needed for analysis
    + 'source' - contains device/interface that was used to post
        * deleting source will also fix the residual HTML tags
    + 'expanded_urls' - contains url of tweet
* Change doggo, floofer, pupper, and puppo from separate columns into one column.
    * check for entries with multiple stage
    * store 'tweet_id's
    * wipe existing fields
    * manually update from text
    * melt 'doggo', 'floofer', 'pupper', and 'puppo' into 'dog_stage'
    * change all 'None' values to 'NaN'
* Fix numerators and denominators:
    * Re-extract all matches of the numerator and denominator
        * Correct numerator regex extraction for decimals
        * Correct denominator regex extraction to only get denominators divisible by 10
    * Make numerators and denominators consistent:
        * Normalize numerators/denominator pairs to a denominator of 10
        * Average multiple ratings in a tweet to create a single rating per tweet
* Fix dog names column to contain dog names
    + tag_pos in the nltk was tried; invalid 'name's were all lowercase
    + Remove lowercase entries
    + Remove 'None' entries
* Manually fixes:
    + Delete
        * tweet_id = 855862651834028034 (Snoop Dogg is not a real dog)
    + 'name'
        * "O" should be "O'Malley"
        * "Gin" should be "Gin & Tonic"

`image_predictions_df`
* dropped tweets where the image recognition didn't find a dog
* consolidated all prediction variables to a variable called 'dog_breed' with the most probable breed 
* dropped all variables except 'tweet_id' and 'dog_breed'

`tweet_json_df`
* missing tweet data was determined to be deleted tweets
    +  all returned a [response code](https://developer.twitter.com/en/docs/basics/response-codes.html) 144, which most likely means they were deleted
* dropped all variables except 'id', 'retweet_count', 'favorite_count', <s>'users'/'followers_count'</s>
    + 'followers_count' ended up being the amount of followers at the time of API access
* 'id' was changed to 'tweet_id' to facilitate the merge

all files:
* all tables were merged into one table
    + the merge was performed using an left outer join to `tweet_archive_df`

<a id='store'></a>
### Store
Since the data was appropriately merged into one table, the cleaned and merged data set was able to be stored as a flat file called `twitter_archive_master.csv`.