## Purpose
The purpose of this report was to wrangle data associated with the [WeRateDogs](https://twitter.com/dogrates) Twitter feed so that it would be suitable for analysis.

## Gather
The 'We Rate Dogs' twitter archive was provided as the flat file `twitter-archive-enhanced.csv`

The image prediction file `image-predictions.tsv` was downloaded programmatically from Udacity using `Requests`.

Extended tweet information for the twitter archive, including favorite and retweet counts, was obtained from the Twitter API via the `Tweepy` interface.

## Assess

#### Quality
##### `twitter_arch` table (from provided file)
- contains retweets
- some of the tweets are about non-dog items ('guys, we only rate dogs')
- `rating_numerator` and `rating denominator` contain values that are not ratings
- `rating numerator` and `rating_denominator` contain some ratings for multiple dogs
- `name` field contains several non-name entries (e.g., a, an, the, very, just, quite)
- `timestamp` and `retweeted_status_timestamp` are strings, not datetime
- `in_reply_to_status_id`,`in_reply_to_user_id`, `retweeted_status_id` and `retweeted_user_id` are floats, not ints
- dog stage contains multiple assignments (should be unique)

##### `image_preds` table (from downloaded file)
- inconsistent capitalization of descriptors in `p1`, `p2`, `p3`
- tweets without images (N in this table < N in Twitter archive)

##### `ext_tweet` table (via Twitter API)
- some tweets no longer accessible (N of this table < N of twitter archive table)

#### Tidiness
##### `twitter_arch` table 
- Dog stage is four fields, each of which contains "None" or the column name

##### `image_preds` table
- Image prediction table should be combined with the main archive table since there is one entry per tweet

##### `ext_tweet` table 
- Extended tweet info (retweet & favorite counts) should be part of the main archive table


## Clean

### Quality

#### `twitter_arch`
1. Contains retweets: delete retweet entries 
2. Contains tweets that are not about dogs: delete non-dog entries. These are assumed to be any containg the phrase 'we only rate dogs'.
3. Ratings contain non-rating values: replace these with ratings if able; otherwise delete. Many of these are parsing errors where some other `x/y`-type string was taken as the rating. By manually sifting the text, some of these were found to contain other text with ratings.
4. Ratings contain ratings for multiple dogs: delete. Remove to standardize dataset to single-dog ratings. These ratings also contain multiplicative denominators (e.g. 110, 120) which are inconsistent with the bulk of ratings.
5. Field `name` contains non-name entries: change them to 'None'. They are all lower-case, whereas actual names are capitalized.
6. Fields `timestamp` and `retweeted_status_timestamp` are strings, not datetime: convert `timestamp` to datetime. `retweeted_status_timestamp` was deleted during step 1.
7. Fields `in_reply_to_status_id`,`in_reply_to_user_id`, `retweeted_status_id` and `retweeted_user_id` are floats, not ints: convert `in_reply_to_status_id` and `in_reply_to_user_id`to int. `retweeted_status_id` and `retweeted_user_id` were deleted in step 1. 
8. Some entries contain mutliple stage entries: remove these. They appear to be for multiple dogs or non-dogs. Want to limit to single dogs and stage entry should be unique.

#### `image_preds`
1. Inconsistent capitalization of descriptors in `p1`, `p2`, `p3`: make all entries lowercase
2. Tweets without images (N in this table < N in Twitter archive): this will be addressed in "Tidiness" when the tables are merged.

#### `ext_tweet` 
1. Some tweets no longer accessible (N of this table < N of twitter archive table): this will be addressed in "Tidiness" when the tables are merged.

### Tidiness
##### `twitter_arch` table 
1. Dog stage is four fields, each of which contains "None" or the column name: merge these into single column `stage` with a categorical entry for each stage or "None".

##### `image_preds` table
1. Image prediction table should be combined with the main archive table: join it with `twitter_arch`. An inner join will eliminate the tweets without images.

##### `ext_tweet` table 
1. Extended tweet info (retweet & favorite counts) should be part of the main archive table: join it with `twitter_arch`. An inner join will eliminate the tweets that were no longer available when the API was accessed.