# Data Wrangling of WeRateDogs Tweet Archive

## Gathering

Tweet archive was provided as a csv file, but some desired attributes were not present. Extended tweet information, using tweet ids in given archive, was downloaded using tweepy library and saved into json files. It was read into extended tweet and user information datasets.

Image predictions were not provided, but url to download it was given. They were downloaded using requests library.

## Assessing

Using visual and programatic assessement these problems were discovered:

- Tidiness:
    - In tweets archive dog "stages" are datapoints, but should be observations (and categorical);
    - tweets archive, extended tweets and image pradictions should be single dataset;
    - (alternatively twitter related information in tweets archive and extended tweets could go into one dataset and dog related information in tweets archive and image predictions could go to another as this could be seen as two separate concenrs. In our case I think all this is part of same observation, so I decided against splitting data into two datasets);

- Quality:
    - tweets archive:
        - `retweet_status_id` and `retweeted_status_user_id` are in scientific notation (also float64);
        - False dog names (captured in `not_names`);
        - Columns `name`, `doggo`, `floofer`, `pupper`, `puppo` poluted with 'None' instead of np.NaN;
        - `source` would be fine as just inner text of anchor tag;
        - `source` could be categorical;
        - `timestamp` and `retweeted_status_timestamp` are objects, but represent datetimes;
        - `rating_denominator` is not always 10;
    - image_predictions:
        - In column names p1, p2, p3 should be prediction1, prediction2, prediction3;
        - In column names conf should be conficende;
        - Predicted breed capitalization varies;
        - Predicted breeds are with underscores insted of spaces;
    - extended_tweets:
        - `in_reply_to_status_id` and `in_reply_to_user_id` should be integers (or at least be represented in non-scientific notation);
        - `source` has same issues as in tweets archive;
        - `user_id` has single unique value;
    - users:
        - Turns out all tweets came from one user. This dataset is useless;

## Cleaning

Actions to address found problems:
- In image prediction column names replace p1, p2, p3 with prediction1, prediction2, prediction3;
- In image prediction  column names replace conf to confidence;
- In columns `name`, `doggo`, `floofer`, `pupper`, `puppo` replace 'None' with np.NaN;
- Melt `doggo`, `floofer`, `pupper`, `puppo` column into `stage` column;
- Merge tweet archive, extended tweets and image predictions into single dataset;
- Calculate `rating_with_denominator_10` column;
- Replace false dog names (captured in `not_names`) with np.NaN;
- Extract inner text of anchor tag in `source` and replace the value with it;
- Replace underscores in breed predictions and lowercase it;
- Convert `source` to category;
- Convert `timestamp` and `retweeted_status_timestamp` to datetime;
- Convert `retweeted_status_id`, `retweeted_status_user_id`, `in_reply_to_status_id` and `in_reply_to_user_id` to strings (I'd rather have them as integers, but we [can't have NAs in integer series](http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na), and these are identifier values, so representation is important);
- Drop `user_id`;
