# WeRateDogs Twitter Data Wrangling
The following sections will list down the data quality and tidiness issues found from the WeRateDog Twitter data, and the corresponding cleaning done for each issue.

The original data with their corresponding dataframe name are:
- `twitter_df`: WeRateDogs twitter archive
- `image_pred_df`: Classifications of the images in each tweet status (in context of dog breed)
- `tweet_infos_df`: Retweet and favorite counts of each tweet obtained from Twitter API

Below are newly created dataframes after the data wrangling is done:
- `dog_ratings_df`: only contains dog ratings found in each tweet. 
- `tweet_dog_info_df`: only contains the dog stage and dog name detected in each tweet.

## Data Assessment

### Quality Issues
Here is the list of data quality issues found from the data:
- Invalid column data types in `twitter_df`:
    - `tweet_id` is of type integer, while it should be string because it's an ID.
    - `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id` are of type float, while they should be string because they are IDs.
    - `timestamp` and `retweeted_status_timestamp` are of type string, while they should be datetime.
- Invalid column data types in `image_pred_df`:
    - `tweet_id` is integer, while it should be string since it is an ID.
- Invalid value "None" in dog stage columns in `twitter_df`:
    - There are string "None" values in `doggo`, `floofer`, `pupper`, and `puppo` columns, which are invalid and misleading. This can make it hard for programmatic analysis later on.
- Invalid column data type for the dog stages columns in `twitter_df`:
    - The `doggo`, `floofer`, `pupper`, and `puppo` columns represent existence of the word in the tweet status. With this meaning, their data types should be boolean instead of string, with value `True` if the corresponding dog stage appears in the tweet status and `False` otherwise.
- Multiple dog stages in a row (while actually there is only one dog) in `twitter_df`:
    - Some rows actually only have one dog, but the tweet status contains multiple dog stage words in the text (as part of the person's way of expression). This causes those rows to have `True` value for more than one of the `doggo`, `floofer`, `pupper`, and `puppo` columns, which is actually inaccurate and misleading.
- Invalid values for `rating_numerator` in `twitter_df`:
    - Decimal values of the `rating_numerator` were not detected correctly from the tweet status text, e.g. 13.5/10 was detected as 5/10. It seems the original rating extraction did not consider possibility of decimal point in the numerator.
    - For the ones with valid denominator value, some tweets gave `rating_numerator` with value that are less than 10 (possibly due to misunderstanding on the rating system). This will cause unfair comparison during analysis.
- Invalid values for `rating_denominator` in `twitter_df`
    - Some `rating_denominator` values are not 10. Evaluating the corresponding rows revealed that those are non-rating ratios and phrases, e.g. "24/7" or "7/11" (for 24-hours-7-days and seven-eleven) or some date value like 13/14/19, that are mistakenly detected as rating ratio, while they actually are not the dog rating.
    - Some non-10 denominator values are actually meant for dog rating, but the user intentionally gave the rating with different denominator, e.g. rating of 143/130.
- Invalid values for the dog name in `twitter_df`:
    - The `name` column contains invalid names such as "None", "a", "actually", and all values that starts with lower case (which is assumed to be due to mistakes from the original name detection algorithm).
- A lot of `tweet_dog_info_df` rows are completely null:
    - A lot of rows in the dataframe is seen to be completely null, i.e. the dog name column is null and all the dog stage columns are False, which means the rows are completely useless.
- Some columns needs renaming:
    - `jpg_url` and `img_num` in `image_pred_df` have potential to be renamed to a better name.

### Tidiness Issues
Here is the list of data tidiness issues found from the data:
- `tweet_infos_df` is separated from `twitter_df`:
    - The `tweet_infos_df` should not be separated from the `twitter_df`. Both of them should be combined together, because the columns `retweet_count` and `favorite_count` should belong to `twitter_df` which represent tweet related details.
- Index of `twitter_df` and `image_pred_df` is not `tweet_id`:
    - The index of `twitter_df` and `image_pred_df` should be changed to `tweet_id` to ease programmatic analysis.
- Dog stage and dog name columns do not belong in `twitter_df`:
    - The dog stage columns (`doggo`, `floofer`, `pupper`, and `puppo`) and the dog name column `name` should be in a separate table/dataframe specific to contain information about the dog detected from the tweet status. The `twitter_df` should only contain tweet related details and not anything else.
- Dog rating columns do not belong in `twitter_df`:
    - The dog rating columns () should be in separate table/datafrae specific for containing dog rating information detected from the tweet status. The `twitter_df` should only contain tweet related details and not anything else.
    - Also it was found that each tweet may have multiple dog rating, which adds one more reason for dog ratings information to have their own table/dataframe separate from `twitter_df`.
- `image_pred_df`'s predicted class (`p1`, `p2`, `p3`), prediction confidence (`p1_conf`, `p2_conf`, `p3_conf`), and "is dog" columns (`p1_dog`, `p2_dog`, `p3_dog`) are untidy, where each of them should be represented by 1 column, but they are separated into 3 columns for each variable.

## Data Cleaning
The following cleaning are done for each data quality issue:
- Invalid column data types in `twitter_df`:
    - Converted `tweet_id`, `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, and `retweeted_status_user_id` columns to type string.
    - Converted `timestamp` and `retweeted_status_timestamp` columns to type datetime.
- Invalid column data types in `image_pred_df`:
    - Converted `tweet_id` column to type string.
- Invalid value "None" in dog stage columns in `twitter_df`:
    - Replaced the string "None" to be actual `NaN` values in the dog stage columns `doggo`, `floofer`, `pupper`, and `puppo`.
- Invalid column data type for the dog stages columns in `twitter_df`:
    - Converted the dog stage columns `doggo`, `floofer`, `pupper`, and `puppo` into boolean data type by converting the value to `True` if the value is not null, and `False` if null (after doing the conversion mentioned above).
- Multiple dog stages in a row (while actually there is only one dog) in `twitter_df`:
    - Since the number of statuses with multiple dog stages are only a few, I fixed them manually. With visual observation over the statuses with invalid multiple dog stages, I corrected the value of the dog stage columns into the correct one based on my own judgement. In total there are 6 statuses corrected to have the valid single dog stage.
- Invalid values for `rating_numerator` in `twitter_df`:
    - Reextracted the ratings to take into account decimal points, and also the possibility that each tweet has multiple ratings.
    - Stored the ratings into a separate table named `dog_ratings_df`.
    - Removed any ratings whose numerator value is greater than 17. This is because it is observed that numerator values that are higher than 17 are all oddly high values such as 182 and above.
    - Added 10 to the numerator of all ratings whose numerator is less than or equal to 10. This is based on the assumption that the users didn't understand the WeRateDogs rating rule and chose to rate the dog normally with value <= 10.
- Invalid values for `rating_denominator` in `twitter_df`:
    - Removed all ratings whose denominator is not 10 from the `dog_ratings_df`. This is to remove all the ratios that are actually non-rating. As for the ratios that are actually rating but the denominator is not 10, they are also still removed for fair comparison in the analysis.
- Invalid values for the dog name in `twitter_df`:
    - Replaced the name "None" and any name that starts with lowercase value (which are all actually just non-name words) to be the `None` value.
- A lot of `tweet_dog_info_df` rows are completely null:
    - For this I dropped the rows that are completely null, i.e. rows whose `name` is null and whose dog stage columns are all False.
- Column renames:
    - Renamed `jpg_url` to be `img_url` and `img_num` to be `img_idx` in `image_pred_df`.

The following cleaning are done for each data tidiness issue:
- `tweet_infos_df` is separated from `twitter_df`:
    - Merged the tables `tweet_infos_df` into `twitter_df` based on the `tweet_id` column.
- Index of `twitter_df` and `image_pred_df` is not `tweet_id`:
    - Set the index of `twitter_df` and `image_pred_df` to be the `tweet_id`.
- Dog stage and dog name columns do not belong in `twitter_df`:
    - Separated the `name`, `doggo`, `floofer`, `pupper`, and `puppo` columns from `twitter_df` into a new dataframe `tweet_dog_info_df`.
    - Dropped the columns mentioned above from `twitter_df`.
- Dog rating columns do not belong in `twitter_df`:
    - Dropped the rating columns from `twitter_df`. From the data cleaning on the quality issues above, the rating columns were moved to a separate table named `dog_ratings_df`.
- `image_pred_df`'s predicted class (`p1`, `p2`, `p3`), prediction confidence (`p1_conf`, `p2_conf`, `p3_conf`), and "is dog" columns (`p1_dog`, `p2_dog`, `p3_dog`) are untidy, where each of them should be represented by 1 column, but they are separated into 3 columns for each variable:
    - For this I melted the columns mentioned above into the following columns: `pred_level` (in which the value is either 1, 2, or 3), `pred_confidence`, `pred_class`, and `is_dog`.
    - Afterward, the index for `image_pred_df` is set to `tweet_id`, `img_idx`, and `pred_level`.

## Feature Engineering
Here is the list of some feature engineering done during the data wrangling:
- Converted retweet and reply information columns in `twitter_df` into `is_retweet` and `is_reply`:
    - The columns `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, and `retweeted_status_timestamp` can be used to infer on whether the tweet is actually a retweet or a reply to another tweet, so I created the columns `is_retweet` and `is_reply` based on these columns, where the value is true if the `retweeted_status_id` and `in_reply_to_status_id` are not null (respectively), and vice versa.
    - The original columns themselves are not so important for analysis, so they are dropped.
- Converted `source` column values in `twitter_df` into categorical values:
    - The values of the `source` column consists of only 4 possible values that represent the source of the tweet. Based on the 4 unique values of the columns, they are then converted into categorical column with values: 'iphone', 'webclient', 'vine', and 'tweetdeck'.

## Conclusion
With the cleaning actions done above, here are finalized and cleaned dataframes:
- `twitter_df`: contains tweets technical informations.
- `image_pred_df`: contains dog breed classifications of the image in each tweet.
- `dog_ratings_df`: contains dog ratings found in each tweet. 
- `tweet_dog_info_df`: contains the dog stage and dog name detected in each tweet.

Each of the dataframe are saved as the following CSV files respectively:
- twitter_archive_master.csv
- image_predictions_master.csv
- tweet_dog_ratings_master.csv
- tweet_dog_infos_master.csv