# Wrangle Report

## Introduction

The dataset that we wrangled is the tweet archive of Twitter user `@dog_rates`, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people’s dogs with a humorous comment about the dog. These ratings generally have a denominator of 10, and the numerators almost always greater than 10. The tweet archive records using in this project contains basic tweet data (tweet ID, timestamp, text, etc.) for all 2356 of their tweets as they stood on August 1, 2017.
Installing Libraries

We installed the following libraries using the Anaconda Powershell Prompt: Requests, Tweepy, and JSON then loaded the rest as such:

## Gathering Data for this Project

### 1. Twitter Archive

The WeRateDogs Twitter archive. We will manually save this file as: `twitter_archive_enhanced.csv`

**URL:** https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv

### 2. Image Predictions

The tweet image predictions. We will programatically save this file as: `image_predictions.tsv` using the Requests library. 

**URL:** https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

### 3. Twitter API

Query using the Twitter API to obtain each tweet's retweet count and favorite ("like") count at minimum, and any additional data we find interesting. We will save this file as: `tweet_json.txt`. We accessed the API using this code before hiding our generated keys:

We created a loop to download the tweets we found while timestamping our requests with a counter, we also created code to ignore exception errors. The exceptions are printed into a dictionary called `dict_keys` for future reference.

## Assessing Data

We performed both visual and programmatical assessments on the twitter_archive_en, img_pred and api_data dataframes and found these issues:

### 1. Issues with `twitter_archive_en`

**Quality Issues:**

1. Missing data such as ``NaN`` in the `in_reply_to_status_id` column
2. Missing data such as `NaN` in the `in_reply_to_user_id` column
3. Missing data such as `NaN` in the `retweeted_status_id` column
4. Missing data such as `NaN` in the `retweeted_status_user_id` column
5. Missing data such as `NaN` in the `retweeted_status_timestamp` column
6. Missing  urls in `expanded_urls` column, used for images
7. Missing names such as 'None' in the `name` column
8. Incorrect names such as 'a' or 'the' in the `name` column
9. Incorrect datatype `int64` in the `tweet_id` column
10. Dog-types in individual columns with repeated and missing values

**Tidiness Issues**

1. Drop duplicate data `<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>` in the `source` column

### 2. Issues with `img_pred`

**Quality Issues:**

1. Incorrect datatype such as `int64` the in the `tweet_id` column
2. `2330` entries vs `2356` in `twitter_archive_en.cvs` means that some posts will not have have predictions when we join the tables

**Tidiness Issues**

1. Columns `p1`, `p2` and `p3` all contain similar types of data for dog breed predictions

### 3. Issues with `api_data`

**Quality Issues:**

1. Incorrect datatype such as `int64` the in the `tweet_id` column
2. `2075` entries vs `2356` in `twitter_archive_en.cvs` means that some posts will not have have images when we join the tables

**Tidiness Issues**

1. This dataframe is not joined with the other tweet data

## Cleaning Data

After we had assessed the data, we begun to clean it. But first, it is always good practice to make copies to preserve each dataframe:

First, we dropped the columns we did not find useful using `.drop`. We also marked the data with null values as NaN. The missing and incorrect names in the `name` columns where fixed by looking for names that weren't names as such:

Next, we changed the `int64` datatypes to string `object` datatypes. We merged the dog stages into a single column to clean up the dataframe. In the `img_pred_clean` dataframe we removed posts without images by using `.isin` to compare our columns to the other dataframes and remove them if they do not exist.

Next, we matched the number of `tweet_id` entries of `archive_clean`, `api_data_clean` and `img_pred` using `isin`. 

Finally, we used `.pd.wide_to_long()` to reshape the p columns of the `img_pred_clean` dataframe and append similar data.
Columns p1, p2 and p3 all contain similar types of data for dog breed predictions. After renaming the columns we converted the dataframe:

To complete our data wrangling, we merged the data from `api_data_clean` to `archive_clean` using `.pd.merge()` to merge `api_data_clean` dataframe to right of `archive_clean`:

Finishing touches included removing ratings and shortlinks from the `text` column in `archive_clean`. Use `re` to remove shortlinks from text column in `archive_clean` then remove the ratings.

We repeated a similar process to remove the ratings after removing the shortlinks then saved the files as `.csv`.