# Wrangle Report

## Table of Contents

<ul>
<li><a href="#intro">1. Introduction</a></li>
<li><a href="#data-gathering">2. Data Gathering</a></li>
<li><a href="#assessing-data">3. Assessing Data</a></li>
<li><a href="#cleaning-data">4. Cleaning Data</a></li>
</ul>

<a id='intro'></a>

## 1. Introduction

This report describes the data wrangling process that was used to gather, assess and clean WeRateDogs data.

<a id='data-gathering'></a>

## 2. Data Gathering


#### The WeRateDogs Twitter archive
We have downloaded this file manually: `twitter_archive_enhanced.csv`. Once it was downloaded, we uploaded it and read the data into a pandas DataFrame.

#### The tweet image predictions
This file (image_predictions.tsv) is present in each tweet according to a neural network. It is hosted on Udacity's servers and was downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

#### Additional data from the Twitter API
We gathered each tweet's retweet count and favorite ("like") count. Using the tweet IDs in the WeRateDogs Twitter archive, we queried the Twitter API for each tweet's JSON data using Python's Tweepy library and stored each tweet's entire set of JSON data in a file called tweet_json.txt file.

Each tweet's JSON data should be written to its own line. Then we read this .txt file line by line into a pandas DataFrame with tweet ID, retweet count, and favorite count.



<a id='assessing-data'></a>

## 3. Assessing Data

Using both visual assessment and programmatic assessments, we were able to identify the following issues:

### Quality Issues
- Unwanted records that represent retweets replies.
- Non common dog names in the name column, such as 'a', 'o', 'actually', 'all', 'an', 'by', 'getting', 'his', 'incredibly', 'infuriating', 'just', 'life', 'light', 'mad', 'my', 'not', 'officially', 'old', 'one', 'quite', 'space', 'such', 'the', 'this', 'unacceptable', 'very'.
- Wrong datatype of tweet_id, which is int64.
- Wrong datatype of timestamp, which is object.
- Some records have missing values in the expanded_urls column.
- The types of dogs in columns p1, p2, and p3 don't have the same format, some of them are lowercase, others are titlecase.
- Wrong datatype of tweet_id, which is int64.
- There are 2075 records in the image predictions dataset, meaning that we have 281 missing records, because there are 2356 records in the Data Archive dataset.
- Wrong datatype of id_str, which is int64.
- There are 2325 records present in the additional JSON dataset. meaning that we have 31 missing records.

### Tidiness Issues
- The columns doggo, floofer, pupper and puppo are unnecessary. The names of dogs should be stored in one column which is name.
- Unnecessary columns : retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, in_reply_to_status_id, in_reply_to_user_id, rating_denominator, and source.
- Instead of keeping the three types that are in p1, p2, and p3, we only need the one with the heighest probability.
- No need for records in which the image is predicted as a non dog (ie. p1_dog, p2_dog and p3_dog are all False).
- No need for the img_num column.
- The column name of id_str should be changed to tweet_id so that we can merege the three datasets.

<a id='cleaning-data'></a>

## 4. Cleaning Data

After making copies of the 3 datasets, we resolved the issues that we mentionned in the previous section using the following methods:
- Dropped unwanted columns.
- Dropped records that represent replies or retweets.
- Dropped records that have unusual dog names.
- Converted columns datatypes to what they should be using <b>df.apply</b> and <b>pd.to_datetime</b> methods.
- Dropped records with missing values in the `expanded_urls` column.
- Converted all dog breeds to lowercase.
- Stored dogs names in the `name` column and dropped the following columns `doggo`, `floofer`, `pupper` and `puppo`.
- Selected only the most confident dog breed level.
- Dropped records which are identified as non dogs by the neural network predictions.

Finally we combined the three datasets into one dataset `df_master` and saved it.