# Wrangle report
### By Eduardo Rossel

This reports summarizes the efforts made in the Data Wrangling of the WeRateDogs's tweet archive. The account refers itself as "Your Only Source For Professional Dog Ratings". Each tweet includes a brief description of the dog, its name, "stage", rating and an image of the dog. We'll describe the main steps in wrangling the data to create interesting insights about this archive. The full detail of the process can be found in _wrangle_act.ipynb_

## Gathering

We are provided with an "enhanced" twitter archive for the WeRateDogs's account. This file is given to us as a .csv file, with which we can create our first dataset called **archive**.

We will use Twitter's API to access the complete information of tweets and retrieve aditional information for each one of them, such as: retweets count, favorite count and hashtags used. To do so, we use the _Tweepy_ library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data will be written into it's own line. We will later read this .txt to create a dataframe called **api_archive**.

Finally, we are provided with a table full of image predictions. For getting this file, we are given an url. For downloading the file we use the _request_ library and then convert the .tsv file into our third dataframe, "image_predictions"

## Assesing 

We assessed 3 dataframes: _archive_, _api_archive_ and _image_predictions_. For programmatical assesment several methods are used such as pandas's info, describe, duplicated, value_counts, among others. The following issues were identified:

**Quality**
* *Archive's dataframe:* 
    * Maximum rating denominator is 170
    * Maximun rating numerator is 1776
    * Rating numerator may not be accurate
    * Missing values in dog stages also they may not be accurate
    * Dataset contains replies and retweets
    * Timestamp is of object type
* *Image predictions:*
    * Predictions's names have different formats 
* *Api_archive's dataframe:*
    * Date of creation of tweets is Object, must be datetime

**Tidiness**
* *Archive's dataframe:*
    * All types of dog should be on a single column.
* *Image Predictions:*
    * p, p_conf and p_dog should be in columns
* *Api_archive's dataframe:*
    * Api_archive and archive'data should be on the same dataframe.
    
    
## Cleaning

Each of this issues were clean and a full detail can be found on the _wrange_act_ jupyter notebook. This action can briefly be described as:

### Fixing Tidiness 

In this case two tables were kept: archive and our image predictions dataset. Even though we could join all tables in only one data set. These two tables keep different types of observational units. The first one, refers to tweets and their content, and the second one to predictions of the breed of the dogs made by a neural network. The actions performed for this were:
* Join archive and api_archive's dataframe.
* Use melt function to create a column with the Type of Dog.
* Transform predicted_breed, p_conf and p_dog into one column each by using concat method

###Â Fixing Quality

* Recalculate the score numerator for all the dogs. We'll use a regular expresion to extract each numerator. Also, checking and forcing that the denominator is 10.
* Drop the rows where rating numerator seems off.
* Change the type of data of timestamp to datetime.
* Extracting the dog stages from the text.
* Extract the names of the pets from text.
* Delete retweets and replies from our twitter archive.
* Change the format of the predicted breed in order for them to be consist.

The final results of our Data Wrangling efforts where stored in two files: **twitter_archive_master.csv**, **image_prediction_clean.csv**