# Wrangle Report

# Wrangling Data @dog_rates aka. WeRateDogs

# Introduction
Real-world data rarely come clean. Using Python and its libraries, we will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. We will document our wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python its libraries.

The dataset that we will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user [@dog_rates](https://twitter.com/dog_rates), also known as [WeRateDogs](https://en.wikipedia.org/wiki/WeRateDogs). WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs, Brent". WeRateDogs has over 4 million followers and has received international media coverage.

**Software that we will be used**  
Since we work in a local environment, the following libraries should be installed:
* pandas
* NumPy
* requests
* tweepy
* json

**Context**  
Goal: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. 

**The Data**  
* Enhanced Twitter Archive
>The WeRateDogs Twitter archive contains basic tweet data for all 2356 of their tweets. Containing one column the archive does contain though: each tweet's text, which Udacity team has extracted the rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced".

* Additional Data via the Twitter API
>Then we need retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Using this API we can extract needed data to make our dataset more concise.

* Image Predictions File
> The Udacity team has run every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs. The results are so amazing: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction.


**Project Details** 

* Data wrangling, which consists of:
> Gathering data    
> Assessing data  
> Cleaning data
* Storing, analyzing, and visualizing your wrangled data
* Reporting on:
> 1) your data wrangling efforts and   
> 2) your data analyses and visualizations

## Gather Data

* The WeRateDogs Twitter archive.
> The archive data is downloaded manually from the Udacity lesson's page, then we will be inserted using Pandas libraries.
* The tweet image predictions.
> This data is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv.
* Each tweet's retweet count and favorite ("like") count at minimum, and any additional data may be interesting.
> For this data we will be using TwitterAPI and Tweepy library. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count.

## Assessing Data
In this step, we will be assessing them visually and programmatically for quality and tidiness issues using two types of assessment. We will be intensively using Pandas and its method, i.e:
* `.describe()` to see the summary statistic
* `.info()` to see the data types each column and detect missing data
* `.duplicates()` to see if there is any duplicated row
* we also using some loops to see the weird rating on the archive dataframe


**Key Points**  
Key points in the data wrangling process for this project:

* We want original ratings (no retweets) that have images. 
* Cleaning includes merging individual pieces of data according to the rules of tidy data.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.

From the assessment process above, the result is divide into two kinds, quality and tidiness issues.

### Quality
Quality: issues with content. Low-quality data is also known as dirty data.

#### `archive` dataframe:
* keep the original tweet except the retweeted
* some not useful columns for analysis i.e: in_reply_to_status_id, in_reply_to_user_id, source, expanded_urls, retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp
* tweet_id in int64 Dtype
* timestamp in object Dtype
* wrong numerator (decimal value or false detection) in index 516, 1712, 1202, and 763
* wrong denominator in index 2335, 342, and 516
* 'None' value instead of NaN in name and dog stages colummn


#### `image` dataframe:
* duplicated image
* tweet_id in int64 Dtype
* not columns for analysis for analysis

#### `tweepy` dataframe:
* non original tweet
* id column name is not match with other dataframe
* id in int64 Dtype  
* not useful columns for analysis i.e (id_str, in_reply_to_status_id, in_reply_to_status_id_str, in_reply_to_user_id, in_reply_to_user_id_str, lang, quoted_status_id, and quoted_status_id_str

### Tidiness
Tidiness: issues with a structure that prevents easy analysis. Untidy data is also known as messy data.

#### `archive` dataframe
* dog stage columns: doggo, floofer, pupper, and puppo is not good

#### `image` dataframe:
* p1, p1_conf, p1_dog, p2, p2_conf, p2_dog, p3, p3_conf, p3_dog

#### `tweepy` dataframe:
-

#### make all dataframes into one whole master dataframe

## Cleaning Data

The programmatic data cleaning process:

* Define
* Code
* Test

As always, we need to copy our dataframe before do any cleaning process, so we can refer back to the old ones.

### Archive Dataframe

What we will do for this dataframe are:
* remove retweeted row with filtering technique
* remove not useful for analysis columns using `.drop()` method
* change tweet_id datatype into 'object' using `.astype()` method
* change timestamp datatype into datetime using `.astype()` method
* with some looping we will fix
    * numerator for index 516, 1712, 1202, and 763
    * wrong denominator for index 2335, 342, and 516
* change 'None' into NaN in name and dog stages colummn using numpy 
* make dog_stage column, then delete the messy columns

### Image Dataframe

What we will do for this dataframe are:
* remo duplicated image row
* change tweet_id in into object datatype
* remove all not useful columns for analysis for analysis
* select one of p1, p1_conf, p1_dog, p2, p2_conf, p2_dog, p3, p3_conf, p3_dog

### Tweepy Dataframe

What we will do for this dataframe are:
* remove non original tweet
* change id column name to tweet_id then change the datatype to 'object'
* remove not useful columns for analysis i.e (id_str, in_reply_to_status_id, in_reply_to_status_id_str, in_reply_to_user_id, in_reply_to_user_id_str, lang, quoted_status_id, and quoted_status_id_str

### Join and Store All Three Dataframes
All dataframe will be merged based on tweet_id as the primary key. The final dataframe will be inner-joined.
Then, after final checking, we will save the dataframe to CSV file, named 'twitter_archive_master.csv'.