# Wrangle Report  


## Introduction
Twitter user `@dog_rates` rates the people's dog with humorous comment about the dog. The dataset we are wrangling in this project is the tweet archive of the discussed user. Twitter user `@dog_rates` is also known as __WeRateDogs__.


This report briefly describes my data wrangling efforts. The tasks for project WeRateDogs Twitter are as follows:
- Gathering Data
- Assessing Data
- Cleaning Data
- - -
## Gathering Data

My wrangling efforts for the WeRateDogs Twitter project included gathering data from the
following sources:

- The WeRateDogs __Twitter archive__. The `twitter_archive_enhanced.csv` file was provided by Udacity to students. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. 
- The __tweet image predictions__, what breed of dog (or another object, animal, etc.) is present in each tweet according to a neural network. The `image_predictions.tsv` file was provided by Udacity which required to be downloaded programmactically via request library and special access URL. 
- __Twitter API__ and __Python's Tweepy library__ to gather all the data for each tweet ID in WeRateDogs Twitter archive. Gathered the data from API and stored it in `tweet_json.txt`. Then read whole `tweet_json.txt` data line by line and stored some attributes `('favorite_count', 'retweet_count', 'followers_count', 'favourites_count', 'friends_count' and 'date_time')` which I like in a DataFrame for further analysis.

- - - 
## Assessing Data
Once all the data gathered successfuly, I started to assessing the data visually and programmatically both for  the *__quality__* and the *__tidiness__* issues.

### Quality Issue
The quality issues I found in the data are listed below file wise. 

- __`Twitter Archive Enhanced Data:`__ `Source` format is bad and can't be read easily.
- __`Twitter Archive Enhanced Data:`__ These attributes should be `integers/strings` instead of `float`
    - in_reply_to_status_id
    - in_reply_to_user_id
    - retweeted_status_id
    - retweeted_status_user_id
- __`Twitter Archive Enhanced Data:`__ Columns should be `datetime` instead of `object(string)`.
    - retweeted_status_timestamp
    - timestamp
- __`Twitter Archive Enhanced Data:`__ We may want to change these columns types to `string` because We don't want any operations on them.
    - tweet_id
    - in_reply_to_status_id   
    - in_reply_to_user_id 
    - retweeted_status_id 
    - retweeted_status_user_id
- __`Twitter Archive Enhanced Data:`__ The columns `numerator` and `denominator` have invalid values.
- __`Twitter Archive Enhanced Data:`__ There are invalid names like (`a, an, etc.`).
- __`Twitter Archive Enhanced Data:`__ Convert the `null` values to `None` type
- __`Image Predictions Data:`__ `tweet_id` should be `object` instead of `integer`, as no calculation is needed.
- __`Image Predictions Data:`__ Missing values in _`image_predictions`_ dataset. __2075__ rows of data compared to _`twitter_archive_enhanced`_ dataset which has __2356__ number of records.
- __`Image Predictions Data:`__ Some tweets have 2 different `tweet_id`, that are actually retweets.


### Tidiness Issue
The tidiness issues I found in the data are listed below file wise. 

- __`Twitter Archive Enhanced Data:`__ No need to divide Dog stages in 4 different columns like 
    - 'doggo'
    - 'floofer'
    - 'pupper'
    - 'puppo'
-  __`Image Predictions Data:`__  _`Image Predictions Data`_ should be joined to _`Twitter Archive Enhanced Data`_
-  __`Tweets (API) JSON Data:`__ Merge _`Tweets (API) JSON Data`_ with the _`Twitter Archive Enhanced Data`_


- - - 
## Cleaning Data
After the successful assessment, cleaned the data on the basis of *__Define, Code and Test__* for each of the issue addressed in assessing data section. Some data cleaning steps are listed below. 

- Create a copy of each original dataframe.
- Merge the copied version of archive, images, and tweet_json dataframes
- Correct the dog types.
- Create one column for the various dog types: doggo, floofer, pupper, puppo 
- Delete retweets.
- Remove columns no longer needed: 
    - 'retweeted_status_id'
    - 'retweeted_status_user_id'
    - 'retweeted_status_timestamp'
    - 'date_time'
    - 'friends_count'
    - 'img_num'
    - 'p1'
    - 'p1_conf'
    - 'p1_dog'
    - 'p2'
    - 'p2_conf'
    - 'p2_dog'
    - 'p3'
    - 'p3_conf'
    - 'p3_dog'
    - 'in_reply_to_status_id'
    - 'in_reply_to_user_id'
    - 'favourites_count'
- Make Source format is good and can read easily.
- The columns numerator and denominator have invalid values.
- Cleaning decimal values in rating numerators.
- Convert the null values to None type
- Correcting Data types
    - 'tweet_id' as type str
    - 'timestamp' as type datetime
    - 'source' as type category
    - 'rating_numerator' as type float
    - 'rating_denominator' as type float
    - 'favorite_count' as type int
    - 'retweet_count' as type int
    - 'followers_count' as type int
    - 'dog_stage' as type category

---
## Conclusion

Data wrangling is a core skill that whoever handles data should be familiar with. 
I have used Python programming language and some of its packages. There are several advantages of this tool (as compared to e.g. Excel or Sheets) that is used by many data scientists (including the Google, Amazon and Facebook).

- It can deal with a large variety of data either structured data from ERP/SQL databases or unstructured data (like JSON or NoSQL Databases).
- It is strong in dealing with big data (much better than Excel, Sheets).
- For gathering data there are several packages that help scraping data off the web, either using APIs to collect data (Tweepy for Twitter) or to communicate with SQL databases.
- It is easy to document each single step and if needed re-run each single step. Thus, one can leave a perfect audit trail.
- One can re-run analysis automatically every period.
- Handling, assessing, cleaning and visualizing of data is possible programmatically using code.