# wrangle_report
#### by Yves Deutschmann

## Introduction

This report will give an brief description of my efforts wrangling the WeRateDogs dataset for the Nanodegree program of Udacity. The report is strucured as follows:

1. Gathering the data
2. Assessment of the dataset
3. Cleaning of the dataset

## 1. Gathering the data  

For this project we have 3 sources of data:
- the WeRateDogs twitter archive
- image predictions for each tweet
- retweet count and favorite count for each tweet

### WeRateDogs twitter archive:
The Twitter archive is provided by Udacity on their website and was downloaded manually. The file was stored as `twitter-archive-enhanced.csv` and processed with the pandas `pd.read_csv()` method that forms the `archive`-DataFrame.  

### image predictions:
The image predictions file contains a table that provides predictions for what is to see on the image of each tweet in the Twitter archive. These predictions were the results of a neural network. Udacity hosted this file on their servers and we downloaded the file programmatically with the `requests` library. After that, we read in the `image-predictions.tsv` file with the same `read_csv()` method as above. This time we had to specify the separator as `/t` (tab stop separated) which results in the `images`-DataFrame 

### retweet and favorite count:
This information is acquired through the official Twitter API. We set up a Twitter developer account and set up our python `tweepy` library with this authentification information. For every tweet_id in the Twitter archive we requested the original tweet information in `.json`format and wrote each json-package in one line of a new text-file named `tweet_json.txt`. This text-file is processed with the `pd.read_csv()` method forming the `tweets`-DataFrame.

### Gathering - Conclusion:
We now have three DataFrames that are ready for Assessment in the next step.

## 2. Assessment of the dataset

Now that we have gathered all of our data we will assess our DataFrames for issues in quality and tidiness. Udacity is supporting us with the following **Key Points**, that we should keep in mind:

>- You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
>- Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
>- Cleaning includes merging individual pieces of data according to the rules of tidy data.
>- The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
>- You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

The following issues were found in our tables:

`archive`:
#### Tidiness:
- dog-stage should be a variable (*doggo, floofer, pupper, puppo*)
- `nominator`and `denominator`should be one variable
- columns for retweets are useless after removing retweeted rows

#### Quality: 
- `twitter_id`s are `int` instead of `string`
- Erroneous datatypes for timestamp
- Erroneous datatypes for dog-stage
- some ratings don't have a related image
- data includes rows from retweets
- wrong ratings
- wrong and missing names
- data in `source` is HTML code

`images`:
#### Tidiness:
- table should be part of `archive`

#### Quality:
- `twitter_id` is `int` instead of `string`

`tweets`:
#### Tidiness:
- should be part of `archive`

## 3. Cleaning the dataset

Before we start to manipulate our data we made a copy of each DataFrame to avoid the loss of data. The next step of the wrangling process is to clean the issues that we found in the assessment. For each issue in or findings list above we will take the following steps to clean the data:
- *Define* the steps we will take to tackle the obstacle
- transform the described steps  into *Code*
- *Test* if the taken actions led to the desired result

### Cleaning - Conclusion
For this project, we didn't address all issues in quality and tidiness of this dataset as this would consume too much time. Though, we did address all Key Points mentioned above. The result is a single DataFrame that we stored in the file `twitter_archive_master.csv`. With this file, we are ready to analyze the dataset in an efficient and structured way.

## Conclusion

Over the course of this project, we've seen that data isn't clean at all. If you get clean data, chances are that somebody else already had taken the time to do the job of data wrangling. But the effort is worth it. With our actions, we were able to normalize the rating to a consistent scale that can quickly be processed for aggregation and visualization. Furthermore, data wrangling helps to avoid mistakes that could lead to false assumptions and models.  
So, in conclusion, I can say that before you start analyzing, you should always wrangle your data, to not give yourself a hard time dodging issues in your data later off.