# Wrangle Report on Wrangling and Analyzing WeRateDogs Data

<a id='intro'></a>
## Introduction
This project is about wrangling and analyzing the WeRateDogs Twitter data. In this report I will briefly describe my wrangling efforts. Therefore I will describe how I gathered the data from different sources, assessed the data and at the end cleaned the data that they can be used for analysis. The acutal work which is described here in words can be found in the document: wrangle_act.ipynb

<a id='gathering'></a>
## Gathering Data

Within this project three different kinds of data need to be gathered. For each kind of the data a different method will be used. 

### Gathering Data WeRateDogs twitter archive by reading a csv file

The first data file (twitter_archive_enhanced.csv) was manually downloaded and saved in the same folder as the Jupyter notebook.

### Gathering image data by downloading a file using the Request library

Afterward I downloaded the corresponding image data from an Udacity server. 
The Request library was used to download the file programmatically from the following URL: <br>
https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

### Gathering JSON data using a twitter API called Tweepy library

The last part of these section deals with API handling and I needed to download the JSON data of the tweeds within the WeRateDogs archive. Python´s Tweepy library was used to download and store the the entire set of JSON data in a file named: tweet_json.txt .

<a id='assessing'></a>
## Assessing Data
After gathering the data, I assessed it to identify quality and tidiness issues.
The following tables summarizes the found issues: <br>

### Summarizing quality issues

| Issue Number | table name | Issues decription | Comment |
|----|----|----|----|
| Q1 | df_twitter_archive | 181 retwitted data should be deleted, because they are not relevant for original tweets |  |
| Q2 | df_twitter_archive | 78 in_reply data should be deleted, because they are not relevant for original tweets |  |
| Q3 | df_twitter_archive | tweet_id should be a string, not an int |  |
| Q4 | df_twitter_archive | timestamp should be datetime, not a string |  |
| Q5 | df_twitter_archive | missing or incorrect dog name -> 'None', 'a', 'an, 'the', 'my' |  |
| Q6 | df_twitter_archive | Nearly 84% of the dogs have no stage (all 4 stages are "None") | Will not be cleaned, because it would remove ~ 84% of the rows |
| Q7 | df_image_pred | tweet_id should be a string, not an int |  |
| Q8 | df_image_pred | columns should be renamed for an easier understanding of the content |  |
| Q9 | df_image_pred | 324 entries predict no dog for all three predictions. |  |
| Q10 | df_twitter_add_data | retweet_count should be an int, not a string |  |
| Q11 | df_twitter_add_data | favorite_count should be an int, not a string |  |

### Summarizing tidiness issues

| Issue Number | table name | Issues decription | Comment |
|----|----|----|----|
| T1 | df_twitter_archive | dog stages should be merged to one column | - |
| T2 | df_twitter_archive | retweeted columns (retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp) can be deleted, because they are not relevant for original tweets | - |
| T3 | df_twitter_archive | reply columns (in_reply_to_status_id, in_reply_to_user_id) can be deleted, because they are not relevant for original tweets | - |
| T4 | general | all 3 data frames should be merged into one | - |


<a id='cleaning'></a>
## Cleaning Data
After identifying the quality and tidiness issues, I started to clean the data and resolved issue by issue. Therefore I had to use my Python knowledge and what I learned so far within this Nanodegree course. In addition there were some tricky parts for which I needed some help from stackoverflow or the pandas documentation. All the used references are linked in the corresponding reference section in the working document wrangle_act.ipynb. <br>


After all these steps, the data were ready for an analysis. The results of the analysis using visualizations is shown in the report: act_report.html .

<a id='references'></a>
## References
Markdown Tables: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#tables <br>