# WeRateDogs Twitter Data Wrangling

This project is to gather the data from [WeRateDogs (@dog_rates)](https://twitter.com/dog_rates), then assess and clean the data for further exploratory data analysis, and also training a recurrent neural network to identify the species of dogs.

The project focuses on below targets:
1. Keep the original ranks from WeRateDogs only, not the ones from retweet 
2. At least 8 quality issues and 2 tidiness issues
3. Including merge data table to achieve tidiness target
4. Be noticed that numerator sometimes is larger than denominator in this dataset, and that is the way WeRateDogs runs

The process of this project is as below:
1. Data wrangling, including cleaning, assessing and cleaning in programmatic ways
2. Store, analyze and visualize the processed data
3. Summary the first two steps in a report

In [1]:
import pandas as pd
import numpy as np
import requests
import tweepy
import json

## Data Gathering

There are three files to be gathered:
1. Download twitter data of WeRateDogs from [GitHub repo](https://github.com/udacity/new-dand-advanced-china/tree/master/数据清洗/WeRateDogs项目)
2. Download programmatically for the image recognition result of dog species, in above repo
3. Gather retweet_count and favorite_count via `Tweepy` API, and combine them into a JSON format txt file, each line representing a single record, at least including `tweet ID`, `retweet_count` and `favorite_count` information 

After gathering, import all into separated `pandas.DataFrame` 

In [2]:
##### 1. Import existed dataset, including tweet ID #####
twitter_archive_enhanced = pd.read_csv('twitter-archive-enhanced.csv')
twitter_archive_enhanced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [3]:
##### 2. Programmatically download image recoginition result ######
image_predictions = pd.read_csv('https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/image-predictions.tsv',
                               sep='\t')
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [None]:
##### 3. Info from Twitter API via tweepy #####
### Create tweepy instance for Twitter API ###
consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [None]:
### Looping to download all info and store in a list instance ###
tweet_json = []
tweet_error = {}
for tweet_id in twitter_archive_enhanced.tweet_id:
    try:
        tweet = api.get_status(tweet_id, tweet_mode='extended')
        tweet_json.append(tweet._json)
    except Exception as e:
        tweet_error[tweet_id] = e

In [38]:
### Re-download the error part ###
tweet_error_2nd = {}
for tweet_id in tweet_error.keys():
    try:
        tweet = api.get_status(tweet_id, tweet_mode='extended')
        tweet_json.append(tweet._json)
    except Exception as e:
        tweet_error_2nd[tweet_id] = e

Rate limit reached. Sleeping for: 343


In [41]:
tweet_error_2nd

{704761120771465216: tweepy.error.TweepError("Failed to send request: HTTPSConnectionPool(host='api.twitter.com', port=443): Read timed out. (read timeout=60)"),
 775096608509886464: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 802247111496568832: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 827228250799742977: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 837012587749474308: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 842892208864923648: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 845459076796616705: tweepy.error.TweepError([{'code': 144,
                           'message': 'No status found with that ID.'}]),
 861769973181624320: tweepy.error.

In [40]:
### From above, there are 11 tweet_id are unable to find corresponding info ###
### Re-download the last one ###
tweet = api.get_status('704761120771465216', tweet_mode='extended')
tweet_json.append(tweet._json)

In [46]:
### Store result in txt file ###
with open('tweet_json_YY.txt', 'w') as output:  
    for i in range(len(tweet_json)):
        json.dump(tweet_json[i], output)

In [83]:
### Import tweet_json.txt to pandas.DataFrame ###
json_str = []
with open('tweet_json.txt', encoding='utf-8') as json_file:
    for i in range(2352):   # there are 2352 rows in txt file got from visual inspection
        tweet = json_file.readline()
        json_str.append(tweet)

In [90]:
# transform string read from txt to dic, and store in list to create DataFrame #
json_list = []
for i in range(len(json_str)):
    json_list.append(json.loads(json_str[i]))

tweet_json = pd.DataFrame(json_list)

In [93]:
tweet_json.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2352 entries, 0 to 2351
Data columns (total 31 columns):
contributors                     0 non-null object
coordinates                      0 non-null object
created_at                       2352 non-null object
display_text_range               2352 non-null object
entities                         2352 non-null object
extended_entities                2073 non-null object
favorite_count                   2352 non-null int64
favorited                        2352 non-null bool
full_text                        2352 non-null object
geo                              0 non-null object
id                               2352 non-null int64
id_str                           2352 non-null object
in_reply_to_screen_name          78 non-null object
in_reply_to_status_id            78 non-null float64
in_reply_to_status_id_str        78 non-null object
in_reply_to_user_id              78 non-null float64
in_reply_to_user_id_str          78 non-null obj

### Summary of Data Collection

Corresponding to three targets,
1. Existed file was loaded via `pandas.read_csv` from local path, storing as `twitter_archive_enhanced` dataframe
2. Image recognition result was downloaded via `pandas.read_csv` from url, storing as `image_predictions` dataframe
3. Tweet info downloaded via tweepy was stored in `tweet_json_YY.txt`, to be distinguishable from the provided `tweet_json.txt` . However, to be eaiser for project reivew, `tweet_json.txt` was still used to generate `tweet_json` dataframe for the rest of project.

## Data Assessing

Accomplish at least 8 quality issues and 2 tidiness issues.

## Data Cleaning

Accomplish at least 8 quality issues and 2 tidiness issues.

## Data Storage and Visualization

1. Store the processed data as `twitter_archive_master.csv`
2. Illustrate at least 3 intuitions from data analysis and 1 visualization

## Sumamry Report

1. 300-to-600-word report for internal assessment to complete project, about how the whole project was done, saved as `wrangle_report.pdf` 
2. ~250-word report for external demonstration like blog post, saved as `act_report.pdf` 