## Introduction
Real-world data rarely comes clean. Using Python and its libraries, you will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. You will document your wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python (and its libraries) and/or SQL.

The dataset that you will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for you to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. More on this soon.

## Project Motivation
Your goal: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

## The Data
### Enhanced Twitter Archive
The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, I have filtered for tweets with ratings only (there are 2356).

### Additional Data via the Twitter API
Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. But you, because you have the WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+. And guess what? You're going to query Twitter's API to gather this valuable data.

### Image Predictions File
One more cool thing: I ran every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs*. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images). Column explains:<br>
tweet_id is the last part of the tweet URL after "status/" → https://twitter.com/dog_rates/status/889531135344209921<br>
p1 is the algorithm's #1 prediction for the image in the tweet → golden retriever<br>
p1_conf is how confident the algorithm is in its #1 prediction → 95%<br>
p1_dog is whether or not the #1 prediction is a breed of dog → TRUE<br>
p2 is the algorithm's second most likely prediction → Labrador retriever<br>
p2_conf is how confident the algorithm is in its #2 prediction → 1%<br>
p2_dog is whether or not the #2 prediction is a breed of dog → TRUE

## Key Points
Key points to keep in mind when data wrangling for this project:

You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
Cleaning includes merging individual pieces of data according to the rules of tidy data.
The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.
*Fun fact: creating this 

### Assessing Data for this Project
After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues. Detect and document at least eight (8) quality issues and two (2) tidiness issues in your wrangle_act.ipynb Jupyter Notebook. To meet specifications, the issues that satisfy the Project Motivation (see the Key Points header on the previous page) must be assessed.


In [None]:
#Import all packages needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import json

In [None]:
# read tweet archive enhanced
df_t_archive = pd.read_csv('twitter-archive-enhanced-2.csv')

In [None]:
#Read TSV file
image_prediction = pd.read_csv('image-predictions-3.tsv', sep='\t' )
image_prediction.info()

In [299]:
# Read json file as a dataframe
tweet_json = []
with open('tweet-json.txt') as file:
    for line in file:
        tweet_json.append(json.loads(line))


df_tweet_json = pd.DataFrame(tweet_json)
df_tweet_json.head(1)

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorite_count,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status
0,Tue Aug 01 16:23:56 +0000 2017,892420643555336193,892420643555336193,This is Phineas. He's a mystical boy. Only eve...,False,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,39467,False,False,False,False,en,,,,


In [None]:
#DataFrame with (at minimum) tweet ID, retweet count, and favorite count
df_tweet_json = df_tweet_json[['id', 'retweet_count', 'favorite_count','retweeted']]
df_tweet_json

In [None]:
df_tweet_json['retweeted'].value_counts()

### Cleaning Data for this Project
Clean each of the issues you documented while assessing. Perform this cleaning in wrangle_act.ipynb as well. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned.


### Quality issue
Completeness: missing data
Validity: does the data make sense
Accuracy: inaccurate data? (wrong data can still show up as valid)
Consistency: standardization
#### df_t_archive
1.Need to remove retweets.<br>
2.Need to delete columns that not be used for analysis.<br>
3.Need to change timestamp to datetime, and seperate it by date and time.<br>
4.Need to standardize dog names, some has None, a or an.<br>
5.Need to correct rating_numerator<br>
#### image_prediction
1.remove duplicate jpg_urs.<br>

#### df_tweet_json
1.Need to rename id to tweet_id.<br>


In [None]:
# 1.Need to remove retweets.
# check retweets number
df_t_archive['retweeted_status_id'].notnull().sum()

In [None]:
# remove retweets
df_t_archive = df_t_archive[df_t_archive.retweeted_status_id.isna()]
df_t_archive.info()

In [None]:
#2. Need to delete columns that not be used for analysis.
df_t_archive = df_t_archive.drop(['retweeted_status_id', 'retweeted_status_user_id','retweeted_status_timestamp'], axis = 1)
df_t_archive.columns

In [None]:
#3. Need to change timestamp to datetime, and seperate it by date and time.
df_t_archive['timestamp'] = pd.to_datetime(df_t_archive['timestamp'])
df_t_archive['timestamp_date'] = df_t_archive['timestamp'].dt.strftime('%Y%m%d')
df_t_archive['timestamp_hour'] = df_t_archive['timestamp'].dt.strftime('%H')

In [None]:
df_t_archive.sample(2)

In [None]:
# 4.Need to standardize dog names, some has None, a or an.
df_t_archive['name'].replace('None', np.nan, inplace=True)
df_t_archive['name'].value_counts()

In [None]:
#5 create rating column = rating_numerator/rating_denominator
df_t_archive['rating'] = df_t_archive['rating_numerator'] / df_t_archive['rating_denominator']

##### image_prediction

In [None]:
#1 remove duplicate jpg_urs
image_prediction = image_prediction.drop_duplicates(subset=['jpg_url'], keep='first')
#check remove duplciate result
image_prediction['jpg_url'].duplicated().sum()

In [298]:
image_prediction.sample(2)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1044,712717840512598017,https://pbs.twimg.com/media/CeQVF1eVIAAJaTv.jpg,1,Great_Pyrenees,0.732043,True,kuvasz,0.121375,True,Irish_wolfhound,0.049524,True
1955,864279568663928832,https://pbs.twimg.com/media/C_6JrWZVwAAHhCD.jpg,1,bull_mastiff,0.668613,True,French_bulldog,0.180562,True,Staffordshire_bullterrier,0.052237,True


#### tweet_json

### Tidiness Issues
1 One column for dog stages in 4 different columns (doggo, floofer, pupper, and puppo) <br>

2 key to join 3 dataset should have the same name.


In [None]:
# 1 One column for dog stages in 4 different columns (doggo, floofer, pupper, and puppo)
df_t_archive['stage'] = df_t_archive[['doggo', 'floofer', 'pupper', 'puppo']].max(axis=1)

In [None]:
#1.1 drop doggo, floofer, pupper, and puppo
df_t_archive.drop(['doggo', 'floofer', 'pupper', 'puppo'], axis=1, inplace=True)

In [None]:
#1.2 check stage column
df_t_archive.stage.value_counts()

In [None]:
df_t_archive.info()

In [None]:
#2 Need to rename id to tweet_id.
df_tweet_json.rename({'id':'tweet_id'}, axis = 1, inplace = True)
# Check duplicate tweetid
df_tweet_json['tweet_id'].duplicated().sum()

In [None]:
# merge 3 dataset
df_merge = pd.merge(df_t_archive, image_prediction, how = 'left', on = ['tweet_id'])

In [None]:
# merge 3 dataset
df_merge = pd.merge(df_merge, df_tweet_json, how = 'left', on = ['tweet_id'])

In [None]:
# save it to csv
df_merge.to_csv('twitter_archive_master.csv', index=False)