# Data Wrangling Template

### Table Of Contents 
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gather">Gather</a></li>
<li><a href="#assess">Assess</a></li>
<li><a href="#clean">Clean</a></li>
<li><a href="#vizualize">Storing, Analyzing, and Visualizing Data</a></li>
</ul>

<a id='intro'></a>
## Introduction

Real-world data rarely comes clean. In this project, I will be using Python and its libraries, to gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling.

The dataset that I will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user **@dog_rates**, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

In [None]:
#import packages 
import pandas as pd 
import numpy as np
import os
import requests 
import json
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

<a id='gather'></a>
## Gather

### 1.The WeRateDogs Twitter archive  


In [None]:
#read WeRateDogs twitter data from a csv file  
twitter_archive_data = pd.read_csv('twitter-archive-enhanced.csv')
twitter_archive_data.head()

### 2. Image Predictions File 

In [None]:
#download Image predictions file programmatically using request library 
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
response.status_code

In [None]:
#read Image Predictions file 
image_pred = pd.read_csv(url, delimiter='\t')
image_pred.tail()

### 3. Twitter API data 

In [None]:
#Unfortunately, I do not have access to Twitter API, so I will be using tweet json.text file instead, This the code used to access Twitter API

#import tweepy
#from tweepy import OAuthHandler
#import json
#from timeit import default_timer as timer 
#Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
#These are hidden to comply with Twitter's API terms and conditions
#consumer_key = 'HIDDEN'
#consumer_secret = 'HIDDEN'
#access_token = 'HIDDEN'
#access_secret = 'HIDDEN'
#auth = OAuthHandler(consumer_key, consumer_secret)
#auth.set_access_token(access_token, access_secret)
#api = tweepy.API(auth, wait_on_rate_limit=True)
#tweet_ids = df_1.tweet_id.values
#len(tweet_ids)
#Query Twitter's API for JSON data for each tweet ID in the Twitter archive
#count = 0
#fails_dict = {}
#start = timer()
#Save each tweet's returned JSON as a new line in a .txt file
#with open('tweet_json.txt', 'w') as outfile:
 #This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    #for tweet_id in tweet_ids:
        #count += 1
        #print(str(count) + ": " + str(tweet_id))
       # try:
        #    tweet = api.get_status(tweet_id, tweet_mode='extended')
        #    print("Success")
       #     json.dump(tweet._json, outfile)
       #     outfile.write('\n')
       # except tweepy.TweepError as e:
      #      print("Fail")
      #      fails_dict[tweet_id] = e
      #      pass
#end = timer()
#print(end - start)
#print(fails_dict)``
tweet = pd.read_json(r'tweet-json.txt', lines=True)
tweet_json= pd.DataFrame(tweet, columns = ['id', 'favorite_count','retweet_count'])
tweet_json.sample()

<a id='assess'></a>
## Assess

### Visual Assessment 

In [None]:
twitter_archive_data

In [None]:
image_pred

In [None]:
tweet_json

### Programmatic Assessment 

#### 1. The WeRateDogs Twitter archive 

In [None]:
twitter_archive_data.info()

In [None]:
twitter_archive_data.sample(5)

In [None]:
twitter_archive_data.name.value_counts()

In [None]:
(twitter_archive_data.duplicated()).sum()

#### 2. Image Predictions File 

In [None]:
image_pred.info()

In [None]:
image_pred.sample(10)

#### 3. Twitter API data

In [None]:
tweet_json.info()

In [None]:
(tweet_json.duplicated()).sum()

In [None]:
tweet_json.shape[0]

### Quality 
---
**Quality:** issues with content. Low quality data is also known as dirty data.

--- 

#### The WeRateDogs `twitter_archive_data`  table 

- timestamp is captured as a string object not datetime
- incorrect dog names that starts with lower case letters
- None values in dogs name 
- tweet_id is captured as an int not object string  
- archive data contains retweets along with original tweets 
- retweeted_status_id	retweeted_status_user_id , retweeted_status_timestamp, columns are not needed for the anlaysis 

#### Image Predictions File  `image_pred`  table 
- inconsistent capitalization in p1,p2,p3 column, some are written in title case and lowercase
- missing data,  Image Predictions File table has 2075 tweets information 
#### Twitter API  `tweet_json` table 
-  missing data, Twitter API table has 2354 tweets information 



### Tidiness
---
**Tidiness:** issues with structure that prevent easy analysis. Untidy data is also known as messy data. Tidy data requirements:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.



---

- `twitter_archive_data` , `image_pred` , `tweet_json` tables describe one tweet

#### The WeRateDogs `twitter_archive_data`  table 

- four variables (doggo, floofer, pupper, puppo) in one column dog_stage

#### Image Predictions File  `image_pred`  table  
- p1, p2, p3 columns names are not clear 

#### Twitter API  `tweet_json` table  
- id columns name needs to be tweet_id to match with `twitter_archive_data` and `image_pred` table 

<a id='clean'></a>
## Clean

In [None]:
#make copies of data 
twitter_archive_clean = twitter_archive_data.copy() 
image_pred_clean = image_pred.copy() 
tweet_json_clean = tweet_json.copy()  

In [None]:
twitter_archive_clean['rating'] = twitter_archive_clean['rating_numerator'] / twitter_archive_clean['rating_denominator'] 

`twitter_archive_clean` **timestamp is captured as a string object not datetime** 
#### Define
convert timestamp column into datetime using `to_datetime` method in pandas

#### Code

In [None]:
twitter_archive_clean.timestamp = pd.to_datetime(twitter_archive_clean.timestamp)

#### Test

In [None]:
twitter_archive_clean.info()

`twitter_archive_clean`  **incorrect dog names that starts with lower case letters such as** [ by,quite,a,not, an ..etc ]
#### Define
`twitter_archive_clean` drop column with lower case names using `istitle()` method in pandas

#### Code 

In [None]:
twitter_archive_clean = twitter_archive_clean[twitter_archive_clean['name'].str.istitle()!= False]

#### Test 

In [None]:
twitter_archive_clean.name.str.istitle().value_counts()

`twitter_archive_clean`  **None values in dogs name**
#### Define
Drop rows that have the value "None" as a dog name, using `twitter_archive_clean['name']!= 'None'`

#### Code 

In [None]:
twitter_archive_clean = twitter_archive_clean[twitter_archive_clean['name']!= 'None']

#### Test

In [None]:
(twitter_archive_clean.name == 'None').sum()

In [None]:
twitter_archive_clean.shape

`twitter_archive_clean`  **tweet_id is captured as an int not object string**  
#### Define 
Convert tweet_id into a string object because tweet_id column will not be used for manipulation or calculation

#### Code

In [None]:
twitter_archive_clean['tweet_id'] = twitter_archive_clean['tweet_id'].astype(int)

#### Test

In [None]:
twitter_archive_clean.info()

`twitter_archive_clean` **archive data contains retweets along with original tweets**
#### Define
Removing retweets from twitter_archive_clean table using `.match()` to match `RT @dog_rates` pattren in text column

#### Code

In [None]:
twitter_archive_clean.text.str.match('RT @').value_counts()

In [None]:
twitter_archive_clean = twitter_archive_clean[twitter_archive_clean.text.str.match('RT @')!= True]

#### Test 

In [None]:
(twitter_archive_clean.text.str.match('RT @')).sum()

`twitter_archive_clean` **retweeted_status_id retweeted_status_user_id , retweeted_status_timestamp, columns are not needed for the anlaysis**
#### Define
Drop retweeted_status_id retweeted_status_user_id , retweeted_status_timestamp using `.drop()` method in pandas 

#### Code

In [None]:
#columns before .drop()
twitter_archive_clean.columns

In [None]:
twitter_archive_clean.drop(['retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp'],axis=1,inplace=True) 

#### Test

In [None]:
#after .drop() 
twitter_archive_clean.columns

In [None]:
twitter_archive_clean.shape

`twitter_archive_clean`  **four variables (doggo, floofer, pupper, puppo) in one column dog_stage**
#### Define
Create dog_stage column and fill in its values from the pupper, puppo ,floofer and doggo columns 


#### Code

In [None]:
#fill in none values with nan 
twitter_archive_clean = twitter_archive_clean.replace('None', np.nan) 

In [None]:
#create dog_stage column and assign doggo, floofer, pupper and puppo values
twitter_archive_clean['dog_stage'] = twitter_archive_clean[['doggo','floofer','pupper','puppo']].fillna('').sum(1).replace('', np.nan)

In [None]:
twitter_archive_clean.dog_stage.value_counts()

In [None]:
 twitter_archive_clean['dog_stage'] = twitter_archive_clean['dog_stage'].replace('doggopupper', 'doggo,pupper')

In [None]:
 twitter_archive_clean.drop(['doggo','floofer','pupper','puppo'], axis=1,inplace=True)

#### Test

In [None]:
twitter_archive_clean.dog_stage.value_counts()

In [None]:
twitter_archive_clean.columns

In [None]:
twitter_archive_clean.shape

`image_pred_clean` **inconsistent capitalization in p1,p2,p3 column, some are written in title case and lowercase**
#### Define
`image_pred_clean` convert values in p1 , p2 , p3 column into Title Case letter using `title()` method 

#### Code 

In [None]:
image_pred_clean.p1 = image_pred_clean.p1.str.title()
image_pred_clean.p2 = image_pred_clean.p2.str.title()
image_pred_clean.p3 = image_pred_clean.p3.str.title()

#### Test

In [None]:
image_pred_clean.head()

`image_pred` **p1, p2, p3 columns names are not clear**

#### Define 

Rename p1, p2, p3 columns to  prediction_1 ,  prediction_2 ,  prediction_3 using `rename()` 


#### Code

In [None]:
image_pred_clean.rename(columns={'p1':'pred_1','p2':'pred_2','p3':'pred_3'}, inplace=True)

#### Test

In [None]:
image_pred_clean.columns

`tweet_json` **id columns name needs to be tweet_id to match with twitter_archive_data and image_pred table**
#### Define 
Rename id columns to tweet_id using `rename()`

#### Code

In [None]:
tweet_json_clean.rename(columns={'id':'tweet_id'}, inplace=True)

#### Test

In [None]:
tweet_json_clean.columns

`image_pred` **missing data, Image Predictions File table has 2075 tweets information** 
#### Define 

Create a new data frame and merge `image_pred` with `twitter_archive_clean`

In [None]:
twitter_archive_master = pd.merge(twitter_archive_clean,image_pred_clean,
                            on=['tweet_id'], how='left')

In [None]:
twitter_archive_master.shape

In [None]:
twitter_archive_master.columns

In [None]:
twitter_archive_master.head()

In [None]:
tweet_json_clean.columns

`tweet_json_clean` missing data, Twitter API table has 2354 tweets information 

### Define
merge tweet_json_clean with twitter_archive_master

#### Code

In [None]:
twitter_archive_master = pd.merge(twitter_archive_master,tweet_json_clean, on=['tweet_id'], how='left')

#### Test

In [None]:
twitter_archive_master

In [None]:
twitter_archive_master.to_csv(r'twitter_archive_master.csv', index = False)

<a id='store'></a>
## Storing, Analyzing, and Visualizing Data

In [None]:
twitter_data = pd.read_csv('twitter_archive_master.csv')

### Evaluating our models performance

> How many times our models **first prediction** is correct?**

In [None]:
twitter_data.p1_dog.value_counts()

>How many times our models **second prediction** is correct?**

In [None]:
twitter_data.p2_dog.value_counts()

> How many times our models **third prediction** is correct?  

In [None]:
twitter_data.p3_dog.value_counts()

> ### Most common words used in @dog_rates twitter account tweets 

In [None]:
# Create the wordcloud object
text = twitter_data['text'].values 
wordcloud = WordCloud(width=480, height=480, margin=0).generate(str(text))

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()


In [None]:
dog_breeds = (twitter_data[twitter_data['p1_dog']==True]).pred_1

result = dog_breeds.value_counts()[:10]

result.plot(kind='barh',figsize=(10,8),color=(0.2, 0.4, 0.6, 0.6))
plt.xlabel("Number of of Dogs")
plt.ylabel("Dog Breeds")
plt.title("Most Common Dog breed in @dog_rates Tweets")

In [None]:


#[:10]

In [None]:
labels = 'Frogs', 'Hogs', 'Dogs', 'Logs'
sizes = [15, 30, 45, 10]
explode = (0, 0.1, 0, 0)  # only "explode" the 2nd slice (i.e. 'Hogs')

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show() 



> Dog Stage @dog_rates tweets 

In [None]:
pie_data = twitter_data.dog_stage.value_counts()[:5]
pie_data.plot.pie(autopct='%1.1f%%', startangle=140,figsize=(8,8),title="Dog Stage @dog_rates tweets", label="Dog Stage")

In [None]:
plt.scatter(twitter_data.favorite_count,twitter_data.retweet_count)
plt.title("Favorites vs. Retweets @dog_stage")
plt.xlabel("Favorites")
plt.ylabel("Retweets")
plt.show()

### Resources 


- https://python-graph-gallery.com/wordcloud/  
- https://stackoverflow.com/questions/43606339/generate-word-cloud-from-single-column-pandas-dataframe 
