# <a name="top">WeRateDogs - Udacity Data Wrangling Project 03 </a>
---
## GATHER & ASSESS 3 datasets from 3 different sources:
1. [Gather/Assess Data #1](#gatherassess1) - Twitter archive, twitter-archive-enhanced.csv (local archive). format: CSV
2. [Gather/Assess Data #2](#gatherassess2) - Tweet image predictions - Download data from file_url utilizing requests library. format: TSV
3. [Gather/Assess Data #3](#gatherassess3) - Query Twitter API for additional data - image_preds (local archive created from image recognition system). format: TXT
 
## CLEAN (8) Quality Issues 
Also known as dirty data which includes mislabeled, corrupted, duplicated, inconsistent content issues, etc.

### twitter-archive-enhanced.csv quality issues:

1. [Quality #1](#q1) - columns 'timestamp' & 'retweeted_status_timestamp' are objects (strings) and not of 'timestamp' type. Change type to timestamp.

2. [Quality #2](#q2) - twitterDF.name contains a lot of non-dog names, e.g. 'a', 'an', 'actually', etc; Replace with np.NaN
   
3. [Quality #3](#q3) - ratings with decimal values in the numerator incorrectly extracted (not including denominator)

4. [Quality #4](#q4) - remove URL from 'source' & replace with 4 categories: iphone, vine, twitter, tweetdeck

5. [Quality #5](#q5) - retweeted_status_id is of type float; change to object(text). `in_reply_to_status_id` and `in_reply_to_user_id` are type float. Convert to string
 
6. [Quality #6](#q6) - 

### rt_tweets quality issues:

7. [Quality #7](#q7) - create new dataframe of columns needed

8. [Quality #8](#q8) - remove retweets


---
## CLEAN (2) Tidiness Issues
Messy data includes structural issues where variables don't form a column, observations form rows, & each observational unit forms a table.

1. [Tidy #1](#t1) - Merge all three datasets to form one. Three similar datasets should form one observation unit.

2. [Tidy #2](#t2) - Form one column from the four that describes dog stages, doggo, floofer, pupper, puppo. Tidy data requires each variable forms a column.
---
## Insights from data Analysis:

1. [BAR CHART 1](#vis1) - Horizontal Bar Chart (WeRateDogs Dog Breeds represented (top 10))
2. [BAR CHART 2](#vis2) - Horizontal Bar Chart (Top 15 Favorites (tweets), by probable name)
3. [Programatic 1](#prog1) - Percentages, Value Counts, etc.
4. [Programatic 2](#prog2) - Grouping of dataframe on the first predicted name for various mean data
---

## Saved new dataframe to file 
[Save to file, WeRateDogs_migration.csv](#save1) to file.


[BACK TO TOP](#top)

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import requests

import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.patches import ConnectionPatch
%matplotlib inline

## <a name="gatherassess1">Gather/Assess Data #1 - Twitter Archive Enhanced</a>

In [2]:
# Read data into dataframe
twitterDF_orig = pd.read_csv("data/twitter-archive-enhanced.csv")

# Make copy of dataframe
twitterDF = twitterDF_orig.copy()

In [35]:
# Visually Assess Twitter Archive
twitterDF.sample(10)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
837,767754930266464257,,,2016-08-22 16:06:54+00:00,iphone,This is Philbert. His toilet broke and he does...,https://twitter.com/dog_rates/status/767754930...,11.0,10.0,Philbert,,,,
1068,740373189193256964,,,2016-06-08 02:41:38+00:00,iphone,"After so many requests, this is Bretagne. She ...",https://twitter.com/dog_rates/status/740373189...,9.0,11.0,,,,,
2220,668484198282485761,,,2015-11-22 17:40:27+00:00,iphone,Good teamwork between these dogs. One is on lo...,https://twitter.com/dog_rates/status/668484198...,9.0,10.0,,,,,
1724,680085611152338944,,,2015-12-24 18:00:19+00:00,TweetDeck,This is by far the most coordinated series of ...,https://twitter.com/dog_rates/status/680085611...,12.0,10.0,by,,,,
299,836989968035819520,,,2017-03-01 17:22:13+00:00,iphone,This is Mookie. He really enjoys shopping but ...,https://twitter.com/dog_rates/status/836989968...,12.0,10.0,Mookie,,,,
2206,668631377374486528,,,2015-11-23 03:25:17+00:00,iphone,Meet Zeek. He is a grey Cumulonimbus. Zeek is ...,https://twitter.com/dog_rates/status/668631377...,5.0,10.0,Zeek,,,,
1678,682047327939461121,,,2015-12-30 03:55:29+00:00,iphone,We normally don't rate bears but this one seem...,https://twitter.com/dog_rates/status/682047327...,10.0,10.0,,,,,
1464,694356675654983680,6.706684e+17,4196984000.0,2016-02-02 03:08:26+00:00,iphone,This pupper only appears through the hole of a...,https://twitter.com/dog_rates/status/694356675...,10.0,10.0,,,,pupper,
80,877316821321428993,,,2017-06-21 00:06:44+00:00,iphone,Meet Dante. At first he wasn't a fan of his ne...,https://twitter.com/dog_rates/status/877316821...,13.0,10.0,Dante,,,,
1636,684200372118904832,,,2016-01-05 02:30:55+00:00,iphone,Gang of fearless hoofed puppers here. Straight...,https://twitter.com/dog_rates/status/684200372...,6.0,10.0,,,,,


In [4]:
# Programmatically Assess
# review data columns in DF, are Dtypes appropriate, etc.
twitterDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [5]:
# Programmatically Assess
# find all tweets where the retweeted_status_id is notnull
twitterDF[twitterDF.retweeted_status_id.notnull()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,,,2017-07-21 01:02:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Canela. She attempted s...,8.874740e+17,4.196984e+09,2017-07-19 00:47:34 +0000,https://twitter.com/dog_rates/status/887473957...,13,10,Canela,,,,
32,886054160059072513,,,2017-07-15 02:45:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...,8.860537e+17,1.960740e+07,2017-07-15 02:44:07 +0000,https://twitter.com/dog_rates/status/886053434...,12,10,,,,,
36,885311592912609280,,,2017-07-13 01:35:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Lilly. She just paralle...,8.305833e+17,4.196984e+09,2017-02-12 01:04:29 +0000,https://twitter.com/dog_rates/status/830583320...,13,10,Lilly,,,,
68,879130579576475649,,,2017-06-26 00:13:58 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Emmy. She was adopted t...,8.780576e+17,4.196984e+09,2017-06-23 01:10:23 +0000,https://twitter.com/dog_rates/status/878057613...,14,10,Emmy,,,,
73,878404777348136964,,,2017-06-24 00:09:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Shadow. In an attempt to r...,8.782815e+17,4.196984e+09,2017-06-23 16:00:04 +0000,"https://www.gofundme.com/3yd6y1c,https://twitt...",13,10,Shadow,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1023,746521445350707200,,,2016-06-25 01:52:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Shaggy. He knows exactl...,6.678667e+17,4.196984e+09,2015-11-21 00:46:50 +0000,https://twitter.com/dog_rates/status/667866724...,10,10,Shaggy,,,,
1043,743835915802583040,,,2016-06-17 16:01:16 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Extremely intelligent dog here....,6.671383e+17,4.196984e+09,2015-11-19 00:32:12 +0000,https://twitter.com/dog_rates/status/667138269...,10,10,,,,,
1242,711998809858043904,,,2016-03-21 19:31:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @twitter: @dog_rates Awesome Tweet! 12/10. ...,7.119983e+17,7.832140e+05,2016-03-21 19:29:52 +0000,https://twitter.com/twitter/status/71199827977...,12,10,,,,,
2259,667550904950915073,,,2015-11-20 03:51:52 +0000,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",RT @dogratingrating: Exceptional talent. Origi...,6.675487e+17,4.296832e+09,2015-11-20 03:43:06 +0000,https://twitter.com/dogratingrating/status/667...,12,10,,,,,


[BACK TO TOP](#top)

In [6]:
# Visually Assess
# review names of pups
twitterDF.name.value_counts()

None          745
a              55
Charlie        12
Lucy           11
Cooper         11
             ... 
Cedrick         1
JD              1
Christoper      1
Kuyu            1
Strider         1
Name: name, Length: 957, dtype: int64

In [7]:
# Programmatically Assess
# review dogtionary names; interesting to see id# 200 has 2 values, doggo & floofer
twitterDF[twitterDF['floofer'] != 'None'].head(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
46,883360690899218434,,,2017-07-07 16:22:55 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Grizzwald. He may be the floofiest floofe...,,,,https://twitter.com/dog_rates/status/883360690...,13,10,Grizzwald,,floofer,,
200,854010172552949760,,,2017-04-17 16:34:26 +0000,"<a href=""http://twitter.com/download/iphone"" r...","At first I thought this was a shy doggo, but i...",,,,https://twitter.com/dog_rates/status/854010172...,11,10,,doggo,floofer,,
582,800388270626521089,,,2016-11-20 17:20:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Doc. He takes time out of every day to...,,,,https://twitter.com/dog_rates/status/800388270...,12,10,Doc,,floofer,,


In [8]:
# it appears the stages are pulled from the tweeted text, 'doggo' & 'floofer' in text below

twitterDF.loc[200,'text']

"At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send dogs https://t.co/TXdT3tmuYk"

In [9]:
# Illustrating that pup designations are NOT singular. Multiple 
twitterDF[twitterDF['doggo'] != 'None'].sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1117,732375214819057664,,,2016-05-17 01:00:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Kyle (pronounced 'Mitch'). He strives ...,,,,https://twitter.com/dog_rates/status/732375214...,11,10,Kyle,doggo,,,
724,782747134529531904,,,2016-10-03 01:00:34 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Deacon. He's the happiest almost dry d...,,,,https://twitter.com/dog_rates/status/782747134...,11,10,Deacon,doggo,,,
362,829449946868879360,,,2017-02-08 22:00:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a stressed doggo. Had a long day. Many ...,,,,https://twitter.com/dog_rates/status/829449946...,11,10,,doggo,,,
1075,739623569819336705,,,2016-06-06 01:02:55 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",Here's a doggo that don't need no human. 12/10...,,,,https://vine.co/v/iY9Fr1I31U6,12,10,,doggo,,,
501,813096984823349248,,,2016-12-25 19:00:02 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Rocky. He got triple-doggo-dared. Stuc...,,,,https://twitter.com/dog_rates/status/813096984...,11,10,Rocky,doggo,,,


---
### Define

<a name="q1"> Q1 - Convert dtype of timestamp columns</a>

### Code

In [10]:
# Fixed 2 columns with incorrect datatypes, changed to datetime64
twitterDF.timestamp = pd.to_datetime(twitterDF.timestamp)
twitterDF.retweeted_status_timestamp = pd.to_datetime(twitterDF.retweeted_status_timestamp)

### Test

In [11]:
twitterDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   tweet_id                    2356 non-null   int64              
 1   in_reply_to_status_id       78 non-null     float64            
 2   in_reply_to_user_id         78 non-null     float64            
 3   timestamp                   2356 non-null   datetime64[ns, UTC]
 4   source                      2356 non-null   object             
 5   text                        2356 non-null   object             
 6   retweeted_status_id         181 non-null    float64            
 7   retweeted_status_user_id    181 non-null    float64            
 8   retweeted_status_timestamp  181 non-null    datetime64[ns, UTC]
 9   expanded_urls               2297 non-null   object             
 10  rating_numerator            2356 non-null   int64           

---
<a name="q2">-</a>
### Define

Q2 - twitterDF.name contains a lot of non-dog names, e.g. 'a', 'an', 'actually', etc; Replace with np.NaN

### Code

In [12]:
# replace puppo's names that match 'a' with NaN
# twitterDF.name = np.where(twitterDF.name == 'a', np.NaN, twitterDF.name)

In [34]:
# apparantely all of the invalid dog names are lowercase. See here.

twitterDF[twitterDF.name.str.islower()].name.value_counts()

ValueError: Cannot mask with non-boolean array containing NA / NaN values

### Test

In [13]:
# check to ensure all 'a' names were removed 
twitterDF[twitterDF.name == 'a']

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


---
<a name="q3">-</a><br/>
### Define 
Q3 - Ratings with decimal values incorrectly extracted 

### Code

In [14]:
twitterDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   tweet_id                    2356 non-null   int64              
 1   in_reply_to_status_id       78 non-null     float64            
 2   in_reply_to_user_id         78 non-null     float64            
 3   timestamp                   2356 non-null   datetime64[ns, UTC]
 4   source                      2356 non-null   object             
 5   text                        2356 non-null   object             
 6   retweeted_status_id         181 non-null    float64            
 7   retweeted_status_user_id    181 non-null    float64            
 8   retweeted_status_timestamp  181 non-null    datetime64[ns, UTC]
 9   expanded_urls               2297 non-null   object             
 10  rating_numerator            2356 non-null   int64           

In [15]:
# update Dtype of ratings to float to accept the updated values

twitterDF.rating_numerator = twitterDF.rating_numerator.astype(float)
twitterDF.rating_denominator = twitterDF.rating_denominator.astype(float)
twitterDF.info()
twitterDF.sample(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   tweet_id                    2356 non-null   int64              
 1   in_reply_to_status_id       78 non-null     float64            
 2   in_reply_to_user_id         78 non-null     float64            
 3   timestamp                   2356 non-null   datetime64[ns, UTC]
 4   source                      2356 non-null   object             
 5   text                        2356 non-null   object             
 6   retweeted_status_id         181 non-null    float64            
 7   retweeted_status_user_id    181 non-null    float64            
 8   retweeted_status_timestamp  181 non-null    datetime64[ns, UTC]
 9   expanded_urls               2297 non-null   object             
 10  rating_numerator            2356 non-null   float64         

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1239,712092745624633345,,,2016-03-22 01:45:15+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Steven. He's inverted af. Also very he...,,,NaT,https://twitter.com/dog_rates/status/712092745...,7.0,10.0,Steven,,,,
2251,667806454573760512,,,2015-11-20 20:47:20+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Filup. He is overcome with joy after f...,,,NaT,https://twitter.com/dog_rates/status/667806454...,10.0,10.0,Filup,,,,
1331,705591895322394625,,,2016-03-04 03:13:11+00:00,"<a href=""http://twitter.com/download/iphone"" r...","""Ma'am, for the last time, I'm not authorized ...",,,NaT,https://twitter.com/dog_rates/status/705591895...,11.0,10.0,,,,,


In [16]:
# extract ratings from text to include decimal values for the NUMERATOR only & SEE result

ratings = twitterDF.text.str.extract('((?:\d+\.)?\d+)\/(\d+)', expand=True)
ratings.sample(5)

Unnamed: 0,0,1
128,13,10
1630,12,10
2005,11,10
1072,12,10
842,10,10


In [17]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       2356 non-null   object
 1   1       2356 non-null   object
dtypes: object(2)
memory usage: 36.9+ KB


In [18]:
ratings = ratings.rename(columns={0:'numerator',1:'denominator'})
ratings.info

<bound method DataFrame.info of      numerator denominator
0           13          10
1           13          10
2           12          10
3           13          10
4           12          10
...        ...         ...
2351         5          10
2352         6          10
2353         9          10
2354         7          10
2355         8          10

[2356 rows x 2 columns]>

In [19]:
ratings.numerator = ratings.numerator.astype(float)
ratings.denominator = ratings.denominator.astype(float)
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   numerator    2356 non-null   float64
 1   denominator  2356 non-null   float64
dtypes: float64(2)
memory usage: 36.9 KB


In [32]:
twitterDF.rating_numerator = ratings.numerator
twitterDF.rating_denominator = ratings.denominator
twitterDF.sample(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1420,698262614669991936,,,2016-02-12 21:49:15+00:00,iphone,This is Franklin. He's a yoga master. Trying t...,https://twitter.com/dog_rates/status/698262614...,11.0,10.0,Franklin,,,,
443,819347104292290561,,,2017-01-12 00:55:47+00:00,iphone,Say hello to Anna and Elsa. They fall asleep i...,https://twitter.com/dog_rates/status/819347104...,12.0,10.0,Anna,,,,


### Test

---
### Define
<a name="q4"> Q4 - remove URL from 'source' & replace with 4 categories: iphone, vine, twitter, tweetdeck </a>

In [22]:
# review names of sources
twitterDF.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

### Code

In [23]:
twitterDF.head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,NaT,https://twitter.com/dog_rates/status/892420643...,13.0,10.0,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,NaT,https://twitter.com/dog_rates/status/892177421...,13.0,10.0,Tilly,,,,


In [24]:
# function to categorize source column

def update_source(row):
    if 'iphone' in row:
        return 'iphone'
    elif 'vine' in row:
        return 'vine'
    elif 'Twitter' in row:
        return 'twitter web client'
    elif 'TweetDeck' in row:
        return 'TweetDeck'

In [25]:
# run update_source function on every row to replace source text with shorter description of source
twitterDF.source = twitterDF.apply(lambda row: update_source(row['source']),axis=1)

### Test

In [26]:
# check to ensure function replaced items as intended
twitterDF.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1478,693590843962331137,,,2016-01-31 00:25:18+00:00,iphone,Meet Phil. He's big af. Currently destroying t...,,,NaT,https://twitter.com/dog_rates/status/693590843...,3.0,10.0,Phil,,,pupper,
1349,704134088924532736,,,2016-02-29 02:40:23+00:00,vine,This sneezy pupper is just adorable af. 12/10 ...,,,NaT,https://vine.co/v/igW2OEwu9vg,12.0,10.0,,,,pupper,
1540,689659372465688576,,,2016-01-20 04:03:02+00:00,iphone,This is Ricky. He's being escorted out of the ...,,,NaT,https://twitter.com/dog_rates/status/689659372...,8.0,10.0,Ricky,,,,
678,789268448748703744,,,2016-10-21 00:53:56+00:00,iphone,This is Stella. She's happier than I will ever...,,,NaT,https://twitter.com/dog_rates/status/789268448...,10.0,10.0,Stella,,,,
863,762471784394268675,,,2016-08-08 02:13:34+00:00,iphone,Meet Glenn. Being in public scares him. Fright...,,,NaT,https://twitter.com/dog_rates/status/762471784...,12.0,10.0,Glenn,,,,


---
### Define
<a name="q5">Q5 - retweeted_status_id is of type float; change to object(text). `in_reply_to_status_id` and `in_reply_to_user_id` are type float. Convert to string</a>

### Code

In [27]:
# data exploration
# see sample of is_reply_to_status_id...
twitterDF[twitterDF.in_reply_to_status_id.notnull()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
30,886267009285017600,8.862664e+17,2.281182e+09,2017-07-15 16:51:35+00:00,iphone,@NonWhiteHat @MayhewMayhem omg hello tanner yo...,,,NaT,,12.0,10.0,,,,,
55,881633300179243008,8.816070e+17,4.738443e+07,2017-07-02 21:58:53+00:00,iphone,@roushfenway These are good dogs but 17/10 is ...,,,NaT,,17.0,10.0,,,,,
64,879674319642796034,8.795538e+17,3.105441e+09,2017-06-27 12:14:36+00:00,iphone,@RealKentMurphy 14/10 confirmed,,,NaT,,14.0,10.0,,,,,
113,870726314365509632,8.707262e+17,1.648776e+07,2017-06-02 19:38:25+00:00,iphone,@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...,,,NaT,,10.0,10.0,,,,,
148,863427515083354112,8.634256e+17,7.759620e+07,2017-05-13 16:15:35+00:00,iphone,@Jack_Septic_Eye I'd need a few more pics to p...,,,NaT,,12.0,10.0,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2038,671550332464455680,6.715449e+17,4.196984e+09,2015-12-01 04:44:10+00:00,iphone,After 22 minutes of careful deliberation this ...,,,NaT,,1.0,10.0,,,,,
2149,669684865554620416,6.693544e+17,4.196984e+09,2015-11-26 01:11:28+00:00,iphone,After countless hours of research and hundreds...,,,NaT,,11.0,10.0,,,,,
2169,669353438988365824,6.678065e+17,4.196984e+09,2015-11-25 03:14:30+00:00,iphone,This is Tessa. She is also very pleased after ...,,,NaT,https://twitter.com/dog_rates/status/669353438...,10.0,10.0,Tessa,,,,
2189,668967877119254528,6.689207e+17,2.143566e+07,2015-11-24 01:42:25+00:00,iphone,12/10 good shit Bubka\n@wane15,,,NaT,,12.0,10.0,,,,,


### Test

---
### Define
<a name="q6">Q6 - </a>

---
### Define

<a name="q8"> Q8 - remove retweets & delete columns </a>

In [28]:
twitterDF.sample(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1917,674291837063053312,,,2015-12-08 18:17:56+00:00,iphone,This is Kenny. He just wants to be included in...,,,NaT,https://twitter.com/dog_rates/status/674291837...,11.0,10.0,Kenny,,,,
1084,738402415918125056,,,2016-06-02 16:10:29+00:00,iphone,"""Don't talk to me or my son ever again"" ...10/...",,,NaT,https://twitter.com/dog_rates/status/738402415...,10.0,10.0,,,,,


### Code

In [29]:
# Get indices of rows to drop, in this case, any row with a value in retweeted_status_id different that NaN.  
drop_these = twitterDF[twitterDF['retweeted_status_id'].notnull()].index
twitterDF.drop(drop_these,inplace=True)
twitterDF.sample(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1575,687476254459715584,,,2016-01-14 03:28:06+00:00,iphone,This is Curtis. He's a fluffball. 11/10 would ...,,,NaT,https://twitter.com/dog_rates/status/687476254...,11.0,10.0,Curtis,,,pupper,
393,825876512159186944,,,2017-01-30 01:21:19+00:00,iphone,This is Mo. No one will push him around in the...,,,NaT,https://twitter.com/dog_rates/status/825876512...,11.0,10.0,Mo,,,,
374,828372645993398273,,,2017-02-05 22:40:03+00:00,iphone,This is Alexander Hamilpup. He was one of the ...,,,NaT,https://twitter.com/dog_rates/status/828372645...,12.0,10.0,Alexander,,,,


In [30]:
# get rid of 3 empty columns representing the retweeted tweets
drop_cols = ['retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp']
twitterDF.drop(drop_cols,axis=1,inplace=True)
twitterDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   tweet_id               2175 non-null   int64              
 1   in_reply_to_status_id  78 non-null     float64            
 2   in_reply_to_user_id    78 non-null     float64            
 3   timestamp              2175 non-null   datetime64[ns, UTC]
 4   source                 2175 non-null   object             
 5   text                   2175 non-null   object             
 6   expanded_urls          2117 non-null   object             
 7   rating_numerator       2175 non-null   float64            
 8   rating_denominator     2175 non-null   float64            
 9   name                   2120 non-null   object             
 10  doggo                  2175 non-null   object             
 11  floofer                2175 non-null   object           

### Test

In [31]:
# check if any 'notnull' entries exist in retweeted_status_id
twitterDF[twitterDF['retweeted_status_id'].notnull()]

KeyError: 'retweeted_status_id'

In [None]:
# check to ensure cols dropped
twitterDF.info()

## <a name="gather2">Gather Data #2 - Tweet image predictions</a>

In [None]:
# Download data from file_url utilizing requests library & save to line #5
file_url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
req = requests.get(file_url)
fname = os.path.basename(file_url)
open("data/" + fname, 'wb').write(req.content)

In [None]:
# Nows read file downloaded & view sample to ensure read_csv worked. Also works as a visual assessment.
image_preds_orig = pd.read_csv("data/image-predictions.tsv", sep="\t")
image_preds = image_preds_orig.copy()

# visual assessment
image_preds.sample(5)

In [None]:
# programmatic assessment
image_preds.info()

[BACK TO TOP](#top)

## <a name="gather3">Gather Data #3 - Query Twitter API for additional data</a>
Query Twitter's API for JSON data for each tweet ID in the Twitter archive

 * retweet count
 * favorite count
 * any additional data found that's interesting
 * only tweets on Aug 1st, 2017 (image predictions present)

In [None]:
# define keys & API info 
# authenticate API using regenerated keys/tokens

consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

In [None]:
tweet_ids = twitterDF.tweet_id.values
len(tweet_ids)

In [None]:
# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
'''
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)
'''

### Start from here if data already obtained from Twitter                                                   

[BACK TO TOP](#top)

In [None]:
# Read tweet JSON into dataframe using pandas
# recived ValueError: Trailing data without 'lines=True'

rt_tweets_orig = pd.read_json("tweet.json", lines=True)
rt_tweets = rt_tweets_orig.copy()

# visual assessment as well as confirmation that read_json successful
rt_tweets.head(5)

In [None]:
# programmatic assessment
rt_tweets.info()

In [None]:
# programmatic assessment
# View retweeted tweets, first 5 of 163, these will be deleted

rt_tweets[rt_tweets.retweeted_status.notnull()].head(5)

In [None]:
# visual assessment
rt_tweets.user

In [None]:
# visual assessment
rt_tweets.columns

In [None]:
# visual assessment
# inspect the extended entities data
rt_tweets.loc[0,'extended_entities']

In [None]:
# visual assessment
# inspect the entities data
rt_tweets.loc[115,'entities']

In [None]:
# visual assessment
rt_tweets.loc[130,'user']

In [None]:
# visual assessment
rt_tweets.iloc[1:8,11:]

## <a name="t1">Tidy #1 - create new dataframe of columns needed</a>

In [None]:
# add columns to this list for creating a new DF with only columns we want only
tweet_cols = ['created_at','id','full_text','display_text_range','retweet_count','favorite_count','user']

In [None]:
# create new DF with column defined above
rt_tweets_sub = rt_tweets.loc[:,tweet_cols]
rt_tweets_sub.head(10)

## <a name="t2">Tidy #2 - Merge 3 datasets</a>

1. twitterDF
2. rt_tweets_sub
3. image_preds

In [None]:
# data exploration
twitterDF.info()

In [None]:
# data exploration
rt_tweets_sub.info()

In [None]:
image_preds.info()

### Define 
### <a name="q7">Quality 7 - create new dataframe of columns needed</a> 

In [None]:
# dataframe has a different name for its shared column, id --> tweet_id
rt_tweets_sub = rt_tweets_sub.rename(columns={"id":"tweet_id"})
rt_tweets_sub.head(5)

In [None]:
# MERGE 2 dataframes!
new_tweets_df = pd.merge(rt_tweets_sub, twitterDF, on='tweet_id')
new_tweets_df.head(3)

In [None]:
# data exploration
new_tweets_df.info()

In [None]:
# MERGE newly merged dataframe and image_preds to get new_tweets_df2
new_tweets_df2 = pd.merge(new_tweets_df, image_preds, on='tweet_id')

## <a name="save1">New Dataframe saved to file</a>

In [None]:
# write new dataframe to file
new_tweets_df2.to_csv("twitter_archive_master.csv")

[BACK TO TOP](#top)

In [None]:
# data exploration
new_tweets_df2.head(5)

In [None]:
# data exploration
# how many names are blank(null)
new_tweets_df2.name.isnull().count()

In [None]:
# data exploration
new_tweets_df2.loc[576,'expanded_urls']

In [None]:
# data exploration
new_tweets_df2.info()

In [None]:
# exploratory
# highest_accuracy = new_tweets_df2.query("p1_dog == true and ")

In [None]:
# count the number of times a name was used for pup. New series, count_by_name, is sorted by the index which is alphabetically 
# sorted by default 

count_by_name = new_tweets_df2.groupby('p1').size()
count_by_name

In [None]:
# see top 40 most predicted names 
count_by_name.sort_values(ascending=False)[0:40]

In [None]:
# Investigate why 'seat_belt' is the 15th most predicted name for a dog picture. These are all tweets who's value equals
# 'seat_belt' and groupby the 2nd predicted value

new_tweets_df2.query("p1 == 'seat_belt'").groupby('p2').size()

In [None]:
# create new series of the top 10 names used for pups

top10_names = count_by_name.sort_values(ascending=False).head(10)
top10_names

In [None]:
top10_names.index.values

In [None]:
top10_val_array = top10_names.values
top10_val_array

## <a name="vis1"> Horizontal Bar Chart to visualize the top 10 breeds represented during the timeframe </a>

In [None]:
# Horizontal Bar Chart to visualize the top 10 breeds represented during the timeframe

# Fixing random state for reproducibility
np.random.seed(19680801)


plt.rcdefaults()
fig, ax = plt.subplots()

names = top10_names.index.values 

y_pos = np.arange(len(names))

performance = top10_names.values
error = np.random.rand(len(names))

ax.barh(y_pos, performance, xerr=error, align='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(names)
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('Dog Breeds (predicted) Count ')
ax.set_ylabel('Predicted Breeds')
ax.set_title('WeRateDogs Dog Breeds represented (top 10)')

plt.show()

[BACK TO TOP](#top)

In [None]:
# Data Exploration
new_tweets_df2.iloc[300:305,0:10]

In [None]:
# Data Exploration
new_tweets_df2.iloc[300:305,11:20]

## <a name="prog1">DATA ANALYSIS</a>

In [None]:
## Percentages that dog was catagorized affectionately
## Averages of doggo, floofer, pupper, & puppo. Essentially, how often have these been designated

## This means that 'doggo' was used to describe a pup 3.6% of the time

desig = ['doggo', 'floofer', 'pupper', 'puppo']

#new_tweets_df2.doggo.mean()

new_tweets_df2[desig].mean()

In [None]:
## Owner named their dog this index the value number of times. There were a lot of missing values here
## Data Exploration 
## Names most used

new_tweets_df2.name.value_counts()

[BACK TO TOP](#top)

In [None]:
top10_names_used = list(top10_names.index)

In [None]:
top10_names_used

[BACK TO TOP](#top)

### <a name="prog2">More DATA ANALYSIS</a> 

In [None]:
## Create grouping of dataframe on the first predicted name, p1, & obtained the mean of specific data points

# This one provides appropriate columns but it correctly displayed the resulting dataframe in p1 alphabetic order
# which is not statistically significant

name_by_avgs = new_tweets_df2.groupby("p1")[['p1_conf','rating_numerator','rating_denominator','favorite_count',
                                             'retweet_count']].mean()
#Actually, you just need to pull out the rows you want, top10names, from the name_by_avgs. It's just sorted alphabetically
#name_by_avgs = new_tweets_df2.groupby(new_tweets_df2[newtop10])[['p1_conf','rating_numerator','rating_denominator','doggo','floofer',
#                                                 'pupper','puppo','favorite_count','retweet_count']].mean()


name_by_avgs.head(10)

In [None]:
# Get the highest average of retweets by predicted names.
p1_retweets = name_by_avgs.retweet_count.sort_values(ascending=False)
p1_retweets.head(10)

# The results indicate that tweets with pictures that are predicted as an "Arabian_camel" had an average retweet count of 17,424
# retweets. This insight says more about the neural network results and it's accuracy than the retweet specifics

In [None]:
# data exploration
#top10stats = name_by_avgs.loc[newtop10]
#top10stats.head(10)

In [None]:
#name_by_avgs.reset_index(inplace=True)

In [None]:
#name_by_avgs.rename(columns= {'p1':'probable_name', 'p1_conf':'probability'}, inplace=True)

In [None]:
# data exploration
#favorites_by_name = name_by_avgs.loc[:,['favorite_count']]
#favorites_by_name. 

In [None]:
#top15_favorites = favorites_by_name.iloc[0:15,:]
#top15_favorites.t.sort_values(ascending=False)

## <a name="vis2">Notable analysis from visual bar chart </a>

### None of the top 15 favorited 'dog's' images were acturately identified as dogs

In [None]:
# create sub
favorites_by_name = name_by_avgs.loc[:,['favorite_count']]
favorites_by_name.sort_values(by=['favorite_count'], ascending=False, inplace=True)
# get top 15 of new subset to create visual from
top15_favorites = favorites_by_name.iloc[0:15,:]
group_names = top15_favorites.index
group_data = top15_favorites.favorite_count

In [None]:
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize=(6, 4))
ax.barh(group_names, group_data)
labels = ax.get_xticklabels()
plt.setp(labels, rotation=45, horizontalalignment='right')
ax.set(xlim=[-10000, 70000], xlabel='No. of favorited tweets', ylabel='Names (guessed by learning model)',
       title='Top 15 Favorites (tweets), by probable name')

plt.show;

[BACK TO TOP](#top)

In [None]:
name_by_avgs.query("rating_numerator >= 10").rating_numerator.sort_values(ascending=False)

In [None]:
name_by_avgs.rating_numerator.sort_values(ascending=False)