# Data Wrangling Template

In [17]:
import numpy as np
import pandas as pd
import requests
import os
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer
import matplotlib.pyplot as plt

## Gather

WeRateDogs Twitter archive

- WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, I have filtered for tweets with ratings only (there are 2356).

In [18]:
#Import csv file
tae = pd.read_csv('twitter-archive-enhanced.csv')

In [19]:
#Create directory 
folder_name = 'image_predictions'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [20]:
#Download file from url
url='https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

response = requests.get(url)
response
with open(os.path.join(folder_name, url.split('/')[-1]), mode ='wb') as file:
    file.write(response.content)

In [21]:
#Import tsv file
ip = pd.read_csv('image_predictions/image-predictions.tsv', sep='\t')

Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. 


Image Predictions File

- The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

- tweet_id is the last part of the tweet URL after "status/" → https://twitter.com/dog_rates/status/889531135344209921
- p1 is the algorithm's #1 prediction for the image in the tweet → golden retriever
- p1_conf is how confident the algorithm is in its #1 prediction → 95%
- p1_dog is whether or not the #1 prediction is a breed of dog → TRUE
- p2 is the algorithm's second most likely prediction → Labrador retriever
- p2_conf is how confident the algorithm is in its #2 prediction → 1%
- p2_dog is whether or not the #2 prediction is a breed of dog → TRUE



## Assess

### Assessment of Twitter Archive Enhanced

In [22]:
tae.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [23]:
tae.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1148,726887082820554753,,,2016-05-01 21:32:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Blitz. He's a new dad struggling to co...,,,,https://twitter.com/dog_rates/status/726887082...,10,10,Blitz,,,,
2078,670832455012716544,,,2015-11-29 05:11:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Amy. She is Queen Starburst. 10/10 une...,,,,https://twitter.com/dog_rates/status/670832455...,10,10,Amy,,,,
1383,700847567345688576,,,2016-02-20 01:00:55 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Crouton. He's a Galapagos Boonwiddle. Has...,,,,https://twitter.com/dog_rates/status/700847567...,10,10,Crouton,,,,
1941,673715861853720576,,,2015-12-07 04:09:13 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a heavily opinionated dog. Loves walls...,,,,https://twitter.com/dog_rates/status/673715861...,4,10,a,,,,
491,813800681631023104,,,2016-12-27 17:36:16 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Sky. She's learning how to roll her R'...,,,,https://twitter.com/dog_rates/status/813800681...,12,10,Sky,,,,


In [24]:
tae.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [25]:
tae.info()

# tweet_id string?
# timestamp time
# source change text
# text analysis
# retweeted timestamp time
# name analysis
# doggo, floofer, pupper, puppo --> one column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [26]:
# access rows
tae.iloc[1224:1225]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1224,714214115368108032,,,2016-03-27 22:14:49 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Happy Easter from the squad! 🐇🐶 13/10 for all ...,,,,https://twitter.com/dog_rates/status/714214115...,13,10,,,,,


In [27]:
# access columns
tae[['tweet_id', 'timestamp']]

Unnamed: 0,tweet_id,timestamp
0,892420643555336193,2017-08-01 16:23:56 +0000
1,892177421306343426,2017-08-01 00:17:27 +0000
2,891815181378084864,2017-07-31 00:18:03 +0000
3,891689557279858688,2017-07-30 15:58:51 +0000
4,891327558926688256,2017-07-29 16:00:24 +0000
...,...,...
2351,666049248165822465,2015-11-16 00:24:50 +0000
2352,666044226329800704,2015-11-16 00:04:52 +0000
2353,666033412701032449,2015-11-15 23:21:54 +0000
2354,666029285002620928,2015-11-15 23:05:30 +0000


In [28]:
# build subset of table
tae[['tweet_id', 'timestamp']].iloc[1224:1228]

Unnamed: 0,tweet_id,timestamp
1224,714214115368108032,2016-03-27 22:14:49 +0000
1225,714141408463036416,2016-03-27 17:25:54 +0000
1226,713919462244790272,2016-03-27 02:43:58 +0000
1227,713909862279876608,2016-03-27 02:05:49 +0000


In [29]:
# build new dataframe
data = tae[['tweet_id', 'timestamp']].iloc[1224:1228]

In [30]:
# query dataframe
data.query('tweet_id == 714214115368108032')

Unnamed: 0,tweet_id,timestamp
1224,714214115368108032,2016-03-27 22:14:49 +0000


In [31]:
columns = tae.columns
columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

In [32]:
(len(tae)) - (tae.in_reply_to_status_id.isna().sum())

78

In [33]:
long_columns = []

for column in columns:
    length = len(column)
    if length > 10:
        long_columns.append(column)    

In [34]:
long_columns

['in_reply_to_status_id',
 'in_reply_to_user_id',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator']

In [35]:
id_columns = []

for column in columns:
    if 'id' in column:
        id_columns.append(column)    

In [36]:
id_columns

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'retweeted_status_id',
 'retweeted_status_user_id']

In [37]:
meet_list = []

for i in range(0, len(tae)):
    if 'meet' in tae.text.iloc[i]:
        meet = tae[['tweet_id', 'text']].iloc[i]
        meet_list.append(meet)

In [38]:
meet_list

[tweet_id                                   886680336477933568
 text        This is Derek. He's late for a dog meeting. 13...
 Name: 28, dtype: object,
 tweet_id                                   884162670584377345
 text        Meet Yogi. He doesn't have any important dog m...
 Name: 43, dtype: object,
 tweet_id                                   779123168116150273
 text        This is Reggie. He hugs everyone he meets. 12/...
 Name: 750, dtype: object,
 tweet_id                                   669353438988365824
 text        This is Tessa. She is also very pleased after ...
 Name: 2169, dtype: object,
 tweet_id                                   667806454573760512
 text        This is Filup. He is overcome with joy after f...
 Name: 2251, dtype: object]

In [39]:
meet_df = pd.DataFrame(meet_list, columns = ['tweet_id', 'text'])
meet_df

Unnamed: 0,tweet_id,text
28,886680336477933568,This is Derek. He's late for a dog meeting. 13...
43,884162670584377345,Meet Yogi. He doesn't have any important dog m...
750,779123168116150273,This is Reggie. He hugs everyone he meets. 12/...
2169,669353438988365824,This is Tessa. She is also very pleased after ...
2251,667806454573760512,This is Filup. He is overcome with joy after f...


In [40]:
for i in range(0, len(tae)):
    if 'meat' in tae.text.iloc[i]:
        print(tae.text.iloc[i])

"So... we meat again" (I'm so sorry for that pun I couldn't resist pls don't unfollow) 10/10 https://t.co/XFBrrqapZa


In [41]:
# new column needs to be created
tae.text_meat = tae.text.str.replace('meet', 'meat')

  tae.text_meat = tae.text.str.replace('meet', 'meat')


In [42]:
meat_list = []

for i in range(0, len(tae)):
    if 'meat' in tae.text_meat.iloc[i]:
        meat = tae[['tweet_id', 'text']].iloc[i]
        meat_list.append(meat)

In [43]:
meat_df = pd.DataFrame(meat_list, columns = ['tweet_id', 'text'])
meat_df

Unnamed: 0,tweet_id,text
28,886680336477933568,This is Derek. He's late for a dog meeting. 13...
43,884162670584377345,Meet Yogi. He doesn't have any important dog m...
750,779123168116150273,This is Reggie. He hugs everyone he meets. 12/...
995,748346686624440324,"""So... we meat again"" (I'm so sorry for that p..."
2169,669353438988365824,This is Tessa. She is also very pleased after ...
2251,667806454573760512,This is Filup. He is overcome with joy after f...


In [44]:
# rename columns
meat_df = meat_df.rename(columns={'tweet_id':'tweat_id'})
meat_df

Unnamed: 0,tweat_id,text
28,886680336477933568,This is Derek. He's late for a dog meeting. 13...
43,884162670584377345,Meet Yogi. He doesn't have any important dog m...
750,779123168116150273,This is Reggie. He hugs everyone he meets. 12/...
995,748346686624440324,"""So... we meat again"" (I'm so sorry for that p..."
2169,669353438988365824,This is Tessa. She is also very pleased after ...
2251,667806454573760512,This is Filup. He is overcome with joy after f...


In [45]:
# rename columns (same output)
tae.rename(columns={'timestamp':'time'}, inplace=True)
tae = tae.rename(columns={'timestamp':'time'})

In [46]:
# order items

In [47]:
# create dict
# access different parts of dict

In [48]:
# timestamp delete first part

In [49]:
# last columns replace none with nan
# use one part while concat

In [50]:
# put two columns together

In [51]:
# change order of columns

In [52]:
# order rows with criteria

In [53]:
tae.rating_sum = tae.rating_numerator/2

  tae.rating_sum = tae.rating_numerator/2


In [54]:
dog_columns = ['doggo', 'floofer', 'pupper', 'puppo']

for column in dog_columns:
    tae[column] = tae[column].str.replace('None','')

In [55]:
tae.dog_breed = tae.doggo + tae.floofer

  tae.dog_breed = tae.doggo + tae.floofer


In [56]:
for column in dog_columns:
    if '' in tae[column]:
        x=+1
        
# count columns that have a certain value

In [57]:
# replace NaN (as string or other method)

In [58]:
# divide and concat dataframes

In [64]:
meet_list

[tweet_id                                   886680336477933568
 text        This is Derek. He's late for a dog meeting. 13...
 Name: 28, dtype: object,
 tweet_id                                   884162670584377345
 text        Meet Yogi. He doesn't have any important dog m...
 Name: 43, dtype: object,
 tweet_id                                   779123168116150273
 text        This is Reggie. He hugs everyone he meets. 12/...
 Name: 750, dtype: object,
 tweet_id                                   669353438988365824
 text        This is Tessa. She is also very pleased after ...
 Name: 2169, dtype: object,
 tweet_id                                   667806454573760512
 text        This is Filup. He is overcome with joy after f...
 Name: 2251, dtype: object]

In [67]:
# create dataframe from list (column names need to match)
meet_df = pd.DataFrame(meet_list, columns=['tweet_id', 'text'])

In [68]:
meet_df

Unnamed: 0,tweet_id,text
28,886680336477933568,This is Derek. He's late for a dog meeting. 13...
43,884162670584377345,Meet Yogi. He doesn't have any important dog m...
750,779123168116150273,This is Reggie. He hugs everyone he meets. 12/...
2169,669353438988365824,This is Tessa. She is also very pleased after ...
2251,667806454573760512,This is Filup. He is overcome with joy after f...


In [72]:
meet_df = meet_df.rename(columns={'tweet_id':'tweat_id', 'text':'test'})
meet_df

In [86]:
# difference between loc (index) and iloc (new "index")?
tae.loc[0:4]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,time,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [93]:
tae.query('tweet_id <= 884162670584377345')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,time,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
43,884162670584377345,,,2017-07-09 21:29:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Yogi. He doesn't have any important dog m...,,,,https://twitter.com/dog_rates/status/884162670...,12,10,Yogi,doggo,,,
44,883838122936631299,,,2017-07-09 00:00:04 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Noah. He can't believe someone made th...,,,,https://twitter.com/dog_rates/status/883838122...,12,10,Noah,,,,
45,883482846933004288,,,2017-07-08 00:28:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Bella. She hopes her smile made you sm...,,,,https://twitter.com/dog_rates/status/883482846...,5,10,Bella,,,,
46,883360690899218434,,,2017-07-07 16:22:55 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Grizzwald. He may be the floofiest floofe...,,,,https://twitter.com/dog_rates/status/883360690...,13,10,Grizzwald,,floofer,,
47,883117836046086144,,,2017-07-07 00:17:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Please only send dogs. We don't rate mechanics...,,,,https://twitter.com/dog_rates/status/883117836...,13,10,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,


In [100]:
tae.query("name == 'Phineas'")

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,time,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
2104,670668383499735048,,,2015-11-28 18:19:37 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a magical dog. Only appe...,,,,https://twitter.com/dog_rates/status/670668383...,10,10,Phineas,,,,


In [125]:
list = []

for i in range(0, len(tae)):
    if 'meet' in (tae.text.iloc[i]):
        list.append(tae.text.iloc[i])

list

["This is Derek. He's late for a dog meeting. 13/10 pet...al to the metal https://t.co/BCoWue0abA",
 "Meet Yogi. He doesn't have any important dog meetings today he just enjoys looking his best at all times. 12/10 for dangerously dapper doggo https://t.co/YSI00BzTBZ",
 'This is Reggie. He hugs everyone he meets. 12/10 keep spreading the love Reggie https://t.co/uMfhduaate',
 'This is Tessa. She is also very pleased after finally meeting her biological father. 10/10 https://t.co/qDS1aCqppv',
 'This is Filup. He is overcome with joy after finally meeting his father. 10/10 https://t.co/TBmDJXJB75']

In [133]:
df = pd.DataFrame(list, columns=['text'])
df

Unnamed: 0,text
0,This is Derek. He's late for a dog meeting. 13...
1,Meet Yogi. He doesn't have any important dog m...
2,This is Reggie. He hugs everyone he meets. 12/...
3,This is Tessa. She is also very pleased after ...
4,This is Filup. He is overcome with joy after f...


### Assessment of Image Predictions

### Issues Twitter Archive Enhanced

#### Tidiness

#### Quality


### Issues Image Predictions

#### Tidiness

#### Quality


# Clean


## Clean Twitter Archive Enhanced

#### Define

#### Code

#### Test

# Store, Analyze, and Visualize