# WeRateDogs - Udacity Data Wrangling Project 03
---
## 8 Quality Issues 
Also known as dirty data which includes mislabeled, corrupted, duplicated, inconsistent content issues

### twitter-archive-enhanced.csv quality issues:

1. columns 'timestamp' & 'retweeted_status_timestamp' are objects (strings) and not of 'timestamp' type

2. numerous dog names are "a"; Replace with np.NaN
   
3. doggo, floofer, pupper, & puppo use None; Replace with 0, and 1 where 'doggo, floofer, etc...' 

4. remove URL from 'source' & replace with 4 categories: iphone, vine, twitter, tweetdeck

5. remove retweets


---
## 2 Tidiness Issues
Messy data includes structural issues where variables don't form a column, observations form rows, & each observational unit forms a table.


### all 3 datasets tidiness issues:

1. merge all 3 datasets; remove unwanted columns


### image-predictions.tsv tidiness issues:

2. Messy data - variables form both rows and columns --> p1, p2, p3, p1_conf, p2_conf, p3_conf, etc. Pivot vars into 3 cols, prediction #, prediction name, prediction probability

3. Messy data - variables from both rows and columns --> doggo, floofer, pupper, puppo. Presumably the dog should only have 1 name? If so, this issue can been resolved with imperfection (which name to select when 2 or more given). If not, and multiple 'doggo' names are allowed, then is issue becomes moot.
            

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import requests

import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

## Gather Data #1 - Twitter archive

In [2]:
twitterDF = pd.read_csv("data/twitter-archive-enhanced.csv")
twitterDF.head(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [3]:
# review data columns in DF, are Dtypes appropriate, etc.
twitterDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [4]:
# review names of pups
twitterDF.name.value_counts()

None       745
a           55
Charlie     12
Cooper      11
Oliver      11
          ... 
Dwight       1
Edmund       1
Jonah        1
Ralph        1
this         1
Name: name, Length: 957, dtype: int64

In [5]:
# review dogtionary names; interesting to see id# 200 has 2 values, doggo & floofer
twitterDF[twitterDF['floofer'] != 'None'].head(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
46,883360690899218434,,,2017-07-07 16:22:55 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Grizzwald. He may be the floofiest floofe...,,,,https://twitter.com/dog_rates/status/883360690...,13,10,Grizzwald,,floofer,,
200,854010172552949760,,,2017-04-17 16:34:26 +0000,"<a href=""http://twitter.com/download/iphone"" r...","At first I thought this was a shy doggo, but i...",,,,https://twitter.com/dog_rates/status/854010172...,11,10,,doggo,floofer,,
582,800388270626521089,,,2016-11-20 17:20:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Doc. He takes time out of every day to...,,,,https://twitter.com/dog_rates/status/800388270...,12,10,Doc,,floofer,,


In [6]:
# it appears the designations were pulled from the tweeted text, 'doggo' & 'floofer' in text below
twitterDF.loc[200,'text']

"At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send dogs https://t.co/TXdT3tmuYk"

In [7]:
# Illustrating that pup designations are NOT singular. Multiple 
twitterDF[twitterDF['doggo'] != 'None'].sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
992,748692773788876800,,,2016-07-01 01:40:41 +0000,"<a href=""http://twitter.com/download/iphone"" r...",That is Quizno. This is his beach. He does not...,,,,https://twitter.com/dog_rates/status/748692773...,10,10,his,doggo,,,
857,763956972077010945,7.638652e+17,15846407.0,2016-08-12 04:35:10 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@TheEllenShow I'm not sure if you know this bu...,,,,,12,10,,doggo,,,
889,759793422261743616,,,2016-07-31 16:50:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...","Meet Maggie &amp; Lila. Maggie is the doggo, L...",,,,https://twitter.com/dog_rates/status/759793422...,12,10,Maggie,doggo,,pupper,
318,834574053763584002,,,2017-02-23 01:22:14 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a doggo fully pupared for a shower. H*c...,,,,https://twitter.com/dog_rates/status/834574053...,13,10,,doggo,,,
1176,719991154352222208,,,2016-04-12 20:50:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This doggo was initially thrilled when she saw...,,,,https://twitter.com/dog_rates/status/719991154...,10,10,,doggo,,,


## Q1 - Convert dtype of timestamp columns
Q1 = Quality Item #1

In [8]:
# Fixed 2 columns with incorrect datatypes, changed to datetime64
twitterDF.timestamp = pd.to_datetime(twitterDF.timestamp)
twitterDF.retweeted_status_timestamp = pd.to_datetime(twitterDF.retweeted_status_timestamp)
twitterDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   tweet_id                    2356 non-null   int64              
 1   in_reply_to_status_id       78 non-null     float64            
 2   in_reply_to_user_id         78 non-null     float64            
 3   timestamp                   2356 non-null   datetime64[ns, UTC]
 4   source                      2356 non-null   object             
 5   text                        2356 non-null   object             
 6   retweeted_status_id         181 non-null    float64            
 7   retweeted_status_user_id    181 non-null    float64            
 8   retweeted_status_timestamp  181 non-null    datetime64[ns, UTC]
 9   expanded_urls               2297 non-null   object             
 10  rating_numerator            2356 non-null   int64           

## Q2 - dog names = 'a', replace with NaN

In [9]:
# replace puppo's names that match 'a' with NaN
twitterDF.name = np.where(twitterDF.name == 'a', np.NaN, twitterDF.name)

In [10]:
# check to ensure all 'a' names were removed 
twitterDF[twitterDF.name == 'a']

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


## Q3 - doggo, floofer, pupper, & puppo use None; Replace with NaN, or 0, & 1 for present

In [11]:
# replace 'None' with 0
# replace 'doggo' with 1
twitterDF.doggo = np.where(twitterDF.doggo == 'None', 0, twitterDF.doggo)
twitterDF.doggo = np.where(twitterDF.doggo == 'doggo', 1, twitterDF.doggo)

In [12]:
# replace 'None' with 0
# replace 'floofer' with 1
twitterDF.floofer = np.where(twitterDF.floofer == 'None', 0, twitterDF.floofer)
twitterDF.floofer = np.where(twitterDF.floofer == 'floofer', 1, twitterDF.floofer)

In [13]:
# replace 'None' with 0
# replace 'pupper' with 1
twitterDF.pupper = np.where(twitterDF.pupper == 'None', 0, twitterDF.pupper)
twitterDF.pupper = np.where(twitterDF.pupper == 'pupper', 1, twitterDF.pupper)

In [14]:
# replace 'None' with 0
# replace 'puppo' with 1
twitterDF.puppo = np.where(twitterDF.puppo == 'None', 0, twitterDF.puppo)
twitterDF.puppo = np.where(twitterDF.puppo == 'puppo', 1, twitterDF.puppo)

In [15]:
# check to ensure cleaning successful
twitterDF[twitterDF.puppo == 'None'].count()

tweet_id                      0
in_reply_to_status_id         0
in_reply_to_user_id           0
timestamp                     0
source                        0
text                          0
retweeted_status_id           0
retweeted_status_user_id      0
retweeted_status_timestamp    0
expanded_urls                 0
rating_numerator              0
rating_denominator            0
name                          0
doggo                         0
floofer                       0
pupper                        0
puppo                         0
dtype: int64

In [16]:
# check to ensure cleaning successful
twitterDF.query("floofer == 1")

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
46,883360690899218434,,,2017-07-07 16:22:55+00:00,"<a href=""http://twitter.com/download/iphone"" r...",Meet Grizzwald. He may be the floofiest floofe...,,,NaT,https://twitter.com/dog_rates/status/883360690...,13,10,Grizzwald,0,1,0,0
200,854010172552949760,,,2017-04-17 16:34:26+00:00,"<a href=""http://twitter.com/download/iphone"" r...","At first I thought this was a shy doggo, but i...",,,NaT,https://twitter.com/dog_rates/status/854010172...,11,10,,1,1,0,0
582,800388270626521089,,,2016-11-20 17:20:08+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Doc. He takes time out of every day to...,,,NaT,https://twitter.com/dog_rates/status/800388270...,12,10,Doc,0,1,0,0
774,776218204058357768,,,2016-09-15 00:36:55+00:00,"<a href=""http://twitter.com/download/iphone"" r...",Atlas rolled around in some chalk and now he's...,,,NaT,https://twitter.com/dog_rates/status/776218204...,13,10,,0,1,0,0
984,749317047558017024,,,2016-07-02 19:01:20+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Blu. He's a wild bush Floofer. I wish ...,,,NaT,https://twitter.com/dog_rates/status/749317047...,12,10,Blu,0,1,0,0
1022,746542875601690625,,,2016-06-25 03:17:46+00:00,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",Here's a golden floofer helping with the groce...,,,NaT,https://vine.co/v/5uZYwqmuDeT,11,10,,0,1,0,0
1091,737445876994609152,,,2016-05-31 00:49:32+00:00,"<a href=""http://twitter.com/download/iphone"" r...",Just wanted to share this super rare Rainbow F...,,,NaT,https://twitter.com/dog_rates/status/737445876...,13,10,,0,1,0,0
1110,733822306246479872,,,2016-05-21 00:50:46+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Moose. He's a Polynesian Floofer. Dapp...,,,NaT,https://twitter.com/dog_rates/status/733822306...,10,10,Moose,0,1,0,0
1534,689993469801164801,,,2016-01-21 02:10:37+00:00,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",Here we are witnessing a rare High Stepping Al...,,,NaT,https://vine.co/v/ienexVMZgi5,12,10,,0,1,0,0
1614,685307451701334016,,,2016-01-08 03:50:03+00:00,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to Petrick. He's an Altostratus Floo...,,,NaT,https://twitter.com/dog_rates/status/685307451...,11,10,Petrick,0,1,0,0


## Q4 - remove URL from 'source' & replace with 4 categories: iphone, vine, twitter, tweetdeck

In [17]:
# review names of sources
twitterDF.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [18]:
twitterDF.head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,NaT,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,0,0,0,0
1,892177421306343426,,,2017-08-01 00:17:27+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,NaT,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,0,0,0,0


In [19]:
def update_source(row):
    if 'iphone' in row:
        return 'iphone'
    elif 'vine' in row:
        return 'vine'
    elif 'Twitter' in row:
        return 'twitter web client'
    elif 'TweetDeck' in row:
        return 'TweetDeck'

In [20]:
# run update_source function on every row to replace source text with shorter description of source
twitterDF.source = twitterDF.apply(lambda row: update_source(row['source']),axis=1)

In [21]:
# check to ensure function replaced items as intended
twitterDF.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
950,752173152931807232,,,2016-07-10 16:10:29+00:00,iphone,This is Brody. He's a lifeguard. Always prepar...,,,NaT,https://twitter.com/dog_rates/status/752173152...,12,10,Brody,0,0,0,0
1814,676617503762681856,,,2015-12-15 04:19:18+00:00,iphone,I promise this wasn't meant to be a cuteness o...,,,NaT,https://twitter.com/dog_rates/status/676617503...,13,10,,0,0,1,0
1017,746872823977771008,,,2016-06-26 01:08:52+00:00,iphone,This is a carrot. We only rate dogs. Please on...,,,NaT,https://twitter.com/dog_rates/status/746872823...,11,10,,0,0,0,0
311,835297930240217089,,,2017-02-25 01:18:40+00:00,iphone,Meet Ash. He's a Benebop Cumberplop. Quite rar...,,,NaT,https://twitter.com/dog_rates/status/835297930...,12,10,Ash,0,0,0,0
1043,743835915802583040,,,2016-06-17 16:01:16+00:00,iphone,RT @dog_rates: Extremely intelligent dog here....,6.671383e+17,4196984000.0,2015-11-19 00:32:12+00:00,https://twitter.com/dog_rates/status/667138269...,10,10,,0,0,0,0


## Q5 - remove retweets & delete columns

In [22]:
twitterDF.sample(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
808,771770456517009408,,,2016-09-02 18:03:10+00:00,iphone,This is Davey. He'll have your daughter home b...,,,NaT,https://twitter.com/dog_rates/status/771770456...,11,10,Davey,0,0,0,0
302,836648853927522308,,,2017-02-28 18:46:45+00:00,iphone,RT @SchafeBacon2016: @dog_rates Slightly distu...,8.366481e+17,7.124572e+17,2017-02-28 18:43:57+00:00,https://twitter.com/SchafeBacon2016/status/836...,11,10,,0,0,0,0


In [23]:
twitterDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   tweet_id                    2356 non-null   int64              
 1   in_reply_to_status_id       78 non-null     float64            
 2   in_reply_to_user_id         78 non-null     float64            
 3   timestamp                   2356 non-null   datetime64[ns, UTC]
 4   source                      2356 non-null   object             
 5   text                        2356 non-null   object             
 6   retweeted_status_id         181 non-null    float64            
 7   retweeted_status_user_id    181 non-null    float64            
 8   retweeted_status_timestamp  181 non-null    datetime64[ns, UTC]
 9   expanded_urls               2297 non-null   object             
 10  rating_numerator            2356 non-null   int64           

In [24]:
# Get indices of rows to drop, in this case, any row with a value in retweeted_status_id different that NaN.  
drop_these = twitterDF[twitterDF['retweeted_status_id'].notnull()].index
twitterDF.drop(drop_these,inplace=True)
twitterDF.sample(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
370,828409743546925057,,,2017-02-06 01:07:28+00:00,iphone,This is Mutt Ryan. He's quite confident at the...,,,NaT,https://twitter.com/dog_rates/status/828409743...,12,10,Mutt,0,0,0,0
190,855857698524602368,,,2017-04-22 18:55:51+00:00,iphone,"HE'S LIKE ""WAIT A MINUTE I'M AN ANIMAL THIS IS...",,,NaT,https://twitter.com/perfy/status/8558573181681...,13,10,,0,0,0,0
1041,743980027717509120,,,2016-06-18 01:33:55+00:00,iphone,This is Geno. He's a Wrinkled Baklavian Velvee...,,,NaT,https://twitter.com/dog_rates/status/743980027...,11,10,Geno,0,0,0,0


In [25]:
# check if any 'notnull' entries exist in retweeted_status_id
twitterDF[twitterDF['retweeted_status_id'].notnull()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [26]:
twitterDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   tweet_id                    2175 non-null   int64              
 1   in_reply_to_status_id       78 non-null     float64            
 2   in_reply_to_user_id         78 non-null     float64            
 3   timestamp                   2175 non-null   datetime64[ns, UTC]
 4   source                      2175 non-null   object             
 5   text                        2175 non-null   object             
 6   retweeted_status_id         0 non-null      float64            
 7   retweeted_status_user_id    0 non-null      float64            
 8   retweeted_status_timestamp  0 non-null      datetime64[ns, UTC]
 9   expanded_urls               2117 non-null   object             
 10  rating_numerator            2175 non-null   int64           

In [27]:
drop_cols = ['retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp']
twitterDF.drop(drop_cols,axis=1,inplace=True)
twitterDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   tweet_id               2175 non-null   int64              
 1   in_reply_to_status_id  78 non-null     float64            
 2   in_reply_to_user_id    78 non-null     float64            
 3   timestamp              2175 non-null   datetime64[ns, UTC]
 4   source                 2175 non-null   object             
 5   text                   2175 non-null   object             
 6   expanded_urls          2117 non-null   object             
 7   rating_numerator       2175 non-null   int64              
 8   rating_denominator     2175 non-null   int64              
 9   name                   2120 non-null   object             
 10  doggo                  2175 non-null   object             
 11  floofer                2175 non-null   object           

In [51]:
twitterDF[twitterDF.in_reply_to_status_id.notnull()]


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
30,886267009285017600,8.862664e+17,2.281182e+09,2017-07-15 16:51:35+00:00,iphone,@NonWhiteHat @MayhewMayhem omg hello tanner yo...,,12,10,,0,0,0,0
55,881633300179243008,8.816070e+17,4.738443e+07,2017-07-02 21:58:53+00:00,iphone,@roushfenway These are good dogs but 17/10 is ...,,17,10,,0,0,0,0
64,879674319642796034,8.795538e+17,3.105441e+09,2017-06-27 12:14:36+00:00,iphone,@RealKentMurphy 14/10 confirmed,,14,10,,0,0,0,0
113,870726314365509632,8.707262e+17,1.648776e+07,2017-06-02 19:38:25+00:00,iphone,@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...,,10,10,,0,0,0,0
148,863427515083354112,8.634256e+17,7.759620e+07,2017-05-13 16:15:35+00:00,iphone,@Jack_Septic_Eye I'd need a few more pics to p...,,12,10,,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2038,671550332464455680,6.715449e+17,4.196984e+09,2015-12-01 04:44:10+00:00,iphone,After 22 minutes of careful deliberation this ...,,1,10,,0,0,0,0
2149,669684865554620416,6.693544e+17,4.196984e+09,2015-11-26 01:11:28+00:00,iphone,After countless hours of research and hundreds...,,11,10,,0,0,0,0
2169,669353438988365824,6.678065e+17,4.196984e+09,2015-11-25 03:14:30+00:00,iphone,This is Tessa. She is also very pleased after ...,https://twitter.com/dog_rates/status/669353438...,10,10,Tessa,0,0,0,0
2189,668967877119254528,6.689207e+17,2.143566e+07,2015-11-24 01:42:25+00:00,iphone,12/10 good shit Bubka\n@wane15,,12,10,,0,0,0,0


## Gather Data #2 - Tweet image predictions

In [None]:
file_url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
req = requests.get(file_url)
fname = os.path.basename(file_url)
open("data/" + fname, 'wb').write(req.content)

In [28]:
image_preds = pd.read_csv("data/image-predictions.tsv", sep="\t")
image_preds.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
951,704871453724954624,https://pbs.twimg.com/media/Ccg02LiWEAAJHw1.jpg,1,Norfolk_terrier,0.689504,True,soft-coated_wheaten_terrier,0.10148,True,Norwich_terrier,0.055779,True
1882,847157206088847362,https://pbs.twimg.com/media/C8G0_CMWsAAjjAY.jpg,2,Staffordshire_bullterrier,0.219609,True,American_Staffordshire_terrier,0.178671,True,pug,0.123271,True
1479,780800785462489090,https://pbs.twimg.com/media/CtX2Kr9XYAAuxrM.jpg,2,Siberian_husky,0.951963,True,Eskimo_dog,0.035346,True,Pembroke,0.008862,True
1702,817171292965273600,https://pbs.twimg.com/media/C1cs8uAWgAEwbXc.jpg,1,golden_retriever,0.295483,True,Irish_setter,0.144431,True,Chesapeake_Bay_retriever,0.077879,True
1235,746507379341139972,https://pbs.twimg.com/media/Clwgf4bWgAAB15c.jpg,1,toy_poodle,0.508292,True,Lakeland_terrier,0.234458,True,affenpinscher,0.084563,True


In [29]:
image_preds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


## Gather Data #3 - Query Twitter API for additional data
Query Twitter's API for JSON data for each tweet ID in the Twitter archive

 * retweet count
 * favorite count
 * any additional data found that's interesting
 * only tweets on Aug 1st, 2017 (image predictions present)

In [None]:
# authenticate API using regenerated keys/tokens

consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

In [None]:
tweet_ids = twitterDF.tweet_id.values
len(tweet_ids)

In [None]:
# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)

### Pick up from here if data already obtained from Twitter

In [30]:
# Read tweet JSON into dataframe using pandas
# recived ValueError: Trailing data without 'lines=True'

rt_tweets = pd.read_json("tweet.json", lines=True)
rt_tweets.head(5)

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status
0,2017-08-01 16:23:56+00:00,892420643555336193,892420643555336192,This is Phineas. He's a mystical boy. Only eve...,False,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
1,2017-08-01 00:17:27+00:00,892177421306343426,892177421306343424,This is Tilly. She's just checking pup on you....,False,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
2,2017-07-31 00:18:03+00:00,891815181378084864,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,False,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
3,2017-07-30 15:58:51+00:00,891689557279858688,891689557279858688,This is Darla. She commenced a snooze mid meal...,False,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
4,2017-07-29 16:00:24+00:00,891327558926688256,891327558926688256,This is Franklin. He would like you to stop ca...,False,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,


In [31]:
rt_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2330 entries, 0 to 2329
Data columns (total 32 columns):
 #   Column                         Non-Null Count  Dtype              
---  ------                         --------------  -----              
 0   created_at                     2330 non-null   datetime64[ns, UTC]
 1   id                             2330 non-null   int64              
 2   id_str                         2330 non-null   int64              
 3   full_text                      2330 non-null   object             
 4   truncated                      2330 non-null   bool               
 5   display_text_range             2330 non-null   object             
 6   entities                       2330 non-null   object             
 7   extended_entities              2058 non-null   object             
 8   source                         2330 non-null   object             
 9   in_reply_to_status_id          77 non-null     float64            
 10  in_reply_to_status_id_st

In [32]:
rt_tweets[rt_tweets.retweeted_status.notnull()].head(5)

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status
31,2017-07-15 02:45:48+00:00,886054160059072513,886054160059072512,RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...,False,"[0, 50]","{'hashtags': [{'text': 'BATP', 'indices': [21,...",,"<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,und,{'created_at': 'Sat Jul 15 02:44:07 +0000 2017...,8.860534e+17,8.860534e+17,"{'url': 'https://t.co/WxwJmvjfxo', 'expanded':...",
35,2017-07-13 01:35:06+00:00,885311592912609280,885311592912609280,RT @dog_rates: This is Lilly. She just paralle...,False,"[0, 133]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 830583314243268608, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,{'created_at': 'Sun Feb 12 01:04:29 +0000 2017...,,,,
67,2017-06-26 00:13:58+00:00,879130579576475649,879130579576475648,RT @dog_rates: This is Emmy. She was adopted t...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,,,en,{'created_at': 'Fri Jun 23 01:10:23 +0000 2017...,,,,
72,2017-06-24 00:09:53+00:00,878404777348136964,878404777348136960,RT @dog_rates: Meet Shadow. In an attempt to r...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,,,en,{'created_at': 'Fri Jun 23 16:00:04 +0000 2017...,,,,
73,2017-06-23 18:17:33+00:00,878316110768087041,878316110768087040,RT @dog_rates: Meet Terrance. He's being yelle...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,,,en,{'created_at': 'Tue Nov 24 03:51:38 +0000 2015...,,,,


In [33]:
rt_tweets.user

0       {'id': 4196983835, 'id_str': '4196983835', 'na...
1       {'id': 4196983835, 'id_str': '4196983835', 'na...
2       {'id': 4196983835, 'id_str': '4196983835', 'na...
3       {'id': 4196983835, 'id_str': '4196983835', 'na...
4       {'id': 4196983835, 'id_str': '4196983835', 'na...
                              ...                        
2325    {'id': 4196983835, 'id_str': '4196983835', 'na...
2326    {'id': 4196983835, 'id_str': '4196983835', 'na...
2327    {'id': 4196983835, 'id_str': '4196983835', 'na...
2328    {'id': 4196983835, 'id_str': '4196983835', 'na...
2329    {'id': 4196983835, 'id_str': '4196983835', 'na...
Name: user, Length: 2330, dtype: object

In [34]:
rt_tweets.columns

Index(['created_at', 'id', 'id_str', 'full_text', 'truncated',
       'display_text_range', 'entities', 'extended_entities', 'source',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'retweet_count', 'favorite_count',
       'favorited', 'retweeted', 'possibly_sensitive',
       'possibly_sensitive_appealable', 'lang', 'retweeted_status',
       'quoted_status_id', 'quoted_status_id_str', 'quoted_status_permalink',
       'quoted_status'],
      dtype='object')

In [35]:
rt_tweets[rt_tweets.retweeted == True]

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status


In [36]:
# inspect the extended entities data
rt_tweets.loc[0,'extended_entities']

{'media': [{'id': 892420639486877696,
   'id_str': '892420639486877696',
   'indices': [86, 109],
   'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
   'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
   'url': 'https://t.co/MgUWQ76dJU',
   'display_url': 'pic.twitter.com/MgUWQ76dJU',
   'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
   'type': 'photo',
   'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
    'medium': {'w': 540, 'h': 528, 'resize': 'fit'},
    'small': {'w': 540, 'h': 528, 'resize': 'fit'},
    'large': {'w': 540, 'h': 528, 'resize': 'fit'}}}]}

In [37]:
# inspect the entities data
rt_tweets.loc[115,'entities']

{'hashtags': [],
 'symbols': [],
 'user_mentions': [],
 'urls': [],
 'media': [{'id': 869702951354474496,
   'id_str': '869702951354474496',
   'indices': [140, 163],
   'media_url': 'http://pbs.twimg.com/media/DBHOOfOXoAABKlU.jpg',
   'media_url_https': 'https://pbs.twimg.com/media/DBHOOfOXoAABKlU.jpg',
   'url': 'https://t.co/vmCu3PFCQq',
   'display_url': 'pic.twitter.com/vmCu3PFCQq',
   'expanded_url': 'https://twitter.com/dog_rates/status/869702957897576449/photo/1',
   'type': 'photo',
   'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
    'large': {'w': 901, 'h': 1600, 'resize': 'fit'},
    'small': {'w': 383, 'h': 680, 'resize': 'fit'},
    'medium': {'w': 676, 'h': 1200, 'resize': 'fit'}}}]}

In [38]:
rt_tweets.loc[130,'user']

{'id': 4196983835,
 'id_str': '4196983835',
 'name': 'WeRateDogs®',
 'screen_name': 'dog_rates',
 'location': 'merch ➜',
 'description': 'Your Only Source For Professional Dog Ratings Instagram and Facebook ➜ WeRateDogs partnerships@weratedogs.com ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀',
 'url': 'https://t.co/N7sNNHSfPq',
 'entities': {'url': {'urls': [{'url': 'https://t.co/N7sNNHSfPq',
     'expanded_url': 'http://weratedogs.com',
     'display_url': 'weratedogs.com',
     'indices': [0, 23]}]},
  'description': {'urls': []}},
 'protected': False,
 'followers_count': 8894598,
 'friends_count': 18,
 'listed_count': 6035,
 'created_at': 'Sun Nov 15 21:41:29 +0000 2015',
 'favourites_count': 145946,
 'utc_offset': None,
 'time_zone': None,
 'geo_enabled': True,
 'verified': True,
 'statuses_count': 13082,
 'lang': None,
 'contributors_enabled': False,
 'is_translator': False,
 'is_translation_enabled': False,
 'profile_background_color': '000000',
 'profile_background_image_url': 'http://abs.twimg.com/images/them

In [45]:
rt_tweets.iloc[1:8,11:]

Unnamed: 0,in_reply_to_user_id,in_reply_to_user_id_str,in_reply_to_screen_name,user,geo,coordinates,place,contributors,is_quote_status,retweet_count,...,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status
1,,,,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,,,False,5549,...,False,False,0.0,0.0,en,,,,,
2,,,,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,,,False,3671,...,False,False,0.0,0.0,en,,,,,
3,,,,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,,,False,7649,...,False,False,0.0,0.0,en,,,,,
4,,,,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,,,False,8249,...,False,False,0.0,0.0,en,,,,,
5,,,,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,,,False,2759,...,False,False,0.0,0.0,en,,,,,
6,,,,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,,,False,1791,...,False,False,0.0,0.0,en,,,,,
7,,,,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,,,False,16725,...,False,False,0.0,0.0,en,,,,,


In [41]:
# keeping only records of tweets that are NOT retweeted. Should have 2167 after filtering out non-null values of retweeted_status
rt_tweets = rt_tweets[rt_tweets.retweeted_status.isnull()]
rt_tweets.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2167 entries, 0 to 2329
Data columns (total 32 columns):
 #   Column                         Non-Null Count  Dtype              
---  ------                         --------------  -----              
 0   created_at                     2167 non-null   datetime64[ns, UTC]
 1   id                             2167 non-null   int64              
 2   id_str                         2167 non-null   int64              
 3   full_text                      2167 non-null   object             
 4   truncated                      2167 non-null   bool               
 5   display_text_range             2167 non-null   object             
 6   entities                       2167 non-null   object             
 7   extended_entities              1986 non-null   object             
 8   source                         2167 non-null   object             
 9   in_reply_to_status_id          77 non-null     float64            
 10  in_reply_to_status_id_st

In [42]:
rt_tweets.sample(3)

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status
1664,2015-12-28 05:07:27+00:00,681340665377193984,681340665377193984,I've been told there's a slight possibility he...,False,"[0, 106]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"<a href=""http://twitter.com/download/iphone"" r...",6.813394e+17,...,False,False,,,en,,,,,
1559,2016-01-13 02:17:20+00:00,687096057537363968,687096057537363968,This pupper's New Year's resolution was to bec...,False,"[0, 125]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 687096050230820864, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
2032,2015-11-30 15:18:34+00:00,671347597085433856,671347597085433856,This is Lola. She was not fully prepared for t...,False,"[0, 90]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 671347593046306816, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,


In [43]:
# add columns to this list for creating a new DF with only column we want only
tweet_cols = ['created_at','id','full_text','display_text_range','retweet_count','favorite_count','user']

In [49]:
# create new DF with column defined above
rt_tweets_sub = rt_tweets.loc[:,tweet_cols]
rt_tweets_sub.head(10)

Unnamed: 0,created_at,id,full_text,display_text_range,retweet_count,favorite_count,user
0,2017-08-01 16:23:56+00:00,892420643555336193,This is Phineas. He's a mystical boy. Only eve...,"[0, 85]",7477,35388,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1,2017-08-01 00:17:27+00:00,892177421306343426,This is Tilly. She's just checking pup on you....,"[0, 138]",5549,30638,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
2,2017-07-31 00:18:03+00:00,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,"[0, 121]",3671,23034,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
3,2017-07-30 15:58:51+00:00,891689557279858688,This is Darla. She commenced a snooze mid meal...,"[0, 79]",7649,38689,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
4,2017-07-29 16:00:24+00:00,891327558926688256,This is Franklin. He would like you to stop ca...,"[0, 138]",8249,36965,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
5,2017-07-29 00:08:17+00:00,891087950875897856,Here we have a majestic great white breaching ...,"[0, 138]",2759,18630,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
6,2017-07-28 16:27:12+00:00,890971913173991426,Meet Jax. He enjoys ice cream so much he gets ...,"[0, 140]",1791,10828,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
7,2017-07-28 00:22:40+00:00,890729181411237888,When you watch your owner call another dog a g...,"[0, 118]",16725,59634,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
8,2017-07-27 16:25:51+00:00,890609185150312448,This is Zoey. She doesn't want to be one of th...,"[0, 122]",3815,25645,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
9,2017-07-26 15:59:51+00:00,890240255349198849,This is Cassie. She is a college pup. Studying...,"[0, 133]",6490,29261,"{'id': 4196983835, 'id_str': '4196983835', 'na..."


In [46]:
rt_tweets.drop('retweeted_status',axis=1,inplace=True)
rt_tweets.columns

Index(['created_at', 'id', 'id_str', 'full_text', 'truncated',
       'display_text_range', 'entities', 'extended_entities', 'source',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'retweet_count', 'favorite_count',
       'favorited', 'retweeted', 'possibly_sensitive',
       'possibly_sensitive_appealable', 'lang', 'quoted_status_id',
       'quoted_status_id_str', 'quoted_status_permalink', 'quoted_status'],
      dtype='object')

In [47]:
rt_tweets[rt_tweets.]

SyntaxError: invalid syntax (<ipython-input-47-e5de16195877>, line 1)

## Merge datasets

### twitterDF, rt_tweets_sub, image_preds

In [46]:
twitterDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   tweet_id               2175 non-null   int64              
 1   in_reply_to_status_id  78 non-null     float64            
 2   in_reply_to_user_id    78 non-null     float64            
 3   timestamp              2175 non-null   datetime64[ns, UTC]
 4   source                 2175 non-null   object             
 5   text                   2175 non-null   object             
 6   expanded_urls          2117 non-null   object             
 7   rating_numerator       2175 non-null   int64              
 8   rating_denominator     2175 non-null   int64              
 9   name                   2120 non-null   object             
 10  doggo                  2175 non-null   object             
 11  floofer                2175 non-null   object           

In [50]:
rt_tweets_sub.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2167 entries, 0 to 2329
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   created_at          2167 non-null   datetime64[ns, UTC]
 1   id                  2167 non-null   int64              
 2   full_text           2167 non-null   object             
 3   display_text_range  2167 non-null   object             
 4   retweet_count       2167 non-null   int64              
 5   favorite_count      2167 non-null   int64              
 6   user                2167 non-null   object             
dtypes: datetime64[ns, UTC](1), int64(3), object(3)
memory usage: 135.4+ KB


In [48]:
image_preds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [52]:
rt_tweets_sub = rt_tweets_sub.rename(columns={"id":"tweet_id"})
rt_tweets_sub.head()

Unnamed: 0,created_at,tweet_id,full_text,display_text_range,retweet_count,favorite_count,user
0,2017-08-01 16:23:56+00:00,892420643555336193,This is Phineas. He's a mystical boy. Only eve...,"[0, 85]",7477,35388,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1,2017-08-01 00:17:27+00:00,892177421306343426,This is Tilly. She's just checking pup on you....,"[0, 138]",5549,30638,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
2,2017-07-31 00:18:03+00:00,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,"[0, 121]",3671,23034,"{'id': 4196983835, 'id_str': '4196983835', 'na..."


In [54]:
new_tweets_df = pd.merge(rt_tweets_sub, twitterDF, on='tweet_id')
new_tweets_df

Unnamed: 0,created_at,tweet_id,full_text,display_text_range,retweet_count,favorite_count,user,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,2017-08-01 16:23:56+00:00,892420643555336193,This is Phineas. He's a mystical boy. Only eve...,"[0, 85]",7477,35388,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,2017-08-01 16:23:56+00:00,iphone,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,0,0,0,0
1,2017-08-01 00:17:27+00:00,892177421306343426,This is Tilly. She's just checking pup on you....,"[0, 138]",5549,30638,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,2017-08-01 00:17:27+00:00,iphone,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,0,0,0,0
2,2017-07-31 00:18:03+00:00,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,"[0, 121]",3671,23034,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,2017-07-31 00:18:03+00:00,iphone,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,0,0,0,0
3,2017-07-30 15:58:51+00:00,891689557279858688,This is Darla. She commenced a snooze mid meal...,"[0, 79]",7649,38689,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,2017-07-30 15:58:51+00:00,iphone,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,0,0,0,0
4,2017-07-29 16:00:24+00:00,891327558926688256,This is Franklin. He would like you to stop ca...,"[0, 138]",8249,36965,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,2017-07-29 16:00:24+00:00,iphone,This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2162,2015-11-16 00:24:50+00:00,666049248165822465,Here we have a 1949 1st generation vulpix. Enj...,"[0, 120]",40,96,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,2015-11-16 00:24:50+00:00,iphone,Here we have a 1949 1st generation vulpix. Enj...,https://twitter.com/dog_rates/status/666049248...,5,10,,0,0,0,0
2163,2015-11-16 00:04:52+00:00,666044226329800704,This is a purebred Piers Morgan. Loves to Netf...,"[0, 137]",124,265,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,2015-11-16 00:04:52+00:00,iphone,This is a purebred Piers Morgan. Loves to Netf...,https://twitter.com/dog_rates/status/666044226...,6,10,,0,0,0,0
2164,2015-11-15 23:21:54+00:00,666033412701032449,Here is a very happy pup. Big fan of well-main...,"[0, 130]",39,109,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,2015-11-15 23:21:54+00:00,iphone,Here is a very happy pup. Big fan of well-main...,https://twitter.com/dog_rates/status/666033412...,9,10,,0,0,0,0
2165,2015-11-15 23:05:30+00:00,666029285002620928,This is a western brown Mitsubishi terrier. Up...,"[0, 139]",41,119,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,2015-11-15 23:05:30+00:00,iphone,This is a western brown Mitsubishi terrier. Up...,https://twitter.com/dog_rates/status/666029285...,7,10,,0,0,0,0


In [55]:
new_tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2167 entries, 0 to 2166
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   created_at             2167 non-null   datetime64[ns, UTC]
 1   tweet_id               2167 non-null   int64              
 2   full_text              2167 non-null   object             
 3   display_text_range     2167 non-null   object             
 4   retweet_count          2167 non-null   int64              
 5   favorite_count         2167 non-null   int64              
 6   user                   2167 non-null   object             
 7   in_reply_to_status_id  78 non-null     float64            
 8   in_reply_to_user_id    78 non-null     float64            
 9   timestamp              2167 non-null   datetime64[ns, UTC]
 10  source                 2167 non-null   object             
 11  text                   2167 non-null   object           

In [56]:
new_tweets_df2 = pd.merge(new_tweets_df, image_preds, on='tweet_id')

In [57]:
new_tweets_df2.head(5)

Unnamed: 0,created_at,tweet_id,full_text,display_text_range,retweet_count,favorite_count,user,in_reply_to_status_id,in_reply_to_user_id,timestamp,...,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,2017-08-01 16:23:56+00:00,892420643555336193,This is Phineas. He's a mystical boy. Only eve...,"[0, 85]",7477,35388,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,2017-08-01 16:23:56+00:00,...,1,orange,0.097049,False,bagel,0.085851,False,banana,0.07611,False
1,2017-08-01 00:17:27+00:00,892177421306343426,This is Tilly. She's just checking pup on you....,"[0, 138]",5549,30638,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,2017-08-01 00:17:27+00:00,...,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True
2,2017-07-31 00:18:03+00:00,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,"[0, 121]",3671,23034,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,2017-07-31 00:18:03+00:00,...,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
3,2017-07-30 15:58:51+00:00,891689557279858688,This is Darla. She commenced a snooze mid meal...,"[0, 79]",7649,38689,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,2017-07-30 15:58:51+00:00,...,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
4,2017-07-29 16:00:24+00:00,891327558926688256,This is Franklin. He would like you to stop ca...,"[0, 138]",8249,36965,"{'id': 4196983835, 'id_str': '4196983835', 'na...",,,2017-07-29 16:00:24+00:00,...,2,basset,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True


In [58]:
new_tweets_df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1986 entries, 0 to 1985
Data columns (total 31 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   created_at             1986 non-null   datetime64[ns, UTC]
 1   tweet_id               1986 non-null   int64              
 2   full_text              1986 non-null   object             
 3   display_text_range     1986 non-null   object             
 4   retweet_count          1986 non-null   int64              
 5   favorite_count         1986 non-null   int64              
 6   user                   1986 non-null   object             
 7   in_reply_to_status_id  23 non-null     float64            
 8   in_reply_to_user_id    23 non-null     float64            
 9   timestamp              1986 non-null   datetime64[ns, UTC]
 10  source                 1986 non-null   object             
 11  text                   1986 non-null   object           