# Wrangling

In the following cells I perform the necessary steps in order to gather the data from multiple sources. There are three data sources that I will be collecting data from: 
 1. CSV that was handed to me
 2. Downloaded TSV file from online source
 3. JSON data from Twitter's API

In [1]:
#importing all of the ncessary libraries
import pandas as pd
import requests
import json
import numpy as np
import tweepy
import os

### 1. Downloading and Loading the CSV File into a Dataframe
The CSV file was downloaded from Udactiy and stored on my local machine in the same folder location as my Jupyter Notebook. The file was then loaded into a dataframe using pandas as described in the cells below.

In [2]:
#loading the archive file to a dataframe
df_archive = pd.read_csv('./twitter-archive-enhanced.csv')
df_archive.head() #verify the dataframe loads properly

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


### 2. Downloading the TSV File from the Internet
The next file needed to be downloaded programmatically from the internet using python. A URL was provided where I could download the file. Using the OS and Requests library in python, I was able to create a folder on my machine and make a request to the URL to download the file. 

After downloading the file, I was able to name the file based on the URL. Then, I stored the data onto a new dataframe using pandas.

In [3]:
#created the folder to store the file that I needed to download
folder_name = 'image_predictions'
#create the folder only if the foldername does not exist
if not os.path.exists(folder_name):
    os.makedirs(folder_name)
#store the URL of the files location in url variable
url = ' https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
#store the request response in the response variable
response = requests.get(url)
#open the file path based on the foldername created in the first step
#create the file name based on the url, use all characters after the last '/', then write the file to the specified location
with open(os.path.join(folder_name,url.split('/')[-1]), mode = 'wb') as file:
    file.write(response.content)

In [4]:
#since the file is a tsv open the file using pandas read_csv but use sep as \t for tabs and create the dataframe
df_image_pred = pd.read_csv(folder_name + '/' + url.split('/')[-1], sep='\t')
df_image_pred.head() #verify the dataframe created loads properly

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


### 3. Accessing JSON data from Twitter's API
The next data set would have to be accessed using Twitter's API. After following the instructions in the class, I was able to setup my own Twitter developer access and get my own authentication and token keys. I followed the instructions on how to access and get data from the Twitter API and used the Tweepy API class to get data. 

First, I needed to get the Tweet IDs from the Archive file (first file) that I loaded. After getting the Tweet IDs stored in a variable, I ran some simple tests to understand how the data was being returned using a single Tweet ID. I made sure I understood how to use the Tweepy API, get the JSON data, and download the JSON data for one Tweet ID before doing the entire list of tweet IDs. The rest of the steps I took are listed next to the cells below.

In [5]:
#twitter API variables for authentication and access
consumer_key = 'mKkUiGXd08ESNBEvItFbUSGhm'
consumer_secret = 'KrmjP09QpsvoLxhTi8rGqFVKRSQyIwTD3x3MJfB8S2MsuhhC0C'
access_token = '819966980228681728-fWy7RqGAkgkX7phMbQm4SYtJEXW2Jbv'
access_secret = 'lKp25U01GkeTGRHwtyllMHvYClTBxPJuUKsXqko1rSH6N'
#authentication based on twitter documentation
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
#use the tweepy API class to authenticate and also set wait on rate limit to True so that it waits when the limit is reached
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [6]:
#retweet count and favorite ("like") count at minimum,

In [7]:
#perform a test on one tweet ID to ensure that the connection is successful, use tweet_mode extended based on recommendation
tweet = api.get_status(892420643555336193, tweet_mode='extended')
#store the response as a json
tweet_json = tweet._json
#verify that I can read the json response as a dict using retweet_count as an example 
tweet_json['retweet_count']
#retweet_count
#favorite_count

8308

In [8]:
#perform a simple test of creating a new file and writing all of the contents to the file
#commenting the below code out since I don't need it anymore
'''
with open('tweet_json.txt', 'w') as outfile:  
    json.dump(tweet_json, outfile)
'''

"\nwith open('tweet_json.txt', 'w') as outfile:  \n    json.dump(tweet_json, outfile)\n"

#### 3a. Downloading all of the JSON data into a text file
I stored all of the Tweet IDs from the archive data frame into a list which I would use when sending requests to the Twitter API. I created a simple script to help me loop through all of the Tweet IDs and store each JSON data for each Tweet in its own line in the text file. 

Because the script took a very long time to run (around 30 minutes), and since the script stores the text file on my local machine after it runs I didn't really need to run it again, I have commented out the script below. 

In [9]:
tweet_id = df_archive['tweet_id'] #store all of the tweet IDs in df_archive into a list
tweet_errors = {} #creating a dictionary that will be used for the exceptions
''' 
### As stated in the above markdown cell, commenting out the rest of the download script as I like to rerun the entire the notebook and make it look clean ###

import time #importing the time library as suggessted to measure how long the script is running
start = time.time() #starting the timer
for tweetid in tweet_id: #starting the for loop for all Tweet IDs in the list created above
    try: #creating a try-except block to catch all of the exceptions as some tweets were deleted
        tweet = api.get_status(tweetid, tweet_mode='extended') #use the get_status from the Tweepy API to get the JSON data for each tweet ID
        tweet_json = tweet._json #convert the data into json--found the _json function online
        with open('tweet_json.txt', 'a') as outfile: #open a new file called tweet_json with the append action so that all of the data gets written to the file instead of overwriting it
            json.dump(tweet_json, outfile) #use the json.dump function to load the json data into the text file
            outfile.write("\n") #this creates a new line on the file for the next tweet
        end = time.time() #this stops the time for the last data that was appended to the file
        total_time = end-start #calculate the total time from start to finish of last load
        print(tweetid, (total_time/60)) #print out the tweet ID and the total time in minutes to see the progress
    except Exception as e: #for all exceptions store in this exception block
        print(str(tweetid) +": "+str(e)) #print the tweet ID along witht he exception error message
        tweet_errors[tweetid]=tweet_json #store the data in the tweet_errors dict that was created earlier
'''

' \n### As stated in the above markdown cell, commenting out the rest of the download script as I like to rerun the entire the notebook and make it look clean ###\n\nimport time #importing the time library as suggessted to measure how long the script is running\nstart = time.time() #starting the timer\nfor tweetid in tweet_id: #starting the for loop for all Tweet IDs in the list created above\n    try: #creating a try-except block to catch all of the exceptions as some tweets were deleted\n        tweet = api.get_status(tweetid, tweet_mode=\'extended\') #use the get_status from the Tweepy API to get the JSON data for each tweet ID\n        tweet_json = tweet._json #convert the data into json--found the _json function online\n        with open(\'tweet_json.txt\', \'a\') as outfile: #open a new file called tweet_json with the append action so that all of the data gets written to the file instead of overwriting it\n            json.dump(tweet_json, outfile) #use the json.dump function to 

#### 3b. Reading the text file line by line and loading the data into a dataframe
This next section took the longest time in the wrangling process. Multiple iterations were taken to get the data correctly and I had to look at the JSON data in the text file multiple times for various tweets to ensure that I was getting the data properly.

I created the below script to load the text file and then read each line and store the attributes that I wanted into a dataframe. The minimum required attributes were Tweet ID, Retweet Count, and Favorite Count. I had to look through the text file to get the actual attribute names in the JSON data using one Tweet ID as an example. 

I also decided that I wanted to pull Full Text, URL, and User Mentions just to see if I could. This would prove to be more difficult than I thought. The Full Text was easy to pull as it was just like the other minimum required fields. However, the URL and User Mentions were both nested in the JSON data tree. So, I had to go multiple layers to access both. 

I noticed that User Mentions were empty for a lot of data (more than 2,000) so I decided to remove that from the script. 

URL was nested for the majority of tweets in the same location. However, there were some exceptions that came up which is why I added the try-except blocks to the first part of the dataframe script. Some URLs were nested in different parts of the JSON data.

In [10]:
#create a variable for the file that was created from step 3a with encoding set to utf-8
input_file = open('tweet_json.txt','r',encoding='utf-8')
df_list = [] #create a list that the attributes will be appended to
df_errors={} #create a dict for the exceptions
#start with the input_file variable
with input_file as f:
#start the for loop to go through each line of the input_file
    for line in f:
        try: #start the try-except block because of the exceptions
            data = json.loads(line) #use the json.load command to load the line as a json in the data variable
            tweet_id = data['id'] #start getting the attributes that I'm interested in starting with Tweet ID, since the data is in JSON format I can use the same functions as I would for a dict
            retweet_count = data['retweet_count'] #get the attribute retweet_count
            favorite_count = data['favorite_count'] #get the attribute favorite_count
            full_text = data['full_text'] #get the attribute full_text
            url = data['entities']['media'][0]['url'] #get the attribute url
            #append all of the attributes to df_list so that I can create a dataframe from the list
            df_list.append({'tweet_id':tweet_id, #create a key, value for tweet_id
                            'retweet_count':retweet_count, #create a key, value for retweet_count
                            'favorite_count':favorite_count, #create a key, value for favorite_count
                            'full_text':full_text, #create a key, value for full_text
                            'url':url, #create a key, value for url
                           })
            #use pandas DataFrame to convert the list into a dataframe with the appropriate columns
            df_tweepy = pd.DataFrame(df_list, columns = ['tweet_id',
                                                        'retweet_count',
                                                        'favorite_count',
                                                        'full_text',
                                                        'url'])
        #start the exception block as I noticed that some URLs are not stored in the same location of some tweets
        except Exception as e:
            print(str(tweet_id)+': '+str(e)) #print the exception message with the tweet ID
            df_errors[tweet_id]=line #store the line with the tweet ID into the dictionary
   

886267009285017600: 'media'
886054160059072513: 'media'
885518971528720385: 'media'
884247878851493888: 'media'
881633300179243008: 'media'
879674319642796034: 'media'
879130579576475649: 'media'
878604707211726852: 'media'
878404777348136964: 'media'
878316110768087041: 'media'
876537666061221889: 'media'
875097192612077568: 'media'
874434818259525634: 'media'
873337748698140672: 'media'
871166179821445120: 'media'
871102520638267392: 'media'
870726314365509632: 'media'
868639477480148993: 'media'
866720684873056260: 'media'
866094527597207552: 'media'
863471782782697472: 'media'
863427515083354112: 'media'
860981674716409858: 'media'
860177593139703809: 'media'
858860390427611136: 'media'
857214891891077121: 'media'
857062103051644929: 'media'
856602993587888130: 'media'
856330835276025856: 'media'
856288084350160898: 'media'
855862651834028034: 'media'
855860136149123072: 'media'
855857698524602368: 'media'
855818117272018944: 'media'
855245323840757760: 'media'
855138241867124737: 

In [11]:
len(df_errors.keys()) #find the number of exceptions in the dictionary 

273

#### 3c. Creating a new Dataframe Ignoring Exceptions
Next, I wanted to create a new dataframe now that I have all of my exceptions. I wanted to get the list of all the tweet IDs that had the exception then run a modified version of the script I created in step 3b to avoid any exceptions. 

Getting the URLs for each of the exceptions proved to be challenging as the URL was nested in various places for these tweets. I didn't have enough time to go through each of the exceptions or write the code to handle all scenarios. So, if an exception existed, I just ignored the URL.

In [12]:
tweet_errors = list(df_errors.keys()) #conver the Tweet IDs in the exception dict to a list
tweet_errors #verify the Tweet IDs

[886267009285017600,
 886054160059072513,
 885518971528720385,
 884247878851493888,
 881633300179243008,
 879674319642796034,
 879130579576475649,
 878604707211726852,
 878404777348136964,
 878316110768087041,
 876537666061221889,
 875097192612077568,
 874434818259525634,
 873337748698140672,
 871166179821445120,
 871102520638267392,
 870726314365509632,
 868639477480148993,
 866720684873056260,
 866094527597207552,
 863471782782697472,
 863427515083354112,
 860981674716409858,
 860177593139703809,
 858860390427611136,
 857214891891077121,
 857062103051644929,
 856602993587888130,
 856330835276025856,
 856288084350160898,
 855862651834028034,
 855860136149123072,
 855857698524602368,
 855818117272018944,
 855245323840757760,
 855138241867124737,
 852936405516943360,
 850333567704068097,
 849668094696017920,
 848213670039564288,
 847978865427394560,
 847617282490613760,
 846505985330044928,
 846139713627017216,
 845098359547420673,
 843981021012017153,
 841320156043304961,
 840761248237

In [13]:
#clearing the list and datagrame since I am going to re-run a modified version of the script in the below cell
df_list = [] 
df_tweepy = []

In [14]:
input_file = open('tweet_json.txt','r',encoding='utf-8') #store the file in a variable input_file
with input_file as f: #start by loading the file
    for line in f: #start the for loop
        data = json.loads(line) #use the json.load command to load the line as a json in the data variable
        tweet_id = data['id'] #start getting the attributes that I'm interested in starting with Tweet ID, since the data is in JSON format I can use the same functions as I would for a dict
        retweet_count = data['retweet_count'] #get the attribute retweet_count
        favorite_count = data['favorite_count'] #get the attribute favorite_count
        full_text = data['full_text'] #get the attribute full_text
        if  tweet_id not in tweet_errors: #if the Tweet ID does not exist in the Tweet ID exception list
            url = data['entities']['media'][0]['url'] #then use the following keys and store the value in the URL variable
        else: #if the Tweet ID does exist in the Tweet ID Exception list
            url=None #then don't put anything in the URL attribute
        #append all of the attributes to df_list so that I can create a dataframe from the list
        df_list.append({'tweet_id':tweet_id, #create a key, value for tweet_id
                        'retweet_count':retweet_count, #create a key, value for retweet_count
                        'favorite_count':favorite_count, #create a key, value for favorite_count
                        'full_text':full_text, #create a key, value for full_text
                        'url':url, #create a key, value for url
                        })
        #use pandas DataFrame to convert the list into a dataframe with the appropriate columns
        df_tweepy = pd.DataFrame(df_list, columns = ['tweet_id',
                                                    'retweet_count',
                                                    'favorite_count',
                                                    'full_text',
                                                    'url'])

In [15]:
df_tweepy.count() #verify the counts, some URLs will be missing or null

tweet_id          2340
retweet_count     2340
favorite_count    2340
full_text         2340
url               2067
dtype: int64

# Assessing

eight (8) quality issues and two (2) tidiness issues

tables: 
df_archive - archived tweets
df_image_pred - image predictions
df_tweepy - twitter data for favorite count and retweets

### 1. Quality Issues

#### df_archive table
- Dog names don't seem to be correct (visually noticed 'the' and 'a' as some dog names)
- Timestamp and retweeted_status_timestamp is represented as an object and should be converted to a datetime.
- All IDs (i.e. in_reply_to_user_id, retweeted_status_user_id, in_reply_to_status_id, retweeted_status_id) data type should be integer.
- Expanded URLs seem to consist of duplicate Twitter URLs and also URLs to other sites (like go fund me) all jumbled into a single attribute

#### df_image_pred table
- 66 duplicate image URLs found in the dataset each pointing to a unique Tweet ID.
- Inconsistent punctuation of the dog types in the p1, p2, and p3 columns.
- There seems to be missing records from the archive table.

#### df_tweepy table
- Noticed a lot of zeros for favorite count. When checking the URL for one, there was a favorite count so there must have been an issue with the script pulling that data from the API.
- There seems to be missing records from the archive table.

### 2. Tidyness Issues

#### df_archive table
- Multiple attributes represented when it can be combined for dog type (i.e. doggo, floofer, pupper, puppo) even rating (numerator and denominator)

#### df_image_pred table 
- Multiple attributes on the table that can be represented in a single attribute (i.e. prediction, prediction number, and dog (true or false))

In [17]:
df_archive

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


In [20]:
df_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [23]:
df_archive['in_reply_to_status_id'].value_counts()

6.671522e+17    2
8.562860e+17    1
8.131273e+17    1
6.754971e+17    1
6.827884e+17    1
8.265984e+17    1
6.780211e+17    1
6.689207e+17    1
6.658147e+17    1
6.737159e+17    1
7.590995e+17    1
8.862664e+17    1
7.384119e+17    1
7.727430e+17    1
7.468859e+17    1
8.634256e+17    1
6.693544e+17    1
6.914169e+17    1
6.920419e+17    1
6.753494e+17    1
7.291135e+17    1
8.406983e+17    1
6.747400e+17    1
7.501805e+17    1
6.744689e+17    1
7.638652e+17    1
6.747934e+17    1
8.503288e+17    1
6.747522e+17    1
8.816070e+17    1
               ..
8.380855e+17    1
8.211526e+17    1
8.558616e+17    1
8.558585e+17    1
7.032559e+17    1
6.678065e+17    1
8.018543e+17    1
7.667118e+17    1
6.855479e+17    1
6.717299e+17    1
6.715610e+17    1
6.758457e+17    1
6.924173e+17    1
7.476487e+17    1
8.381455e+17    1
6.903413e+17    1
8.476062e+17    1
8.352460e+17    1
6.813394e+17    1
8.795538e+17    1
6.860340e+17    1
8.571567e+17    1
6.765883e+17    1
7.044857e+17    1
8.707262e+

In [27]:
df_archive[df_archive['expanded_urls'].duplicated()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
55,881633300179243008,8.816070e+17,4.738443e+07,2017-07-02 21:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@roushfenway These are good dogs but 17/10 is ...,,,,,17,10,,,,,
64,879674319642796034,8.795538e+17,3.105441e+09,2017-06-27 12:14:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@RealKentMurphy 14/10 confirmed,,,,,14,10,,,,,
75,878281511006478336,,,2017-06-23 16:00:04 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Shadow. In an attempt to reach maximum zo...,,,,"https://www.gofundme.com/3yd6y1c,https://twitt...",13,10,Shadow,,,,
76,878057613040115712,,,2017-06-23 01:10:23 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Emmy. She was adopted today. Massive r...,,,,https://twitter.com/dog_rates/status/878057613...,14,10,Emmy,,,,
98,873213775632977920,,,2017-06-09 16:22:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Sierra. She's one precious pupper. Abs...,,,,https://www.gofundme.com/help-my-baby-sierra-g...,12,10,Sierra,,,pupper,
113,870726314365509632,8.707262e+17,1.648776e+07,2017-06-02 19:38:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...,,,,,10,10,,,,,
126,868552278524837888,,,2017-05-27 19:39:34 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to Cooper. His expression is the sam...,,,,"https://www.gofundme.com/3ti3nps,https://twitt...",12,10,Cooper,,,,
135,866450705531457537,,,2017-05-22 00:28:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Jamesy. He gives a kiss to every other...,,,,https://twitter.com/dog_rates/status/866450705...,13,10,Jamesy,,,pupper,
136,866334964761202691,,,2017-05-21 16:48:45 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Coco. At first I thought she was a clo...,,,,https://twitter.com/dog_rates/status/866334964...,12,10,Coco,,,,
148,863427515083354112,8.634256e+17,7.759620e+07,2017-05-13 16:15:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@Jack_Septic_Eye I'd need a few more pics to p...,,,,,12,10,,,,,


In [28]:
df_archive['expanded_urls'].value_counts()

https://twitter.com/dog_rates/status/833124694597443584/photo/1,https://twitter.com/dog_rates/status/833124694597443584/photo/1,https://twitter.com/dog_rates/status/833124694597443584/photo/1                                                                                                              2
https://twitter.com/dog_rates/status/739238157791694849/video/1                                                                                                                                                                                                                                              2
https://twitter.com/dog_rates/status/667138269671505920/photo/1                                                                                                                                                                                                                                              2
https://twitter.com/dog_rates/status/837820167694528512/photo/1,https://twitter.com/dog_rat

In [36]:
df_archive['rating_denominator'].value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

In [37]:
df_archive.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [43]:
df_archive.query('rating_numerator ==1776')['tweet_id']

979    749981277374128128
Name: tweet_id, dtype: int64

In [44]:
df_archive.sample(50)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
286,838831947270979586,,,2017-03-06 19:21:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Riley. His owner put a ...,7.8384e+17,4196984000.0,2016-10-06 01:23:05 +0000,https://twitter.com/dog_rates/status/783839966...,13,10,Riley,,,,
348,831670449226514432,,,2017-02-15 01:04:21 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Daisy. She has a heart on her butt. 13...,,,,https://twitter.com/dog_rates/status/831670449...,13,10,Daisy,,,,
2186,668981893510119424,,,2015-11-24 02:38:07 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Unique dog here. Oddly shaped tail. Long pink ...,,,,https://twitter.com/dog_rates/status/668981893...,4,10,,,,,
389,826476773533745153,,,2017-01-31 17:06:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Pilot. He has mastered the synchronize...,,,,https://twitter.com/dog_rates/status/826476773...,12,10,Pilot,doggo,,,
1949,673689733134946305,,,2015-12-07 02:25:23 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you're having a blast and remember tomorr...,,,,https://twitter.com/dog_rates/status/673689733...,11,10,,,,,
202,853639147608842240,,,2017-04-16 16:00:07 +0000,"<a href=""http://twitter.com/download/iphone"" r...",A photographer took pictures before and after ...,,,,https://twitter.com/dog_rates/status/853639147...,13,10,,,,,
968,750147208377409536,,,2016-07-05 02:00:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...","And finally, happy 4th of July from the squad ...",,,,https://twitter.com/dog_rates/status/750147208...,13,10,,,,,
1282,708738143638450176,,,2016-03-12 19:35:15 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Coco. She gets to stay on the Bachelor...,,,,https://twitter.com/dog_rates/status/708738143...,11,10,Coco,,,,
45,883482846933004288,,,2017-07-08 00:28:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Bella. She hopes her smile made you sm...,,,,https://twitter.com/dog_rates/status/883482846...,5,10,Bella,,,,
1936,673956914389192708,,,2015-12-07 20:07:04 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is one esteemed pupper. Just graduated co...,,,,https://twitter.com/dog_rates/status/673956914...,10,10,one,,,pupper,


In [18]:
df_image_pred

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


In [29]:
df_image_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [31]:
df_image_pred[df_image_pred['jpg_url'].duplicated()]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1297,752309394570878976,https://pbs.twimg.com/ext_tw_video_thumb/67535...,1,upright,0.303415,False,golden_retriever,0.181351,True,Brittany_spaniel,0.162084,True
1315,754874841593970688,https://pbs.twimg.com/media/CWza7kpWcAAdYLc.jpg,1,pug,0.272205,True,bull_mastiff,0.251530,True,bath_towel,0.116806,False
1333,757729163776290825,https://pbs.twimg.com/media/CWyD2HGUYAQ1Xa7.jpg,2,cash_machine,0.802333,False,schipperke,0.045519,True,German_shepherd,0.023353,True
1345,759159934323924993,https://pbs.twimg.com/media/CU1zsMSUAAAS0qW.jpg,1,Irish_terrier,0.254856,True,briard,0.227716,True,soft-coated_wheaten_terrier,0.223263,True
1349,759566828574212096,https://pbs.twimg.com/media/CkNjahBXAAQ2kWo.jpg,1,Labrador_retriever,0.967397,True,golden_retriever,0.016641,True,ice_bear,0.014858,False
1364,761371037149827077,https://pbs.twimg.com/tweet_video_thumb/CeBym7...,1,brown_bear,0.713293,False,Indian_elephant,0.172844,False,water_buffalo,0.038902,False
1368,761750502866649088,https://pbs.twimg.com/media/CYLDikFWEAAIy1y.jpg,1,golden_retriever,0.586937,True,Labrador_retriever,0.398260,True,kuvasz,0.005410,True
1387,766078092750233600,https://pbs.twimg.com/media/ChK1tdBWwAQ1flD.jpg,1,toy_poodle,0.420463,True,miniature_poodle,0.132640,True,Chesapeake_Bay_retriever,0.121523,True
1407,770093767776997377,https://pbs.twimg.com/media/CkjMx99UoAM2B1a.jpg,1,golden_retriever,0.843799,True,Labrador_retriever,0.052956,True,kelpie,0.035711,True
1417,771171053431250945,https://pbs.twimg.com/media/CVgdFjNWEAAxmbq.jpg,3,Samoyed,0.978833,True,Pomeranian,0.012763,True,Eskimo_dog,0.001853,True


In [33]:
df_image_pred.query('jpg_url =="https://pbs.twimg.com/media/C4KHj-nWQAA3poV.jpg"')

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1785,829374341691346946,https://pbs.twimg.com/media/C4KHj-nWQAA3poV.jpg,1,Staffordshire_bullterrier,0.757547,True,American_Staffordshire_terrier,0.14995,True,Chesapeake_Bay_retriever,0.047523,True
1903,851953902622658560,https://pbs.twimg.com/media/C4KHj-nWQAA3poV.jpg,1,Staffordshire_bullterrier,0.757547,True,American_Staffordshire_terrier,0.14995,True,Chesapeake_Bay_retriever,0.047523,True


In [50]:
df_image_pred.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [51]:
df_image_pred.query('img_num == 4')

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
144,668623201287675904,https://pbs.twimg.com/media/CUdtP1xUYAIeBnE.jpg,4,Chihuahua,0.708163,True,Pomeranian,0.091372,True,titi,0.067325,False
779,689905486972461056,https://pbs.twimg.com/media/CZMJYCRVAAE35Wk.jpg,4,Pomeranian,0.943331,True,Shetland_sheepdog,0.023675,True,chow,0.007165,True
1024,710588934686908417,https://pbs.twimg.com/media/CdyE2x1W8AAe0TG.jpg,4,Pembroke,0.982004,True,Cardigan,0.008943,True,malamute,0.00755,True
1161,734787690684657664,https://pbs.twimg.com/media/CjJ9gQ1WgAAXQtJ.jpg,4,golden_retriever,0.883991,True,chow,0.023542,True,Labrador_retriever,0.016056,True
1286,750868782890057730,https://pbs.twimg.com/media/CmufLLsXYAAsU0r.jpg,4,toy_poodle,0.912648,True,miniature_poodle,0.035059,True,seat_belt,0.026376,False
1325,756998049151549440,https://pbs.twimg.com/media/CoFlsGAWgAA2YeV.jpg,4,golden_retriever,0.678555,True,Labrador_retriever,0.072632,True,Border_terrier,0.049033,True
1337,758405701903519748,https://pbs.twimg.com/media/CoZl9fXWgAMox0n.jpg,4,Chesapeake_Bay_retriever,0.702954,True,laptop,0.092277,False,notebook,0.032727,False
1342,758854675097526272,https://pbs.twimg.com/media/Cof-SuqVYAAs4kZ.jpg,4,barrow,0.974047,False,Old_English_sheepdog,0.023791,True,komondor,0.001246,True
1372,762464539388485633,https://pbs.twimg.com/media/CpTRc4DUEAAYTq6.jpg,4,chow,0.999953,True,Tibetan_mastiff,2.3e-05,True,dhole,3e-06,False
1437,773985732834758656,https://pbs.twimg.com/media/Cr2_6R8WAAAUMtc.jpg,4,giant_panda,0.451149,False,fur_coat,0.148001,False,pug,0.10957,True


In [55]:
df_image_pred['p3'].value_counts()

Labrador_retriever                79
Chihuahua                         58
golden_retriever                  48
Eskimo_dog                        38
kelpie                            35
kuvasz                            34
Staffordshire_bullterrier         32
chow                              32
beagle                            31
cocker_spaniel                    31
Pekinese                          29
toy_poodle                        29
Pomeranian                        29
Great_Pyrenees                    27
Chesapeake_Bay_retriever          27
Pembroke                          27
French_bulldog                    26
malamute                          26
American_Staffordshire_terrier    24
Cardigan                          23
pug                               23
basenji                           21
toy_terrier                       20
bull_mastiff                      20
Siberian_husky                    19
Shetland_sheepdog                 17
Boston_bull                       17
b

In [19]:
df_tweepy

Unnamed: 0,tweet_id,retweet_count,favorite_count,full_text,url
0,892420643555336193,8311,38002,This is Phineas. He's a mystical boy. Only eve...,https://t.co/MgUWQ76dJU
1,892177421306343426,6139,32632,This is Tilly. She's just checking pup on you....,https://t.co/0Xxu71qeIV
2,891815181378084864,4064,24549,This is Archie. He is a rare Norwegian Pouncin...,https://t.co/wUnZnhtVJB
3,891689557279858688,8447,41351,This is Darla. She commenced a snooze mid meal...,https://t.co/tD36da7qLQ
4,891327558926688256,9154,39532,This is Franklin. He would like you to stop ca...,https://t.co/AtUZn91f7f
5,891087950875897856,3043,19862,Here we have a majestic great white breaching ...,https://t.co/kQ04fDDRmh
6,890971913173991426,2018,11606,Meet Jax. He enjoys ice cream so much he gets ...,https://t.co/tVJBRMnhxl
7,890729181411237888,18435,64126,When you watch your owner call another dog a g...,https://t.co/v0nONBcwxq
8,890609185150312448,4182,27277,This is Zoey. She doesn't want to be one of th...,https://t.co/9TwLuAGH0b
9,890240255349198849,7214,31302,This is Cassie. She is a college pup. Studying...,https://t.co/t1bfwz5S2A


In [34]:
df_tweepy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2340 entries, 0 to 2339
Data columns (total 5 columns):
tweet_id          2340 non-null int64
retweet_count     2340 non-null int64
favorite_count    2340 non-null int64
full_text         2340 non-null object
url               2067 non-null object
dtypes: int64(3), object(2)
memory usage: 91.5+ KB


In [46]:
df_tweepy.describe()

Unnamed: 0,tweet_id,retweet_count,favorite_count
count,2340.0,2340.0,2340.0
mean,7.422176e+17,2926.80812,7955.481624
std,6.832564e+16,4930.476731,12321.53991
min,6.660209e+17,0.0,0.0
25%,6.783394e+17,587.75,1373.0
50%,7.186224e+17,1366.5,3458.5
75%,7.986954e+17,3410.25,9734.0
max,8.924206e+17,83567.0,164162.0


In [49]:
df_tweepy.query('favorite_count == 0')

Unnamed: 0,tweet_id,retweet_count,favorite_count,full_text,url
31,886054160059072513,105,0,RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...,
35,885311592912609280,18153,0,RT @dog_rates: This is Lilly. She just paralle...,https://t.co/SATN4If5H5
67,879130579576475649,6694,0,RT @dog_rates: This is Emmy. She was adopted t...,
72,878404777348136964,1267,0,RT @dog_rates: Meet Shadow. In an attempt to r...,
73,878316110768087041,6525,0,RT @dog_rates: Meet Terrance. He's being yelle...,
77,877611172832227328,79,0,RT @rachel2195: @dog_rates the boyfriend and h...,https://t.co/dJx4Gzc50G
90,874434818259525634,14496,0,RT @dog_rates: This is Coco. At first I though...,
95,873337748698140672,1564,0,RT @dog_rates: This is Sierra. She's one preci...,
106,871166179821445120,5662,0,RT @dog_rates: This is Dawn. She's just checki...,
120,868639477480148993,2097,0,RT @dog_rates: Say hello to Cooper. His expres...,
