# Wrangling

In the following cells I perform the necessary steps in order to gather the data from multiple sources. There are three data sources that I will be collecting data from: 
 1. CSV that was handed to me
 2. Downloaded TSV file from online source
 3. JSON data from Twitter's API

In [1185]:
#importing all of the ncessary libraries
import pandas as pd
import requests
import json
import numpy as np
import tweepy
import os

### 1. Downloading and Loading the CSV File into a Dataframe
The CSV file was downloaded from Udactiy and stored on my local machine in the same folder location as my Jupyter Notebook. The file was then loaded into a dataframe using pandas as described in the cells below.

In [1186]:
#loading the archive file to a dataframe
df_archive = pd.read_csv('./twitter-archive-enhanced.csv')
df_archive.head() #verify the dataframe loads properly

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,,,,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,,,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",,,,"https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12,10,Franklin,,,,


### 2. Downloading the TSV File from the Internet
The next file needed to be downloaded programmatically from the internet using python. A URL was provided where I could download the file. Using the OS and Requests library in python, I was able to create a folder on my machine and make a request to the URL to download the file. 

After downloading the file, I was able to name the file based on the URL. Then, I stored the data onto a new dataframe using pandas.

In [1187]:
#created the folder to store the file that I needed to download
folder_name = 'image_predictions'
#create the folder only if the foldername does not exist
if not os.path.exists(folder_name):
    os.makedirs(folder_name)
#store the URL of the files location in url variable
url = ' https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
#store the request response in the response variable
response = requests.get(url)
#open the file path based on the foldername created in the first step
#create the file name based on the url, use all characters after the last '/', then write the file to the specified location
with open(os.path.join(folder_name,url.split('/')[-1]), mode = 'wb') as file:
    file.write(response.content)

In [1188]:
#since the file is a tsv open the file using pandas read_csv but use sep as \t for tabs and create the dataframe
df_image_pred = pd.read_csv(folder_name + '/' + url.split('/')[-1], sep='\t')
df_image_pred.head() #verify the dataframe created loads properly

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


### 3. Accessing JSON data from Twitter's API
The next data set would have to be accessed using Twitter's API. After following the instructions in the class, I was able to setup my own Twitter developer access and get my own authentication and token keys. I followed the instructions on how to access and get data from the Twitter API and used the Tweepy API class to get data. 

First, I needed to get the Tweet IDs from the Archive file (first file) that I loaded. After getting the Tweet IDs stored in a variable, I ran some simple tests to understand how the data was being returned using a single Tweet ID. I made sure I understood how to use the Tweepy API, get the JSON data, and download the JSON data for one Tweet ID before doing the entire list of tweet IDs. The rest of the steps I took are listed next to the cells below.

In [1189]:
#twitter API variables for authentication and access
#commenting out and removing keys and tokens
'''
consumer_key = 'xx'
consumer_secret = 'xx'
access_token = 'xx'
access_secret = 'xx'

#authentication based on twitter documentation
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
#use the tweepy API class to authenticate and also set wait on rate limit to True so that it waits when the limit is reached
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

'''

"\nconsumer_key = 'xx'\nconsumer_secret = 'xx'\naccess_token = 'xx'\naccess_secret = 'xx'\n\n#authentication based on twitter documentation\nauth = tweepy.OAuthHandler(consumer_key, consumer_secret)\nauth.set_access_token(access_token, access_secret)\n#use the tweepy API class to authenticate and also set wait on rate limit to True so that it waits when the limit is reached\napi = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)\n\n"

In [1190]:
#retweet count and favorite ("like") count at minimum,

In [1191]:
'''
#perform a test on one tweet ID to ensure that the connection is successful, use tweet_mode extended based on recommendation
tweet = api.get_status(892420643555336193, tweet_mode='extended')
#store the response as a json
tweet_json = tweet._json
#verify that I can read the json response as a dict using retweet_count as an example 
tweet_json['retweet_count']
#retweet_count
#favorite_count
'''

"\n#perform a test on one tweet ID to ensure that the connection is successful, use tweet_mode extended based on recommendation\ntweet = api.get_status(892420643555336193, tweet_mode='extended')\n#store the response as a json\ntweet_json = tweet._json\n#verify that I can read the json response as a dict using retweet_count as an example \ntweet_json['retweet_count']\n#retweet_count\n#favorite_count\n"

In [1192]:
#perform a simple test of creating a new file and writing all of the contents to the file
#commenting the below code out since I don't need it anymore
'''
with open('tweet_json.txt', 'w') as outfile:  
    json.dump(tweet_json, outfile)
'''

"\nwith open('tweet_json.txt', 'w') as outfile:  \n    json.dump(tweet_json, outfile)\n"

#### 3a. Downloading all of the JSON data into a text file
I stored all of the Tweet IDs from the archive data frame into a list which I would use when sending requests to the Twitter API. I created a simple script to help me loop through all of the Tweet IDs and store each JSON data for each Tweet in its own line in the text file. 

Because the script took a very long time to run (around 30 minutes), and since the script stores the text file on my local machine after it runs I didn't really need to run it again, I have commented out the script below. 

In [1193]:
tweet_id = df_archive['tweet_id'] #store all of the tweet IDs in df_archive into a list
tweet_errors = {} #creating a dictionary that will be used for the exceptions
''' 
### As stated in the above markdown cell, commenting out the rest of the download script as I like to rerun the entire the notebook and make it look clean ###

import time #importing the time library as suggessted to measure how long the script is running
start = time.time() #starting the timer
for tweetid in tweet_id: #starting the for loop for all Tweet IDs in the list created above
    try: #creating a try-except block to catch all of the exceptions as some tweets were deleted
        tweet = api.get_status(tweetid, tweet_mode='extended') #use the get_status from the Tweepy API to get the JSON data for each tweet ID
        tweet_json = tweet._json #convert the data into json--found the _json function online
        with open('tweet_json.txt', 'a') as outfile: #open a new file called tweet_json with the append action so that all of the data gets written to the file instead of overwriting it
            json.dump(tweet_json, outfile) #use the json.dump function to load the json data into the text file
            outfile.write("\n") #this creates a new line on the file for the next tweet
        end = time.time() #this stops the time for the last data that was appended to the file
        total_time = end-start #calculate the total time from start to finish of last load
        print(tweetid, (total_time/60)) #print out the tweet ID and the total time in minutes to see the progress
    except Exception as e: #for all exceptions store in this exception block
        print(str(tweetid) +": "+str(e)) #print the tweet ID along witht he exception error message
        tweet_errors[tweetid]=tweet_json #store the data in the tweet_errors dict that was created earlier
'''

' \n### As stated in the above markdown cell, commenting out the rest of the download script as I like to rerun the entire the notebook and make it look clean ###\n\nimport time #importing the time library as suggessted to measure how long the script is running\nstart = time.time() #starting the timer\nfor tweetid in tweet_id: #starting the for loop for all Tweet IDs in the list created above\n    try: #creating a try-except block to catch all of the exceptions as some tweets were deleted\n        tweet = api.get_status(tweetid, tweet_mode=\'extended\') #use the get_status from the Tweepy API to get the JSON data for each tweet ID\n        tweet_json = tweet._json #convert the data into json--found the _json function online\n        with open(\'tweet_json.txt\', \'a\') as outfile: #open a new file called tweet_json with the append action so that all of the data gets written to the file instead of overwriting it\n            json.dump(tweet_json, outfile) #use the json.dump function to 

#### 3b. Reading the text file line by line and loading the data into a dataframe
This next section took the longest time in the wrangling process. Multiple iterations were taken to get the data correctly and I had to look at the JSON data in the text file multiple times for various tweets to ensure that I was getting the data properly.

I created the below script to load the text file and then read each line and store the attributes that I wanted into a dataframe. The minimum required attributes were Tweet ID, Retweet Count, and Favorite Count. I had to look through the text file to get the actual attribute names in the JSON data using one Tweet ID as an example. 

I also decided that I wanted to pull Full Text, URL, and User Mentions just to see if I could. This would prove to be more difficult than I thought. The Full Text was easy to pull as it was just like the other minimum required fields. However, the URL and User Mentions were both nested in the JSON data tree. So, I had to go multiple layers to access both. 

I noticed that User Mentions were empty for a lot of data (more than 2,000) so I decided to remove that from the script. 

URL was nested for the majority of tweets in the same location. However, there were some exceptions that came up which is why I added the try-except blocks to the first part of the dataframe script. Some URLs were nested in different parts of the JSON data.

In [1194]:
#create a variable for the file that was created from step 3a with encoding set to utf-8
input_file = open('tweet_json.txt','r',encoding='utf-8')
df_list = [] #create a list that the attributes will be appended to
df_errors={} #create a dict for the exceptions
#start with the input_file variable
with input_file as f:
#start the for loop to go through each line of the input_file
    for line in f:
        try: #start the try-except block because of the exceptions
            data = json.loads(line) #use the json.load command to load the line as a json in the data variable
            tweet_id = data['id'] #start getting the attributes that I'm interested in starting with Tweet ID, since the data is in JSON format I can use the same functions as I would for a dict
            retweet_count = data['retweet_count'] #get the attribute retweet_count
            favorite_count = data['favorite_count'] #get the attribute favorite_count
            full_text = data['full_text'] #get the attribute full_text
            url = data['entities']['media'][0]['url'] #get the attribute url
            #append all of the attributes to df_list so that I can create a dataframe from the list
            df_list.append({'tweet_id':tweet_id, #create a key, value for tweet_id
                            'retweet_count':retweet_count, #create a key, value for retweet_count
                            'favorite_count':favorite_count, #create a key, value for favorite_count
                            'full_text':full_text, #create a key, value for full_text
                            'url':url, #create a key, value for url
                           })
            #use pandas DataFrame to convert the list into a dataframe with the appropriate columns
            df_tweepy = pd.DataFrame(df_list, columns = ['tweet_id',
                                                        'retweet_count',
                                                        'favorite_count',
                                                        'full_text',
                                                        'url'])
        #start the exception block as I noticed that some URLs are not stored in the same location of some tweets
        except Exception as e:
            print(str(tweet_id)+': '+str(e)) #print the exception message with the tweet ID
            df_errors[tweet_id]=line #store the line with the tweet ID into the dictionary
   

886267009285017600: 'media'
886054160059072513: 'media'
885518971528720385: 'media'
884247878851493888: 'media'
881633300179243008: 'media'
879674319642796034: 'media'
879130579576475649: 'media'
878604707211726852: 'media'
878404777348136964: 'media'
878316110768087041: 'media'
876537666061221889: 'media'
875097192612077568: 'media'
874434818259525634: 'media'
873337748698140672: 'media'
871166179821445120: 'media'
871102520638267392: 'media'
870726314365509632: 'media'
868639477480148993: 'media'
866720684873056260: 'media'
866094527597207552: 'media'
863471782782697472: 'media'
863427515083354112: 'media'
860981674716409858: 'media'
860177593139703809: 'media'
858860390427611136: 'media'
857214891891077121: 'media'
857062103051644929: 'media'
856602993587888130: 'media'
856330835276025856: 'media'
856288084350160898: 'media'
855862651834028034: 'media'
855860136149123072: 'media'
855857698524602368: 'media'
855818117272018944: 'media'
855245323840757760: 'media'
855138241867124737: 

In [1195]:
len(df_errors.keys()) #find the number of exceptions in the dictionary 

273

#### 3c. Creating a new Dataframe Ignoring Exceptions
Next, I wanted to create a new dataframe now that I have all of my exceptions. I wanted to get the list of all the tweet IDs that had the exception then run a modified version of the script I created in step 3b to avoid any exceptions. 

Getting the URLs for each of the exceptions proved to be challenging as the URL was nested in various places for these tweets. I didn't have enough time to go through each of the exceptions or write the code to handle all scenarios. So, if an exception existed, I just ignored the URL.

In [1196]:
tweet_errors = list(df_errors.keys()) #conver the Tweet IDs in the exception dict to a list
tweet_errors #verify the Tweet IDs

[886267009285017600,
 886054160059072513,
 885518971528720385,
 884247878851493888,
 881633300179243008,
 879674319642796034,
 879130579576475649,
 878604707211726852,
 878404777348136964,
 878316110768087041,
 876537666061221889,
 875097192612077568,
 874434818259525634,
 873337748698140672,
 871166179821445120,
 871102520638267392,
 870726314365509632,
 868639477480148993,
 866720684873056260,
 866094527597207552,
 863471782782697472,
 863427515083354112,
 860981674716409858,
 860177593139703809,
 858860390427611136,
 857214891891077121,
 857062103051644929,
 856602993587888130,
 856330835276025856,
 856288084350160898,
 855862651834028034,
 855860136149123072,
 855857698524602368,
 855818117272018944,
 855245323840757760,
 855138241867124737,
 852936405516943360,
 850333567704068097,
 849668094696017920,
 848213670039564288,
 847978865427394560,
 847617282490613760,
 846505985330044928,
 846139713627017216,
 845098359547420673,
 843981021012017153,
 841320156043304961,
 840761248237

In [1197]:
#clearing the list and datagrame since I am going to re-run a modified version of the script in the below cell
df_list = [] 
df_tweepy = []

In [1198]:
input_file = open('tweet_json.txt','r',encoding='utf-8') #store the file in a variable input_file
with input_file as f: #start by loading the file
    for line in f: #start the for loop
        data = json.loads(line) #use the json.load command to load the line as a json in the data variable
        tweet_id = data['id'] #start getting the attributes that I'm interested in starting with Tweet ID, since the data is in JSON format I can use the same functions as I would for a dict
        retweet_count = data['retweet_count'] #get the attribute retweet_count
        favorite_count = data['favorite_count'] #get the attribute favorite_count
        full_text = data['full_text'] #get the attribute full_text
        if  tweet_id not in tweet_errors: #if the Tweet ID does not exist in the Tweet ID exception list
            url = data['entities']['media'][0]['url'] #then use the following keys and store the value in the URL variable
        else: #if the Tweet ID does exist in the Tweet ID Exception list
            url=None #then don't put anything in the URL attribute
        #append all of the attributes to df_list so that I can create a dataframe from the list
        df_list.append({'tweet_id':tweet_id, #create a key, value for tweet_id
                        'retweet_count':retweet_count, #create a key, value for retweet_count
                        'favorite_count':favorite_count, #create a key, value for favorite_count
                        'full_text':full_text, #create a key, value for full_text
                        'url':url, #create a key, value for url
                        })
        #use pandas DataFrame to convert the list into a dataframe with the appropriate columns
        df_tweepy = pd.DataFrame(df_list, columns = ['tweet_id',
                                                    'retweet_count',
                                                    'favorite_count',
                                                    'full_text',
                                                    'url'])

In [1199]:
df_tweepy.count() #verify the counts, some URLs will be missing or null

tweet_id          2340
retweet_count     2340
favorite_count    2340
full_text         2340
url               2067
dtype: int64

# Assessing

Part of the assessment process was visually looking at all of the data set and then programmatically assessing the data to find data issues. In the cells below, I walk through all of the various data issues that I have identified and then I have aggregated all of my findings after my analysis.

The first step was to visually look at all of the data. I started with the archive table. I immediately noticed some tidiness issues with the dog classification (doggo, floofer, pupper, puppo). Separate attributes for these seemed unnecessary.

The rating number and rating denominator also seemed suspicious as it made no sense but that's just the way the twitter account rates the dogs so I left it as-is.

In [1200]:
df_archive

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,,,,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,,,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",,,,"https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh,,,,https://twitter.com/dog_rates/status/891087950875897856/photo/1,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\n\nhttps://t.co/Zr4hWfAs1H https://t.co/tVJBRMnhxl,,,,"https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq,,,,"https://twitter.com/dog_rates/status/890729181411237888/photo/1,https://twitter.com/dog_rates/status/890729181411237888/photo/1",13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettable boatpet. 13/10 #BarkWeek https://t.co/9TwLuAGH0b,,,,https://twitter.com/dog_rates/status/890609185150312448/photo/1,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A,,,,https://twitter.com/dog_rates/status/890240255349198849/photo/1,14,10,Cassie,doggo,,,


Next I checked the data types for all of the columns. I noticed that the IDs are in floats and the timestamp was a string. These were some things that I could address as quality issues.

I also noticed a lot of None values in the classification was being counted as actual values.

In [1201]:
#check data types
df_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

I started looking through various attributes trying to see if there were any duplicated. I did notice that some of the expanded URLs were duplcated. These might be due to retweets.

In [1202]:
#check for duplicates
df_archive[df_archive['expanded_urls'].duplicated()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
55,881633300179243008,8.816070e+17,4.738443e+07,2017-07-02 21:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@roushfenway These are good dogs but 17/10 is an emotional impulse rating. More like 13/10s,,,,,17,10,,,,,
64,879674319642796034,8.795538e+17,3.105441e+09,2017-06-27 12:14:36 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@RealKentMurphy 14/10 confirmed,,,,,14,10,,,,,
75,878281511006478336,,,2017-06-23 16:00:04 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","Meet Shadow. In an attempt to reach maximum zooming borkdrive, he tore his ACL. Still 13/10 tho. Help him out below\n\nhttps://t.co/245xJJElsY https://t.co/lUiQH219v6",,,,"https://www.gofundme.com/3yd6y1c,https://twitter.com/dog_rates/status/878281511006478336/photo/1",13,10,Shadow,,,,
76,878057613040115712,,,2017-06-23 01:10:23 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Emmy. She was adopted today. Massive round of pupplause for Emmy and her new family. 14/10 for all involved https://t.co/cwtWnHMVpe,,,,"https://twitter.com/dog_rates/status/878057613040115712/photo/1,https://twitter.com/dog_rates/status/878057613040115712/photo/1",14,10,Emmy,,,,
98,873213775632977920,,,2017-06-09 16:22:42 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Sierra. She's one precious pupper. Absolute 12/10. Been in and out of ICU her whole life. Help Sierra below\n\nhttps://t.co/Xp01EU3qyD https://t.co/V5lkvrGLdQ,,,,"https://www.gofundme.com/help-my-baby-sierra-get-better,https://twitter.com/dog_rates/status/873213775632977920/photo/1,https://twitter.com/dog_rates/status/873213775632977920/photo/1",12,10,Sierra,,,pupper,
113,870726314365509632,8.707262e+17,1.648776e+07,2017-06-02 19:38:25 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@ComplicitOwl @ShopWeRateDogs &gt;10/10 is reserved for dogs,,,,,10,10,,,,,
126,868552278524837888,,,2017-05-27 19:39:34 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Say hello to Cooper. His expression is the same wet or dry. Absolute 12/10 but Coop desperately requests your help\n\nhttps://t.co/ZMTE4Mr69f https://t.co/7RyeXTYLNi,,,,"https://www.gofundme.com/3ti3nps,https://twitter.com/dog_rates/status/868552278524837888/photo/1,https://twitter.com/dog_rates/status/868552278524837888/photo/1",12,10,Cooper,,,,
135,866450705531457537,,,2017-05-22 00:28:40 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Jamesy. He gives a kiss to every other pupper he sees on his walk. 13/10 such passion, much tender https://t.co/wk7TfysWHr",,,,"https://twitter.com/dog_rates/status/866450705531457537/photo/1,https://twitter.com/dog_rates/status/866450705531457537/photo/1",13,10,Jamesy,,,pupper,
136,866334964761202691,,,2017-05-21 16:48:45 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Coco. At first I thought she was a cloud but clouds don't bork with such passion. 12/10 would hug softly https://t.co/W86h5dgR6c,,,,"https://twitter.com/dog_rates/status/866334964761202691/photo/1,https://twitter.com/dog_rates/status/866334964761202691/photo/1",12,10,Coco,,,,
148,863427515083354112,8.634256e+17,7.759620e+07,2017-05-13 16:15:35 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","@Jack_Septic_Eye I'd need a few more pics to polish a full analysis, but based on the good boy content above I'm leaning towards 12/10",,,,,12,10,,,,,


Then I did a value count of all expanded URLs and noticed that most of the URLs were duplicated or some other websites other than twitter were referenced. 

In [1203]:
#check the value counts
df_archive['expanded_urls'].value_counts()

https://twitter.com/dog_rates/status/759923798737051648/photo/1                                                                                                                                                                                                                                        2
https://www.gofundme.com/my-puppys-double-cataract-surgery,https://twitter.com/dog_rates/status/825026590719483904/photo/1,https://twitter.com/dog_rates/status/825026590719483904/photo/1                                                                                                             2
https://twitter.com/dog_rates/status/753375668877008896/photo/1                                                                                                                                                                                                                                        2
https://twitter.com/dog_rates/status/667866724293877760/photo/1                                              

I used describe to check the min and max for the numerical values. I noticed that one denominator was 170 while all others were 10. So it might have been a typo. Then I checked all the values counts for denominator and found a small portion were not 10.

In [1204]:
#use describe to check the numerical values and see if they make sense
df_archive.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [1205]:
#check to see which denominators had 170
df_archive.query('rating_denominator ==170')['tweet_id']

1120    731156023742988288
Name: tweet_id, dtype: int64

In [1206]:
#check all value counts
df_archive_clean['rating_denominator'].value_counts()

10    2279
Name: rating_denominator, dtype: int64

Then I took a sample from df_archive to visually assess the data again. I noticed that the URLs was in the text field. 

In [1207]:
df_archive.sample(50)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
946,752568224206688256,,,2016-07-11 18:20:21 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine - Make a Scene</a>",Here are three doggos completely misjudging an airborne stick. Decent efforts tho. All 9/10 https://t.co/HCXQL4fGVZ,,,,https://vine.co/v/5W0bdhEUUVT,9,10,,,,,
256,843981021012017153,,,2017-03-21 00:22:10 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",HE WAS DOING A SNOOZE NO SHAME IN A SNOOZE 13/10 https://t.co/Gu5wHx3CBd,,,,https://twitter.com/brianstack153/status/796796054100471809,13,10,,,,,
1766,678399652199309312,,,2015-12-20 02:20:55 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This made my day. 12/10 please enjoy https://t.co/VRTbo3aAcm,,,,https://twitter.com/dog_rates/status/678399652199309312/video/1,12,10,,,,,
1797,677269281705472000,,,2015-12-16 23:29:14 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is the happiest pupper I've ever seen. 10/10 would trade lives with https://t.co/ep8ATEJwRb,,,,https://twitter.com/dog_rates/status/677269281705472000/photo/1,10,10,the,,,pupper,
156,861383897657036800,,,2017-05-08 00:54:59 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Hobbes. He's never seen bubbles before. 13/10 deep breaths buddy https://t.co/QFRlbZw4Z1,,,,https://twitter.com/dog_rates/status/861383897657036800/photo/1,13,10,Hobbes,,,,
983,749395845976588288,,,2016-07-03 00:14:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is George. He just remembered that bees are dying globally at an alarming rate. Scary stuff George. 10/10 https://t.co/lznl6QGkYc,,,,"https://twitter.com/dog_rates/status/749395845976588288/photo/1,https://twitter.com/dog_rates/status/749395845976588288/photo/1",10,10,George,,,,
1958,673580926094458881,,,2015-12-06 19:13:01 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",When you ask your professor about extra credit on the last day of class. 8/10 https://t.co/H6rqZyE4NP,,,,https://twitter.com/dog_rates/status/673580926094458881/photo/1,8,10,,,,,
192,855818117272018944,,,2017-04-22 16:18:34 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",I HEARD HE TIED HIS OWN BOWTIE MARK AND HE JUST WANTS TO SAY HI AND MAYBE A NOGGIN PAT SHOW SOME RESPECT 13/10 https://t.co/5BEjzT2Tth,,,,https://twitter.com/markhalperin/status/855656431005061120,13,10,,,,,
2123,670385711116361728,,,2015-11-27 23:36:23 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Larry. He's a Panoramic Benzoate. Can shoot lasers out of his eyes. Very neat. Stuck in that position tho. 8/10 https://t.co/MAZx8MPF0S,,,,https://twitter.com/dog_rates/status/670385711116361728/photo/1,8,10,Larry,,,,
617,796387464403357696,,,2016-11-09 16:22:22 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Snicku. He's having trouble reading because he's a dog. Glasses only helped a little. Nap preferred. 12/10 would snug well https://t.co/cVLUasbKA5,,,,"https://twitter.com/dog_rates/status/796387464403357696/photo/1,https://twitter.com/dog_rates/status/796387464403357696/photo/1",12,10,Snicku,,,,


Next, I visually assessed the df_image_pred table to see if I noticed anything. I immediately noticed that there were multiple predictions for the same image. I also noticed that the puncutation of dog names were different.

In [1208]:
df_image_pred

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


I checked the data types for the table as well and everything seemed okay.

In [1209]:
#check the data types
df_image_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


I checked for duplicated data of the image to see if some predictions were duplicates. Turns out that the same image was predicted multiple times. This was probably due to the retweets.

In [1210]:
#checjk duplicates for images
df_image_pred[df_image_pred['jpg_url'].duplicated()]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1297,752309394570878976,https://pbs.twimg.com/ext_tw_video_thumb/675354114423808004/pu/img/qL1R_nGLqa6lmkOx.jpg,1,upright,0.303415,False,golden_retriever,0.181351,True,Brittany_spaniel,0.162084,True
1315,754874841593970688,https://pbs.twimg.com/media/CWza7kpWcAAdYLc.jpg,1,pug,0.272205,True,bull_mastiff,0.251530,True,bath_towel,0.116806,False
1333,757729163776290825,https://pbs.twimg.com/media/CWyD2HGUYAQ1Xa7.jpg,2,cash_machine,0.802333,False,schipperke,0.045519,True,German_shepherd,0.023353,True
1345,759159934323924993,https://pbs.twimg.com/media/CU1zsMSUAAAS0qW.jpg,1,Irish_terrier,0.254856,True,briard,0.227716,True,soft-coated_wheaten_terrier,0.223263,True
1349,759566828574212096,https://pbs.twimg.com/media/CkNjahBXAAQ2kWo.jpg,1,Labrador_retriever,0.967397,True,golden_retriever,0.016641,True,ice_bear,0.014858,False
1364,761371037149827077,https://pbs.twimg.com/tweet_video_thumb/CeBym7oXEAEWbEg.jpg,1,brown_bear,0.713293,False,Indian_elephant,0.172844,False,water_buffalo,0.038902,False
1368,761750502866649088,https://pbs.twimg.com/media/CYLDikFWEAAIy1y.jpg,1,golden_retriever,0.586937,True,Labrador_retriever,0.398260,True,kuvasz,0.005410,True
1387,766078092750233600,https://pbs.twimg.com/media/ChK1tdBWwAQ1flD.jpg,1,toy_poodle,0.420463,True,miniature_poodle,0.132640,True,Chesapeake_Bay_retriever,0.121523,True
1407,770093767776997377,https://pbs.twimg.com/media/CkjMx99UoAM2B1a.jpg,1,golden_retriever,0.843799,True,Labrador_retriever,0.052956,True,kelpie,0.035711,True
1417,771171053431250945,https://pbs.twimg.com/media/CVgdFjNWEAAxmbq.jpg,3,Samoyed,0.978833,True,Pomeranian,0.012763,True,Eskimo_dog,0.001853,True


In [1211]:
#check for a specific image url using query to check the duplicate
df_image_pred.query('jpg_url =="https://pbs.twimg.com/media/C4KHj-nWQAA3poV.jpg"')

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1785,829374341691346946,https://pbs.twimg.com/media/C4KHj-nWQAA3poV.jpg,1,Staffordshire_bullterrier,0.757547,True,American_Staffordshire_terrier,0.14995,True,Chesapeake_Bay_retriever,0.047523,True
1903,851953902622658560,https://pbs.twimg.com/media/C4KHj-nWQAA3poV.jpg,1,Staffordshire_bullterrier,0.757547,True,American_Staffordshire_terrier,0.14995,True,Chesapeake_Bay_retriever,0.047523,True


I used describe to see if any of the numerical values were suspect.

In [1212]:
df_image_pred.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [1213]:
#performed value count on dogs to see if there were any duplicates based on mis-punctuation
df_image_pred['p3'].value_counts()

Labrador_retriever                79
Chihuahua                         58
golden_retriever                  48
Eskimo_dog                        38
kelpie                            35
kuvasz                            34
Staffordshire_bullterrier         32
chow                              32
cocker_spaniel                    31
beagle                            31
toy_poodle                        29
Pomeranian                        29
Pekinese                          29
Great_Pyrenees                    27
Chesapeake_Bay_retriever          27
Pembroke                          27
French_bulldog                    26
malamute                          26
American_Staffordshire_terrier    24
Cardigan                          23
pug                               23
basenji                           21
toy_terrier                       20
bull_mastiff                      20
Siberian_husky                    19
Shetland_sheepdog                 17
Boston_bull                       17
L

The last table I assessed was the df_tweepy table to see if there were any quality or tidiness issues. Only two new attributes were present on this dataset and the others seemed like duplicate. This table seemed like a good candidate to merge with another table.

In [1214]:
df_tweepy

Unnamed: 0,tweet_id,retweet_count,favorite_count,full_text,url
0,892420643555336193,8311,38002,This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,https://t.co/MgUWQ76dJU
1,892177421306343426,6139,32632,"This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",https://t.co/0Xxu71qeIV
2,891815181378084864,4064,24549,This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,https://t.co/wUnZnhtVJB
3,891689557279858688,8447,41351,This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,https://t.co/tD36da7qLQ
4,891327558926688256,9154,39532,"This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",https://t.co/AtUZn91f7f
5,891087950875897856,3043,19862,Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh,https://t.co/kQ04fDDRmh
6,890971913173991426,2018,11606,Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\n\nhttps://t.co/Zr4hWfAs1H https://t.co/tVJBRMnhxl,https://t.co/tVJBRMnhxl
7,890729181411237888,18435,64126,When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq,https://t.co/v0nONBcwxq
8,890609185150312448,4182,27277,This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettable boatpet. 13/10 #BarkWeek https://t.co/9TwLuAGH0b,https://t.co/9TwLuAGH0b
9,890240255349198849,7214,31302,This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A,https://t.co/t1bfwz5S2A


In [1215]:
#check the data types
df_tweepy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2340 entries, 0 to 2339
Data columns (total 5 columns):
tweet_id          2340 non-null int64
retweet_count     2340 non-null int64
favorite_count    2340 non-null int64
full_text         2340 non-null object
url               2067 non-null object
dtypes: int64(3), object(2)
memory usage: 91.5+ KB


In [1216]:
#check the numericalvalues distribution
df_tweepy.describe()

Unnamed: 0,tweet_id,retweet_count,favorite_count
count,2340.0,2340.0,2340.0
mean,7.422176e+17,2926.80812,7955.481624
std,6.832564e+16,4930.476731,12321.53991
min,6.660209e+17,0.0,0.0
25%,6.783394e+17,587.75,1373.0
50%,7.186224e+17,1366.5,3458.5
75%,7.986954e+17,3410.25,9734.0
max,8.924206e+17,83567.0,164162.0


In [1217]:
#check the favorite counts
df_tweepy.query('favorite_count == 0')

Unnamed: 0,tweet_id,retweet_count,favorite_count,full_text,url
31,886054160059072513,105,0,RT @Athletics: 12/10 #BATP https://t.co/WxwJmvjfxo,
35,885311592912609280,18153,0,RT @dog_rates: This is Lilly. She just parallel barked. Kindly requests a reward now. 13/10 would pet so well https://t.co/SATN4If5H5,https://t.co/SATN4If5H5
67,879130579576475649,6694,0,RT @dog_rates: This is Emmy. She was adopted today. Massive round of pupplause for Emmy and her new family. 14/10 for all involved https://…,
72,878404777348136964,1267,0,"RT @dog_rates: Meet Shadow. In an attempt to reach maximum zooming borkdrive, he tore his ACL. Still 13/10 tho. Help him out below\n\nhttps:/…",
73,878316110768087041,6525,0,RT @dog_rates: Meet Terrance. He's being yelled at because he stapled the wrong stuff together. 11/10 hang in there Terrance https://t.co/i…,
77,877611172832227328,79,0,RT @rachel2195: @dog_rates the boyfriend and his soaking wet pupper h*cking love his new hat 14/10 https://t.co/dJx4Gzc50G,https://t.co/dJx4Gzc50G
90,874434818259525634,14496,0,RT @dog_rates: This is Coco. At first I thought she was a cloud but clouds don't bork with such passion. 12/10 would hug softly https://t.c…,
95,873337748698140672,1564,0,RT @dog_rates: This is Sierra. She's one precious pupper. Absolute 12/10. Been in and out of ICU her whole life. Help Sierra below\n\nhttps:/…,
106,871166179821445120,5662,0,RT @dog_rates: This is Dawn. She's just checking pup on you. Making sure you're doing okay. 12/10 she's here if you need her https://t.co/X…,
120,868639477480148993,2097,0,RT @dog_rates: Say hello to Cooper. His expression is the same wet or dry. Absolute 12/10 but Coop desperately requests your help\n\nhttps://…,


# Cleaning

In [1218]:
# copy all of the dataframes for cleaning activities
df_archive_clean = df_archive.copy()
df_tweepy_clean = df_tweepy.copy()
df_image_pred_clean = df_image_pred.copy()

## Quality Issues

#### df_archive table
1. Dog names don't seem to be correct (visually noticed 'the' and 'a' as some dog names)
2. Timestamp and retweeted_status_timestamp is represented as an object and should be converted to a datetime.
3. Expanded URLs seem to consist of duplicate Twitter URLs and also URLs to other sites (like go fund me) all jumbled into a single attribute
4. None values are being counted when they should be NaN
5. Small portion of rating denominators are not 10 and are inconsistent.
6. Retweets seem like duplicate data in both archive and image_pred tables
7. IDs should be integers instead of float.

#### df_image_pred table
8. Inconsistent punctuation of the dog types in the p1, p2, and p3 columns.

## Tidyness Issues

#### df_archive table
1. Multiple attributes represented when it can be combined for dog type (i.e. doggo, floofer, pupper, puppo) even rating (numerator and denominator)
2. URLs are in the text attributes

#### df_tweepy table
3. Only has a two attributes that are new (retweet count and favorite count). This should be merged to the df_archive table
4. Multiple attributes on the table that are very similar with the prediction confidence score getting worse with more predictions (i.e. prediction, prediction number, and dog (true or false))

My approach was to handle all of the tidyness issues first. Tidyness issue 1 and Quality issue 4 needed to be cleaned up together to make things easier.

### Tidyness Issue 1 & Quality Issue 4
I'll handle both of these data issues first.
#### df_archive table
- Multiple attributes represented when it can be combined for dog type (i.e. doggo, floofer, pupper, puppo) even rating (numerator and denominator)
- None values are being counted when they should be NaN


##### Define
- Combine the attributes (doggo, floofer, puuper, puppo) into a single value called classification.
- Replace the None values with NaN so they are not counted as values

##### Code

In [1219]:
#Replace the None value with empty character. Create a variable for the column names with None as a value
ids =['name','doggo','floofer','pupper','puppo']

#create a function to replace the None value with an empty character
def replaceNone(df, id):
    df[id] = df[id].replace('None','')
#use the function for the column names on the dataframe
replaceNone(df_archive_clean, ids)
#merge the doggo, floofer, pupper, and puppo  column values
df_archive_clean['classification']=df_archive_clean['doggo']+df_archive_clean['floofer']+df_archive_clean['pupper']+df_archive_clean['puppo']
#drop the merged columns
df_archive_clean=df_archive_clean.drop(['doggo','floofer','pupper','puppo'],axis=1)
#replace the empty characters with np.nan so it doesn't get counted
df_archive_clean['name'].replace('',np.nan,inplace=True)
#replace the empty characters with np.nan so it doesn't get counted
df_archive_clean['classification'].replace('',np.nan,inplace=True)

In [1220]:
#change datatype to category
df_archive_clean['classification'] = df_archive_clean['classification'].astype('category')

##### Test

In [1221]:
#take a sample of the dataframe to ensure all of the columns were merged
df_archive_clean.sample(50)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,classification
2183,668989615043424256,,,2015-11-24 03:08:48 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Bernie. He's taking his Halloween costume very seriously. Wants to be baked. 3/10 not a good idea Bernie smh https://t.co/1zBp1moFlX,,,,https://twitter.com/dog_rates/status/668989615043424256/photo/1,3,10,Bernie,
1106,734787690684657664,,,2016-05-23 16:46:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This dog is more successful than I will ever be. 13/10 absolute legend https://t.co/BPoaHySYwA,,,,"https://twitter.com/dog_rates/status/734787690684657664/photo/1,https://twitter.com/dog_rates/status/734787690684657664/photo/1,https://twitter.com/dog_rates/status/734787690684657664/photo/1,https://twitter.com/dog_rates/status/734787690684657664/photo/1",13,10,,
498,813130366689148928,8.131273e+17,4196984000.0,2016-12-25 21:12:41 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",I've been informed by multiple sources that this is actually a dog elf who's tired from helping Santa all night. Pupgraded to 12/10,,,,,12,10,,
1661,683030066213818368,,,2016-01-01 21:00:32 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Lulu. She's contemplating all her unreached 2015 goals and daydreaming of a more efficient tomorrow. 10/10 https://t.co/h3ScYuz77J,,,,https://twitter.com/dog_rates/status/683030066213818368/photo/1,10,10,Lulu,
2297,667073648344346624,,,2015-11-18 20:15:26 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here is Dave. He is actually just a skinny legged seal. Happy birthday Dave. 10/10 https://t.co/DPSamreuRq,,,,https://twitter.com/dog_rates/status/667073648344346624/photo/1,10,10,Dave,
1514,691090071332753408,,,2016-01-24 02:48:07 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Happy Saturday here's a dog in a mailbox. 12/10 https://t.co/MM7tb4HpEY,,,,https://twitter.com/dog_rates/status/691090071332753408/photo/1,12,10,,
620,796125600683540480,,,2016-11-08 23:01:49 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",#ImWithThor 13/10\nhttps://t.co/a18mzkhTf6,,,,https://twitter.com/king5seattle/status/796123679771897856,13,10,,
663,790946055508652032,,,2016-10-25 16:00:09 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Betty. She's assisting with the dishes. Such a good puppo. 12/10 h*ckin helpful af https://t.co/dgvTPZ9tgI,,,,https://twitter.com/dog_rates/status/790946055508652032/photo/1,12,10,Betty,puppo
501,813096984823349248,,,2016-12-25 19:00:02 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Rocky. He got triple-doggo-dared. Stuck af. 11/10 someone help him https://t.co/soNL00XWVu,,,,https://twitter.com/dog_rates/status/813096984823349248/photo/1,11,10,Rocky,doggo
366,828801551087042563,,,2017-02-07 03:04:22 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","RT @dog_rates: This is Gus. He likes to be close to you, which is good because you want to be close to Gus. 12/10 would boop then pet https…",8.102541e+17,4196984000.0,2016-12-17 22:43:27 +0000,https://twitter.com/dog_rates/status/810254108431155201/photo/1,12,10,Gus,


In [1222]:
#check a specific tweet value against the old archive value
df_archive_clean.query('tweet_id ==803773340896923648')['classification'], df_archive.query('tweet_id ==803773340896923648')['puppo'] 

(554    puppo
 Name: classification, dtype: category
 Categories (7, object): [doggo, doggofloofer, doggopupper, doggopuppo, floofer, pupper, puppo],
 554    puppo
 Name: puppo, dtype: object)

### Tidyness Issue 2
#### df_archive table
- URLs are in the text attribute

##### Define
- Extract the only the text from text attribute and since a URL attribute already exists we don't need to extract the URL contained in the text attribute

##### Code

In [1223]:
#extract all characters using regex before the http and only look at text fields that have an http. store the clean values in a column text_clean
df_archive_clean['text_clean'] = df_archive_clean[df_archive_clean['text'].str.contains('http')].text.str.extract('(.[\s\S]+?(?=https))', expand=True) 

#fill all of the null values in text_clean with the values that didn't have http in the original text column
df_archive_clean['text_clean'].fillna(df_archive_clean['text'], inplace=True)

##### Test

In [1224]:
#check some tweets to ensure the link is removed
df_archive_clean.query('tweet_id in (679158373988876288, 754874841593970688)')['text_clean']

926     RT @dog_rates: This is Rubio. He has too much skin. 11/10 
1744                   This is Rubio. He has too much skin. 11/10 
Name: text_clean, dtype: object

In [1225]:
#check the number of rows
df_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 15 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          1611 non-null object
classification                380 non-null category
text_clean                    2356 non-null object
dtypes: category(1), float64(4), int64(3), object(7)
memory usage: 260.4+ KB


In [1226]:
#check to see if any string contains http
df_archive_clean[df_archive_clean['text_clean'].str.contains('http')]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,classification,text_clean


In [1227]:
#dropping the text column after performing the tests since it is no longer required
df_archive_clean.drop('text',axis=1,inplace=True)

In [1228]:
#checking to make sure the column was dropped
df_archive_clean.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,classification,text_clean
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,,This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,,"This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10"
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,,,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12,10,Archie,,This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,,,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13,10,Darla,,This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,,,"https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12,10,Franklin,,"This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek"


### Tidyness Issue 3
#### df_tweepy table
- Only has a two attributes that are new (retweet count and favorite count). This should be merged to the df_archive table

##### Define
- Merge with the df_archive_clean table

##### Code

In [1229]:
#merge archive and tweepy dataframes
df_archive_clean = df_archive_clean.merge(df_tweepy_clean, on='tweet_id')

In [1230]:
#drop full_text columns as it is a duplicate. URL is also a duplicate but I want to keep the shortened version
df_archive_clean.drop('full_text', axis=1,inplace=True)

##### Test

In [1231]:
#verify a with a sample set of data that the data has been merged
df_archive_clean.sample(50)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,classification,text_clean,retweet_count,favorite_count,url
924,753039830821511168,,,2016-07-13 01:34:21 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine - Make a Scene</a>",,,,https://vine.co/v/5W2Dg3XPX7a,13,10,,,So this just changed my life. 13/10 please enjoy,22566,39162,
1729,679148763231985668,,,2015-12-22 03:57:37 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,,,https://twitter.com/dog_rates/status/679148763231985668/photo/1,8,10,,,I know everyone's excited for Christmas but that doesn't mean you can send in reindeer. We only rate dogs... 8/10,1099,2885,https://t.co/eWjWgbOCYL
442,818614493328580609,,,2017-01-10 00:24:38 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,,,"https://twitter.com/dog_rates/status/818614493328580609/photo/1,https://twitter.com/dog_rates/status/818614493328580609/photo/1,https://twitter.com/dog_rates/status/818614493328580609/photo/1,https://twitter.com/dog_rates/status/818614493328580609/photo/1",12,10,Bear,,This is Bear. He's a passionate believer of the outdoors. Leaves excite him. 12/10 would hug softly,2817,10486,https://t.co/FOF0hBDxP8
1153,721001180231503872,,,2016-04-15 15:44:11 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,,,https://twitter.com/dog_rates/status/721001180231503872/photo/1,11,10,Oliver,pupper,This is Oliver. Bath time is upon him. His fear of the wetness postpones his ultimate pupper destiny. 11/10,654,2628,https://t.co/AFzzKqR4tT
274,839239871831150596,,,2017-03-07 22:22:32 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,,,"https://twitter.com/dog_rates/status/839239871831150596/photo/1,https://twitter.com/dog_rates/status/839239871831150596/photo/1,https://twitter.com/dog_rates/status/839239871831150596/photo/1",13,10,Odie,,This is Odie. He's big. 13/10 would attempt to ride,6990,28449,https://t.co/JEXB9RwBmm
1478,692752401762250755,,,2016-01-28 16:53:37 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,,,https://twitter.com/dog_rates/status/692752401762250755/photo/1,13,10,,pupper,"""Hello yes could I get one pupper to go please thank you""\nBoth 13/10",3868,7156,https://t.co/kYWcXbluUu
1764,677700003327029250,,,2015-12-18 04:00:46 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,,,https://twitter.com/dog_rates/status/677700003327029250/photo/1,10,10,Ralph,,This is Ralph. He's an interpretive dancer. 10/10,1529,3522,https://t.co/zoDdPyPFsa
1591,685667379192414208,,,2016-01-09 03:40:16 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,,,https://twitter.com/dog_rates/status/685667379192414208/photo/1,9,10,Marty,pupper,This is Marty. He has no idea what happened here. Never seen this stuff in his life. 9/10 very suspicious pupper,609,2452,https://t.co/u427woxFpJ
2161,669037058363662336,,,2015-11-24 06:17:19 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,,,https://twitter.com/dog_rates/status/669037058363662336/photo/1,10,10,,,"Here we have Pancho and Peaches. Pancho is a Condoleezza Gryffindor, and Peaches is just an asshole. 10/10 &amp; 7/10",321,667,https://t.co/Lh1BsJrWPp
2196,668587383441514497,,,2015-11-23 00:30:28 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine - Make a Scene</a>",,,,https://vine.co/v/ea0OwvPTx9l,13,10,the,,Never forget this vine. You will not stop watching for at least 15 minutes. This is the second coveted.. 13/10,1102,1688,


### Tidyness Issue 4
#### df_image_pred table 
- Multiple attributes on the table that are very similar with the prediction confidence score getting worse with more predictions (i.e. prediction, prediction number, and dog (true or false))

##### Define
- Drop the columns for predictions 2 and predictions 3 and rename prediction 1 columns

##### Code

In [1232]:
#create a variable to drop column headers 
drop_columns=['p2','p2_conf','p2_dog','p3','p3_conf','p3_dog']
#drop the column headers from the dataframe
df_image_pred_clean = df_image_pred_clean.drop(columns=drop_columns, axis=1)

In [1233]:
#rename the columns
df_image_pred_clean.rename(columns={'p1':'prediction','p1_conf':'confidence','p1_dog':'dog'},inplace=True)

##### Test

In [1234]:
#verify the columns are renamed and dropped
df_image_pred_clean.head()

Unnamed: 0,tweet_id,jpg_url,img_num,prediction,confidence,dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True


### Quality Issues

#### df_archive table
1. Dog names don't seem to be correct (visually noticed 'the' and 'a' as some dog names)
2. Timestamp and retweeted_status_timestamp is represented as an object and should be converted to a datetime.
3. Expanded URLs seem to consist of duplicate Twitter URLs and also URLs to other sites (like go fund me) all jumbled into a single attribute
4. None values are being counted when they should be NaN
5. URLs are in the text attributes (tidiness issue)
6. Retweets seem like duplicate data in both archive and image_pred tables
7. IDs should be integers instead of float.

#### df_image_pred table
8. Inconsistent punctuation of the dog types in the p1, p2, and p3 columns.

### Quality Issue 1
Dog names don't seem to be correct (visually noticed 'the' and 'a' as some dog names)

##### Define
- Remove invalid names (a, an, the) and see if there is a way to extract the name

##### Code

In [1235]:
#create a variable to store the invalid names identified
invalid_names = ['a','an','the']
#replace the invalid names with NaN
df_archive_clean['name'].replace(invalid_names, np.nan, inplace=True)

##### Test

In [1236]:
#check to make sure the values are not showing up
df_archive_clean.query('name in @invalid_names')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,classification,text_clean,retweet_count,favorite_count,url


### Quality Issue 2
Timestamp and retweeted_status_timestamp is represented as an object and should be converted to a datetime.

##### Define
- Change the data type to datetime

##### Code

In [1237]:
#change the columns to datetime data type
df_archive_clean['timestamp']=pd.to_datetime(df_archive_clean['timestamp'])
df_archive_clean['retweeted_status_timestamp']=pd.to_datetime(df_archive_clean['retweeted_status_timestamp'])

##### Test

In [1238]:
#test the dataframe info to ensure data type was changed
df_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2340 entries, 0 to 2339
Data columns (total 17 columns):
tweet_id                      2340 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2340 non-null datetime64[ns]
source                        2340 non-null object
retweeted_status_id           167 non-null float64
retweeted_status_user_id      167 non-null float64
retweeted_status_timestamp    167 non-null datetime64[ns]
expanded_urls                 2281 non-null object
rating_numerator              2340 non-null int64
rating_denominator            2340 non-null int64
name                          1532 non-null object
classification                378 non-null category
text_clean                    2340 non-null object
retweet_count                 2340 non-null int64
favorite_count                2340 non-null int64
url                           2067 non-null object
dtypes: category(

### Quality Issue 3
Expanded URLs seem to consist of duplicate Twitter URLs and also URLs to other sites (like go fund me) all jumbled into a single attribute

##### Define
- Extract twitter URLs from the attribute and remove duplicates

##### Code

In [1239]:
#use regex to only extract the twitter URLs
df_archive_clean['expanded_urls']=df_archive_clean['expanded_urls'].str.extract('(https?://twitter(.+?),)')


In [1240]:
#remove unncessary comma at the end of expanded urls
df_archive_clean['expanded_urls']=df_archive_clean['expanded_urls'].str[:-1]

##### Test

In [1241]:
#verify the expanded URLs
df_archive_clean['expanded_urls'].value_counts()

https://twitter.com/dog_rates/status/789530877013393408/photo/1    2
https://twitter.com/dog_rates/status/774314403806253056/photo/1    2
https://twitter.com/dog_rates/status/822489057087389700/photo/1    2
https://twitter.com/dog_rates/status/860563773140209665/photo/1    2
https://twitter.com/dog_rates/status/782305867769217024/photo/1    2
https://twitter.com/dog_rates/status/820314633777061888/photo/1    2
https://twitter.com/dog_rates/status/863062471531167744/photo/1    2
https://twitter.com/dog_rates/status/829501995190984704/photo/1    2
https://twitter.com/dog_rates/status/833124694597443584/photo/1    2
https://twitter.com/dog_rates/status/819004803107983360/photo/1    2
https://twitter.com/dog_rates/status/679062614270468097/photo/1    2
https://twitter.com/dog_rates/status/782969140009107456/photo/1    2
https://twitter.com/dog_rates/status/687317306314240000/photo/1    2
https://twitter.com/dog_rates/status/866334964761202691/photo/1    2
https://twitter.com/dog_rates/stat

In [1242]:
#verify row counts
df_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2340 entries, 0 to 2339
Data columns (total 17 columns):
tweet_id                      2340 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2340 non-null datetime64[ns]
source                        2340 non-null object
retweeted_status_id           167 non-null float64
retweeted_status_user_id      167 non-null float64
retweeted_status_timestamp    167 non-null datetime64[ns]
expanded_urls                 603 non-null object
rating_numerator              2340 non-null int64
rating_denominator            2340 non-null int64
name                          1532 non-null object
classification                378 non-null category
text_clean                    2340 non-null object
retweet_count                 2340 non-null int64
favorite_count                2340 non-null int64
url                           2067 non-null object
dtypes: category(1

### Quality Issue 5
#### df_archive
- Small portion of rating denominators are not 10 and are inconsistent.

##### Define
- Update all denominators to 10

##### Code

In [1243]:
#update all denominators to 10
df_archive_clean['rating_denominator']=10

##### Test

In [1244]:
#verify all denominators are 10
df_archive_clean['rating_denominator'].value_counts()

10    2340
Name: rating_denominator, dtype: int64

### Quality Issue 6
#### df_archive_clean table
- Retweets seem like duplicate data in both archive and image_pred tables

##### Define
- remove the retweet data from df_archive and df_tweepy tables

##### Code

In [1245]:
#identify duplicate images
df_archive_clean.query('url=="https://t.co/Bb3xnpsWBC"')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,classification,text_clean,retweet_count,favorite_count,url
934,752309394570878976,,,2016-07-11 01:11:51,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",6.753544e+17,4196984000.0,2015-12-11 16:40:19,https://twitter.com/dog_rates/status/675354435921575936/video/1,13,10,,,RT @dog_rates: Everyone needs to watch this. 13/10,17974,0,https://t.co/Bb3xnpsWBC
1849,675354435921575936,,,2015-12-11 16:40:19,"<a href=""http://twitter.com"" rel=""nofollow"">Twitter Web Client</a>",,,NaT,,13,10,,,Everyone needs to watch this. 13/10,17973,33710,https://t.co/Bb3xnpsWBC


In [1246]:
#store all of the retweeted tweet IDs as a copy in the retweets variable
retweets = df_archive_clean[~df_archive_clean['retweeted_status_id'].isna()]['tweet_id'].copy()

In [1247]:
#copy all of the duplicated tweets and store in variable deleted_tweets
deleted_tweets = df_image_pred_clean[df_image_pred_clean['jpg_url'].duplicated()].query('tweet_id in @retweets')['tweet_id'].copy()

In [1248]:
#drop duplicated tweets from df_image_pred table
df_image_pred_clean = df_image_pred_clean.drop(df_image_pred_clean.query('tweet_id in @deleted_tweets').index)

In [1249]:
#drop all of the duplicate tweet IDs from df_archive_clean table 
df_archive_clean = df_archive_clean.drop(df_archive_clean.query('tweet_id in @deleted_tweets').index)

##### Test

In [1250]:
#Check one of the duplicated =images
df_image_pred_clean.query('jpg_url =="https://pbs.twimg.com/media/CWza7kpWcAAdYLc.jpg"')

Unnamed: 0,tweet_id,jpg_url,img_num,prediction,confidence,dog
591,679158373988876288,https://pbs.twimg.com/media/CWza7kpWcAAdYLc.jpg,1,pug,0.272205,True


In [1251]:
#check to see if the deleted tweets were removed
df_archive_clean.query('tweet_id in @deleted_tweets')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,classification,text_clean,retweet_count,favorite_count,url


### Quality Issue 7
#### df_archive_clean table
- IDs should be integers instead of float.

##### Define
- Change the data type for the IDs to INT instead of float and replace all null values to 0.

##### Code

In [1252]:
#replace all of the NaN values for all of the ID fields with 0
df_archive_clean['retweeted_status_user_id'] = df_archive_clean['retweeted_status_user_id'].fillna(0)
df_archive_clean['retweeted_status_id'] = df_archive_clean['retweeted_status_id'].fillna(0)
df_archive_clean['in_reply_to_status_id'] = df_archive_clean['in_reply_to_status_id'].fillna(0)
df_archive_clean['in_reply_to_user_id'] = df_archive_clean['in_reply_to_user_id'].fillna(0)


In [1253]:
#change all of the ID fields using astype to integer
df_archive_clean['retweeted_status_user_id'] = df_archive_clean['retweeted_status_user_id'].astype(int)
df_archive_clean['retweeted_status_id'] = df_archive_clean['retweeted_status_id'].astype(int)
df_archive_clean['in_reply_to_status_id'] = df_archive_clean['in_reply_to_status_id'].astype(int)
df_archive_clean['in_reply_to_user_id'] = df_archive_clean['in_reply_to_user_id'].astype(int)


##### Test

In [1254]:
df_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2279 entries, 0 to 2339
Data columns (total 17 columns):
tweet_id                      2279 non-null int64
in_reply_to_status_id         2279 non-null int64
in_reply_to_user_id           2279 non-null int64
timestamp                     2279 non-null datetime64[ns]
source                        2279 non-null object
retweeted_status_id           2279 non-null int64
retweeted_status_user_id      2279 non-null int64
retweeted_status_timestamp    106 non-null datetime64[ns]
expanded_urls                 542 non-null object
rating_numerator              2279 non-null int64
rating_denominator            2279 non-null int64
name                          1487 non-null object
classification                367 non-null category
text_clean                    2279 non-null object
retweet_count                 2279 non-null int64
favorite_count                2279 non-null int64
url                           2006 non-null object
dtypes: category(1),

### Quality Issue 8
#### df_image_pred table
- Inconsistent punctuation of the dog types in the prediction column.

##### Define
- Change the punctuation to proper case for the prediction column

##### Code

In [1255]:
#replace all underscores with a space
df_image_pred_clean['prediction'] = df_image_pred_clean['prediction'].str.replace('_',' ')

In [1256]:
#use the title function to make all first letters in words uppercase
df_image_pred_clean['prediction']=df_image_pred_clean['prediction'].str.title()

##### Test

In [1257]:
df_image_pred_clean

Unnamed: 0,tweet_id,jpg_url,img_num,prediction,confidence,dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh Springer Spaniel,0.465074,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,Redbone,0.506826,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German Shepherd,0.596461,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian Ridgeback,0.408143,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,Miniature Pinscher,0.560311,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese Mountain Dog,0.651137,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,Box Turtle,0.933012,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,Chow,0.692517,True
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,Shopping Cart,0.962465,False
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,Miniature Poodle,0.201493,True


In [1259]:
#save all of the cleaned dataframes to one master csv
twitter_archive_master = df_archive_clean.merge(df_image_pred_clean, on='tweet_id')
twitter_archive_master.to_csv('twitter_archive_master.csv', index=False)