# Data Wrangling Project
A project to practise data wrangling using the WeRateDogs Twitter account as the inspiration.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import tweepy
import re

## Gathering

First we need to gather the data as follows  
1) The WeRateDogs Twitter archive we is already in the working folder (downloaded from the Udacity resources)  
2) The classification of the breed of dog in the tweets which we will download programmatically below  
3) Further data from the Twitter API. We will decide what to download once we've had a look at the options and a think about what might be useful. (Currently I have just downloaded the JSON from the Udacity website as the Twitter bots won't grant me developer access.)

### Twitter Archive

This is considered a "file on hand" so we have manually added it to the working directory.

In [2]:
twitter_arc = pd.read_csv("twitter-archive-enhanced.csv")
twitter_arc.loc[:,"source"].value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

### Image Labelling
The following data was generated by running the pictures attached to the tweets through an image labelling algorithm by the Udacity tutor. Here we are downloading from the relevant URL.

In [3]:
#download and save image_predictions.tsv
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
r = requests.get(url)
with open("image-predictions.tsv", "wb") as f:
    f.write(r.content)

In [4]:
breed = pd.read_csv("image-predictions.tsv", "\t")
breed

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


In [5]:
#download and save tweet meta data
url = "https://video.udacity-data.com/topher/2018/November/5bf60cc8_tweet-json/tweet-json"
r = requests.get(url)
with open("tweet-json", "wb") as f:
    f.write(r.content)

### Tweet Meta Data

Tweet meta data should be downloaded via the Twitter API, but until the kind bots of Twitter give me access I will use the provided file. Here is the example code I would have used.

In [6]:
"""
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = df_1.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)
"""

'\nimport tweepy\nfrom tweepy import OAuthHandler\nimport json\nfrom timeit import default_timer as timer\n\n# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file\n# These are hidden to comply with Twitter\'s API terms and conditions\nconsumer_key = \'HIDDEN\'\nconsumer_secret = \'HIDDEN\'\naccess_token = \'HIDDEN\'\naccess_secret = \'HIDDEN\'\n\nauth = OAuthHandler(consumer_key, consumer_secret)\nauth.set_access_token(access_token, access_secret)\n\napi = tweepy.API(auth, wait_on_rate_limit=True)\n\n# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:\n# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to\n# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv\n# NOTE TO REVIEWER: this student had mobile verification issues so the following\n# Twitter API code was sent to this student from a Udacity instructor\n# Tweet IDs for which to gather additional data via Twitter\'s API\ntweet_ids = df_1.

In [7]:
tweet_meta = pd.read_json("tweet-json", lines = True)
tweet_meta.sample(10)

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorite_count,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status
1989,2015-12-04 03:43:54+00:00,672622327801233409,672622327801233408,This lil pupper is sad because we haven't foun...,False,"[0, 128]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 672622321664811010, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,1377,False,False,0.0,0.0,en,,,,
161,2017-05-05 16:00:04+00:00,860524505164394496,860524505164394496,This is Carl. He likes to dance. Doesn't care ...,False,"[0, 97]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 860524497660776448, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,24678,False,False,0.0,0.0,en,,,,
1525,2016-01-22 02:28:52+00:00,690360449368465409,690360449368465408,Stop sending in lobsters. This is the final wa...,False,"[0, 113]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 690360446273110016, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,2925,False,False,0.0,0.0,en,,,,
1187,2016-04-08 01:19:36+00:00,718246886998687744,718246886998687744,This is Alexanderson. He's got a weird ass bir...,False,"[0, 132]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 718246851380711426, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,2115,False,False,0.0,0.0,en,,,,
1616,2016-01-07 05:28:35+00:00,684969860808454144,684969860808454144,"For those who claim this is a goat, u are wron...",False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"<a href=""http://twitter.com/download/iphone"" r...",6.849598e+17,...,2374,False,False,,,en,,,,
1685,2015-12-28 20:57:50+00:00,681579835668455424,681579835668455424,This is Apollo. He thought you weren't coming ...,False,"[0, 130]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 681579814340456449, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,3893,False,False,0.0,0.0,en,,,,
2027,2015-12-02 00:58:41+00:00,671855973984772097,671855973984772096,Remarkable dog here. Walks on back legs really...,False,"[0, 117]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 671855970818121728, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,977,False,False,0.0,0.0,en,,,,
1804,2015-12-16 01:27:03+00:00,676936541936185344,676936541936185344,Here we see a rare pouched pupper. Ample stora...,False,"[0, 137]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 676936535535656961, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,13809,False,False,0.0,0.0,en,,,,
2121,2015-11-27 23:36:23+00:00,670385711116361728,670385711116361728,Meet Larry. He's a Panoramic Benzoate. Can sho...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 670385704808108032, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,593,False,False,0.0,0.0,en,,,,
655,2016-10-27 23:17:38+00:00,791780927877898241,791780927877898240,RT @dog_rates: This is Maddie. She gets some w...,False,"[0, 119]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"<a href=""http://twitter.com/download/iphone"" r...",,...,0,False,False,0.0,0.0,en,{'created_at': 'Sat Jun 25 17:31:25 +0000 2016...,,,


Now we have gathered the three sets of data we set out to and loaded them into dataframes (in whatever form), it is time to assess the data ready for cleaning.

## Assessing

In this section we will review the three data sets and note any problems we find with completeness, accuracy, and tidiness in the table at the bottom.

### Twitter Archive

In [8]:
twitter_arc.head(25)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


The data above contains retweets and replies which are not straightforward ratings of pictures of dogs.
The names are not all populated or in some cases don't appear to be realistic names.
There are columns containing retweet information which won't be relevant.
The source is part of a `<a>` tag.
Expanded urls often contain the tweet id, which is duplication.<br><br>
Let's see if the ratings "make sense"

In [9]:
twitter_arc["rating_numerator"].value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

In [10]:
twitter_arc["rating_denominator"].value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

Hmmm. This a whimsical rating system but there is definitely something wrong. The overwhelming majority of the ratings are out of 10, but roughly a score are not. There are some odd looking numerators as well. Let's have a look at a couple and see if we can see what's wrong in the original text.

In [11]:
twitter_arc[twitter_arc["rating_denominator"] != 10][["text", "in_reply_to_user_id"]]

Unnamed: 0,text,in_reply_to_user_id
313,@jonnysun @Lin_Manuel ok jomny I know you're e...,26259580.0
342,@docmisterio account started on 11/15/15,30582080.0
433,The floofs have been released I repeat the flo...,
516,Meet Sam. She smiles 24/7 &amp; secretly aspir...,
784,"RT @dog_rates: After so many requests, this is...",
902,Why does this never happen at my front door......,
1068,"After so many requests, this is Bretagne. She ...",
1120,Say hello to this unbelievably well behaved sq...,
1165,Happy 4/20 from the squad! 13/10 for all https...,
1202,This is Bluebert. He just saw that both #Final...,


There are a few things going on above but it seems that the parser that strips the score out sometimes gets confused with dates or other apparent fractions. In some cases the ratings have been added up for multiple dogs.

Let's have a look at the `name` columns

In [12]:
twitter_arc["name"].head(25)

0      Phineas
1        Tilly
2       Archie
3        Darla
4     Franklin
5         None
6          Jax
7         None
8         Zoey
9       Cassie
10        Koda
11       Bruno
12        None
13         Ted
14      Stuart
15      Oliver
16         Jim
17        Zeke
18     Ralphus
19      Canela
20      Gerald
21     Jeffrey
22        such
23      Canela
24        None
Name: name, dtype: object

In [13]:
twitter_arc.loc[22,"text"]

"I've yet to rate a Venezuelan Hover Wiener. This is such an honor. 14/10 paw-inspiring af (IG: roxy.thedoxy) https://t.co/20VrLAA8ba"

It seems that the method of parsing the name relies on the phrase "this is", which isn't foolproof. <br><br>
Also worth noting from above that we have a link to the tweet at the end of the text of the tweet that we can remove or put in a different column (but only if it is useful as an alternative to the `expanded_url`).

Finally let's check the dtypes of the columns

In [14]:
twitter_arc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

### Image Labelling

In [15]:
breed

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


We have some decisions to make about how to use the image classifications. There are sometimes multiple photos. Each photo has the top three image matches and a column to tell us whether it is a dog breed or not. This needs to be tidied into something useable. In cases where there is no match it will probaly make sense to add a NaN.  
Let's check how many of the rows have a dog breed identified.

In [16]:
breed[["p1_dog","p2_dog","p3_dog"]].any(axis=1).sum(), breed.shape[0]

(1751, 2075)

In [17]:
breed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


We see again that `tweet_id` is an `int` and the rest of the data will likely be condensed into a single column but we will decide that in the cleaning stage.

### Tweet Meta Data

In [18]:
tweet_meta

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorite_count,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status
0,2017-08-01 16:23:56+00:00,892420643555336193,892420643555336192,This is Phineas. He's a mystical boy. Only eve...,False,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,39467,False,False,0.0,0.0,en,,,,
1,2017-08-01 00:17:27+00:00,892177421306343426,892177421306343424,This is Tilly. She's just checking pup on you....,False,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,33819,False,False,0.0,0.0,en,,,,
2,2017-07-31 00:18:03+00:00,891815181378084864,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,False,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,25461,False,False,0.0,0.0,en,,,,
3,2017-07-30 15:58:51+00:00,891689557279858688,891689557279858688,This is Darla. She commenced a snooze mid meal...,False,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,42908,False,False,0.0,0.0,en,,,,
4,2017-07-29 16:00:24+00:00,891327558926688256,891327558926688256,This is Franklin. He would like you to stop ca...,False,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,41048,False,False,0.0,0.0,en,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2349,2015-11-16 00:24:50+00:00,666049248165822465,666049248165822464,Here we have a 1949 1st generation vulpix. Enj...,False,"[0, 120]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666049244999131136, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,111,False,False,0.0,0.0,en,,,,
2350,2015-11-16 00:04:52+00:00,666044226329800704,666044226329800704,This is a purebred Piers Morgan. Loves to Netf...,False,"[0, 137]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666044217047650304, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,311,False,False,0.0,0.0,en,,,,
2351,2015-11-15 23:21:54+00:00,666033412701032449,666033412701032448,Here is a very happy pup. Big fan of well-main...,False,"[0, 130]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666033409081393153, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,128,False,False,0.0,0.0,en,,,,
2352,2015-11-15 23:05:30+00:00,666029285002620928,666029285002620928,This is a western brown Mitsubishi terrier. Up...,False,"[0, 139]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666029276303482880, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,132,False,False,0.0,0.0,en,,,,


There is lots of duplicate data here and some data we don't need, so let's have a look at a single row and try to see what's important.

In [19]:
tweet_meta.loc[0]

created_at                                               2017-08-01 16:23:56+00:00
id                                                              892420643555336193
id_str                                                          892420643555336192
full_text                        This is Phineas. He's a mystical boy. Only eve...
truncated                                                                    False
display_text_range                                                         [0, 85]
entities                         {'hashtags': [], 'symbols': [], 'user_mentions...
extended_entities                {'media': [{'id': 892420639486877696, 'id_str'...
source                           <a href="http://twitter.com/download/iphone" r...
in_reply_to_status_id                                                          NaN
in_reply_to_status_id_str                                                      NaN
in_reply_to_user_id                                                            NaN
in_r

In [20]:
tweet_meta.loc[0]

created_at                                               2017-08-01 16:23:56+00:00
id                                                              892420643555336193
id_str                                                          892420643555336192
full_text                        This is Phineas. He's a mystical boy. Only eve...
truncated                                                                    False
display_text_range                                                         [0, 85]
entities                         {'hashtags': [], 'symbols': [], 'user_mentions...
extended_entities                {'media': [{'id': 892420639486877696, 'id_str'...
source                           <a href="http://twitter.com/download/iphone" r...
in_reply_to_status_id                                                          NaN
in_reply_to_status_id_str                                                      NaN
in_reply_to_user_id                                                            NaN
in_r

In [21]:
twitter_arc[twitter_arc["tweet_id"] == tweet_meta.loc[0, "id_str"]]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [22]:
twitter_arc[twitter_arc["tweet_id"] == tweet_meta.loc[0, "id"]]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,


From the above it appears that `id` is the column that matches `tweet_id` in the `twitter_arc` dataframe. The only two columns that seem of use are the `retweet_count` and `favourite_count` columns, so we can retain those (plus the `id` key column) and drop the rest.

In [23]:
tweet_meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 31 columns):
 #   Column                         Non-Null Count  Dtype              
---  ------                         --------------  -----              
 0   created_at                     2354 non-null   datetime64[ns, UTC]
 1   id                             2354 non-null   int64              
 2   id_str                         2354 non-null   int64              
 3   full_text                      2354 non-null   object             
 4   truncated                      2354 non-null   bool               
 5   display_text_range             2354 non-null   object             
 6   entities                       2354 non-null   object             
 7   extended_entities              2073 non-null   object             
 8   source                         2354 non-null   object             
 9   in_reply_to_status_id          78 non-null     float64            
 10  in_reply_to_status_id_st

Again, the `id` column will need to be turned to a `str`

### Summary of Issues

#### Twitter Archive

##### Data Quality
* Contains Retweets `in_reply_to_status_id` != `Nan`
* Contains columns with data about retweets `retweeted_status_id`, `retweeted_status_user_id`, and `retweeted_status_timestamp`
* The entires in the `source` columns are obscured inside an `<a>` tag
* The end of the entry in the `expanded_urls` column is often the tweet id
* Ratings are not always pulled out of the tweets correctly and where they are they sometimes refer to multiple dogs so are not comparable
* Some of the names are not correctly extracted from the tweet (relying on "this is"?)
* `text` column contains the tweet url which is already available in expanded form in `expanded_url`
* `tweet_id` should be a `str` (`object`)
* There are 59 `expanded_url` entries missing
* Source, once stripped out, should be a `category` and `NaN` where not available
* `timestamp` should be dtype `datatime`

##### Tidiness
* The last four columns are `doggo`, `floofer`, `pupper`, and `puppo` are a single category spread over multiple columns


#### Image Labelling

##### Data Quality
* `tweet_id` should be a `str` (`object`)
* The breeds themselves are inconsistently capitalised and have underscores

##### Tidiness
* Breed classification is spread out over several columns and the data is not tidy

#### Tweet Meta Data

##### Data Quality
* `id` should be a `str` (`object`) and is the key column that corresponds to `tweet_id`

##### Tidiness
* `favorite_count` and `retweet_count` should be retained but the others dropped
* For the sake of tidiness it also seems sensible to amalgamate all this data into a single table since each row will be an observation about a tweet.

## Cleaning

First copy the dataframes to create clean versions.

In [24]:
tweets_clean = twitter_arc.copy()
breeds_clean = breed.copy()
meta_clean = tweet_meta.copy()

We will now go through the Summary of Issues in order

### Twitter Archive <br>
#### Drop the replies and retweets

In [25]:
tweets_clean.shape

(2356, 17)

In [26]:
tweets_clean.drop(tweets_clean[tweets_clean["retweeted_status_user_id"].notnull()].index, inplace = True)
tweets_clean.drop(tweets_clean[tweets_clean["in_reply_to_user_id"].notnull()].index, inplace = True)

In [27]:
tweets_clean.shape

(2097, 17)

#### Pull out source from tag

In [28]:
tweets_clean["source"].value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     1964
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       31
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

As there only four entries we can put them in a dictionary

In [29]:
rep_dict = {\
            "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>": "iPhone",
            "<a href=\"http://vine.co\" rel=\"nofollow\">Vine - Make a Scene</a>": "Vine",
"<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>": "Web",
"<a href=\"https://about.twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck</a>":"TweetDeck"}
tweets_clean["source"].replace(rep_dict, inplace = True)
tweets_clean["source"].value_counts()

iPhone       1964
Vine           91
Web            31
TweetDeck      11
Name: source, dtype: int64

#### Expanded URL

Although there is replication of data between the `tweet_id` and the end of the `expanded_url` it is convenient to leave them link in to access the tweet quickly, if needed.

#### Ratings
It's less obvious what the right approach to take here is. I think it's safe to assume that all genuine ratings will be out of 10, so let's see what we've got.

In [30]:
tweets_clean["rating_denominator"].value_counts()

10     2080
50        3
11        2
80        2
7         1
170       1
150       1
120       1
110       1
90        1
70        1
40        1
20        1
2         1
Name: rating_denominator, dtype: int64

In [31]:
pd.set_option('display.max_colwidth', None)
tweets_clean[(tweets_clean["rating_denominator"] % 10 != 0)]["text"]

516     Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx
1068                                 After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ
1662                                  This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5
2335                                     This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv
Name: text, dtype: object

516 doesn't have a rating so we should set `rating_numerator` and `rating_denominator` to `NaN`

The remaining three do have ratings but they are not the first fraction like object

In [32]:
indices = tweets_clean[(tweets_clean["rating_denominator"] % 10 != 0)].index
fraction = []
for ix in indices:
    rating = re.search(r"\d+/10",tweets_clean.loc[ix, "text"])
    if rating != None:
        fraction = re.split(r"/", rating.group(0))
        tweets_clean.loc[ix, "rating_numerator"] = int(fraction[0])
        tweets_clean.loc[ix, "rating_denominator"] = int(fraction[1])
    else:
        tweets_clean.loc[ix, "rating_numerator"] = np.nan
        tweets_clean.loc[ix, "rating_denominator"] = np.nan

In [33]:
tweets_clean["rating_denominator"].value_counts()

10.0     2083
50.0        3
80.0        2
150.0       1
110.0       1
90.0        1
70.0        1
170.0       1
120.0       1
40.0        1
20.0        1
Name: rating_denominator, dtype: int64

In [34]:
mult_dogs = tweets_clean[tweets_clean["rating_denominator"] != 10].index
mult_dogs

Int64Index([433, 516, 902, 1120, 1165, 1202, 1228, 1254, 1274, 1351, 1433,
            1635, 1779, 1843],
           dtype='int64')

And now for the remaining rows we see a the odd looking ratings are just multiple dogs, so we can normalise to ${x}/{10}$

In [35]:
for ix in mult_dogs:
    num_dogs = tweets_clean.loc[ix, "rating_denominator"] / 10
    tweets_clean.loc[ix, "rating_numerator"] = tweets_clean.loc[ix, "rating_numerator"] / num_dogs
    tweets_clean.loc[ix, "rating_denominator"] = tweets_clean.loc[ix, "rating_denominator"] / num_dogs


In [36]:
tweets_clean.loc[mult_dogs]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
433,820690176645140481,,,2017-01-15 17:52:40 +0000,iPhone,The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd,,,,"https://twitter.com/dog_rates/status/820690176645140481/photo/1,https://twitter.com/dog_rates/status/820690176645140481/photo/1,https://twitter.com/dog_rates/status/820690176645140481/photo/1",12.0,10.0,,,,,
516,810984652412424192,,,2016-12-19 23:06:23 +0000,iPhone,Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx,,,,"https://www.gofundme.com/sams-smile,https://twitter.com/dog_rates/status/810984652412424192/photo/1",,,Sam,,,,
902,758467244762497024,,,2016-07-28 01:00:57 +0000,iPhone,Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE,,,,https://twitter.com/dog_rates/status/758467244762497024/video/1,11.0,10.0,,,,,
1120,731156023742988288,,,2016-05-13 16:15:54 +0000,iPhone,Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv,,,,https://twitter.com/dog_rates/status/731156023742988288/photo/1,12.0,10.0,this,,,,
1165,722974582966214656,,,2016-04-21 02:25:47 +0000,iPhone,Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a,,,,https://twitter.com/dog_rates/status/722974582966214656/photo/1,2.0,10.0,,,,,
1202,716439118184652801,,,2016-04-03 01:36:11 +0000,iPhone,This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq,,,,https://twitter.com/dog_rates/status/716439118184652801/photo/1,10.0,10.0,Bluebert,,,,
1228,713900603437621249,,,2016-03-27 01:29:02 +0000,iPhone,Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1,,,,https://twitter.com/dog_rates/status/713900603437621249/photo/1,11.0,10.0,,,,,
1254,710658690886586372,,,2016-03-18 02:46:49 +0000,iPhone,Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12,,,,https://twitter.com/dog_rates/status/710658690886586372/photo/1,10.0,10.0,,,,,
1274,709198395643068416,,,2016-03-14 02:04:08 +0000,iPhone,"From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK",,,,https://twitter.com/dog_rates/status/709198395643068416/photo/1,9.0,10.0,,,,,
1351,704054845121142784,,,2016-02-28 21:25:30 +0000,iPhone,Here is a whole flock of puppers. 60/50 I'll take the lot https://t.co/9dpcw6MdWa,,,,https://twitter.com/dog_rates/status/704054845121142784/photo/1,12.0,10.0,a,,,,


Row 1165 still has a low numerator and it is because there is a date in there: 20th April. The actual score later in the tweet is 13/10, so we will manually change the numerator to 13.

In [37]:
tweets_clean.loc[1165, "rating_numerator"] = 13

 Now the large remaining numerators

In [38]:
tweets_clean[tweets_clean["rating_numerator"] > 14]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
695,786709082849828864,,,2016-10-13 23:23:56 +0000,iPhone,"This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS",,,,https://twitter.com/dog_rates/status/786709082849828864/photo/1,75.0,10.0,Logan,,,,
763,778027034220126208,,,2016-09-20 00:24:34 +0000,iPhone,This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq,,,,https://twitter.com/dog_rates/status/778027034220126208/photo/1,27.0,10.0,Sophie,,,pupper,
979,749981277374128128,,,2016-07-04 15:00:45 +0000,TweetDeck,This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh,,,,https://twitter.com/dog_rates/status/749981277374128128/photo/1,1776.0,10.0,Atticus,,,,
1712,680494726643068929,,,2015-12-25 21:06:00 +0000,iPhone,Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD,,,,https://twitter.com/dog_rates/status/680494726643068929/photo/1,26.0,10.0,,,,,
2074,670842764863651840,,,2015-11-29 05:52:33 +0000,iPhone,After so many requests... here you go.\n\nGood dogg. 420/10 https://t.co/yfAAo1gdeY,,,,https://twitter.com/dog_rates/status/670842764863651840/photo/1,420.0,10.0,,,,,


There are only five so I will manually correct the three that are wrong. The other two are correct in the sense they are what the tweet says.

In [39]:
tweets_clean.loc[695, "rating_numerator"] = 9.75
tweets_clean.loc[763, "rating_numerator"] = 11.27
tweets_clean.loc[1712, "rating_numerator"] = 11.26

#### Names

In [40]:
print(tweets_clean["name"].sample(50))

871        Sephie
1007    Bookstore
226          None
178         Oscar
1568         None
2150        Kloey
1548        Lucky
691       Lincoln
1250     Chuckles
1298         None
1570        Ember
1756          Mia
493          None
139         Sammy
2037          the
466         Tebow
1907         None
1831         Herm
2007         Kreg
2139         None
1077         None
575         Bones
766          None
653         Remus
495       Eleanor
689         Tonks
277          None
477          Jack
2312        Josep
1815          the
1192         None
1551         Trip
2019         just
2295         Gabe
150         Quinn
1273        Penny
269          None
898         Lilli
1125      Charles
974       Jameson
1904         None
1369         Rudy
1716         Toby
1981         Chet
1875         Zuzu
1088         Remy
2061         Herm
10           Koda
1736         Mike
1166    Schnitzel
Name: name, dtype: object


There are lot names `a` because the function used to guess the name looks for `This is`, so let's revert everthing that is not a word beginning with a capital letter to `None`

In [41]:
for ix in tweets_clean["name"].index:
    not_name = re.search(r"^[A-Z][a-z]+", tweets_clean.loc[ix, "name"])
    if not_name is None:
        tweets_clean.loc[ix, "name"] = "None"


#### URL Duplication

Let's remove the url from the end of the tweet `text` to avoid duplication

In [42]:
tweets_clean["text"]

0                                                            This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU
1       This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV
2                        This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB
3                                                                  This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ
4       This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f
                                                                                       ...                                                       

In [43]:
tweets_clean["text"] = tweets_clean["text"].replace(to_replace = "https.+", value = "" , regex = True).str.strip()
tweets_clean["text"]

0                                                            This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10
1       This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10
2                        This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10
3                                                                  This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us
4       This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek
                                                                           ...                                                                    
2351                                              Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox 

#### Four Categories of Dog
There are four categories of dog each with its own column. These can be amalgamated into a single column, although very occasionally more than one is true. 

In [44]:
dog_classes = ["doggo", "floofer", "pupper", "puppo"]

There are 11 instances of muliple labels. This strongly implies that they shouldn't be condensed into a single column and should be set to boolean `True` or `False`. However since there aren't many I will just add an extra label of `multiple`.

In [45]:
def dog(df):
    cols = df.columns
    length = df.shape[1]
    dog_index = df.index
    dog_series = pd.Series(None, dog_index, dtype = str)
    for col in cols:
        for ix in dog_index:
            entry = df.loc[ix, col]
            if entry == col:
                if dog_series[ix] is np.nan:
                    dog_series[ix] = col
                else:
                    dog_series[ix] = "multiple"
    return(dog_series)


tweets_clean["dog_type"] = dog(tweets_clean[dog_classes])
tweets_clean["dog_type"].value_counts()    

pupper      221
doggo        72
puppo        23
multiple     11
floofer       9
Name: dog_type, dtype: int64

In [46]:
tweets_clean.drop(dog_classes, axis = 1, inplace = True)
tweets_clean

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,dog_type
0,892420643555336193,,,2017-08-01 16:23:56 +0000,iPhone,This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13.0,10.0,Phineas,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,iPhone,"This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13.0,10.0,Tilly,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,iPhone,This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10,,,,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12.0,10.0,Archie,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,iPhone,This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us,,,,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13.0,10.0,Darla,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,iPhone,"This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek",,,,"https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12.0,10.0,Franklin,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,iPhone,Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10,,,,https://twitter.com/dog_rates/status/666049248165822465/photo/1,5.0,10.0,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,iPhone,This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10,,,,https://twitter.com/dog_rates/status/666044226329800704/photo/1,6.0,10.0,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,iPhone,Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af,,,,https://twitter.com/dog_rates/status/666033412701032449/photo/1,9.0,10.0,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,iPhone,This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of,,,,https://twitter.com/dog_rates/status/666029285002620928/photo/1,7.0,10.0,,


#### Tweet_id Dtype

In [47]:
tweets_clean["tweet_id"] = tweets_clean["tweet_id"].astype(str)

In [48]:
tweets_clean.dtypes

tweet_id                       object
in_reply_to_status_id         float64
in_reply_to_user_id           float64
timestamp                      object
source                         object
text                           object
retweeted_status_id           float64
retweeted_status_user_id      float64
retweeted_status_timestamp     object
expanded_urls                  object
rating_numerator              float64
rating_denominator            float64
name                           object
dog_type                       object
dtype: object

In [49]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2097 non-null   object 
 1   in_reply_to_status_id       0 non-null      float64
 2   in_reply_to_user_id         0 non-null      float64
 3   timestamp                   2097 non-null   object 
 4   source                      2097 non-null   object 
 5   text                        2097 non-null   object 
 6   retweeted_status_id         0 non-null      float64
 7   retweeted_status_user_id    0 non-null      float64
 8   retweeted_status_timestamp  0 non-null      object 
 9   expanded_urls               2094 non-null   object 
 10  rating_numerator            2096 non-null   float64
 11  rating_denominator          2096 non-null   float64
 12  name                        2097 non-null   object 
 13  dog_type                    336 n

#### Missing Expanded URLs

In [50]:
tweets_clean[tweets_clean["expanded_urls"].isnull() == True ]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,dog_type
375,828361771580813312,,,2017-02-05 21:56:51 +0000,Web,Beebop and Doobert should start a band 12/10 would listen,,,,,12.0,10.0,,
707,785515384317313025,,,2016-10-10 16:20:36 +0000,iPhone,"Today, 10/10, should be National Dog Rates Day",,,,,10.0,10.0,,
1445,696518437233913856,,,2016-02-08 02:18:30 +0000,iPhone,Oh my god 10/10 for every little hot dog pupper,,,,,10.0,10.0,,pupper


It's not obvious why these three remaining tweets are missing urls -- I will leave them as is

#### Drop Unneeded Columns

In [51]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2097 non-null   object 
 1   in_reply_to_status_id       0 non-null      float64
 2   in_reply_to_user_id         0 non-null      float64
 3   timestamp                   2097 non-null   object 
 4   source                      2097 non-null   object 
 5   text                        2097 non-null   object 
 6   retweeted_status_id         0 non-null      float64
 7   retweeted_status_user_id    0 non-null      float64
 8   retweeted_status_timestamp  0 non-null      object 
 9   expanded_urls               2094 non-null   object 
 10  rating_numerator            2096 non-null   float64
 11  rating_denominator          2096 non-null   float64
 12  name                        2097 non-null   object 
 13  dog_type                    336 n

In [52]:
drop_cols = ["retweeted_status_id", "retweeted_status_user_id", "retweeted_status_timestamp",
 "in_reply_to_status_id", "in_reply_to_user_id"]
tweets_clean.drop(drop_cols, axis = 1, inplace = True)

In [53]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   tweet_id            2097 non-null   object 
 1   timestamp           2097 non-null   object 
 2   source              2097 non-null   object 
 3   text                2097 non-null   object 
 4   expanded_urls       2094 non-null   object 
 5   rating_numerator    2096 non-null   float64
 6   rating_denominator  2096 non-null   float64
 7   name                2097 non-null   object 
 8   dog_type            336 non-null    object 
dtypes: float64(2), object(7)
memory usage: 243.8+ KB


#### Fix Datatypes

In [54]:
tweets_clean["timestamp"] = pd.to_datetime(tweets_clean["timestamp"])
tweets_clean["source"] = tweets_clean["source"].astype("category")
tweets_clean["dog_type"] = tweets_clean["dog_type"].astype("category")

In [55]:
tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            2097 non-null   object             
 1   timestamp           2097 non-null   datetime64[ns, UTC]
 2   source              2097 non-null   category           
 3   text                2097 non-null   object             
 4   expanded_urls       2094 non-null   object             
 5   rating_numerator    2096 non-null   float64            
 6   rating_denominator  2096 non-null   float64            
 7   name                2097 non-null   object             
 8   dog_type            336 non-null    category           
dtypes: category(2), datetime64[ns, UTC](1), float64(2), object(4)
memory usage: 215.5+ KB


In [56]:
tweets_clean.sample(10)

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,dog_type
1794,677328882937298944,2015-12-17 03:26:04+00:00,iPhone,This is Derek. All the dogs adore Derek. He's a great guy. 10/10 really solid pup,https://twitter.com/dog_rates/status/677328882937298944/photo/1,10.0,10.0,Derek,
1944,673708611235921920,2015-12-07 03:40:24+00:00,iPhone,This is Riley. She's just an adorable football fan. 12/10 I'd watch the sports with her,https://twitter.com/dog_rates/status/673708611235921920/photo/1,12.0,10.0,Riley,
2277,667437278097252352,2015-11-19 20:20:22+00:00,iPhone,Never seen this breed before. Very pointy pup. Hurts when you cuddle. Still cute tho. 10/10,https://twitter.com/dog_rates/status/667437278097252352/photo/1,10.0,10.0,,
238,847116187444137987,2017-03-29 16:00:12+00:00,iPhone,Unbelievable... We. Only. Rate. Dogs. Please stop sending in other things like this Blossoming Flop Kangaroo. Thank you... 11/10,https://twitter.com/dog_rates/status/847116187444137987/photo/1,11.0,10.0,,
1573,687494652870668288,2016-01-14 04:41:12+00:00,iPhone,This is Marq. He stole this car. 7/10 wtf Marq?,https://twitter.com/dog_rates/status/687494652870668288/photo/1,7.0,10.0,Marq,
1383,700847567345688576,2016-02-20 01:00:55+00:00,iPhone,Meet Crouton. He's a Galapagos Boonwiddle. Has a legendary tongue (most Boonwiddles do). Excellent stuff 10/10,https://twitter.com/dog_rates/status/700847567345688576/photo/1,10.0,10.0,Crouton,
249,845306882940190720,2017-03-24 16:10:40+00:00,iPhone,This is Pickles. She's a silly pupper. Thinks she's a dish. 12/10 would dry,https://twitter.com/dog_rates/status/845306882940190720/photo/1,12.0,10.0,Pickles,pupper
1001,747933425676525569,2016-06-28 23:23:19+00:00,iPhone,Pls don't send more sherks. I don't care how seemingly floofy they are. It does me so much frighten. Thank u. 11/10,"https://twitter.com/dog_rates/status/747933425676525569/photo/1,https://twitter.com/dog_rates/status/747933425676525569/photo/1,https://twitter.com/dog_rates/status/747933425676525569/photo/1",11.0,10.0,,
1979,672980819271634944,2015-12-05 03:28:25+00:00,iPhone,Extraordinary dog here. Looks large. Just a head. No body. Rather intrusive. 5/10 would still pet,https://twitter.com/dog_rates/status/672980819271634944/photo/1,5.0,10.0,,
951,751950017322246144,2016-07-10 01:23:49+00:00,Vine,This is Lola. She's a surfing pupper. 13/10 magical af,https://vine.co/v/5WrjaYAMvMO,13.0,10.0,Lola,pupper


In [57]:
tweets_clean.to_csv("tweets_clean.csv", index = False)

### Image Labelling

In [58]:
breeds_clean

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


#### Untidy Classification Data

Above we see that algorithm has gone through and offered the three best fits for the picture, the probability of being correct (according to the algo itself) and whether the particular classification represents a dog breed.  
It is much tidier to have a single guess of the breed of dog or a null if there are not dog breeds in the three guesses.  
It may also be sensible to put null where the most likely dog breed is below some confidence threshold, however I would enforce that in the cleaning.  
Also note that in the case of multiple images only one is classified, presumably the first.

In [59]:
breeds_clean[breeds_clean["tweet_id"] == 891327558926688256]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True


I'll write a function to return the most likely dog breed plus the probability from the final nine columns, or null if none is a dog breed.

In [60]:
breed_col = {"first": ["p1", "p1_conf", "p1_dog"],
             "second": ["p2", "p2_conf", "p2_dog"],
             "third": ["p3", "p3_conf", "p3_dog"]
            }
breeds_cond = pd.DataFrame(np.nan, breeds_clean.index, ["breed", "confidence"])
for key in ["third", "second", "first"]:
    breeds_temp = []
    breeds_temp = breeds_clean[breeds_clean[breed_col[key][2]]][breed_col[key][0]]
    breeds_cond.loc[breeds_temp.index, "breed"] = breeds_temp[breeds_temp.notnull()]
    cond_temp = []
    cond_temp = breeds_clean[breeds_clean[breed_col[key][2]]][breed_col[key][1]]
    breeds_cond.loc[cond_temp.index, "confidence"] = cond_temp[breeds_temp.notnull()]
breeds_cond.head(10)


Unnamed: 0,breed,confidence
0,Welsh_springer_spaniel,0.465074
1,redbone,0.506826
2,German_shepherd,0.596461
3,Rhodesian_ridgeback,0.408143
4,miniature_pinscher,0.560311
5,Bernese_mountain_dog,0.651137
6,,
7,chow,0.692517
8,golden_retriever,0.007959
9,miniature_poodle,0.201493


In [61]:
#check row 8 which has a dog recognised, but not in p1
breeds_clean.head(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


In [62]:
breeds_cond["breed"].value_counts()

golden_retriever      173
Labrador_retriever    113
Pembroke               96
Chihuahua              95
pug                    65
                     ... 
EntleBucher             1
Scotch_terrier          1
standard_schnauzer      1
clumber                 1
Japanese_spaniel        1
Name: breed, Length: 113, dtype: int64

Let's now standardise the appearance of the breed names, concatenate the dataframes and drop the unnecessary columns

In [63]:
breeds_cond["breed"] = breeds_cond["breed"].str.replace("_", " ").str.title()
breeds_cond["breed"]

0       Welsh Springer Spaniel
1                      Redbone
2              German Shepherd
3          Rhodesian Ridgeback
4           Miniature Pinscher
                 ...          
2070                    Basset
2071        Labrador Retriever
2072                 Chihuahua
2073                 Chihuahua
2074                       NaN
Name: breed, Length: 2075, dtype: object

In [64]:
lst = list(breed_col.values())
flat_lst = [val for sublist in lst for val in sublist]
breeds_clean.drop(flat_lst, axis = 1, inplace = True)
breeds_clean["tweet_id"] = breeds_clean["tweet_id"].astype(str)
breeds_clean

Unnamed: 0,tweet_id,jpg_url,img_num
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1
...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1


In [65]:
breeds_clean[["breed", "confidence"]] = breeds_cond[["breed", "confidence"]]
breeds_clean

Unnamed: 0,tweet_id,jpg_url,img_num,breed,confidence
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh Springer Spaniel,0.465074
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,Redbone,0.506826
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German Shepherd,0.596461
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian Ridgeback,0.408143
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,Miniature Pinscher,0.560311
...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,Basset,0.555712
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,Labrador Retriever,0.168086
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581


In [66]:
breeds_clean.to_csv("breeds_clean.csv", index = False)

### Tweet Meta Data  
Let's creat a dataframe with just the number of favourites and retweets, and the original tweet id.

In [67]:
meta_clean = meta_clean[["id", "retweet_count", "favorite_count"]]
meta_clean["id"] = meta_clean["id"].astype(str)
meta_clean.rename({"id": "tweet_id"}, axis = 1, inplace = True)
meta_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tweet_id        2354 non-null   object
 1   retweet_count   2354 non-null   int64 
 2   favorite_count  2354 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 55.3+ KB


In [68]:
meta_clean.to_csv("meta_clean.csv", index = False)

## Merging

Lastly I will merge these tables together into a single dataframe and combine

In [69]:
dogs_temp = breeds_clean.merge(meta_clean, on="tweet_id", how="outer")  
dogs = tweets_clean.merge(dogs_temp, on="tweet_id", how="left")
dogs

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,dog_type,jpg_url,img_num,breed,confidence,retweet_count,favorite_count
0,892420643555336193,2017-08-01 16:23:56+00:00,iPhone,This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13.0,10.0,Phineas,,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,1.0,,,8853.0,39467.0
1,892177421306343426,2017-08-01 00:17:27+00:00,iPhone,"This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10",https://twitter.com/dog_rates/status/892177421306343426/photo/1,13.0,10.0,Tilly,,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1.0,Chihuahua,0.323581,6514.0,33819.0
2,891815181378084864,2017-07-31 00:18:03+00:00,iPhone,This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12.0,10.0,Archie,,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1.0,Chihuahua,0.716012,4328.0,25461.0
3,891689557279858688,2017-07-30 15:58:51+00:00,iPhone,This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13.0,10.0,Darla,,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1.0,Labrador Retriever,0.168086,8964.0,42908.0
4,891327558926688256,2017-07-29 16:00:24+00:00,iPhone,"This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek","https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12.0,10.0,Franklin,,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2.0,Basset,0.555712,9774.0,41048.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2092,666049248165822465,2015-11-16 00:24:50+00:00,iPhone,Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10,https://twitter.com/dog_rates/status/666049248165822465/photo/1,5.0,10.0,,,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1.0,Miniature Pinscher,0.560311,41.0,111.0
2093,666044226329800704,2015-11-16 00:04:52+00:00,iPhone,This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10,https://twitter.com/dog_rates/status/666044226329800704/photo/1,6.0,10.0,,,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1.0,Rhodesian Ridgeback,0.408143,147.0,311.0
2094,666033412701032449,2015-11-15 23:21:54+00:00,iPhone,Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af,https://twitter.com/dog_rates/status/666033412701032449/photo/1,9.0,10.0,,,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1.0,German Shepherd,0.596461,47.0,128.0
2095,666029285002620928,2015-11-15 23:05:30+00:00,iPhone,This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of,https://twitter.com/dog_rates/status/666029285002620928/photo/1,7.0,10.0,,,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1.0,Redbone,0.506826,48.0,132.0


In [70]:
dogs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2096
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            2097 non-null   object             
 1   timestamp           2097 non-null   datetime64[ns, UTC]
 2   source              2097 non-null   category           
 3   text                2097 non-null   object             
 4   expanded_urls       2094 non-null   object             
 5   rating_numerator    2096 non-null   float64            
 6   rating_denominator  2096 non-null   float64            
 7   name                2097 non-null   object             
 8   dog_type            336 non-null    category           
 9   jpg_url             1971 non-null   object             
 10  img_num             1971 non-null   float64            
 11  breed               1666 non-null   object             
 12  confidence          1666 non-null 

In [71]:
cols = ["retweet_count", "favorite_count"]
dogs[cols] = dogs[cols].astype(int)
dogs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2096
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            2097 non-null   object             
 1   timestamp           2097 non-null   datetime64[ns, UTC]
 2   source              2097 non-null   category           
 3   text                2097 non-null   object             
 4   expanded_urls       2094 non-null   object             
 5   rating_numerator    2096 non-null   float64            
 6   rating_denominator  2096 non-null   float64            
 7   name                2097 non-null   object             
 8   dog_type            336 non-null    category           
 9   jpg_url             1971 non-null   object             
 10  img_num             1971 non-null   float64            
 11  breed               1666 non-null   object             
 12  confidence          1666 non-null 

In [72]:
dogs.to_csv("twitter_archive_master.csv", index = False)