# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [1]:
import pandas as pd
import numpy as np
df1 = pd.read_csv('twitter-archive-enhanced.csv')

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [2]:
# Import predicted breeds file
df2 = pd.read_csv("image_predictions.tsv", sep='\t')

3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [3]:
import json
df3_list = []

for line in open('tweet_json.txt','r'):
    data = json.loads(line)
    df3_list.append({'tweet_id': data['id'], 'timestamp': data['created_at'], 'retweets': data['retweet_count'], 
                      'favorites':data['favorite_count'], 'followers': data['user']['followers_count']})

In [4]:
df2.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [5]:
df3 = pd.DataFrame(df3_list)
df3.head()

Unnamed: 0,tweet_id,timestamp,retweets,favorites,followers
0,892420643555336193,Tue Aug 01 16:23:56 +0000 2017,7168,34467,9058333
1,892177421306343426,Tue Aug 01 00:17:27 +0000 2017,5388,29883,9058333
2,891815181378084864,Mon Jul 31 00:18:03 +0000 2017,3553,22497,9058333
3,891689557279858688,Sun Jul 30 15:58:51 +0000 2017,7383,37667,9058333
4,891327558926688256,Sat Jul 29 16:00:24 +0000 2017,7926,35979,9058333


## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



In [6]:
# Make copies of original pieces of data
dfarchive = df1.copy()
dfimage = df2.copy()
dftweet = df3.copy()

#### Checking Twitter Archive DataFrame

In [7]:
dfarchive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


##### Quality Issue #1: Retweets and Replies

In [8]:
dfarchive.info()
#checking info to see null values. retweeted_status_id has many null values, all non-null values are
#indications of retweets

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

##### Quality Issue #2: Checking Numerator and Denominator Columns

In [9]:
dfarchive.rating_denominator.describe()

count    2356.000000
mean       10.455433
std         6.745237
min         0.000000
25%        10.000000
50%        10.000000
75%        10.000000
max       170.000000
Name: rating_denominator, dtype: float64

In [10]:
#setting column width so that I can see all of the text
pd.set_option('display.max_colwidth', None)

In [11]:
#checking denominators less than 10

dfarchive.query('rating_denominator < 10').head()
#ID 313 is set to 960/0 but its supposed to be 13/10
#ID 516 is set to 24/7 but there is actually no rating --> remove
#ID 2335 is set to 1/2 but should be 9/10

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
313,835246439529840640,8.35246e+17,26259576.0,2017-02-24 21:54:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho",,,,,960,0,,,,,
516,810984652412424192,,,2016-12-19 23:06:23 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx,,,,"https://www.gofundme.com/sams-smile,https://twitter.com/dog_rates/status/810984652412424192/photo/1",24,7,Sam,,,,
2335,666287406224695296,,,2015-11-16 16:11:11 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is an Albanian 3 1/2 legged Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv,,,,https://twitter.com/dog_rates/status/666287406224695296/photo/1,1,2,an,,,,


In [12]:
#I visually QC'ed denominators greater than 10 in Microsoft Excel, but here are the text extractions 
dfarchive.query('rating_denominator > 10').text

#11/15 is actually the date, not the rating
#4/20 is the date, not the rating
#20/16 may or may not be a real rating, unclear. 


342                                                                                                               @docmisterio account started on 11/15/15
433                                                    The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd
784           RT @dog_rates: After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https:/…
902                                                                         Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE
1068          After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ
1120                             Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv
1165                                                                  

In [13]:
dfarchive.rating_numerator.describe()

count    2356.000000
mean       13.126486
std        45.876648
min         0.000000
25%        10.000000
50%        11.000000
75%        12.000000
max      1776.000000
Name: rating_numerator, dtype: float64

In [14]:
#checking to see if the rating numerators of zero are real (they are real)
dfarchive.query('rating_numerator < 1').text

315           When you're so blinded by your systematic plagiarism that you forget what day it is. 0/10 https://t.co/YbEJPkg4Ag
1016    PUPDATE: can't see any. Even if I could, I couldn't reach them to pet. 0/10 much disappointment https://t.co/c7WXaB2nqX
Name: text, dtype: object

In [15]:
#I visually QC'ed denominators greater than 10 in Microsoft Excel, but here are the text extractions 
dfarchive.query('rating_numerator < 10').text
#45 should be 13.5 not 5


45                                This is Bella. She hopes her smile made you smile. If not, she is also offering you her favorite monkey. 13.5/10 https://t.co/qjrljjt948
229     This is Jerry. He's doing a distinguished tongue slip. Slightly patronizing tbh. You think you're better than us, Jerry? 6/10 hold me back https://t.co/DkOBbwulw1
315                                                      When you're so blinded by your systematic plagiarism that you forget what day it is. 0/10 https://t.co/YbEJPkg4Ag
387                                                                                                  I was going to do 007/10, but the joke wasn't worth the &lt;10 rating
462                           RT @dog_rates: Meet Herschel. He's slightly bigger than ur average pupper. Looks lonely. Could probably ride 7/10 would totally pet https:/…
                                                                                       ...                                                       

In [16]:
#I visually QC'ed denominators greater than 10 in Microsoft Excel, but here are the text extractions 
dfarchive.query('rating_numerator > 14').text
#1712 should be 11 not 26

55                                                                                    @roushfenway These are good dogs but 17/10 is an emotional impulse rating. More like 13/10s
188                                                                                        @dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research
189                                         @s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10
285                                                                               RT @KibaDva: I collected all the good dogs!! 15/10 @dog_rates #GoodDogs https://t.co/6UCGFczlOI
290                                                                                                                                                            @markhoppus 182/10
291                                                                                                           

#### Quality Issue #3: Blank or incorrect Dog Names

In [17]:
namelist = dfarchive.name.value_counts()

In [18]:
pd.set_option('display.max_rows', None)

namelist

#581 dogs have no name, and 54 have a name called "a"

None              745
a                  55
Charlie            12
Oliver             11
Cooper             11
Lucy               11
Tucker             10
Lola               10
Penny              10
Bo                  9
Winston             9
Sadie               8
the                 8
an                  7
Bailey              7
Buddy               7
Toby                7
Daisy               7
Scout               6
Milo                6
Leo                 6
Rusty               6
Jack                6
Bella               6
Oscar               6
Stanley             6
Koda                6
Dave                6
Jax                 6
very                5
Gus                 5
George              5
Phil                5
Bentley             5
Alfie               5
Sunny               5
Louis               5
Oakley              5
Chester             5
Sammy               5
Finn                5
Larry               5
Walter              4
Winnie              4
Derek               4
Maggie    

In [19]:
dfarchive[dfarchive.name == 'a']
#a large number of these are not dogs. 
#Perhaps we should delete animals that aren't dogs using the twitter image archive,
# then come back and see if these dogs still exist in the dataframe

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
56,881536004380872706,,,2017-07-02 15:32:16 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here is a pupper approaching maximum borkdrive. Zooming at never before seen speeds. 14/10 paw-inspiring af \n(IG: puffie_the_chow) https://t.co/ghXBIIeQZF,,,,https://twitter.com/dog_rates/status/881536004380872706/video/1,14,10,a,,,pupper,
649,792913359805018113,,,2016-10-31 02:17:31 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here is a perfect example of someone who has their priorities in order. 13/10 for both owner and Forrest https://t.co/LRyMrU7Wfq,,,,"https://twitter.com/dog_rates/status/792913359805018113/photo/1,https://twitter.com/dog_rates/status/792913359805018113/photo/1,https://twitter.com/dog_rates/status/792913359805018113/photo/1,https://twitter.com/dog_rates/status/792913359805018113/photo/1",13,10,a,,,,
801,772581559778025472,,,2016-09-04 23:46:12 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Guys this is getting so out of hand. We only rate dogs. This is a Galapagos Speed Panda. Pls only send dogs... 10/10 https://t.co/8lpAGaZRFn,,,,"https://twitter.com/dog_rates/status/772581559778025472/photo/1,https://twitter.com/dog_rates/status/772581559778025472/photo/1,https://twitter.com/dog_rates/status/772581559778025472/photo/1",10,10,a,,,,
1002,747885874273214464,,,2016-06-28 20:14:22 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a mighty rare blue-tailed hammer sherk. Human almost lost a limb trying to take these. Be careful guys. 8/10 https://t.co/TGenMeXreW,,,,"https://twitter.com/dog_rates/status/747885874273214464/photo/1,https://twitter.com/dog_rates/status/747885874273214464/photo/1",8,10,a,,,,
1004,747816857231626240,,,2016-06-28 15:40:07 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Viewer discretion is advised. This is a terrible attack in progress. Not even in water (tragic af). 4/10 bad sherk https://t.co/L3U0j14N5R,,,,https://twitter.com/dog_rates/status/747816857231626240/photo/1,4,10,a,,,,
1017,746872823977771008,,,2016-06-26 01:08:52 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a carrot. We only rate dogs. Please only send in dogs. You all really should know this by now ...11/10 https://t.co/9e48aPrBm2,,,,"https://twitter.com/dog_rates/status/746872823977771008/photo/1,https://twitter.com/dog_rates/status/746872823977771008/photo/1",11,10,a,,,,
1049,743222593470234624,,,2016-06-15 23:24:09 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a very rare Great Alaskan Bush Pupper. Hard to stumble upon without spooking. 12/10 would pet passionately https://t.co/xOBKCdpzaa,,,,https://twitter.com/dog_rates/status/743222593470234624/photo/1,12,10,a,,,pupper,
1193,717537687239008257,,,2016-04-06 02:21:30 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",People please. This is a Deadly Mediterranean Plop T-Rex. We only rate dogs. Only send in dogs. Thanks you... 11/10 https://t.co/2ATDsgHD4n,,,,https://twitter.com/dog_rates/status/717537687239008257/photo/1,11,10,a,,,,
1207,715733265223708672,,,2016-04-01 02:51:22 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a taco. We only rate dogs. Please only send in dogs. Dogs are what we rate. Not tacos. Thank you... 10/10 https://t.co/cxl6xGY8B9,,,,https://twitter.com/dog_rates/status/715733265223708672/photo/1,10,10,a,,,,
1340,704859558691414016,,,2016-03-02 02:43:09 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here is a heartbreaking scene of an incredible pupper being laid to rest. 10/10 RIP pupper https://t.co/81mvJ0rGRu,,,,https://twitter.com/dog_rates/status/704859558691414016/photo/1,10,10,a,,,pupper,


In [20]:
dfarchive[dfarchive.name == 'None']

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh,,,,https://twitter.com/dog_rates/status/891087950875897856/photo/1,13,10,,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq,,,,"https://twitter.com/dog_rates/status/890729181411237888/photo/1,https://twitter.com/dog_rates/status/890729181411237888/photo/1",13,10,,,,,
12,889665388333682689,,,2017-07-25 01:55:32 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here's a puppo that seems to be on the fence about something haha no but seriously someone help her. 13/10 https://t.co/BxvuXk0UCm,,,,https://twitter.com/dog_rates/status/889665388333682689/photo/1,13,10,,,,,puppo
24,887343217045368832,,,2017-07-18 16:08:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",You may not have known you needed to see this today. 13/10 please enjoy (IG: emmylouroo) https://t.co/WZqNqygEyV,,,,https://twitter.com/dog_rates/status/887343217045368832/video/1,13,10,,,,,
25,887101392804085760,,,2017-07-18 00:07:08 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This... is a Jubilant Antarctic House Bear. We only rate dogs. Please only send dogs. Thank you... 12/10 would suffocate in floof https://t.co/4Ad1jzJSdp,,,,https://twitter.com/dog_rates/status/887101392804085760/photo/1,12,10,,,,,
30,886267009285017600,8.862664e+17,2281182000.0,2017-07-15 16:51:35 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@NonWhiteHat @MayhewMayhem omg hello tanner you are a scary good boy 12/10 would pet with extreme caution,,,,,12,10,,,,,
32,886054160059072513,,,2017-07-15 02:45:48 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @Athletics: 12/10 #BATP https://t.co/WxwJmvjfxo,8.860537e+17,19607400.0,2017-07-15 02:44:07 +0000,"https://twitter.com/dog_rates/status/886053434075471873,https://twitter.com/dog_rates/status/886053434075471873",12,10,,,,,
35,885518971528720385,,,2017-07-13 15:19:09 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",I have a new hero and his name is Howard. 14/10 https://t.co/gzLHboL7Sk,,,,https://twitter.com/4bonds2carbon/status/885517367337512960,14,10,,,,,
37,885167619883638784,,,2017-07-12 16:03:00 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have a corgi undercover as a malamute. Pawbably doing important investigative work. Zero control over tongue happenings. 13/10 https://t.co/44ItaMubBf,,,,"https://twitter.com/dog_rates/status/885167619883638784/photo/1,https://twitter.com/dog_rates/status/885167619883638784/photo/1,https://twitter.com/dog_rates/status/885167619883638784/photo/1,https://twitter.com/dog_rates/status/885167619883638784/photo/1",13,10,,,,,
41,884441805382717440,,,2017-07-10 15:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","I present to you, Pup in Hat. Pup in Hat is great for all occasions. Extremely versatile. Compact as h*ck. 14/10 (IG: itselizabethgales) https://t.co/vvBOcC2VdC",,,,https://twitter.com/dog_rates/status/884441805382717440/photo/1,14,10,,,,,


##### Tidiness Issue #1: Doggo, Floofer, Pupper and Pupp are untidy columns that should be combined

In [21]:
dfarchive.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

In [22]:
dfarchive.doggo

0        None
1        None
2        None
3        None
4        None
5        None
6        None
7        None
8        None
9       doggo
10       None
11       None
12       None
13       None
14       None
15       None
16       None
17       None
18       None
19       None
20       None
21       None
22       None
23       None
24       None
25       None
26       None
27       None
28       None
29       None
30       None
31       None
32       None
33       None
34       None
35       None
36       None
37       None
38       None
39       None
40       None
41       None
42       None
43      doggo
44       None
45       None
46       None
47       None
48       None
49       None
50       None
51       None
52       None
53       None
54       None
55       None
56       None
57       None
58       None
59       None
60       None
61       None
62       None
63       None
64       None
65       None
66       None
67       None
68       None
69       None
70       None
71    

In [23]:
dfarchive.floofer

0          None
1          None
2          None
3          None
4          None
5          None
6          None
7          None
8          None
9          None
10         None
11         None
12         None
13         None
14         None
15         None
16         None
17         None
18         None
19         None
20         None
21         None
22         None
23         None
24         None
25         None
26         None
27         None
28         None
29         None
30         None
31         None
32         None
33         None
34         None
35         None
36         None
37         None
38         None
39         None
40         None
41         None
42         None
43         None
44         None
45         None
46      floofer
47         None
48         None
49         None
50         None
51         None
52         None
53         None
54         None
55         None
56         None
57         None
58         None
59         None
60         None
61         None
62      

In [24]:
dfarchive.pupper

0         None
1         None
2         None
3         None
4         None
5         None
6         None
7         None
8         None
9         None
10        None
11        None
12        None
13        None
14        None
15        None
16        None
17        None
18        None
19        None
20        None
21        None
22        None
23        None
24        None
25        None
26        None
27        None
28        None
29      pupper
30        None
31        None
32        None
33        None
34        None
35        None
36        None
37        None
38        None
39        None
40        None
41        None
42        None
43        None
44        None
45        None
46        None
47        None
48        None
49      pupper
50        None
51        None
52        None
53        None
54        None
55        None
56      pupper
57        None
58        None
59        None
60        None
61        None
62        None
63        None
64        None
65        None
66        

In [25]:
dfarchive.puppo

0        None
1        None
2        None
3        None
4        None
5        None
6        None
7        None
8        None
9        None
10       None
11       None
12      puppo
13       None
14      puppo
15       None
16       None
17       None
18       None
19       None
20       None
21       None
22       None
23       None
24       None
25       None
26       None
27       None
28       None
29       None
30       None
31       None
32       None
33       None
34       None
35       None
36       None
37       None
38       None
39       None
40       None
41       None
42       None
43       None
44       None
45       None
46       None
47       None
48       None
49       None
50       None
51       None
52       None
53       None
54       None
55       None
56       None
57       None
58       None
59       None
60       None
61       None
62       None
63       None
64       None
65       None
66       None
67       None
68       None
69       None
70       None
71    

#### Checking Image DataFrame

In [26]:
dfimage.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [27]:
dfimage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


##### Quality Issue #5: Identify images with false values 

In [28]:
#Here we see that the values for p1_dog are either True or False, no erroneous values here
dfimage.p1_dog.unique()

array([ True, False])

In [29]:
#here I want to see the ratio of dogs that are guessed correctly v. incorrectly. ~74% of all guesses are correct
dfimage.p1_dog.sum()/dfimage.p1_dog.count()

0.7383132530120482

In [30]:
#There are only 324 entries in which all three guesses are False
dfimage.query('p1_dog == False and p2_dog == False and p3_dog == False').tweet_id.count()

324

In [31]:
dfimage.query('p1_dog == False and p2_dog == False and p3_dog == False')

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.0458854,False,terrapin,0.0178853,False
17,666104133288665088,https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg,1,hen,0.965932,False,cock,0.0339194,False,partridge,5.20658e-05,False
18,666268910803644416,https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg,1,desktop_computer,0.086502,False,desk,0.0855474,False,bookcase,0.0794797,False
21,666293911632134144,https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg,1,three-toed_sloth,0.914671,False,otter,0.01525,False,great_grey_owl,0.0132072,False
25,666362758909284353,https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg,1,guinea_pig,0.996496,False,skunk,0.00240245,False,hamster,0.000460863,False
29,666411507551481857,https://pbs.twimg.com/media/CT-RugiWIAELEaq.jpg,1,coho,0.40464,False,barracouta,0.271485,False,gar,0.189945,False
45,666786068205871104,https://pbs.twimg.com/media/CUDmZIkWcAAIPPe.jpg,1,snail,0.999888,False,slug,5.51417e-05,False,acorn,2.6258e-05,False
50,666837028449972224,https://pbs.twimg.com/media/CUEUva1WsAA2jPb.jpg,1,triceratops,0.442113,False,armadillo,0.114071,False,common_iguana,0.0432553,False
51,666983947667116034,https://pbs.twimg.com/media/CUGaXDhW4AY9JUH.jpg,1,swab,0.589446,False,chain_saw,0.190142,False,wig,0.0345097,False
53,667012601033924608,https://pbs.twimg.com/media/CUG0bC0U8AAw2su.jpg,1,hyena,0.98723,False,African_hunting_dog,0.0126008,False,coyote,5.73501e-05,False


#### Checking Twitter API DataFrame

In [32]:
dftweet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2328 entries, 0 to 2327
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   tweet_id   2328 non-null   int64 
 1   timestamp  2328 non-null   object
 2   retweets   2328 non-null   int64 
 3   favorites  2328 non-null   int64 
 4   followers  2328 non-null   int64 
dtypes: int64(4), object(1)
memory usage: 91.1+ KB


### Quality issues
1. Remove retweets and replies

2. Remove erroneous columns regarding retweets and replies

3. Remove invalid numerator and denominator values.

4. Change numerator and denominator of ratings into a single float value. 

5. Calculate the outer fences for the dog ratings and then remove rating values that are major outliers. 

6. Replace dog names listed as "None" with NaN

7. Remove images from image dataframe that weren't guessed correctly in all three guesses. 

7.5 Remove images that are not dogs? 

8. Remove erroneous dog descriptor columns from dfarchive

### Tidiness issues
1. Variables as column headers: doggo, floofer, etc. 

2. Combine the tables into one master dataset

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

### Quality Issue #1:

#### Define: Remove Retweets

There are 181 retweets, so I will remove all non-null values from the column "retweeted_status_id"

In [33]:
dfarchive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

#### Code

In [34]:
#setting the dataframe equal to only the columns where retweeted_status_id and in_reply_to_status is null
dfarchive = dfarchive[dfarchive.retweeted_status_id.isnull()]
dfarchive = dfarchive[dfarchive.in_reply_to_status_id.isnull()]

#### Test

In [35]:
#checking that the dataframe now has all null values in retweeted_status_id
dfarchive.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2097 non-null   int64  
 1   in_reply_to_status_id       0 non-null      float64
 2   in_reply_to_user_id         0 non-null      float64
 3   timestamp                   2097 non-null   object 
 4   source                      2097 non-null   object 
 5   text                        2097 non-null   object 
 6   retweeted_status_id         0 non-null      float64
 7   retweeted_status_user_id    0 non-null      float64
 8   retweeted_status_timestamp  0 non-null      object 
 9   expanded_urls               2094 non-null   object 
 10  rating_numerator            2097 non-null   int64  
 11  rating_denominator          2097 non-null   int64  
 12  name                        2097 non-null   object 
 13  doggo                       2097 

### Quality Issue #2:

#### Define: Remove erroneous columns from dataframe

In [36]:
dfarchive = dfarchive.drop(['in_reply_to_status_id'], axis = 1)
dfarchive = dfarchive.drop(['in_reply_to_user_id'], axis = 1)
dfarchive = dfarchive.drop(['retweeted_status_id'], axis = 1)
dfarchive = dfarchive.drop(['retweeted_status_user_id'], axis = 1)
dfarchive = dfarchive.drop(['retweeted_status_timestamp'], axis = 1)

In [37]:
dfarchive.columns

Index(['tweet_id', 'timestamp', 'source', 'text', 'expanded_urls',
       'rating_numerator', 'rating_denominator', 'name', 'doggo', 'floofer',
       'pupper', 'puppo'],
      dtype='object')

#### Code

#### Test

### Quality Issue #3:

#### Define: Remove erroneous numerators and denominators. Create a new column with a single float rating. 
In the analysis phase I determined that only a small handful (3) of ratings were stripped incorrectly. So, instead of restripping the ratings, I will simply remove erroneous ones. 


In [38]:
#I will drop rows with the following denominators: 0, 2, 7, 15, 20, 16 
#I'll remove numerators equal to 5 and 26 

#### Code

In [39]:
#First checking number of numerators
print(dfarchive.rating_numerator.unique())

[  13   12   14    5   11    6   10    0   84   24   75   27    3    7
    8    9    4  165 1776  204   50   99   80   45   60   44  121   26
    2  144   88    1  420]


In [40]:
nums = [5,26]
for num in nums:
    dfarchive = dfarchive[dfarchive['rating_numerator'] != num]

#### Test

In [41]:
#Number of numerators should be down by 2
print(dfarchive.rating_numerator.nunique())

31


#### Code

In [42]:
#Now checking number of denominators
print(dfarchive.rating_denominator.nunique())

14


In [43]:
denoms = [0,2,7,15,20,16]

for denom in denoms:
    dfarchive = dfarchive[dfarchive['rating_denominator'] != denom]

#### Test

In [44]:
#Number of denominators should be down by 6
print(dfarchive.rating_denominator.nunique())

11


### Quality Issue #4:

#### Define: Remove erroneous numerators and denominators. Create a new column with a single float rating. 
Now that I've removed erroneous rating values, I will combine them into one float rating value so that the votes can be ranked


#### Code

In [45]:
dfarchive['rating'] = (dfarchive.rating_numerator.astype(float) / dfarchive.rating_denominator.astype(float))

#### Test

In [46]:
dfarchive.rating.describe()

count    2059.000000
mean        1.179337
std         4.000923
min         0.000000
25%         1.000000
50%         1.100000
75%         1.200000
max       177.600000
Name: rating, dtype: float64

### Quality Issue #5:

#### Define: Remove major outliers from ratings
I have removed all invalid ratings, but there are still some really high ratings that will skew the data, and I have been informed that I should remove all major outliers. Thus, I will calculate the major outlier value and remove values higher than this


#### Code

In [47]:
#A major outlier is 3x outside of the quartile range

#calculating the quartile range, then multiplying it by 3. 
(1.2 - 1.1)*3


0.2999999999999996

In [48]:
#calculating the appropriate maximum for the dataset
1.2 + .3

1.5

In [49]:
dfarchive = dfarchive[dfarchive.rating < 1.5]

#### Test

In [50]:
dfarchive.rating.describe()

count    2055.000000
mean        1.069808
std         0.204276
min         0.000000
25%         1.000000
50%         1.100000
75%         1.200000
max         1.400000
Name: rating, dtype: float64

### Quality Issue #6

#### Define: Replace dogs named "None" with NaN

#### Code

In [51]:
dfarchive.name.value_counts().head(5)

None       581
a           54
Charlie     11
Lucy        11
Cooper      10
Name: name, dtype: int64

In [52]:
dfarchive.name.replace('None', np.nan, inplace=True)

#### Test

In [53]:
dfarchive.name.value_counts().head(5)

a          54
Charlie    11
Lucy       11
Cooper     10
Oliver     10
Name: name, dtype: int64

### Quality Issue #7

#### Define: Remove rows from image dataframe that did not guess correctly in all three tries

#### Code

In [54]:
#this is the number of values we want in our dataframe after the cleaning step:
dfimage.tweet_id.count() - dfimage.query('p1_dog == False and p2_dog == False and p3_dog == False').tweet_id.count()

1751

In [55]:
dfimage = dfimage.query('p1_dog == True or p2_dog == True or p3_dog == True ')

#### Test

In [56]:
dfimage.tweet_id.count()

1751

### Quality Issue #7.5

#### Define: Remove rows for images that are not dogs

#### Code

In [57]:
#First I'll look at P1 = true values for non-dogs
dftrue = dfimage[dfimage.p1_dog == True]

In [58]:
dftrue.p1.unique()
#looks like all were guessed as dogs and correct

array(['Welsh_springer_spaniel', 'redbone', 'German_shepherd',
       'Rhodesian_ridgeback', 'miniature_pinscher',
       'Bernese_mountain_dog', 'chow', 'miniature_poodle',
       'golden_retriever', 'Gordon_setter', 'Walker_hound', 'pug',
       'bloodhound', 'Lhasa', 'English_setter', 'Italian_greyhound',
       'Maltese_dog', 'malamute', 'soft-coated_wheaten_terrier',
       'Chihuahua', 'black-and-tan_coonhound', 'toy_terrier',
       'Blenheim_spaniel', 'Pembroke', 'Chesapeake_Bay_retriever',
       'curly-coated_retriever', 'dalmatian', 'Ibizan_hound',
       'Border_collie', 'Labrador_retriever', 'miniature_schnauzer',
       'Airedale', 'West_Highland_white_terrier', 'toy_poodle',
       'giant_schnauzer', 'vizsla', 'Rottweiler', 'Siberian_husky',
       'papillon', 'Saint_Bernard', 'Tibetan_terrier', 'borzoi', 'beagle',
       'Yorkshire_terrier', 'Pomeranian', 'kuvasz',
       'flat-coated_retriever', 'Norwegian_elkhound', 'standard_poodle',
       'Staffordshire_bullterrier

In [59]:
#Now I'll look at P2 = true values for non-dogs
dftrue = dfimage[dfimage.p2_dog == True]

In [60]:
dftrue.p2.unique()
#looks like all second guesses that were true were dogs

array(['collie', 'miniature_pinscher', 'malinois', 'redbone',
       'Rottweiler', 'English_springer', 'Tibetan_mastiff', 'komondor',
       'Yorkshire_terrier', 'English_foxhound', 'bull_mastiff',
       'German_shepherd', 'Shih-Tzu', 'Newfoundland', 'toy_terrier',
       'toy_poodle', 'Chesapeake_Bay_retriever', 'Siberian_husky',
       'Afghan_hound', 'bloodhound', 'papillon', 'cocker_spaniel', 'chow',
       'Irish_terrier', 'beagle', 'giant_schnauzer', 'Labrador_retriever',
       'Pembroke', 'Chihuahua', 'Weimaraner', 'Brittany_spaniel',
       'standard_schnauzer', 'vizsla', 'pug', 'Italian_greyhound',
       'Samoyed', 'Pomeranian', 'miniature_poodle', 'Lakeland_terrier',
       'Irish_setter', 'malamute', 'Border_collie', 'Leonberg',
       'French_bulldog', 'golden_retriever', 'standard_poodle', 'kuvasz',
       'Cardigan', 'silky_terrier', 'English_setter', 'Pekinese', 'boxer',
       'basset', 'Bedlington_terrier', 'Shetland_sheepdog', 'Lhasa',
       'groenendael', 'Austra

In [61]:
#Now I'll look at P3 = true values for non-dogs
dftrue = dfimage[dfimage.p3_dog == True]

In [62]:
dftrue.p3.unique()
#looks like all third guesses that were true were dogs

array(['Shetland_sheepdog', 'Rhodesian_ridgeback', 'bloodhound',
       'miniature_pinscher', 'Doberman', 'Greater_Swiss_Mountain_dog',
       'golden_retriever', 'soft-coated_wheaten_terrier',
       'Labrador_retriever', 'Pekinese', 'Ibizan_hound', 'French_bulldog',
       'malinois', 'Dandie_Dinmont', 'borzoi', 'basenji',
       'miniature_poodle', 'groenendael', 'Eskimo_dog', 'briard',
       'papillon', 'flat-coated_retriever', 'Chihuahua', 'Shih-Tzu',
       'Pomeranian', 'Saluki', 'Great_Pyrenees',
       'West_Highland_white_terrier', 'collie', 'toy_poodle', 'vizsla',
       'giant_schnauzer', 'kelpie', 'Brabancon_griffon',
       'standard_poodle', 'beagle', 'Irish_water_spaniel', 'bluetick',
       'Weimaraner', 'Chesapeake_Bay_retriever',
       'black-and-tan_coonhound', 'kuvasz', 'Staffordshire_bullterrier',
       'Yorkshire_terrier', 'Lakeland_terrier', 'cocker_spaniel',
       'Australian_terrier', 'Great_Dane', 'curly-coated_retriever',
       'schipperke', 'Newfoundla

#### Test

In [63]:
#No need to code here, by removing all wrong guesses, I also vicariously removed all non-dogs. 

##### Quality Issue #8 is taken care of after tidiness issue #1

### Tidiness Issue #1

#### Define:  Combine Doggo, Floofer, and Pupper columns into one descriptor column

#### Code: 

In [64]:
#combining the descriptors into on column
dfarchive['dog_descriptor'] = dfarchive.doggo + dfarchive.floofer +dfarchive.pupper + dfarchive.puppo

In [65]:
#looking at the column, lots of 'None's are copied
dfarchive.dog_descriptor

0           NoneNoneNoneNone
1           NoneNoneNoneNone
2           NoneNoneNoneNone
3           NoneNoneNoneNone
4           NoneNoneNoneNone
5           NoneNoneNoneNone
6           NoneNoneNoneNone
7           NoneNoneNoneNone
8           NoneNoneNoneNone
9          doggoNoneNoneNone
10          NoneNoneNoneNone
11          NoneNoneNoneNone
12         NoneNoneNonepuppo
13          NoneNoneNoneNone
14         NoneNoneNonepuppo
15          NoneNoneNoneNone
16          NoneNoneNoneNone
17          NoneNoneNoneNone
18          NoneNoneNoneNone
20          NoneNoneNoneNone
21          NoneNoneNoneNone
22          NoneNoneNoneNone
23          NoneNoneNoneNone
24          NoneNoneNoneNone
25          NoneNoneNoneNone
26          NoneNoneNoneNone
27          NoneNoneNoneNone
28          NoneNoneNoneNone
29        NoneNonepupperNone
31          NoneNoneNoneNone
33          NoneNoneNoneNone
34          NoneNoneNoneNone
35          NoneNoneNoneNone
37          NoneNoneNoneNone
38          No

In [66]:
#we can use extract to remove the nones and grab the dog names, but there are a few columns that have more than one descriptor
dfarchive.dog_descriptor.unique()

array(['NoneNoneNoneNone', 'doggoNoneNoneNone', 'NoneNoneNonepuppo',
       'NoneNonepupperNone', 'NoneflooferNoneNone', 'doggoNoneNonepuppo',
       'doggoflooferNoneNone', 'doggoNonepupperNone'], dtype=object)

In [67]:
dfarchive.query('dog_descriptor == "doggoNonepupperNone"')

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,rating,dog_descriptor
460,817777686764523521,2017-01-07 16:59:28 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Dido. She's playing the lead role in ""Pupper Stops to Catch Snow Before Resuming Shadow Box with Dried Apple."" 13/10 (IG: didodoggo) https://t.co/m7isZrOBX7",https://twitter.com/dog_rates/status/817777686764523521/video/1,13,10,Dido,doggo,,pupper,,1.3,doggoNonepupperNone
531,808106460588765185,2016-12-12 00:29:28 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have Burke (pupper) and Dexter (doggo). Pupper wants to be exactly like doggo. Both 12/10 would pet at same time https://t.co/ANBpEYHaho,https://twitter.com/dog_rates/status/808106460588765185/photo/1,12,10,,doggo,,pupper,,1.2,doggoNonepupperNone
575,801115127852503040,2016-11-22 17:28:25 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine https://t.co/55Dqe0SJNj,"https://twitter.com/dog_rates/status/801115127852503040/photo/1,https://twitter.com/dog_rates/status/801115127852503040/photo/1",12,10,Bones,doggo,,pupper,,1.2,doggoNonepupperNone
705,785639753186217984,2016-10-11 00:34:48 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Pinot. He's a sophisticated doggo. You can tell by the hat. Also pointier than your average pupper. Still 10/10 would pet cautiously https://t.co/f2wmLZTPHd,"https://twitter.com/dog_rates/status/785639753186217984/photo/1,https://twitter.com/dog_rates/status/785639753186217984/photo/1",10,10,Pinot,doggo,,pupper,,1.0,doggoNonepupperNone
733,781308096455073793,2016-09-29 01:42:20 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine - Make a Scene</a>","Pupper butt 1, Doggo 0. Both 12/10 https://t.co/WQvcPEpH2u",https://vine.co/v/5rgu2Law2ut,12,10,,doggo,,pupper,,1.2,doggoNonepupperNone
889,759793422261743616,2016-07-31 16:50:42 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","Meet Maggie &amp; Lila. Maggie is the doggo, Lila is the pupper. They are sisters. Both 12/10 would pet at the same time https://t.co/MYwR4DQKll","https://twitter.com/dog_rates/status/759793422261743616/photo/1,https://twitter.com/dog_rates/status/759793422261743616/photo/1",12,10,Maggie,doggo,,pupper,,1.2,doggoNonepupperNone
1063,741067306818797568,2016-06-10 00:39:48 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC,https://twitter.com/dog_rates/status/741067306818797568/photo/1,12,10,just,doggo,,pupper,,1.2,doggoNonepupperNone
1113,733109485275860992,2016-05-19 01:38:16 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda",https://twitter.com/dog_rates/status/733109485275860992/photo/1,12,10,,doggo,,pupper,,1.2,doggoNonepupperNone


In [68]:
#for descriptors that have 'None' between them, we should just manually remove the none. 
#Otherwise the extraction will pull only the first doggo and not both descriptors
dfarchive.dog_descriptor.replace('doggoNoneNonepuppo', 'doggopuppo', inplace=True)
dfarchive.dog_descriptor.replace('doggoNonepupperNone', 'doggopupper', inplace=True)

In [69]:
# now we are extracting the descriptors, both indicvidual and combined, leaving all the None's behind
dfarchive.dog_descriptor = dfarchive.dog_descriptor.str.extract('(doggopuppo|doggopupper|doggofloofer|doggo|floofer|pupper|puppo)', expand = True)

In [70]:
dfarchive.dog_descriptor.unique()

array([nan, 'doggo', 'puppo', 'pupper', 'floofer', 'doggopuppo',
       'doggofloofer', 'doggopupper'], dtype=object)

In [71]:
#now, change the desciptors to be more gramatically correct
dfarchive.loc[dfarchive.dog_descriptor == 'doggopupper', 'dog_descriptor'] = 'doggo, pupper'
dfarchive.loc[dfarchive.dog_descriptor == 'doggopuppo', 'dog_descriptor'] = 'doggo, puppo'
dfarchive.loc[dfarchive.dog_descriptor == 'doggofloofer', 'dog_descriptor'] = 'doggo, floofer'

#### Test:

In [72]:
dfarchive.dog_descriptor.unique()

array([nan, 'doggo', 'puppo', 'pupper', 'floofer', 'doggo, puppo',
       'doggo, floofer', 'doggo, pupper'], dtype=object)

### Quality Issue #8

#### Define: Remove erroneous dog descriptor columns from table

#### Code:

In [73]:
dfarchive = dfarchive.drop(['doggo'], axis = 1)
dfarchive = dfarchive.drop(['floofer'], axis = 1)
dfarchive = dfarchive.drop(['puppo'], axis = 1)
dfarchive = dfarchive.drop(['pupper'], axis = 1)

#### Test:

In [74]:
dfarchive.columns

Index(['tweet_id', 'timestamp', 'source', 'text', 'expanded_urls',
       'rating_numerator', 'rating_denominator', 'name', 'rating',
       'dog_descriptor'],
      dtype='object')

### Tidiness Issue #2

This is where i need the most help. The tweet IDs from the image_predictions.tsv, which is downloaded from the requests library, don't match the other two dataframe IDs. 


#### Define: Combining Tables into one table

In [75]:
dfarchive.columns

Index(['tweet_id', 'timestamp', 'source', 'text', 'expanded_urls',
       'rating_numerator', 'rating_denominator', 'name', 'rating',
       'dog_descriptor'],
      dtype='object')

In [76]:
dfimage.columns

Index(['tweet_id', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2',
       'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'],
      dtype='object')

In [77]:
dftweet.columns

Index(['tweet_id', 'timestamp', 'retweets', 'favorites', 'followers'], dtype='object')

In [78]:
#checking to see which tweet ids from the tweet API and from the original archive dont match. 
#only 8 values dont match
set(dfarchive.tweet_id) - set(dftweet.tweet_id)

{680055455951884288,
 754011816964026368,
 759923798737051648,
 779123168116150273,
 829374341691346946,
 837366284874571778,
 844704788403113984,
 872261713294495745}

In [79]:
#checking to see which tweet ids from the tweet API and from the image prediction archive don't match. 
#it looks like no values match
set(dfarchive.tweet_id) - set(dfimage.tweet_id)

{666051853826850816,
 666104133288665088,
 666268910803644416,
 666293911632134144,
 666362758909284353,
 666411507551481857,
 666786068205871104,
 666837028449972224,
 666983947667116034,
 667012601033924608,
 667065535570550784,
 667188689915760640,
 667369227918143488,
 667437278097252352,
 667443425659232256,
 667549055577362432,
 667724302356258817,
 667766675769573376,
 667782464991965184,
 667866724293877760,
 667873844930215936,
 667937095915278337,
 668142349051129856,
 668154635664932864,
 668226093875376128,
 668291999406125056,
 668466899341221888,
 668544745690562560,
 668587383441514497,
 668614819948453888,
 668620235289837568,
 668643542311546881,
 668645506898350081,
 668981893510119424,
 668988183816871936,
 668992363537309700,
 669015743032369152,
 669214165781868544,
 669351434509529089,
 669571471778410496,
 669583744538451968,
 669682095984410625,
 669749430875258880,
 669923323644657664,
 669972011175813120,
 670055038660800512,
 670079681849372674,
 670361874861

In [80]:
#checking to see which tweet ids from the original archive and from the image prediction archive don't match. 
#it looks like no values match
#something must be wrong with the image prediction file's ID's
set(dftweet.tweet_id) - set(dfimage.tweet_id)

{666051853826850816,
 666104133288665088,
 666268910803644416,
 666293911632134144,
 666362758909284353,
 666411507551481857,
 666786068205871104,
 666837028449972224,
 666983947667116034,
 667012601033924608,
 667065535570550784,
 667070482143944705,
 667188689915760640,
 667369227918143488,
 667437278097252352,
 667443425659232256,
 667549055577362432,
 667550882905632768,
 667724302356258817,
 667766675769573376,
 667782464991965184,
 667866724293877760,
 667873844930215936,
 667911425562669056,
 667937095915278337,
 668142349051129856,
 668154635664932864,
 668226093875376128,
 668291999406125056,
 668466899341221888,
 668544745690562560,
 668587383441514497,
 668614819948453888,
 668620235289837568,
 668643542311546881,
 668645506898350081,
 668967877119254528,
 668981893510119424,
 668988183816871936,
 668992363537309700,
 669015743032369152,
 669214165781868544,
 669351434509529089,
 669571471778410496,
 669583744538451968,
 669661792646373376,
 669682095984410625,
 669684865554

In [82]:
#checking data types
#it appears all IDs have the same datatype
type(dfarchive.tweet_id[1])

numpy.int64

In [83]:
type(dftweet.tweet_id[1])

numpy.int64

In [84]:
type(dfimage.tweet_id[1])

numpy.int64

In [85]:
#proceeding with the combination of the archive data and the twitter API, since i didn't have issues with those files
df_comb = pd.merge(dfarchive,dftweet, how = 'left', on = ['tweet_id'])

In [86]:
df_comb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2055 entries, 0 to 2054
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   tweet_id            2055 non-null   int64  
 1   timestamp_x         2055 non-null   object 
 2   source              2055 non-null   object 
 3   text                2055 non-null   object 
 4   expanded_urls       2052 non-null   object 
 5   rating_numerator    2055 non-null   int64  
 6   rating_denominator  2055 non-null   int64  
 7   name                1474 non-null   object 
 8   rating              2055 non-null   float64
 9   dog_descriptor      333 non-null    object 
 10  timestamp_y         2047 non-null   object 
 11  retweets            2047 non-null   float64
 12  favorites           2047 non-null   float64
 13  followers           2047 non-null   float64
dtypes: float64(4), int64(3), object(7)
memory usage: 240.8+ KB


In [89]:
df_comb.head()

Unnamed: 0,tweet_id,timestamp_x,source,text,expanded_urls,rating_numerator,rating_denominator,name,rating,dog_descriptor,timestamp_y,retweets,favorites,followers
0,892420643555336193,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,1.3,,Tue Aug 01 16:23:56 +0000 2017,7168.0,34467.0,9058333.0
1,892177421306343426,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,1.3,,Tue Aug 01 00:17:27 +0000 2017,5388.0,29883.0,9058333.0
2,891815181378084864,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12,10,Archie,1.2,,Mon Jul 31 00:18:03 +0000 2017,3553.0,22497.0,9058333.0
3,891689557279858688,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13,10,Darla,1.3,,Sun Jul 30 15:58:51 +0000 2017,7383.0,37667.0,9058333.0
4,891327558926688256,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f","https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12,10,Franklin,1.2,,Sat Jul 29 16:00:24 +0000 2017,7926.0,35979.0,9058333.0


#### Code: 

#### Test:

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1. Most favorited and retweeted tweets? 
2. Relationship between followers and retweets?
2. Most common dog breeds? 

### Visualization