# We Rate Dogs Data Analysis 

We are going to analyze data coming from the WeRateDogs twitter channel. This project aims to practice thorough data wrangling techniques. Additionally, the goal is to find out interesting facts and write a report.

Your goal: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. 

## Table of Contents

1. [Introduction](#introduction)
2. [Data Wrangling](#data-wrangling)
    1. [Gathering data](#gathering-data)
    2. [Assessing data](#assessing-data)
    3. [Cleaning data](#cleaning-data) 
3. [Analysis and Visualization](#analysis-and-visualization)
4. [Reporting](#reporting)

## Introduction <a name="introduction"></a>
Some introduction text, formatted in heading 2 style

To get started, let's import our libraries.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import time
import json
import requests
import os.path
import re
%matplotlib inline

pd.set_option('display.max_rows', 2500)
pd.set_option('display.max_columns', 2000)

pd.set_option('display.max_colwidth', -1)

## Data Wrangling <a name="data-wrangling"></a>
The first paragraph text

### Gathering Data <a name="gathering-data"></a>
Read in the first data set: WeRateDogs Twitter archive provided by Udacity.

In [2]:
# read in twitter archive 
twitter_dogs_archive = pd.read_csv('twitter-archive-enhanced-2.csv')
twitter_dogs_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,,,,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,,,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",,,,"https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12,10,Franklin,,,,


Download and read in image predictions file provided by Udacity.

In [None]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

# download file programmatically
response = requests.get(url)
    
# create new file if not existent
if not os.path.exists('image-predictions.tsv'):
    file = open('image-predictions.tsv', 'w')
    file.close()

# open file and write file content
with open('image-predictions.tsv', 'wb') as file_image_predictions:
        file_image_predictions.write(response.content)
        

In [3]:
# load image predictions into data frame
image_predictions = pd.read_csv('image-predictions.tsv', sep='\t')
image_predictions.head(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


### Twitter API request using tweepy 

In [None]:
# hide login details
with open('logins.json') as login_file:
    logins = json.load(login_file)

def get_secret(setting, logins=logins):
    """Get login setting or fail with ImproperlyConfigured"""
    try:
        return logins[setting]
    except KeyError:
        raise ImproperlyConfigured("Set the {} setting.".format(setting))

In [None]:
# retrieve Twitter login details 
consumer_key = get_secret('consumer_key')
consumer_secret = get_secret('consumer_secret')
access_token = get_secret('access_token')
access_secret = get_secret('access_secret')

In [None]:
# access Twitter API
import tweepy

# Redirect to Twitter to authorize
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

# Get access token
auth.set_access_token(access_token, access_secret)

# api instance
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)


In [None]:
# get tweets from WeRateDogs Twitter timeline 
start = time.time()
error_ids = []
print("Start requesting WeRateDogs tweets.")
with open('tweet_json.txt', 'w', encoding='utf-8') as file:
    file.write("[\n")
    for index, tweet_id in enumerate(twitter_dogs_archive.tweet_id.values):
        ranking = index + 1 
        try:
            # Twitter API request using specific tweet_id
            status = api.get_status(tweet_id, tweet_mode='extended')
            status_json = json.dumps(status._json)

            # write json object
            file.write(status_json)
            if ranking < len(twitter_dogs_archive.tweet_id.values):
                file.write(",")
            file.write("\n")
            
            # This cell is slow so print ranking to gauge time remaining
            print(ranking, '-', tweet_id)

        except tweepy.TweepError as e:
            # catch erroneos
            error_ids.append(tweet_id)
            e = e.response.text
            print(e)
    file.write("]")
end = time.time()
print("Process finisheed. Time elapsed: ", round((end-start) / 60, 2), "min." )

In [4]:
tweets = []
with open('tweet_json.txt', 'r') as file:
    data = json.loads(file.read())
    for i in range(0, len(data)):
        record = {"id": data[i]["id"], "retweet_count": data[i]['retweet_count'], "favorite_count": data[i]["favorite_count"]}
        tweets.append(record)

tweets_df = pd.DataFrame(tweets)
tweets_df.head()


Unnamed: 0,favorite_count,id,retweet_count
0,37683,892420643555336193,8215
1,32373,892177421306343426,6076
2,24378,891815181378084864,4017
3,41004,891689557279858688,8370
4,39208,891327558926688256,9075


In [None]:
# save erroneous ids
errors_df= pd.DataFrame(error_ids)
errors_df.to_csv('errors.csv',index=False)

In [None]:
# load errors
errors_df = pd.read_csv('errors.csv')
errors_df.info()

 query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count.

### Assassing Data <a name="assessing-data"></a>
The paragraph text

#### Visual assessment

In [None]:
twitter_dogs_archive

In [None]:
tweets_df

In [None]:
image_predictions

#### Programmatic assessment

In [None]:
# Assess twitter dogs enhanced.
twitter_dogs_archive.info()

In [5]:
# show all retweets
retweets = twitter_dogs_archive[twitter_dogs_archive.retweeted_status_id.notna()]['retweeted_status_id'].values.astype(np.int64)
twitter_dogs_archive[twitter_dogs_archive.retweeted_status_id.notna()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,,,2017-07-21 01:02:36 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Canela. She attempted some fancy porch pics. They were unsuccessful. 13/10 someone help her https://t.co/cLyzpcUcMX,8.87474e+17,4196984000.0,2017-07-19 00:47:34 +0000,"https://twitter.com/dog_rates/status/887473957103951883/photo/1,https://twitter.com/dog_rates/status/887473957103951883/photo/1,https://twitter.com/dog_rates/status/887473957103951883/photo/1,https://twitter.com/dog_rates/status/887473957103951883/photo/1",13,10,Canela,,,,
32,886054160059072513,,,2017-07-15 02:45:48 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @Athletics: 12/10 #BATP https://t.co/WxwJmvjfxo,8.860537e+17,19607400.0,2017-07-15 02:44:07 +0000,"https://twitter.com/dog_rates/status/886053434075471873,https://twitter.com/dog_rates/status/886053434075471873",12,10,,,,,
36,885311592912609280,,,2017-07-13 01:35:06 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Lilly. She just parallel barked. Kindly requests a reward now. 13/10 would pet so well https://t.co/SATN4If5H5,8.305833e+17,4196984000.0,2017-02-12 01:04:29 +0000,"https://twitter.com/dog_rates/status/830583320585068544/photo/1,https://twitter.com/dog_rates/status/830583320585068544/photo/1,https://twitter.com/dog_rates/status/830583320585068544/photo/1,https://twitter.com/dog_rates/status/830583320585068544/photo/1",13,10,Lilly,,,,
68,879130579576475649,,,2017-06-26 00:13:58 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Emmy. She was adopted today. Massive round of pupplause for Emmy and her new family. 14/10 for all involved https://‚Ä¶,8.780576e+17,4196984000.0,2017-06-23 01:10:23 +0000,"https://twitter.com/dog_rates/status/878057613040115712/photo/1,https://twitter.com/dog_rates/status/878057613040115712/photo/1",14,10,Emmy,,,,
73,878404777348136964,,,2017-06-24 00:09:53 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","RT @dog_rates: Meet Shadow. In an attempt to reach maximum zooming borkdrive, he tore his ACL. Still 13/10 tho. Help him out below\n\nhttps:/‚Ä¶",8.782815e+17,4196984000.0,2017-06-23 16:00:04 +0000,"https://www.gofundme.com/3yd6y1c,https://twitter.com/dog_rates/status/878281511006478336/photo/1",13,10,Shadow,,,,
74,878316110768087041,,,2017-06-23 18:17:33 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: Meet Terrance. He's being yelled at because he stapled the wrong stuff together. 11/10 hang in there Terrance https://t.co/i‚Ä¶,6.690004e+17,4196984000.0,2015-11-24 03:51:38 +0000,https://twitter.com/dog_rates/status/669000397445533696/photo/1,11,10,Terrance,,,,
78,877611172832227328,,,2017-06-21 19:36:23 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @rachel2195: @dog_rates the boyfriend and his soaking wet pupper h*cking love his new hat 14/10 https://t.co/dJx4Gzc50G,8.768508e+17,512804500.0,2017-06-19 17:14:49 +0000,"https://twitter.com/rachel2195/status/876850772322988033/photo/1,https://twitter.com/rachel2195/status/876850772322988033/photo/1,https://twitter.com/rachel2195/status/876850772322988033/photo/1,https://twitter.com/rachel2195/status/876850772322988033/photo/1",14,10,,,,pupper,
91,874434818259525634,,,2017-06-13 01:14:41 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Coco. At first I thought she was a cloud but clouds don't bork with such passion. 12/10 would hug softly https://t.c‚Ä¶,8.66335e+17,4196984000.0,2017-05-21 16:48:45 +0000,"https://twitter.com/dog_rates/status/866334964761202691/photo/1,https://twitter.com/dog_rates/status/866334964761202691/photo/1",12,10,Coco,,,,
95,873697596434513921,,,2017-06-11 00:25:14 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Walter. He won't start hydrotherapy without his favorite floatie. 14/10 keep it pup Walter https://t.co/r28jFx9uyF,8.688804e+17,4196984000.0,2017-05-28 17:23:24 +0000,"https://twitter.com/dog_rates/status/868880397819494401/photo/1,https://twitter.com/dog_rates/status/868880397819494401/photo/1",14,10,Walter,,,,
97,873337748698140672,,,2017-06-10 00:35:19 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Sierra. She's one precious pupper. Absolute 12/10. Been in and out of ICU her whole life. Help Sierra below\n\nhttps:/‚Ä¶,8.732138e+17,4196984000.0,2017-06-09 16:22:42 +0000,"https://www.gofundme.com/help-my-baby-sierra-get-better,https://twitter.com/dog_rates/status/873213775632977920/photo/1,https://twitter.com/dog_rates/status/873213775632977920/photo/1",12,10,Sierra,,,pupper,


In [None]:
# check if original tweets are in twitter archive
retweets_reduced = retweets
for retweet in retweets:
    if retweet in twitter_dogs_archive.tweet_id.values:
        index = np.argwhere(retweets_reduced == retweet)
        retweets_reduced = np.delete(retweets_reduced, index)
print(len(twitter_dogs_archive[twitter_dogs_archive.retweeted_status_id.notna()]) - len(retweets_reduced), "entries out of the retweets are contained in the twitter archive.\n")

In [6]:
# Show all replies to assess if they are relevant for our research
replies = twitter_dogs_archive[twitter_dogs_archive.in_reply_to_status_id.notna()]['tweet_id'].values.astype(np.int64)
twitter_dogs_archive[twitter_dogs_archive.in_reply_to_status_id.notna()][['tweet_id','text']]

valid_replies = [863079547188785154, 856526610513747968, 847617282490613760, 802265048156610565, 786051337297522688, 
                 766714921925144576, 704871453724954624,675870721063669760, 675707330206547968, 669353438988365824]

In [None]:
dogs_clean.query('tweet_id == 669353438988365824')

In [None]:
# assess names
twitter_dogs_archive.name.value_counts()

In [7]:
# after finding typical mistakes, I'm checking if there is a pattern to recover names  
determiners = ["a", "an", "the", "officially", "old", "just", "quite", "getting", "actually", "mad", "not", "by", "very", "one", "this", "life", "all", "None"]

# loop trough names column and print each text of the text column whenever name equals determiner
for i, row in twitter_dogs_archive.iterrows():
    if row['name'] in determiners:
        print(i , "-", row['text'])

5 - Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh
7 - When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq
12 - Here's a puppo that seems to be on the fence about something haha no but seriously someone help her. 13/10 https://t.co/BxvuXk0UCm
24 - You may not have known you needed to see this today. 13/10 please enjoy (IG: emmylouroo) https://t.co/WZqNqygEyV
25 - This... is a Jubilant Antarctic House Bear. We only rate dogs. Please only send dogs. Thank you... 12/10 would suffocate in floof https://t.co/4Ad1jzJSdp
30 - @NonWhiteHat @MayhewMayhem omg hello tanner you are a scary good boy 12/10 would pet with extreme caution
32 - RT @Athletics: 12/10 #BATP https://t.co/WxwJmvjfxo
35 - I have a new hero and his name is Howard. 14/10 https://t.co/gzLHboL7Sk
37 - Here we have a corgi und

730 - Who keeps sending in pictures without dogs in them? This needs to stop. 5/10 for the mediocre road https://t.co/ELqelxWMrC
732 - Idk why this keeps happening. We only rate dogs. Not Bangladeshi Couch Chipmunks. Please only send dogs... 12/10 https://t.co/ya7bviQUUf
733 - Pupper butt 1, Doggo 0. Both 12/10 https://t.co/WQvcPEpH2u
735 - We normally don't rate lobsters, but this one appears to be a really good lobster. 10/10 would pet with caution https://t.co/YkHc7U7uUy
736 - I want to finally rate this iconic puppo who thinks the parade is all for him. 13/10 would absolutely attend https://t.co/5dUYOu4b8d
740 - Here's a perturbed super floof. 12/10 would snug so damn well https://t.co/VG095mi09Q
742 - RT @Patreon: Well. @dog_rates is on Patreon. 

12/10. 

https://t.co/rnKvzt6RJs https://t.co/v4e2ywe8iO
744 - We only rate dogs. Pls stop sending in non-canines like this Urban Floof Giraffe. I can't handle this. 11/10 https://t.co/zHIqpM5Gni
746 - Here's a doggo questioning his enti

1293 - Everybody stop what you're doing and watch this video. Frank is stuck in a loop. 13/10 (Vid by @klbmatty) https://t.co/5AJs8TIV1U
1295 - @serial @MrRoles OH MY GOD I listened to all of season 1 during a single road trip. I love you guys! I can confirm Bernie's 12/10 rating :)
1298 - When your roommate eats your leftover Chili's but you pretend it's no big deal cuz you fat anyway. 10/10 head up pup https://t.co/0nMgoue8IR
1299 - He's doing his best. 12/10 very impressive that he got his license in the first place  https://t.co/2vRmkkOLcN
1301 - We usually don't rate marshmallows but this one's having so much fun in the snow. 10/10 (vid by @kylejk24) https://t.co/NL2KwOioBh
1304 - "I shall trip the big pupper with leash. Big pupper will never see it coming. I am a genius." Both 11/10 https://t.co/uQsCJ8pf51
1306 - This dog just brutally murdered a snowman. Currently toying with its nutritious remains 9/10 would totally still pet https://t.co/iKThgKnW1j
1313 - Ever seen a dog pet a

1787 - Contortionist pup here. Inside pentagram. Clearly worships Satan. Known to slowly push fragile stuff off tables 6/10 https://t.co/EX9oR55VMe
1788 - Reckless pupper here. Not even looking at road. Absolute menace. No regard for fellow pupper lives. 10/10 still cute https://t.co/96IBkOYB7j
1789 - Not much to say here. I just think everyone needs to see this. 12/10 https://t.co/AGag0hFHpe
1791 - Downright inspiring 12/10 https://t.co/vSLtYBWHcQ
1792 - This dog gave up mid jump. 9/10 https://t.co/KmMv3Y2zI8
1797 - This is the happiest pupper I've ever seen. 10/10 would trade lives with https://t.co/ep8ATEJwRb
1799 - Here we see a Byzantine Rigatoni. Very aerodynamic. No eyes. Actually not windy here they just look like that. 9/10 https://t.co/gzI0m6wXRo
1801 - 10/10 I'd follow this dog into battle no questions asked https://t.co/ngTNXYQF0L
1804 - This pups goal was to get all four feet as close to each other as possible. Valiant effort 12/10 https://t.co/2mXALbgBTV
1805 - Who leaves

2212 - Never forget this vine. You will not stop watching for at least 15 minutes. This is the second coveted.. 13/10 https://t.co/roqIxCvEB3
2214 - It is an honor to rate this pup. He is a Snorklhuahua from Amarillo. A true renaissance dog. Also part Rudolph 10/10 https://t.co/ALNyYuGui7
2215 - There's a lot going on here but in my honest opinion every dog pictured is pretty fabulous. 10/10 for all. Good dogs https://t.co/VvYVbsi6c3
2218 - This is a Birmingham Quagmire named Chuk. Loves to relax and watch the game while sippin on that iced mocha. 10/10 https://t.co/HvNg9JWxFt
2220 - Good teamwork between these dogs. One is on lookout while other eats. Long necks. Nice big house. 9/10s good pups https://t.co/uXgmECGYEB
2222 - Here is a mother dog caring for her pups. Snazzy red mohawk. Doesn't wag tail. Pups look confused. Overall 4/10 https://t.co/YOHe6lf09m
2223 - 2 rare dogs. They waddle (v inefficient). Sometimes slide on bellies. Right one wants to be aircraft Marshall. 9/10s http

In [None]:
# Assess all observations with None values to doublecheck if names were missed
twitter_dogs_archive[twitter_dogs_archive.name == "None"][['tweet_id','text']]


In [None]:
# Assess whether names were extracted correctly.  
for i,row in twitter_dogs_archive.iterrows():
    if not row['name'] in row['text'] and (row['name'] != "None"):
        print(row.tweet_id, row.text, row.name, "\n")

In [None]:
# Assess if denominators are all 10. Multiple dogs in one picture/videos
for i,row in twitter_dogs_archive.iterrows():
    if not row['rating_denominator'] == 10:
        print(row.tweet_id, "\n", row.text)
        print("Numerator: ", row.rating_numerator, "\nDenominator: ", row.rating_denominator, "\n")

# irrelevant_tweets = [835246439529840640, 832088576586297345, 810984652412424192, 686035780142297088, 684225744407494656, 682808988178739200  ]
# multiple_dogs = [758467244762497024, 709198395643068416  ]

In [None]:
twitter_dogs_archive[twitter_dogs_archive.tweet_id == 679062614270468096]

In [None]:
twitter_dogs_archive.doggo.value_counts()

In [None]:
twitter_dogs_archive.floofer.value_counts()

In [None]:
twitter_dogs_archive.pupper.value_counts()

In [None]:
twitter_dogs_archive.puppo.value_counts()

In [None]:
# assess texts
twitter_dogs_archive.text

In [None]:
print("Number of duplicated tweet ids:", len(twitter_dogs_archive[twitter_dogs_archive.tweet_id.duplicated(keep=False)]))
twitter_dogs_archive[twitter_dogs_archive.tweet_id.duplicated()]

In [None]:
# Tweets containing "We only rate dogs caught my attention", however it seems to be a joke for dogs that don't look like dogs. 
# Print every row that contains We only rate dogs" or "We. Only. Rate. Dogs."
import re
pattern = re.compile(r'we.? only.? rate.? dogs', re.IGNORECASE)
for i, row in dogs_clean.iterrows():
    if pattern.search(row['text']):
        print(row['text'])

##### Assessing tweets data

In [21]:
# looking at errors
print("Number of non-existing tweet ids: ", len(errors_df))

NameError: name 'errors_df' is not defined

In [None]:
tweets_df[tweets_df['retweet_count'].isnull()]

##### Assessing image predictions data

In [29]:
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [27]:
image_predictions.tweet_id.duplicated().sum()

0

In [30]:
image_predictions

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.0614285,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.0741917,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.0161992,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.0458854,False,terrapin,0.0178853,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.0582794,True,fur_coat,0.0544486,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.0145938,False,golden_retriever,0.00795896,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.0820861,True


In [None]:
# check if retweets are in image predictions -> no retweets in image predictions
for retweet in retweets_reduced:
    if retweet in image_predictions.tweet_id.values:
        print(retweet)

**Quality**  

_**twitter archive table**_
- irrelevant tweets
- irrelevant columns in_reply_to_status_id, in_reply_to_user_id
- Incorrect dog names, determiners like: a, an, the, officially, old, just, quite, getting, actually, mad, not, by, very, one, this, life, all, None
- ratings are not normalized
- missing dog names
- incorrect null values in dog stages. None should be NaN.
- Erroneous data types (timestamp, source, dog_stage)
- Source value is wrapped in anchor tag
- Text contains mentions of users, e.g. @NonWhiteHat
- Contains tweets that are replies to other treets --> Remove if in_reply_to_status_id/in_reply_to_user_id is not NaN
- Some tweets contain a link using t.co, twitter's url shortener. They are not active anymore. Working url is included in expanded_urls
- Row 47, 59, 62,91 not a valid observation (We only rate dogs-comments)
- Incorrectly-extracted or None as names, e.g. a row 56, None should be NaN
- Incorrect demoninators (not 10)
- Incomparable rating numerators.
- Tweets with missing photos
- Incorrect dog stages
- dog statuses should be NaN values instead of a string of None

- dogs with multiple dog stage in the following tweet_ids: 854010172552949760, 817777686764523521, 808106460588765185, 802265048156610565, 801115127852503040, 785639753186217984,781308096455073793, 775898661951791106, 770093767776997377, 759793422261743616, 751583847268179968, 741067306818797568, 733109485275860992, 855851453814013952

_**image predictions table**_
- dog races in p1, p2, p3 contain underscores, some are uppercase, some are lowercase
- dog races contain non dog- races, e.g. hen, cock, personal_computer, shopping_cart, box-turtle... --> p_dog is False in this case
- contains 281 fewer entries compared to twitter archive table

_**tweets table**_ 
- contains 17 fewer ids compared to twitter archive due to errors during twitter request 

**Tidiness**  

_**twitter archive table**_
- doggo, floofer, pupper, puppo should be one column
- 3 separate tables of the same purpose
- Multiple urls in expanded_urls (some are duplicates)


### Cleaning Data <a name="cleaning-data"></a>
#### Tidiness


In [9]:
# Create copies of data frames
dogs_clean = twitter_dogs_archive
tweets_clean = tweets_df
images_clean = image_predictions

**dogs: _doggo, floofer, pupper, puppo should be one column_**

_**Define**_

Create a dog_stage column to assign the status of dog to each tweet. Use the 4 separate columns of doggo floofer, pupper and puppo. Leave the value empty if all of the 4 columns contain "None".

_**Code**_

In [10]:
# create a dog status column by using doggo column
column_names = ['tweet_id','in_reply_to_status_id','in_reply_to_user_id', \
                'timestamp','source','text','retweeted_status_id', 'retweeted_status_user_id','retweeted_status_timestamp',\
                'expanded_urls','rating_numerator','rating_denominator','name']

# add column for none dog type
dogs_clean['None'] = "placeholder"

for i, row in dogs_clean.iterrows():
    if row.loc['doggo'] == row.loc['floofer'] == row.loc['pupper'] == row.loc['puppo'] == 'None':
        dogs_clean.at[i, 'None'] = "None"

# melt dog stages into rows
dogs_clean = pd.melt(dogs_clean, id_vars=column_names, var_name='placeholder', value_name='dog_stage')

# remove all excess rows and columns
for i, row in dogs_clean.iterrows():
    if not row.loc['placeholder'] == row.loc['dog_stage']:
        dogs_clean = dogs_clean.drop([i])
        
dogs_clean = dogs_clean.drop(['placeholder'], axis=1).reset_index(drop=True)


_**Test**_

In [None]:
# Data frame must have a min of 2356 non-null values plus 14 tweets with multiple values for dog stages
dogs_clean.info()

In [None]:
# no changes in value counts of dog stages
dogs_clean.dog_stage.value_counts()

_**3 separate tables of the same purpose**_

_**Define**_

Join dogs table and tweets table using 'tweet_id', respectively 'id', removing non-matching tweet_ids. Than, join new dogs table and image predictions table on their common tweet_id. Keep all entries with non-matching dog race predictions to not loose entries. 

_**Code**_

In [11]:
# merge dogs and tweets table 
dogs_clean = pd.merge(dogs_clean, tweets_clean, how='inner',  left_on='tweet_id', right_on='id').drop(['id'], axis=1)

# merge dogs and image predications tables 
dogs_clean = pd.merge(dogs_clean, images_clean, how='left',  on='tweet_id')

_**Test**_

In [None]:
# should have all 2353 entries after dropping the missing tweets
# all columns are combined in one table 
dogs_clean.info()

_**Quality**_

_**Irrelevant tweets**_

_**Define**_  
Remove all retweets by keeping only rows containing a null value in 'retweeted_status_id'. Subsequentially, remove irrelevant columns: 'retweeted_status_id', 'retweeted_status_user_id' and 'retweeted_status_timestamp'.   
Remove all irrelevant replies to tweets. Replies contain a value in 'in_reply_to' column. We don't want to drop all of them, so subtract all valid replies before removing droping them. 

_**Code**_

In [12]:
# We found that most replies are irrelevant to our research, accept for the replies containing following tweet id 
valid_replies = [863079547188785154, 856526610513747968, 847617282490613760, 802265048156610565, 786051337297522688, 
                 766714921925144576, 704871453724954624,675870721063669760, 675707330206547968, 669353438988365824]

In [13]:
# remove valid replies from our list of replies
replies = list(replies)
[replies.remove(el) for el in valid_replies if el in replies]

[None, None, None, None, None, None, None, None, None, None]

In [14]:
# remove remaining replies
for i, row in dogs_clean.iterrows():
    if row.loc['tweet_id'] in replies:
        dogs_clean = dogs_clean.drop([i])

In [15]:
dogs_clean.drop(['retweeted_status_id', 'retweeted_status_user_id','retweeted_status_timestamp', 'in_reply_to_status_id', 'in_reply_to_user_id'], axis=1, inplace=True)

_**Test**_

In [15]:
# should be 68 replies to remove after deleting 10 valid replies
len(replies)

68

In [16]:
# After removing 169 retweet observations and 68 replies, we should have 2116 observations left.  
dogs_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2285 entries, 0 to 2352
Data columns (total 22 columns):
tweet_id              2285 non-null int64
timestamp             2285 non-null object
source                2285 non-null object
text                  2285 non-null object
expanded_urls         2278 non-null object
rating_numerator      2285 non-null int64
rating_denominator    2285 non-null int64
name                  2285 non-null object
dog_stage             2285 non-null object
favorite_count        2285 non-null int64
retweet_count         2285 non-null int64
jpg_url               2063 non-null object
img_num               2063 non-null float64
p1                    2063 non-null object
p1_conf               2063 non-null float64
p1_dog                2063 non-null object
p2                    2063 non-null object
p2_conf               2063 non-null float64
p2_dog                2063 non-null object
p3                    2063 non-null object
p3_conf               2063 non-null

_**Incorrect dog names, determiners like: a, an, the, officially, old, just, quite, getting, actually, mad, not, by, very, one, this, life, all, None**_

_**Define**_  
Scan through all missing names, which are all names represented by a, an, the, officially, old, just, quite, getting, actually, mad, not, by, very, one, this, life, all or None. Identify dog names in the text column of the structure "named (something)". Replace the name if there is a name pattern match in the text field. If there is no matching pattern, fill the cell by a null value. 

_**Code**_

In [16]:
pattern1 = re.compile(r'named [A-Za-z]+')
pattern2 = re.compile(r'name\.? is\.? [A-Za-z]+', re.IGNORECASE)

def extract_name(row):
    name_result1 = re.search(pattern1, row['text'])
    name_result2 = re.search(pattern2, row['text'])
    
    if name_result1:
        new_name = name_result1.group().split()[1]
    elif name_result2:
        new_name = name_result2.group().split()[2]
    else:
        new_name = np.nan
    return new_name
    
# loop trough names column and print each text of the text column whenever name equals determiner
for i, row in dogs_clean.iterrows():
    if row['name'] in determiners:
        new_name = extract_name(row)
        dogs_clean.at[i, 'name'] = new_name

_**Test**_

In [None]:
dogs_clean.info()

In [17]:
# all determiners are gone, names were correctly extracted
dogs_clean.name.value_counts()


Lucy              11
Charlie           11
Oliver            11
Cooper            11
Tucker            10
Lola              10
Penny             10
Bo                9 
Winston           9 
Sadie             8 
Daisy             7 
Toby              7 
Bailey            7 
Buddy             7 
Rusty             6 
Oscar             6 
Dave              6 
Leo               6 
Scout             6 
Jack              6 
Koda              6 
Jax               6 
Stanley           6 
Milo              6 
Bella             6 
Louis             5 
Maggie            5 
Gus               5 
Alfie             5 
Sunny             5 
Bentley           5 
Chester           5 
Larry             5 
Oakley            5 
Finn              5 
George            5 
Sampson           4 
Luna              4 
Winnie            4 
Bruce             4 
Maddie            4 
Archie            4 
Moose             4 
Reginald          4 
Loki              4 
Beau              4 
Derek             4 
Duke         

_**Missing dog names**_

_**Define**_  
Replace null value in dog names by the dog's that was extracted from the tweet's text.   

_**Code**_

In [18]:
missing_names = [(826204788643753985, "Dew"), (854120357044912130, "Cooper"), (778039087836069888, "Max"), 
                   (685547936038666240, "Jack"),(878604707211726852, "Martha"), (863079547188785154, "Pipsy"), 
                   (856526610513747968, "Charly"), (847617282490613760, "Cannon"), (844979544864018432, "Toby"),
                   (836001077879255040, "Atlas"), (831650051525054464, "Blue"), (811647686436880384, "Augie"), 
                   (778408200802557953, "Loki"), (758041019896193024, "Teagan"), (740373189193256964, "Bretagne"), 
                   (708026248782585858, "Frank"), (704871453724954624, "Pipsie"), (695064344191721472, "Charles"), 
                   (692142790915014657, "Teddy"), (685681090388975616, "Jack"), (685325112850124800, "Tristan"), 
                   (684538444857667585, "Pippa"), (678023323247357953, "Reese"), (677687604918272002, "Cindy"),
                   (676590572941893632, "Bubbles"), (675870721063669760, "Yoshi"), (669684865554620416, "Dug"), 
                   (668142349051129856, "Oliver")]


In [19]:
# loop missing names and match corresponding tweet with name provided in the set
for el in missing_names:
    # find row with matching tweet_id
    for i, row in dogs_clean.iterrows():
        if row['tweet_id'] == el[0]:
            dogs_clean.at[i, 'name'] = el[1]

_**Test**_

In [None]:
# should have 1431+28= 1459 names 
dogs_clean.info()

In [None]:
# test if new names are correct
dogs_clean.name.value_counts()

In [20]:
# assess dogs_clean
for el in dogs_clean[dogs_clean.name.isna()][['text', 'tweet_id']].values:
    print(el)
    
# irrelevant_tweets = [840696689258311684, 840698636975636481, 835246439529840640, 674606911342424069]

# two_in one = [('Burke', 'Dexter',808106460588765185), (Cletus, Jerome, Alejandro, Burp, & Titson,709198395643068416),
# (689599056876867584 = 33 dogs), (669037058363662336, "Pancho","Peaches"), (668221241640230912, "Bo", "Smittens"), (666835007768551424, "Cupit and Prencer" ")]

["Here's a very large dog. He has a date later. Politely asked this water person to check if his breath is bad. 12/10 good to go doggo https://t.co/EMYIdoblMR"
 872967104147763200]
['Never doubt a doggo 14/10 https://t.co/AbBLh2FZCH' 871102520638267392]
["I have stumbled puppon a doggo painting party. They're looking to be the next Pupcasso or Puppollock. All 13/10 would put it on the fridge https://t.co/cUeDMlHJbq"
 858843525470990336]
["Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for https://t.co/cMhq16isel"
 855851453814013952]
["Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for https://t.co/cMhq16isel"
 855851453814013952]
["At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send dogs https://t.co/TXdT3tmuYk"
 854010172552949760]
["At

In [None]:
# names were replaced by a name in their respective text column
# for i, row in clean.iterrows():
#     if row['name'] in determiners:
#         new_name = extract_name(row)
#         clean.at[i, 'name'] = new_name

_**Incorrect null values in dog stages. None should be NaN.**_

_**Define**_  
Replace every None string into a numpy null value in name and dog_stage columns. 

_**Code**_

In [21]:
dogs_clean['dog_stage'].replace("None", np.nan, inplace=True)

_**Test**_

In [None]:
# dog stages should have 2184-1828=356 entries, since there were 1828 Nones  
# names should have 2184-87=1497 entries, since there were 687 Nones 
dogs_clean.info()


In [None]:
dogs_clean.query('tweet_id == 722974582966214656') 

_**Ratings are not nomalized**_

_**Define**_   
First, correct wrongly extracted ratings. Than, find all ratings that don't have a 10 denominator and remove the corresponding row. 

_**Code**_

In [22]:
# falsy ratings and their corrected and normalized numerator
corrected_ratings = [{'tweet_id': 820690176645140481, 'rating_numerator': 12}, 
                     {'tweet_id': 722974582966214656, 'rating_numerator': 13},
                     {'tweet_id': 716439118184652801, 'rating_numerator': 11},
                     {'tweet_id': 713900603437621249, 'rating_numerator': 11},
                     {'tweet_id': 710658690886586372, 'rating_numerator': 10},
                     {'tweet_id': 704054845121142784, 'rating_numerator': 12},
                     {'tweet_id': 697463031882764288, 'rating_numerator': 11},
                     {'tweet_id': 684222868335505415, 'rating_numerator': 11},
                     {'tweet_id': 682962037429899265, 'rating_numerator': 6},
                     {'tweet_id': 677716515794329600, 'rating_numerator': 12},
                     {'tweet_id': 675853064436391936, 'rating_numerator': 11},
                     {'tweet_id': 666287406224695296, 'rating_numerator': 11},
                    ]

In [23]:
# replace falsy numerators by their corrected and normalized values. Set their denominator to 10.
for rating in corrected_ratings:
    dogs_clean.loc[dogs_clean['tweet_id'] == rating['tweet_id'], 'rating_numerator'] = rating['rating_numerator']
    dogs_clean.loc[dogs_clean['tweet_id'] == rating['tweet_id'], 'rating_denominator'] = 10

In [24]:
# remove all entries containing a denominator other than 10
dogs_clean = dogs_clean[dogs_clean.rating_denominator == 10]

_**Test**_

In [26]:
# Test if ratings were replaced correctly
for rating in corrected_ratings:
    print(dogs_clean.loc[dogs_clean['tweet_id'] == rating['tweet_id']][['rating_numerator', 'rating_denominator']])

     rating_numerator  rating_denominator
747  12                10                
      rating_numerator  rating_denominator
1320  13                10                
      rating_numerator  rating_denominator
1350  11                10                
      rating_numerator  rating_denominator
1375  11                10                
      rating_numerator  rating_denominator
1398  10                10                
      rating_numerator  rating_denominator
1483  12                10                
      rating_numerator  rating_denominator
1551  11                10                
      rating_numerator  rating_denominator
1708  11                10                
      rating_numerator  rating_denominator
1727  6                 10                
      rating_numerator  rating_denominator
1821  12                10                
      rating_numerator  rating_denominator
1873  11                10                
      rating_numerator  rating_denominator
2332  11     

In [27]:
# all rating_denominators are 10
dogs_clean[dogs_clean.rating_denominator != 10]

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,dog_stage,favorite_count,retweet_count,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog


In [None]:
valid_replies = [863079547188785154, 856526610513747968, 847617282490613760, 802265048156610565, 786051337297522688, 
                 766714921925144576, 704871453724954624,675870721063669760, 675707330206547968, 669353438988365824]

_**Test**_

_**dog races in p1, p2, p3 contain underscores, some are uppercase, some are lowercase**_

_**Define**_  
Replace all underscores race predictions in p1, p2 and p3 columns by whitespace. Then, make the first letter of each word uppercase. 

In [94]:
clean = dogs_clean

In [103]:
columns = ['p1', 'p2', 'p3']
for i, row in clean.iterrows():
    for column in columns:
        if isinstance(row[column],str):
            race = row[column].replace("_", " ").split()
            clean.at[i, column] = " ".join([word.capitalize() for word in race])

_**Code**_

_**Test**_

In [93]:
# check if any underscores left, returns 3 times False if there are no underscores 
[dogs_clean[column].str.contains('_').any() for column in columns]

[False, False, False]

In [104]:
# check if words were capitalized
clean.sample(10)[['p1', 'p2', 'p3']]

Unnamed: 0,p1,p2,p3
1107,Old English Sheepdog,Lhasa,Briard
837,Shield,Barrel,Sundial
8,Flat-coated Retriever,Labrador Retriever,Groenendael
1751,Frilled Lizard,Tailed Frog,Axolotl
1891,Chihuahua,West Highland White Terrier,Samoyed
1613,Lakeland Terrier,Irish Terrier,Airedale
1172,Shih-tzu,Lhasa,Pekinese
95,Barrow,Basset,Wok
1194,Samoyed,Eskimo Dog,Great Pyrenees
1752,Lhasa,Shih-tzu,Pomeranian


_**Erroneous data types (timestamp, source, dog stage)**_

_**Define**_  

Convert timestamp into datetime format. Convert source and dog stage into categorical data. 

_**Code**_

In [None]:
# To datetime
dogs_clean.timestamp = pd.to_datetime(dogs_clean.timestamp)

# To category
dogs_clean.source = dogs_clean.source.astype('category')
dogs_clean.dog_stage = dogs_clean.dog_stage.astype('category')

_**Test**_

In [None]:
dogs_clean.info()

## Analysis and Visualization <a name="analysis-and-visualization"></a>
The paragraph text

- Most popular dog names/breed vs. retweets, favortes, breed
- most popular dog content 
- rating statistics
- popularity of the account - over time
- Where are users from?
- most popular hashtags
- what race is associated with which dogtype

## Reporting <a name="reporting"></a>
The paragraph text