# We Rate Dogs Data Analysis 

We are going to analyze data coming from the WeRateDogs twitter channel. This project aims to practice thorough data wrangling techniques. Additionally, the goal is to find out interesting facts and write a report.

Your goal: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. 

## Table of Contents

1. [Introduction](#introduction)
2. [Data Wrangling](#data-wrangling)
    1. [Gathering data](#gathering-data)
    2. [Assessing data](#assessing-data)
    3. [Cleaning data](#cleaning-data) 
3. [Analysis and Visualization](#analysis-and-visualization)
4. [Reporting](#reporting)

## Introduction <a name="introduction"></a>
Some introduction text, formatted in heading 2 style

To get started, let's import our libraries.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import time
import json
import requests
import os.path
%matplotlib inline

## Data Wrangling <a name="data-wrangling"></a>
The first paragraph text

### Gathering Data <a name="gathering-data"></a>
Read in the first data set: WeRateDogs Twitter archive provided by Udacity.

In [2]:
# read in twitter archive 
twitter_dogs_archive = pd.read_csv('twitter-archive-enhanced-2.csv')
twitter_dogs_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


Download and read in image predictions file provided by Udacity.

In [3]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

# download file programmatically
response = requests.get(url)
    
# create new file if not existent
if not os.path.exists('image-predictions.tsv'):
    file = open('image-predictions.tsv', 'w')
    file.close()

# open file and write file content
with open('image-predictions.tsv', 'wb') as file_image_predictions:
        file_image_predictions.write(response.content)
        

In [3]:
# load image predictions into data frame
image_predictions = pd.read_csv('image-predictions.tsv', sep='\t')
image_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


### Twitter API request using tweepy 

In [4]:
# hide login details
with open('logins.json') as login_file:
    logins = json.load(login_file)

def get_secret(setting, logins=logins):
    """Get login setting or fail with ImproperlyConfigured"""
    try:
        return logins[setting]
    except KeyError:
        raise ImproperlyConfigured("Set the {} setting.".format(setting))

In [5]:
# retrieve Twitter login details 
consumer_key = get_secret('consumer_key')
consumer_secret = get_secret('consumer_secret')
access_token = get_secret('access_token')
access_secret = get_secret('access_secret')

In [6]:
# access Twitter API
import tweepy

# Redirect to Twitter to authorize
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

# Get access token
auth.set_access_token(access_token, access_secret)

# api instance
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)


In [133]:
# get tweets from WeRateDogs Twitter timeline 
start = time.time()
error_ids = []
print("Start requesting WeRateDogs tweets.")
with open('tweet_json.txt', 'w', encoding='utf-8') as file:
    file.write("[\n")
    for index, tweet_id in enumerate(twitter_dogs_archive.tweet_id.values):
        ranking = index + 1 
        try:
            # Twitter API request using specific tweet_id
            status = api.get_status(tweet_id, tweet_mode='extended')
            status_json = json.dumps(status._json)

            # write json object
            file.write(status_json)
            if ranking < len(twitter_dogs_archive.tweet_id.values):
                file.write(",")
            file.write("\n")
            
            # This cell is slow so print ranking to gauge time remaining
            print(ranking, '-', tweet_id)

        except tweepy.TweepError as e:
            # catch erroneos
            error_ids.append(tweet_id)
            e = e.response.text
            print(e)
    file.write("]")
end = time.time()
print("Process finisheed. Time elapsed: ", round((end-start) / 60, 2), "min." )

Start requesting WeRateDogs tweets.
1 - 892420643555336193
2 - 892177421306343426
3 - 891815181378084864
4 - 891689557279858688
5 - 891327558926688256
6 - 891087950875897856
7 - 890971913173991426
8 - 890729181411237888
9 - 890609185150312448
10 - 890240255349198849
11 - 890006608113172480
12 - 889880896479866881
13 - 889665388333682689
14 - 889638837579907072
15 - 889531135344209921
16 - 889278841981685760
17 - 888917238123831296
18 - 888804989199671297
19 - 888554962724278272
{"errors":[{"code":144,"message":"No status found with that ID."}]}
21 - 888078434458587136
22 - 887705289381826560
23 - 887517139158093824
24 - 887473957103951883
25 - 887343217045368832
26 - 887101392804085760
27 - 886983233522544640
28 - 886736880519319552
29 - 886680336477933568
30 - 886366144734445568
31 - 886267009285017600
32 - 886258384151887873
33 - 886054160059072513
34 - 885984800019947520
35 - 885528943205470208
36 - 885518971528720385
37 - 885311592912609280
38 - 885167619883638784
39 - 884925521741

Rate limit reached. Sleeping for: 39


85 - 876484053909872640
86 - 876120275196170240
87 - 875747767867523072
88 - 875144289856114688
89 - 875097192612077568
90 - 875021211251597312
91 - 874680097055178752
92 - 874434818259525634
93 - 874296783580663808
94 - 874057562936811520
95 - 874012996292530176
{"errors":[{"code":144,"message":"No status found with that ID."}]}
97 - 873580283840344065
98 - 873337748698140672
99 - 873213775632977920
100 - 872967104147763200
101 - 872820683541237760
{"errors":[{"code":144,"message":"No status found with that ID."}]}
103 - 872620804844003328
104 - 872486979161796608
{"errors":[{"code":144,"message":"No status found with that ID."}]}
106 - 872122724285648897
107 - 871879754684805121
108 - 871762521631449091
109 - 871515927908634625
110 - 871166179821445120
111 - 871102520638267392
112 - 871032628920680449
113 - 870804317367881728
114 - 870726314365509632
115 - 870656317836468226
116 - 870374049280663552
117 - 870308999962521604
118 - 870063196459192321
{"errors":[{"code":144,"message":"N

397 - 825147591692263424
398 - 825120256414846976
399 - 825026590719483904
400 - 824796380199809024
401 - 824775126675836928
402 - 824663926340194305
403 - 824325613288833024
404 - 824297048279236611
405 - 824025158776213504
406 - 823939628516474880
407 - 823719002937630720
408 - 823699002998870016
409 - 823581115634085888
410 - 823333489516937216
411 - 823322678127919110
412 - 823269594223824897
413 - 822975315408461824
414 - 822872901745569793
415 - 822859134160621569
416 - 822647212903690241
417 - 822610361945911296
418 - 822489057087389700
419 - 822462944365645825
420 - 822244816520155136
421 - 822163064745328640
422 - 821886076407029760
423 - 821813639212650496
424 - 821765923262631936
425 - 821522889702862852
426 - 821421320206483457
427 - 821407182352777218
428 - 821153421864615936
429 - 821149554670182400
430 - 821107785811234820
431 - 821044531881721856
432 - 820837357901512704
433 - 820749716845686786
434 - 820690176645140481
435 - 820494788566847489
436 - 820446719150292993


722 - 783334639985389568
723 - 783085703974514689
724 - 782969140009107456
725 - 782747134529531904
726 - 782722598790725632
727 - 782598640137187329
728 - 782305867769217024
729 - 782021823840026624
730 - 781955203444699136
731 - 781661882474196992
732 - 781655249211752448
733 - 781524693396357120
734 - 781308096455073793
735 - 781251288990355457
736 - 781163403222056960
737 - 780931614150983680
738 - 780858289093574656
739 - 780800785462489090
740 - 780601303617732608
741 - 780543529827336192
742 - 780496263422808064
743 - 780476555013349377
744 - 780459368902959104
745 - 780192070812196864
746 - 780092040432480260
747 - 780074436359819264
748 - 779834332596887552
749 - 779377524342161408
750 - 779124354206535695
751 - 779123168116150273
752 - 779056095788752897
753 - 778990705243029504
754 - 778774459159379968
755 - 778764940568104960
756 - 778748913645780993
757 - 778650543019483137
758 - 778624900596654080
759 - 778408200802557953
760 - 778396591732486144
761 - 778383385161035776


Rate limit reached. Sleeping for: 598


985 - 749317047558017024
986 - 749075273010798592
987 - 749064354620928000
988 - 749036806121881602
989 - 748977405889503236
990 - 748932637671223296
991 - 748705597323898880
992 - 748699167502000129
993 - 748692773788876800
994 - 748575535303884801
995 - 748568946752774144
996 - 748346686624440324
997 - 748337862848962560
998 - 748324050481647620
999 - 748307329658011649
1000 - 748220828303695873
1001 - 747963614829678593
1002 - 747933425676525569
1003 - 747885874273214464
1004 - 747844099428986880
1005 - 747816857231626240
1006 - 747651430853525504
1007 - 747648653817413632
1008 - 747600769478692864
1009 - 747594051852075008
1010 - 747512671126323200
1011 - 747461612269887489
1012 - 747439450712596480
1013 - 747242308580548608
1014 - 747219827526344708
1015 - 747204161125646336
1016 - 747103485104099331
1017 - 746906459439529985
1018 - 746872823977771008
1019 - 746818907684614144
1020 - 746790600704425984
1021 - 746757706116112384
1022 - 746726898085036033
1023 - 746542875601690625
1

1301 - 707693576495472641
1302 - 707629649552134146
1303 - 707610948723478529
1304 - 707420581654872064
1305 - 707411934438625280
1306 - 707387676719185920
1307 - 707377100785885184
1308 - 707315916783140866
1309 - 707297311098011648
1310 - 707059547140169728
1311 - 707038192327901184
1312 - 707021089608753152
1313 - 707014260413456384
1314 - 706904523814649856
1315 - 706901761596989440
1316 - 706681918348251136
1317 - 706644897839910912
1318 - 706593038911545345
1319 - 706538006853918722
1320 - 706516534877929472
1321 - 706346369204748288
1322 - 706310011488698368
1323 - 706291001778950144
1324 - 706265994973601792
1325 - 706169069255446529
1326 - 706166467411222528
1327 - 706153300320784384
1328 - 705975130514706432
1329 - 705970349788291072
1330 - 705898680587526145
1331 - 705786532653883392
1332 - 705591895322394625
1333 - 705475953783398401
1334 - 705442520700944385
1335 - 705428427625635840
1336 - 705239209544720384
1337 - 705223444686888960
1338 - 705102439679201280
1339 - 70506

1617 - 685198997565345792
1618 - 685169283572338688
1619 - 684969860808454144
1620 - 684959798585110529
1621 - 684940049151070208
1622 - 684926975086034944
1623 - 684914660081053696
1624 - 684902183876321280
1625 - 684880619965411328
1626 - 684830982659280897
1627 - 684800227459624960
1628 - 684594889858887680
1629 - 684588130326986752
1630 - 684567543613382656
1631 - 684538444857667585
1632 - 684481074559381504
1633 - 684460069371654144
1634 - 684241637099323392
1635 - 684225744407494656
1636 - 684222868335505415
1637 - 684200372118904832
1638 - 684195085588783105
1639 - 684188786104872960
1640 - 684177701129875456
1641 - 684147889187209216
1642 - 684122891630342144
1643 - 684097758874210310
1644 - 683857920510050305
1645 - 683852578183077888
1646 - 683849932751646720
1647 - 683834909291606017
1648 - 683828599284170753
1649 - 683773439333797890
1650 - 683742671509258241
1651 - 683515932363329536
1652 - 683498322573824003
1653 - 683481228088049664
1654 - 683462770029932544
1655 - 68344

Rate limit reached. Sleeping for: 605


1885 - 674800520222154752
1886 - 674793399141146624
1887 - 674790488185167872
1888 - 674788554665512960
1889 - 674781762103414784
1890 - 674774481756377088
1891 - 674767892831932416
1892 - 674764817387900928
1893 - 674754018082705410
1894 - 674752233200820224
1895 - 674743008475090944
1896 - 674742531037511680
1897 - 674739953134403584
1898 - 674737130913071104
1899 - 674690135443775488
1900 - 674670581682434048
1901 - 674664755118911488
1902 - 674646392044941312
1903 - 674644256330530816
1904 - 674638615994089473
1905 - 674632714662858753
1906 - 674606911342424069
1907 - 674468880899788800
1908 - 674447403907457024
1909 - 674436901579923456
1910 - 674422304705744896
1911 - 674416750885273600
1912 - 674410619106390016
1913 - 674394782723014656
1914 - 674372068062928900
1915 - 674330906434379776
1916 - 674318007229923329
1917 - 674307341513269249
1918 - 674291837063053312
1919 - 674271431610523648
1920 - 674269164442398721
1921 - 674265582246694913
1922 - 674262580978937856
1923 - 67425

2201 - 668655139528511488
2202 - 668645506898350081
2203 - 668643542311546881
2204 - 668641109086707712
2205 - 668636665813057536
2206 - 668633411083464705
2207 - 668631377374486528
2208 - 668627278264475648
2209 - 668625577880875008
2210 - 668623201287675904
2211 - 668620235289837568
2212 - 668614819948453888
2213 - 668587383441514497
2214 - 668567822092664832
2215 - 668544745690562560
2216 - 668542336805281792
2217 - 668537837512433665
2218 - 668528771708952576
2219 - 668507509523615744
2220 - 668496999348633600
2221 - 668484198282485761
2222 - 668480044826800133
2223 - 668466899341221888
2224 - 668297328638447616
2225 - 668291999406125056
2226 - 668286279830867968
2227 - 668274247790391296
2228 - 668268907921326080
2229 - 668256321989451776
2230 - 668248472370458624
2231 - 668237644992782336
2232 - 668226093875376128
2233 - 668221241640230912
2234 - 668204964695683073
2235 - 668190681446379520
2236 - 668171859951755264
2237 - 668154635664932864
2238 - 668142349051129856
2239 - 66811

In [136]:
tweets = []
with open('tweet_json.txt', 'r') as file:
    data = json.loads(file.read())
    for i in range(0, len(data)):
        record = {"id": data[i]["id"], "retweet_count": data[i]['retweet_count'], "favorite_count": data[i]["favorite_count"]}
       # , "hashtags": data[i]["entities"]["hashtags"]
#         record["hashtags"] = [el["text"] for el in data[i]["entities"]["hashtags"] if data[i]["entities"]["hashtags"]]
        tweets.append(record)

tweets_df = pd.DataFrame(tweets)
tweets_df.head()


Unnamed: 0,favorite_count,id,retweet_count
0,37683,892420643555336193,8215
1,32373,892177421306343426,6076
2,24378,891815181378084864,4017
3,41004,891689557279858688,8370
4,39208,891327558926688256,9075


In [141]:
# save erroneous ids
errors_df= pd.DataFrame(error_ids)
errors_df.to_csv('errors.csv',index=False)

 query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count.

### Assassing Data <a name="assessing-data"></a>
The paragraph text

In [9]:
# Assess twitter dogs enhanced.
twitter_dogs_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [10]:
#twitter_dogs_archive[twitter_dogs_archive.rating_denominator == 121]
twitter_dogs_archive.puppo.unique()

array(['None', 'puppo'], dtype=object)

In [135]:
twitter_dogs_archive[twitter_dogs_archive.in_reply_to_status_id.isnull()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


**Quality**  

_**twitter archive table**_
- Erroneos datatypes (tweet_id, in_reply_to_status_id, in_reply_to_user_id --> change in case Python can't handle)
- Erroneos datatypes (timestamp, retweeted_status_timestamp, 'retweeted_status_id'-> string )
- Source value is wrapped in anchor tag
- Contains tweets that are replies to other treets --> Remove if in_reply_to_status_id/in_reply_to_user_id is not NaN
- Retweets are contained in the data set (all text entries having a retweeted_status_id) --> query the original tweet using "retweeted_status_id", because not all information is available on retweet, e.g. while the 'original' Tweet may have geo-tagged, the Retweet "geo" and "place" objects will always be null.
- Some tweets contain a link using t.co, twitter's url shortener. They are not active anymore. Working url is included in expanded_urls
- Row 47, 59, 62,91 not a valid observation (We only rate dogs-comments)
- Incorrectly-extracted or None as names, e.g. a row 56, None should be NaN
- Incorrect demoninators (not 10)
- Incomparable rating numerators.
- Tweets with missing photos
- Incorrect dog statuses


**Tidiness**  

_**twitter archive table**_
- Multiple urls in expanded_urls (some are duplicates)
- retweet and favorite counts should be part of the twitter archive table

### Cleaning Data <a name="cleaning-data"></a>
The paragraph text

In [119]:
# Create copies of data frames
dogs_copy = twitter_dogs_archive

images_copy = image_predictions

## Analysis and Visualization <a name="analysis-and-visualization"></a>
The paragraph text

- Most popular dog names
- most popular dog content 
- rating statistics
- popularity of the account
- Where are users from?
- most popular hashtags
- what race is associated with which dogtype

## Reporting <a name="reporting"></a>
The paragraph text