# WeRateDogs Twitter Data Wrangling Project
## for Udacity Data Analyst nanodegree data wrangling project

Rubric at https://review.udacity.com/#!/rubrics/1136/view

Student: Cherie Goodenough

Date: 01/20/2019

#### Table of Contents:
1. [Gather](#gather)
2. [Assess](#assess)
    1. [twitter-archive-enhanced data](#wrd-assess)
    1. [breed prediction data](#breed-assess)
    1. [likes and retweets counts data](#counts-assess)
3. [Clean](#clean)
    1. [Quality]
        1. [twitter_archive-enhanced data](#wrd-qual)
        1. [breed prediction data](#breed-qual)
        1. [likes and retweets counts data](#counts-qual)
    1. [Tidinesss](#tidy)
        1. [twitter_archive-enhanced data](#wrd-tidy)
        1. [breed prediction data](#breed-tidy)
        1. [likes and retweets counts data](#counts-tidy)
1. [Analysis](#analysis)

<a id='gather'></a>
## Gather

There are three initial sources of data for this project:
1. Archive file twitter_archive_enhanced.csv - This file was provided to us by WeRateDogs via udacity. It contains rating, tweet texts and tweet IDs for 5000 tweets. It is not up to date and I will not attempt to do the latest tweets because it will not line up with the second source, which is:
2. Another data file provided by Udacity instructor which used an AI algorithm (not provided) to predict the breed of each rated dog based on the image in the rating tweet. This is a .tsv file located at https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv
3. I will use the Twitter API and Tweepy library to get at least likes and retweets for each tweet. This information is not in the provided archive.

In [1]:
#load required libraries
import pandas as pd
import requests
import os
import tweepy

In [2]:
# load and quick look at archive file
wrd_df = pd.read_csv('twitter-archive-enhanced.csv')
wrd_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [3]:
# And the provided file with breed prediction
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
r = requests.get(url)
with open(url.split('/')[-1],mode='wb') as file: # Note to self, 'wb' means open writeable and binary
    file.write(r.content)
breed_df = pd.read_csv(url.split('/')[-1],sep='\t')
breed_df.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [4]:
# Finally, enhance with more data from twitter. First, get authenticated
consumer_key = '6S4FUsi29sHWvaMswVhohbKnv'
consumer_secret = '32cIJWEl8OHzsYZocvRTRoKF3w7P1nGW8ySqYoLJUqXNVhz6pl'
access_token = '1964849383-pwJac4Z4Ca91fWPfIYgkICoEyW5VP9VgcydgNHL'
access_secret = 'zeYDFb1urVXxExL3v4vJNOh5Yj05ESVLhCXyRI7U5UJXA'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth,wait_on_rate_limit=True,wait_on_rate_limit_notify=True)

In [5]:
# create dataframe with tweet_id, url, favorite_count, and retweet_count for each id in wrd_df
# include a couple of print statements and a timer to keep the query rate down

# I started out just putting all of the retrieved data for each tweet into a json file, but this was unweildy and make accessing 
# the data I wanted painful, as well as not grabbing the tweet_id easily. I looked around and was guided by Jonathan Allen's 
# method on git up at https://github.com/JonathanAllen84/Wrangling-WeRateDogs/blob/master/wrangle_act.ipynb

from timeit import default_timer as timer
import json
not_found = {}
tweet_data = {}
start = timer()
for tid in wrd_df['tweet_id']:
    try:
        tweet = api.get_status(tid,tweet_mode='extended')
        print("Found id: ",tid)
        tweet_data[tweet.id] = []
        try:
            media = tweet.entities['media']
            tweet_data[tweet.id].append({
                'url': tweet.entities['media'][0]['expanded_url'],
                'favorite_count': tweet.favorite_count,
                'retweet_count': tweet.retweet_count
            })
        except KeyError as e:
            print(tid, 'no url')
            tweet_data[tweet.id].append({
                'url': '',
                'favorite_count': tweet.favorite_count,
                'retweet_count': tweet.retweet_count
            })
    except tweepy.TweepError as e:
        print('Tweet not found:', tid)
        not_found[tid] = e
        pass
with open('tweet_json.txt','w') as myfile:
    json.dump(tweet_data,myfile)
print(timer() - start)

Found id:  892420643555336193
Found id:  892177421306343426
Found id:  891815181378084864
Found id:  891689557279858688
Found id:  891327558926688256
Found id:  891087950875897856
Found id:  890971913173991426
Found id:  890729181411237888
Found id:  890609185150312448
Found id:  890240255349198849
Found id:  890006608113172480
Found id:  889880896479866881
Found id:  889665388333682689
Found id:  889638837579907072
Found id:  889531135344209921
Found id:  889278841981685760
Found id:  888917238123831296
Found id:  888804989199671297
Found id:  888554962724278272
Tweet not found: 888202515573088257
Found id:  888078434458587136
Found id:  887705289381826560
Found id:  887517139158093824
Found id:  887473957103951883
Found id:  887343217045368832
Found id:  887101392804085760
Found id:  886983233522544640
Found id:  886736880519319552
Found id:  886680336477933568
Found id:  886366144734445568
Found id:  886267009285017600
886267009285017600 no url
Found id:  886258384151887873
Found id

Found id:  847251039262605312
Found id:  847157206088847362
Found id:  847116187444137987
Found id:  846874817362120707
Found id:  846514051647705089
Found id:  846505985330044928
846505985330044928 no url
Found id:  846153765933735936
Found id:  846139713627017216
846139713627017216 no url
Found id:  846042936437604353
Found id:  845812042753855489
Found id:  845677943972139009
Tweet not found: 845459076796616705
Found id:  845397057150107648
Found id:  845306882940190720
Found id:  845098359547420673
845098359547420673 no url
Found id:  844979544864018432
Found id:  844973813909606400
Found id:  844704788403113984
Found id:  844580511645339650
Found id:  844223788422217728
Found id:  843981021012017153
843981021012017153 no url
Found id:  843856843873095681
Found id:  843604394117681152
Found id:  843235543001513987
Tweet not found: 842892208864923648
Found id:  842846295480000512
Found id:  842765311967449089
Found id:  842535590457499648
Found id:  842163532590374912
Found id:  842

Found id:  817056546584727552
Found id:  816829038950027264
Found id:  816816676327063552
Found id:  816697700272001025
Found id:  816450570814898180
Found id:  816336735214911488
Found id:  816091915477250048
Found id:  816062466425819140
816062466425819140 no url
Found id:  816014286006976512
Found id:  815990720817401858
Found id:  815966073409433600
Found id:  815745968457060357
815745968457060357 no url
Found id:  815736392542261248
Found id:  815639385530101762
Found id:  815390420867969024
Found id:  814986499976527872
Found id:  814638523311648768
Found id:  814578408554463233
814578408554463233 no url
Found id:  814530161257443328
Found id:  814153002265309185
Found id:  813944609378369540
Found id:  813910438903693312
Found id:  813812741911748608
Found id:  813800681631023104
Found id:  813217897535406080
Found id:  813202720496779264
Found id:  813187593374461952
Found id:  813172488309972993
Found id:  813157409116065792
Found id:  813142292504645637
Found id:  81313036668

Found id:  785639753186217984
Found id:  785533386513321988
Found id:  785515384317313025
785515384317313025 no url
Found id:  785264754247995392
Found id:  785170936622350336
Found id:  784826020293709826
Found id:  784517518371221505
Found id:  784431430411685888
Found id:  784183165795655680
784183165795655680 no url
Found id:  784057939640352768
784057939640352768 no url
Found id:  783839966405230592
Found id:  783821107061198850
Found id:  783695101801398276
Found id:  783466772167098368
Found id:  783391753726550016
Found id:  783347506784731136
Found id:  783334639985389568
Found id:  783085703974514689
Found id:  782969140009107456
Found id:  782747134529531904
Found id:  782722598790725632
Found id:  782598640137187329
Found id:  782305867769217024
Found id:  782021823840026624
Found id:  781955203444699136
Found id:  781661882474196992
Found id:  781655249211752448
781655249211752448 no url
Found id:  781524693396357120
Found id:  781308096455073793
781308096455073793 no url


Rate limit reached. Sleeping for: 716


Found id:  758740312047005698
Found id:  758474966123810816
Found id:  758467244762497024
Found id:  758405701903519748
Found id:  758355060040593408
Found id:  758099635764359168
758099635764359168 no url
Found id:  758041019896193024
Found id:  757741869644341248
Found id:  757729163776290825
Found id:  757725642876129280
Found id:  757611664640446465
Found id:  757597904299253760
Found id:  757596066325864448
Found id:  757400162377592832
Found id:  757393109802180609
Found id:  757354760399941633
Found id:  756998049151549440
Found id:  756939218950160384
Found id:  756651752796094464
Found id:  756526248105566208
Found id:  756303284449767430
Found id:  756288534030475264
Found id:  756275833623502848
Found id:  755955933503782912
Found id:  755206590534418437
Found id:  755110668769038337
Found id:  754874841593970688
Found id:  754856583969079297
Found id:  754747087846248448
Found id:  754482103782404096
Found id:  754449512966619136
Found id:  754120377874386944
Tweet not foun

Found id:  726887082820554753
Found id:  726828223124897792
Found id:  726224900189511680
Found id:  725842289046749185
Found id:  725786712245440512
Found id:  725729321944506368
Found id:  725458796924002305
725458796924002305 no url
Found id:  724983749226668032
Found id:  724771698126512129
Found id:  724405726123311104
Found id:  724049859469295616
Found id:  724046343203856385
Found id:  724004602748780546
Found id:  723912936180330496
Found id:  723688335806480385
Found id:  723673163800948736
Found id:  723179728551723008
Found id:  722974582966214656
Found id:  722613351520608256
Found id:  721503162398597120
Found id:  721001180231503872
Found id:  720785406564900865
Found id:  720775346191278080
Found id:  720415127506415616
Found id:  720389942216527872
Found id:  720340705894408192
Found id:  720059472081784833
Found id:  720043174954147842
Found id:  719991154352222208
Found id:  719704490224398336
Found id:  719551379208073216
Found id:  719367763014393856
Found id:  719

Found id:  699691744225525762
Found id:  699446877801091073
Found id:  699434518667751424
Found id:  699423671849451520
Found id:  699413908797464576
Found id:  699370870310113280
Found id:  699323444782047232
Found id:  699088579889332224
Found id:  699079609774645248
Found id:  699072405256409088
Found id:  699060279947165696
699060279947165696 no url
Found id:  699036661657767936
Found id:  698989035503689728
Found id:  698953797952008193
Found id:  698907974262222848
Found id:  698710712454139905
Found id:  698703483621523456
Found id:  698635131305795584
Found id:  698549713696649216
Found id:  698355670425473025
Found id:  698342080612007937
Found id:  698262614669991936
Found id:  698195409219559425
Found id:  698178924120031232
Found id:  697995514407682048
Found id:  697990423684476929
Found id:  697943111201378304
Found id:  697881462549430272
Found id:  697630435728322560
697630435728322560 no url
Found id:  697616773278015490
Found id:  697596423848730625
Found id:  6975754

Found id:  683849932751646720
Found id:  683834909291606017
Found id:  683828599284170753
Found id:  683773439333797890
Found id:  683742671509258241
Found id:  683515932363329536
683515932363329536 no url
Found id:  683498322573824003
Found id:  683481228088049664
Found id:  683462770029932544
Found id:  683449695444799489
Found id:  683391852557561860
Found id:  683357973142474752
Found id:  683142553609318400
Found id:  683111407806746624
Found id:  683098815881154561
Found id:  683078886620553216
Found id:  683030066213818368
Found id:  682962037429899265
Found id:  682808988178739200
682808988178739200 no url
Found id:  682788441537560576
Found id:  682750546109968385
Found id:  682697186228989953
Found id:  682662431982772225
Found id:  682638830361513985
Found id:  682429480204398592
Found id:  682406705142087680
Found id:  682393905736888321
Found id:  682389078323662849
Found id:  682303737705140231
Found id:  682259524040966145
Found id:  682242692827447297
Found id:  6820880

Rate limit reached. Sleeping for: 715


Found id:  676975532580409345
Found id:  676957860086095872
Found id:  676949632774234114
Found id:  676948236477857792
Found id:  676946864479084545
Found id:  676942428000112642
Found id:  676936541936185344
Found id:  676916996760600576
676916996760600576 no url
Found id:  676897532954456065
Found id:  676864501615042560
Found id:  676821958043033607
Found id:  676819651066732545
Found id:  676811746707918848
Found id:  676776431406465024
Found id:  676617503762681856
Found id:  676613908052996102
Found id:  676606785097199616
Found id:  676603393314578432
Found id:  676593408224403456
676593408224403456 no url
Found id:  676590572941893632
676590572941893632 no url
Found id:  676588346097852417
Found id:  676582956622721024
Found id:  676575501977128964
Found id:  676533798876651520
Found id:  676496375194980353
Found id:  676470639084101634
Found id:  676440007570247681
Found id:  676430933382295552
Found id:  676263575653122048
Found id:  676237365392908289
Found id:  67621968703

Found id:  671154572044468225
Found id:  671151324042559489
Found id:  671147085991960577
Found id:  671141549288370177
Found id:  671138694582165504
Found id:  671134062904504320
Found id:  671122204919246848
Found id:  671115716440031232
Found id:  671109016219725825
Found id:  670995969505435648
Found id:  670842764863651840
Found id:  670840546554966016
Found id:  670838202509447168
Found id:  670833812859932673
Found id:  670832455012716544
Found id:  670826280409919488
Found id:  670823764196741120
Found id:  670822709593571328
Found id:  670815497391357952
Found id:  670811965569282048
Found id:  670807719151067136
Found id:  670804601705242624
Found id:  670803562457407488
Found id:  670797304698376195
Found id:  670792680469889025
Found id:  670789397210615808
Found id:  670786190031921152
Found id:  670783437142401025
Found id:  670782429121134593
Found id:  670780561024270336
Found id:  670778058496974848
Found id:  670764103623966721
Found id:  670755717859713024
Found id: 

Found id:  666293911632134144
Found id:  666287406224695296
Found id:  666273097616637952
Found id:  666268910803644416
Found id:  666104133288665088
Found id:  666102155909144576
Found id:  666099513787052032
Found id:  666094000022159362
Found id:  666082916733198337
Found id:  666073100786774016
Found id:  666071193221509120
Found id:  666063827256086533
Found id:  666058600524156928
Found id:  666057090499244032
Found id:  666055525042405380
Found id:  666051853826850816
Found id:  666050758794694657
Found id:  666049248165822465
Found id:  666044226329800704
Found id:  666033412701032449
Found id:  666029285002620928
Found id:  666020888022790149
1922.7781458000002


In [6]:
for x in not_found.keys():
    print( x, not_found[x])

888202515573088257 [{'code': 144, 'message': 'No status found with that ID.'}]
873697596434513921 [{'code': 144, 'message': 'No status found with that ID.'}]
872668790621863937 [{'code': 144, 'message': 'No status found with that ID.'}]
869988702071779329 [{'code': 144, 'message': 'No status found with that ID.'}]
866816280283807744 [{'code': 144, 'message': 'No status found with that ID.'}]
861769973181624320 [{'code': 144, 'message': 'No status found with that ID.'}]
845459076796616705 [{'code': 144, 'message': 'No status found with that ID.'}]
842892208864923648 [{'code': 144, 'message': 'No status found with that ID.'}]
837012587749474308 [{'code': 144, 'message': 'No status found with that ID.'}]
827228250799742977 [{'code': 144, 'message': 'No status found with that ID.'}]
812747805718642688 [{'code': 144, 'message': 'No status found with that ID.'}]
802247111496568832 [{'code': 144, 'message': 'No status found with that ID.'}]
775096608509886464 [{'code': 144, 'message': 'No sta

In [7]:
len(not_found)

16

In [8]:
with open ('tweet_json.txt') as myfile:
    json_tweets = json.load(myfile)
counts_df = pd.DataFrame(list(json_tweets.items()), columns=['tweet_id','temp_data'])
counts_df.head()

Unnamed: 0,tweet_id,temp_data
0,892420643555336193,[{'url': 'https://twitter.com/dog_rates/status...
1,892177421306343426,[{'url': 'https://twitter.com/dog_rates/status...
2,891815181378084864,[{'url': 'https://twitter.com/dog_rates/status...
3,891689557279858688,[{'url': 'https://twitter.com/dog_rates/status...
4,891327558926688256,[{'url': 'https://twitter.com/dog_rates/status...


In [9]:
counts_df['temp_data'][1]

[{'url': 'https://twitter.com/dog_rates/status/892177421306343426/photo/1',
  'favorite_count': 32593,
  'retweet_count': 6119}]

In [10]:
counts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2340 entries, 0 to 2339
Data columns (total 2 columns):
tweet_id     2340 non-null object
temp_data    2340 non-null object
dtypes: object(2)
memory usage: 36.6+ KB


<a id='assess'></a>
## Assess

**Data Quality** is evaluated on the following dimensions:

 - Completeness: Are there missing records or not? Are there specific rows, columns, or cells missing?
 
 - Validity: records don't conform to a defined schema. A schema is a defined set of rules for data. These rules can be real-world constraints (e.g. negative height is impossible) and table-specific constraints (e.g. unique key constraints in tables). In this case, each record needs to be a rating of a dog, which as we will see is not always the case.
 
 - Accuracy: inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect. Example: a patient's weight that is 5 lbs too heavy because the scale was faulty.
 
 - Consistency: inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing. Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.

 
**Data Tidiness** is evaluated on the following dimensions:
 - Each variable has a column
 - Each observation is in a single row
 - Each observational unit has its own table
 
**Schema**
 - Only observations about dogs were included in the analysis
 - An observation was considered a dog if at least one of the three dog/not-dog predictions was True
 - Only original tweets are valid (no reply-to or retweets)
 - Missing data in the dog type (doggo, floofer, etc) is acceptable and that information is retained.
 - All denominators must be 10

#### Assess: Twitter Archive data
<a id="wrd-assess"></a>

In [11]:
# wrd_df used excel to visually assess. Excel not happy with importing the long integers for ids. Nothing obviously amiss,
# except there are several columns that are unlikely to be used. Drop those when appropriate and clearly not useful

# programmatic assessment here
wrd_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

1. tweet_id is integer (as are other ids but we won't use them)
1. time_stamp is string
1. There are a bunch without reply_to or retweeted information, but that makes sense. Seem's like it's worth looking at the expanded urls to see if there is some formatting issue.

In [12]:
wrd_df.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


Ok, so that max number is way out there.

In [13]:
high_rate_list = wrd_df[wrd_df.rating_numerator > 20].index.values.tolist()

In [14]:
for i in high_rate_list:
    print(i, wrd_df.loc[i,'text'],wrd_df.loc[i,'rating_numerator'])
    print( '\n')

188 @dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research 420


189 @s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10 666


290 @markhoppus 182/10 182


313 @jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho 960


340 RT @dog_rates: This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wu… 75


433 The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd 84


516 Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. 
Keep Sam smiling by clicking and sharing this link:
https://t.co/98tB8y7y7t https://t.co/LouL5vdvxx 24


695 This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS 75


763 This is Sophie. She's a Jubilant Bush Puppe

 - A bunch need to be converted to be out of ten
 - Udactity rating_numerator extract failed when decimal values were used (340 Logan, 695 Logan (again) ,793 Sophie, 1712 this is a group)
 - 313 jomny 960 should be 13, denominator should be 10, name not capitalized

In [15]:
wrd_df[wrd_df.rating_numerator == 1776]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
979,749981277374128128,,,2016-07-04 15:00:45 +0000,"<a href=""https://about.twitter.com/products/tw...",This is Atticus. He's quite simply America af....,,,,https://twitter.com/dog_rates/status/749981277...,1776,10,Atticus,,,,


In [16]:
print(wrd_df.iloc[979][5])

This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh


In [17]:
# Got the code for embedding tweets here: 
# https://github.com/jupyter/notebook/issues/2790
# go to tweet and choose Embed Tweet, paste it into s (see below)

class Tweet(object):
    def __init__(self, embed_str=None):
        self.embed_str = embed_str

    def _repr_html_(self):
        return self.embed_str


In [18]:
s  = ("""
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">This is Atticus. He&#39;s quite simply America af. 1776/10 <a href="https://t.co/GRXwMxLBkh">pic.twitter.com/GRXwMxLBkh</a></p>&mdash; WeRateDogs™ (@dog_rates) <a href="https://twitter.com/dog_rates/status/749981277374128128?ref_src=twsrc%5Etfw">July 4, 2016</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
""")
Tweet(s)

OK, it's "correct." I guess that's a risk of analyzing data from a humorous source.

In [19]:
# what about the denominator of 170
wrd_df[wrd_df.rating_denominator == 170]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1120,731156023742988288,,,2016-05-13 16:15:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to this unbelievably well behaved sq...,,,,https://twitter.com/dog_rates/status/731156023...,204,170,this,,,,


In [20]:
print(wrd_df.iloc[1120][5])

Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv


And another funny-but-correct value:

In [21]:
s = ("""
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once <a href="https://t.co/yGQI3He3xv">pic.twitter.com/yGQI3He3xv</a></p>&mdash; WeRateDogs™ (@dog_rates) <a href="https://twitter.com/dog_rates/status/731156023742988288?ref_src=twsrc%5Etfw">May 13, 2016</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
""")
Tweet(s)

The moral of this story is that I'm not going to worry too much about "accuracy" for this dataset. "They're all good dogs, Brent." https://twitter.com/brant/status/775407594802335744

##### Breed prediction data
<a id="breed-assess"></a>

In [22]:
# visual assessment shows that there are several tweets with more than one photo and that not all of the predictions match

# Also, some of the predictions are wrong, if funny. 
s = ("""
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">This is a curly Ticonderoga named Pepe. No feet. Loves to jet ski. 11/10 would hug until forever <a href="https://t.co/cyDfaK8NBc">pic.twitter.com/cyDfaK8NBc</a></p>&mdash; WeRateDogs™ (@dog_rates) <a href="https://twitter.com/dog_rates/status/666983947667116034?ref_src=twsrc%5Etfw">November 18, 2015</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
""")

Tweet(s)

This guy is categorized as a swab, a chainsaw, and a wig! Hilarious! I don't think this is an issue of quality or tidiness, just something to be aware of - some of these are not well classified by the AI algorithm.

As for the multiple photos, this is a tidiness issue.

In [23]:
breed_df[breed_df.p1 == 'swab']

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
51,666983947667116034,https://pbs.twimg.com/media/CUGaXDhW4AY9JUH.jpg,1,swab,0.589446,False,chain_saw,0.190142,False,wig,0.03451,False


In [24]:
# Now for some programmatic assessment of the breed predictions
breed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


There's no missing data, which is amazing, and the data types are all appropriate.

In [25]:
breed_df.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


The ranges of the number of images and the confidence intervals are all reasonable, although the 1 as the max in the confidence interval is pretty bold

In [26]:
breed_df[breed_df.p1_conf == 1]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
106,667866724293877760,https://pbs.twimg.com/media/CUS9PlUWwAANeAD.jpg,1,jigsaw_puzzle,1.0,False,prayer_rug,1.0113e-08,False,doormat,1.74017e-10,False


In [27]:
breed_df.iloc[106]

tweet_id                                 667866724293877760
jpg_url     https://pbs.twimg.com/media/CUS9PlUWwAANeAD.jpg
img_num                                                   1
p1                                            jigsaw_puzzle
p1_conf                                                   1
p1_dog                                                False
p2                                               prayer_rug
p2_conf                                          1.0113e-08
p2_dog                                                False
p3                                                  doormat
p3_conf                                         1.74017e-10
p3_dog                                                False
Name: 106, dtype: object

This is an image with a dog looking at a jigsaw puzzle which is the main focus of the photo. This is accurate, but doesn't meet the schema.

In [28]:
breed_df[breed_df.tweet_id == breed_df.iloc[106][0]]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
106,667866724293877760,https://pbs.twimg.com/media/CUS9PlUWwAANeAD.jpg,1,jigsaw_puzzle,1.0,False,prayer_rug,1.0113e-08,False,doormat,1.74017e-10,False


In [29]:
# check for no dog results
breed_df[(breed_df.p1_dog != True) & (breed_df.p2_dog != True) & (breed_df.p3_dog != True)]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,4.588540e-02,False,terrapin,1.788530e-02,False
17,666104133288665088,https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg,1,hen,0.965932,False,cock,3.391940e-02,False,partridge,5.206580e-05,False
18,666268910803644416,https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg,1,desktop_computer,0.086502,False,desk,8.554740e-02,False,bookcase,7.947970e-02,False
21,666293911632134144,https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg,1,three-toed_sloth,0.914671,False,otter,1.525000e-02,False,great_grey_owl,1.320720e-02,False
25,666362758909284353,https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg,1,guinea_pig,0.996496,False,skunk,2.402450e-03,False,hamster,4.608630e-04,False
29,666411507551481857,https://pbs.twimg.com/media/CT-RugiWIAELEaq.jpg,1,coho,0.404640,False,barracouta,2.714850e-01,False,gar,1.899450e-01,False
45,666786068205871104,https://pbs.twimg.com/media/CUDmZIkWcAAIPPe.jpg,1,snail,0.999888,False,slug,5.514170e-05,False,acorn,2.625800e-05,False
50,666837028449972224,https://pbs.twimg.com/media/CUEUva1WsAA2jPb.jpg,1,triceratops,0.442113,False,armadillo,1.140710e-01,False,common_iguana,4.325530e-02,False
51,666983947667116034,https://pbs.twimg.com/media/CUGaXDhW4AY9JUH.jpg,1,swab,0.589446,False,chain_saw,1.901420e-01,False,wig,3.450970e-02,False
53,667012601033924608,https://pbs.twimg.com/media/CUG0bC0U8AAw2su.jpg,1,hyena,0.987230,False,African_hunting_dog,1.260080e-02,False,coyote,5.735010e-05,False


Let's check a few:

In [30]:
s = ("""
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Meet Elliot. He&#39;s a Canadian Forrest Pup. Unusual number of antlers for a dog. Sneaky tongue slip to celebrate <a href="https://twitter.com/hashtag/Canada150?src=hash&amp;ref_src=twsrc%5Etfw">#Canada150</a>. 12/10 would pet <a href="https://t.co/cgwJwowTMC">pic.twitter.com/cgwJwowTMC</a></p>&mdash; 💝 WeRateDogs™ (@dog_rates) <a href="https://twitter.com/dog_rates/status/881268444196462592?ref_src=twsrc%5Etfw">July 1, 2017</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
""")

Tweet(s)

In [31]:
s = ("""
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">This is Steven. He has trouble relating to other dogs. Quite shy. Neck longer than average. Tropical probably. 11/10 would still pet <a href="https://t.co/2mJCDEJWdD">pic.twitter.com/2mJCDEJWdD</a></p>&mdash; 💝 WeRateDogs™ (@dog_rates) <a href="https://twitter.com/dog_rates/status/879050749262655488?ref_src=twsrc%5Etfw">June 25, 2017</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
""")
Tweet(s)

In [32]:
s = ("""
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Real funny guys. Sending in a pic without a dog in it. Hilarious. We&#39;ll rate the rug tho because it&#39;s giving off a very good vibe. 11/10 <a href="https://t.co/GCD1JccCyi">pic.twitter.com/GCD1JccCyi</a></p>&mdash; 💝 WeRateDogs™ (@dog_rates) <a href="https://twitter.com/dog_rates/status/870804317367881728?ref_src=twsrc%5Etfw">June 3, 2017</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
""")
Tweet(s)

Hmmm. Most are right. The last one shown is a dog, but I wouldn't have the foggiest what breed and can't blame the AI for not seeing it.

<a id="counts-assess"></a>
##### Likes and Retweets Counts Data

In [33]:
counts_df.head()

Unnamed: 0,tweet_id,temp_data
0,892420643555336193,[{'url': 'https://twitter.com/dog_rates/status...
1,892177421306343426,[{'url': 'https://twitter.com/dog_rates/status...
2,891815181378084864,[{'url': 'https://twitter.com/dog_rates/status...
3,891689557279858688,[{'url': 'https://twitter.com/dog_rates/status...
4,891327558926688256,[{'url': 'https://twitter.com/dog_rates/status...


In [34]:
counts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2340 entries, 0 to 2339
Data columns (total 2 columns):
tweet_id     2340 non-null object
temp_data    2340 non-null object
dtypes: object(2)
memory usage: 36.6+ KB


In [35]:
counts_df.describe()

Unnamed: 0,tweet_id,temp_data
count,2340,2340
unique,2340,2340
top,666691418707132416,[{'url': 'https://twitter.com/dog_rates/status...
freq,1,1


There are 16 missing entries - not bad out of 2356 ids. Also, this is a very popular account so I find the min of 0 likes or retweets odd. In order to assess, it will be a lot easier if we pull the dat out the temp_data column into their own columns. While this is sort of a cleaning step, it is useful now.

In [36]:
# there must be a better way!
counts_df[counts_df.tweet_id == '857989990357356544'].temp_data.values[0][0]['favorite_count']

16259

In [37]:
favorites = []
retweets = []
urlarr = []
for tid in counts_df.tweet_id:
    favorites.append(counts_df[counts_df.tweet_id == tid].temp_data.values[0][0]['favorite_count'])
    retweets.append(counts_df[counts_df.tweet_id == tid].temp_data.values[0][0]['retweet_count'])
    urlarr.append(counts_df[counts_df.tweet_id == tid].temp_data.values[0][0]['url'])
counts_df['favorite_count'] = favorites
counts_df['retweet_count'] = retweets
counts_df['url'] = urlarr
counts_df.head()

Unnamed: 0,tweet_id,temp_data,favorite_count,retweet_count,url
0,892420643555336193,[{'url': 'https://twitter.com/dog_rates/status...,37950,8287,https://twitter.com/dog_rates/status/892420643...
1,892177421306343426,[{'url': 'https://twitter.com/dog_rates/status...,32593,6119,https://twitter.com/dog_rates/status/892177421...
2,891815181378084864,[{'url': 'https://twitter.com/dog_rates/status...,24539,4054,https://twitter.com/dog_rates/status/891815181...
3,891689557279858688,[{'url': 'https://twitter.com/dog_rates/status...,41293,8416,https://twitter.com/dog_rates/status/891689557...
4,891327558926688256,[{'url': 'https://twitter.com/dog_rates/status...,39484,9129,https://twitter.com/dog_rates/status/891327558...


In [38]:
counts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2340 entries, 0 to 2339
Data columns (total 5 columns):
tweet_id          2340 non-null object
temp_data         2340 non-null object
favorite_count    2340 non-null int64
retweet_count     2340 non-null int64
url               2340 non-null object
dtypes: int64(2), object(3)
memory usage: 91.5+ KB


In [39]:
counts_df[counts_df.favorite_count == 0].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 167 entries, 31 to 2244
Data columns (total 5 columns):
tweet_id          167 non-null object
temp_data         167 non-null object
favorite_count    167 non-null int64
retweet_count     167 non-null int64
url               167 non-null object
dtypes: int64(2), object(3)
memory usage: 7.8+ KB


167 with 0 likes and several with several thousand retweets? Hmmmm

In [40]:
counts_df[counts_df.favorite_count == 0].tweet_id

31      886054160059072513
35      885311592912609280
67      879130579576475649
72      878404777348136964
73      878316110768087041
77      877611172832227328
90      874434818259525634
95      873337748698140672
106     871166179821445120
120     868639477480148993
126     867072653475098625
132     866094527597207552
141     863471782782697472
153     860981674716409858
154     860924035999428608
159     860177593139703809
165     858860390427611136
174     857062103051644929
176     856602993587888130
179     856330835276025856
188     855245323840757760
189     855138241867124737
198     852936405516943360
205     851953902622658560
206     851861385021730816
216     849668094696017920
224     847978865427394560
225     847971574464610304
243     845098359547420673
258     841833993020538882
               ...        
761     776249906839351296
766     775898661951791106
781     773336787167145985
787     772615324260794368
798     771171053431250945
802     771004394259247104
8

In [41]:
tweet = api.get_status(747242308580548608) # test one of them

In [42]:
tweet.text

'RT @dog_rates: This pupper killed this great white in an epic sea battle. Now wears it as a trophy. Such brave. Much fierce. 13/10 https://…'

In [43]:
wrd_df[wrd_df.tweet_id == 747242308580548608]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1012,747242308580548608,,,2016-06-27 01:37:04 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This pupper killed this great w...,7.047611e+17,4196984000.0,2016-03-01 20:11:59 +0000,https://twitter.com/dog_rates/status/704761120...,13,10,,,,pupper,


Ah! These are retweeted, not original rating tweets. Let's check:

* first we need to convert tweet_id to integers to match the other dataframes. Should have done this when doing the tweet harvesting.

In [44]:
counts_df['tweet_id'] = pd.to_numeric(counts_df['tweet_id'],downcast='integer')
counts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2340 entries, 0 to 2339
Data columns (total 5 columns):
tweet_id          2340 non-null int64
temp_data         2340 non-null object
favorite_count    2340 non-null int64
retweet_count     2340 non-null int64
url               2340 non-null object
dtypes: int64(3), object(2)
memory usage: 91.5+ KB


In [45]:
no_likes = counts_df[counts_df.favorite_count == 0].tweet_id.tolist()
for i in no_likes:
    temp = wrd_df[wrd_df.tweet_id == i].retweeted_status_id.item()
    print(temp)

        

8.860537344211026e+17
8.305833205850685e+17
8.780576130401157e+17
8.782815110064783e+17
6.690003974455337e+17
8.76850772322988e+17
8.663349647612027e+17
8.732137756329779e+17
8.41077006473257e+17
8.685522785248379e+17
8.650134204453683e+17
8.378201676945285e+17
8.630624715311677e+17
8.605637731402097e+17
8.609144852504699e+17
7.616729943768064e+17
8.395493263596708e+17
8.57061112319234e+17
8.44704788403114e+17
8.563301587682181e+17
8.421635325903749e+17
8.551225332674601e+17
8.316500515250545e+17
8.29374341691347e+17
8.482893821761004e+17
8.331246945974436e+17
8.323698773316936e+17
8.47971000004354e+17
7.733088242540298e+17
8.174238601360835e+17
8.406323370628628e+17
6.671521640794235e+17
8.392899192982241e+17
8.3890598062882e+17
7.838399664052306e+17
8.207497168456868e+17
8.366481490034852e+17
8.178278394877379e+17
7.869630643735347e+17
8.35264098648617e+17
7.530398308215112e+17
8.295019951909847e+17
8.324343582922097e+17
8.327663821985669e+17
7.867090828498289e+17
7.932864763017994e+

Yes, that's it. Also check for where retweets = 0

In [46]:
counts_df[counts_df.retweet_count == 0].tweet_id.item()

838085839343206401

In [47]:
wrd_df[wrd_df.tweet_id == 838085839343206401]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
291,838085839343206401,8.380855e+17,2894131000.0,2017-03-04 17:56:49 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@bragg6of8 @Andy_Pace_ we are still looking fo...,,,,,15,10,,,,,


Only one zero in retweet_count and it is in fact a reply_to.


In [48]:
counts_df.head()

Unnamed: 0,tweet_id,temp_data,favorite_count,retweet_count,url
0,892420643555336193,[{'url': 'https://twitter.com/dog_rates/status...,37950,8287,https://twitter.com/dog_rates/status/892420643...
1,892177421306343426,[{'url': 'https://twitter.com/dog_rates/status...,32593,6119,https://twitter.com/dog_rates/status/892177421...
2,891815181378084864,[{'url': 'https://twitter.com/dog_rates/status...,24539,4054,https://twitter.com/dog_rates/status/891815181...
3,891689557279858688,[{'url': 'https://twitter.com/dog_rates/status...,41293,8416,https://twitter.com/dog_rates/status/891689557...
4,891327558926688256,[{'url': 'https://twitter.com/dog_rates/status...,39484,9129,https://twitter.com/dog_rates/status/891327558...


In [49]:
counts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2340 entries, 0 to 2339
Data columns (total 5 columns):
tweet_id          2340 non-null int64
temp_data         2340 non-null object
favorite_count    2340 non-null int64
retweet_count     2340 non-null int64
url               2340 non-null object
dtypes: int64(3), object(2)
memory usage: 91.5+ KB


Note that the lack of null values in url is not really true, since I put in the empty string where urls were not found. This is not going to impact the intended analysis, but would be irritating to someone planning to use the urls in an analysis.

### Quality

#### wrd_df
1. <font color='green'>Do not want retweeted/reply_to - just original dog_rates tweets</font>
1. <font color='green'>Unused columns retweets, reply-to, source, extended urls</font>
1. <font color='green'>tweet_id is int64 but should be non-numeric (we don't perform arithmetic on them, or shouldn't, at least)</font>
1. <font color='green'>Timestamp is str not datetime</font>
1. <font color='green'>Missing names due to typos</font>
1. <font color= 'green'>Udactity rating_numerator extract failed when decimal values were used (340 Logan, 695 Logan (again) ,793 Sophie, 1712 this is a group)</font>
1.  <font color= 'green'>Many denominators not 10</font>
1. <font color='green'>Columns doggo, floofer, etc are not used once data is extracted (which is a tidiness issue, so this will happen at the end)</font>

#### breed_df
1. <font color='green'>Tweet_id is integer, not string.</font>
1. <font color='green'>Pepe's breed is wrong (tweet_id = 666983947667116034)</font>
1. <font color='green'>tweet_id 667866724293877760 misidentified - can't identify, so instead well drop any with no id as dog.</font>
1. <font color='green'>There are 324 tweets not identified as dogs</font>

#### counts_df
1. <font color='green'>Tweet_id is integer</font>
1. <font color='green'>16 of the tweet_ids from wrd_df are missing (dealt with in wrd_df clean step)</font>
1. <font color='green'>favorite_count and retweet_count == 0 are retweets</font>

### Tidiness

#### wrd_df
1. <font color='green'>Floofer, doggo, puppo and pupper are different columns.</font>

#### Tidiness for analysis
1. Makes sense to pull data for analysis into one dataframe. Dataframe should have tweet_id, breed prediction p1 (best guess), rating (calculated numerator over denominator), favorite_count, retweet_count.


<a id='clean'></a>
## Clean

<a id='wrd-qual'></a>
### wrd_df  - Quality

#### Define: Do not want retweeted/reply_to - just original dog_rates tweets
Drop rows with values in retweeted/reply_to columns

#### Code

In [153]:
# copy dataframe for cleaning
wrd_clean = wrd_df.copy()
wrd_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [154]:
wrd_clean = wrd_df[wrd_df.in_reply_to_user_id.isnull()]
wrd_clean = wrd_clean[wrd_clean.retweeted_status_id.isnull()]

#### Test

In [155]:
wrd_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2097 non-null int64
in_reply_to_status_id         0 non-null float64
in_reply_to_user_id           0 non-null float64
timestamp                     2097 non-null object
source                        2097 non-null object
text                          2097 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 2094 non-null object
rating_numerator              2097 non-null int64
rating_denominator            2097 non-null int64
name                          2097 non-null object
doggo                         2097 non-null object
floofer                       2097 non-null object
pupper                        2097 non-null object
puppo                         2097 non-null object
dtypes: float64(4), int64(3), object(10)

In [156]:
len(wrd_df)

2356

In [157]:
2356-2097

259

Dropped 259 rows.

In [158]:
wrd_clean.reset_index(inplace=True,drop=True)

In [159]:
wrd_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2097 entries, 0 to 2096
Data columns (total 17 columns):
tweet_id                      2097 non-null int64
in_reply_to_status_id         0 non-null float64
in_reply_to_user_id           0 non-null float64
timestamp                     2097 non-null object
source                        2097 non-null object
text                          2097 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 2094 non-null object
rating_numerator              2097 non-null int64
rating_denominator            2097 non-null int64
name                          2097 non-null object
doggo                         2097 non-null object
floofer                       2097 non-null object
pupper                        2097 non-null object
puppo                         2097 non-null object
dtypes: float64(4), int64(3), object(10)

#### Define: drop unused columns retweets, reply-to, source, extended urls

#### Code:

In [160]:
wrd_clean.columns.values.tolist()

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'doggo',
 'floofer',
 'pupper',
 'puppo']

In [161]:
drop_list = ['in_reply_to_status_id',
 'in_reply_to_user_id',
 'source',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls']
wrd_clean.drop(columns=drop_list,inplace=True)
# TEST
wrd_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2097 entries, 0 to 2096
Data columns (total 10 columns):
tweet_id              2097 non-null int64
timestamp             2097 non-null object
text                  2097 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  2097 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
dtypes: int64(3), object(7)
memory usage: 163.9+ KB


#### Define: Change tweet_id type from integer to str

#### Code:

In [162]:
wrd_clean.tweet_id = wrd_clean.tweet_id.astype(str)

# TEST

wrd_clean.info()
wrd_clean.tweet_id[0]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2097 entries, 0 to 2096
Data columns (total 10 columns):
tweet_id              2097 non-null object
timestamp             2097 non-null object
text                  2097 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  2097 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
dtypes: int64(2), object(8)
memory usage: 163.9+ KB


'892420643555336193'

#### Define: Timestamp is str not datetime

#### Code:

In [163]:
wrd_clean.timestamp = pd.to_datetime(wrd_clean.timestamp)

# Test
wrd_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2097 entries, 0 to 2096
Data columns (total 10 columns):
tweet_id              2097 non-null object
timestamp             2097 non-null datetime64[ns]
text                  2097 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  2097 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 163.9+ KB


#### Define: Re-extract ratings_numerators from text to capture decimals correctly

#### Code:

In [164]:
import re
for x in wrd_clean.index:
    m = re.search('((?:\d+\.)?\d+)\/(\d+)',wrd_clean.loc[x,'text'])[0] # this is the extracted rating as a string
    num = round(float(m.split('/')[0]))
    den = int(m.split('/')[1])
    wrd_clean.loc[x,'rating_numerator'] = num
    wrd_clean.loc[x,'rating_denominator'] = den
# regex was suggested by reviewer and worked well

#### Test:

In [165]:
wrd_clean[wrd_clean.name == 'Sophie'].text.values[1]

"This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq"

In [166]:
wrd_clean[wrd_clean.name == 'Sophie'].rating_numerator.values[1]

11

In [167]:
wrd_clean[wrd_clean.name == 'Logan'].text.values[0]

"This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS"

In [168]:
wrd_clean[wrd_clean.name == 'Logan'].rating_numerator.values[0] # I rounded up the numerators

10

#### Define: Denominators not all out of 10 either normalize all to denominator of 10 or just make calculation into a rating column.

#### Code:

In [169]:
wrd_clean[wrd_clean.rating_denominator != 10]

Unnamed: 0,tweet_id,timestamp,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
339,820690176645140481,2017-01-15 17:52:40,The floofs have been released I repeat the flo...,84,70,,,,,
403,810984652412424192,2016-12-19 23:06:23,Meet Sam. She smiles 24/7 &amp; secretly aspir...,24,7,Sam,,,,
700,758467244762497024,2016-07-28 01:00:57,Why does this never happen at my front door......,165,150,,,,,
853,740373189193256964,2016-06-08 02:41:38,"After so many requests, this is Bretagne. She ...",9,11,,,,,
904,731156023742988288,2016-05-13 16:15:54,Say hello to this unbelievably well behaved sq...,204,170,this,,,,
948,722974582966214656,2016-04-21 02:25:47,Happy 4/20 from the squad! 13/10 for all https...,4,20,,,,,
985,716439118184652801,2016-04-03 01:36:11,This is Bluebert. He just saw that both #Final...,50,50,Bluebert,,,,
1011,713900603437621249,2016-03-27 01:29:02,Happy Saturday here's 9 puppers on a bench. 99...,99,90,,,,,
1036,710658690886586372,2016-03-18 02:46:49,Here's a brigade of puppers. All look very pre...,80,80,,,,,
1056,709198395643068416,2016-03-14 02:04:08,"From left to right:\nCletus, Jerome, Alejandro...",45,50,,,,,


In [170]:
wrd_clean.loc[2076,'text'] # another misread rating. Fix this first

'This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv'

In [171]:
wrd_clean.loc[2076,'rating_numerator'] = 9
wrd_clean.loc[2076,'rating_denominator'] = 10
wrd_clean[wrd_clean.tweet_id == '666287406224695296']

Unnamed: 0,tweet_id,timestamp,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2076,666287406224695296,2015-11-16 16:11:11,This is an Albanian 3 1/2 legged Episcopalian...,9,10,an,,,,


In [172]:
wrd_clean[wrd_clean.rating_denominator != 10]

Unnamed: 0,tweet_id,timestamp,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
339,820690176645140481,2017-01-15 17:52:40,The floofs have been released I repeat the flo...,84,70,,,,,
403,810984652412424192,2016-12-19 23:06:23,Meet Sam. She smiles 24/7 &amp; secretly aspir...,24,7,Sam,,,,
700,758467244762497024,2016-07-28 01:00:57,Why does this never happen at my front door......,165,150,,,,,
853,740373189193256964,2016-06-08 02:41:38,"After so many requests, this is Bretagne. She ...",9,11,,,,,
904,731156023742988288,2016-05-13 16:15:54,Say hello to this unbelievably well behaved sq...,204,170,this,,,,
948,722974582966214656,2016-04-21 02:25:47,Happy 4/20 from the squad! 13/10 for all https...,4,20,,,,,
985,716439118184652801,2016-04-03 01:36:11,This is Bluebert. He just saw that both #Final...,50,50,Bluebert,,,,
1011,713900603437621249,2016-03-27 01:29:02,Happy Saturday here's 9 puppers on a bench. 99...,99,90,,,,,
1036,710658690886586372,2016-03-18 02:46:49,Here's a brigade of puppers. All look very pre...,80,80,,,,,
1056,709198395643068416,2016-03-14 02:04:08,"From left to right:\nCletus, Jerome, Alejandro...",45,50,,,,,


The easiest thing is to just compute all of the ratings numerator/denominator and mulitply by 10. I will round.

In [173]:
wrd_clean['rating10'] = None
wrd_clean['rating10'] = (wrd_clean.rating_numerator/wrd_clean.rating_denominator)*10
wrd_clean['rating10'] = wrd_clean['rating10'].astype(int)

#### Test:

In [174]:
wrd_clean.head()

Unnamed: 0,tweet_id,timestamp,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,rating10
0,892420643555336193,2017-08-01 16:23:56,This is Phineas. He's a mystical boy. Only eve...,13,10,Phineas,,,,,13
1,892177421306343426,2017-08-01 00:17:27,This is Tilly. She's just checking pup on you....,13,10,Tilly,,,,,13
2,891815181378084864,2017-07-31 00:18:03,This is Archie. He is a rare Norwegian Pouncin...,12,10,Archie,,,,,12
3,891689557279858688,2017-07-30 15:58:51,This is Darla. She commenced a snooze mid meal...,13,10,Darla,,,,,13
4,891327558926688256,2017-07-29 16:00:24,This is Franklin. He would like you to stop ca...,12,10,Franklin,,,,,12


In [175]:
wrd_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2097 entries, 0 to 2096
Data columns (total 11 columns):
tweet_id              2097 non-null object
timestamp             2097 non-null datetime64[ns]
text                  2097 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  2097 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
rating10              2097 non-null int32
dtypes: datetime64[ns](1), int32(1), int64(2), object(7)
memory usage: 172.1+ KB


In [176]:
wrd_clean.head()

Unnamed: 0,tweet_id,timestamp,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,rating10
0,892420643555336193,2017-08-01 16:23:56,This is Phineas. He's a mystical boy. Only eve...,13,10,Phineas,,,,,13
1,892177421306343426,2017-08-01 00:17:27,This is Tilly. She's just checking pup on you....,13,10,Tilly,,,,,13
2,891815181378084864,2017-07-31 00:18:03,This is Archie. He is a rare Norwegian Pouncin...,12,10,Archie,,,,,12
3,891689557279858688,2017-07-30 15:58:51,This is Darla. She commenced a snooze mid meal...,13,10,Darla,,,,,13
4,891327558926688256,2017-07-29 16:00:24,This is Franklin. He would like you to stop ca...,12,10,Franklin,,,,,12


<a id='breed-qual'></a>
### Breed identification - Quality

#### Define: change tweet_id type to string (non-numeric)

#### Code:

In [177]:
breed_clean = breed_df.copy()
breed_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [178]:
breed_clean.tweet_id = breed_clean.tweet_id.astype(str)

# TEST

breed_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null object
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(1), object(5)
memory usage: 152.1+ KB


#### Define: Pepe's breed in wrong. Should be ticonderoga per the tweet text
Look up text for tweet_id 666983947667116034. Overwrite p1 with correct breed and confidence interval as 1, and dog to True

In [179]:
wrd_clean[wrd_clean.tweet_id == '666983947667116034']

Unnamed: 0,tweet_id,timestamp,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,rating10
2045,666983947667116034,2015-11-18 14:18:59,This is a curly Ticonderoga named Pepe. No fee...,11,10,a,,,,,11


In [180]:
breed_clean[breed_clean.tweet_id == '666983947667116034']

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
51,666983947667116034,https://pbs.twimg.com/media/CUGaXDhW4AY9JUH.jpg,1,swab,0.589446,False,chain_saw,0.190142,False,wig,0.03451,False


In [181]:
breed_clean.loc[51,'p1'] = 'Ticonderoga'
breed_clean[breed_clean.tweet_id == '666983947667116034']

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
51,666983947667116034,https://pbs.twimg.com/media/CUGaXDhW4AY9JUH.jpg,1,Ticonderoga,0.589446,False,chain_saw,0.190142,False,wig,0.03451,False


In [182]:
breed_clean.loc[51,'p1_conf'] = 1.0
breed_clean.loc[51,'p1_dog'] = True

#Test
breed_clean[breed_clean.tweet_id == '666983947667116034']

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
51,666983947667116034,https://pbs.twimg.com/media/CUGaXDhW4AY9JUH.jpg,1,Ticonderoga,1.0,True,chain_saw,0.190142,False,wig,0.03451,False


#### Define: tweet_id 667866724293877760 misidentified.

In [183]:
breed_clean[breed_clean.tweet_id == '667866724293877760']

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
106,667866724293877760,https://pbs.twimg.com/media/CUS9PlUWwAANeAD.jpg,1,jigsaw_puzzle,1.0,False,prayer_rug,1.0113e-08,False,doormat,1.74017e-10,False


In [184]:
s  = ("""
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">This is Shaggy. He knows exactly how to solve the puzzle but can&#39;t talk. All he wants to do is help. 10/10 great guy <a href="https://t.co/SBmWbfAg6X">pic.twitter.com/SBmWbfAg6X</a></p>&mdash; 💝 WeRateDogs™ (@dog_rates) <a href="https://twitter.com/dog_rates/status/667866724293877760?ref_src=twsrc%5Etfw">November 21, 2015</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
""")
Tweet(s)

My best guess on this one is that he is a cockapoo, but I'm going to leave him out because, what do I know?

#### Define: Drop rows that are not identified as dogs in px. 
While some will be like Shaggy (above), we don't have access to the AI to try and get a better guess (by cropping the photos, etc), we're just going to have to drop the "non-dogs" since they have to be dogs to find our schema. First we'll get the breed from whichever is the first to have px_dog == True, with p1 > p2 > p3. We'll store that in a new column, breed

Create new column 'breed' and put in the p value of the first px_dog == True for each row. After that, drop the rows in which no breed was identified

#### Code

In [185]:
breed_clean['breed'] = None

In [186]:
no_breed = []
for tid in breed_clean.tweet_id:
    if breed_clean[breed_clean.tweet_id == tid].p1_dog.values[0]:
        breed_clean.loc[breed_clean[breed_clean.tweet_id == tid].index[0],'breed'] = breed_clean[breed_clean.tweet_id == tid].p1.values[0]
    elif breed_clean[breed_clean.tweet_id == tid].p2_dog.values[0]:
        breed_clean.loc[breed_clean[breed_clean.tweet_id == tid].index[0],'breed'] = breed_clean[breed_clean.tweet_id == tid].p2.values[0]
    elif breed_clean[breed_clean.tweet_id == tid].p3_dog.values[0]:
        breed_clean.loc[breed_clean[breed_clean.tweet_id == tid].index[0],'breed'] = breed_clean[breed_clean.tweet_id == tid].p3.values[0]
    else:
        no_breed.append(tid)
no_breed

['666051853826850816',
 '666104133288665088',
 '666268910803644416',
 '666293911632134144',
 '666362758909284353',
 '666411507551481857',
 '666786068205871104',
 '666837028449972224',
 '667012601033924608',
 '667065535570550784',
 '667188689915760640',
 '667369227918143488',
 '667437278097252352',
 '667443425659232256',
 '667549055577362432',
 '667550882905632768',
 '667724302356258817',
 '667766675769573376',
 '667782464991965184',
 '667866724293877760',
 '667873844930215936',
 '667911425562669056',
 '667937095915278337',
 '668142349051129856',
 '668154635664932864',
 '668226093875376128',
 '668291999406125056',
 '668466899341221888',
 '668544745690562560',
 '668614819948453888',
 '668620235289837568',
 '668643542311546881',
 '668645506898350081',
 '668981893510119424',
 '668988183816871936',
 '668992363537309700',
 '669015743032369152',
 '669214165781868544',
 '669351434509529089',
 '669571471778410496',
 '669583744538451968',
 '669661792646373376',
 '669682095984410625',
 '669749430

In [187]:
breed_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 13 columns):
tweet_id    2075 non-null object
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
breed       1752 non-null object
dtypes: bool(3), float64(3), int64(1), object(6)
memory usage: 168.3+ KB


In [188]:
# Just for fun: this is a barbell
s = ("""
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">This is Wiggles. She would like you to spot her. Probably won&#39;t need your help but just in case. 13/10 powerful as h*ck <a href="https://t.co/2d370P0OEg">pic.twitter.com/2d370P0OEg</a></p>&mdash; 💝 WeRateDogs™ (@dog_rates) <a href="https://twitter.com/dog_rates/status/852311364735569921?ref_src=twsrc%5Etfw">April 13, 2017</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
""")
Tweet(s)

In [189]:
# And this is a meercat or a marmot ("Nice marmot.")
s = ("""
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">This is Albus. He&#39;s quite impressive at hide and seek. Knows he&#39;s been found this time. 13/10 usually elusive as h*ck <a href="https://t.co/ht47njyZ64">pic.twitter.com/ht47njyZ64</a></p>&mdash; 💝 WeRateDogs™ (@dog_rates) <a href="https://twitter.com/dog_rates/status/863907417377173506?ref_src=twsrc%5Etfw">May 15, 2017</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
""")
Tweet(s)

In [190]:
len(no_breed)

323

In [191]:
len(breed_clean)

2075

#### Redefine: Instead of dropping, I'll just peel out the rows where breed is not null

#### Code:

In [192]:
breed_clean = breed_clean[breed_clean.breed.notnull()]
breed_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1752 entries, 0 to 2073
Data columns (total 13 columns):
tweet_id    1752 non-null object
jpg_url     1752 non-null object
img_num     1752 non-null int64
p1          1752 non-null object
p1_conf     1752 non-null float64
p1_dog      1752 non-null bool
p2          1752 non-null object
p2_conf     1752 non-null float64
p2_dog      1752 non-null bool
p3          1752 non-null object
p3_conf     1752 non-null float64
p3_dog      1752 non-null bool
breed       1752 non-null object
dtypes: bool(3), float64(3), int64(1), object(6)
memory usage: 155.7+ KB


In [193]:
breed_clean.reset_index(inplace=True,drop=True)
breed_clean.index.values[:40]

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39], dtype=int64)

<a id='counts-qual'></a>
### Counts - Quality

#### Define: remove entries with retweets and favorites equal to zero. These are retweets or replies themselves.

#### Code:

In [194]:
counts_clean = counts_df[(counts_df.retweet_count != 0) & (counts_df.favorite_count != 0)]

In [195]:
counts_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2172 entries, 0 to 2339
Data columns (total 5 columns):
tweet_id          2172 non-null int64
temp_data         2172 non-null object
favorite_count    2172 non-null int64
retweet_count     2172 non-null int64
url               2172 non-null object
dtypes: int64(3), object(2)
memory usage: 101.8+ KB


In [196]:
counts_clean.reset_index(inplace=True,drop=True)
counts_clean.index.values[:40]

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39], dtype=int64)

In [197]:
counts_clean.tweet_id = counts_clean.tweet_id.astype(str)
counts_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2172 entries, 0 to 2171
Data columns (total 5 columns):
tweet_id          2172 non-null object
temp_data         2172 non-null object
favorite_count    2172 non-null int64
retweet_count     2172 non-null int64
url               2172 non-null object
dtypes: int64(2), object(3)
memory usage: 84.9+ KB


#### Describe: check that favorite_count == 0 and retweet_count == 0 have been removed

#### Code:

In [198]:
counts_clean.describe()

Unnamed: 0,favorite_count,retweet_count
count,2172.0,2172.0
mean,8557.424954,2632.53407
std,12562.385729,4667.654874
min,51.0,2.0
25%,1824.0,569.0
50%,3864.0,1258.0
75%,10637.25,3011.75
max,163883.0,83323.0


In [199]:
counts_clean[counts_clean.favorite_count == 0]

Unnamed: 0,tweet_id,temp_data,favorite_count,retweet_count,url


In [200]:
counts_clean[counts_clean.retweet_count == 0]

Unnamed: 0,tweet_id,temp_data,favorite_count,retweet_count,url


#### Define: drop temp_data column since all info has been extracted into columns

#### Code:

In [201]:
counts_clean.drop(columns='temp_data', inplace=True)

# Test

counts_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2172 entries, 0 to 2171
Data columns (total 4 columns):
tweet_id          2172 non-null object
favorite_count    2172 non-null int64
retweet_count     2172 non-null int64
url               2172 non-null object
dtypes: int64(2), object(2)
memory usage: 68.0+ KB


<a id='wrd-tidy'></a>
### wrd_clean - tidyness

#### Define: Drop unused columns

#### Code:

In [202]:
wrd_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2097 entries, 0 to 2096
Data columns (total 11 columns):
tweet_id              2097 non-null object
timestamp             2097 non-null datetime64[ns]
text                  2097 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  2097 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
rating10              2097 non-null int32
dtypes: datetime64[ns](1), int32(1), int64(2), object(7)
memory usage: 172.1+ KB


#### Define: Floofer, doggo, puppo, and pupper need to become one categorical column
Melt doggo, floofer, pupper, puppo columns into one column good_dog_type

#### Code:

In [203]:
wrd_clean.columns.values.tolist()

['tweet_id',
 'timestamp',
 'text',
 'rating_numerator',
 'rating_denominator',
 'name',
 'doggo',
 'floofer',
 'pupper',
 'puppo',
 'rating10']

In [204]:
wrd_clean.replace('None', wrd_clean.replace(['None'], [None]),inplace=True) # turn string 'None' into actual none

In [205]:
wrd_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2097 entries, 0 to 2096
Data columns (total 11 columns):
tweet_id              2097 non-null object
timestamp             2097 non-null datetime64[ns]
text                  2097 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  1494 non-null object
doggo                 83 non-null object
floofer               10 non-null object
pupper                230 non-null object
puppo                 24 non-null object
rating10              2097 non-null int64
dtypes: datetime64[ns](1), int64(3), object(7)
memory usage: 180.3+ KB


In [206]:
for tid in wrd_clean.tweet_id:
    if (wrd_clean[wrd_clean.tweet_id == tid].doggo == 'doggo').values[0]:
        wrd_clean.loc[wrd_clean[wrd_clean.tweet_id==tid].index,'type'] = 'doggo'
    elif (wrd_clean[wrd_clean.tweet_id == tid].floofer == 'floofer').values[0]:
        wrd_clean.loc[wrd_clean[wrd_clean.tweet_id==tid].index,'type'] = 'floofer'
    elif (wrd_clean[wrd_clean.tweet_id == tid].pupper == 'pupper').values[0]:
        wrd_clean.loc[wrd_clean[wrd_clean.tweet_id==tid].index,'type'] = 'pupper'
    elif (wrd_clean[wrd_clean.tweet_id == tid].puppo == 'puppo').values[0]:
        wrd_clean.loc[wrd_clean[wrd_clean.tweet_id==tid].index,'type'] = 'puppo'
wrd_clean[wrd_clean.type == 'floofer']

Unnamed: 0,tweet_id,timestamp,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,rating10,type
42,883360690899218434,2017-07-07 16:22:55,Meet Grizzwald. He may be the floofiest floofe...,13,10,Grizzwald,,floofer,,,13,floofer
450,800388270626521089,2016-11-20 17:20:08,This is Doc. He takes time out of every day to...,12,10,Doc,,floofer,,,12,floofer
593,776218204058357768,2016-09-15 00:36:55,Atlas rolled around in some chalk and now he's...,13,10,,,floofer,,,13,floofer
775,749317047558017024,2016-07-02 19:01:20,This is Blu. He's a wild bush Floofer. I wish ...,12,10,Blu,,floofer,,,12,floofer
809,746542875601690625,2016-06-25 03:17:46,Here's a golden floofer helping with the groce...,11,10,,,floofer,,,11,floofer
875,737445876994609152,2016-05-31 00:49:32,Just wanted to share this super rare Rainbow F...,13,10,,,floofer,,,13,floofer
894,733822306246479872,2016-05-21 00:50:46,This is Moose. He's a Polynesian Floofer. Dapp...,10,10,Moose,,floofer,,,10,floofer
1303,689993469801164801,2016-01-21 02:10:37,Here we are witnessing a rare High Stepping Al...,12,10,,,floofer,,,12,floofer
1381,685307451701334016,2016-01-08 03:50:03,Say hello to Petrick. He's an Altostratus Floo...,11,10,Petrick,,floofer,,,11,floofer


In [207]:
wrd_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2097 entries, 0 to 2096
Data columns (total 12 columns):
tweet_id              2097 non-null object
timestamp             2097 non-null datetime64[ns]
text                  2097 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  1494 non-null object
doggo                 83 non-null object
floofer               10 non-null object
pupper                230 non-null object
puppo                 24 non-null object
rating10              2097 non-null int64
type                  336 non-null object
dtypes: datetime64[ns](1), int64(3), object(8)
memory usage: 196.7+ KB


In [208]:
wrd_clean[wrd_clean.type == 'doggo']

Unnamed: 0,tweet_id,timestamp,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,rating10,type
9,890240255349198849,2017-07-26 15:59:51,This is Cassie. She is a college pup. Studying...,14,10,Cassie,doggo,,,,14,doggo
39,884162670584377345,2017-07-09 21:29:42,Meet Yogi. He doesn't have any important dog m...,12,10,Yogi,doggo,,,,12,doggo
86,872967104147763200,2017-06-09 00:02:31,Here's a very large dog. He has a date later. ...,12,10,,doggo,,,,12,doggo
94,871515927908634625,2017-06-04 23:56:03,This is Napolean. He's a Raggedy East Nicaragu...,12,10,Napolean,doggo,,,,12,doggo
95,871102520638267392,2017-06-03 20:33:19,Never doubt a doggo 14/10 https://t.co/AbBLh2FZCH,14,10,,doggo,,,,14,doggo
104,869596645499047938,2017-05-30 16:49:31,This is Scout. He just graduated. Officially a...,12,10,Scout,doggo,,,,12,doggo
143,858843525470990336,2017-05-01 00:40:27,I have stumbled puppon a doggo painting party....,13,10,,doggo,,,,13,doggo
154,855851453814013952,2017-04-22 18:31:02,Here's a puppo participating in the #ScienceMa...,13,10,,doggo,,,puppo,13,doggo
161,854010172552949760,2017-04-17 16:34:26,"At first I thought this was a shy doggo, but i...",11,10,,doggo,floofer,,,11,doggo
192,846514051647705089,2017-03-28 00:07:32,This is Barney. He's an elder doggo. Hitches a...,13,10,Barney,doggo,,,,13,doggo


### wrd_clean - Quality

#### Define: drop unused doggo, floofer, puppo and pupper columns once dog type info is combined into one column

#### Code:

In [209]:
wrd_clean.drop(columns=['doggo','floofer','puppo','pupper'],inplace=True)
wrd_clean[wrd_clean.type == 'doggo']

Unnamed: 0,tweet_id,timestamp,text,rating_numerator,rating_denominator,name,rating10,type
9,890240255349198849,2017-07-26 15:59:51,This is Cassie. She is a college pup. Studying...,14,10,Cassie,14,doggo
39,884162670584377345,2017-07-09 21:29:42,Meet Yogi. He doesn't have any important dog m...,12,10,Yogi,12,doggo
86,872967104147763200,2017-06-09 00:02:31,Here's a very large dog. He has a date later. ...,12,10,,12,doggo
94,871515927908634625,2017-06-04 23:56:03,This is Napolean. He's a Raggedy East Nicaragu...,12,10,Napolean,12,doggo
95,871102520638267392,2017-06-03 20:33:19,Never doubt a doggo 14/10 https://t.co/AbBLh2FZCH,14,10,,14,doggo
104,869596645499047938,2017-05-30 16:49:31,This is Scout. He just graduated. Officially a...,12,10,Scout,12,doggo
143,858843525470990336,2017-05-01 00:40:27,I have stumbled puppon a doggo painting party....,13,10,,13,doggo
154,855851453814013952,2017-04-22 18:31:02,Here's a puppo participating in the #ScienceMa...,13,10,,13,doggo
161,854010172552949760,2017-04-17 16:34:26,"At first I thought this was a shy doggo, but i...",11,10,,11,doggo
192,846514051647705089,2017-03-28 00:07:32,This is Barney. He's an elder doggo. Hitches a...,13,10,Barney,13,doggo


### breed_clean - quality

#### Define: This is an aferthought: drop px rows since we won't use them and we captured the best guess in column 'breed'

#### Code:

In [210]:
breed_clean.drop(columns=['p1','p1_conf','p1_dog','p2','p2_conf','p2_dog','p3','p3_conf','p3_dog'],inplace=True)
breed_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1752 entries, 0 to 1751
Data columns (total 4 columns):
tweet_id    1752 non-null object
jpg_url     1752 non-null object
img_num     1752 non-null int64
breed       1752 non-null object
dtypes: int64(1), object(3)
memory usage: 54.8+ KB


### Overall Tidiness: Combine into one dataframe and save.

#### Define
I will merge these dataframes on tweet_id to get one big dataset where each row is a dog rates tweet and has the best breed prediction, retweets, and favorite counts, as well as other data.

#### Code

In [211]:
len(wrd_clean)

2097

In [212]:
len(breed_clean)

1752

In [213]:
len(counts_clean)

2172

I'll left join on breed_clean since it has the fewest rows.

In [214]:
wrd_merge = pd.merge(breed_clean,wrd_clean,on='tweet_id',how='left')


In [215]:
wrd_merge = pd.merge(wrd_merge,counts_clean,on='tweet_id',how='left')
wrd_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1752 entries, 0 to 1751
Data columns (total 14 columns):
tweet_id              1752 non-null object
jpg_url               1752 non-null object
img_num               1752 non-null int64
breed                 1752 non-null object
timestamp             1667 non-null datetime64[ns]
text                  1667 non-null object
rating_numerator      1667 non-null float64
rating_denominator    1667 non-null float64
name                  1267 non-null object
rating10              1667 non-null float64
type                  257 non-null object
favorite_count        1685 non-null float64
retweet_count         1685 non-null float64
url                   1685 non-null object
dtypes: datetime64[ns](1), float64(5), int64(1), object(7)
memory usage: 205.3+ KB


In [216]:
wrd_merge.to_csv('twitter_archive_master.csv',index=False)

The dataset is complete. Now for a little analysis!!!

<a id="analysis"></a>
## Analysis

First I'm going to look at favorite counts vs rating to see if ratings reflect the judgement of the public

In [2]:
%%html 
<div class='tableauPlaceholder' id='viz1549858722999' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ud&#47;UdacityWRDLikesandRetweets&#47;Story3&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='UdacityWRDLikesandRetweets&#47;Story3' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ud&#47;UdacityWRDLikesandRetweets&#47;Story3&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1549858722999');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='1016px';vizElement.style.height='991px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

The above Tableau code doesn't seem to create the visualization when reviewers open it, so I'll insert images as well. I want to keep the tableau code so the visualizations can be interactive.

<img src="Story 3.png">

Now I want to see what the most popular and highest rated dog breeds are. For that I need to normalize by number of times the breed is in the dataset.

This means adding a column to the dataset: norm_favorite. I'll just do simple division.

In [218]:
breed_counts = wrd_merge.groupby( ['breed'] ).count().reset_index()
breed_counts.head()

Unnamed: 0,breed,tweet_id,jpg_url,img_num,timestamp,text,rating_numerator,rating_denominator,name,rating10,type,favorite_count,retweet_count,url
0,Afghan_hound,4,4,4,3,3,3,3,3,3,0,3,3,3
1,Airedale,12,12,12,12,12,12,12,9,12,1,12,12,12
2,American_Staffordshire_terrier,16,16,16,16,16,16,16,11,16,3,16,16,16
3,Appenzeller,2,2,2,2,2,2,2,1,2,0,2,2,2
4,Australian_terrier,2,2,2,2,2,2,2,2,2,0,2,2,2


In [219]:
breed_counts[breed_counts.breed == 'Airedale'].tweet_id.values[0]

12

In [220]:
wrd_merge[wrd_merge.tweet_id == '666020888022790149']

Unnamed: 0,tweet_id,jpg_url,img_num,breed,timestamp,text,rating_numerator,rating_denominator,name,rating10,type,favorite_count,retweet_count,url
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,2015-11-15 22:32:08,Here we have a Japanese Irish Setter. Lost eye...,8.0,10.0,,8.0,,2532.0,498.0,https://twitter.com/dog_rates/status/666020888...


In [221]:
norm_dict = {}
for tid in wrd_merge.tweet_id:
    breed = wrd_merge[wrd_merge.tweet_id == tid].breed.values[0]
    nbreed = breed_counts[breed_counts.breed == breed].tweet_id.values[0]
    norm_dict[tid] = nbreed
norm_dict

{'666020888022790149': 4,
 '666029285002620928': 6,
 '666033412701032449': 21,
 '666044226329800704': 4,
 '666049248165822465': 26,
 '666050758794694657': 11,
 '666055525042405380': 51,
 '666057090499244032': 173,
 '666058600524156928': 8,
 '666063827256086533': 173,
 '666071193221509120': 4,
 '666073100786774016': 5,
 '666082916733198337': 65,
 '666094000022159362': 7,
 '666099513787052032': 5,
 '666102155909144576': 9,
 '666273097616637952': 17,
 '666287406224695296': 19,
 '666337882303524864': 7,
 '666345417576210432': 173,
 '666353288456101888': 34,
 '666373753744588802': 15,
 '666396247373291520': 95,
 '666407126856765440': 2,
 '666418789513326592': 3,
 '666421158376562688': 11,
 '666428276349472768': 96,
 '666430724426358785': 7,
 '666435652385423360': 31,
 '666437273139982337': 95,
 '666447344410484738': 3,
 '666454714377183233': 13,
 '666644823164719104': 5,
 '666649482315059201': 12,
 '666691418707132416': 21,
 '666701168228331520': 113,
 '666739327293083650': 8,
 '66677690848

Now add a the breed_count and normalized favorites to my dataset

In [222]:
norm_df = pd.DataFrame(list(norm_dict.items()),columns=['tweet_id','breed_count'])

In [223]:
norm_df.head()

Unnamed: 0,tweet_id,breed_count
0,666020888022790149,4
1,666029285002620928,6
2,666033412701032449,21
3,666044226329800704,4
4,666049248165822465,26


In [224]:
wrd_merge = pd.merge(wrd_merge,norm_df,on='tweet_id',how='left')
wrd_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1752 entries, 0 to 1751
Data columns (total 15 columns):
tweet_id              1752 non-null object
jpg_url               1752 non-null object
img_num               1752 non-null int64
breed                 1752 non-null object
timestamp             1667 non-null datetime64[ns]
text                  1667 non-null object
rating_numerator      1667 non-null float64
rating_denominator    1667 non-null float64
name                  1267 non-null object
rating10              1667 non-null float64
type                  257 non-null object
favorite_count        1685 non-null float64
retweet_count         1685 non-null float64
url                   1685 non-null object
breed_count           1752 non-null int64
dtypes: datetime64[ns](1), float64(5), int64(2), object(7)
memory usage: 219.0+ KB


In [225]:
wrd_merge['normalized_fav'] = wrd_merge.favorite_count/wrd_merge.breed_count

In [226]:
wrd_merge.normalized_fav

0        633.000000
1         20.833333
2          5.857143
3         73.250000
4          4.076923
5         11.818182
6          8.411765
7          1.676301
8         13.625000
9          2.693642
10        35.750000
11        63.000000
12         1.753846
13        23.142857
14        30.200000
15         8.666667
16        10.058824
17         7.578947
18        27.285714
19         1.653179
20         6.323529
21        12.133333
22         1.705263
23        52.500000
24        40.666667
25        28.454545
26         1.697917
27        44.285714
28         5.258065
29         1.294737
           ...     
1722     542.230769
1723    2907.454545
1724     421.969231
1725     648.937500
1726     622.052632
1727     362.452632
1728     651.260870
1729    4714.000000
1730     704.156250
1731    1739.235294
1732     666.093750
1733            NaN
1734     971.050000
1735     144.815029
1736     164.919075
1737    2249.000000
1738      85.601156
1739     831.343750
1740     491.104167


In [227]:
wrd_merge.to_csv('twitter_archive_master.csv', index=False)

In [3]:
%%html
<div class='tableauPlaceholder' id='viz1549842926324' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ud&#47;UdacityWRDBestBreeds&#47;Story1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='UdacityWRDBestBreeds&#47;Story1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ud&#47;UdacityWRDBestBreeds&#47;Story1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1549842926324');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='1016px';vizElement.style.height='991px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

And the image for the above Tableau html that doesn't load for reviewers:

<img src="Story 1.png">

In [4]:
%%html
<div class='tableauPlaceholder' id='viz1549857350556' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ud&#47;UdacityWRDRatingsOverTime&#47;Story2&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='UdacityWRDRatingsOverTime&#47;Story2' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ud&#47;UdacityWRDRatingsOverTime&#47;Story2&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1549857350556');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='1016px';vizElement.style.height='991px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

And the last visualization image from the Tableau story code above:

<img src="Story 2.png">

In [230]:
wrd_merge[wrd_merge.breed == 'Brabancon_griffon'].url.values

array(['https://twitter.com/dog_rates/status/669371483794317312/photo/1',
       'https://twitter.com/dog_rates/status/673887867907739649/photo/1',
       'https://twitter.com/dog_rates/status/674447403907457024/photo/1'],
      dtype=object)

Totally underrated. This dataset is an outrage! No pitbulls at all, either!

In [231]:
wrd_merge[wrd_merge.breed == 'pitbull']

Unnamed: 0,tweet_id,jpg_url,img_num,breed,timestamp,text,rating_numerator,rating_denominator,name,rating10,type,favorite_count,retweet_count,url,breed_count,normalized_fav


In [232]:
wrd_merge[wrd_merge['rating10'] == 14].count()

tweet_id              22
jpg_url               22
img_num               22
breed                 22
timestamp             22
text                  22
rating_numerator      22
rating_denominator    22
name                  15
rating10              22
type                  10
favorite_count        22
retweet_count         22
url                   22
breed_count           22
normalized_fav        22
dtype: int64

In [233]:
wrd_merge[wrd_merge['rating10'] == 13].count()

tweet_id              221
jpg_url               221
img_num               221
breed                 221
timestamp             221
text                  221
rating_numerator      221
rating_denominator    221
name                  163
rating10              221
type                   42
favorite_count        221
retweet_count         221
url                   221
breed_count           221
normalized_fav        221
dtype: int64

In [234]:
wrd_merge.groupby('breed').count()['tweet_id'].max()

173

In [235]:
wrd_grouped = wrd_merge.groupby('breed').count()

In [236]:
wrd_grouped[wrd_grouped.tweet_id == 173]

Unnamed: 0_level_0,tweet_id,jpg_url,img_num,timestamp,text,rating_numerator,rating_denominator,name,rating10,type,favorite_count,retweet_count,url,breed_count,normalized_fav
breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
golden_retriever,173,173,173,156,156,156,156,111,156,34,158,158,158,173,158


In [237]:
wrd_grouped

Unnamed: 0_level_0,tweet_id,jpg_url,img_num,timestamp,text,rating_numerator,rating_denominator,name,rating10,type,favorite_count,retweet_count,url,breed_count,normalized_fav
breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Afghan_hound,4,4,4,3,3,3,3,3,3,0,3,3,3,4,3
Airedale,12,12,12,12,12,12,12,9,12,1,12,12,12,12,12
American_Staffordshire_terrier,16,16,16,16,16,16,16,11,16,3,16,16,16,16,16
Appenzeller,2,2,2,2,2,2,2,1,2,0,2,2,2,2,2
Australian_terrier,2,2,2,2,2,2,2,2,2,0,2,2,2,2,2
Bedlington_terrier,6,6,6,6,6,6,6,3,6,2,6,6,6,6,6
Bernese_mountain_dog,11,11,11,11,11,11,11,11,11,2,11,11,11,11,11
Blenheim_spaniel,11,11,11,10,10,10,10,9,10,2,10,10,10,11,10
Border_collie,12,12,12,12,12,12,12,10,12,4,12,12,12,12,12
Border_terrier,7,7,7,7,7,7,7,6,7,0,7,7,7,7,7
