
### TO-DO
  - Grab the retweet and favorite count for each tweet in the dataset by using
  the provided tweet ID
    - Print out each twitter ID after it's called and use a code timer to ensure that rate limits are met.
    - Also
  - Filter out all the retweets, as we want only original ratings that have
  images attached to them.
  - Assess and clean 8 *quality issues*
    - Issues that satisfy the Project Motivation must be satisfied
        Other issues:
            - IDs are listed as integers or floats but should be ints because they'll never be used in calculation
            - Retweets and favorites should be integers and not floats, as there will never be half of one.
        - Twitter Archive Table
            1. Retweets exist in the table
            2. Dogs both don't have names listed, or are words from the text of the tweet.
            3. Rating denominator is not always 10
            4. Extremely large numerators
            5. Columns have NaN in place of null
            6. Nulls in expanded_urls columns when it should have scripts
        - Image Predictions Table
            7. Tweets with images not of dogs. 
            8. Prediction names columns are not consistent casing
  - Assess and clean 2 *tidiness issues*
    - Merge the individual pieces of data according to the rules of *tidy data*
        3. "Each type of observational unit forms a table" -> all tweets can be merged into one table. 
        - Twitter Archive Table
            1. Dog stages are separate columns when they should be one column
            2. Timestamp is a string when it should be a datetime type
        
  - Download the image predictions file using the Requests library using the url https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv
  - Store each tweet's data into a tweet_json.txt file, using the tweepy library.
  - Produce *3 insights* and *1 visualization*
  - *300-600 word written report that describes wrangling efforts*
  - *250 word written report that communicates the insights and visualizations from the wrangled data*

- Access Token
  356969562-XV8Hnta7upyC9Wm3kqB96LXu0CTgsq88j0iEGoEz
- Access Token Secret
  mok6g5nFfJXQ5CtQZGZMS0N1zhhedEG8PFmVjearYSose


- Doggo Stages
  - Doggo
  - Pupper
  - Puppo
  - Blep
  - Snoot
  - Floof
  
- Tidiness requirements
    - Each variable forms a column
    - Each observation forms a row
    - Each type of observational unit forms a table

# Wrangle Report

"What, if any, are the characteristics of a good boy?" 

A poignant question with no easy answers. In our effort to answer this, we'll turn to the WeRateDogs twitter account, a world renowned resource for finding the goodest boys amongst us. However, before we can find the best doggos of them all (and they are all good dogs, Brent), we need to make that data viable for use internally. In order to do that, we'll proceed through the usual three step process for wrangling our data; gathering, assessing and cleaning. Our data will come from three sources: the WeRateDogs twitter archive, contained in the csv `twitter-archive-enhanced.csv`, the image predictions we can pull down remotely from the file `image-predictions.tsv`, and external data from the Twitter API. So let's go about gathering those three sources, and then we can assess their tidiness, cleanliness, and then get them sorted so we can appreciate every pupper they contain.

### Part 1: Gathering Our Data
#### File 1: The Twitter Archive
This will actually turn out to be the easiest of the bunch, from a "gathering" standpoint. As the file was provided to us by Udacity, we can simply place it in our local directory, and import it directly into a Pandas dataframe, where it will be ready for later use:

In [108]:
# import the we rate dogs twitter archive
import pandas as pd
import numpy as np

dogs = pd.read_csv('twitter-archive-enhanced.csv')
dogs.shape[0]


2356

#### File 2: The Image Predictions
Now, we have to do a little more legwork. We can remotely 'boop' the Udacity servers using the Request library to download our second data source. From there, it's just a matter making sure the text is read properly, that Pandas recognizes the tab separated data format, and voila, we'll have our predictions.

In [96]:
import requests
import io

# pull the data down from the Udacity servers
r = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')

# Use the encoded raw text, identify the separator to pandas, and load the dataframe,
# preds here is shorthand for 'predictions'
preds = pd.read_csv(io.StringIO(r.text), sep='\t')
preds.shape[0]

2075

#### File 3: Additonal Twitter Details
Here's where things get a little more... ruff. We're going to have to go ahead and use the `tweepy` library to programatically access twitter for all the additional information about the tweets we were interested in from above. 

Which tweets should we grab? Well we have the ids from the Twitter archive above. While this goes back further than we need, we can pare down our results later, once we have all the possible information assembled. For now, let's grab those tweets from the `dogs` df above, write the JSON we get back from the api into a file labeled `tweet_json.txt`, and read that into a Pandas dataframe:

In [125]:
# grab the twitter data for each tweet and store it as a json array in "tweet_json.txt"
import tweepy
import json

# Remember to load your own API key information from the Twitter Developers portal.
api_key = 'eJ4KQYnbqZx8WvJAQQqEZKauw'
api_secret = 'h3HuBqJG2MFbmZe8MgEzM6AZ8hbxletwxEDoDFvksELeO3hCrv'
access_token = '356969562-XV8Hnta7upyC9Wm3kqB96LXu0CTgsq88j0iEGoEz'
access_token_secret = 'mok6g5nFfJXQ5CtQZGZMS0N1zhhedEG8PFmVjearYSose'

# Set up our tweepy scraper, ensuring that we can use the right amount of request and not
# violate our rate limit:
auth = tweepy.OAuthHandler(api_key, api_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

# Set up an array to catch all the tweets that have since been deleted off Twitter
deleted_tweets = []

# Grab the IDs we need
tweet_ids = df['tweet_id']
# create our file if it doesn't exist
with open('tweet_json.txt', 'w+') as f:
    for tweet_id in tweet_ids:
        # set up an exception handler in the event that we run into a tweet that no longer
        # exists, and append that tweets ID into our array above
        try:
            print('Grabbing tweet_id {}'.format(tweet_id))
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            json.dump(tweet._json, f)
            # write a newline to ensure that each tweet gets its own separate line in the text file:
            f.write('\n')
        # document errors when they occur:
        except tweepy.TweepError:
            print('Tweet not found: {}'.format(tweet_id))
            deleted_tweets.append(tweet_id)
            pass

# create a blank dataframe to load our JSON data into:
cols = ['tweet_id', 'retweet_count', 'favorite_count']
api_data = pd.DataFrame(columns=cols)
with open('tweet_json.txt') as json_file:
    for line in json_file:
        # read through the file line by line, using the keys in the JSON file
        # to pull out the appropriate data we need for our dataframe
        status = json.loads(line)
        tweet_id = status['id_str']
        retweet_count = status['retweet_count']
        favorite_count = status['favorite_count']
        
        # append each line to our dataframe
        api_data = api_data.append(pd.DataFrame([[tweet_id, retweet_count, favorite_count]], columns=cols))

Grabbing tweet_id 666020888022790149
Grabbing tweet_id 666029285002620928
Grabbing tweet_id 666033412701032449
Grabbing tweet_id 666044226329800704
Grabbing tweet_id 666049248165822465
Grabbing tweet_id 666050758794694657
Grabbing tweet_id 666051853826850816
Grabbing tweet_id 666055525042405380
Grabbing tweet_id 666057090499244032
Grabbing tweet_id 666058600524156928
Grabbing tweet_id 666063827256086533
Grabbing tweet_id 666071193221509120
Grabbing tweet_id 666073100786774016
Grabbing tweet_id 666082916733198337
Grabbing tweet_id 666094000022159362
Grabbing tweet_id 666099513787052032
Grabbing tweet_id 666102155909144576
Grabbing tweet_id 666104133288665088
Grabbing tweet_id 666268910803644416
Grabbing tweet_id 666273097616637952
Grabbing tweet_id 666287406224695296
Grabbing tweet_id 666293911632134144
Grabbing tweet_id 666337882303524864
Grabbing tweet_id 666345417576210432
Grabbing tweet_id 666353288456101888
Grabbing tweet_id 666362758909284353
Grabbing tweet_id 666373753744588802
G

Grabbing tweet_id 670290420111441920
Grabbing tweet_id 670303360680108032
Grabbing tweet_id 670319130621435904
Grabbing tweet_id 670338931251150849
Grabbing tweet_id 670361874861563904
Grabbing tweet_id 670374371102445568
Grabbing tweet_id 670385711116361728
Grabbing tweet_id 670403879788544000
Grabbing tweet_id 670408998013820928
Grabbing tweet_id 670411370698022913
Grabbing tweet_id 670417414769758208
Grabbing tweet_id 670420569653809152
Grabbing tweet_id 670421925039075328
Grabbing tweet_id 670427002554466305
Grabbing tweet_id 670428280563085312
Grabbing tweet_id 670433248821026816
Grabbing tweet_id 670434127938719744
Grabbing tweet_id 670435821946826752
Grabbing tweet_id 670442337873600512
Grabbing tweet_id 670444955656130560
Grabbing tweet_id 670449342516494336
Grabbing tweet_id 670452855871037440
Grabbing tweet_id 670465786746662913
Grabbing tweet_id 670468609693655041
Grabbing tweet_id 670474236058800128
Grabbing tweet_id 670668383499735048
Grabbing tweet_id 670676092097810432
G

Grabbing tweet_id 674644256330530816
Grabbing tweet_id 674646392044941312
Grabbing tweet_id 674664755118911488
Grabbing tweet_id 674670581682434048
Grabbing tweet_id 674690135443775488
Grabbing tweet_id 674737130913071104
Grabbing tweet_id 674739953134403584
Grabbing tweet_id 674743008475090944
Grabbing tweet_id 674752233200820224
Grabbing tweet_id 674754018082705410
Grabbing tweet_id 674764817387900928
Grabbing tweet_id 674767892831932416
Grabbing tweet_id 674774481756377088
Grabbing tweet_id 674781762103414784
Grabbing tweet_id 674788554665512960
Grabbing tweet_id 674790488185167872
Grabbing tweet_id 674793399141146624
Grabbing tweet_id 674800520222154752
Grabbing tweet_id 674805413498527744
Grabbing tweet_id 674999807681908736
Grabbing tweet_id 675003128568291329
Grabbing tweet_id 675006312288268288
Grabbing tweet_id 675015141583413248
Grabbing tweet_id 675047298674663426
Grabbing tweet_id 675109292475830276
Grabbing tweet_id 675111688094527488
Grabbing tweet_id 675113801096802304
G

Grabbing tweet_id 682962037429899265
Grabbing tweet_id 683030066213818368
Grabbing tweet_id 683078886620553216
Grabbing tweet_id 683098815881154561
Grabbing tweet_id 683111407806746624
Grabbing tweet_id 683142553609318400
Grabbing tweet_id 683357973142474752
Grabbing tweet_id 683391852557561860
Grabbing tweet_id 683449695444799489
Grabbing tweet_id 683462770029932544
Grabbing tweet_id 683481228088049664
Grabbing tweet_id 683498322573824003
Grabbing tweet_id 683742671509258241
Grabbing tweet_id 683773439333797890
Grabbing tweet_id 683828599284170753
Grabbing tweet_id 683834909291606017
Grabbing tweet_id 683849932751646720
Grabbing tweet_id 683852578183077888
Grabbing tweet_id 683857920510050305
Grabbing tweet_id 684097758874210310
Grabbing tweet_id 684122891630342144
Grabbing tweet_id 684177701129875456
Grabbing tweet_id 684188786104872960
Grabbing tweet_id 684195085588783105
Grabbing tweet_id 684200372118904832
Grabbing tweet_id 684222868335505415
Grabbing tweet_id 684225744407494656
G

Grabbing tweet_id 699088579889332224
Grabbing tweet_id 699323444782047232
Grabbing tweet_id 699370870310113280
Grabbing tweet_id 699413908797464576
Grabbing tweet_id 699423671849451520
Grabbing tweet_id 699434518667751424
Grabbing tweet_id 699446877801091073
Grabbing tweet_id 699691744225525762
Grabbing tweet_id 699775878809702401
Grabbing tweet_id 699779630832685056


Rate limit reached. Sleeping for: 512


Grabbing tweet_id 699788877217865730
Grabbing tweet_id 699801817392291840
Grabbing tweet_id 700002074055016451
Grabbing tweet_id 700029284593901568
Grabbing tweet_id 700062718104104960
Grabbing tweet_id 700143752053182464
Grabbing tweet_id 700151421916807169
Grabbing tweet_id 700167517596164096
Grabbing tweet_id 700462010979500032
Grabbing tweet_id 700505138482569216
Grabbing tweet_id 700518061187723268
Grabbing tweet_id 700747788515020802
Grabbing tweet_id 700796979434098688
Grabbing tweet_id 700847567345688576
Grabbing tweet_id 700864154249383937
Grabbing tweet_id 700890391244103680
Grabbing tweet_id 701214700881756160
Grabbing tweet_id 701545186879471618
Grabbing tweet_id 701570477911896070
Grabbing tweet_id 701601587219795968
Grabbing tweet_id 701889187134500865
Grabbing tweet_id 701952816642965504
Grabbing tweet_id 701981390485725185
Grabbing tweet_id 702217446468493312
Grabbing tweet_id 702276748847800320
Grabbing tweet_id 702321140488925184
Grabbing tweet_id 702539513671897089
G

Grabbing tweet_id 726887082820554753
Grabbing tweet_id 726935089318363137
Grabbing tweet_id 727175381690781696
Grabbing tweet_id 727286334147182592
Grabbing tweet_id 727314416056803329
Grabbing tweet_id 727524757080539137
Grabbing tweet_id 727644517743104000
Grabbing tweet_id 727685679342333952
Grabbing tweet_id 728015554473250816
Grabbing tweet_id 728035342121635841
Grabbing tweet_id 728046963732717569
Grabbing tweet_id 728387165835677696
Grabbing tweet_id 728409960103686147
Grabbing tweet_id 728653952833728512
Grabbing tweet_id 728751179681943552
Grabbing tweet_id 728760639972315136
Grabbing tweet_id 728986383096946689
Grabbing tweet_id 729113531270991872
Grabbing tweet_id 729463711119904772
Grabbing tweet_id 729823566028484608
Grabbing tweet_id 729838605770891264
Grabbing tweet_id 729854734790754305
Grabbing tweet_id 730196704625098752
Grabbing tweet_id 730211855403241472
Grabbing tweet_id 730427201120833536
Grabbing tweet_id 730573383004487680
Grabbing tweet_id 730924654643314689
G

Grabbing tweet_id 759099523532779520
Grabbing tweet_id 759159934323924993
Grabbing tweet_id 759197388317847553
Grabbing tweet_id 759447681597108224
Grabbing tweet_id 759557299618865152
Grabbing tweet_id 759793422261743616
Grabbing tweet_id 759846353224826880
Grabbing tweet_id 759923798737051648
Grabbing tweet_id 760190180481531904
Grabbing tweet_id 760252756032651264
Grabbing tweet_id 760290219849637889
Grabbing tweet_id 760539183865880579
Grabbing tweet_id 760641137271070720
Grabbing tweet_id 760656994973933572
Grabbing tweet_id 760893934457552897
Grabbing tweet_id 761004547850530816
Grabbing tweet_id 761227390836215808
Grabbing tweet_id 761292947749015552
Grabbing tweet_id 761334018830917632
Grabbing tweet_id 761371037149827077
Grabbing tweet_id 761599872357261312
Grabbing tweet_id 761672994376806400
Grabbing tweet_id 761745352076779520
Grabbing tweet_id 761750502866649088
Grabbing tweet_id 761976711479193600
Grabbing tweet_id 762035686371364864
Grabbing tweet_id 762316489655476224
G

Grabbing tweet_id 794355576146903043
Grabbing tweet_id 794926597468000259
Grabbing tweet_id 794983741416415232
Grabbing tweet_id 795076730285391872
Grabbing tweet_id 795400264262053889
Grabbing tweet_id 795464331001561088
Grabbing tweet_id 796031486298386433
Grabbing tweet_id 796080075804475393
Grabbing tweet_id 796116448414461957
Grabbing tweet_id 796149749086875649
Grabbing tweet_id 796177847564038144
Grabbing tweet_id 796387464403357696
Grabbing tweet_id 796484825502875648
Grabbing tweet_id 796759840936919040
Grabbing tweet_id 796865951799083009
Grabbing tweet_id 797236660651966464
Grabbing tweet_id 797545162159308800
Grabbing tweet_id 797971864723324932
Grabbing tweet_id 798209839306514432
Grabbing tweet_id 798340744599797760
Grabbing tweet_id 798628517273620480
Grabbing tweet_id 798644042770751489
Grabbing tweet_id 798665375516884993
Grabbing tweet_id 798673117451325440
Grabbing tweet_id 798694562394996736
Grabbing tweet_id 798697898615730177
Grabbing tweet_id 798925684722855936
G

Grabbing tweet_id 831262627380748289
Grabbing tweet_id 831309418084069378
Grabbing tweet_id 831315979191906304
Grabbing tweet_id 831322785565769729
Grabbing tweet_id 831552930092285952
Grabbing tweet_id 831650051525054464
Grabbing tweet_id 831670449226514432
Grabbing tweet_id 831911600680497154
Grabbing tweet_id 831939777352105988
Grabbing tweet_id 832032802820481025
Grabbing tweet_id 832040443403784192


Rate limit reached. Sleeping for: 661


Grabbing tweet_id 832215726631055365
Grabbing tweet_id 832273440279240704
Grabbing tweet_id 832369877331693569
Grabbing tweet_id 832397543355072512
Grabbing tweet_id 832636094638288896
Grabbing tweet_id 832757312314028032
Grabbing tweet_id 832769181346996225
Grabbing tweet_id 832998151111966721
Grabbing tweet_id 833124694597443584
Grabbing tweet_id 833479644947025920
Grabbing tweet_id 833722901757046785
Grabbing tweet_id 833826103416520705
Grabbing tweet_id 833863086058651648
Grabbing tweet_id 834086379323871233
Grabbing tweet_id 834167344700198914
Grabbing tweet_id 834209720923721728
Grabbing tweet_id 834458053273591808
Grabbing tweet_id 834574053763584002
Grabbing tweet_id 834786237630337024
Grabbing tweet_id 834931633769889797
Grabbing tweet_id 835152434251116546
Grabbing tweet_id 835172783151792128
Grabbing tweet_id 835264098648616962
Grabbing tweet_id 835297930240217089
Grabbing tweet_id 835574547218894849
Grabbing tweet_id 836001077879255040
Grabbing tweet_id 836260088725786625
G

Grabbing tweet_id 883838122936631299
Grabbing tweet_id 884162670584377345
Grabbing tweet_id 884441805382717440
Grabbing tweet_id 884562892145688576
Grabbing tweet_id 884876753390489601
Grabbing tweet_id 884925521741709313
Grabbing tweet_id 885167619883638784
Grabbing tweet_id 885311592912609280
Grabbing tweet_id 885528943205470208
Grabbing tweet_id 885984800019947520
Grabbing tweet_id 886258384151887873
Grabbing tweet_id 886366144734445568
Grabbing tweet_id 886680336477933568
Grabbing tweet_id 886736880519319552
Grabbing tweet_id 886983233522544640
Grabbing tweet_id 887101392804085760
Grabbing tweet_id 887343217045368832
Grabbing tweet_id 887473957103951883
Grabbing tweet_id 887517139158093824
Grabbing tweet_id 887705289381826560
Grabbing tweet_id 888078434458587136
Grabbing tweet_id 888554962724278272
Grabbing tweet_id 888804989199671297
Grabbing tweet_id 888917238123831296
Grabbing tweet_id 889278841981685760
Grabbing tweet_id 889531135344209921
Grabbing tweet_id 889638837579907072
G

In [124]:
# confirm our data is loaded as expected
api_data.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,666020888022790149,462,2417


### Part 2: Assessing / Cleaning Our Data
While normally, you can iterate through the data wrangling process by first identifying your issues as a whole, then cleaning them, for simplicity's sake, I will be identifying the issues with my data and cleaning them as I go. These issues will come in two main forms: data quality, and data tidiness. While data quality lacks a neat and *tidy* definition, tidiness does. In particular, [as defined by Hadley Wickham](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html), the three major characteristics of tidy data are:
- Each variable forms a column
- Each observation forms a row
- Each observational unit forms a table
Working backwards, we can identify data quality issues as those that deal with the actual data itself. Missing values, improper data types, improperly enterted values all enter the realm of "quality" rather than tidy. 

Bearing these definitions in mind, I'll first lay out an outline of issues I plan to address, and then we can work down the list one by one, expounding on them in more detail and addressing the issues as they occur.
##### Data Quality Issues
- Twitter Archive Table
    1. Retweets exist in the table
    3. Columns have NaN in place of null
    4. Ratings have incorrect denominators
    5. IDs are not all listed as strings
    6. Improperly formatted dog names
    7. Timestamps are formatted as strings 
- Image Predictions Table
    7. Prediction name columns are inconsistenly formatted.
- Twitter API Data Table
    8. Retweets and favorites should be listed as floats
        
##### Data Tidiness Issues
- Twitter Archive Table
    1. Dog types are listed as separate columns
- All Tables
    2. Tables can be merged to provide related information

In [126]:
dogs

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,


In [72]:

# merge the two dataframes        
df3 = df3.reset_index(drop=True)
df3 = df3.astype(int)
df = df.merge(df3, on='tweet_id')
df

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,retweet_count,favorite_count
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True,462,2417
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True,42,121
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True,41,112
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True,132,272
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True,39,96
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2054,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True,8495,37789
2055,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False,7877,39566
2056,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True,3783,23546
2057,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True,5709,31279


In [None]:
# remove retweets by pulling out tweets with a retweeted_status_id, indicating that it originates from another account
dogs = dogs[dogs['retweeted_status_id'].isnull()]