
### TO-DO
  - Grab the retweet and favorite count for each tweet in the dataset by using
  the provided tweet ID
    - Print out each twitter ID after it's called and use a code timer to ensure that rate limits are met.
    - Also
  - Filter out all the retweets, as we want only original ratings that have
  images attached to them.
  - Assess and clean 8 *quality issues*
    - Issues that satisfy the Project Motivation must be satisfied
        Other issues:
            - IDs are listed as integers or floats but should be ints because they'll never be used in calculation
            - Retweets and favorites should be integers and not floats, as there will never be half of one.
        - Twitter Archive Table
            1. Retweets exist in the table
            2. Dogs both don't have names listed, or are words from the text of the tweet.
            3. Rating denominator is not always 10
            4. Extremely large numerators
            5. Columns have NaN in place of null
            6. Nulls in expanded_urls columns when it should have scripts
        - Image Predictions Table
            7. Tweets with images not of dogs. 
            8. Prediction names columns are not consistent casing
  - Assess and clean 2 *tidiness issues*
    - Merge the individual pieces of data according to the rules of *tidy data*
        3. "Each type of observational unit forms a table" -> all tweets can be merged into one table. 
        - Twitter Archive Table
            1. Dog stages are separate columns when they should be one column
            2. Timestamp is a string when it should be a datetime type
        
  - Download the image predictions file using the Requests library using the url https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv
  - Store each tweet's data into a tweet_json.txt file, using the tweepy library.
  - Produce *3 insights* and *1 visualization*
  - *300-600 word written report that describes wrangling efforts*
  - *250 word written report that communicates the insights and visualizations from the wrangled data*

- Access Token
  356969562-XV8Hnta7upyC9Wm3kqB96LXu0CTgsq88j0iEGoEz
- Access Token Secret
  mok6g5nFfJXQ5CtQZGZMS0N1zhhedEG8PFmVjearYSose


- Doggo Stages
  - Doggo
  - Pupper
  - Puppo
  - Blep
  - Snoot
  - Floof
  
- Tidiness requirements
    - Each variable forms a column
    - Each observation forms a row
    - Each type of observational unit forms a table

# Wrangle Report

"What, if any, are the characteristics of a good boy?" 

A poignant question with no easy answers. In our effort to answer this, we'll turn to the WeRateDogs twitter account, a world renowned resource for finding the goodest boys amongst us. However, before we can find the best doggos of them all (and they are all good dogs, Brent), we need to make that data viable for use internally. In order to do that, we'll proceed through the usual three step process for wrangling our data; gathering, assessing and cleaning. Our data will come from three sources: the WeRateDogs twitter archive, contained in the csv `twitter-archive-enhanced.csv`, the image predictions we can pull down remotely from the file `image-predictions.tsv`, and external data from the Twitter API. So let's go about gathering those three sources, and then we can assess their tidiness, cleanliness, and then get them sorted so we can appreciate every pupper they contain.

### Part 1: Gathering Our Data
#### File 1: The Twitter Archive
This will actually turn out to be the easiest of the bunch, from a "gathering" standpoint. As the file was provided to us by Udacity, we can simply place it in our local directory, and import it directly into a Pandas dataframe, where it will be ready for later use:

In [108]:
# import the we rate dogs twitter archive
import pandas as pd
import numpy as np

dogs = pd.read_csv('twitter-archive-enhanced.csv')
dogs.shape[0]


2356

#### File 2: The Image Predictions
Now, we have to do a little more legwork. We can remotely 'boop' the Udacity servers using the Request library to download our second data source. From there, it's just a matter making sure the text is read properly, that Pandas recognizes the tab separated data format, and voila, we'll have our predictions.

In [96]:
import requests
import io

# pull the data down from the Udacity servers
r = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')

# Use the encoded raw text, identify the separator to pandas, and load the dataframe,
# preds here is shorthand for 'predictions'
preds = pd.read_csv(io.StringIO(r.text), sep='\t')
preds.shape[0]

2075

#### File 3: Additonal Twitter Details
Here's where things get a little more... ruff. We're going to have to go ahead and use the `tweepy` library to programatically access twitter for all the additional information about the tweets we were interested in from above. 

Which tweets should we grab? Well we have the ids from the Twitter archive above. While this goes back further than we need, we can pare down our results later, once we have all the possible information assembled. For now, let's grab those tweets from the `dogs` df above, write the JSON we get back from the api into a file labeled `tweet_json.txt`, and read that into a Pandas dataframe:

In [None]:
# grab the twitter data for each tweet and store it as a json array in "tweet_json.txt"
import tweepy
import json

# Remember to load your own API key information from the Twitter Developers portal.
api_key = 'eJ4KQYnbqZx8WvJAQQqEZKauw'
api_secret = 'h3HuBqJG2MFbmZe8MgEzM6AZ8hbxletwxEDoDFvksELeO3hCrv'
access_token = '356969562-XV8Hnta7upyC9Wm3kqB96LXu0CTgsq88j0iEGoEz'
access_token_secret = 'mok6g5nFfJXQ5CtQZGZMS0N1zhhedEG8PFmVjearYSose'

# Set up our tweepy scraper, ensuring that we can use the right amount of request and not
# violate our rate limit:
auth = tweepy.OAuthHandler(api_key, api_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

# Set up an array to catch all the tweets that have since been deleted off Twitter
deleted_tweets = []

# Grab the IDs we need
tweet_ids = df['tweet_id']
# create our file if it doesn't exist
with open('tweet_json.txt', 'w+') as f:
    for tweet_id in tweet_ids:
        # set up an exception handler in the event that we run into a tweet that no longer
        # exists, and append that tweets ID into our array above
        try:
            print('Grabbing tweet_id {}'.format(tweet_id))
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            json.dump(tweet._json, f)
            # write a newline to ensure that each tweet gets its own separate line in the text file:
            f.write('\n')
        # document errors when they occur:
        except tweepy.TweepError:
            print('Tweet not found: {}'.format(tweet_id))
            deleted_tweets.append(tweet_id)
            pass

# create a blank dataframe to load our JSON data into:
cols = ['tweet_id', 'retweet_count', 'favorite_count']
api_data = pd.DataFrame(columns=cols)
with open('tweet_json.txt') as json_file:
    for line in json_file:
        # read through the file line by line, using the keys in the JSON file
        # to pull out the appropriate data we need for our dataframe
        status = json.loads(line)
        tweet_id = status['id_str']
        retweet_count = status['retweet_count']
        favorite_count = status['favorite_count']
        
        # append each line to our dataframe
        api_data = api_data.append(pd.DataFrame([[tweet_id, retweet_count, favorite_count]], columns=cols))

Grabbing tweet_id 666020888022790149
Grabbing tweet_id 666029285002620928
Grabbing tweet_id 666033412701032449
Grabbing tweet_id 666044226329800704
Grabbing tweet_id 666049248165822465
Grabbing tweet_id 666050758794694657
Grabbing tweet_id 666051853826850816
Grabbing tweet_id 666055525042405380
Grabbing tweet_id 666057090499244032
Grabbing tweet_id 666058600524156928
Grabbing tweet_id 666063827256086533
Grabbing tweet_id 666071193221509120
Grabbing tweet_id 666073100786774016
Grabbing tweet_id 666082916733198337
Grabbing tweet_id 666094000022159362
Grabbing tweet_id 666099513787052032
Grabbing tweet_id 666102155909144576
Grabbing tweet_id 666104133288665088
Grabbing tweet_id 666268910803644416
Grabbing tweet_id 666273097616637952
Grabbing tweet_id 666287406224695296
Grabbing tweet_id 666293911632134144
Grabbing tweet_id 666337882303524864
Grabbing tweet_id 666345417576210432
Grabbing tweet_id 666353288456101888
Grabbing tweet_id 666362758909284353
Grabbing tweet_id 666373753744588802
G

Grabbing tweet_id 670290420111441920
Grabbing tweet_id 670303360680108032
Grabbing tweet_id 670319130621435904
Grabbing tweet_id 670338931251150849
Grabbing tweet_id 670361874861563904
Grabbing tweet_id 670374371102445568
Grabbing tweet_id 670385711116361728
Grabbing tweet_id 670403879788544000
Grabbing tweet_id 670408998013820928
Grabbing tweet_id 670411370698022913
Grabbing tweet_id 670417414769758208
Grabbing tweet_id 670420569653809152
Grabbing tweet_id 670421925039075328
Grabbing tweet_id 670427002554466305
Grabbing tweet_id 670428280563085312
Grabbing tweet_id 670433248821026816
Grabbing tweet_id 670434127938719744
Grabbing tweet_id 670435821946826752
Grabbing tweet_id 670442337873600512
Grabbing tweet_id 670444955656130560
Grabbing tweet_id 670449342516494336
Grabbing tweet_id 670452855871037440
Grabbing tweet_id 670465786746662913
Grabbing tweet_id 670468609693655041
Grabbing tweet_id 670474236058800128
Grabbing tweet_id 670668383499735048
Grabbing tweet_id 670676092097810432
G

Grabbing tweet_id 674644256330530816
Grabbing tweet_id 674646392044941312
Grabbing tweet_id 674664755118911488
Grabbing tweet_id 674670581682434048
Grabbing tweet_id 674690135443775488
Grabbing tweet_id 674737130913071104
Grabbing tweet_id 674739953134403584
Grabbing tweet_id 674743008475090944
Grabbing tweet_id 674752233200820224
Grabbing tweet_id 674754018082705410
Grabbing tweet_id 674764817387900928
Grabbing tweet_id 674767892831932416
Grabbing tweet_id 674774481756377088
Grabbing tweet_id 674781762103414784
Grabbing tweet_id 674788554665512960
Grabbing tweet_id 674790488185167872
Grabbing tweet_id 674793399141146624
Grabbing tweet_id 674800520222154752
Grabbing tweet_id 674805413498527744
Grabbing tweet_id 674999807681908736
Grabbing tweet_id 675003128568291329
Grabbing tweet_id 675006312288268288
Grabbing tweet_id 675015141583413248
Grabbing tweet_id 675047298674663426
Grabbing tweet_id 675109292475830276
Grabbing tweet_id 675111688094527488
Grabbing tweet_id 675113801096802304
G

Grabbing tweet_id 682962037429899265
Grabbing tweet_id 683030066213818368
Grabbing tweet_id 683078886620553216
Grabbing tweet_id 683098815881154561
Grabbing tweet_id 683111407806746624
Grabbing tweet_id 683142553609318400
Grabbing tweet_id 683357973142474752
Grabbing tweet_id 683391852557561860
Grabbing tweet_id 683449695444799489
Grabbing tweet_id 683462770029932544
Grabbing tweet_id 683481228088049664
Grabbing tweet_id 683498322573824003
Grabbing tweet_id 683742671509258241
Grabbing tweet_id 683773439333797890
Grabbing tweet_id 683828599284170753
Grabbing tweet_id 683834909291606017
Grabbing tweet_id 683849932751646720
Grabbing tweet_id 683852578183077888
Grabbing tweet_id 683857920510050305
Grabbing tweet_id 684097758874210310
Grabbing tweet_id 684122891630342144
Grabbing tweet_id 684177701129875456
Grabbing tweet_id 684188786104872960
Grabbing tweet_id 684195085588783105
Grabbing tweet_id 684200372118904832
Grabbing tweet_id 684222868335505415
Grabbing tweet_id 684225744407494656
G

Grabbing tweet_id 699088579889332224
Grabbing tweet_id 699323444782047232
Grabbing tweet_id 699370870310113280
Grabbing tweet_id 699413908797464576
Grabbing tweet_id 699423671849451520
Grabbing tweet_id 699434518667751424
Grabbing tweet_id 699446877801091073
Grabbing tweet_id 699691744225525762
Grabbing tweet_id 699775878809702401
Grabbing tweet_id 699779630832685056


Rate limit reached. Sleeping for: 512


In [124]:
# confirm our data is loaded as expected
api_data.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,666020888022790149,462,2417


### Part 2: Assessing / Cleaning Our Data
While normally, you can iterate through the data wrangling process by first identifying your issues as a whole, then cleaning them, for simplicity's sake, I will be identifying the issues with my data and cleaning them as I go. These issues will come in two main forms: data quality, and data tidiness. 

In [None]:
dogs

In [72]:

# merge the two dataframes        
df3 = df3.reset_index(drop=True)
df3 = df3.astype(int)
df = df.merge(df3, on='tweet_id')
df

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,retweet_count,favorite_count
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True,462,2417
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True,42,121
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True,41,112
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True,132,272
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True,39,96
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2054,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True,8495,37789
2055,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False,7877,39566
2056,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True,3783,23546
2057,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True,5709,31279


In [None]:
# remove retweets by pulling out tweets with a retweeted_status_id, indicating that it originates from another account
dogs = dogs[dogs['retweeted_status_id'].isnull()]