# [WeRateDogs](https://twitter.com/dog_rates) Wrangling and Analysis 


<a id='intro'></a>
# Introduction
As part of Udacity's Data Analyst Nanodegree program, data from the WeRateDogs Twitter account is wrangled and analyzed for trends.


### Table of Contents
- [Gather](#gather)
    * [Twitter Archive](#g_archive)
    * [Image Predictions](#g_preds)
    * [Retweet/Favorite Data](#g_api)
- [Assess](#assess)
    * [`archive.csv`](#a_archive)
    * [`image_preds.csv`](#a_preds)
    * [`api_data.csv`](#a_api)
- [Clean](#clean)
    1. [Missing Data](#missing)
    2. [Tidiness](#tidy)
    3. [Quality](#quality)
- [Wrangling References](#ref1)
- [Analyze](#analyze)
- [Summary](#summary)


In [1]:
# import libraries
import pandas as pd
import numpy as np
import random 
import matplotlib.pyplot as plt
import requests
import os
import tweepy # API for twitter
import re
import json
import time
from collections import Counter
import itertools as it
pd.set_option('display.max_colwidth', -1)


<a id='gather'></a>
# Gather

<a id='g_archive'></a>
### Twitter Archive

In [2]:
# Read in archived tweet file and check
archive = pd.read_csv('twitter-archive-enhanced.csv')
archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

<a id='g_preds'></a>
### Image Predictions

In [3]:
# Make directory if it doesn't already exist
folder_name = 'image_predictions'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [4]:
# Request data
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/\
599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
response # if [200], successful


<Response [200]>

In [5]:
# Create file, and access body of response using response.content (is in bytes)
with open(os.path.join(folder_name,
                       url.split('/')[-1]), mode='wb') as file:
    file.write(response.content)

In [6]:
# Check that we have downloaded file
os.listdir(folder_name)

['image-predictions.tsv']

In [7]:
# Read in image prediction TSV file and check importation success
image_preds = pd.read_csv('image_predictions/image-predictions.tsv', sep='\t')
image_preds.head()


Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


<a id='g_api'></a>
### API Data

**Plan:** Use tweet IDs in WeRateDogs Twitter archive to query the Twitter API for each tweet's JSON data using Python's Tweepy library. I'll do this for one tweet first, followed by the rest. Store each tweet's set of JSON data in a file called tweet_json.txt. Each tweet's JSON data gets written to its own line. Read the .txt file line by line into a pandas DataFrame with desired data.

In [100]:
# Define keys and tokens obtained from Twitter's App application (hidden cell below)
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_token_secret = 'HIDDEN'

In [9]:
# OAuth process, using keys and tokens
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# Creation of the actual interface, using authentication & invoking rate limit instructions 
# Automatically wait for rate limits to replenish and print notification when Tweepy is waiting for rate limits to replenish
api = tweepy.API(auth_handler=auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)


In [10]:
# Query twitter API for sample tweet ID from archive
# Store status object (contains tweet text and other data)
tweet = api.get_status(735648611367784448, tweet_mode='extended')

# Access and store status object's JSON serializable response data using ._json property 
info_json = tweet._json
info_json

{'created_at': 'Thu May 26 01:47:51 +0000 2016',
 'id': 735648611367784448,
 'id_str': '735648611367784448',
 'full_text': '*faints* 12/10 perfection in pupper form https://t.co/t6TxTwTLEK',
 'truncated': False,
 'display_text_range': [0, 40],
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [],
  'media': [{'id': 735648574982246401,
    'id_str': '735648574982246401',
    'indices': [41, 64],
    'media_url': 'http://pbs.twimg.com/media/CjWMezdW0AErwU3.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/CjWMezdW0AErwU3.jpg',
    'url': 'https://t.co/t6TxTwTLEK',
    'display_url': 'pic.twitter.com/t6TxTwTLEK',
    'expanded_url': 'https://twitter.com/dog_rates/status/735648611367784448/photo/1',
    'type': 'photo',
    'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
     'small': {'w': 680, 'h': 680, 'resize': 'fit'},
     'medium': {'w': 1024, 'h': 1024, 'resize': 'fit'},
     'large': {'w': 1024, 'h': 1024, 'resize': 'fit'}}}]},
 'ext

In [11]:
# Extract full_text, created_at, retweet_count, favorite_count, and display_text_range for sample tweet
# We will store all the JSON data in the .txt file for all tweets and extract the stuff we want to fill the DataFrame
print("Tweed ID: {}".format(info_json['id']))
print("Tweet text: \"{}\".".format(info_json['full_text'].split('https')[0]))
print("Tweeted on {}.".format(info_json['created_at']))
print("Retweeted {} times.".format(info_json['retweet_count']))
print("Favorited by {} Twitterers.".format(info_json['favorite_count']))
print("Tweet length: {} characters.".format(info_json['display_text_range'][1]))

Tweed ID: 735648611367784448
Tweet text: "*faints* 12/10 perfection in pupper form ".
Tweeted on Thu May 26 01:47:51 +0000 2016.
Retweeted 1161 times.
Favorited by 4152 Twitterers.
Tweet length: 40 characters.


In [12]:
# Create/close tweet_json.txt for JSON data storage if not exist
f = 'tweet_json.txt'
if not os.path.isfile(f):
    open(f, mode='w').close()
    

In [13]:
# Store tweet_id series, we will iterate through this to collect all data
tweet_ids = archive.tweet_id

In [14]:
# Create a counter and empty errors dictionary
if not os.path.isfile(f):
    counter = 0
    fails_dict = {} # Will be a dictionary of errors {index_tweetID: Error message}
    # Iterate through tweet_ids to extract JSON data and dump in .txt file
    with open(f, 'w') as outfile:
        for tweet_id in tweet_ids:
            # Slow cell, print Tweet counter to gauge time remaining
            counter += 1
            print(str(counter) + ": " + str(tweet_id))
            try:
                # Retrieve status objects with Tweepy API
                tweet = api.get_status(tweet_id, tweet_mode='extended')
                print("Success")
                # Access/write JSON serializable response data to tweet_json.txt
                json.dump(tweet._json, outfile)
                outfile.write('\n')
            except tweepy.TweepError as e:
                # Print text for each exception + the tweet's id
                print("Fail " + str(tweet_id) + ": " + str(e))
                # Add JSON data to tweet_errors dictionary for bad tweets
                fails_dict[tweet_id] = e


In [15]:
# Create dataframe to hold tweet data
api_data = pd.DataFrame(columns=['tweet_id', 'retweet_count', 'favorite_count', 'tweet_length'])

# Load/assign data from txt file and append it to the dataframe
with open('tweet_json.txt') as f:
    for line in f:
        status  = json.loads(line)
        tweet_id = status['id_str']
        retweet_count = status['retweet_count']
        favorite_count = status['favorite_count']
        tweet_length = status['display_text_range'][1]
        api_data = api_data.append(pd.DataFrame([[tweet_id, retweet_count, favorite_count, tweet_length]],
                                        columns=['tweet_id', 'retweet_count', 'favorite_count', 'tweet_length']))

# Reset index so not all indices are 0
api_data = api_data.reset_index(drop=True)
api_data.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count,tweet_length
0,892420643555336193,8231,37782,85
1,892177421306343426,6083,32454,138
2,891815181378084864,4026,24436,121
3,891689557279858688,8389,41114,79
4,891327558926688256,9088,39322,138


In [16]:
len(api_data.tweet_id[0])

18

# Assess

<a id='a_archive'></a>
### `archive` table

In [17]:
# Visually assess
archive.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1979,672980819271634944,,,2015-12-05 03:28:25 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Extraordinary dog here. Looks large. Just a head. No body. Rather intrusive. 5/10 would still pet https://t.co/ufHWUFA9Pu,,,,https://twitter.com/dog_rates/status/672980819271634944/photo/1,5,10,,,,,
674,789599242079838210,,,2016-10-21 22:48:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Brownie. She's wearing a Halloween themed onesie. 12/10 festive af https://t.co/0R4meWXFOx,,,,"https://twitter.com/dog_rates/status/789599242079838210/photo/1,https://twitter.com/dog_rates/status/789599242079838210/photo/1",12,10,Brownie,,,,
622,796080075804475393,,,2016-11-08 20:00:55 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Yogi. He's 98% floof. Snuggable af. 12/10 https://t.co/opoXKxmfFm,,,,https://twitter.com/dog_rates/status/796080075804475393/photo/1,12,10,Yogi,,,,
772,776477788987613185,,,2016-09-15 17:48:25 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Huck. He's addicted to caffeine. Hope it's not too latte to seek help. 11/10 stay strong pupper https://t.co/iJE3F0VozW,,,,"https://twitter.com/dog_rates/status/776477788987613185/photo/1,https://twitter.com/dog_rates/status/776477788987613185/photo/1",11,10,Huck,,,pupper,
1398,699775878809702401,,,2016-02-17 02:02:25 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Fillup. Spaghetti is his main weakness. Also pissed because he's rewarded with cat treats 11/10 it'll be ok pup https://t.co/TEHu55ZQKD,,,,https://twitter.com/dog_rates/status/699775878809702401/photo/1,11,10,Fillup,,,,


In [18]:
# Is there missing data/are the datatypes appropriate? 
archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

**RATINGS**

In [19]:
# Are all ratings reasonable (0 < numerator < 20 and denominator = 10)?
weird_rates = archive[(archive['rating_numerator'] < 0) | \
                        (archive['rating_numerator'] > 20) | \
                        (archive['rating_denominator'] != 10) ]
weird_rates[['text', 'rating_numerator', 'rating_denominator']].sample(5)

Unnamed: 0,text,rating_numerator,rating_denominator
2335,This is an Albanian 3 1/2 legged Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv,1,2
1712,Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD,26,10
1068,"After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ",9,11
902,Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE,165,150
1274,"From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK",45,50


>There are ratings that don't "make sense," but the absurdity of the system is part of the appeal of the account (for example 1776 in the numerator for Atticus who is "America af." I'll leave these alone.

**NAMES**

In [20]:
# Are there any incorrect name recognitions? Most names observed are capitalized.
low_names = []
for pup in archive['name']:
    if pup.islower():
        low_names.append(pup)
wrong_names = Counter(low_names) 
wrong_names

Counter({'such': 1,
         'a': 55,
         'quite': 4,
         'not': 2,
         'one': 4,
         'incredibly': 1,
         'mad': 2,
         'an': 7,
         'very': 5,
         'just': 4,
         'my': 1,
         'his': 1,
         'actually': 2,
         'getting': 2,
         'this': 1,
         'unacceptable': 1,
         'all': 1,
         'old': 1,
         'infuriating': 1,
         'the': 8,
         'by': 1,
         'officially': 1,
         'life': 1,
         'light': 1,
         'space': 1})

In [21]:
# Display examples of name misrecognitions 
bad_names_df = archive[archive['name'].isin(wrong_names)]
bad_names_df[['text', 'name']].sample(5)

Unnamed: 0,text,name
2348,Here is a Siberian heavily armored polar bear mix. Strong owner. 10/10 I would do unspeakable things to pet this dog https://t.co/rdivxLiqEt,a
1004,Viewer discretion is advised. This is a terrible attack in progress. Not even in water (tragic af). 4/10 bad sherk https://t.co/L3U0j14N5R,a
2345,This is the happiest dog you will ever see. Very committed owner. Nice couch. 10/10 https://t.co/RhUEAloehK,the
1878,This is a fluffy albino Bacardi Columbia mix. Excellent at the tweets. 11/10 would hug gently https://t.co/diboDRUuEI,a
1049,This is a very rare Great Alaskan Bush Pupper. Hard to stumble upon without spooking. 12/10 would pet passionately https://t.co/xOBKCdpzaa,a


In [22]:
# Are there any tweets where no name was extracted?
len(archive[archive['name'] == "None"])

745

In [23]:
# Display examples of missing names.
no_names = archive[archive['name'] == 'None']
no_names[['text', 'name']].sample(5)

Unnamed: 0,text,name
2131,"""Hi yes this is dog. I can't help with that s- sir please... the manager isn't in right n- well that was rude""\n10/10 https://t.co/DuQXATW27f",
2298,After much debate this dog is being upgraded to 10/10. I repeat 10/10,
189,"@s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10",
1600,This pupper has a magical eye. 11/10 I can't stop looking at it https://t.co/heAGpKTpPW,
2260,RT @dogratingrating: Unoriginal idea. Blatant plagiarism. Curious grammar. -5/10 https://t.co/r7XzeQZWzb,


>Common formats for the appearance of names in text (as observed in visual assessment) are 
* "This is _name_".
* "Meet _name_".
* "Say hello to _name_".
* "Here is _name_".
* "...named _name_".


In [24]:
# Look for instances of the main 5 phrases appearing in the tweet and either going unrecognized (None),
# or being recognized as a wrong_name in wrong_names
n1 = archive[(archive['text'].str.contains('This is ' + '([A-Z][a-z]+)' + '.'))  & (archive['name'].isin(wrong_names))];
n2 = archive[(archive['text'].str.contains('Meet ' + '([A-Z][a-z]+)' + '.'))  & (archive['name'].isin(wrong_names))];
n3 = archive[(archive['text'].str.contains('Say hello to ' + '([A-Z][a-z]+)' + '.'))  & (archive['name'].isin(wrong_names))];
n4 = archive[(archive['text'].str.contains('Here is ' + '([A-Z][a-z]+)' + '.')) & (archive['name'].isin(wrong_names))];
n5 = archive[(archive['text'].str.contains('named ' + '([A-Z][a-z]+)' + '.')) & (archive['name'].isin(wrong_names))];
# Print length of these dataframes to see where our problems lie (if we have them)
len(n1), len(n2), len(n3), len(n4), len(n5)


  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
  """
  
  import sys


(0, 0, 0, 0, 20)

In [25]:
# The phrase that seems to be slipping through the cracks is "named ______."
# Display dataframe of those tweets
nn = archive[(archive['text'].str.contains('named ' + '([A-Z][a-z]+)' + '.'))]
nn[['text', 'name']].sample(5)

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,text,name
2269,This a Norwegian Pewterschmidt named Tickles. Ears for days. 12/10 I care deeply for Tickles https://t.co/0aDF62KVP7,
2218,This is a Birmingham Quagmire named Chuk. Loves to relax and watch the game while sippin on that iced mocha. 10/10 https://t.co/HvNg9JWxFt,a
2227,Here we have an Azerbaijani Buttermilk named Guss. He sees a demon baby Hitler behind his owner. 10/10 stays alert https://t.co/aeZykWwiJN,
2235,This is a Trans Siberian Kellogg named Alfonso. Huge ass eyeballs. Actually Dobby from Harry Potter. 7/10 https://t.co/XpseHBlAAb,a
2204,This is an Irish Rigatoni terrier named Berta. Completely made of rope. No eyes. Quite large. Loves to dance. 10/10 https://t.co/EM5fDykrJg,an


In [26]:
n_more2 = archive[(archive['name'].isin(wrong_names)) & (~archive.index.isin(nn.index))]
n_more2[['text', 'name']].sample(5)

Unnamed: 0,text,name
1063,This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC,just
1207,This is a taco. We only rate dogs. Please only send in dogs. Dogs are what we rate. Not tacos. Thank you... 10/10 https://t.co/cxl6xGY8B9,a
1916,This is life-changing. 12/10 https://t.co/SroTpI6psB,life
1499,This is a rare Arctic Wubberfloof. Unamused by the happenings. No longer has the appetites. 12/10 would totally hug https://t.co/krvbacIX0N,a
1724,This is by far the most coordinated series of pictures I was sent. Downright impressive in every way. 12/10 for all https://t.co/etzLo3sdZE,by


**STAGES**

In [27]:
# Are the dog stages always recognized correctly from the text? 
df_cat = {'stage': [], 'in_tweet': [], 'recognized': []}

for cat in ['doggo', 'floofer', 'pupper', 'puppo']:
    t = archive.text.str.contains(cat).sum()
    r = (archive[cat] != 'None').sum()
    df_cat['stage'].append(cat)
    df_cat['in_tweet'].append(t)
    df_cat['recognized'].append(r)
    
df_cat = pd.DataFrame(data=df_cat)
df_cat['errors'] = df_cat['in_tweet'] - df_cat['recognized']
df_cat

Unnamed: 0,stage,in_tweet,recognized,errors
0,doggo,98,97,1
1,floofer,4,10,-6
2,pupper,272,257,15
3,puppo,37,30,7


In [28]:
# Are dogs ever classified as more than one type of dog?

In [29]:
cats = ['doggo', 'floofer', 'pupper', 'puppo']
pairs = list(it.combinations(cats, 2))
pairs

[('doggo', 'floofer'),
 ('doggo', 'pupper'),
 ('doggo', 'puppo'),
 ('floofer', 'pupper'),
 ('floofer', 'puppo'),
 ('pupper', 'puppo')]

In [30]:
# Create list of dataframes with each type of double classifications
frames = []
for i in range(6):
    a = archive[(archive[pairs[i][0]] == pairs[i][0]) & (archive[pairs[i][1]] == pairs[i][1])]
    frames.append(a)

doubles = pd.concat(frames)
len(doubles)

14

<a id='a_preds'></a>
### `image_preds` table

In [31]:
# Visually assess
image_preds.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
792,690728923253055490,https://pbs.twimg.com/media/CZX2SxaXEAEcnR6.jpg,1,kuvasz,0.422806,True,golden_retriever,0.291586,True,Great_Pyrenees,0.076189,True
1149,731285275100512256,https://pbs.twimg.com/media/CiYME3tVAAENz99.jpg,1,Pembroke,0.967103,True,Cardigan,0.021126,True,Chihuahua,0.002231,True
365,672898206762672129,https://pbs.twimg.com/media/CVadWcCXIAAL4Sh.jpg,1,motor_scooter,0.835819,False,bobsled,0.035856,False,moped,0.033079,False
828,693590843962331137,https://pbs.twimg.com/media/CaAhMb1XEAAB6Bz.jpg,1,dining_table,0.383448,False,grey_fox,0.103191,False,Siamese_cat,0.098256,False
1767,826958653328592898,https://pbs.twimg.com/media/C3nygbBWQAAjwcW.jpg,1,golden_retriever,0.617389,True,Labrador_retriever,0.337053,True,tennis_ball,0.008554,False


In [32]:
# Is there missing data/are the datatypes appropriate? 
image_preds.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [33]:
# Check for p-values outside the interval [0,1]
p_frames = []
cols = ['p1_conf', 'p2_conf', 'p3_conf']
for col in cols:
    print(len(image_preds[(image_preds[col] < 0) | (image_preds[col] > 1)]))


0
0
0


<a id='a_api'></a>
### `api_data` table

In [34]:
# Visually assess
api_data.sample(5)

Unnamed: 0,tweet_id,retweet_count,favorite_count,tweet_length
224,847971574464610304,444,0,81
1696,680494726643068929,510,1787,106
2259,667443425659232256,580,777,138
150,861288531465048066,4177,17156,138
849,762464539388485633,4497,10843,113


In [35]:
# Is there missing data/are the datatypes appropriate? 
api_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2339 entries, 0 to 2338
Data columns (total 4 columns):
tweet_id          2339 non-null object
retweet_count     2339 non-null object
favorite_count    2339 non-null object
tweet_length      2339 non-null object
dtypes: object(4)
memory usage: 73.2+ KB


In [36]:
# How many tweets are missing api_data?
len(archive) - len(api_data)

17

### Assessment Summary:

#### Quality
##### `archive` 
- ~~78 retweets/replies included in dataset~~
- Some expanded_urls are missing **(not addressing)**
- ~~Tweet text contains url~~
- ~~Erroneous datatypes: tweet_id, in_reply_to_status_id, in_reply_to_user_id, timestamp, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, and all dog stage columns~~
- ~~Missing names & incorrect name recognitions~~
- Some tweets are about two dogs second name of dog missing **(not addressing)**
- ~~Missing classifications for `doggo`, `floofer`, `pupper`, or `puppo`~~
- ~~Presence of more than one dog category in tweet~~
- ~~Series for "We only rate dogs" reprimand unavailable~~

##### `image_preds` 
- ~~Many predictions are lowercase and contain underscores~~
- ~~There are fewer image_preds entries than there are tweets in the archive (perhaps a result of retweets, perhaps not)~~
- ~~Erroneous datatype (tweet_id is integer, not string)~~

##### `api_data` 
- ~~Erroneous datatype (tweet_id is integer, not string)~~
- ~~There are 17 fewer entries in api_data than there are in archive~~

#### Tidiness
##### `archive` 
- ~~doggo, floofer, pupper, puppo violate "Multiple columns shouldn't contain the same type of data"~~

##### `image_preds` 
- ~~p1, p2, p3, p1_conf, p2_conf, p3_conf, p1_dog, p2_dog, p3_dog violate "Multiple columns shouldn't contain the same type of data"~~

##### `api_data` 
- ~~This dataset is separate from the rest of the data~~




<a id='clean'></a>
# Clean

#### Data Inclusion Criteria
From Udacity - Exclude retweets and tweets without corresponding image predictions from analysis. 

In [37]:
# Make copies of the three datasets
arch_clean = archive.copy()
preds_clean = image_preds.copy()
api_clean = api_data.copy()

<a id='missing'></a>
## 1) Missing Data
1. Some expanded_urls are missing **(not addressing)**
2. Missing classifications for `doggo`, `floofer`, `pupper`, or `puppo`
3. Presence of more than one dog stage in tweet
4. Missing names & incorrect name recognitions
5. Some tweets are about two or more dogs, extra names missing **(not addressing)**

### `archive`: Missing classifications for `doggo`, `floofer`, `pupper`, or `puppo`
I will also address the issue: Presence of more than one dog stage in tweet


##### Define
Create list to hold accurate stages, then append identified stages to list using for loop to iterate through rows. The tweets containing more than one stage will read as "multi."

##### Code

In [38]:
cats

['doggo', 'floofer', 'pupper', 'puppo']

In [39]:
# Create empty list for final stages
stage_list = []

# For loop to fill above list
for row in arch_clean['text']:
    # List for each row, will fill with all instances of category
    new_row = []
    for cat in cats:
        if cat in row:
            new_row.append(cat)
    if len(new_row) > 1:
        new_row = 'multi'
    elif not new_row:
        new_row = np.NaN
    else: 
        new_row = new_row[0]

    stage_list.append(new_row)

# Add dog_stage series to dataframe from stage_list
arch_clean['dog_stage'] = stage_list

##### Test

In [40]:
arch_clean[['text', 'dog_stage']].sample(10)

Unnamed: 0,text,dog_stage
975,This is Beau. He's trying to keep his daddy from packing to leave for Annual Training. 13/10 and now I'm crying https://t.co/7JeDfQvzzI,
1111,"""Ello this is dog how may I assist"" ...10/10 https://t.co/jeAENpjH7L",
588,This is Longfellow (prolly sophisticated). He's a North Appalachian Oatzenjammer. Concerned about wrinkled feets. 12/10 would hug softly https://t.co/bpLuQuxzHZ,
1809,Meet Ash. He's just a head now. Lost his body during the Third Crusade. Still in good spirits. 10/10 would pet well https://t.co/NJj2uP0atK,
1113,"Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda",multi
1481,This is Sadie and her 2 pups Shebang &amp; Ruffalo. Sadie says single parenting is challenging but rewarding. All 10/10 https://t.co/UzbhwXcLne,
2162,Meet Ronduh. She's a Finnish Checkered Blitzkrieg. Ears look fake. Shoes on point. 10/10 would pet extra well https://t.co/juktj5qiaD,
80,"Meet Dante. At first he wasn't a fan of his new raincoat, then he saw his reflection. H*ckin handsome. 13/10 for water resistant good boy https://t.co/SHRTIo5pxc",
680,"This is Lucy. She destroyed not one, but two remotes trying to turn off the debate. 11/10 relatable af https://t.co/3BXh073tDm",
1663,"I'm aware that I could've said 20/16, but here at WeRateDogs we are very professional. An inconsistent rating scale is simply irresponsible",


### `archive`: Missing names & incorrect name recognitions
I will not be addressing the tweets with more than one dog. For another project/challenge. 

Things to keep in mind when planning resolution of issue:
- Some names are presented in the context, "...named ______."
- All observations of names are capitalized.
- The set of tweets containing "This is _name_," "Meet _name_," "Say hello to _name_," and "Here is _name_," and the set of tweets with names in `wrong_names` are mutually exclusive.


##### Define
Write function which accepts `text` and returns correct name. Function will use regular expressions to extract dog name from all tweets containing the regular expression "...named _name_." Replace all names in `wrong_names` with NaN. 

##### Code

In [41]:
# Function accepts a series, returns name from tweets with "...named ___." 
def named_fix(tweety):
    # Get row index of particular tweet
    unique_index = pd.Index(arch_clean.text)
    ind = unique_index.get_loc(tweety)
    # Return extracted name if "named __" in tweet
    if 'named ' in tweety:
        m = re.search('(?<=named\s)(\w+)', tweety)
        new_name = m.group(0)
        return new_name
    # Return original name if "named ___" not in tweet
    else:
        return arch_clean.name.iloc[ind]

In [42]:
# Apply named_fix to arch_clean.text to repopulate arch_clean.name
arch_clean.name = arch_clean.text.apply(named_fix)

In [43]:
# Function accepts a series, returns np.NaN if value member of keys in wrong_names
def wrong_names_fix(n):
    # If the name is member of wrong_names key or entered as None, return NaN
    if n in wrong_names or n == 'None':
        newy = np.NaN
        return newy
    # If name is not member of wrong_names key, return input unchanged
    else:
        return n

In [44]:
arch_clean.name = arch_clean.name.apply(wrong_names_fix)

##### Test

In [45]:
# Check to see if all "...named ____" instances are caught and corrected
named_check = arch_clean[(arch_clean['text'].str.contains('named '))]
named_check[['text', 'name']]

Unnamed: 0,text,name
603,RT @dog_rates: This a Norwegian Pewterschmidt named Tickles. Ears for days. 12/10 I care deeply for Tickles https://t.co/0aDF62KVP7,Tickles
1853,This is a Sizzlin Menorah spaniel from Brooklyn named Wylie. Lovable eyes. Chiller as hell. 10/10 and I'm out.. poof https://t.co/7E0AiJXPmI,Wylie
1955,This is a Lofted Aphrodisiac Terrier named Kip. Big fan of bed n breakfasts. Fits perfectly. 10/10 would pet firmly https://t.co/gKlLpNzIl3,Kip
2034,This is a Tuscaloosa Alcatraz named Jacob (Yacōb). Loves to sit in swing. Stellar tongue. 11/10 look at his feet https://t.co/2IslQ8ZSc7,Jacob
2066,This is a Helvetica Listerine named Rufus. This time Rufus will be ready for the UPS guy. He'll never expect it 9/10 https://t.co/34OhVhMkVr,Rufus
2116,This is a Deciduous Trimester mix named Spork. Only 1 ear works. No seat belt. Incredibly reckless. 9/10 still cute https://t.co/CtuJoLHiDo,Spork
2125,This is a Rich Mahogany Seltzer named Cherokee. Just got destroyed by a snowball. Isn't very happy about it. 9/10 https://t.co/98ZBi6o4dj,Cherokee
2128,This is a Speckled Cauliflower Yosemite named Hemry. He's terrified of intruder dog. Not one bit comfortable. 9/10 https://t.co/yV3Qgjh8iN,Hemry
2146,This is a spotted Lipitor Rumpelstiltskin named Alphred. He can't wait for the Turkey. 10/10 would pet really well https://t.co/6GUGO7azNX,Alphred
2161,This is a Coriander Baton Rouge named Alfredo. Loves to cuddle with smaller well-dressed dog. 10/10 would hug lots https://t.co/eCRdwouKCl,Alfredo


In [46]:
# Check to see if all wrong names have been replaced by np.NaN 
wn_check = arch_clean[arch_clean['name'].isin(wrong_names)]
wn_check[['text', 'name']]

Unnamed: 0,text,name


In [47]:
# Look at sample of name == np.NaN for instances of not catching the name
nulls = arch_clean[arch_clean.name.isna()]
nulls[['text', 'name']].sample(10)

Unnamed: 0,text,name
193,"Guys, we only rate dogs. This is quite clearly a bulbasaur. Please only send dogs. Thank you... 12/10 human used pet, it's super effective https://t.co/Xc7uj1C64x",
843,His name is Charley and he already has a new set of wheels thanks to donations. I heard his top speed was also increased. 13/10 for Charley,
1958,When you ask your professor about extra credit on the last day of class. 8/10 https://t.co/H6rqZyE4NP,
1729,"""Dammit hooman I'm jus trynna lik the fler"" 11/10 https://t.co/eRZRI8OTj7",
1858,"I shall call him squishy and he shall be mine, and he shall be my squishy. 13/10 https://t.co/WId5lxNdPH",
615,RT @dog_rates: I want to finally rate this iconic puppo who thinks the parade is all for him. 13/10 would absolutely attend https://t.co/5d…,
1782,This was Cindy's face when she heard Susan forgot the snacks for after the kid's soccer game. 11/10 https://t.co/gzkuVGRgAD,
1065,Here we are witnessing the touchdown of a pupnado. It's not funny it's actually very deadly. 9/10 might still pet https://t.co/CmLoKMbOHv,
1552,This pupper just wants to say hello. 11/10 would knock down fence for https://t.co/A8X8fwS78x,
189,"@s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10",


>We can see another structure of name introduction "... name is ..." as well as other one-off instances. I will fix the "name is" issue here and return to the one-off instances at another iteration of the project.

##### Code

In [48]:
# Create dataframe for all tweet rows containing the phrase "name is"
df = arch_clean[(arch_clean['text'].str.contains('name is ' + '([A-Z][a-z]+)'))]
df[['text', 'name']]

  


Unnamed: 0,text,name
35,I have a new hero and his name is Howard. 14/10 https://t.co/gzLHboL7Sk,
168,Sorry for the lack of posts today. I came home from school and had to spend quality time with my puppo. Her name is Zoey and she's 13/10 https://t.co/BArWupFAn0,
843,His name is Charley and he already has a new set of wheels thanks to donations. I heard his top speed was also increased. 13/10 for Charley,
852,This is my dog. Her name is Zoey. She knows I've been rating other dogs. She's not happy. 13/10 no bias at all https://t.co/ep1NkYoiwB,
1678,We normally don't rate bears but this one seems nice. Her name is Thea. Appears rather fluffy. 10/10 good bear https://t.co/fZc7MixeeT,
1734,This pup's name is Sabertooth (parents must be cool). Ears for days. Jumps unannounced. 9/10 would pet diligently https://t.co/iazoiNUviP,
2267,Another topnotch dog. His name is Big Jumpy Rat. Massive ass feet. Superior tail. Jumps high af. 12/10 great pup https://t.co/seESNzgsdm,
2287,This is a Dasani Kingfisher from Maine. His name is Daryl. Daryl doesn't like being swallowed by a panda. 8/10 https://t.co/jpaeu6LNmW,
2313,This is Lugan. He is a Bohemian Rhapsody. Very confused dog. Thinks his name is Rocky. Not amused by the snows 10/10 https://t.co/tI3uFLDHBI,Lugan


In [49]:
# Code to fix names with "name is" in text
name_is_ind = df.index.tolist()
name_is_ind
name_is_names = ['Howard', 'Zoey', 'Charley', 'Zoey', 'Thea', 'Sabertooth', 'Big Jumpy Rat', 'Daryl', 'Lugan']

j = 0
for i in name_is_ind:
    arch_clean.at[i, 'name'] = name_is_names[j]
    j += 1

##### Test

In [50]:
# Test "name is" dogs
df1 = arch_clean[(arch_clean['text'].str.contains('name is ' + '([A-Z][a-z]+)'))]
df1[['text', 'name']]

  


Unnamed: 0,text,name
35,I have a new hero and his name is Howard. 14/10 https://t.co/gzLHboL7Sk,Howard
168,Sorry for the lack of posts today. I came home from school and had to spend quality time with my puppo. Her name is Zoey and she's 13/10 https://t.co/BArWupFAn0,Zoey
843,His name is Charley and he already has a new set of wheels thanks to donations. I heard his top speed was also increased. 13/10 for Charley,Charley
852,This is my dog. Her name is Zoey. She knows I've been rating other dogs. She's not happy. 13/10 no bias at all https://t.co/ep1NkYoiwB,Zoey
1678,We normally don't rate bears but this one seems nice. Her name is Thea. Appears rather fluffy. 10/10 good bear https://t.co/fZc7MixeeT,Thea
1734,This pup's name is Sabertooth (parents must be cool). Ears for days. Jumps unannounced. 9/10 would pet diligently https://t.co/iazoiNUviP,Sabertooth
2267,Another topnotch dog. His name is Big Jumpy Rat. Massive ass feet. Superior tail. Jumps high af. 12/10 great pup https://t.co/seESNzgsdm,Big Jumpy Rat
2287,This is a Dasani Kingfisher from Maine. His name is Daryl. Daryl doesn't like being swallowed by a panda. 8/10 https://t.co/jpaeu6LNmW,Daryl
2313,This is Lugan. He is a Bohemian Rhapsody. Very confused dog. Thinks his name is Rocky. Not amused by the snows 10/10 https://t.co/tI3uFLDHBI,Lugan


<a id='tidy'></a>
## 2) Tidiness
1. in `archive` doggo, floofer, pupper, puppo violate "Multiple columns shouldn't contain the same type of data"
2. `image_preds` violates "Multiple columns shouldn't contain the same type of data"
3. `api_data` is separate from rest of data

### `archive`: doggo, floofer, pupper, puppo violate "Multiple columns shouldn't contain the same type of data"

##### Define
We created a new column for dog stage in 1) Missing, so the only action necessary is to drop the four columns that are no longer needed.

##### Code

In [51]:
arch_clean.drop(['doggo', 'floofer', 'pupper', 'puppo'], axis=1, inplace=True);

##### Test

In [52]:
set(['doggo', 'floofer', 'pupper', 'puppo']).issubset(arch_clean.columns)

False

### `image_preds`: p1, p2, p3, p1_conf, p2_conf, p3_conf, p1_dog, p2_dog, p3_dog violate "Multiple columns shouldn't contain the same type of data"

In [53]:
preds_clean.head(1)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True


##### Define
* melt `p1`, `p2`, `p3` into a `predictions` column
* melt `p1_conf`, `p2_conf`, `p3_conf` into a `confidence` column
* melt `p1_dog`, `p2_dog`, `p3_dog` into a `accuracy` column

Accomplish the melting using `pd.wide_to_long` after changing column names to make it possible.

##### Code

In [54]:
# Update column names
cols_new = ['tweet_id', 'jpg_url', 'img_num', 
             'prediction_1', 'confidence_1', 'accuracy_1', 
             'prediction_2', 'confidence_2', 'accuracy_2', 
             'prediction_3', 'confidence_3', 'accuracy_3']
preds_clean.columns = cols_new

In [55]:
# Apply pd.wide_to_long
preds_clean = pd.wide_to_long(preds_clean, stubnames=['prediction','confidence','accuracy'],\
                i=['tweet_id', 'jpg_url', 'img_num'],\
                j='order', sep='_', suffix='.')\
                .reset_index().sort_index();

##### Test

In [56]:
preds_clean.head()

Unnamed: 0,tweet_id,jpg_url,img_num,order,prediction,confidence,accuracy
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,1,Welsh_springer_spaniel,0.465074,True
1,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,2,collie,0.156665,True
2,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,3,Shetland_sheepdog,0.061428,True
3,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,1,redbone,0.506826,True
4,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,2,miniature_pinscher,0.074192,True


In [57]:
preds_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6225 entries, 0 to 6224
Data columns (total 7 columns):
tweet_id      6225 non-null int64
jpg_url       6225 non-null object
img_num       6225 non-null int64
order         6225 non-null int64
prediction    6225 non-null object
confidence    6225 non-null float64
accuracy      6225 non-null bool
dtypes: bool(1), float64(1), int64(3), object(2)
memory usage: 298.0+ KB


In [58]:
len(preds_clean)/3

2075.0

In [59]:
len(arch_clean)

2356

>We can see from the above calculation that not all tweets in `archive` are associated with image predictions. We will drop those which do not.

### `api_data`: This dataset is separate from the rest of the data

In [60]:
len(arch_clean)-len(api_clean)

17

##### Define
Merge `api_data` with `archive` using `tweet_id` as key (change `tweet_id` to object datatype in both dataframes first). Use `how='inner'` so only common tweets are retained.

##### Code

In [61]:
# Cast tweet_id columns in arch_clean and api_clean as object datatypes
arch_clean.tweet_id = arch_clean.tweet_id.astype(str)
api_clean.tweet_id = api_clean.tweet_id.astype(str)

In [62]:
# Check that there are no duplicate tweet_ids in either table
arch_clean.tweet_id.duplicated().sum(), api_clean.tweet_id.duplicated().sum()

(0, 0)

In [63]:
# Check that all api_clean tweets map to arch_clean tweet_ids
len(arch_clean[arch_clean['tweet_id'].isin(api_clean['tweet_id'].tolist())]), len(api_clean)

(2339, 2339)

In [64]:
# Merge the API and archive tables, only keeping tweet data from tweet_ids in both tables
arch_clean = pd.merge(left=arch_clean, right=api_clean, how='inner', on='tweet_id')

##### Test

In [65]:
# Look at random sample of tweets in arch_clean
arch_clean.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,dog_stage,retweet_count,favorite_count,tweet_length
808,770069151037685760,,,2016-08-29 01:22:47 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Say hello to Carbon. This is his first time swimming. He's having a h*ckin blast. 10/10 we should all be this happy https://t.co/mADHGenzFS,,,,https://twitter.com/dog_rates/status/770069151037685760/photo/1,10,10,Carbon,,2475,7986,115
1308,706169069255446529,,,2016-03-05 17:26:40 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",He was doing his best. 12/10 I'll be his lawyer\nhttps://t.co/WN4C6miCzR,,,,https://twitter.com/wgnnews/status/706165920809492480,12,10,,,2346,4051,71
569,800388270626521089,,,2016-11-20 17:20:08 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Doc. He takes time out of every day to worship our plant overlords. 12/10 quite the floofer https://t.co/azMneS6Ly5,,,,"https://twitter.com/dog_rates/status/800388270626521089/photo/1,https://twitter.com/dog_rates/status/800388270626521089/photo/1,https://twitter.com/dog_rates/status/800388270626521089/photo/1",12,10,Doc,floofer,3056,11856,99
451,817502432452313088,,,2017-01-06 22:45:43 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: Meet Herschel. He's slightly bigger than ur average pupper. Looks lonely. Could probably ride 7/10 would totally pet https:/…,6.924173e+17,4196984000.0,2016-01-27 18:42:06 +0000,https://twitter.com/dog_rates/status/692417313023332352/photo/1,7,10,Herschel,pupper,3675,0,140
753,777684233540206592,,,2016-09-19 01:42:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","""Yep... just as I suspected. You're not flossing."" 12/10 and 11/10 for the pup not flossing https://t.co/SuXcI9B7pQ",,,,https://twitter.com/dog_rates/status/777684233540206592/photo/1,12,10,,,3180,11876,91


In [66]:
# Check summary of arch_clean for correct columns and NaNs
arch_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2339 entries, 0 to 2338
Data columns (total 17 columns):
tweet_id                      2339 non-null object
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2339 non-null object
source                        2339 non-null object
text                          2339 non-null object
retweeted_status_id           167 non-null float64
retweeted_status_user_id      167 non-null float64
retweeted_status_timestamp    167 non-null object
expanded_urls                 2280 non-null object
rating_numerator              2339 non-null int64
rating_denominator            2339 non-null int64
name                          1525 non-null object
dog_stage                     397 non-null object
retweet_count                 2339 non-null object
favorite_count                2339 non-null object
tweet_length                  2339 non-null object
dtypes: float64(4), int64(2), ob

In [67]:
2356 - len(arch_clean)

17

>We lost 17 tweets from archive because they had no API data.. 

<a id='quality'></a>
## 3) Quality

I'll be addressing the following quality issues in this section
1. Replies and retweets still in `archive`
2. `archive` and `image_preds` have erroneous datatypes
3. Series for "We only rate dogs" reprimand unavailable in `archive`
4. Many predictions are lowercase and contain underscores in `image_preds`
5. Tweet url part of tweet text in `archive`

### `archive`: 78 replies/167 retweets included in dataset

##### Define
Check for overlap/examine differences between replies and retweets. Drop all rows where `retweeted_status_id` is not null and `in_reply_to_status_id` is not null.

##### Code

In [68]:
# Check for overlap between replies and retweets
replies = arch_clean[arch_clean['in_reply_to_status_id'].notnull()].index.tolist()
retweets = arch_clean[arch_clean['retweeted_status_id'].notnull()].index.tolist()
set(replies).isdisjoint(retweets)

True

Replies and retweets do not overlap.

In [69]:
# How long should new dataframe be after dropping replies and retweets?
new_length = len(arch_clean) - len(replies) - len(retweets)
new_length

2094

In [70]:
# Create and sort list of indicies to use with df.drop feature
all_drops = replies + retweets
all_drops.sort()

In [71]:
# Drop all rows with indicies in all_drops
arch_clean.drop(all_drops, axis=0, inplace=True)

##### Test

In [72]:
# See if any tweets are left with non-null values in relevant columns
arch_clean[(arch_clean['in_reply_to_status_id'].notnull()) \
          | (arch_clean['retweeted_status_id'].notnull())]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,dog_stage,retweet_count,favorite_count,tweet_length


In [73]:
# Verify length of adjusted data set, should be 2094
len(arch_clean)


2094

### `archive` and `image_preds`: Erroneous datatypes - tweet_id, in_reply_to_status_id, in_reply_to_user_id, timestamp, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, and all dog stage columns

##### Define
* Drop `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, and `retweeted_status_timestamp` since they are not being used anymore.
* Check datatype of dog_stage column (should be object), fix if not
* Use pandas `.to_datetime()` to cast `timestamp` to datetime and `.astype()` to cast `tweet_id` to string.
* Cast all columns merged from `api_data` to int64, they are currently strings.

##### Code

In [74]:
# Drop all columns related to retweets and replies since they are all NaN values now
arch_clean.drop(columns=['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 
                         'retweeted_status_user_id', 'retweeted_status_timestamp'], inplace=True)

In [75]:
# Confirm that dog_stage is object/string
arch_clean.dog_stage.dtype

dtype('O')

In [76]:
# Cast erroneous columns to correct dtypes
arch_clean.timestamp = pd.to_datetime(arch_clean.timestamp)
arch_clean[['retweet_count', 'favorite_count', 'tweet_length']] = arch_clean[['retweet_count', 'favorite_count', 'tweet_length']].astype(int)
preds_clean.tweet_id = preds_clean.tweet_id.astype(str)


##### Test

In [77]:
# Are columns dropped? Yes.
arch_clean.columns

Index(['tweet_id', 'timestamp', 'source', 'text', 'expanded_urls',
       'rating_numerator', 'rating_denominator', 'name', 'dog_stage',
       'retweet_count', 'favorite_count', 'tweet_length'],
      dtype='object')

In [78]:
# Are datatypes correct?
arch_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2094 entries, 0 to 2338
Data columns (total 12 columns):
tweet_id              2094 non-null object
timestamp             2094 non-null datetime64[ns]
source                2094 non-null object
text                  2094 non-null object
expanded_urls         2091 non-null object
rating_numerator      2094 non-null int64
rating_denominator    2094 non-null int64
name                  1417 non-null object
dog_stage             353 non-null object
retweet_count         2094 non-null int64
favorite_count        2094 non-null int64
tweet_length          2094 non-null int64
dtypes: datetime64[ns](1), int64(5), object(6)
memory usage: 212.7+ KB


In [79]:
preds_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6225 entries, 0 to 6224
Data columns (total 7 columns):
tweet_id      6225 non-null object
jpg_url       6225 non-null object
img_num       6225 non-null int64
order         6225 non-null int64
prediction    6225 non-null object
confidence    6225 non-null float64
accuracy      6225 non-null bool
dtypes: bool(1), float64(1), int64(2), object(3)
memory usage: 298.0+ KB


In [80]:
len(preds_clean)/3

2075.0

### `archive`: Series for "We only rate dogs" reprimand unavailable

##### Define
Many tweets sarcastically reprimand submissions for not being pictures of dogs. Although sometimes the photos are of dogs, I want to create a column called WORD which captures whether or not the tweet text contains the phrase "we only rate dogs" in some manner. Sometimes the phrase is capitalized, sometimes not. 

I'll write a function which I can apply to each row and if the tweet contains some form of the phrase "we only rate dogs" return True to WORD column, False if it does not.

##### Code

In [81]:
# Function accepts a series, returns True or False depending on whether phrase in tweet." 
def WORD_detector(tweety):
    match = re.search(r'\we\.? \wnly\.? \wate\.? \wogs\.?', tweety)
    # If-statement after search() tests if it succeeded
    if match:
        return True
    else:
        return False


In [82]:
# Create empty series in arch_clean
arch_clean['WORD'] = ''
# Populate new series w/ boolean indicating presence of "we only rate dogs"
arch_clean.WORD = arch_clean.text.apply(WORD_detector)

##### Test

In [83]:
# How many tweets contain some version of this phrase?
arch_clean.WORD.value_counts()

False    2040
True     54  
Name: WORD, dtype: int64

In [84]:
# Examine sample of arch_clean
samp = arch_clean[['text', 'WORD']].sample(10)
samp

Unnamed: 0,text,WORD
1659,This is Samson. He patrols his waters on the back of his massive shielded battle dog. 11/10 https://t.co/f8dVgDYDFf,False
2144,This is a Coriander Baton Rouge named Alfredo. Loves to cuddle with smaller well-dressed dog. 10/10 would hug lots https://t.co/eCRdwouKCl,False
313,DOGGO ON THE LOOSE I REPEAT DOGGO ON THE LOOSE 10/10 https://t.co/ffIH2WxwF0,False
1662,We normally don't rate bears but this one seems nice. Her name is Thea. Appears rather fluffy. 10/10 good bear https://t.co/fZc7MixeeT,False
1570,This pupper forgot how to walk. 12/10 happens to all of us (vid by @bbuckley96) https://t.co/KFTrkSOuu3,False
278,This is Walter. His owner has been watching all the Iditarod coverage and is convinced Walter can be a sled dog. 13/10 Walter isn't so sure https://t.co/0av1PEehFI,False
1577,Say hello to Crimson. He's a Speckled Winnebago. Main passions are air hockey &amp; parkour. 11/10 would pet thoroughly https://t.co/J5aI7SjzDc,False
1004,This is Maddie. She gets some wicked air time. Hardcore barkour. 11/10 nimble af https://t.co/bROYbceZ1u,False
580,This is Shadow. He's a firm believer that they're all good dogs. H*ckin passionate about it too. 11/10 I stand with Shadow https://t.co/8yvpacwBcu,False
440,Meet Wafer. He represents every fiber of my being. 13/10 very good dog https://t.co/I7bkhxBxUG,False


### `archive`: Tweet text contains URL

##### Define:
Write function to accept the tweet text. Will append URL to a list (later to be added as a column in `arch_clean`) and return the tweet without the URL in it. Use `str.split()`.

##### Code:

In [85]:
# Empty list to hold urls and np.NaN for tweets without urls
url_list = []
def url_fix(tweety):
    # Search for http in the text
    if 'http' in tweety:
        # Split the tweet at http
        x = tweety.split('http')
        # Concatenate http with the rest of the url and append to list
        url_list.append('http' + x[1])
        # Return all the tweet text to the left of http
        return x[0]
    # If no url in link
    else:
        # Append a null to the list
        url_list.append(np.NaN)
        # Return the tweet unchanged.
        return tweety


In [86]:
# Apply function to text series
arch_clean.text = arch_clean.text.apply(url_fix)

In [87]:
# Create new series from url_list
arch_clean['tweet_url'] = url_list

##### Test

In [88]:
arch_clean[['text', 'tweet_url']].sample(10)

Unnamed: 0,text,tweet_url
113,Meet Clifford. He's quite large. Also red. Good w kids. Somehow never steps on them. Massive poops very inconvenient. Still 14/10 would ride,https://t.co/apVOyDgOju
1683,I thought I made this very clear. We only rate dogs. Stop sending other things like this shark. Thank you... 9/10,https://t.co/CXSJZ4Stk3
348,"Meet Samson. He's absolute fluffy perfection. Easily 13/10, but he needs your help. Click the link to find out more\n\n",https://t.co/z82hCtwhpn
967,This is George. He just remembered that bees are dying globally at an alarming rate. Scary stuff George. 10/10,https://t.co/lznl6QGkYc
1213,This is Piper. She would really like that tennis ball core. Super sneaky tongue slip. 12/10 precious af,https://t.co/QP6GHi5az9
854,Guys.. we only rate dogs. Pls don't send any more pics of the Loch Ness Monster. Only send in dogs. Thank you. 11/10,https://t.co/obH5vMbm1j
482,Here's a doggo who has concluded that Christmas is entirely too bright. Requests you tone it down a notch. 11/10,https://t.co/cD967DjnIn
1588,This is Olive. He's stuck in a sleeve. 9/10 damn it Olive,https://t.co/NnLjg6BgyF
1852,What kind of person sends in a picture without a dog in it? 1/10 just because that's a nice table,https://t.co/RDXCfk8hK0
1818,Meet Striker. He's ready for Christmas. 11/10,https://t.co/B3xxSLjQSH


In [89]:
len(arch_clean)

2094

### `image_preds`: Many predictions are lowercase and contain underscores

In [90]:
preds_clean.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,order,prediction,confidence,accuracy
2847,704847917308362754,https://pbs.twimg.com/media/CcgfcANW4AA9hzr.jpg,1,1,golden_retriever,0.85724,True
1614,676949632774234114,https://pbs.twimg.com/media/CWUCGMtWEAAjXnS.jpg,1,1,Welsh_springer_spaniel,0.206479,True
2348,690005060500217858,https://pbs.twimg.com/media/CZNj8N-WQAMXASZ.jpg,1,3,teddy,0.072475,False
5969,872967104147763200,https://pbs.twimg.com/media/DB1m871XUAAw5vZ.jpg,2,3,German_short-haired_pointer,0.092861,True
4435,780601303617732608,https://pbs.twimg.com/media/CtVAvX-WIAAcGTf.jpg,1,2,Cardigan,0.003044,True


##### Define
Write function to apply to  `prediction` column: use `str.replace()` and `str.title()` to remove underscores and capitalize all predictions.

##### Code

In [91]:
# Function
def fix_preds(pred):
    return pred.replace('_', ' ').title()

In [92]:
# Apply function to prediction series
preds_clean.prediction = preds_clean.prediction.apply(fix_preds)

##### Test

In [93]:
# Are there any lowercase predictions left in column
preds_clean.prediction.str.islower().sum()

0

In [94]:
preds_clean.prediction.sample(10)

2494    Bluetick        
5575    Samoyed         
5099    Chow            
6191    Muzzle          
4659    Golden Retriever
4865    Golden Retriever
907     Bath Towel      
4350    Samoyed         
2554    Arctic Fox      
5776    Siberian Husky  
Name: prediction, dtype: object

### `image_preds`:  There are fewer image_preds entries than there are tweets in the archive (perhaps a result of retweets, perhaps not)

##### Define
* Examine intersection of `preds_clean` and `arch_clean`.
* Get list of `tweet_id`s in intersection of above dataframes.
* Create sub-dataframe for each dataframe keeping only tweet data for tweet ids in both.
* Will need to accommodate for duplicates in `preds_clean` using set()


##### Code

In [95]:
# What is the overlap like for the set of tweets in both dataframes?
p = list(set(preds_clean.tweet_id.tolist()))
a = arch_clean.tweet_id.tolist()
set(p).issubset(a), len(a) > len(p)

(False, True)

We know that there are `tweet_ids` in `preds_clean` that are not in `arch_clean`, and vice versa (no duplicates in either and `len(a)` > `len(p)`).

In [96]:
# Create a set that is the intersection of tweet_ids in a and p
inter = set(p) & set(a)
len(p), len(a), len(inter)

(2075, 2094, 1968)

There are 1968 tweets in common between the dataframes. We will keep those for both.

In [97]:
# Redefine preds_clean and arch_clean with only tweet info in both dataframes
preds_clean = preds_clean[preds_clean['tweet_id'].isin(inter)]
arch_clean = arch_clean[arch_clean['tweet_id'].isin(inter)]

##### Test

In [98]:
# Are both sets of tweet_ids subsets of each other? If true, implies they are the same set
set(arch_clean['tweet_id']).issubset(inter), set(preds_clean['tweet_id']).issubset(inter)

(True, True)

In [99]:
# Check length of both dataframes
len(arch_clean), len(preds_clean)/3

(1968, 1968.0)

## Save Cleaned Data

In [101]:
arch_clean.to_csv('twitter_archive_master.csv', index=False)
preds_clean.to_csv('image_predictions_master.csv', index=False)

<a id='ref1'></a>

# Wrangling References

- [Tweepy for Python intro](https://www.pythoncentral.io/introduction-to-tweepy-twitter-for-python/)
- [Create a twitter application](https://docs.inboundnow.com/guide/create-twitter-application/)
- [Tweepy API documentation](http://docs.tweepy.org/en/3.7.0/api.html)
- [Convert tweepy status object to JSON](https://stackoverflow.com/questions/27900451/convert-tweepy-status-object-into-json)
- [Python documentation for open()](https://docs.python.org/3/library/functions.html#open)
- [Create, then close file to free up resources](https://stackoverflow.com/questions/12654772/create-empty-file-using-python/12654798)
- [Difference between .exists and .isfile](https://stackoverflow.com/questions/17752078/difference-between-os-path-exists-and-os-path-isfile-in-python)
- [Write JSON data to txt file](https://stackoverflow.com/questions/12309269/how-do-i-write-json-data-to-a-file)
- [JSON data structure overview](https://www.w3resource.com/JSON/structures.php)
- [Add new keys to dictionary](https://stackoverflow.com/questions/1024847/add-new-keys-to-a-dictionary)
- [What are Twitter rate limits?](https://support.onelouder.com/hc/en-us/articles/203931090-Why-am-I-getting-a-rate-limit-)
- [DataFrame indexing](https://brohrer.github.io/dataframe_indexing.html)
- [Counting occurrances in a list](https://stackoverflow.com/questions/2600191/how-can-i-count-the-occurrences-of-a-list-item)
- [List of Indices](https://stackoverflow.com/questions/55270808/append-index-element-to-list-if-condition-is-met-pandas-and-python/55270831#55270831)
- [pd.wide_to_long Stack Overflow](https://stackoverflow.com/questions/50103549/melt-group-of-columns-into-target-columns-by-name)
- [pd.wide_to_long Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.wide_to_long.html)

# Analysis