# Project 7: Twitter Wrangle Project
## Table of Contents
<ul>
<li><a href="#import1">Import Libraries and APIs</a></li>
<li><a href="#import2">Import Data</a></li>
<li><a href="#assess">Assessing the Data</a></li>
<li><a href="#cleaning">Cleaning the Three Tables</a></li>
<li><a href="#Merge">Conversion of the Dataframes to df_combined</a></li>
<li><a href="#analysis">Analysis of the Master Dataset</a></li>
<li><a href="#insights">The Five Insights</a></li>    
<li><a href="#sources">Sources</a></li>
</ul>

<a id='import1'></a>
## Import Libraries and APIs

In [12]:
# This imports the necessary libraries and APIs for the analysis.
import pandas as pd
import numpy as np
import tweepy 
from tweepy import OAuthHandler 
import json
import requests as rq
import os
import sys
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
from datetime import datetime
import csv
import re
import statsmodels.api as sm

<a id='import2'></a>
## Import Data

### Udacity-provided Twitter Archive CSV

In [13]:
# This imports the twitter CSV file provided by Udacity.
df_twitter = pd.read_csv('/home/workspace/twitter-archive-enhanced.csv')

# This tests whether df_twitter successfully uploaded.
df_twitter.head(10)


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


In [14]:
df_twitter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

>**Results**: The dataset was succesffully uploaded.

### Udacity-provided Image Prediction TSV from a URL

In [15]:
# I create pt1 and pt2 of the URL to avoid a long line of code.
# I then assign url to the concatenation of pt1 and pt2.
pt1 = "https://d17h27t6h515a5.cloudfront.net/topher/"
pt2 = "2017/August/599fd2ad_image-predictions/image-predictions.tsv"
url = pt1 + pt2

# This uses the request.get() method to import the URL.
response = rq.get(url)

# This tests whether the URL was succesfully uploaded.
# If successful, the output should be <Response [200]>
response

<Response [200]>

In [16]:
# This creates the varibale of the current working
# directory.
current_folder = str(os.getcwd())
current_folder

'/home/workspace'

> **Results**: The URL was successfully uploaded.

In [17]:
# This identifies the TSV we need to upload.
with open(os.path.join(current_folder, url.split('/')[-1]),
                      mode='wb') as file:
    file.write(response.content)

os.listdir(current_folder)

['twitter-archive-master.csv',
 '.ipynb_checkpoints',
 'wrangle_act.ipynb',
 'df_wrangled_copy.csv',
 'act_report.html',
 'twitter-archive-enhanced.csv',
 'wrangle_report.pdf',
 'images-predictions-csv.csv',
 'df_twitter_copy.csv',
 'image-predictions.tsv',
 'df_images_copy.csv',
 'tweet_json_csv.csv',
 'tweet_json.txt']

> **Results**: We identifed the tsv as 'image-predictions.tsv'

In [18]:
# Now that we have set the working directory,
# we can upload 'image-predictions.tsv'.
# This code comes from Source 1 in the Sources Section.
df_predictions = pd.read_csv('/home/workspace/image-predictions.tsv', sep='\t')

# This confirms that we uploaded 
df_predictions.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
956,705428427625635840,https://pbs.twimg.com/media/CcovaMUXIAApFDl.jpg,1,Chihuahua,0.774792,True,quilt,0.073079,False,Pembroke,0.022365,True
1578,796149749086875649,https://pbs.twimg.com/media/Cwx99rpW8AMk_Ie.jpg,1,golden_retriever,0.600276,True,Labrador_retriever,0.140798,True,seat_belt,0.087355,False
399,673686845050527744,https://pbs.twimg.com/media/CVlqi_AXIAASlcD.jpg,1,Pekinese,0.185903,True,guinea_pig,0.172951,False,pug,0.166183,True
1577,796116448414461957,https://pbs.twimg.com/media/CwxfrguUUAA1cbl.jpg,1,Cardigan,0.700182,True,Pembroke,0.260738,True,papillon,0.01711,True
2009,878281511006478336,https://pbs.twimg.com/media/DDBIX9QVYAAohGa.jpg,1,basset,0.32042,True,collie,0.215975,True,Appenzeller,0.128507,True
292,671186162933985280,https://pbs.twimg.com/media/CVCIQX7UkAEzqh_.jpg,1,Chihuahua,0.319106,True,whippet,0.169134,True,toy_terrier,0.125815,True
2004,877316821321428993,https://pbs.twimg.com/media/DCza_vtXkAQXGpC.jpg,1,Saluki,0.509967,True,Italian_greyhound,0.090497,True,golden_retriever,0.079406,True
1093,719704490224398336,https://pbs.twimg.com/media/CfznaXuUsAAH-py.jpg,1,home_theater,0.059033,False,window_shade,0.038299,False,bathtub,0.035528,False
582,678798276842360832,https://pbs.twimg.com/media/CWuTbAKUsAAvZHh.jpg,1,Airedale,0.583122,True,silky_terrier,0.129567,True,Lakeland_terrier,0.094727,True
1298,752334515931054080,https://pbs.twimg.com/ext_tw_video_thumb/75233...,1,Bedlington_terrier,0.399163,True,standard_poodle,0.086425,True,wire-haired_fox_terrier,0.075231,True


In [19]:
df_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [11]:
# This converts df_predictions to a csv.
# This code is from Source 8 of the Sources Section.
df_predictions = open('/home/workspace/image-predictions.tsv', 'r')
fileContent =  df_predictions.read()
fileContent = re.sub("\t", ",", fileContent) # convert from tab to comma
csv_file = open("images-predictions-csv.csv", "w")
csv_file.write(fileContent)
csv_file.close()

df_images = pd.read_csv('images-predictions-csv.csv')

# This determines whether the conversion to a csv was successful.
df_images.sample(15)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1254,748324050481647620,https://pbs.twimg.com/media/CmKUwImXIAA58f5.jpg,1,Shetland_sheepdog,0.880499,True,collie,0.107901,True,Pembroke,0.003607,True
84,667502640335572993,https://pbs.twimg.com/media/CUNyHTMUYAAQVch.jpg,1,Labrador_retriever,0.996709,True,golden_retriever,0.001688,True,beagle,0.000712,True
388,673355879178194945,https://pbs.twimg.com/media/CVg9mTYWIAAu7J6.jpg,1,Rottweiler,0.529248,True,miniature_pinscher,0.168296,True,Appenzeller,0.100452,True
2041,885311592912609280,https://pbs.twimg.com/media/C4bTH6nWMAAX_bJ.jpg,1,Labrador_retriever,0.908703,True,seat_belt,0.057091,False,pug,0.011933,True
658,682389078323662849,https://pbs.twimg.com/media/CXhVKtvW8AAyiyK.jpg,1,curly-coated_retriever,0.482288,True,flat-coated_retriever,0.315286,True,Great_Dane,0.062179,True
766,689154315265683456,https://pbs.twimg.com/media/CZBeMMVUwAEdVqI.jpg,1,cocker_spaniel,0.816044,True,golden_retriever,0.054135,True,Airedale,0.030648,True
1843,838561493054533637,https://pbs.twimg.com/media/C6MrOsEXQAENOds.jpg,1,kelpie,0.216562,True,doormat,0.139994,False,dalmatian,0.13282,True
829,693622659251335168,https://pbs.twimg.com/media/CaA-IR9VIAAqg5l.jpg,1,malamute,0.449298,True,Siberian_husky,0.385075,True,Eskimo_dog,0.163485,True
1281,750147208377409536,https://pbs.twimg.com/media/CmkO57iXgAEOxX9.jpg,1,pug,0.977765,True,Boston_bull,0.004794,True,French_bulldog,0.004573,True
1955,864279568663928832,https://pbs.twimg.com/media/C_6JrWZVwAAHhCD.jpg,1,bull_mastiff,0.668613,True,French_bulldog,0.180562,True,Staffordshire_bullterrier,0.052237,True


### Twitter Data: Wrangled with Tweepy

In [27]:
# This sets up the Tweepy Api.
# This comes from Sources 4 & 6 of the Sources Section.
consumer_key = 'Hidden'
consumer_secret = 'Hidden'
access_token = 'Hidden'
access_secret = 'Hidden'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth_handler=auth,
            wait_on_rate_limit = True,
            wait_on_rate_limit_notify = True)

In [None]:
# I use the twitter ids from the list of ids in df_twitter.
# This code comes from Sources 5 & 6 in the sources section.

# This creates the list of tweet ids from df_twitter.
tweet_ids = list(df_twitter.tweet_id)

# This creates the wrangled data dictionary.
wrangled_data = {}
#wrangled_data['input'] = []

# This creates a dictionary for the ids that triggered errors.
error_ids = {}
#error_ids['input'] = []

# This tracks the time it takes to run the loop that wrangles the 
# tweets.
start_time = datetime.now()

# This is the for loop that extracts the data, stores it in the 
# wrangled_data dictionary. If an error occurs, it is stored in the
# error_ids dictionary and an error message is printed.
for tweet in tweet_ids:
    try:
        tweet_status = api.get_status(tweet)
        tw_id = str(tweet)
        wrangled_data[tw_id] = tweet_status._json
        
    except:
        tw_id = str(tweet)
        error_ids[tw_id] = tw_id
        print("Error for tweet_id: " + str(tweet) + "\n")

# This prints how long it took to wrangle the data.
endtime = datetime.now() - start_time
print("Total time to wrangle data (hh:mm:ss.ms): {}".format(endtime))

# This writes the text file for the wrangled tweets.
with open('tweet_json.txt', 'w') as df_w_tweets:
    json.dump(wrangled_data, df_w_tweets)

Error for tweet_id: 888202515573088257

Error for tweet_id: 883838122936631299



From cffi callback <function _verify_callback at 0x7f6aecca4400>:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/OpenSSL/SSL.py", line 221, in wrapper
    @wraps(callback)
KeyboardInterrupt


Error for tweet_id: 879492040517615616

Error for tweet_id: 876838120628539392

Error for tweet_id: 875097192612077568

Error for tweet_id: 873697596434513921

Error for tweet_id: 872668790621863937

Error for tweet_id: 869988702071779329

Error for tweet_id: 866816280283807744

Error for tweet_id: 866720684873056260

Error for tweet_id: 865718153858494464

Error for tweet_id: 864873206498414592

Error for tweet_id: 863907417377173506

Error for tweet_id: 863471782782697472

Error for tweet_id: 863432100342583297

Error for tweet_id: 863079547188785154

Error for tweet_id: 863062471531167744

Error for tweet_id: 862831371563274240

Error for tweet_id: 862722525377298433

Error for tweet_id: 862457590147678208

Error for tweet_id: 862096992088072192

Error for tweet_id: 861769973181624320

Error for tweet_id: 861383897657036800

Error for tweet_id: 861288531465048066

Error for tweet_id: 861005113778896900

Error for tweet_id: 860981674716409858

Error for tweet_id: 856543823941562368



In [21]:
# I check the tweets.txt document to determine whether 
# it was successfully downloaded.
# This code is from Source 5 in the Sources Section.
wrangled_txt = pd.read_json('tweet_json.txt', orient = 'index', encoding = 'utf-8')
wrangled_txt.head()

Unnamed: 0,contributors,coordinates,created_at,entities,extended_entities,favorite_count,favorited,geo,id,id_str,...,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,text,truncated,user
1991-02-08 13:48:08.022790149,,,2015-11-15 22:32:08,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666020881337073664, 'id_str'...",2551,0,,666020888022790144,666020888022790144,...,,,,508,0,,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a Japanese Irish Setter. Lost eye...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1991-02-08 16:08:05.002620928,,,2015-11-15 23:05:30,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666029276303482880, 'id_str'...",129,0,,666029285002620928,666029285002620928,...,,,,47,0,,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1991-02-08 17:16:52.701032449,,,2015-11-15 23:21:54,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666033409081393153, 'id_str'...",125,0,,666033412701032448,666033412701032448,...,,,,44,0,,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1991-02-08 20:17:06.329800704,,,2015-11-16 00:04:52,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666044217047650304, 'id_str'...",295,0,,666044226329800704,666044226329800704,...,,,,139,0,,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1991-02-08 21:40:48.165822465,,,2015-11-16 00:24:50,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666049244999131136, 'id_str'...",107,0,,666049248165822464,666049248165822464,...,,,,40,0,,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."


>**Results**: The wrangled tweets were successfully converted into text through json. The data was then downloaded into the tweets.txt file.

In [22]:
# This conerts the wrangled_txt dataframe to a 
# csv file.
# This code comes from Source 7 in the Sources Section.
csv_conversion = wrangled_txt.to_csv("tweet_json_csv.csv", encoding = 'utf-8')

# This defines the dataframe from the "wrangled_csv."
df_wrangled = pd.read_csv('tweet_json_csv.csv')

# This determines whether the df_wrangled properly converted to a csv.
df_wrangled.head()

Unnamed: 0.1,Unnamed: 0,contributors,coordinates,created_at,entities,extended_entities,favorite_count,favorited,geo,id,...,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,text,truncated,user
0,1991-02-08 13:48:08.022790149,,,2015-11-15 22:32:08,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666020881337073664, 'id_str'...",2551,0,,666020888022790144,...,,,,508,0,,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a Japanese Irish Setter. Lost eye...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1,1991-02-08 16:08:05.002620928,,,2015-11-15 23:05:30,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666029276303482880, 'id_str'...",129,0,,666029285002620928,...,,,,47,0,,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
2,1991-02-08 17:16:52.701032449,,,2015-11-15 23:21:54,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666033409081393153, 'id_str'...",125,0,,666033412701032448,...,,,,44,0,,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
3,1991-02-08 20:17:06.329800704,,,2015-11-16 00:04:52,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666044217047650304, 'id_str'...",295,0,,666044226329800704,...,,,,139,0,,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
4,1991-02-08 21:40:48.165822465,,,2015-11-16 00:24:50,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666049244999131136, 'id_str'...",107,0,,666049248165822464,...,,,,40,0,,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."


In [23]:
# This summarizes the data.
df_wrangled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2341 entries, 0 to 2340
Data columns (total 31 columns):
Unnamed: 0                       2341 non-null object
contributors                     0 non-null float64
coordinates                      0 non-null float64
created_at                       2341 non-null object
entities                         2341 non-null object
extended_entities                1822 non-null object
favorite_count                   2341 non-null int64
favorited                        2341 non-null int64
geo                              0 non-null float64
id                               2341 non-null int64
id_str                           2341 non-null int64
in_reply_to_screen_name          78 non-null object
in_reply_to_status_id            78 non-null float64
in_reply_to_status_id_str        78 non-null float64
in_reply_to_user_id              78 non-null float64
in_reply_to_user_id_str          78 non-null float64
is_quote_status                  2341 non-nul

> **Results**: The text file was succesfully converted to a csv file and was defined as "df_wrangled".

<a id='assess'></a>
## Assessing the Data

### Df_twitter Table Assessment

In [24]:
# This samples 15 rows from the df_twitter table to allow me
# to visually assess the data for quality and tidiness errors.
df_twitter.sample(15)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2039,671547767500775424,,,2015-12-01 04:33:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Marley. She chews shoes then feels ext...,,,,https://twitter.com/dog_rates/status/671547767...,10,10,Marley,,,,
516,810984652412424192,,,2016-12-19 23:06:23 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Sam. She smiles 24/7 &amp; secretly aspir...,,,,"https://www.gofundme.com/sams-smile,https://tw...",24,7,Sam,,,,
899,758828659922702336,,,2016-07-29 00:57:05 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This doggo is just waiting for someone to be p...,,,,https://twitter.com/dog_rates/status/758828659...,13,10,,doggo,,,
2337,666268910803644416,,,2015-11-16 14:57:41 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Very concerned about fellow dog trapped in com...,,,,https://twitter.com/dog_rates/status/666268910...,10,10,,,,,
1759,678740035362037760,,,2015-12-21 00:53:29 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Tango. He's a large dog. Doesn't care muc...,,,,https://twitter.com/dog_rates/status/678740035...,6,10,Tango,,,,
518,810657578271330305,,,2016-12-19 01:26:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Pavlov. His floatation device has fail...,,,,https://twitter.com/dog_rates/status/810657578...,11,10,Pavlov,,,,
829,768909767477751808,,,2016-08-25 20:35:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: When it's Janet from accounting...,7.001438e+17,4196984000.0,2016-02-18 02:24:13 +0000,https://twitter.com/dog_rates/status/700143752...,10,10,,,,pupper,
1913,674372068062928900,,,2015-12-08 23:36:44 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Chesney. On the outside he stays calm &am...,,,,https://twitter.com/dog_rates/status/674372068...,10,10,Chesney,,,,
943,752701944171524096,,,2016-07-12 03:11:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: HEY PUP WHAT'S THE PART OF THE ...,6.835159e+17,4196984000.0,2016-01-03 05:11:12 +0000,"https://vine.co/v/ibvnzrauFuV,https://vine.co/...",11,10,,,,,
665,790698755171364864,,,2016-10-24 23:37:28 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Mosby. He appears to be rather h*ckin ...,,,,https://twitter.com/dog_rates/status/790698755...,12,10,Mosby,,,,


In [25]:
df_twitter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

> **Results**: The number of rows is 2,356. Also, many columns have null values. Also, null values are labeled "None" when they should be labeled "NaN" in the name column.

In [26]:
# I check for duplicate tweets by running a duplicated method
# in the tweet_id column.

df_twitter[df_twitter['tweet_id'].duplicated()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


> **Results**: There are no duplicate 'tweet_id' values.

In [27]:
# I check for erroneous 'tweet_id' values.
df_twitter.tweet_id.sort_values()

2355    666020888022790149
2354    666029285002620928
2353    666033412701032449
2352    666044226329800704
2351    666049248165822465
2350    666050758794694657
2349    666051853826850816
2348    666055525042405380
2347    666057090499244032
2346    666058600524156928
2345    666063827256086533
2344    666071193221509120
2343    666073100786774016
2342    666082916733198337
2341    666094000022159362
2340    666099513787052032
2339    666102155909144576
2338    666104133288665088
2337    666268910803644416
2336    666273097616637952
2335    666287406224695296
2334    666293911632134144
2333    666337882303524864
2332    666345417576210432
2331    666353288456101888
2330    666362758909284353
2329    666373753744588802
2328    666396247373291520
2327    666407126856765440
2326    666411507551481857
               ...        
29      886366144734445568
28      886680336477933568
27      886736880519319552
26      886983233522544640
25      887101392804085760
24      887343217045368832
2

> **Results**: All the tweet_ids are the same length, which means none of them were input incorrectly.

In [28]:
# I check for erroneous name values.
df_twitter['name'].value_counts()

None        745
a            55
Charlie      12
Lucy         11
Oliver       11
Cooper       11
Tucker       10
Penny        10
Lola         10
Winston       9
Bo            9
Sadie         8
the           8
Buddy         7
Toby          7
Bailey        7
Daisy         7
an            7
Jack          6
Jax           6
Rusty         6
Oscar         6
Koda          6
Milo          6
Scout         6
Bella         6
Dave          6
Stanley       6
Leo           6
Finn          5
           ... 
Arnold        1
Major         1
Pepper        1
Iggy          1
Julius        1
Schnozz       1
Zeus          1
Marq          1
Chadrick      1
Bert          1
Leonidas      1
Jed           1
Ralpher       1
Sailer        1
Damon         1
Ralf          1
Reptar        1
Wesley        1
Rizzo         1
Eleanor       1
Anthony       1
Corey         1
Sailor        1
Comet         1
Chef          1
Cheesy        1
BeBe          1
Divine        1
Bobble        1
Jeffri        1
Name: name, Length: 957,

> **Results**: The name null values are input incorrectly as "None". Also, "a, an," and "the" should be convertd to null values. Also, the name "life" needs to be capitalized.

### Df_images Table Assessment

In [29]:
# This allows me to visually assess the data for 
# quality and tidiness errors.
df_images.sample(15)


Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
193,669571471778410496,https://pbs.twimg.com/media/CUrLsI-UsAALfUL.jpg,1,minivan,0.873488,False,pickup,0.041259,False,beach_wagon,0.0154,False
1886,847962785489326080,https://pbs.twimg.com/media/C8SRpHNUIAARB3j.jpg,1,sea_lion,0.882654,False,mink,0.06688,False,otter,0.025679,False
2059,889278841981685760,https://pbs.twimg.com/ext_tw_video_thumb/88927...,1,whippet,0.626152,True,borzoi,0.194742,True,Saluki,0.027351,True
622,680497766108381184,https://pbs.twimg.com/media/CXGdG0aWcAEbOO1.jpg,1,Chihuahua,0.538354,True,muzzle,0.084289,False,ski_mask,0.07669,False
611,680115823365742593,https://pbs.twimg.com/media/CXBBurSWMAELewi.jpg,1,pug,0.999365,True,French_bulldog,0.000544,True,Boston_bull,2.8e-05,True
1381,765371061932261376,https://pbs.twimg.com/media/Cp8k6oRWcAUL78U.jpg,2,golden_retriever,0.829456,True,Labrador_retriever,0.089371,True,kuvasz,0.017028,True
1589,798628517273620480,https://pbs.twimg.com/media/CUN4Or5UAAAa5K4.jpg,1,beagle,0.636169,True,Labrador_retriever,0.119256,True,golden_retriever,0.082549,True
598,679722016581222400,https://pbs.twimg.com/media/CW7bkW6WQAAksgB.jpg,1,boxer,0.459604,True,Boston_bull,0.197913,True,French_bulldog,0.087023,True
2067,890729181411237888,https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg,2,Pomeranian,0.566142,True,Eskimo_dog,0.178406,True,Pembroke,0.076507,True
1023,710283270106132480,https://pbs.twimg.com/media/Cdtu3WRUkAAsRVx.jpg,2,Shih-Tzu,0.932401,True,Lhasa,0.030806,True,Tibetan_terrier,0.008974,True


>**Results**: The p1-p3 columns have inconsistent formatting in their values.

In [30]:
# This confirms that the p1_dog column only 
# has true/false values.
df_images.p1_dog.unique()

array([ True, False], dtype=bool)

In [31]:
# This confimrs that the p2_dog column only 
# has true/false values.
df_images.p2_dog.unique()

array([ True, False], dtype=bool)

In [32]:
# This confimrs that the p3_dog column only 
# has true/false values.
df_images.p3_dog.unique()

array([ True, False], dtype=bool)

In [33]:
# This describes the df_images dataframe.
df_images.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


>**Results**: There are no null values in the data. Tweet_id and img_num need to be converted to strings.

In [34]:
# This determines whether there are tweet_id duplicates.
df_images[df_images['tweet_id'].duplicated()]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog


>**Results**: There are no duplicate values for tweet_id.

In [35]:
# This determines whether there are erroneous .jpg values.
jpg_list = list(df_images.jpg_url)

count = 0
for i in jpg_list:
    if ".jpg" not in i:
        count = count + 1
        print(i, '\n The total number of errors is {}.\n'.format(count))

https://pbs.twimg.com/tweet_video_thumb/CVKtH-4WIAAmiQ5.png 
 The total number of errors is 1.

https://pbs.twimg.com/tweet_video_thumb/CZ0mhduWkAICSGe.png 
 The total number of errors is 2.



>**Results**: All the values have the same character length. There are no erroneous values in the dataframe.

### Df_wrangled Table Assessment

In [36]:
# This returns a visual assessment of df_wrangled.
df_wrangled.sample(15)

Unnamed: 0.1,Unnamed: 0,contributors,coordinates,created_at,entities,extended_entities,favorite_count,favorited,geo,id,...,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,text,truncated,user
2035,1996-06-19 08:26:23.151792128,,,2017-02-24 17:01:22,"{'hashtags': [], 'symbols': [], 'user_mentions...",,27626,0,,835172783151792128,...,,,,6226,0,,"<a href=""http://twitter.com/download/iphone"" r...",We only rate dogs. Please don't send in any no...,1,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
947,1992-02-26 02:46:45.256409088,,,2016-02-15 03:27:04,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 699072391083880449, 'id_str'...",3173,0,,699072405256409088,...,,,,1273,0,,"<a href=""http://twitter.com/download/iphone"" r...",ERMAHGERD 12/10 please enjoy https://t.co/7WrA...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
2324,1998-03-03 09:27:18.123831296,,,2017-07-23 00:22:39,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 888917229776945152, 'id_str'...",28726,0,,888917238123831296,...,,,,4444,0,,"<a href=""http://twitter.com/download/iphone"" r...",This is Jim. He found a fren. Taught him how t...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
245,1991-03-31 18:42:35.656130560,,,2015-11-28 03:31:48,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 670444949847023616, 'id_str'...",6873,0,,670444955656130560,...,,,,2050,0,,"<a href=""http://twitter.com/download/iphone"" r...",This is Paull. He just stubbed his toe. 10/10 ...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
732,1991-09-15 02:36:23.876321280,,,2016-01-07 00:59:40,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 684902175449939968, 'id_str'...",1989,0,,684902183876321280,...,,,,571,0,,"<a href=""http://twitter.com/download/iphone"" r...",This is Perry. He's an Augustus Gloopster. Ver...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
324,1991-04-16 02:18:01.401958400,,,2015-12-01 19:10:13,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 671768277677441024, 'id_str'...",1222,0,,671768281401958400,...,,,,544,0,,"<a href=""http://twitter.com/download/iphone"" r...",When you try to recreate the scene from Lady &...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
2025,1996-06-06 18:39:39.323871233,,,2017-02-21 17:04:24,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 834086371694362625, 'id_str'...",13827,0,,834086379323871232,...,,,,2407,0,,"<a href=""http://twitter.com/download/iphone"" r...",This is Lipton. He's a West Romanian Snuggle P...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
364,1991-04-25 23:32:07.801233409,,,2015-12-04 03:43:54,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 672622321664811010, 'id_str'...",1323,0,,672622327801233408,...,,,,514,0,,"<a href=""http://twitter.com/download/iphone"" r...",This lil pupper is sad because we haven't foun...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
298,1991-04-11 07:44:17.343524864,,,2015-11-30 15:51:24,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 671355851936940033, 'id_str'...",490,0,,671355857343524864,...,,,,117,0,,"<a href=""http://twitter.com/download/iphone"" r...",This is Lou. He's a Petrarch Sunni Pinto. Well...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
331,1991-04-17 12:35:28.106971137,,,2015-12-02 03:20:45,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 671891721769385984, 'id_str'...",1359,0,,671891728106971136,...,,,,589,0,,"<a href=""http://twitter.com/download/iphone"" r...",This is Mojo. Apparently he's too cute for a s...,0,"{'id': 4196983835, 'id_str': '4196983835', 'na..."


In [37]:
# This describes df_wrangled.
df_wrangled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2341 entries, 0 to 2340
Data columns (total 31 columns):
Unnamed: 0                       2341 non-null object
contributors                     0 non-null float64
coordinates                      0 non-null float64
created_at                       2341 non-null object
entities                         2341 non-null object
extended_entities                1822 non-null object
favorite_count                   2341 non-null int64
favorited                        2341 non-null int64
geo                              0 non-null float64
id                               2341 non-null int64
id_str                           2341 non-null int64
in_reply_to_screen_name          78 non-null object
in_reply_to_status_id            78 non-null float64
in_reply_to_status_id_str        78 non-null float64
in_reply_to_user_id              78 non-null float64
in_reply_to_user_id_str          78 non-null float64
is_quote_status                  2341 non-nul

>**Results**: Id_str needs to be converted into a string and renamed so that it can merge with the other dataframes.

In [38]:
# This determines whether there are erroneous negative counts.
df_wrangled.favorite_count.sort_values()

1113         0
1754         0
1753         0
1752         0
1751         0
2115         0
2116         0
1343         0
2124         0
1750         0
2134         0
1749         0
1748         0
1747         0
1746         0
2234         0
1745         0
1739         0
1736         0
1733         0
1724         0
2135         0
1722         0
2142         0
2151         0
1717         0
1755         0
1756         0
2097         0
2082         0
         ...  
1694     48725
2285     48865
2287     49966
1242     51789
2209     52805
1484     52896
1727     53805
1955     55486
1891     55610
1984     56319
2333     64637
2221     65630
2318     68251
1993     71028
2299     72170
2191     75377
2303     76549
525      78004
2264     78947
1813     80963
589      81974
2228     82392
2177     91310
1901     92504
2275    104702
1816    121333
1276    122170
2210    122728
1937    141395
1316    165063
Name: favorite_count, Length: 2341, dtype: int64

In [39]:
# This checks for id_str duplicate values.
df_wrangled[df_wrangled['id_str'].duplicated()]

Unnamed: 0.1,Unnamed: 0,contributors,coordinates,created_at,entities,extended_entities,favorite_count,favorited,geo,id,...,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,text,truncated,user


### Df_twitter Table Issues

> **Quality Issues**
>> 1. Tweet_id, in_reply_to_status_id, in_reply_to_user_id, and retweeted_status_id, and retweeted_status_user_id should be strings.
>> 2. The name "life" needs to be capitalized in the name column.
>> 3. "A, an," and "the" from the name column should be null values.
>> 4. The null values are recorded incorrectly in the name column. Also, there are incorrect names in the name column that need to be corrected.

> **Tidiness Issues**
>> 1. Dogger, Floofer, Pupper, and Puppo need to be converted to categorical variables of the column "stage."

### Df_images Table Issues

> **Quality Issues**
>> 5. The p1-p3 values have inconsistent case formatting.
>> 6. Tweet_id and img_num need to be changed to strings.
>> 7. There are png values in the jpg_url column.


> **Tidiness Issues**
>> 2. The p1-p3 columns need to be renamed. Their current labels tell us nothing.


### Df_wrangled Table Issues

> **Quality Issues**
>> 8. id_str needs converted to a string.

> **Tidiness Issues**
>> 3. id_str needs to be renamed in order to merge it with the other dataframes.
>> 4. All columns except str_id, in_reply_to_status_id, in_reply_to_user_id, and favorite count need to be dropped prior to the conversion of the dataframes.

<a id='cleaning'></a>
## Cleaning the Three Tables

### Df_twitter Table Definitions

#### Define

> **Quality Solutions**
>> 1. Convert tweet_id, in_reply_to_status_id, in_reply_to_user_id, and retweeted_status_id, and retweeted_status_user_id to strings.
>> 2. Convert "None" and erroneous values in the name column to "NaN".
>> 3. Convert "a, an," and "the" from the name column to "NaN"
>> 4. Capitalize the name "life" from the name column.

> **Tidiness Solutions**
>> 1. Convert dogger, floofer, pupper, and puppo need to be converted to categorical variables of the column "stage."

### DF_twitter Table Cleaning

#### Code

In [40]:
# Convert tweet_id, in_reply_to_status_id, in_reply_to_user_id, 
# and retweeted_status_id, and retweeted_status_user_id to strings.

# This renames df_twitter to clean_twitter.
clean_twitter = df_twitter.copy()


# This defines the function that will convert tweet_id,
# in_reply_to_status_id, in_reply_to_user_id, and 
# retweeted_status_user_id to strings.
#def str_conv(col_str):
#    '''
#    This converts the column data to string data.
#    '''
#    col_str = col_str.astype(str)
#    return
#    col_str
    
# This converts the column data to string data.
clean_twitter['tweet_id'] = clean_twitter['tweet_id'].astype(str)

clean_twitter['in_reply_to_status_id'] = clean_twitter[
    'in_reply_to_status_id'].astype(str)

clean_twitter['name'] = clean_twitter['name'].astype(str)

clean_twitter['in_reply_to_user_id'] = clean_twitter[
    'in_reply_to_user_id'].astype(str)

clean_twitter['retweeted_status_id'] = clean_twitter[
    'retweeted_status_id'].astype(str)

clean_twitter['retweeted_status_user_id'] = clean_twitter[
    'retweeted_status_user_id'].astype(str)


#### Test

In [41]:
# This determines whether the column data was converted to string
# data.
clean_twitter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 18 columns):
Unnamed: 0                    2356 non-null int64
tweet_id                      2356 non-null object
in_reply_to_status_id         2356 non-null object
in_reply_to_user_id           2356 non-null object
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           2356 non-null object
retweeted_status_user_id      2356 non-null object
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         

>**Results**: Quality Definition 1 was succesffuly implemented, tweet_id, in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, and retweeted_status_user_id were converted to strings.

#### Code

In [42]:
# Capitalize the name "life" from the name column.

# This turns the lower-cased name "life" to
# upper-cased life.
clean_twitter['name'] = clean_twitter['name'].str.replace("life", "Life")

#### Test

In [43]:
# This creates the for loop test for lower-cased
# life.
life_check = list(clean_twitter.name)
for i in life_check:
    if i == 'life':
        print('Lower-cased \"life\" is still in here.')
        break

print('The correction was made.')

The correction was made.


> **Results**: Quality Definition 2 was succesffuly implemented. Lower-cased 'life' was coverted to upper-cased 'Life'.

#### Code

In [45]:
# Convert "None" and other erroneous values in the name column to "NaN".

# This converts the all the erroneous name values.
# This code comes from Source 18 in the Sources Section.
mask = clean_twitter.name.str.islower()
column_name = 'name'
clean_twitter.loc[mask, column_name] = np.nan

# This returns the value count to insure that the correction
# was made.
clean_twitter.name.value_counts()

None        745
Charlie      12
Oliver       11
Lucy         11
Cooper       11
Tucker       10
Penny        10
Lola         10
Bo            9
Winston       9
Sadie         8
Bailey        7
Daisy         7
Buddy         7
Toby          7
Oscar         6
Jack          6
Bella         6
Rusty         6
Milo          6
Scout         6
Stanley       6
Jax           6
Dave          6
Leo           6
Koda          6
George        5
Gus           5
Finn          5
Chester       5
           ... 
Arnold        1
Major         1
Pepper        1
Millie        1
Iggy          1
Julius        1
Henry         1
Winifred      1
Suki          1
Schnozz       1
Sailer        1
Damon         1
Ralf          1
Reptar        1
Wesley        1
Chadrick      1
Rizzo         1
Anthony       1
Corey         1
Sailor        1
Comet         1
Chef          1
Cheesy        1
BeBe          1
Divine        1
Bobble        1
Eleanor       1
Zeus          1
Yoda          1
Jeffri        1
Name: name, Length: 933,

#### Test

In [None]:
# Although the visual assessment is reassuring,
# I run a boolean loop to ensure that I eliminated
# the "None" values.
names_list1 = list(clean_twitter.name)
for name in names_list1:
    if name == "None" or name == "none":
        print("None is still in here.\n")
        break

print('If the first loop did not print a message, then all \"None\"',
'values were eliminated.\n')

# I run this loop to identify all lower-cased name values.
lower_cased_names = []
for name in names_list1:
    if name == name.lower():
        lower_cased_names.append(name)

# I run this loop to determine that the lower-cased
# values are only 'nan'.
for name in lower_cased_names:
    if name != 'nan':
        print('Lower Case Alert!')

print('If the third loop did not print a message',
      'then the lower-cased values were eliminated.')


> **Results**: Quality Definition 3 and 4 were successfully implemented. All the "none" and erroneous name values were successfully converted to null values.

#### Code

In [None]:
# Convert dogger, floofer, pupper, and puppo 
# need to be converted to categorical variables
# of the column "stage."

# This creates the "stage" list that will become
# the "Stage" Column.
stage = []

## The for loop creates the "stage" list.
for i in range(len(clean_twitter.tweet_id)):
    stage.append(str(np.nan))

#This appends the stage column to clean_twitter.
clean_twitter['stage'] = stage

clean_twitter.stage.unique()

#### Code Continued

In [None]:
# This deterimines the unique values for Doggo,
# Floofer, Pupper, and Puppo. We need these values
# in order to create the "stage" list and column.
print(clean_twitter.doggo.unique())
print(clean_twitter.floofer.unique())
print(clean_twitter.pupper.unique())
print(clean_twitter.puppo.unique())


#### Code Continued

In [None]:
# I must write four loops in order to convert the "None"
# values to null values. After each for loop, I convert
# the original columns with the "None" values into
# columns with the null values.

## The for loop creates the "doggo list" list.
doggo = []
for i in clean_twitter.doggo:
    if i == "None":
        doggo.append(str(np.nan))
    else:
        doggo.append(i)

# This converts the original doggo column to the new list
# that substitutes "None" with null values.
clean_twitter.doggo = doggo

## The for loop creates the "floofer list" list.
floofer = []
for i in clean_twitter.floofer:
    if i == "None":
        floofer.append(str(np.nan))
    else:
        floofer.append(i)

# This converts the original floofer column to the new list
# that substitutes "None" with null values.
clean_twitter.floofer = floofer

## The for loop creates the "pupper list" list.
pupper = []
for i in clean_twitter.pupper:
    if i == "None":
        pupper.append(str(np.nan))
    else:
        pupper.append(i)

# This converts the original pupper column to the new list
# that substitutes "None" with null values.
clean_twitter.pupper = pupper

## The for loop creates the "puppo list" list.
puppo = []
for i in clean_twitter.puppo:
    if i == "None":
        puppo.append(str(np.nan))
    else:
        puppo.append(i)

# This converts the original puppo column to the new list
# that substitutes "None" with null values.
clean_twitter.puppo = puppo

# This checks to see if "None" values were converted to "NaN"
clean_twitter.sample(15)

#### Code Continued

In [None]:
# This converts the stage column to the puppo values.
clean_twitter.stage = puppo

# This drops the puppo column.
clean_twitter = clean_twitter.drop(labels = ['puppo'], axis = 1)


In [None]:
# This confirms that the puppo value was dropped.
clean_twitter.sample(5)

#### Code Continued

In [None]:
# This confirms that the only values for the
# stage column are puppo and nan.
clean_twitter.stage.unique()

#### Code Continued

In [None]:
# This is the loop that adds pupper to the 
# column "stage_to_upload" list.
stage_to_upload = list(clean_twitter.stage)
pupper = pupper
slice_1 = 0
slice_2 = 0
for i in range(len(stage_to_upload)):
    slice_1 = slice_1
    slice_2 = slice_2
    if stage_to_upload[slice_1] != "puppo":
        stage_to_upload[slice_1] = pupper[slice_2]
        slice_1 = slice_1 + 1
        slice_2 = slice_2 + 1
    else:
        stage_to_upload[slice_1] = stage_to_upload[slice_1]
        slice_1 = slice_1 + 1
        slice_2 = slice_2 + 1

# This code cofirms that the only three values
# in stage_to_upload are pupper, puppo, and nan.
stage_check1 = []
for i in stage_to_upload:
    if i not in stage_check1:
        stage_check1.append(i)

for i in stage_check1:
    print(i,'\n')

#### Code Continued

In [None]:
# This is the loop that adds doggo to the 
# column "stage_to_upload" list.
stage_to_upload = stage_to_upload
doggo = doggo 
slice_1 = 0
for i in range(len(stage_to_upload)):
    slice_1 = slice_1
    if stage_to_upload[slice_1] == "puppo":
        stage_to_upload[slice_1] = stage_to_upload[slice_1]
        slice_1 = slice_1 + 1
    elif stage_to_upload[slice_1] == "pupper":
        stage_to_upload[slice_1] = stage_to_upload[slice_1]
        slice_1 = slice_1 + 1
    elif stage_to_upload[slice_1] == "nan":
        stage_to_upload[slice_1] = doggo[slice_1]
        slice_1 = slice_1 + 1
    else:
        slice_1 = slice_1 + 1
# This code cofirms that the only three values
# in stage_to_upload are doggo, pupper, puppo,
# and nan.
stage_check2 = []
for i in stage_to_upload:
    if i not in stage_check2:
        stage_check2.append(i)

for i in stage_check2:
    print(i,'\n')

#### Code Continued

In [None]:
# This is the loop that adds doggo to the 
# column "stage_to_upload" list.
stage_to_upload = stage_to_upload
floofer = floofer
slice_1 = 0
for i in range(len(stage_to_upload)):
    slice_1 = slice_1
    if stage_to_upload[slice_1] == "puppo":
        stage_to_upload[slice_1] = stage_to_upload[slice_1]
        slice_1 = slice_1 + 1
    elif stage_to_upload[slice_1] == "pupper":
        stage_to_upload[slice_1] = stage_to_upload[slice_1]
        slice_1 = slice_1 + 1
    elif stage_to_upload[slice_1] == "doggo":
        stage_to_upload[slice_1] = stage_to_upload[slice_1]
        slice_1 = slice_1 + 1
    elif stage_to_upload[slice_1] == "nan":
        stage_to_upload[slice_1] = floofer[slice_1]
        slice_1 = slice_1 + 1
    else:
        slice_1 = slice_1 + 1
# This code cofirms that the only three values
# in stage_to_upload are doggo, pupper, puppo,
# and nan.
stage_check3 = []
for i in stage_to_upload:
    if i not in stage_check3:
        stage_check3.append(i)

for i in stage_check3:
    print(i,'\n')

#### Code Continued

In [None]:
# This converts the old stage column to the stage_to_upload
# list.
clean_twitter.stage = stage_to_upload

# This drops pupper, doggo, and floofer.
clean_twitter = clean_twitter.drop(labels = ['pupper', 'doggo', 'floofer'], axis = 1)

#### Test

In [None]:
# This visually checks clean_twitter and the new
# column stage.
clean_twitter.sample(10)

#### Test Continued

In [None]:
# This programatically checks the new stage column.
clean_twitter.stage.value_counts()

In [None]:
# This gives a summary of the clean_twitter data.
clean_twitter.info()

> **Results**: Tidiness Definition 1 was successfully implemented; "pupper, doggo, puppo", and "floofer" were successfully merged under the "stage" column.

### Df_images Table Definittions.

#### Define

> **Quality Solutions**
>> 5. Convert the p1-p3 values to upper-case.
>> 6. Convert the tweet_id and img_num need to strings.
>> 7. Relabel jpg_url as picture_url.

> **Tidiness Solutions**
>> 2. Rename The p1-p3 columns. Their current labels tell us nothing.

#### Code

In [None]:
# Convert the p1-p3 values to upper-case.

# This creates clean_images.
clean_images = df_images.copy()

# This uper-cases the p1 column.
p1_list = list(clean_images.p1)
slice1 = 0
for i in p1_list:
    slice1 = slice1
    p1_list[slice1] = i.capitalize()
    slice1 = slice1 + 1
# This replaces the old p1 column with capitalized
# values.
clean_images.p1 = p1_list

# This uper-cases the p2 column.
p2_list = list(clean_images.p2)
slice1 = 0
for i in p2_list:
    slice1 = slice1
    p2_list[slice1] = i.capitalize()
    slice1 = slice1 + 1
    
# This replaces the old p2 column with capitalized
# values.
clean_images.p2 = p2_list

# This uper-cases the p3 column.
p3_list = list(clean_images.p3)
slice1 = 0
for i in p3_list:
    slice1 = slice1
    p3_list[slice1] = i.capitalize()
# This replaces the old p3 column with capitalized
# values.
clean_images.p3 = p3_list

#### Test

In [None]:
# This checks to detrmine that the changes were made.
clean_images.sample(10)

>**Results**: Quality Definition 5 was implemented successfully. The p1, p2, and p3 columns all have capitalized values.

#### Code

In [None]:
# Convert the tweet_id and img_num need to strings.

# This converts tweet_id to a string.
clean_images.tweet_id = clean_images.tweet_id.astype(str)

# This converts num_img to a string.
clean_images.img_num = clean_images.img_num.astype(str)

#### Test

In [None]:
# This checks if tweet_id and img_num were successfully
# converted to strings.
clean_images.info()

>**Results**: Quality Definition 6 was successfully implemented. The values in the tweet_id and img_num columns were converted to strings.

#### Code

In [None]:
# Relabel jpg_url as picture_url.

# Currently, jpg_url is a misleading label for the column;
# there are .png files in the column. I rename the jpg_url
# column to picture_url.
# This code comes from Source 10 in the Sources Section.
clean_images = clean_images.rename(columns={'jpg_url': 'picture_url'})

#### Test

In [None]:
# This confirms that the jpg_url column was renamed to
# picture_url.
clean_images.info()

> **Results**: Quality Definition 7 was succesffuly implemented; jpg_url was renamed to picture_url.

#### Code

In [None]:
# This renames all the columns p1 - p3. The column names
# tell us nothing and are therefore untidy.
# This comes from Source 10 in the Sources Section.
clean_images = clean_images.rename(columns={'p1' : '1st_picture_prediction_(p1)',
                                            'p2' : '2nd_picture_prediction_(p2)',
                                            'p3' : '3rd_picture_prediction_(p3)',
                                            'p1_conf' : 'p1_%_confidence',
                                            'p2_conf' : 'p2_%_confidence',
                                            'p3_conf' : 'p3_%_confidence',
                                            'p1_dog' : 'p1_dog:True/False',
                                            'p2_dog' : 'p2_dog:True/False',
                                            'p3_dog' : 'p3_dog:True/False'})

#### Test

In [None]:
# This confirms that the p1-p3 columns were renamed.
clean_images.info()

> **Results**: Tidiness Definition 2 was successfully implemented. The p1-p3 columns were relabled to give the data tidiness with more comprehensive column labels.

### Df_wrangled Table Issues

#### Definition

> **Quality Issues**
>> 8. Convert id_str to a string.

> **Tidiness Issues**
>> 3. Rename id_string to tweet_id.
>> 4. Drop all columns except tweet_id, in_reply_to_status_id, in_reply_to_user_id, and favorite count in order to make it easier for the conversion of the three dataframes.

#### Code

In [None]:
# Convert id_str to a string.

# This creates the clean_wrangled dataframe.
clean_wrangled = df_wrangled.copy()

# This converts id_str to string data.
clean_wrangled.id_str = clean_wrangled.id_str.astype(str)

#### Test

In [None]:
# This checks whether id_str was converted
# to string data.
clean_wrangled.info()

> **Results**: Quality Definition 8 was successfully implemented. The values in id_str were converted to strings.

#### Code

In [None]:
# Rename id_string to tweet_id.

# In order to merge clean_wrangled with the other two tables
# the column, str_id must be renamed to tweet_id.
# This comes from Source 10 in the Sources Section.
clean_wrangled = clean_wrangled.rename(columns={'id_str' : 'tweet_id'})

#### Test

In [None]:
# This checks whether id_str was successfully renamed to tweet_id.
clean_wrangled.info()

> **Results**: Tidiness Definition 3 was successfully implemented. The values in id_str were converted to strings.

#### Code

In [None]:
# Drop all columns except tweet_id, in_reply_to_status_id, 
# in_reply_to_user_id and favorite count.

# This drops the columns that are not necessary for
# the conversion of the three data tables.
clean_wrangled = clean_wrangled.drop(labels = ['contributors',
                                                'coordinates',
                                                'created_at',
                                                'entities',
                                                'extended_entities',
                                                'favorite_count',
                                                'favorited',
                                                'geo',
                                                'id',
                                                'in_reply_to_screen_name',
                                                'in_reply_to_status_id_str',
                                                'in_reply_to_user_id_str',
                                                'is_quote_status',
                                                'lang',
                                                'place',
                                                'possibly_sensitive',
                                                'possibly_sensitive_appealable',
                                                'quoted_status',
                                                'quoted_status_id_str',
                                                'retweeted',
                                                'retweeted_status',
                                                'source',
                                                'text',
                                                'truncated',
                                                'user'], axis = 1)

#### Code Continued

In [None]:
# This drops the quoted_status_id column.
clean_wrangled = clean_wrangled.drop(labels = ['quoted_status_id'],
                                    axis = 1)

# This drops the Unnamed column.
# This comes from Source 11 in the Sources Section.
clean_wrangled = clean_wrangled.drop(clean_wrangled.columns[0], axis=1)

#### Test

In [None]:
# This confirms that unamed and quated status_id were dropped.
clean_wrangled.info()

> **Results**: Tidiness Solution 4 was succesful. All the unnecessary columns were dropped for the merging of the three data tables.

#### Extracurricular Error Correction: Code

In [None]:
# This converts in_reply_to_status_id and
# in_reply_to_user_id to strings, and tweet_id.
clean_wrangled.in_reply_to_status_id = (
clean_wrangled.in_reply_to_status_id.astype(str))
clean_wrangled.in_reply_to_user_id = (
clean_wrangled.in_reply_to_user_id.astype(str))
clean_wrangled.tweet_id = (
    clean_wrangled.tweet_id.astype(str))

In [None]:
# This gives us a sample of 2 from clean_wrangled.
clean_wrangled.sample(2)

#### Extracurricular Error Correction: Test

In [None]:
# This confirms whether in_reply_to_status_id
# and in_reply_to_user_id, and tweet_id
# were converted to strings.
clean_wrangled.info()

> **Results**: This conversion to strings is to aid the conversion of the three data tables.

<a id='Merge'></a>
## Conversion of Dataframes to df_combined

### Initial Merge of clean_twitter and clean_images

In [None]:
# This comes from Source 11 in the Sources Section.
# These are the columns that the data will be merged onto.
# They need to be converted to lower-case strings so the data can
# merge smoothly.

# This organizes the tweet_id column for the conversion using for loops.
t_tw_id = list(clean_twitter.tweet_id)
for i in t_tw_id:
    i = i.lower()
clean_twitter.tweet_id = t_tw_id

w_tw_id = list(clean_wrangled.tweet_id)
for i in w_tw_id:
    i = i.lower()
clean_wrangled.tweet_id = w_tw_id
    
i_tw_id = list(clean_images.tweet_id)
for i in i_tw_id:
    i = i.lower()
clean_images.tweet_id = i_tw_id

In [None]:
# This organizes the in_reply_to_status_id column for conversion
# using for loops.
t_status= list(clean_twitter.in_reply_to_status_id)
for i in t_status:
    i = i.lower()
clean_twitter.in_reply_to_status_id = t_status

w_status = list(clean_wrangled.in_reply_to_status_id)
for i in w_status:
    i = i.lower()
clean_wrangled.in_reply_to_status_id = w_status

# This organizes the in_reply_to_user_id column for conversion
# using for loops.
t_user = list(clean_twitter.in_reply_to_user_id)
for i in t_user:
    i = i.lower()
clean_twitter.in_reply_to_user_id = t_user
    
w_user = list(clean_wrangled.in_reply_to_user_id)
for i in w_user:
    i = i.lower()
clean_wrangled.in_reply_to_user_id = w_user

# This is the initial merge of clean_twitter with clean_wrangled.
# This comes from Source 11 in the Sources Section.
clean_twitter = pd.merge(clean_twitter, clean_wrangled,
                        on = ['tweet_id', 'in_reply_to_status_id',
                              'in_reply_to_user_id'])

# This checks to confirm that the initial merge was successful.
clean_twitter.info()

In [None]:
# This gives us a visual of the new clean_twitter dataframe.
clean_twitter.head()

> **Results**: The initial merge was successful.

In [None]:
# This is the second and final merge of clean_twitter with
# clean_wrangled.
clean_twitter = pd.merge(clean_twitter, clean_images,
                          on = ['tweet_id'])

# This determines whether the merge was successful.
clean_twitter.info()

In [None]:
# This returns a 5 sample preview of clean_twitter.
clean_twitter.retweet_count.describe()

> **Results**: The second merge was successful.

In [None]:
# This renames clean_twitter to df_aggregate.
df_aggregate = clean_twitter
aggregate_to_csv = df_aggregate.to_csv('twitter-archive-master.csv')

<a id='analysis'></a>
## Analysis of the Master Dataset

### Summary:
> I want to know what factors influence the ratings. I believe that stage and retweet count have a causal relationship with the ratings. I first create the ratings_score variable, my independent variable, by dividing the ratings numerator by its denominator. From there, I take the descriptive statistics of the ratings_score. I then create a correlation matrix to ensure that stage dummy variables and retweet count are not related to each other. I then conduct bivariate analyses to determine whether relationships exist between the dependent and indepenent variables. Finally, I create a multi-linear regression to determine whether causal relationships exist between my dependent and independent variables.

### Hypotheses:

> H0(stage) = 0

> H1(stage) != 0

> H0(retweet count) = 0

> H1(retweet count) != 0

In [None]:
# This creates the ratings_score variable.
df_aggregate['ratings_score'] = (
df_aggregate.rating_numerator / (
df_aggregate.rating_denominator))

# This confirms that the ratings_score column was created.
df_aggregate.info()

In [None]:
# This code takes some of the descriptive statistics
# of the ratings_score variable.

df_aggregate.ratings_score.describe()

In [None]:
# This finds the mode of the ratings scores.
df_aggregate.ratings_score.mode()

In [None]:
# This finds the most common stage, other than 'nan',
# for a dog to have his/her picture taken.
df_aggregate.stage.value_counts()

In [None]:
# This finds the top score rating.
df_top_dog = df_aggregate[df_aggregate['ratings_score'] == 
                          (df_aggregate['ratings_score'].max())]

# This is the score.
top_ratings_score = str(df_top_dog.ratings_score)
print(top_ratings_score)

> **Results**: I identified the descriptive statistics for the ratings scores. The average ratings score is 1.22188. The median is 1.1, suggesting that the data distribution is slightly skewed to the right. The mode score is 1. The interquartile range is from 1.0 to 1.2. The highest score rating for any dog is 177.6. The lowest rating score is 0.1. The most popular dog stage was the pupper stage. The least popular stage was puppo.

In [None]:
# Before I can make my correlation and residual
# plots, I create the dummy variables doggo, flooger,
# pupper, and puppo for stage.
# This comes from source 12 in the sources section.
stage_dummies = pd.get_dummies(df_aggregate['stage'])

# I join the dummy variables to the new dataframe
# df_aggregate_1.
df_aggregate_1 = df_aggregate.join(stage_dummies)

# This determines whether the dummies were successfully
# added.
df_aggregate_1.info()

In [None]:
# This creates the dataframe I use for the
# correlations, bi-variate plots, and regression analysis.
df_analysis = df_aggregate_1[['ratings_score',
                              'retweet_count',
                              'doggo', 
                              'floofer',
                              'nan',
                             'pupper',
                             'puppo',
                             'stage']]

df_analysis.info()

In [None]:
# This creates the correlation matrix for df_analysis.
# This comes from Source 13 in the Sources Section.
df_analysis.corr()


> **Results**: The correlation matrix shows that the only strong correlation between the dependent variables is between pupper and nan, which is -0.831288. This correlation is irrelevant, becuase nan is the dummy variable that will be thrown out in the regression analysis. For all the variables that will be regressed against the ratings score, there is no multi-colinearity.

In [None]:
# This determines the feasibility of assessing the dummy variables.
df_analysis.stage.value_counts()

> **Results**: There are not enough puppo and floofer values to regress against ratings scores. The standard minimum number of values is 30, and the puppo and floofer dummies are not close to the minimum. The only stage dummy variables I will regress against the ratings score are pupper and doggo.

In [None]:
# This regresses pupper against rating scores.
# This comes from Source 14 in the Sources Section.
sns.set(color_codes=True)
ratings = "ratings_score"
sns.regplot(x="pupper", y=ratings, data=df_analysis);
plt.ylim(0,2)

> **Results**: There appears to be a slighly negative linear correlation between pupper and rating scores.

In [None]:
# This regresses doggo against rating scores.
# This comes from Source 14 in the Sources Section.
ratings = "ratings_score"
sns.regplot(x="doggo", y=ratings, data=df_analysis);
plt.ylim(0,2)

> **Results**: There appears to be a slighly negative linear correlation between doggo and rating scores.

In [None]:
# This regresses retweet_count against rating scores.
# This comes from Source 14 in the Sources Section.
ratings = "ratings_score"
sns.regplot(x="retweet_count", y=ratings, data=df_analysis);
plt.ylim(0,2)

> **Results**: While it is possible that a postive linear relationship exists between ratings score and retweet_count, it is also possible that a non-linear relationship exists. In my next plot, I regress the log of retweet_count against ratings_score.

In [None]:
# This regresses retweet_count against rating scores.
# This comes from Source 14 in the Sources Section.
ratings = "ratings_score"
sns.regplot(x="retweet_count", y=ratings, data=df_analysis,
           logx = True);
plt.ylim(0,2)

> **Results**: From the visual in which I regress the log of retweet_count against ratings_score, it appears that a positive quadratic relationship exists between x and y. The next step is to construct the multilinear regression model.

In [None]:
# Before I can begin my multilinear regression analysis,
# I must create the log of retweet_count.
# This code comes from Source 15 in the Sources Section.
df_analysis["log_retweet_count"] = df_analysis.retweet_count.apply(np.log)

# I also must create the intercept.
df_analysis['intercept'] = 1

# This confirm that log_retweet_count variable 
# and intercept were successfully created.
df_analysis.info()

In [None]:
# Pupper and Doggo need to be converted to integers.
df_analysis.doggo = df_analysis.doggo.astype(int)
df_analysis.pupper = df_analysis.pupper.astype(int)

# This confirms whether the conversion to strings
# was successful.
df_analysis.info()

In [None]:
# This creates the multilinear regression analysis.
# This code comes from Source 12 in the Sources Section.
reg_model = sm.OLS(df_analysis['ratings_score'], (
                      df_analysis[['intercept', 'log_retweet_count',
                                   'retweet_count', 'doggo', 'pupper']]))
results_of_model = reg_model.fit()
results_of_model.summary()

### Multilinear Regression Results

> **Summary**: I fail to reject both null hypotheses that retweet count and the stage dummy variables have a correlation coefficient of 0 with the ratings score. Also, even if I had found causal relationships, they would have no practical significance considering that the model only explains 0.2 percent of the variance in the ratings_score.

<a id='insights'></a>
## The Five Insights

> 1. The mode ratings score (ratings numerator / ratings denominator) is 1.

> 2. The most popular stage is the pupper stage.

> 3. The least popular stage is floofer.

> 4. The lowest ratings score was 0.1.

> 5. The highest score rating of any dog was 177.6.

<a id='sources'></a>
## Sources

> 1. DAND Semester 2, Data Wrangling, Lesson 2

> 2. https://stackoverflow.com/questions/5137497/find-current-directory-and-files-directory

> 3. https://stackoverflow.com/questions/1810743/how-to-set-the-current-working-directory/1810760

> 4. https://media.readthedocs.org/pdf/tweepy/latest/tweepy.pdf

> 5. https://github.com/mainkoon81/U008-project-Python-Special-TWITTER-DOGGIE/blob/master/3.wrangle_act_project_doggy-original.ipynb

>6. https://github.com/RedRock42/Udacity-Nanodegree-Portfolio/blob/master/P4.Wrangling%20%26%20Analyzing%20Twitter%20Data/Wrangle%20Act.ipynb

> 7. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

> 8. https://stackoverflow.com/questions/29759305/how-do-i-convert-a-tsv-to-csv

> 9. https://stackoverflow.com/questions/22100130/pandas-replace-multiple-values-one-column

> 10. https://stackoverflow.com/questions/11346283/renaming-columns-in-pandas

> 11. DAND Semester 2, Data Wrangling, Lesson 4

> 12. DAND Semester 1, A/B Testing, Brian Campbell's Project 4 Submission

> 13. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html

> 14. https://seaborn.pydata.org/tutorial/regression.html

> 15. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html

> 16. https://www.statsmodels.org/stable/regression.html

> 17. DAND Semester 2, Project 7, Review # 4.

> 18. DAND Semester 2, Project 7, Review # 5