# Data Wrangling

by Alina Bolat

In [1]:
# Libraries
import pandas as pd
import numpy as np
import datetime as dt
import requests
import tweepy
import json
import time
import re
import os
# Set the maximum width of columns to display tweet text in full
pd.set_option('display.max_colwidth', -1)

## Gather
Gathering process consists of following three data sets:  
1. **twitter_archive_enhanced.csv** is avalable for manual download.
2. **image_predictions.tsv** is avalable through a link for programmatic download from the Udacity Servers.
3. **tweet_json.txt** is to be scraped using Twitter API - tweepy.
***
### Twitter Archive

In [2]:
# Import the csv file and store it in a dataframe
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv', encoding = 'utf-8')
# Check the outcome
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

***
### Image Predictions

In [3]:
###
# # Using Requests library download a file
# url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
# response = requests.get(url)
# # Save the downloaded file
# with open(url.split('/')[-1], mode = 'wb') as outfile:
#     outfile.write(response.content)
###

In [4]:
# Import the csv file and store it in a dataframe
image_predictions = pd.read_csv('image-predictions.tsv', sep = '\t', encoding = 'utf-8')
# Check the result
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


***
### Twitter Data

In [5]:
###
# # Twitter API Authorisation - CONFIDENTIAL INFORMATION REMOVED
# consumer_key = ''
# consumer_secret = ''
# access_token = ''
# access_secret = ''
# 
# auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# auth.set_access_token(access_token, access_secret)
# 
# api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
###

In [6]:
###
# # Calculate the time of excution
# start = time.time()
# # A separate list for catching errors
# errors = []
# 
# with open('tweet_json.txt', 'w') as file:
#     for tweet_id in twitter_archive['tweet_id']:
#         try:
#             tweet = api.get_status(tweet_id, tweet_mode = 'extended') # For every tweet_id in the list
#             file.write(json.dumps(tweet._json) + '\n') # write full JSON status and create a new line
#         except Exception as e:
#             print(str(tweet_id) + " " + str(e)) # Print the missing Tweet ID and the error text
#             errors.append(tweet_id)
# 
# # Calculate the time of excution
# end = time.time()
# print(end - start)
###

In [7]:
###
# # List of errors
# errors_df = pd.DataFrame(twitter_archive.loc[twitter_archive['tweet_id'].isin(errors),:])
# # Save the list of errors for Assessment
# errors_df.to_csv('tweet_json_errors.csv', index=False, encoding = 'utf-8')
###

In [8]:
# Lists of variables of interest
tweet_id = []
favorite_count = []
retweet_count = []
with open('tweet_json.txt', mode = 'r') as f:
     for line in f.readlines():
            tweet_json = json.loads(line)
            tweet_id.append(tweet_json['id'])
            favorite_count.append(tweet_json['favorite_count']) 
            retweet_count.append(tweet_json['retweet_count'])
            
# Store each variable in an identically named columns in a new dataframe           
tweet_json = pd.DataFrame({'tweet_id' : tweet_id, 
                           'favorite_count' : favorite_count, 
                           'retweet_count' : retweet_count})
# Check outcome
tweet_json.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2345 entries, 0 to 2344
Data columns (total 3 columns):
favorite_count    2345 non-null int64
retweet_count     2345 non-null int64
tweet_id          2345 non-null int64
dtypes: int64(3)
memory usage: 55.0 KB


***
## Assess

**Twitter archive** is a dataframe with 17 Columns and 2356 observations. Straight away it is obvious that there are missing values and other flaws in the dataset that will need to be addressed.  
**Image Predictions** dataframe consists of 12 columns and 2075 observations. There appear to be no missing values, but further investigation is required to determine the importance of columns.  
**Twitter Json** is a dataframe programmaticaly gathered from the twitter servers, it has 2345 observations and 3 variables, it will serve as an additional data for the final dataframe.  

In this part of the project the datasets will be assessed Quality and Tidiness. In order to narrow down data wrangling scope, it always helps me to pencil down areas of interest for the EDA and ask preli/minary questions. These questions will be changed and refined as the process goes on.  

* Is the rating subjective or dependant on certain other traits?
* Which dog stage is posted most and/or earns highest scores? 
* Which dog breed is the most popular?
* Which dog breed is the most liked/has highest scores?
* What are main sources of tweets for We Rate Dogs account?
* What are the most viral tweets in the sample (most retweets and quotes)?
* Are there any trends in time or season when retweets/favourites are more prevalent?

In [9]:
# Visual assessment of the top and the bottom of the dataset, 
# I like to print out the dataset whole, as Jupyter Notebooks makes it very easy 
# to see the Head and the Tail of a dataset by abreviating the dataframe list.
twitter_archive

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,,,,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,,,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",,,,"https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh,,,,https://twitter.com/dog_rates/status/891087950875897856/photo/1,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\n\nhttps://t.co/Zr4hWfAs1H https://t.co/tVJBRMnhxl,,,,"https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq,,,,"https://twitter.com/dog_rates/status/890729181411237888/photo/1,https://twitter.com/dog_rates/status/890729181411237888/photo/1",13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettable boatpet. 13/10 #BarkWeek https://t.co/9TwLuAGH0b,,,,https://twitter.com/dog_rates/status/890609185150312448/photo/1,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A,,,,https://twitter.com/dog_rates/status/890240255349198849/photo/1,14,10,Cassie,doggo,,,


In [10]:
# Random sample of 5 twitter entries
twitter_archive.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
509,812466873996607488,,,2016-12-24 01:16:12 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Mary. She's desperately trying to recreate her Coachella experience. 12/10 downright h*ckin adorable https://t.co/BAJrfPvtux,,,,https://twitter.com/dog_rates/status/812466873996607488/photo/1,12,10,Mary,,,,
1656,683357973142474752,,,2016-01-02 18:43:31 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","""Have a seat, son. There are some things we need to discuss"" 10/10 https://t.co/g4G5tvfTVd",,,,https://twitter.com/dog_rates/status/683357973142474752/photo/1,10,10,,,,,
1187,718460005985447936,,,2016-04-08 15:26:28 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Bowie. He's listening for underground squirrels. Smart af. Left eye is considerably magical. 9/10 would so pet https://t.co/JyNmyjy3fe,,,,https://twitter.com/dog_rates/status/718460005985447936/photo/1,9,10,Bowie,,,,
1598,686035780142297088,6.86034e+17,4196984000.0,2016-01-10 04:04:10 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","Yes I do realize a rating of 4/20 would've been fitting. However, it would be unjust to give these cooperative pups that low of a rating",,,,,4,20,,,,,
2116,670427002554466305,,,2015-11-28 02:20:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a Deciduous Trimester mix named Spork. Only 1 ear works. No seat belt. Incredibly reckless. 9/10 still cute https://t.co/CtuJoLHiDo,,,,https://twitter.com/dog_rates/status/670427002554466305/photo/1,9,10,a,,,,


In [11]:
# List sources
twitter_archive.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                        91  
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                     33  
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>    11  
Name: source, dtype: int64

In [12]:
# There are multiple dogs in single tweet
twitter_archive.loc[531,'text']

'Here we have Burke (pupper) and Dexter (doggo). Pupper wants to be exactly like doggo. Both 12/10 would pet at same time https://t.co/ANBpEYHaho'

In [13]:
# List all the numerators
twitter_archive.rating_numerator.value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7       55 
14      54 
5       37 
6       32 
3       19 
4       17 
1       9  
2       9  
420     2  
0       2  
15      2  
75      2  
80      1  
20      1  
24      1  
26      1  
44      1  
50      1  
60      1  
165     1  
84      1  
88      1  
144     1  
182     1  
143     1  
666     1  
960     1  
1776    1  
17      1  
27      1  
45      1  
99      1  
121     1  
204     1  
Name: rating_numerator, dtype: int64

In [14]:
# There are very high ratings, might want to address this manualy, as it will skew the data
twitter_archive.loc[twitter_archive['rating_numerator'] == 1776,['tweet_id','text']]

Unnamed: 0,tweet_id,text
979,749981277374128128,This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh


In [15]:
# There is also Snoop Dog, who has to go
twitter_archive.loc[twitter_archive['rating_numerator'] == 420,['tweet_id','text']]

Unnamed: 0,tweet_id,text
188,855862651834028034,@dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research
2074,670842764863651840,After so many requests... here you go.\n\nGood dogg. 420/10 https://t.co/yfAAo1gdeY


In [16]:
# List all the denominators
twitter_archive.rating_denominator.value_counts()

10     2333
11     3   
50     3   
80     2   
20     2   
2      1   
16     1   
40     1   
70     1   
15     1   
90     1   
110    1   
120    1   
130    1   
150    1   
170    1   
7      1   
0      1   
Name: rating_denominator, dtype: int64

In [17]:
# View tweets which do not have rating denominator of 10
print('Total number of instances: ', len(twitter_archive.loc[twitter_archive.rating_denominator!=10,
                              ['tweet_id','text','rating_numerator','rating_denominator']]))
twitter_archive.loc[twitter_archive.rating_denominator!=10,
                              ['tweet_id','text','rating_numerator','rating_denominator']]

Total number of instances:  23


Unnamed: 0,tweet_id,text,rating_numerator,rating_denominator
313,835246439529840640,"@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho",960,0
342,832088576586297345,@docmisterio account started on 11/15/15,11,15
433,820690176645140481,The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd,84,70
516,810984652412424192,Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx,24,7
784,775096608509886464,"RT @dog_rates: After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https:/…",9,11
902,758467244762497024,Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE,165,150
1068,740373189193256964,"After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ",9,11
1120,731156023742988288,Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv,204,170
1165,722974582966214656,Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a,4,20
1202,716439118184652801,This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq,50,50


In [18]:
# Check for duplicates
twitter_archive[twitter_archive.tweet_id.duplicated()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [19]:
# Count the number of NaNs in each column
twitter_archive.isnull().sum()

tweet_id                      0   
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                     0   
source                        0   
text                          0   
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                 59  
rating_numerator              0   
rating_denominator            0   
name                          0   
doggo                         0   
floofer                       0   
pupper                        0   
puppo                         0   
dtype: int64

In [20]:
# There were some tweet ids which we could not scrape using the API
errors_df = pd.read_csv('tweet_json_errors.csv')
errors_df

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,888202515573088257,,,2017-07-21 01:02:36 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Canela. She attempted some fancy porch pics. They were unsuccessful. 13/10 someone help her https://t.co/cLyzpcUcMX,8.87474e+17,4196984000.0,2017-07-19 00:47:34 +0000,"https://twitter.com/dog_rates/status/887473957103951883/photo/1,https://twitter.com/dog_rates/status/887473957103951883/photo/1,https://twitter.com/dog_rates/status/887473957103951883/photo/1,https://twitter.com/dog_rates/status/887473957103951883/photo/1",13,10,Canela,,,,
1,873697596434513921,,,2017-06-11 00:25:14 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Walter. He won't start hydrotherapy without his favorite floatie. 14/10 keep it pup Walter https://t.co/r28jFx9uyF,8.688804e+17,4196984000.0,2017-05-28 17:23:24 +0000,"https://twitter.com/dog_rates/status/868880397819494401/photo/1,https://twitter.com/dog_rates/status/868880397819494401/photo/1",14,10,Walter,,,,
2,869988702071779329,,,2017-05-31 18:47:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: We only rate dogs. This is quite clearly a smol broken polar bear. We'd appreciate if you only send dogs. Thank you... 12/10…,8.59197e+17,4196984000.0,2017-05-02 00:04:57 +0000,https://twitter.com/dog_rates/status/859196978902773760/video/1,12,10,quite,,,,
3,866816280283807744,,,2017-05-23 00:41:20 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","RT @dog_rates: This is Jamesy. He gives a kiss to every other pupper he sees on his walk. 13/10 such passion, much tender https://t.co/wk7T…",8.664507e+17,4196984000.0,2017-05-22 00:28:40 +0000,"https://twitter.com/dog_rates/status/866450705531457537/photo/1,https://twitter.com/dog_rates/status/866450705531457537/photo/1",13,10,Jamesy,,,pupper,
4,861769973181624320,,,2017-05-09 02:29:07 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","RT @dog_rates: ""Good afternoon class today we're going to learn what makes a good boy so good"" 13/10 https://t.co/f1h2Fsalv9",8.066291e+17,4196984000.0,2016-12-07 22:38:52 +0000,"https://twitter.com/dog_rates/status/806629075125202948/photo/1,https://twitter.com/dog_rates/status/806629075125202948/photo/1,https://twitter.com/dog_rates/status/806629075125202948/photo/1,https://twitter.com/dog_rates/status/806629075125202948/photo/1",13,10,,,,,
5,845459076796616705,,,2017-03-25 02:15:26 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: Here's a heartwarming scene of a single father raising his two pups. Downright awe-inspiring af. 12/10 for everyone https://…,7.562885e+17,4196984000.0,2016-07-22 00:43:32 +0000,"https://twitter.com/dog_rates/status/756288534030475264/photo/1,https://twitter.com/dog_rates/status/756288534030475264/photo/1,https://twitter.com/dog_rates/status/756288534030475264/photo/1,https://twitter.com/dog_rates/status/756288534030475264/photo/1",12,10,,,,,
6,842892208864923648,,,2017-03-18 00:15:37 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Stephan. He just wants to help. 13/10 such a good boy https://t.co/DkBYaCAg2d,8.071068e+17,4196984000.0,2016-12-09 06:17:20 +0000,"https://twitter.com/dog_rates/status/807106840509214720/video/1,https://twitter.com/dog_rates/status/807106840509214720/video/1",13,10,Stephan,,,,
7,837012587749474308,,,2017-03-01 18:52:06 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @KennyFromDaBlok: 14/10 h*ckin good hats. will wear daily @dog_rates https://t.co/rHLoU5gS30,8.370113e+17,726634700.0,2017-03-01 18:47:10 +0000,"https://twitter.com/KennyFromDaBlok/status/837011344666812416/photo/1,https://twitter.com/KennyFromDaBlok/status/837011344666812416/photo/1",14,10,,,,,
8,827228250799742977,,,2017-02-02 18:52:38 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Phil. He's an important dog. Can control the seasons. Magical as hell. 12/10 would let him sign my forehead https://…,6.946697e+17,4196984000.0,2016-02-02 23:52:22 +0000,"https://twitter.com/dog_rates/status/694669722378485760/photo/1,https://twitter.com/dog_rates/status/694669722378485760/photo/1",12,10,Phil,,,,
9,802247111496568832,,,2016-11-25 20:26:31 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: Everybody drop what you're doing and look at this dog. 13/10 must be super h*ckin rare https://t.co/I1bJUzUEW5,7.790561e+17,4196984000.0,2016-09-22 20:33:42 +0000,"https://twitter.com/dog_rates/status/779056095788752897/photo/1,https://twitter.com/dog_rates/status/779056095788752897/photo/1,https://twitter.com/dog_rates/status/779056095788752897/photo/1,https://twitter.com/dog_rates/status/779056095788752897/photo/1",13,10,,,,,


In [21]:
image_predictions

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


In [22]:
# Check the predicyion confidence values
image_predictions.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [23]:
# Looking for duplicated Tweet IDs
image_predictions[image_predictions.tweet_id.duplicated()]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog


In [24]:
tweet_json

Unnamed: 0,favorite_count,retweet_count,tweet_id
0,38824,8591,892420643555336193
1,33253,6312,892177421306343426
2,25044,4190,891815181378084864
3,42188,8706,891689557279858688
4,40347,9474,891327558926688256
5,20236,3137,891087950875897856
6,11861,2088,890971913173991426
7,65572,19045,890729181411237888
8,27786,4300,890609185150312448
9,31956,7474,890240255349198849


In [25]:
tweet_json.describe()

Unnamed: 0,favorite_count,retweet_count,tweet_id
count,2345.0,2345.0,2345.0
mean,8070.775267,3026.485714,7.42294e+17
std,12143.921062,5034.152804,6.833642e+16
min,0.0,0.0,6.660209e+17
25%,1404.0,607.0,6.783802e+17
50%,3539.0,1414.0,7.189392e+17
75%,9979.0,3523.0,7.986979e+17
max,143466.0,77385.0,8.924206e+17


##### Tidyness Observations
* Relevant data from **twitter_archive**, **image_predictions** and **tweet_json** to be included into a single dataframe.
* From **twitter_archive** create a single categorical variable `dog_stage` and melt `doggo`, `floofer`, `pupper`, `puppo`.
* From **image_predictions** into a new column of final dataframe `dog_breed`, populate it with `p(n)` value, where `p(n)_conf` is highest and `p(n)_dog` is set to True.

##### Quality Observations
* In **twitter_archive** `tweet_id` as well as `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id` all need to be string, as of [Twitter best practices](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object)
* Only original tweets to be used, exclude retweets and replies.
* In **twitter_archive** `timestamp` and `retweeted_status_timestamp` - needs to be datetime for ease of analysis
* In **twitter_archive** `source` - retrieve the ahref tag contents only
* In **twitter_archive** `rating_numerator` and `rating_denominator` are to be cleaned and standardised. There are several instances when ratings were not gathered correctly ('24/7'  or 'Happy 4/20'), and where there are multiple ratings.
* In **image_predictions** there are missing values as 2075 observations compared to 2356 tweet IDs, suggesting some tweets are missing images, we may want to get rid of them.
* In **tweet_json** additional information from 8 tweets could not be obtained, probably because they were deleted on Twitter servers.
* In **image_predictions** dog breeds are of different case and spacing styles, replace all to lower case and replace '_' with regular spaces.
* Ensure all datatypes are appropriate.
***
## Clean

In [26]:
# First and foremost - make copies
twitter_archive_copy = twitter_archive.copy()
image_predictions_copy = image_predictions.copy()
tweet_json_copy = tweet_json.copy()

### Tweet IDs to string format
In order to merge all three datasets together, firstly we must to ensure that all twitter IDs are to be string, as of [Twitter best practices](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object)  
*Note:* we do not require to convert `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, as these will only be used to identify retweets and then deleted.
#### Code

In [27]:
twitter_archive_copy.tweet_id = twitter_archive_copy.tweet_id.astype('str')
image_predictions_copy.tweet_id = image_predictions_copy.tweet_id.astype('str')
tweet_json_copy.tweet_id = tweet_json_copy.tweet_id.astype('str')

# Check the outcome
print(twitter_archive_copy.tweet_id.dtype)
print(image_predictions_copy.tweet_id.dtype)
print(tweet_json_copy.tweet_id.dtype)

object
object
object


### Combine all dataframes
All the dataframes should be combined into a single dataframe **dog_ratings**.
#### Code

In [28]:
dog_ratings = pd.merge(twitter_archive_copy, image_predictions_copy, how = 'left', on = ['tweet_id'] )
dog_ratings = pd.merge(dog_ratings, tweet_json_copy, how = 'left', on = ['tweet_id'])
# Check the outcome
dog_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2356 entries, 0 to 2355
Data columns (total 30 columns):
tweet_id                      2356 non-null object
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
jpg_url                       2

### Dog stage
Create a single categorical variable `dog_stage` and melt `doggo`, `floofer`, `pupper`, `puppo`.
#### Code

In [29]:
# List of variables to keep after melt
id_vars_list = ['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp', 'source', 'text', 
                'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp', 'expanded_urls', 
                'rating_numerator', 'rating_denominator', 'name', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 
                'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog', 'favorite_count', 'retweet_count']

# Melt floofer, doggo, pupper, puppo  variables into a new column, 
# this will create a very long list of 4 instances per variable
dog_ratings =pd.melt(dog_ratings, id_vars = id_vars_list, var_name = 'dog_stage_temp', value_name = 'dog_stage')

# Remove the unnessesary column of column names
dog_ratings = dog_ratings.drop('dog_stage_temp', axis =1)

# Remove duplicated tweet ids and keep last instance of dog type, as all others are 'None'
dog_ratings = dog_ratings.sort_values('dog_stage').drop_duplicates('tweet_id', keep = 'last')

# Convert 'None' to NaN
dog_ratings['dog_stage'].replace('None', np.nan, inplace=True)

# Assign appropriate data type
dog_ratings.dog_stage = dog_ratings.dog_stage.astype('category')

# Check the outcome
dog_ratings.dog_stage.value_counts(dropna = False)

NaN        1976
pupper     257 
doggo      83  
puppo      30  
floofer    10  
Name: dog_stage, dtype: int64

### Dog Breeds variable
Create a new column called `dog_breed`, populate it with `p(n)` value, where `p(n)_conf` is highest and `p(n)_dog` is set to True. Thanks to amazing Udacity wizardry on image predictions' the confidence variables are listed in descending order, meaning that p1_conf is higher than p2_conf or p3_conf. This makes the logic of choosing the dog breed from predictions very simple.
#### Code

In [30]:
# New list of dog breeds
dog_breed_list = []
# A much shorter and ellegant way of itterating through several columns than nested for loop.
def breed_to_list(df):
    """
    This funcion checks the values in p(n)_dog,
    if either of them is True it will append the value of
    corresponding dog breed prediction into the list.
    If all predictions are set to False, it will append a 
    NaN value to the list.
    """
    if df['p1_dog'] == True:
        dog_breed_list.append(df['p1'])
    elif df['p2_dog'] == True:
        dog_breed_list.append(df['p2'])
    elif df['p3_dog'] == True:
        dog_breed_list.append(df['p3'])
    else:
        dog_breed_list.append(np.nan)
    return dog_breed_list

dog_ratings.apply(breed_to_list, axis=1)

# Incorporate the list into the dataframe's new column
dog_ratings['dog_breed'] = dog_breed_list

# Check the outcome
dog_ratings.dog_breed.value_counts()

golden_retriever                  173
Labrador_retriever                113
Pembroke                          96 
Chihuahua                         95 
pug                               65 
toy_poodle                        52 
chow                              51 
Samoyed                           46 
Pomeranian                        42 
cocker_spaniel                    34 
malamute                          34 
French_bulldog                    32 
Chesapeake_Bay_retriever          31 
miniature_pinscher                26 
Cardigan                          23 
Staffordshire_bullterrier         22 
Eskimo_dog                        22 
beagle                            21 
German_shepherd                   21 
Siberian_husky                    20 
Shih-Tzu                          20 
kuvasz                            19 
Rottweiler                        19 
Maltese_dog                       19 
Lakeland_terrier                  19 
Shetland_sheepdog                 19 
Italian_grey

### No Retweets
Only original tweets to be used, exclude retweets and replies. Retweets and replies are identified by `retweet_rating_status_id` and `in_reply_to_status_id` respectively, which are assigned to the tweet status object if it is a retweet or reply.
#### Code

In [31]:
# Nullify the observations where there is a retweet_status_id or in_reply_to_status_id
dog_ratings = dog_ratings[pd.isnull(dog_ratings.retweeted_status_id)]
dog_ratings = dog_ratings[pd.isnull(dog_ratings.in_reply_to_status_id)]

# Remove unnessesary columns
columns_to_drop = ['retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp',
                  'in_reply_to_status_id', 'in_reply_to_user_id']
dog_ratings.drop(columns_to_drop, axis = 1, inplace = True)

# Check the outcome
dog_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 2261 to 7236
Data columns (total 23 columns):
tweet_id              2097 non-null object
timestamp             2097 non-null object
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  2097 non-null object
jpg_url               1971 non-null object
img_num               1971 non-null float64
p1                    1971 non-null object
p1_conf               1971 non-null float64
p1_dog                1971 non-null object
p2                    1971 non-null object
p2_conf               1971 non-null float64
p2_dog                1971 non-null object
p3                    1971 non-null object
p3_conf               1971 non-null float64
p3_dog                1971 non-null object
favorite_count        2097 non-null float64
retweet_count         2097 

### Remove tweets without images 
After all datasets were combined together we can match and remove all entries which do not have contents in `jpg_url` column, ideally the number of entries should match the size of **image_redictions**.
#### Code

In [32]:
# Remove observations where there are NaN values in 'jpg_url' column
dog_ratings = dog_ratings.dropna(subset=['jpg_url'])
# Check the outcome
dog_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1971 entries, 2261 to 7236
Data columns (total 23 columns):
tweet_id              1971 non-null object
timestamp             1971 non-null object
source                1971 non-null object
text                  1971 non-null object
expanded_urls         1971 non-null object
rating_numerator      1971 non-null int64
rating_denominator    1971 non-null int64
name                  1971 non-null object
jpg_url               1971 non-null object
img_num               1971 non-null float64
p1                    1971 non-null object
p1_conf               1971 non-null float64
p1_dog                1971 non-null object
p2                    1971 non-null object
p2_conf               1971 non-null float64
p2_dog                1971 non-null object
p3                    1971 non-null object
p3_conf               1971 non-null float64
p3_dog                1971 non-null object
favorite_count        1971 non-null float64
retweet_count         1971 

### Tweet source
`source` variable is difficult to read, as it is an HTML tag, retrieve the href tag contents only.
#### Code

In [33]:
# Check which values to extract
dog_ratings.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     1932
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                     28  
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>    11  
Name: source, dtype: int64

In [34]:
# Use regulal expressions to extract contents from source and replace it
dog_ratings['source'] = dog_ratings.source.str.extract('<a[^>]*>([^<]*)</a>', expand=True)
# Assign appripriate category
dog_ratings.source = dog_ratings.source.astype('category')
# Check the outcome
dog_ratings.source.value_counts()

Twitter for iPhone    1932
Twitter Web Client    28  
TweetDeck             11  
Name: source, dtype: int64

### Invalid rating
There are number of ratings numerators which go up to 100s/10. Denominators are mostly 10s, with few exceptions where there are multiple dogs involved in a tweet. Overall there are only 18 ratings not conforming with the standard.
The following wrangle will be in several steps as there are a lot of thisng to tackle.
1. There are 5 instances where the ratings are confused with other fractions such as 7/11, 9/11, 20/4, 24/7 etc, reading through the tweet text in Assessment part of this report, it was confirmed that these instances are not ratings and include actual rating further down the tweet, might as well go about this manually.
2. Remove observation with tweet_ids *670842764863651840 and 749981277374128128*, these are super high ratings which will skew the data. Also remove observation with tweet_id *810984652412424192*, as there is no rating involved.
3. Come up with a new rating system, perhaps only leaving rating numerator since denominatora are all 10 or a multiple of 10.
#### Code

In [35]:
# 1 # 
# Separate instances where the denominator does not equal to 10
print('Total number of instances: ', len(dog_ratings.loc[dog_ratings.rating_denominator!=10,
                              ['tweet_id','text','rating_numerator','rating_denominator']]))
dog_ratings.loc[dog_ratings.rating_denominator!=10,
                              ['tweet_id','text','rating_numerator','rating_denominator']]

Total number of instances:  17


Unnamed: 0,tweet_id,text,rating_numerator,rating_denominator
2335,666287406224695296,This is an Albanian 3 1/2 legged Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv,1,2
3789,697463031882764288,Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ,44,40
3707,704054845121142784,Here is a whole flock of puppers. 60/50 I'll take the lot https://t.co/9dpcw6MdWa,60,50
3991,684222868335505415,Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55,121,110
3521,722974582966214656,Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a,4,20
3558,716439118184652801,This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq,50,50
3424,740373189193256964,"After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ",9,11
3476,731156023742988288,Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv,204,170
3584,713900603437621249,Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1,99,90
3630,709198395643068416,"From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK",45,50


In [36]:
# 1 # Locate tweet_ids with confused ratings and correct them, 5 in total.
dog_ratings.loc[dog_ratings.tweet_id == '666287406224695296','rating_numerator'] = 9
dog_ratings.loc[dog_ratings.tweet_id == '722974582966214656','rating_numerator'] = 13
dog_ratings.loc[dog_ratings.tweet_id == '716439118184652801','rating_numerator'] = 11
dog_ratings.loc[dog_ratings.tweet_id == '740373189193256964','rating_numerator'] = 14
dog_ratings.loc[dog_ratings.tweet_id == '682962037429899265','rating_numerator'] = 10

In [37]:
# 2 # Remove three observations ratings for which are outliers
dog_ratings = dog_ratings[dog_ratings.tweet_id!='810984652412424192']
dog_ratings = dog_ratings[dog_ratings.tweet_id!='670842764863651840']
dog_ratings = dog_ratings[dog_ratings.tweet_id!='749981277374128128']

In [38]:
# Check the outcome
dog_ratings.rating_numerator.value_counts()

12     446
10     418
11     393
13     254
9      150
8      95 
7      51 
14     34 
5      33 
6      32 
3      19 
4      15 
2      9  
1      4  
204    1  
165    1  
26     1  
27     1  
44     1  
45     1  
60     1  
75     1  
80     1  
84     1  
88     1  
99     1  
121    1  
144    1  
0      1  
Name: rating_numerator, dtype: int64

### Make dog breeds more readable
Make `dog_breed` more uniform by replacing underscore with space and turning all instances to lower case.

In [39]:
dog_ratings.dog_breed = dog_ratings.dog_breed.str.replace('_',' ')
dog_ratings.dog_breed = dog_ratings.dog_breed.str.lower()

#### Final Clean-up and data type corrections
Final clean dataframe **dog_ratings** to include following columns:
* `tweet_id` as obj
* `source` as cat
* `timestamp` as datetime
* `rating_numerator` as int
* `rating_denominator` as int
* `img_url` as obj
* `dog_breed` as cat
* `dog_stage` as cat
* `favorite_count` as int
* `retweet_count` as int
#### Code

In [40]:
dog_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1968 entries, 2261 to 7236
Data columns (total 23 columns):
tweet_id              1968 non-null object
timestamp             1968 non-null object
source                1968 non-null category
text                  1968 non-null object
expanded_urls         1968 non-null object
rating_numerator      1968 non-null int64
rating_denominator    1968 non-null int64
name                  1968 non-null object
jpg_url               1968 non-null object
img_num               1968 non-null float64
p1                    1968 non-null object
p1_conf               1968 non-null float64
p1_dog                1968 non-null object
p2                    1968 non-null object
p2_conf               1968 non-null float64
p2_dog                1968 non-null object
p3                    1968 non-null object
p3_conf               1968 non-null float64
p3_dog                1968 non-null object
favorite_count        1968 non-null float64
retweet_count         196

In [41]:
# Dropping unnessesary columns
columns_to_drop = ['text','expanded_urls','rating_denominator','name','img_num',
                   'p1','p1_conf','p1_dog','p2','p2_conf','p2_dog','p3','p3_conf','p3_dog']
dog_ratings.drop(columns_to_drop, axis = 1, inplace = True)
# Check the outcome
dog_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1968 entries, 2261 to 7236
Data columns (total 9 columns):
tweet_id            1968 non-null object
timestamp           1968 non-null object
source              1968 non-null category
rating_numerator    1968 non-null int64
jpg_url             1968 non-null object
favorite_count      1968 non-null float64
retweet_count       1968 non-null float64
dog_stage           303 non-null category
dog_breed           1665 non-null object
dtypes: category(2), float64(2), int64(1), object(4)
memory usage: 127.1+ KB


In [42]:
# Assign shorter column names
dog_ratings = dog_ratings.rename(columns={'dog_breed' : 'breed',
                                          'dog_stage' : 'stage',
                                          'rating_numerator' : 'rating'})
# Fixing data types
dog_ratings.timestamp = pd.to_datetime(dog_ratings.timestamp)
dog_ratings.source = dog_ratings.source.astype('category')
dog_ratings.favorite_count = dog_ratings.favorite_count.astype(int)
dog_ratings.retweet_count = dog_ratings.retweet_count.astype(int)
dog_ratings.breed = dog_ratings.breed.astype('category')
dog_ratings.rating = dog_ratings.rating.astype(int)

# Check the outcome
dog_ratings.dtypes

tweet_id          object        
timestamp         datetime64[ns]
source            category      
rating            int32         
jpg_url           object        
favorite_count    int32         
retweet_count     int32         
stage             category      
breed             category      
dtype: object

## Saving the data

In [43]:
# Final look at the final clean dataset
dog_ratings

Unnamed: 0,tweet_id,timestamp,source,rating,jpg_url,favorite_count,retweet_count,stage,breed
2261,667549055577362432,2015-11-20 03:44:31,Twitter Web Client,1,https://pbs.twimg.com/media/CUOcVCwWsAERUKY.jpg,5977,2393,,
2262,667546741521195010,2015-11-20 03:35:20,Twitter Web Client,9,https://pbs.twimg.com/media/CUOaOWXWcAA0_Jy.jpg,342,132,,toy poodle
2263,667544320556335104,2015-11-20 03:25:43,Twitter Web Client,10,https://pbs.twimg.com/media/CUOYBbbWIAAXQGU.jpg,895,549,,pomeranian
2264,667538891197542400,2015-11-20 03:04:08,Twitter Web Client,9,https://pbs.twimg.com/media/CUOTFZOW4AABsfW.jpg,209,70,,yorkshire terrier
2258,667724302356258817,2015-11-20 15:20:54,Twitter Web Client,7,https://pbs.twimg.com/media/CUQ7tv3W4AA3KlI.jpg,503,332,,
2265,667534815156183040,2015-11-20 02:47:56,Twitter Web Client,8,https://pbs.twimg.com/media/CUOPYI5UcAAj_nO.jpg,844,561,,pembroke
2267,667524857454854144,2015-11-20 02:08:22,Twitter Web Client,12,https://pbs.twimg.com/media/CUOGUfJW4AA_eni.jpg,1751,1167,,chesapeake bay retriever
2268,667517642048163840,2015-11-20 01:39:42,Twitter Web Client,8,https://pbs.twimg.com/media/CUN_wiBUkAAakT0.jpg,378,198,,italian greyhound
2269,667509364010450944,2015-11-20 01:06:48,Twitter Web Client,12,https://pbs.twimg.com/media/CUN4Or5UAAAa5K4.jpg,6986,2215,,beagle
2270,667502640335572993,2015-11-20 00:40:05,Twitter Web Client,11,https://pbs.twimg.com/media/CUNyHTMUYAAQVch.jpg,545,227,,labrador retriever


In [44]:
# Save the dataframe to csv
dog_ratings.to_csv('twitter_archive_master.csv', index=False, encoding = 'utf-8')

## Resources
[Tweepy Documentation](https://media.readthedocs.org/pdf/tweepy/latest/tweepy.pdf)  
[Structure of Tweet JSON](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json)  
[Pandas Melt](https://www.youtube.com/watch?v=oY62o-tBHF4)  
All code is based on course examples provided by Udacity. One helpful suggestione was talen from Udacity Student Forum community regarding importing the additional twitter via the API.
#### Report created and compiled by Alina Bolat for Udacity Data Analytics Nano Degree