# Wrangling and Analyzing Data from We Rate Dogs
## @rate_dogs on Twitter

## Steps to Accomplish ##

- Download pre-pulled files from Udacity *DONE*
- Pull the file image-predictions.tsv from Udacity's server using the Requests library *DONE*
- Pull Twitter's API to supply the retweet count and favorite count (at minimum) and add that to the data
    - Using the tweet IDs in the twitter archive file, query and store each tweet's JSON data using Tweepy *DONE*
    - Store the entire set of JSON data in a file called tweet_json.txt *DONE*
    - Each tweet's data should be written to its own line *DONE*
    - Read this data line by line in a pandas DataFrame with tweet ID, retweet count, and favorite count at minimum *DONE*
    - Merge this information into the main dataframe *DONE*
- After gathering all of the data, assess them visually and programmatically and find 8 quality and 2 tidiness issues *DONE*
- Make sure they satisfy the Project Motivation to meet specifications *DONE*
- Clean all 10 of those issues and document here
- Store the clean DataFrame(s) in a CSV file named twitter_archive_master.csv
- Analyze and visualize the data here
    - There must be at least 3 insights and 1 visualization
- Create a 300-600 word written report called 'wrangle_report.pdf' or 'wrangle_report.html'
    - Use this to describe my wrangling efforts
- Create a 250 word minimum written report called 'act_report.pdf' or 'act_report.html'
    - Use this to communicate the insights and display the visualization from the data

For data cleaning purposes, I need to clean:
- at least 8 quality issues
- at least 2 tidiness issues

And finally:
- REMOVE my Twitter API keys and token from the files before posting

## Gathering the Data ##

### Tweepy Notes ###

Tweet data is stored in JSON format by Twitter. Getting tweet JSON data via tweet ID using Tweepy is described well in this StackOverflow answer. Note that setting the tweet_mode parameter to 'extended' in the get_status call, i.e., api.get_status(tweet_id, tweet_mode='extended'), can be useful.

Also, note that the tweets corresponding to a few tweet IDs in the archive may have been deleted. Try-except blocks may come in handy here.

Twitter's API has a rate limit. Rate limiting is used to control the rate of traffic sent or received by a server. As per Twitter's rate limiting info page:

Rate limits are divided into 15 minute intervals

To query all of the tweet IDs in the WeRateDogs Twitter archive, 20-30 minutes of running time can be expected. Printing out each tweet ID after it was queried and using a code timer were both helpful for sanity reasons. Setting the wait_on_rate_limit and wait_on_rate_limit_notify parameters to True in the tweepy.api class is useful as well.

In [2]:
import pandas as pd
import numpy as np
import requests
import os
import tweepy
import json

image_predictions.tsv URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

In [3]:
# Establishing the first table: archive

archive = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
# Downloading image-predictions.tsv from Udacity's server

folder_name = "C:/Users/chris/OneDrive/Documents/Coding/Udacity_Data_Analyst/Wrangle_Data_Project"
file_name = 'image-predictions.tsv'
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

response = requests.get(url)
with open(os.path.join(folder_name, file_name), mode='wb') as file:
	file.write(response.content)

In [4]:
# Establishing the second table: image_predictions

image_predictions = pd.read_csv('image-predictions.tsv', sep='\t')

In [5]:
# Setting up the Twitter API

consumer_key = 'REMOVED'
consumer_secret = 'REMOVED'
access_token = 'REMOVED'
access_secret = 'REMOVED'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

In [6]:
# This variable is just to generate a percentage complete while the below loop runs since it takes a while.
number_of_tweets = archive.tweet_id.count()

tweet_json_file = 'tweet_json.txt'
current_record = 0

# Querying Tweepy to pull all available information for the tweets listed in the archive Udacity provided
for tweet_id in archive.tweet_id:
    try:
        tweet = api.get_status(tweet_id, tweet_mode='extended')
        tweet_data = json.dumps(tweet._json)
        with open(os.path.join(folder_name, tweet_json_file), mode='a') as file:
            file.write(tweet_data)
            file.write('\n')
        print(str(tweet_id) + 'success')
        current_record += 1
        print(str((current_record / number_of_tweets) * 100) + '% complete')
    except:
        print(str(tweet_id) + 'fail')
        current_record += 1
        print(str((current_record / number_of_tweets) * 100) + '% complete')

SyntaxError: unmatched ')' (<ipython-input-6-16bfb3732b2e>, line 17)

In [6]:
# Reading the txt file created above line by line into a list and just pulling out
# the id, retweet_count, and favorite_count

tweet_counts = []

with open('tweet_json.txt', 'r') as file:
    for line in file.readlines():
        this_tweet = json.loads(line)
        this_tweet_data = [this_tweet['id'], this_tweet['retweet_count'], this_tweet['favorite_count']]
        tweet_counts.append(this_tweet_data)

In [7]:
tweet_counts_df = pd.DataFrame(tweet_counts, columns=['tweet_id', 'retweet_count', 'favorite_count'])
tweet_counts_df

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7462,35353
1,892177421306343426,5538,30593
2,891815181378084864,3665,23022
3,891689557279858688,7638,38650
4,891327558926688256,8230,36918
...,...,...,...
872,759159934323924993,1129,0
873,759099523532779520,4061,14249
874,759047813560868866,1964,6375
875,758854675097526272,880,3490


In [8]:
tweet_data = archive.merge(tweet_counts_df, how='left', on='tweet_id')
tweet_data

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,retweet_count,favorite_count
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,,7462.0,35353.0
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,,5538.0,30593.0
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,,3665.0,23022.0
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,,7638.0,38650.0
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,,8230.0,36918.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,,,


## Assessing the Data ##

In [36]:
tweet_data

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,retweet_count,favorite_count
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,,7462.0,35353.0
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,,5538.0,30593.0
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,,3665.0,23022.0
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,,7638.0,38650.0
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,,8230.0,36918.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,,,


In [12]:
tweet_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2356 entries, 0 to 2355
Data columns (total 19 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [13]:
tweet_data[['rating_numerator', 'retweet_count', 'favorite_count']].describe()

Unnamed: 0,rating_numerator,retweet_count,favorite_count
count,2356.0,877.0,877.0
mean,13.126486,4603.010262,13557.190422
std,45.876648,5185.956061,14243.386598
min,0.0,1.0,0.0
25%,10.0,1918.0,5115.0
50%,11.0,3153.0,10345.0
75%,12.0,5406.0,19036.0
max,1776.0,54951.0,129978.0


In [14]:
tweet_data['rating_denominator'].describe()

count    2356.000000
mean       10.455433
std         6.745237
min         0.000000
25%        10.000000
50%        10.000000
75%        10.000000
max       170.000000
Name: rating_denominator, dtype: float64

In [15]:
tweet_data[tweet_data['rating_denominator'] > 10][['tweet_id', 'rating_denominator']]

Unnamed: 0,tweet_id,rating_denominator
342,832088576586297345,15
433,820690176645140481,70
784,775096608509886464,11
902,758467244762497024,150
1068,740373189193256964,11
1120,731156023742988288,170
1165,722974582966214656,20
1202,716439118184652801,50
1228,713900603437621249,90
1254,710658690886586372,80


In [37]:
tweet_data[tweet_data['rating_denominator'] > 10][['tweet_id', 'rating_denominator']].count()

tweet_id              20
rating_denominator    20
dtype: int64

In [88]:
tweet_data.name.sample(20)

1676        None
2200    Winifred
92           Jed
474        Comet
197         Arya
487        Oshie
1616       Alfie
731        Combo
2217      Gòrdón
2343        None
422     Hercules
1360        None
2067        None
2005      Pippin
794         Fizz
1446        None
2342        None
317         Tobi
1445        None
1657      Brandy
Name: name, dtype: object

In [93]:
tweet_data.groupby('name').count()

Unnamed: 0_level_0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,doggo,floofer,pupper,puppo,retweet_count,favorite_count
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Abby,2,0,0,2,2,2,0,0,0,2,2,2,2,2,2,2,0,0
Ace,1,0,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0,0
Acro,1,0,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0,0
Adele,1,0,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0,0
Aiden,1,0,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
such,1,0,0,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1
the,8,0,0,8,8,8,0,0,0,8,8,8,8,8,8,8,0,0
this,1,0,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0,0
unacceptable,1,0,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0,0


In [94]:
tweet_data.groupby('name').count().sort_values(by='tweet_id')

Unnamed: 0_level_0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,doggo,floofer,pupper,puppo,retweet_count,favorite_count
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Kingsley,1,0,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0,0
Julius,1,0,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0,0
Sora,1,0,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0,0
Kaia,1,0,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0,0
Kaiya,1,0,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Lucy,11,0,0,11,11,11,0,0,0,11,11,11,11,11,11,11,3,3
Oliver,11,0,0,11,11,11,1,1,1,11,11,11,11,11,11,11,4,4
Charlie,12,0,0,12,12,12,1,1,1,12,12,12,12,12,12,12,7,7
a,55,0,0,55,55,55,0,0,0,55,55,55,55,55,55,55,3,3


In [70]:
# Running a loop to check how many tweet texts contain mention of the 
# dog stages (pupper, puppo, doggo, floofer) as compared to the information in the columns

def dog_counts(string):
    appearances = 0
    for entry in tweet_data.text:
        if string in entry:
            appearances += 1
    
    number_in_archive = tweet_data.groupby(string)['text'].count()[string]
    print("Difference in text and archive of the appearance of " + string + ":")
    print(number_in_archive - appearances)

In [71]:
dog_counts('pupper')

Difference in text and archive of the appearance of pupper:
-15


In [72]:
dog_counts('doggo')

Difference in text and archive of the appearance of doggo:
-1


In [73]:
dog_counts('floofer')

Difference in text and archive of the appearance of floofer:
6


In [74]:
dog_counts('puppo')

Difference in text and archive of the appearance of puppo:
-7


In [16]:
image_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [17]:
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [80]:
image_predictions.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


### Dataset Issues ###

- Quality Issues
    1. There are 181 retweets
    2. Only 877 tweets successfully pulled retweet and favorite count data
    3. There are 20 incorrect ratings
        - some grabbed a date or other number instead of the rating
        - many are ratings based on multiple dogs in the same post
        - Individual issues:
            - id 832088576586297345 not actual rating, note about account starting
            - id 740373189193256964	grabbed 9/11 in the text instead of 14/10
            - id 722974582966214656 grabbed 4/20 instead of 13/10
            - id 682962037429899265 should be 10/10
            - id 682808988178739200 is a retweet, does not include dog rating
            - id 749981277374128128 is a patriotic pupper with a rating of 1776
    4. Some of the names are incorrect
        - 55 records list 'a'
        - some records list random words
    5. The appearance of the dog stages is not accurately recorded in those columns
        - pupper is off by 15 records
        - doggo off by 1 record
        - floofer off by 6 records
        - puppo off by 7 records
    6. 'retweet_count' and 'favorite_count' were imported as floats
    7. Some expanded URLs are missing
    8. Timestamp is saved as a string


- Tidiness Issues
    1. Should be one dataset
        - want to be able to compare dog breeds to favorite and retweet counts
    2. The retweet information columns are unnecessary since none should be retweets
    3. Dog stage is a single variable recorded in four columns
        - data should be categorical

#### Sites I used: ####

Converting JSON dictionary object to string: https://www.geeksforgeeks.org/python-convert-dictionary-object-into-string/

Reading text file line by line: https://tutorial.eyehunts.com/python/python-read-file-line-by-line-readlines/

Converting JSON strings into objects: https://www.freecodecamp.org/news/python-read-json-file-how-to-load-json-from-a-file-and-parse-dumps/