# Gather

- Twitter Archive Enhanced downloaded from Resources, saved as twitter-archive-enhanced.csv

- Image Predictions downloaded programmatically from Udacity using the Requests library, saved as image-predictions.tsv

- JSON data for every tweet in the archive downloaded programmatically using the Tweepy library and Twitter API, saved as tweet_json.txt

# Assess

- Visually

- Programmatically

- Detect and document EIGHT Quality issues and TWO tidiness issues (must satisfy the Project Motivation Key Points)

## Key Points:

- You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
- Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
- Cleaning includes merging individual pieces of data according to the rules of tidy data.
- The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
- You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

# Clean

- Clean data and place in Pandas dataframe

# Storing, Analyzing, and Visualizing Data

- Store the clean dataframes in CSV file(s), the main file named twitter_archive_master.csv

- Optionally, store the data in a SQLite database (submit if it exists)

- Analyze and Visualize data in the wrangle_act notebook

In [4]:
'''Import required libraries'''

import json
import os
import pandas as pd
import requests
import seaborn as sb
import tweepy

In [5]:
'''Use the Requests library to download the image-predictions TSV file from Udacity. Uncomment to run'''

# r = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')
# print(r.status_code)

# with open('image-predictions.tsv', 'w') as outfile:
#     outfile.write(r.text)

'Use the Requests library to download the image-predictions TSV file from Udacity'

In [6]:
'''Use the Tweepy library and Twitter API to download extended data for the tweets in the archive. Uncomment to run'''

# CONSUMER_KEY = os.environ.get('CONSUMER_KEY')
# CONSUMER_SECRET = os.environ.get('CONSUMER_SECRET')
# ACCESS_TOKEN = os.environ.get('ACCESS_TOKEN')
# ACCESS_SECRET = os.environ.get('ACCESS_SECRET')

# auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
# auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

# api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

# tweet_df = pd.read_csv('twitter-archive-enhanced.csv')
# tweets_ids = list(tweet_df.tweet_id)

# def get_tweets(tweet_ids):
#     '''Function to take tweet IDs in list form and create a dictionary to pass as JSON to a .txt file'''

#     nonexistent = list()
#     tweets = dict()
#     for tweet_id in tweet_ids:
#         print(tweet_id)
#         try:
#             twt = api.get_status(tweet_id, tweet_mode='extended')
#             tweets[tweet_id] = twt._json
#         except tweepy.TweepError:
#             nonexistent.append(tweet_id)
#     print(nonexistent)
#     return tweets

# t = get_tweets(tweets_ids)

# with open('tweet_json.txt', 'w') as outfile:
#     json.dump(t, outfile)

'Use the Tweepy library and Twitter API to download extended data for the tweets in the archive'

In [7]:
df = pd.read_csv('twitter-archive-enhanced.csv')
df_image = pd.read_csv('image-predictions.tsv', sep='\t')
df_tweet_info = pd.read_json('tweet_json.txt')

# Twitter Archive Enhanced

In [12]:
df.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
272,840761248237133825,,,2017-03-12 03:07:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Say hello to Maddie and Gunner....,8.406323e+17,4196984000.0,2017-03-11 18:35:42 +0000,"https://www.gofundme.com/3hgsuu0,https://twitt...",12,10,Maddie,,,,
1969,673317986296586240,,,2015-12-06 01:48:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Take a moment and appreciate how these two dog...,,,,https://twitter.com/dog_rates/status/673317986...,10,10,,,,,
1619,684959798585110529,,,2016-01-07 04:48:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Jerry. He's a neat dog. No legs (tragi...,,,,https://twitter.com/dog_rates/status/684959798...,5,10,Jerry,,,,
1620,684940049151070208,,,2016-01-07 03:30:07 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Oreo. She's a photographer and a model...,,,,https://twitter.com/dog_rates/status/684940049...,12,10,Oreo,,,,
1511,691416866452082688,,,2016-01-25 00:26:41 +0000,"<a href=""http://twitter.com/download/iphone"" r...",I present to you... Dog Jesus. 13/10 (he could...,,,,https://twitter.com/dog_rates/status/691416866...,13,10,,,,,


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [17]:
df.retweeted_status_id.value_counts()

6.816941e+17    1
8.688804e+17    1
8.071068e+17    1
8.099208e+17    1
7.932865e+17    1
               ..
7.902771e+17    1
6.671522e+17    1
7.638376e+17    1
8.083449e+17    1
6.675487e+17    1
Name: retweeted_status_id, Length: 181, dtype: int64

In [23]:
df.name.duplicated().sum()

1399

# ASSESSMENT

- Some tweets in the archive are actually retweets