# Gather

- Twitter Archive Enhanced downloaded from Resources, saved as twitter-archive-enhanced.csv

- Image Predictions downloaded programmatically from Udacity using the Requests library, saved as image-predictions.tsv

- JSON data for every tweet in the archive downloaded programmatically using the Tweepy library and Twitter API, saved as tweet_json.txt

# Assess

- Visually

- Programmatically

- Detect and document EIGHT Quality issues and TWO tidiness issues (must satisfy the Project Motivation Key Points)

## Key Points:

- You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
- Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
- Cleaning includes merging individual pieces of data according to the rules of tidy data.
- The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
- You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

# Clean

- Clean data and place in Pandas dataframe

# Storing, Analyzing, and Visualizing Data

- Store the clean dataframes in CSV file(s), the main file named twitter_archive_master.csv

- Optionally, store the data in a SQLite database (submit if it exists)

- Analyze and Visualize data in the wrangle_act notebook

In [10]:
'''Import required libraries'''

import json
import os
import pandas as pd
import requests
import seaborn as sb
import tweepy

In [None]:
'''Use the Requests library to download the image-predictions TSV file from Udacity'''

r = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')
print(r.status_code)

with open('image-predictions.tsv', 'w') as outfile:
    outfile.write(r.text)

In [None]:
'''Use the Tweepy library and Twitter API to download extended data for the tweets in the archive'''

CONSUMER_KEY = os.environ.get('CONSUMER_KEY')
CONSUMER_SECRET = os.environ.get('CONSUMER_SECRET')
ACCESS_TOKEN = os.environ.get('ACCESS_TOKEN')
ACCESS_SECRET = os.environ.get('ACCESS_SECRET')

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

tweet_df = pd.read_csv('twitter-archive-enhanced.csv')
tweets_ids = list(tweet_df.tweet_id)

def get_tweets(tweet_ids):
    '''Function to take tweet IDs in list form and create a dictionary to pass as JSON to a .txt file'''

    nonexistent = list()
    tweets = dict()
    for tweet_id in tweet_ids:
        print(tweet_id)
        try:
            twt = api.get_status(tweet_id, tweet_mode='extended')
            tweets[tweet_id] = twt._json
        except tweepy.TweepError:
            nonexistent.append(tweet_id)
    print(nonexistent)
    return tweets

t = get_tweets(tweets_ids)

with open('tweet_json.txt', 'w') as outfile:
    json.dump(t, outfile)

# ASSESSMENT

- Some tweets in the archive are actually retweets