# Table of Contents

* [Introduction](#introduction)
* [Data Wrangling](#data wrangling)
    * [Gathering](#gathering)
    * [Assessing](#assessing)
    * [Cleaning](#cleaning)
* [Storing](#storing)
* [Exploratory Data Analysis](#exploratory data analysis)   
* [Conclusion](#conclusion)
* [References](#references)

## Introduction

Real-world data rarely comes clean. Using Python and its libraries, I gathered data from a variety of sources and in a variety of formats, assessed its quality and tidiness, then cleaned it. 

### Softwares needed:
> The following packages (libraries) need to be installed. You can install these packages via conda or pip. 

* pandas
* NumPy
* requests
* tweepy
* json

> A text editor, like VS Code or Atom.
A terminal application (Terminal on Mac and Linux or Cygwin on Windows).

### Dataset

> The dataset that I wrangled (analyzed and visualized) was the tweet archive of Twitter user @dog_rates, also known as WeRateDogs.  

> These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

> This archive contains basic tweet data ('tweet ID', 'timestamp', 'text', etc.) for all 5000+ of their tweets as they stood on August 1, 2017. More on this soon.

> My goal was to wrangle this Twitter data in order to create interesting and trustworthy analyses and visualizations. However, we had to gather more data ("retweet count" and "favorite count") to get the best analysis and visualizations.

### Let's get started!
> Before we start buid the project We set up the import statements for all of the packages we plan to use.

In [None]:
# import library for data manipulation and analysis
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sn
# import library for making HTTP requests
import requests
import json
# import os(operating system) library to store file
import os

# 'magic word' so that your visualizations are plotted
%matplotlib inline
print("Set up complete")

## Gathering the Data
Data was gathered from three data sources.
1. Download and load csv file data manually into DataFrame.
2. Download and load tweet image predictions programatically into DataFrame.
3. Webscraping data using API

###  1. Download and load csv file data manually into DataFrame.
> The WeRateDogs Twitter archive provided by Udacity. File `twitter_archive_enhanced.csv`downloaded manually and then loaded and opened into pandas DataFrame `arch_df`.

In [2]:
# load in the twitter archived data for WeRateDogs into dataframe
arch_df = pd.read_csv('twitter-archive-enhanced.csv')
arch_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [3]:
# raw datset summary that displays missing values in each column
arch_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

### 2. Download and load tweet image predictions programatically into DataFrame.
> File `image_predictions.tsv` is hosted on Udacity's servers and was downloaded programmatically using the `requests` library and the following URL:https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

In [4]:
# `requests` module allows you to send HTTP requests
import requests

# url of image prediction tsv file
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"

# send HTTP request to URL and save the response from server in a response object called r.
r = requests.get(url) 
with open(url.split('/')[-1], mode = 'wb') as file:
    file.write(r.content)

# Import data from the tsv file to a pandas DataFrame `img_df`.
img_df = pd.read_csv('image-predictions.tsv', sep = ('\t'))
# print dataframe
img_df.head()    

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [5]:
img_df.tail()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True
2074,892420643555336193,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,1,orange,0.097049,False,bagel,0.085851,False,banana,0.07611,False


Output: a table full of image predictions alongside each tweet ID, image URL, and the image number.

#### Column Descriptions for the tweet image prediction table
* tweet_id is the last part of the tweet URL after "status/" → https://twitter.com/dog_rates/status/889531135344209921

* p1 is the algorithm's #1 prediction for the image in the tweet → golden retriever
* p1_conf is how confident the algorithm is in its #1 prediction → 95%
* p1_dog is whether or not the #1 prediction is a breed of dog → TRUE
* p2 is the algorithm's second most likely prediction → Labrador retriever
* p2_conf is how confident the algorithm is in its #2 prediction → 1%
* p2_dog is whether or not the #2 prediction is a breed of dog → TRUE
etc.


### 3. Webscraping using Tweepy API
* Query Twitter API for each tweet's JSON data using the tweet IDs in the WeRateDogs Twitter archive.
* Query Twitter API for each tweet's "retweet count" and "favorite ("like") count" at minimum, and any additional data.
* Store each tweet's entire set of JSON data in a file called tweet_json.txt file. 
* Each tweet's JSON data should be written to its own line.
* Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. 

In [None]:
# used to download from twitter's API
import tweepy
from tweepy import OAuthHandler
import json

# https://github.com/tweepy/tweepy/blob/master/examples/oauth.py
# == OAuth Authentication ==
#
# This mode of authentication is the new preferred way
# of authenticating with Twitter.

# define keys
consumer_key = 'prDTAiUejLjznFcQfPnrsNRx8'
consumer_secret = 'E79WpXjaIAQLbbrRyoDu9i25NILjrDv54pe0x0yJDUXV7osphO'
access_token = '920063514063945728-5nqXAFHdlfnJhk5rrJrA4cclqelSThd'
access_secret = 'uyVQSHbdd5aJfwvkLbNJjBd70S84v3xlVuTvmMstBhKpD'

# `OAuthHandler` instance into which we pass `consumer_key` and `consumer_secret`
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

# store strings values for `access_token` and `access_key` in order to store access token
auth.set_access_token(access_token, access_secret)

# Construct the API instance
# `wait_on_rate_limit_notify` parameter added to decide whether or not to print a notification
# when Tweepy is waiting for rate limits to replenish
api = tweepy.API(auth, wait_on_rate_limit=True, 
                 wait_on_rate_limit_notify = True)


In [None]:
# If the authentication was successful, you should
# see the name of the account print out
print(api.me().name)

* Error handling in Python is done through the use of exceptions that are caught in try blocks and handled in except blocks.
* If an error is encountered, a try block code execution is stopped and transferred down to the except block.

In [None]:
# https://stackoverflow.com/questions/47612822/how-to-create-pandas-dataframe-from-twitter-search-api
# Download Tweepy status object based on Tweet ID and store in list
list_of_tweets = []
# Tweets that can't be found are saved in the list below:
cant_find_tweets_for_those_ids = []

for tweet_id in arch_df['tweet_id']:   
    # try block to test block of code for errors
    try:
        list_of_tweets.append(api.get_status(tweet_id))
    # `except` block to handle the error
    except Exception as e:
        cant_find_tweets_for_those_ids.append(tweet_id)
   

In [None]:
# Tweet IDs for which to gather additional data via Twitter's API

print("The list of tweets" ,len(list_of_tweets))
print("The list of tweets no found" , len(cant_find_tweets_for_those_ids))

In [None]:
# isolate the json part of each tweepy 
my_list_of_dicts = []
for each_json_tweet in list_of_tweets:
    my_list_of_dicts.append(each_json_tweet)

In [None]:
# write list into text file
with open('tweet_json.txt', 'w') as file:
        file.write(json.dumps(my_list_of_dicts, indent=4))

In [None]:
# create a DataFrame from the tweet_json.txt file
my_demo_list = []
with open('tweet_json.txt', encoding='utf-8') as json_file:  
    all_data = json.load(json_file)
    for each_dictionary in all_data:
        tweet_id = each_dictionary['id']
        whole_tweet = each_dictionary['text']
        only_url = whole_tweet[whole_tweet.find('https'):]
        favorite_count = each_dictionary['favorite_count']
        retweet_count = each_dictionary['retweet_count']
        created_at = each_dictionary['created_at']
        whole_source = each_dictionary['source']
        only_device = whole_source[whole_source.find('rel="nofollow">') + 15:-4]
        source = only_device
        retweeted_status = each_dictionary['retweeted_status'] = each_dictionary.get('retweeted_status', 'Original tweet')
        if retweeted_status == 'Original tweet':
            url = only_url
        else:
            retweeted_status = 'This is a retweet'
            url = 'This is a retweet'

        my_demo_list.append({'tweet_id': str(tweet_id),
                             'favorite_count': int(favorite_count),
                             'retweet_count': int(retweet_count),
                             'url': url,
                             'created_at': created_at,
                             'source': source,
                             'retweeted_status': retweeted_status,
                            })
        tweet_json = pd.DataFrame(my_demo_list, columns = ['tweet_id', 'favorite_count', 
                                                       'retweet_count', 'created_at',
                                                       'source', 'retweeted_status', 'url'])

In [None]:
tweet_json.head()

In [None]:
tweet_json.info()

* Each tweet's JSON data should be written to its own line.
* Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count.

Assessing Data for this Project
After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues. Detect and document at least eight (8) quality issues and two (2) tidiness issues in your wrangle_act.ipynb Jupyter Notebook. To meet specifications, the issues that satisfy the Project Motivation (see the Key Points header on the previous page) must be assessed.

Cleaning Data for this Project
Clean each of the issues you documented while assessing. Perform this cleaning in wrangle_act.ipynb as well. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned.

Storing, Analyzing, and Visualizing Data for this Project
Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do).

Analyze and visualize your wrangled data in your wrangle_act.ipynb Jupyter Notebook. At least three (3) insights and one (1) visualization must be produced.

Reporting for this Project
Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.

Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.

Both of these documents can be created in separate Jupyter Notebooks using the Markdown functionality of Jupyter Notebooks, then downloading those notebooks as PDF files or HTML files (see image below). You might prefer to use a word processor like Google Docs or Microsoft Word, however.

# References

https://www.earthdatascience.org/courses/use-data-open-source-python/intro-to-apis/twitter-data-in-python/

https://github.com/tweepy/tweepy/blob/master/examples/oauth.py

http://docs.tweepy.org/en/latest/getting_started.html

https://stackoverflow.com/questions/21308762/avoid-twitter-api-limitation-with-tweepy

https://stackoverflow.com/questions/47612822/how-to-create-pandas-dataframe-from-twitter-search-api

https://www.pythonforbeginners.com/error-handling/python-try-and-except

https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/

http://docs.tweepy.org/en/v3.5.0/getting_started.html#introduction

https://medium.com/ub-women-data-scholars/let-the-robot-do-your-work-web-scraping-with-python-9c147fb7690f