<a id='intro'></a>
## Introduction

In the following project I am going to gather and analyze data all around the Twitter account <a href = "https://twitter.com/dog_rates">"WeRateDogs"</a>. Data is obtained using three different methods - manual download, programmatically download and over an API. After that I am going to assess this data, define the issues found during the assessment and clean these issues to get a cleaned master dataframe. This data will then be analyzed to draw some useful insights.

<a id='sources'></a>
## Data Sources


1. **Source:** WeRateDogs Twitter Archive (twitter-archive-enhanced.csv)
    - Origin: <a href = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv">Udacity</a>
    - Version: Latest (Downloaded 03/05/2020)
    - Method of gathering: Manual download


2. **Source:** Tweet image predictions (image_predictions.tsv)</li>
    - Origin: <a href="https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv">Udacity</a>     
    - Version: Latest (Downloaded 03/05/2020)
    - Method of gathering: Programmatically download via Requests


3. **Source:** Additional Twitter data (tweet_json.txt)
    - Origin: <a href = "https://twitter.com/dog_rates">WeRateDogs</a>   
    - Version: Latest (Collected 03/05/2020)
    - Method of gathering: API via Tweepy

In [1]:
import requests
import pandas as pd
import tweepy
import json
import re

### 1. WeRateDogs Twitter Archive (twitter-archive-enhanced.csv)

Since we already have the file, lets verify and view it by importing the contents directly into a dataframe via Pandas.

In [None]:
df_twitter = pd.read_csv("./data/raw/twitter-archive-enhanced.csv")

df_twitter.head(3)

### 2. Tweet image predictions (image_predictions.tsv)

To gather this data we are going to define the file - url, request this url and write the content of the response to a separate file.

In [None]:
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"

# get response
response = requests.get(url)

# write return to an image
with open("./data/raw/image_predictions.tsv", mode="wb") as file:
    file.write(response.content)

In [None]:
df_predict = pd.read_csv("./data/raw/image_predictions.tsv", sep='\t')

df_predict.head(3)

### 3. Additional Twitter data (tweet_json.txt)

To gather the data from the Twitter API I created a Twitter developer account and gathered the data via tweepy. This results in a new file called "tweet_json.txt".

In [None]:
from timeit import default_timer as timer
consumer_key = '<your key>'
consumer_secret = '<your key>'
access_token = '<your key>'
access_secret = '<your key>'

def scrape_twitter_timeline():
    # access the API
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)

    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

    # get all the twitter ids in the df
    twitter_ids = list(df_twitter.tweet_id.unique())

    # save the gathered data to a file
    start = timer()
    with open("./data/raw/tweet_json.txt", "w") as file:
        for ids in twitter_ids:
            print(f"Gather id: {ids}")
            try:
                # get all the twitter status - extended mode gives us additional data
                tweet = api.get_status(ids, tweet_mode="extended")
                # dump the json data to our file
                json.dump(tweet._json, file)
                # add a linebreak after each dump
                file.write('\n')
            except Exception as e:
                print(f"Error - id: {ids}" + str(e))
    end = timer()
    print(end - start)

Now we can read in all the necessary data into a dictionary to create a dataframe.

In [None]:
api_data = []

scrape_twitter_timeline()

# read the created file
with open("./data/raw/tweet_json.txt", "r") as f:
    for line in f:
        try:
            tweet = json.loads(line)
            # append a dictionary to the created list
            api_data.append({
                "tweet_id": tweet["id"],
                "retweet_count": tweet["retweet_count"],
                "favorite_count": tweet["favorite_count"],
                "retweeted": tweet["retweeted"],
                "display_text_range": tweet["display_text_range"]
            })

            # tweet["entities"]["media"][0]["media_url"]
        except:
            print("Error.")

df_api = pd.DataFrame(api_data, columns=[
                      "tweet_id", "retweet_count", "favorite_count", "retweeted", "display_text_range"])
df_api.head()

Let's do a final check on the dataframes

In [None]:
df_twitter.head(1)

In [None]:
df_predict.head(1)

In [None]:
df_api.head(1)