<a id='intro'></a>
## Introduction

In the following project I am going to gather and analyze data all around the Twitter account <a href = "https://twitter.com/dog_rates">"WeRateDogs"</a>. Data is obtained using three different methods - manual download, programmatically download and over an API. After that I am going to assess this data, define the issues found during the assessment and clean these issues to get a cleaned master dataframe. This data will then be analyzed to draw some useful insights.

<a id='sources'></a>
## Data Sources


1. **Source:** WeRateDogs Twitter Archive (twitter-archive-enhanced.csv)
    - Origin: <a href = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv">Udacity</a>
    - Version: Latest (Downloaded 03/05/2020)
    - Method of gathering: Manual download


2. **Source:** Tweet image predictions (image_predictions.tsv)</li>
    - Origin: <a href="https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv">Udacity</a>     
    - Version: Latest (Downloaded 03/05/2020)
    - Method of gathering: Programmatically download via Requests


3. **Source:** Additional Twitter data (tweet_json.txt)
    - Origin: <a href = "https://twitter.com/dog_rates">WeRateDogs</a>   
    - Version: Latest (Collected 03/05/2020)
    - Method of gathering: API via Tweepy

In [1]:
import requests
import pandas as pd
import tweepy
import json
import re

### 1. WeRateDogs Twitter Archive (twitter-archive-enhanced.csv)

Since we already have the file, lets verify and view it by importing the contents directly into a dataframe via Pandas.

In [6]:
df_twitter = pd.read_csv("../data/raw/twitter-archive-enhanced.csv")

df_twitter.head(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,


### 2. Tweet image predictions (image_predictions.tsv)

To gather this data we are going to define the file - url, request this url and write the content of the response to a separate file.

In [3]:
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"

# get response
response = requests.get(url)

# write return to an image
with open("../data/raw/image_predictions.tsv", mode="wb") as file:
    file.write(response.content)

In [5]:
df_predict = pd.read_csv("../data/raw/image_predictions.tsv", sep='\t')

df_predict.head(3)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True


### 3. Additional Twitter data (tweet_json.txt)

To gather the data from the Twitter API I created a Twitter developer account and gathered the data via tweepy. This results in a new file called "tweet_json.txt".

In [9]:
from timeit import default_timer as timer
consumer_key = 'HDLUxAVriqnjzKt3BqFtGauRh'
consumer_secret = 'uW0nKbojcbd2rDWaopwawB54ljdz9zhgdK8evpubGNN1fX5UbW'
access_token = '551670793-1K1xeaWD2zoFM2PpeAQdjZrdodQTuIprRWlx3Rq9'
access_secret = 'T6BehPkwML0mUYC6wSkiGe7dvLzOhklFUr8RJ8nYh8TUg'

def scrape_twitter_timeline():
    # access the API
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)

    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

    # get all the twitter ids in the df
    twitter_ids = list(df_twitter.tweet_id.unique())

    # save the gathered data to a file
    start = timer()
    with open("../data/raw/tweet_json.txt", "w") as file:
        for ids in twitter_ids:
            print(f"Gather id: {ids}")
            try:
                # get all the twitter status - extended mode gives us additional data
                tweet = api.get_status(ids, tweet_mode="extended")
                # dump the json data to our file
                json.dump(tweet._json, file)
                # add a linebreak after each dump
                file.write('\n')
            except Exception as e:
                print(f"Error - id: {ids}" + str(e))
    end = timer()
    print(end - start)

Now we can read in all the necessary data into a dictionary to create a dataframe.

In [10]:
api_data = []

# scrape_twitter_timeline()

# read the created file
with open("../data/raw/tweet_json.txt", "r") as f:
    for line in f:
        try:
            tweet = json.loads(line)
            # append a dictionary to the created list
            api_data.append({
                "tweet_id": tweet["id"],
                "retweet_count": tweet["retweet_count"],
                "favorite_count": tweet["favorite_count"],
                "retweeted": tweet["retweeted"],
                "display_text_range": tweet["display_text_range"]
            })

            # tweet["entities"]["media"][0]["media_url"]
        except:
            print("Error.")

df_api = pd.DataFrame(api_data, columns=[
                      "tweet_id", "retweet_count", "favorite_count", "retweeted", "display_text_range"])
df_api.head()


Gather id: 681297372102656000
Gather id: 681281657291280384
Gather id: 681261549936340994
Gather id: 681242418453299201
Gather id: 681231109724700672
Gather id: 681193455364796417
Gather id: 680970795137544192
Gather id: 680959110691590145
Gather id: 680940246314430465
Gather id: 680934982542561280
Gather id: 680913438424612864
Gather id: 680889648562991104
Gather id: 680836378243002368
Gather id: 680805554198020098
Gather id: 680801747103793152
Gather id: 680798457301471234
Gather id: 680609293079592961
Gather id: 680583894916304897
Gather id: 680497766108381184
Gather id: 680494726643068929
Gather id: 680473011644985345
Gather id: 680440374763077632
Gather id: 680221482581123072
Gather id: 680206703334408192
Gather id: 680191257256136705
Gather id: 680176173301628928
Gather id: 680161097740095489
Gather id: 680145970311643136
Gather id: 680130881361686529
Gather id: 680115823365742593
Gather id: 680100725817409536
Gather id: 680085611152338944
Gather id: 680070545539371008
Gather id

Unnamed: 0,tweet_id,retweet_count,favorite_count,retweeted,display_text_range
0,892420643555336193,7728,36324,False,"[0, 85]"
1,892177421306343426,5721,31299,False,"[0, 138]"
2,891815181378084864,3786,23580,False,"[0, 121]"
3,891689557279858688,7884,39619,False,"[0, 79]"
4,891327558926688256,8512,37814,False,"[0, 138]"


Let's do a final check on the dataframes

In [11]:
df_twitter.head(1)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,


In [12]:
df_predict.head(1)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True


In [13]:
df_api.head(1)

Unnamed: 0,tweet_id,retweet_count,favorite_count,retweeted,display_text_range
0,892420643555336193,7728,36324,False,"[0, 85]"
