# Section 2: Data Ingestion

- [Section 2.1: Motivation](#motivation)
- [Section 2.2: Ingestion](#ingestion)
- [Section 2.3: Sample ans Explanation](#sample)

## Section 2.1: Motivation <a class="anchor" id="motivation"></a>

<p>
<img src = https://kss.rs/eng_new/wp-content/uploads/2020/05/tokio.jpg style="width:175px;height:250px;margin:0px 10px;" ALIGN="right" />
    <p>A particular interest of the authors is the ongoing Summer Olympic games. We enjoy watching the top athletes in the world compete in sports that are not on television very often; sports like, gymnastics, swimming, and track. A natural curiosity than was to see how other people view the Olympics.</p>    
    <p>As social media has grown Twitter has been the go-to platform for the world to express their views. Twitter allows people to express their feelings on just about any subject they desire (for better or for worst). It would make sense then to look at Twitter data to model the sentiment around the Olympics. Luckily, Twitter provides an API  for us to query historical tweet data.</p>
</p>

## Section 2.2: Ingestion <a class="anchor" id="ingestion"></a>

The Twitter API is a relatively easy-to-use API. There are a few restrictions such as rate limits and API types that you do have to consider. For the purposes of this study, the API makes it easy to look at historical tweets containing specific words and hashtags. The following details certain aspects of the Twitter API that pertains to this project. For more information please visit [here](https://developer.twitter.com/en/docs/twitter-api).

The first step is to get authorization. To get this we must provide the API with our given client key and secret using the POST command with the requests library. Once we get POST we will receive a response from the API we must save.

In [2]:
import numpy as np
import pandas as pd
import requests 
import base64
import time

# Save Authorization Info.
client_key = 'XXXXXXXXXXXXXXXXXXXXXXXXXX' 
client_secret = 'XXXXXXXXXXXXXXXXXXXXXXXXXX' 
bearer_token = 'XXXXXXXXXXXXXXXXXXXXXXXXXX' 
key_secret = '{}:{}'.format(client_key, client_secret).encode('ascii')
b64_encoded_key = base64.b64encode(key_secret)
b64_encoded_key = b64_encoded_key.decode('ascii')
# Build API URL 
base_url = 'https://api.twitter.com/'
auth_endpoint = base_url + 'oauth2/token'
auth_headers = { 'Authorization': 'Basic {}'.format(b64_encoded_key),
                'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8' }
auth_data = { 'grant_type': 'client_credentials' }
# Provide Authorization Info and Save Access Token
response = requests.post(auth_endpoint, headers=auth_headers, data=auth_data)
print("Response Status Code: ",response.status_code)

Response Status Code:  200


As expected, we get a response status code of "200". This tells us that we succesfully recieved our response. The next step is to get the *access_token* from our authorization response.

In [3]:
json_data =  response.json()
access_token = json_data['access_token']

A utility function was then created. This function allows us to use the Twitter API to save tweets. The function needs the user's saved access token and allows for the user to specify the number of tweets to pull. The most important parameter of this function is the query parameter. This allows you to filter the tweets you receive by specific hashtags, words, retweets, etc. For this project, we are concerned with tweets containing the hashtags "#olympics" and containing the specific sport.

In [None]:
def get_tweets(access_token, query, max_tweets=10, tweet_limit=10):
    """Retrieve tweets from the recent search API.
        Args
        ----
        access_token (str): A valid bearer token for making Twitter API requests.
        query (str): A valid Twitter query string for filtering the search tweets.
        max_tweets (int): The maximum number of tweets to collect in total.
        tweet_limit (int): The number of maximum tweets per API request. 
    """
    page_token = None
    tweet_data = []
    search_headers = {
        'Authorization': 'Bearer {}'.format(access_token),
        'User-Agent': 'v2FullArchiveSearchPython',
    }
    # Divides the max tweets into the appropriate number of requests based on the tweet_limit.
    for i in range(max_tweets // tweet_limit - 1):
        search_parameters = {
            'query': query,
            'max_results': tweet_limit,
            'tweet.fields': 'lang,created_at,referenced_tweets,source,conversation_id'
        }
                # If we reach the 2nd page of results, add a next_token attribute to the search parameters
        if i > 0:
            search_parameters['next_token'] = page_token

        response = requests.get(search_url, headers=search_headers, params=search_parameters)
        if response.status_code != 200:
            print(f'\tError occurred: Status Code{response.status_code}: {response.text}')
        else:
            # Check for a result count before doing anything futher; if we have result_count we have data
            if response.json()['meta']['result_count'] > 0:
                tweet_data.extend(response.json()['data'])
                
                # If a 'next_token' exists, then update the page token to continue pagination
                if 'next_token' in response.json()['meta']:
                    page_token = response.json()['meta']['next_token']
            else:
                print(f'\tNo data returned for query!')
                break        
        print(f'\t{len(tweet_data)} total tweets gathered')
        time.sleep(1)
    return tweet_data

Using the above function we can sample retweets having to do with Olympic gymnastics. Notice that we specify the hashtag "#olympics", the word "gymnastics" and is a retweet in the query parameter. Finally, so we do have to keep requesting data from the Twitter API we save the tweets in a pickle file. This way we can use them later in our analysis (and Twitter will not get mad at us for rate-limits).

In [None]:
# Make the request
q = '#olympics gymnastics -is:retweet'
olympic_tweets = get_tweets(access_token, q, max_tweets=5000, tweet_limit=100)
#Save the data
with open('../data/gymnastics-tweets.pkl', 'wb') as f:
	pickle.dump(olympic_tweets, f)

Using the "get_tweets" function we tried to pull 5000 tweets for the sports basketball, biking, diving, gymnastics, skateboarding, surfing, track, and volleyball. This was easily done by changing the query parameter to include the name of the sport.

## Section 2.3: Sample and Explanation <a class="anchor" id="sample"></a>

Alright we have pulled in some tweets. Now lets have a look at a sample of the tweets.

In [3]:
original_tweets_df = pd.DataFrame(pd.read_pickle('../data/gymnastics-tweets.pkl'))
original_tweets_df.sample(5)

Unnamed: 0,text,id,created_at,conversation_id
1754,"@Shawn_Shipp_ Thank You, Simone Biles - Headph...",1423371158443941890,2021-08-05T19:51:43.000Z,1423299151215931392
1349,Unexpected fun fact: \nLiu Yang originally wa...,1423471713312854017,2021-08-06T02:31:17.000Z,1423471713312854017
389,Rhythmic Gymnastics looks insane. #Olympics,1424213988720717828,2021-08-08T03:40:50.000Z,1424213988720717828
1055,"@kpkuehler Thank You, Simone Biles - Headphone...",1423796825761456128,2021-08-07T00:03:10.000Z,1423616706413572103
1004,@Simone_Biles you are a wonderful human being ...,1423812938889142273,2021-08-07T01:07:12.000Z,1423812938889142273


There is a lot of information about tweets that one can recieve from the API. We have narrowed it down to the important ones for this study:

- text - The text of the tweet sent out. This will include user handles, hashtags, emojis, and any retweet indicators
- id - The unique id given to the tweet. This allows for twitter to store tweets.
- created_at - When the tweet was tweeted. Given in UTC time.
- conversation_id - The unique id given to Twitter conversations. We can use this id to get all tweets in a conversation.

<div class="container">
   <div style="float:left;width:20%"><a href="./Intro.ipynb"><< Section 1: Introduction</a></div>
   <div style="float:right;width:20%"><a href="./Cleaning.ipynb">Section 3: Data Cleaning >></a></div>
   <div style="float:right;width:40%"><a href="../main.md">Table of Contents</a></div>
</div>