# Section 2: Data Ingestion

- [Section 2.1: Motivation](#motivation)
- [Section 2.2: Ingestion](#ingestion)
- [Section 2.3: Sample ans Explanation](#sample)

## Section 2.1: Motivation <a class="anchor" id="motivation"></a>

<p>
<img src = https://kss.rs/eng_new/wp-content/uploads/2020/05/tokio.jpg style="width:175px;height:250px;margin:0px 10px;" ALIGN="right" />
    <p>A particular interest of the authors is the ongoing Summer Olympic games. We enjoy watching the top athletes in the world compete in sports that are not on television very often; sports like, gymnastics,swimming and track. A natural curioisty than was to see hiw other people view the Olympics.</p>    
    <p>As social media has grown Twitter has been the go to platform for the world to express their views. Twitter allows for people to express their feelings on just about any subject they desire (for better or for worst). It would make sence then to look at Twitter data to model the sentiment around the Olympics. Luckily Twitter provides an API  for us to query historical tweet data.</p>
</p>

## Section 2.2: Ingestion <a class="anchor" id="ingestion"></a>

The twitter API is a relative easy to use API. There are a few restictions such as rate limits and API types that you do have to consider. For the purposes of this study, the API makes it easy to look at historical tweets containing specific words and hashtags. The follow details certain aspects of the Twitter API that pertains to this project. For more information please visit [here](https://developer.twitter.com/en/docs/twitter-api).

The first step is to get authorization. 

In [None]:
import numpy as np
import pandas as pd
import requests 
# Save Authorization Info.
client_key = 'k7IeaeoVVSVVRs2VMRyjWMGhB'
client_secret = 'QjWhmf3ylIKXTYn6zv26tTkPjobjbapTKlC4JDay74jd647mcl'
bearer_token = 'AAAAAAAAAAAAAAAAAAAAAO3%2BQgEAAAAAKO%2BcZLOP%2B1jh1SeWDdIMpCF4smc%3DiUFDhj7SZfb1fl177n5Fipd9OA2Elj9aVdezk2hhBUGbTrdLgY'
key_secret = '{}:{}'.format(client_key, client_secret).encode('ascii')
b64_encoded_key = base64.b64encode(key_secret)
b64_encoded_key = b64_encoded_key.decode('ascii')

# Build API URL 
base_url = 'https://api.twitter.com/'
auth_endpoint = base_url + 'oauth2/token'
auth_headers = { 'Authorization': 'Basic {}'.format(b64_encoded_key),
                'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8' }
auth_data = { 'grant_type': 'client_credentials' }

# Provide Authorization Info and Save Access Token
response = requests.post(auth_endpoint, headers=auth_headers, data=auth_data)
print("Response Status Code: ",response.status_code)

As expected we get a response status code of "200". This tells us that we succesfully recieved our access token.

Lets get the access token from our response and save it in a variable *access_token*.

In [None]:
json_data =  response.json()
access_token = json_data['access_token']

A utility function was then created. This function allows us to use the twitter API to save tweets. The function needs the users saved access token and allows for the user to specify the number of tweet to pull. The most important parameter of this function is the query parameter. This allows for you filter tweets you recieve by specifing hashtags, words, retweets, etc. For this project we are concerned with tweets containg the hashtags "#olympics" and containing the a specific sport.

In [None]:
def get_tweets(access_token, query, max_tweets=10, tweet_limit=10):
    """Retrieve tweets from the recent search API.
        Args
        ----
        access_token (str): A valid bearer token for making Twitter API requests.
        query (str): A valid Twitter query string for filtering the search tweets.
        max_tweets (int): The maximum number of tweets to collect in total.
        tweet_limit (int): The number of maximum tweets per API request. 
    """

    page_token = None
    tweet_data = []

    search_headers = {
        'Authorization': 'Bearer {}'.format(access_token),
        'User-Agent': 'v2FullArchiveSearchPython',
    }

    # Divides the max tweets into the appropriate number of requests based on the tweet_limit.
    for i in range(max_tweets // tweet_limit - 1):
        search_parameters = {
            'query': query,
            'max_results': tweet_limit,
            'tweet.fields': 'lang,created_at,referenced_tweets,source,conversation_id'
        }

        print(f'Request {i + 1}: {query}')
        
        # If we reach the 2nd page of results, add a next_token attribute to the search parameters
        if i > 0:
            search_parameters['next_token'] = page_token

        response = requests.get(search_url, headers=search_headers, params=search_parameters)
        if response.status_code != 200:
            print(f'\tError occurred: Status Code{response.status_code}: {response.text}')
        else:
            # We need to check for a result count before doing anything futher; if we have result_count we have data
            if response.json()['meta']['result_count'] > 0:
                tweet_data.extend(response.json()['data'])
                
                # If a 'next_token' exists, then update the page token to continue pagination through results
                if 'next_token' in response.json()['meta']:
                    page_token = response.json()['meta']['next_token']
            else:
                print(f'\tNo data returned for query!')
                break
        
        print(f'\t{len(tweet_data)} total tweets gathered')

        time.sleep(1)

    return tweet_data

Using the above function we can sample retweets having to do with olympic gymnastics. Notice that we are specify the hashtag "#olympics", the word "gymnastics" and is a retweet in the query parameter. Finally, so we do have to keep requesting data from the twitter API we save the tweets in a pickle file. This way we can use them later in our anaysis (and twitter will not get made at us).

In [None]:
# Make the request
q = '#olympics gymnastics -is:retweet'
olympic_tweets = get_tweets(access_token, q, max_tweets, tweet_limit)

with open('../data/olympic-gynmastics.pkl', 'wb') as f:
	pickle.dump(olympic_tweets, f)

We are also concerned with the onversation tweets (replies) that follow the original tweet. Tweet also provides a conversation id attribute with their tweets. This allows us to reuse the twitter API to get all the tweets in a conversation. Using the same function we can get all of the replies by specifing this conversation ID on the query parameter. Once again using olympic gymnastics as an example...

In [None]:
gynmastics_tweets = pd.read_pickle("../data/olympic-gynmastics.pkl")

conversation_ids = [ tweet['conversation_id'] for tweet in gynmastics_tweets ]
len(conversation_ids)

# Query all replies
replies = []

for id in conversation_ids[0:11]:
	q = 'conversation_id:{id}'.format(id=id)
	replies.extend(get_tweets(access_token, q, max_tweets, tweet_limit))

with open('../data/olympic-gynmastics-replies.pkl', 'wb') as f:
	pickle.dump(replies, f)

## Section 2.3: Sample and Explanation <a class="anchor" id="sample"></a>

Alright we have pulled in some tweets. Now lets have a look at some of the original tweets and the retweets.

In [None]:
pd.set_option('expand_frame_repr', False)

original_tweets_df = pd.DataFrame(olympic_tweets)
original_tweets_df.sample(5)

response_tweet_df = pd.DataFrame(replies)
response_tweet_df.sample(5)

There are a lot of information about tweets that you recieve. For us the most important ones are:

-
-
-
-


<div class="container">
   <div style="float:left;width:20%"><a href="./Intro.ipynb"><< Section 1: Introduction</a></div>
   <div style="float:right;width:20%"><a href="./Cleaning.ipynb">Section 3: Data Cleaning >></a></div>
   <div style="float:right;width:40%"><a href="../main.md">Table of Contents</a></div>
</div>