# Sentiment Analysis of COVID-19 Tweets: When did the Public Panic Set In?

    Notebook by Allison Kelly - allisonkelly42@gmail.com

# Introduction 

<i>This notebook is part one of my NLP project aiming to scrape and analyze tweets regarding the coronavirus pandemic. <br>View part two <a href="https://github.com/akelly66/COVID-Tweet-Sentiment/blob/master/text-processing/COVID-tweet-NLP.ipynb">here</a>.</i>

Love it or hate it, social media has gone from angsty teenagers posting poetry on LiVEJOURNAL to the leader of the free world <a href="https://twitter.com/realDonaldTrump/status/1213919480574812160?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1213919480574812160&ref_url=https%3A%2F%2Fmashable.com%2Farticle%2Ftrump-tweets-congress-war-powers-act%2F">waging war on Iran</a> in 265 words (or less) in a matter of years. Social media provides analysts with <i><a href="https://seedscientific.com/how-much-data-is-created-every-day/">zettabytes</a></i> of data every day. That's 1,000 bytes to the seventh power. Specifically, Twitter users generate 500 million tweets per day, of which the content of those Tweets contains invaluable public opinion data. 

As the world has been turned on its head during the COVID-19 global pandemic, an interesting question arises. How seriously is the public taking the pandemic? Millions have lost their jobs, deaths from the disease are in the hundreds each day, and the US meat supply chain is on the brink of failure, but life-saving shelter-in-place orders are being defied as protesters rally all over the country in favor of opening the ecomony. 

The question I seek to answer in this project is whether the public opinion of how serious the pandemic was changed in the United States once the World Health Organization declared a global pandemic. This is not meant to be academically rigorous. As I only have access to a Premium API Sandbox dev environment, I had to scale down my query to only include tweets containing two keywords, "COVID-19" and "coronavirus," and a limit of 5,000 tweets scraped per month. I chose the 24 hours before and after the declaration on January 31, 2020, though this will give a limited scope as stay-at-home orders did not begin until mid-March. 

The following code will demonstrate how I scraped tweets using the <a href="https://github.com/twitterdev/search-tweets-python">searchtweets</a> and <a href="https://pypi.org/project/requests/2.7.0/">requests</a> packages once connected to the Twitter API. 

<b> More documentation to come.</b>

# Imports

In [2]:
import pandas as pd
import json
import requests

import searchtweets
from searchtweets import load_credentials
from string import Template

# OAUTH 2.0 Bearer Token

After following the Twitter Developer <a href="https://developer.twitter.com/en/docs/basics/getting-started">Getting Started</a> guide, I created a developer app, received API keys, and generated a bearer token from OAuth 2.0. My API keys are housed in a secret YAML document, separate from the repository used to house this notebook. 

In [11]:
credentials = load_credentials(filename="/Users/Allie/Documents/DS-Projects/API keys/twitter_keys.yaml",
                 yaml_key="search_tweets_api",
                 env_overwrite=False)

  search_creds = yaml.load(f)[yaml_key]
Grabbing bearer token from OAUTH


In [12]:
bearer_token = credentials['bearer_token']

# Initial Request

In order to test my search parameters, I'm making an initial request to the Twitter API Full Archive. The parameters I'm using for this project are the following:

   - <b>Keywords: </b> "coronavirus" and "COVID-19"<br>
   <br> The number of tweets would be too large for the access I have, and though I would love to include many variations such as "pandemic," "Wuhan virus," or other related terms, I chose the two most common that were COVID-19 specific. The WHO did not release the name COVID-19 until February 11, 2020 so once my requests reset, I will not be able to change this. <br><br>
   - <b>Date Range: </b> 12:00:00 30 Jan 2020 - 23:59:00  31 Jan 2020<br><br>
   The Twitter API dates parameter is exlusive of the last minute. Again, once my requests are reset, I'll change this to reflect tweets the day before, the day of, and the day after the WHO declared a pandemic on 30 Jan 2020.
   <br><br>
   - <b>Location:</b> United States of America<br><br>
   By using the profile_country parameter, the selected tweets will be tweets or retweets of profiles that indicate the US is their geographical location, though they may have tweeted from abroad or retweeted non-Americans. I don't believe this should alter the intended sentiment. 

In [7]:
# Full archive endpoint
endpoint = 'https://api.twitter.com/1.1/tweets/search/fullarchive/production.json'

headers = {"Authorization":"Bearer {}".format(bearer_token), 
           "Content-Type": "application/json"}  

data = '{"query":"(coronavirus OR COVID-19)", 
        "fromDate": "202001300000", 
        "toDate": "202001312359"}, 
        {"place":{"profile_country":"United States of America"}}'

# POST request instead of GET to avoid URL length restrictions of 2048 characters
response = requests.post(endpoint,data=data,headers=headers).json()

Next, we'll use JSON to decode the response into a "pretty-printed" string to explore the results of the request to see if everything checks out. 

In [14]:
# Indenting to pretty-print
print(json.dumps(response, indent = 2))

{
  "error": {
    "message": "Request exceeds account\u2019s current package request limits. Please upgrade your package and retry or contact Twitter about enterprise access.",
    "sent": "2020-05-18T17:54:49+00:00",
    "transactionId": "001d6c5e00207d72"
  }
}


The first tweet from the request shows the first tweet in this time period containing the keyword "coronavirus" wasn't unil Fri Jan 31, 2020 at 23:55:39. I find it hard to believe no one tweeted anything with "coronavirus" in the US before this time. I'll do more investigation once my request limits reset.

In [171]:
# Creating pandas dataframe from result

covid_tweets = pd.DataFrame.from_dict(response['results'])

# Pagination

Rate limits for the free Sandbox dev include 100 tweets per request and 30 requests per minute. The response includes a "next token" that can be supplied to subsequent queries in order to paginate through the results until there are no longer tweets that fit your query. 

I've created a helper function to pull the unique next token per request and paginate through the results. Each JSON string will then be added to the dataframe defined by the initial API request. This function can be generalized by substituting the data and template variables with a properly formatted JSON query.  

In [230]:
# 100 tweets per request
len(covid_tweets)

100

In [None]:
def get_initial_next_token(endpoint, headers, query=data):
    response = requests.post(endpoint,data=data,headers=headers).json()
    
    if response['next']:
        return response['next']
    else:
        return 'Error'

In [18]:
def paginate_covid_tweets(endpoint, headers, next_token):
    
    """
    This helper function when combined with a loop will pull the next_token 
    from each request and paginate through Twitter API requests. 
    
    Keyword arguments:
    
    endpoint -- URL string of Twitter API endpoint including dev environment name
    headers -- Dictonary of OAuth bearer token and content type
    next_token -- Next_token from previous 
    request from which to begin pagination
    """ 
    
    t=Template('{"query":"(coronavirus OR COVID-19)", "fromDate": "202001300000", "toDate": "202001312359", "next":"${next_token}"}, {"place":{"profile_country":"United States of America"}}')
    data = t.substitute(next_token=next_token)
    response = requests.post(endpoint, data=data, headers=headers).json()
    
        
    if response['next']:
               
        next_token = response['next']
        return response, next_token
    
    else:
        return response, "End of Pagination"

In [None]:

# def paginate_covid_tweets(endpoint, headers, previous_response_token=None):
    
#     """
#     This helper function when combined with a loop with pull the next_token 
#     from each request and paginate through Twitter API requests. 
    
#     Keyword arguments:
    
#     endpoint -- URL string of Twitter API endpoint including dev environment name
#     headers -- Dictonary of OAuth bearer token and content type
#     previous_response_token -- None or next_token from previous 
#     request from which to begin pagination
#     """ 
    
#     global next_token
    
#     if previous_response_token:
#         t=Template('{"query":"(coronavirus OR COVID-19)",
#                    "fromDate": "202001300000", 
#                    "toDate": "202001312359",
#                    "next":"${next_token}"}, 
#                    {"place":{"profile_country":"United States of America"}}')
#         data = t.substitute(next_token=previous_response_token)
#         response = requests.post(endpoint, data=data, headers=headers).json()
#         next_token = response['next']
        
#         return response, next_token
    
#     elif previous_response_token == False: 
        
#         data = '{"query":"(coronavirus OR COVID-19)", 
#                    "fromDate": "202001300000", 
#                    "toDate": "202001312359",
#                    "next":"${next_token}"}, 
#                    {"place":{"profile_country":"United States of America"}}'
#         response = requests.post(endpoint, data=data, headers=headers).json()
#         next_token = response['next']
        
#         return response, next_token
    
#     else:
#         return "No next token."

Now that we have the helper function, we'll identify the next token from the initial request, then loop through each request with the pagination function, assigning a new token to each request, and add each page to a list until we've hit the monthly rate limit of 50 requests.

In [None]:
# Assigning next token to variable
next_token = get_initial_next_token(endpoint, headers, query=data)

# Starting iteration at 0
i = 0
dfs = []

# Loop until rate limit of 50 is met
while i < 51:
    response, next_token = paginate_covid_tweets(endpoint, 
                                                 headers, 
                                                 next_token)
    dfs.append(response)
    i += 1
    
else:
    break

Below, we'll take just the tweet results, convert to dataframes, and add them to our initial covid_tweets dataframe.

In [None]:
for dictionary in dfs:
    df = pd.DataFrame.from_dict(dictionary['results'])
    covid_tweets = pd.concat([covid_tweets,df], axis=0)

In [20]:
# Final token
latest_token = dfs[-1]['next']

'eyJtYXhJZCI6MTIyMzM5NDQ0MzQwMDc2NTQ0MX0='

After my trial an error, I've ended up with 3,700 tweets. This will become more robust, again, once the limit is reset.

In [22]:
len(covid_tweets)

3700

# Next Steps