# Sentiment Analysis of COVID-19 Tweets: When did the Public Panic Set In? Part 1: Scraping Tweets

    Notebook by Allison Kelly - allisonkelly42@gmail.com

# Introduction 

<i>This notebook is part one of my NLP project aiming to scrape and analyze tweets regarding the coronavirus pandemic. <br>View part two <a href="https://github.com/akelly66/COVID-Tweet-Sentiment/blob/master/text-processing/COVID-tweet-NLP.ipynb">here</a>.</i>

Love it or hate it, social media has gone from angsty teenagers posting poetry on LiVEJOURNAL to the leader of the free world <a href="https://twitter.com/realDonaldTrump/status/1213919480574812160?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1213919480574812160&ref_url=https%3A%2F%2Fmashable.com%2Farticle%2Ftrump-tweets-congress-war-powers-act%2F">waging war on Iran</a> in 265 words (or less) in a matter of years. Social media provides analysts with <i><a href="https://seedscientific.com/how-much-data-is-created-every-day/">zettabytes</a></i> of data every day. That's 1,000 bytes to the seventh power. Specifically, Twitter users generate 500 million tweets per day, of which the content of those Tweets contains invaluable public opinion data. 

As the world has been turned on its head during the COVID-19 global pandemic, an interesting question arises. How seriously is the public taking the pandemic? Millions have lost their jobs, deaths from the disease are in the hundreds each day, and the US meat supply chain is on the brink of failure, but life-saving shelter-in-place orders are being defied as protesters rally all over the country in favor of opening the ecomony. 

The question I seek to answer in this project is whether the public opinion of how serious the pandemic was changed in the United States once the World Health Organization declared a global pandemic. This is not meant to be academically rigorous. As I only have access to a Premium API Sandbox dev environment, I had to scale down my query to only include tweets containing two keywords, "COVID-19" and "coronavirus," and a limit of 5,000 tweets scraped per month. I chose the 24 hours before and after the declaration on January 31, 2020, though this will give a limited scope as stay-at-home orders did not begin until mid-March. 

The following code will demonstrate how I scraped tweets using the <a href="https://github.com/twitterdev/search-tweets-python">searchtweets</a> and <a href="https://pypi.org/project/requests/2.7.0/">requests</a> packages once connected to the Twitter API. 

<b> More documentation to come.</b>

# Imports

In [1]:
import pandas as pd
import json
import requests

import searchtweets
from searchtweets import load_credentials
from string import Template

# OAUTH 2.0 Bearer Token

After following the Twitter Developer <a href="https://developer.twitter.com/en/docs/basics/getting-started">Getting Started</a> guide, I created a developer app, received API keys, and generated a bearer token from OAuth 2.0. My API keys are housed in a secret YAML document, separate from the repository used to house this notebook. 

In [2]:
credentials = load_credentials(filename="/Users/Allie/Documents/DS-Projects/API keys/twitter_keys.yaml",
                 yaml_key="search_tweets_api",
                 env_overwrite=False)

Grabbing bearer token from OAUTH


In [3]:
bearer_token = credentials['bearer_token']

# Initial Request

In order to test my search parameters, I'm making an initial request to the Twitter API Full Archive. The parameters I'm using for this project are the following:

   - <b>Keywords: </b> "coronavirus OR Wuhan virus OR 2019-nCoV OR China flu"<br>
   <br> This keyword list will not be exhaustive of all tweets relating to COVID-19 during this time period but I believe will be enough to get an impression of the public response to the WHO's declaration of a pandemic. <br><br>
   - <b>Date Range: </b> 28 Jan 2020 -  03 Feb 2020<br><br>
   The pandemic was declared on January 31, 2020. The few days that bookend this date can offer a more concrete examination of public perception pre- and post-declaration
   <br><br>
   - <b>Location:</b> United States of America<br><br>
   By using the profile_country parameter, the selected tweets will be tweets or retweets of profiles that indicate the US is their geographical location, though they may have tweeted from abroad or retweeted non-Americans. I don't believe this should alter the intended sentiment. 

In [7]:
# Full archive endpoint
endpoint = 'https://api.twitter.com/1.1/tweets/search/fullarchive/production.json'

headers = {"Authorization":"Bearer {}".format(bearer_token), 
           "Content-Type": "application/json"}  

data = '{"query":"(coronavirus OR Wuhan virus OR 2019-nCoV OR China flu) lang:en", "fromDate": "202001280000", "toDate":"202002030000"}, {"place":{"profile_country":"United States of America"}}'

# POST request instead of GET to avoid URL length restrictions of 2048 characters
response = requests.post(endpoint,data=data,headers=headers).json()

Next, we'll use JSON to decode the response into a "pretty-printed" string to explore the results of the request to see if everything checks out. 

In [20]:
# Indenting to pretty-print
print(json.dumps(response, indent = 2))

{
  "results": [
    {
      "created_at": "Sun Feb 02 23:59:59 +0000 2020",
      "id": 1224120307717410816,
      "id_str": "1224120307717410816",
      "text": "RT @QuestForSense: Amazing Timelapse as China Completes First of Two Hospitals in Wuhan within 10 days having 1,000 beds and 1,400 medical\u2026",
      "source": "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>",
      "truncated": false,
      "in_reply_to_status_id": null,
      "in_reply_to_status_id_str": null,
      "in_reply_to_user_id": null,
      "in_reply_to_user_id_str": null,
      "in_reply_to_screen_name": null,
      "user": {
        "id": 184207003,
        "id_str": "184207003",
        "name": "\u262e\ufe0fOpe",
        "screen_name": "The_Ope_",
        "location": "Third rock from the sun",
        "url": null,
        "description": "Melancholic thrill seeker | Reader | Mechanical Engineer | Amateur Artist",
        "translator_type": "none",
        "protected"

The first tweet is in Spanish though I meant to only include those in English, not to suppress non-American or non-English speaking voices, but because this project was not meant to be mutli-lingual. There are multilingual word embeddings by Facebook and Google that I could use once my initial exploration in English is complete. It would be facinating to look into different cultural responses to the pandemic.

The first tweet also includes an image that is instrumental to the understanding of the tweet. Taking a harder look, it seems as if this particular user draws comparison between the fictional Umbrella Corporation logo from the Resident Evil videogames and a defunct Shanghai biotech firm logo (Read an article about it <a href="https://www.snopes.com/fact-check/resident-evil-umbrella-coronavirus/">here</a>.) While conspiratorial and silly, it does bring up a limitation of this project: there will be no sentiment analysis of images or videos, only text. I'll need to identify and eliminate any tweet that has a video or image significant to the context of the tweet.

In [21]:
# Creating pandas dataframe from result
covid_tweets = pd.DataFrame.from_dict(response['results'])

# Pagination

Rate limits for the free Sandbox dev include 100 tweets per request and 30 requests per minute. The response includes a "next token" that can be supplied to subsequent queries in order to paginate through the results until there are no longer tweets that fit your query. 

I've created a helper function to pull the unique next token per request and paginate through the results. Each JSON string will then be added to the dataframe defined by the initial API request. This function can be generalized by substituting the data and template variables with a properly formatted JSON query.  

In [22]:
# 100 tweets per request
len(covid_tweets)

100

In [8]:
def get_initial_next_token(endpoint, headers, query=data):
    response = requests.post(endpoint,data=data,headers=headers).json()
    
    if response['next']:
        return response['next']
    else:
        return 'Error'

In [9]:
def paginate_covid_tweets(endpoint, headers, next_token):
    
    """
    This helper function when combined with a loop will pull the next_token 
    from each request and paginate through Twitter API requests. 
    
    Keyword arguments:
    
    endpoint -- URL string of Twitter API endpoint including dev environment name
    headers -- Dictonary of OAuth bearer token and content type
    next_token -- Next_token from previous 
    request from which to begin pagination
    """ 
    
    t=Template('{"query":"(coronavirus OR Wuhan virus OR 2019-nCoV OR China flu) lang:en", "fromDate": "202001300000", "toDate": "202001312359", "next":"${next_token}"}, {"place":{"profile_country":"United States of America"}}')
    data = t.substitute(next_token=next_token)
    response = requests.post(endpoint, data=data, headers=headers).json()
    
        
    if response['next']:
               
        next_token = response['next']
        return response, next_token
    
    else:
        return response, "End of Pagination"

Now that we have the helper function, we'll identify the next token from the initial request, then loop through each request with the pagination function, assigning a new token to each request, and add each page to a list until we've hit the monthly rate limit of 50 requests.

In [58]:
# Assigning next token to variable
next_token = get_initial_next_token(endpoint, headers, query=data)

next_token = "eyJtYXhJZCI6MTIyNDEyMDIzMjIzMjQ4NDg2NH0="
# Starting iteration at 0
i = 0
dfs = []

# Loop until rate limit of 50 is met
while i < 51:
    response, next_token = paginate_covid_tweets(endpoint, 
                                                 headers, 
                                                 next_token)
    dfs.append(response)
    i += 1
    

KeyError: 'next'

Below, we'll take just the tweet results, convert to dataframes, and add them to our initial covid_tweets dataframe.

In [65]:
for dictionary in dfs:
    df = pd.DataFrame.from_dict(dictionary['results'])
    covid_tweets = pd.concat([covid_tweets,df], axis=0)

In [66]:
len(covid_tweets)

6200

In [61]:
# Final token
latest_token = dfs[-1]['next']

IndexError: list index out of range

In [23]:
# Getting the rest of the tweets after the error above
next_token = latest_token
while i < 21:
    response, next_token = paginate_covid_tweets(endpoint, 
                                                 headers, 
                                                 next_token)
    dfs.append(response)
    i += 1

After my trial and error, I've ended up with 4,400 tweets. This will become more robust, again, once the limit is reset.

In [62]:
len(covid_tweets)

6200

In [67]:
covid_tweets.head()

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,extended_tweet,favorite_count,favorited,filter_level,...,quoted_status_id_str,quoted_status_permalink,reply_count,retweet_count,retweeted,retweeted_status,source,text,truncated,user
0,,,Fri Jan 31 23:52:30 +0000 2020,,"{'hashtags': [], 'urls': [{'url': 'https://t.c...",,,0,False,low,...,,,0,0,False,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...","Trump imposes travel restrictions, mandatory q...",False,"{'id': 43604215, 'id_str': '43604215', 'name':..."
1,,,Fri Jan 31 23:52:30 +0000 2020,,"{'hashtags': [], 'urls': [], 'user_mentions': ...",,,0,False,low,...,,,0,0,False,{'created_at': 'Fri Jan 31 23:47:14 +0000 2020...,"<a href=""http://twitter.com/download/android"" ...",RT @CityNews: Although lab-confirmed influenza...,False,"{'id': 108912964, 'id_str': '108912964', 'name..."
2,,,Fri Jan 31 23:52:30 +0000 2020,,"{'hashtags': [], 'urls': [{'url': 'https://t.c...",,{'full_text': 'I am averse to using “dunce” an...,1,False,low,...,,,0,0,False,,"<a href=""http://twitter.com/download/iphone"" r...",I am averse to using “dunce” and “stupid” to d...,True,"{'id': 3640251315, 'id_str': '3640251315', 'na..."
3,,,Fri Jan 31 23:52:29 +0000 2020,,"{'hashtags': [], 'urls': [], 'user_mentions': ...",,,0,False,low,...,,,0,0,False,{'created_at': 'Fri Jan 31 23:36:23 +0000 2020...,"<a href=""https://mobile.twitter.com"" rel=""nofo...",RT @Bill_Nye_Tho: rt only if you dont got coro...,False,"{'id': 1168674623011074049, 'id_str': '1168674..."
4,,,Fri Jan 31 23:52:29 +0000 2020,,"{'hashtags': [], 'urls': [], 'user_mentions': ...",,,0,False,low,...,,,0,0,False,{'created_at': 'Fri Jan 31 21:55:15 +0000 2020...,"<a href=""http://twitter.com/download/iphone"" r...",RT @SolomonYue: Bravo POTUS! President Trump s...,False,"{'id': 1201135899582484480, 'id_str': '1201135..."


In [75]:
covid_tweets.to_csv('expanded_query_tweets_082720.csv',index=False)

# Next Steps

1. There was an issue with the next token on my first run through with the pagination function, but as my requests for this month have reached their limit, I won't be able to investigate until next month. The function did run again when the next token was initiated with the final token of the first run.
<br><br>
2. There needs to be a bit of cleanup with loops and functions.
<br><br>
3. Pagination needs to exhaust all tweets pertaining to this project.
<br><br>
4. Added filter for retweets, replies, and language on June 3. Will need to wait until limits reset to test.