### CUNY Data 620 - Web Analytics, Summer 2020  
**Final Project**   
**Prof:** Alain Ledon  
**Members:** Misha Kollontai, Amber Ferger, Zach Alexander, Subhalaxmi Rout 

### Instructions
Your project should incorporate one or both of the two main themes of this course: network analysis and text processing. You need to show all of your work in a coherent workflow, and in a reproducible format, such as an IPython Notebook or an R Markdown document. If you are building a model or models, explain how you evaluate the “goodness” of the chosen model and parameters. 

### Research Questions

* Is there a relationship between location-specific Covid-19 sentiment and the number of positive cases within that region? 
* Does positive sentiment preceed spikes in positive cases?

### The Data

We will be using the Twitter API to scrape Tweet data, [John's Hopkins COVID-19 Data](https://github.com/CSSEGISandData/COVID-19) and [Wikipedia](https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States) for the COVID-19 numbers. 

### The Plan

1. Scrape Twitter data from 2 locations - perhaps NYC (severe initial wave) and New Orleans (experiencing something of a second wave)
2. Pull coronavirus case numbers for the 2 locations in question
3. Perform sentiment analyis on the tweets collected and aggregate them into an overall sentiment index for each day
4. Plot timeseries of the sentiment index -vs- Coronavirus case numbers
5. Indicate important moments on the timeline related to Covid-19 safety measures or announcements
6. Investigate potential relationships between the two sets and compare the relationships from one city to another

### Libraries

In [1]:
import pandas as pd
import numpy as np
from itertools import combinations
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import warnings
warnings.filterwarnings('ignore')

### Functions

In [2]:
############### sentiment analysis
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    neg = score['neg']
    pos = score['pos']
    neu = score['neu']
    compound= score['compound']
    
    return [neg,pos,neu,compound, sentence]

############### splitting hashtag groupings
def splitTags(x,y):
    return [(x,z) for z in y]

### Data
**TO DO: Explanation of twitter data pull**

In [3]:
# read in and replace nulls
tweets = pd.read_csv('Covid_Twitter_City_Data.csv', delimiter=',')
tweets = tweets.replace(np.nan, '', regex=True)

### Sentiment Analysis

In [4]:
# instantiate sentiment analyzer, define function for sentiment output
analyser = SentimentIntensityAnalyzer()

# sentiments to dataframe
text = tweets['TEXT'].tolist()
sentiments = [sentiment_analyzer_scores(s) for s in text]
sentiments_df = pd.DataFrame(sentiments, columns = ['NEGATIVE_SCORE', 'POSITIVE_SCORE', 'NEUTRAL_SCORE', 'COMPOUND', 'SENTENCE'])

# final dataframe with sentiments
finalFrame = tweets.join(sentiments_df)
finalFrame = finalFrame.iloc[:,1:-1]

### Creating a Network
* **Nodes**: Cities, **TO DO: color: sentiment, size: portion of population affected by covid**
* **Edges**: Shared Hashtags, **TO DO: edge weights: number of shared hashtags**

In [5]:
# hashtags & coordinates for each record
hashtags = tweets['HASHTAGS'].tolist()
coords = tweets['COORDS'].tolist()
sepHash = [i.split() for i in hashtags]
sepHash[1:5]

[[], [], [], ['#CX', '#COVID', '#custserv']]

In [6]:
# set of all coordinates with individual hashtag
coordTag = [splitTags(i,j) for i,j in list(zip(coords,sepHash)) if len(j)> 0]
flattened = [val for sublist in coordTag for val in sublist]
finalHash = set(flattened)

# create a dictionary of each hashtag with the city coordinates
tempDict = {}
for i,j in finalHash:
    if j not in tempDict:
        tempDict[j]= [i]
    else:
        tempDict[j].append(i)
        
# remove covid hashtags from dictionary
tagsToRemove = ['#covid_19', '#COVID19', '#COVID2019', '#COVID_19', '#COVID__19', '#COVID', '#COVD19', '#Covid_19']

for k in tagsToRemove:
    tempDict.pop(k, None)

print('Example output from the hashtag #FollowTheScience:')
tempDict['#FollowTheScience']

Example output from the hashtag #FollowTheScience:


['34.7464809,-92.2895948',
 '39.7392358,-104.990251',
 '47.6062095,-122.3320708',
 '34.0522342,-118.2436849',
 '39.9525839,-75.1652215',
 '41.8781136,-87.6297982',
 '38.9071923,-77.0368707',
 '39.9611755,-82.9987942']

In [7]:
#### Final Edges
# combining all elements in the dictionary values into separate node connections
coordPairs = list(tempDict.values())
productList = []

for i in coordPairs:
    if len(i) >1:
        productList.append(list(combinations(i,2)))
    
finalPairs = [val for sublist in productList for val in sublist]

print('Example Edge:', finalPairs[1])

Example Edge: ('47.6062095,-122.3320708', '45.5230622,-122.6764816')


### Edge Weights
https://www.geeksforgeeks.org/python-program-to-count-duplicates-in-a-list-of-tuples/

### Step 1: Scraping Twitter Data from New York City & New Orleans

As a first step, we decided to scrape tweets from two locations, New York City and New Orleans.

##### Reading in the tweets from NYC

In [None]:
tweets = pd.read_csv('covid_tweets.csv', delimiter='\t')
tweets['City'] = 'NYC'

In [None]:
tweets.head()

In [None]:
tweets.shape

### Step 2: Pulling coronavirus case numbers for both locations

In [None]:
covid_cases = pd.read_csv('confirmed_cases.csv')

##### Filtering for NYC cases

After locating the correct county FIPS number for New York City, we were able to filter the pandas dataframe to only include this row. Additionally, we transposed this row to ensure we had one column designated for the date and another for the number of confirmed cases for that corresponding date. Finally, we made sure to reset the index and adjust the date type in order to be able to show our visuals:

In [None]:
cases_filtered = covid_cases[covid_cases['FIPS'] == 36061]
df = cases_filtered.iloc[:, 11:186:1]

df = df.transpose().reset_index()
df = df.rename(columns={'index': 'Date', 1863: "Confirmed_Cases"})

nyc_time_series = pd.DataFrame(df, columns = ['Date','Confirmed_Cases'])
nyc_time_series['Date'] = pd.to_datetime(nyc_time_series['Date'], format='%m/%d/%y')

Here's a quick look at the filtered dataset with just NYC cases:

In [None]:
nyc_time_series.tail()

Now, in order to find the number of new cases per day, we can utilize our confirmed cases column to take the difference between the current day and the previous day. Additionally, for our visualization, we can take the 7-day average of new cases and plot this as well, in order to obtain a better view of trends over time.

In [None]:
def add_newcases(df):
    df['New_Cases'] = 'NA'
    for i in range(0, len(df['Confirmed_Cases'])):
        if i == 0:
            df['New_Cases'][i] = 0
        else:
            df['New_Cases'][i] = df['Confirmed_Cases'][i] - df['Confirmed_Cases'][i-1]
    return df

In [None]:
def add_sevenday(df):
    df['Seven_Day_Avg'] = 'NA'
    for i in range(0, len(df['Confirmed_Cases'])):
        if i < 8:
            df['Seven_Day_Avg'][i] = 0
        else:
            weekly = []
            for y in range(0,7):
                weekly.append(df['New_Cases'][i-y])
            df['Seven_Day_Avg'][i] = sum(weekly) / 7
    return df

In [None]:
df = add_newcases(nyc_time_series)
df = add_sevenday(df)

After creating the `New Cases` and `Seven Day Average` columns, we can create a plot to show the case counts in New York City:

In [None]:
def drawNewCases(df, title, fignum, var):
    var = plt.figure(fignum, figsize=(16,8))
    plt.bar(df['Date'], df['New_Cases'], color='indianred', alpha=0.4)
    plt.plot(df['Date'], df['Seven_Day_Avg'], c='indianred', linewidth=2)
    plt.plot(legend=None)
    plt.title(title)
    plt.ylabel('Number of New Cases')
    plt.gca().xaxis.set_major_formatter(fmt)
    var.show()

In [None]:
locator = mdates.MonthLocator()
fmt = mdates.DateFormatter('%B')

nyc_time_series = pd.DataFrame(df, columns = ['Date','Confirmed_Cases', 'New_Cases', 'Seven_Day_Avg'])
nyc_time_series['Date'] = pd.to_datetime(nyc_time_series['Date'], format='%m/%d/%y')

drawNewCases(nyc_time_series, 'Number of new COVID-19 cases in New York City (Daily)', 1, 'x')

##### Filtering for New Orleans Cases

In [None]:
cases_filtered_newo = covid_cases[covid_cases['FIPS'] == 22071]


df_newo = cases_filtered_newo.iloc[:, 11:186:1]

df_newo = df_newo.transpose().reset_index()
df_newo = df_newo.rename(columns={'index': 'Date', 1153: "Confirmed_Cases"})

newo_time_series = pd.DataFrame(df_newo, columns = ['Date','Confirmed_Cases'])
newo_time_series['Date'] = pd.to_datetime(newo_time_series['Date'], format='%m/%d/%y')

In [None]:
newo_time_series.tail()

In [None]:
df_newo = add_newcases(newo_time_series)
df_newo = add_sevenday(df_newo)

In [None]:
newo_time_series = pd.DataFrame(df_newo, columns = ['Date','Confirmed_Cases', 'New_Cases', 'Seven_Day_Avg'])
newo_time_series['Date'] = pd.to_datetime(newo_time_series['Date'], format='%m/%d/%y')

In [None]:
drawNewCases(newo_time_series, 'Number of new COVID-19 cases in New Orleans (Daily)', 2, 'y')

**Note from Zach**: Will remove this commented-out code later (see below), but thought I'd leave it just in case it'll be helpful for future visualizations:

In [None]:
# locator = mdates.MonthLocator()
# fmt = mdates.DateFormatter('%B')


# plt.plot(nyc_time_series['Date'], nyc_time_series['Confirmed_Cases'], c='indianred')
# plt.plot(legend=None)
# plt.title('Number of Confirmed COVID-19 Cases in New York City')
# plt.xlabel('Date')
# plt.ylabel('Number of Confirmed Cases')
# plt.gca().xaxis.set_major_formatter(fmt)
# plt.show()