<a href="https://colab.research.google.com/github/ZenBurrito/Predicting-Covid-19-Outbreaks/blob/test/Copy_of_Pre_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
#Data
import requests
import csv

#EDA and Visualizations
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#modeling
from nltk.tokenize import TweetTokenizer, word_tokenize
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import precision_score, accuracy_score, recall_score, f1_score, confusion_matrix, mean_squared_error, roc_curve, auc, roc_auc_score
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import chi2
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

import psutil

import datetime



# Business Understanding

Data used in this project came from Kaggle, the CDC, Google Trends, and Twitter. Using basic NLP libraries (NLTK, TweetTokenizer)

For the analysis of the tweets I started with a basic, mostly cleaned, dataset off of Kaggle to create text based predictors. I also gathered my own twitter data about vaccinations, which accounted for another 10000 rows. I used other libraries to clean the text (TextHero), and then used a Trigram CountVectorizer. To create my target variable I used data from Google Trends and the CDC. Using percentages of new cases and trending relevance, I created a score between the two that accounts for not only physical outbreaks, but a social one as well.



Covid-19 is a problem that does not need much introduction or explanantion, however, for this project I wanted to find if there was any correlation between twitter behavior and possible outbreaks. This is important to understand because this could help hospitals better prepare for a response to an outbreak. This could also be applicable to anybody wanting to avoid close contact with the disease, giving them warning beforehand, which they can then use to make more informed decisions on how to handle their day to day.

# Data Preprocessing and Understanding

### Google Trends Data

In [3]:
google = pd.read_csv('data/covid_searches.csv')
google

Unnamed: 0,Category: All categories
Week,Symptoms of COVID-19: (United States)
2020-05-03,35
2020-05-10,34
2020-05-17,32
2020-05-24,30
...,...
2021-12-12,34
2021-12-19,56
2021-12-26,78
2022-01-02,81


In [4]:
google.reset_index(inplace=True)

google.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 2 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   index                     90 non-null     object
 1   Category: All categories  90 non-null     object
dtypes: object(2)
memory usage: 1.5+ KB


In [5]:
google.drop(index=google.index[0], 
        axis=0, 
        inplace=True)

In [6]:
google['date'] = google['index']
google['relevance'] = google['Category: All categories']

google.drop(columns='index', inplace=True)
google.drop(columns='Category: All categories', inplace=True)

google

Unnamed: 0,date,relevance
1,2020-05-03,35
2,2020-05-10,34
3,2020-05-17,32
4,2020-05-24,30
5,2020-05-31,27
...,...,...
85,2021-12-12,34
86,2021-12-19,56
87,2021-12-26,78
88,2022-01-02,81


In [7]:
google['date'] = pd.to_datetime(google['date'])

google['relevance'] = google['relevance'].astype(int)

google['outbreak'] = [x for x in google['relevance'] > 50]

### CDC Data

In [8]:
CDC = pd.read_csv('data/USA_covid_stats.csv')
CDC

Unnamed: 0,submission_date,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,created_at,consent_cases,consent_deaths
0,03/11/2021,KS,297229,241035.0,56194.0,0,0.0,4851,,,0,0.0,03/12/2021 03:20:13 PM,Agree,
1,06/11/2021,TX,2965966,,,1463,355.0,51158,,,17,0.0,06/13/2021 12:00:00 AM,Not agree,Not agree
2,01/02/2022,AS,11,,,0,0.0,0,,,0,0.0,01/03/2022 03:18:16 PM,,
3,08/22/2020,AR,56199,,,547,0.0,674,,,11,0.0,08/23/2020 02:15:28 PM,Not agree,Not agree
4,07/17/2020,MP,37,37.0,0.0,1,0.0,2,2.0,0.0,0,0.0,07/19/2020 12:00:00 AM,Agree,Agree
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43135,05/28/2020,IA,18585,,,228,0.0,506,,,14,0.0,05/29/2020 02:19:55 PM,Not agree,Not agree
43136,06/07/2020,SD,5438,,,71,0.0,65,64.0,1.0,0,1.0,06/08/2020 02:55:08 PM,,Agree
43137,04/30/2021,SD,122660,,,128,17.0,1967,1601.0,366.0,5,1.0,05/01/2021 01:43:22 PM,,Agree
43138,03/10/2021,SD,113962,,,209,37.0,1904,1546.0,358.0,3,0.0,03/11/2021 03:36:21 PM,,Agree


In [9]:
CDC['date'] = CDC['submission_date']

CDC.drop(columns='submission_date', axis=0, inplace=True)


col = ['state', 'conf_cases', 'prob_cases', 'pnew_case', 'conf_death', 'prob_death', 'pnew_death', 'created_at', 'consent_cases', 'consent_deaths']

CDC.drop(columns=col, inplace=True)

In [10]:
CDC.dropna(how='all', inplace=True)

CDC['date'] = pd.to_datetime(CDC['date']).dt.normalize()

In [11]:
CDC

Unnamed: 0,tot_cases,new_case,tot_death,new_death,date
0,297229,0,4851,0,2021-03-11
1,2965966,1463,51158,17,2021-06-11
2,11,0,0,0,2022-01-02
3,56199,547,674,11,2020-08-22
4,37,1,2,0,2020-07-17
...,...,...,...,...,...
43135,18585,228,506,14,2020-05-28
43136,5438,71,65,0,2020-06-07
43137,122660,128,1967,5,2021-04-30
43138,113962,209,1904,3,2021-03-10


### Twitter Data

In [None]:
tweets = pd.read_csv('data/covid_tweets.csv')
tweets.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,ᏉᎥ☻լꂅϮ,astroworld,wednesday addams as a disney princess keepin i...,2017-05-26 05:46:42,624,950,18775,False,2020-07-25 12:27:21,If I smelled the scent of hand sanitizers toda...,,Twitter for iPhone,False
1,Tom Basile 🇺🇸,"New York, NY","Husband, Father, Columnist & Commentator. Auth...",2009-04-16 20:06:23,2253,1677,24,True,2020-07-25 12:27:17,Hey @Yankees @YankeesPR and @MLB - wouldn't it...,,Twitter for Android,False
2,Time4fisticuffs,"Pewee Valley, KY",#Christian #Catholic #Conservative #Reagan #Re...,2009-02-28 18:57:41,9275,9525,7254,False,2020-07-25 12:27:14,@diane3443 @wdunlap @realDonaldTrump Trump nev...,['COVID19'],Twitter for Android,False
3,ethel mertz,Stuck in the Middle,#Browns #Indians #ClevelandProud #[]_[] #Cavs ...,2019-03-07 01:45:06,197,987,1488,False,2020-07-25 12:27:10,@brookbanktv The one gift #COVID19 has give me...,['COVID19'],Twitter for iPhone,False
4,DIPR-J&K,Jammu and Kashmir,🖊️Official Twitter handle of Department of Inf...,2017-02-12 06:45:15,101009,168,101,False,2020-07-25 12:27:08,25 July : Media Bulletin on Novel #CoronaVirus...,"['CoronaVirusUpdates', 'COVID19']",Twitter for Android,False


In [None]:
column = ['user_name','user_location','user_description','user_created','user_followers','hashtags', 'user_favourites','user_verified','is_retweet', 'source', 'user_friends']

tweets.drop(columns=column, inplace=True)

> Dropped irrelevant columns because the only data I want from the twitter datasets is text data and the date so I can join them by the same week number later on.

In [None]:
tweets['date'] = pd.DatetimeIndex(tweets['date']).normalize()

In [None]:
tweets = tweets.groupby(['text', pd.Grouper(key='date', freq='W-SUN')]).sum().reset_index().sort_values('date')

In [12]:
#!pip install texthero

> Texthero is a python package to work with text data efficiently.
It empowers NLP developers with a tool to quickly understand any text-based dataset and
it provides a solid pipeline to clean and represent text data, from zero to hero.

> Texthero was found doing self research, it helps quickly and effectively clean up text data so you can perform vectorization on it.

In [None]:
import texthero as hero

tweets['text'] = hero.clean(tweets['text'])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
vax_tw = pd.read_csv('data/vaccination_tweets.csv')
vax_tw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11020 entries, 0 to 11019
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   id                11020 non-null  int64 
 1   user_name         11020 non-null  object
 2   user_location     8750 non-null   object
 3   user_description  10341 non-null  object
 4   user_created      11020 non-null  object
 5   user_followers    11020 non-null  int64 
 6   user_friends      11020 non-null  int64 
 7   user_favourites   11020 non-null  int64 
 8   user_verified     11020 non-null  bool  
 9   date              11020 non-null  object
 10  text              11020 non-null  object
 11  hashtags          8438 non-null   object
 12  source            11019 non-null  object
 13  retweets          11020 non-null  int64 
 14  favorites         11020 non-null  int64 
 15  is_retweet        11020 non-null  bool  
dtypes: bool(2), int64(6), object(8)
memory usage: 1.2+ MB


In [None]:
cols = ['id','user_name','user_location','user_description','user_created','user_followers','user_favourites',
'user_verified','hashtags','source','retweets','favorites','is_retweet', 'user_friends']

vax_tw.drop(columns=cols, inplace=True)

vax_tw['date'] = pd.DatetimeIndex(vax_tw['date']).normalize()

vax_tw = vax_tw.groupby(['text', pd.Grouper(key='date', freq='W-SUN')]).sum().reset_index().sort_values('date')

In [None]:
vax_tw['text'] = hero.clean(vax_tw['text'])

vax_tw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11014 entries, 7007 to 1927
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   text    11014 non-null  object        
 1   date    11014 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 258.1+ KB


> Here I combined two different Twitter datasets. Both datasets are mostly about covid-19 and vaccinations. I pulled from relevant subjects because I believed that they would be the most correlated with predicting a pandemic. There are some normal tweets imbedded within that serves as an almost baseline.

> To review, My data came from multiple sources, First I started with a base, almost cleaned, Kaggle dataset that had about 170k rows of tweets, for the sake of time and computational effectiveness, I trimmed down the dataset so I had more time for model tuning and other aspects of the project. I also gathered another 11k rows of tweets that were about vaccinations. Those two datasets were both used to create text based predictors using a CountVectorizer, to see if there was any direct correlation between the tweets and my target variable.

>The target variable was constructed with both data from Google and the CDC. I created a score using the percentages of new cases as well as the relevance of searches. I then scaled the score down so outliers did not effect them as much. This accounts for not only physical outbreaks, but the social aspect as well.

## Combining Dataframes

In [None]:
tweet = tweets.sample(frac=0.5)

df = [tweet, google, vax_tw]

tash = tweet.merge(google, how='inner', on='date')

In [None]:
from functools import reduce

df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['date'],
                                            how='outer'), df)

In [None]:
df_merged['text_x'].fillna(df_merged['text_y'], inplace=True)

df_merged.drop(columns='text_y', inplace=True)

df_merged.dropna(inplace=True)

In [None]:
df_merged.info()

> After joining together all of my dataframes, I realized there was a major problem, the only join that was functional was an outer join. When I attempted an inner join on the date columns they were deleting the values within and leaving only column names. To combat this I did an outer join with my dataframes, but had to fraction my larger twitter dataframe due to computational effectiveness.

## Functions

In [None]:
def tokenize(tweet):
    tknzr = TweetTokenizer(strip_handles=True, reduce_len=True, 
                           preserve_case=False)
    return tknzr.tokenize(tweet)

In [None]:
def classifiers(x_train, y_train):
    nb  = MultinomialNB()
    nb.fit(x_train, y_train)
    log = LogisticRegression()
    log.fit(x_train, y_train)
    forest = RandomForestClassifier(n_estimators=100, max_depth=5)
    forest.fit(x_train, y_train)
    gradboost = GradientBoostingClassifier(random_state=123, max_depth=5, 
                                          learning_rate = 0.01)
    gradboost.fit(x_train, y_train)
    adaboost = AdaBoostClassifier(n_estimators=100)
    adaboost.fit(x_train, y_train)
    svm = SVC(kernel='linear', probability=True)
    svm.fit(x_train, y_train)
    return [nb, log, forest, gradboost, adaboost, svm]


In [None]:
def classifier_performance(vectorizer, train_data, test_data, y_test):
    accuracy_df = []
    x_train = vectorizer.fit_transform(train_data)
    x_test = vectorizer.transform(test_data)
    
    classifier_list = classifiers(x_train, y_train)
    for i in classifier_list:
        preds = i.predict(x_test)
        accuracy = accuracy_score(y_test, preds)
  
    accuracy_df = pd.DataFrame(accuracy_df)
    classifiers_key = ['Naive Bayes', 'Random Forest', 'Gradient Boost',
                      'AdaBoost', 'Support Vector Machine']
    accuracy_df['Model'] = classifiers_key
    accuracy_df.rename(columns={0: 'Accuracy'}, inplace=True)
    fin_accuracy_df = accuracy_df[['Model', 'Accuracy']]
    return fin_accuracy_df

In [None]:
def model_performance(model, vectorizer):
  
    train_data = vectorizer.fit_transform(x_train)
    test_data = vectorizer.transform(x_test)

    model.fit(train_data, y_train)

    pred = model.predict(test_data)

    return f'{model}', accuracy_score(y_test, pred)

In [None]:
def griddy(model, params, vectorizer):
  X_train = vectorizer.fit_transform(x_train)
  X_test = vectorizer.transform(x_test)

  grid = GridSearchCV(estimator=model, param_grid=params, n_jobs=-1, cv=6)
  grid.fit(X_train, y_train)

  return [grid.best_params_, grid.accuracy_score, grid.recall_score]

## Count Vectorization

> In this next section I created baseline models to run through my data on different vectorizations. I used a Count Vectorizer and TFIDF Vectorizer with different n-grams to test the potential of the models.

>After taking an iterative modeling approach, I found that one model performed exceptionally better than other models. Using the Multinomial Naive Bayes model I was able to predict outbreaks at a 40 percent accuracy. Part of the reason the Naive Bayes worked better than other models is because it assumes independence between my variables, meaning each of them is counted on its own without any weights, this pairs well with a CountVectorizer because it establishes importance based on word count.

In [None]:
data = df_merged['text_x']
labels = df_merged['relevance']
x_train, x_test, y_train, y_test = train_test_split(data, labels)
tfidfvec = TfidfVectorizer(stop_words='english', tokenizer=tokenize)
tfidfvec2 = TfidfVectorizer(stop_words='english', tokenizer=tokenize, ngram_range=(1,2))
tfidfvec3 = TfidfVectorizer(stop_words='english', tokenizer=tokenize, ngram_range=(1,3))
countvec = CountVectorizer(stop_words='english', tokenizer=tokenize)
countvec2 = CountVectorizer(stop_words='english', tokenizer=tokenize, ngram_range=(1,2))
countvec3 = CountVectorizer(stop_words='english', tokenizer=tokenize, ngram_range=(1,3))

In [None]:
nb  = MultinomialNB()

In [None]:
model_performance(nb, countvec)

('MultinomialNB()', 0.29124860646599776)

In [None]:
forest = RandomForestClassifier(n_estimators=100, max_depth=5)


In [None]:
model_performance(forest, countvec)

In [None]:
#gradboost = GradientBoostingClassifier(random_state=123, max_depth=5, learning_rate = 0.01)
#model_performance(gradboost, countvec)

In [None]:
#adaboost = AdaBoostClassifier(n_estimators=100)
#model_performance(adaboost, countvec)

### Bigram

In [None]:
model_performance(nb, countvec2)

In [None]:
model_performance(forest, countvec2)

In [None]:
#model_performance(gradboost, countvec2)

In [None]:
#model_performance(adaboost, countvec2)

### Trigram

In [None]:
model_performance(nb, countvec3)

In [None]:
model_performance(forest, countvec3)

In [None]:
#model_performance(gradboost, countvec3)

In [None]:
#model_performance(adaboost, countvec3)

## TF-IDF Vectorization



### Unigram

In [None]:
model_performance(nb, tfidfvec)

In [None]:
model_performance(forest, tfidfvec)

In [None]:
#model_performance(gradboost, tfidfvec)

In [None]:
#model_performance(adaboost, tfidfvec)

### Bigram

In [None]:
model_performance(nb, tfidfvec2)

In [None]:
model_performance(forest, tfidfvec2)

In [None]:
#model_performance(gradboost, tfidfvec2)

In [None]:
#model_performance(adaboost, tfidfvec2)

### Trigram

In [None]:
model_performance(nb, tfidfvec3)

In [None]:
model_performance(forest, tfidfvec3)

In [None]:
#model_performance(gradboost, tfidfvec3)

In [None]:
#model_performance(adaboost, tfidfvec3)


## Improving models

> Trying to improve the best performing models from the previous cells, The two best performing were Random Forests and Naive Bayes, So running grid searches on them to see if there is any hyperparameter tuning that I could do to further improve the models.

In [None]:
X_train = countvec3.fit_transform(x_train)
X_test = countvec3.transform(x_test)

param_grid = {'alpha': [1, 5, 10, 50, 100]}

clf = GridSearchCV(estimator=nb, param_grid=param_grid, n_jobs=-1, cv=6, scoring='accuracy')
clf.fit(X_train, y_train)




A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.



GridSearchCV(cv=6, estimator=MultinomialNB(), n_jobs=-1,
             param_grid={'alpha': [1, 5, 10, 50, 100]}, scoring='accuracy')

In [None]:
clf.best_params_

In [13]:
#params = {'criterion' : ['gini', 'entropy'],
          #'max_depth' : [3, 4, 5, 6, 7, 8],
          #'min_samples_split' : [2, 4, 6, 8, 10],
          #'min_samples_leaf' : [1, 2, 3, 4, 5],
          #'n_estimators' : [10, 25, 50, 100, 150, 200],
          #'bootstrap' : [True, False]}

#griddy(forest, params, countvec3)