# Sentiment Analysis of COVID-19 Tweets: When did the Public Panic Set In? Part 4: Supervised Classification Modeling

    Notebook by Allison Kelly - allisonkelly42@gmail.com
    
This notebook is preceded by parts <a href="https://github.com/akelly66/COVID-Tweet-Sentiment/blob/master/tweet-scraping/Twitter-API-Scraping.ipynb">1</a>, <a href="https://github.com/akelly66/COVID-Tweet-Sentiment/blob/master/text-processing/NLP-Text-Processing.ipynb">2</a> and <a href="https://github.com/akelly66/COVID-Tweet-Sentiment/blob/master/EDA/tweet-EDA.ipynb">3</a>. Part 4 will focus on the modeling portion, but is still very much in ins infancy. Markdown cells and complete documentation to come. 

# Imports

In [1]:
import pandas as pd

from gensim.models import word2vec
from nltk import word_tokenize
from ast import literal_eval

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score


# Data

In [2]:
train_tweets = pd.read_csv('Data/processed_train.csv', 
                       usecols=['polarity', 'processed_tweets'],
                       # Converting string to list
                       converters={"processed_tweets": literal_eval})
train_tweets.head()

Unnamed: 0,polarity,processed_tweets
0,0,"[switchfoot, httptwitpiccom, 2y1zl, awww, that..."
1,0,"[upset, cant, update, facebook, texting, might..."
2,0,"[kenichan, dived, many, time, ball, managed, s..."
3,0,"[whole, body, feel, itchy, like, fire]"
4,0,"[nationwideclass, behaving, im, mad, cant, see]"


In [3]:
from sklearn.model_selection import train_test_split

In [4]:
train_sample = train_tweets.sample(n=25000, random_state = 42)
train_sample.polarity.value_counts()

4    12529
0    12471
Name: polarity, dtype: int64

In [5]:
X_train, X_test, y_train, y_test = train_test_split(train_sample['processed_tweets'], 
                                                    train_sample['polarity'], 
                                                    test_size=.20, 
                                                    random_state=1)

In [6]:
all_words_list = [item for sublist in X_train for item in sublist]

In [7]:
total_vocab = set(all_words_list)

In [8]:
print(len(all_words_list))
len(total_vocab)

154984


31349

# Initial TF-IDF Vectorization

In [32]:
vectorizer = TfidfVectorizer()

In [33]:
train_tweet_list = X_train.apply(('').join)
test_tweet_list = X_test.apply(('').join)

In [34]:
tfidf_train = vectorizer.fit_transform(train_tweet_list)
tfidf_test = vectorizer.transform(test_tweet_list)

In [28]:
tfidf_train.shape

(20000, 20269)

# Naive Bayes

In [13]:
nb_classifier = MultinomialNB()
# rf_classifier = RandomForestClassifier(n_estimators=5)

In [14]:
nb_classifier.fit(tfidf_train, y_train)
nb_train_preds = nb_classifier.predict(tfidf_train)
nb_test_preds = nb_classifier.predict(tfidf_test)

In [15]:
nb_train_score = accuracy_score(y_train, nb_train_preds)
nb_test_score = accuracy_score(y_test, nb_test_preds)
print("Multinomial Naive Bayes")
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(nb_train_score, nb_test_score))

Multinomial Naive Bayes
Training Accuracy: 0.9993 		 Testing Accuracy: 0.4968


# Logistic Regression

In [48]:
from sklearn.linear_model import LogisticRegression

solvers = ['sag','saga','lbfgs']
C = [.001, .01, .1, 1, 10, 100,1000]

for solver in solvers:
    for c in C:
        lr_classifier = LogisticRegression(verbose=0, solver=solver,C=c, random_state=0)
        lr_model = lr_classifier.fit(tfidf_train, y_train)
    
        lr_train_score = lr_model.score(tfidf_train, y_train)
        lr_test_score = lr_model.score(tfidf_test, y_test)
    
        print(f"Logistic Regression with {solver} Solver and C={c}")
        print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}\n".format(lr_train_score, lr_test_score))

Logistic Regression with sag Solver and C=0.001
Training Accuracy: 0.5038 		 Testing Accuracy: 0.4968

Logistic Regression with sag Solver and C=0.01
Training Accuracy: 0.9988 		 Testing Accuracy: 0.4968

Logistic Regression with sag Solver and C=0.1
Training Accuracy: 0.9989 		 Testing Accuracy: 0.4968

Logistic Regression with sag Solver and C=1
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106

Logistic Regression with sag Solver and C=10
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106





Logistic Regression with sag Solver and C=100
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106





Logistic Regression with sag Solver and C=1000
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106

Logistic Regression with saga Solver and C=0.001
Training Accuracy: 0.5036 		 Testing Accuracy: 0.4964

Logistic Regression with saga Solver and C=0.01
Training Accuracy: 0.9988 		 Testing Accuracy: 0.4968

Logistic Regression with saga Solver and C=0.1
Training Accuracy: 0.9989 		 Testing Accuracy: 0.4968

Logistic Regression with saga Solver and C=1
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106

Logistic Regression with saga Solver and C=10
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106





Logistic Regression with saga Solver and C=100
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106





Logistic Regression with saga Solver and C=1000
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106

Logistic Regression with lbfgs Solver and C=0.001
Training Accuracy: 0.5038 		 Testing Accuracy: 0.4968

Logistic Regression with lbfgs Solver and C=0.01
Training Accuracy: 0.9988 		 Testing Accuracy: 0.4968

Logistic Regression with lbfgs Solver and C=0.1
Training Accuracy: 0.9989 		 Testing Accuracy: 0.4968

Logistic Regression with lbfgs Solver and C=1
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106

Logistic Regression with lbfgs Solver and C=10
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Logistic Regression with lbfgs Solver and C=100
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106

Logistic Regression with lbfgs Solver and C=1000
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


# Tensorflow/Keras

In [16]:
import matplotlib.pyplot as plt
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [17]:
dense_matrix = tfidf_train.todense()
dense_list = dense_matrix.tolist()

In [16]:
# AUTOTUNE = tf.data.experimental.AUTOTUNE

# tfidf_train = tfidf_train.cache().prefetch(buffer_size=AUTOTUNE)

# tfidf_test = tfidf_test.cache().prefetch(buffer_size=AUTOTUNE)b