# Sentiment Analysis of COVID-19 Tweets: When did the Public Panic Set In? Part 4: Supervised Classification Modeling

    Notebook by Allison Kelly - allisonkelly42@gmail.com
    
This notebook is preceded by parts <a href="https://github.com/akelly66/COVID-Tweet-Sentiment/blob/master/tweet-scraping/Twitter-API-Scraping.ipynb">1</a>, <a href="https://github.com/akelly66/COVID-Tweet-Sentiment/blob/master/text-processing/NLP-Text-Processing.ipynb">2</a> and <a href="https://github.com/akelly66/COVID-Tweet-Sentiment/blob/master/EDA/tweet-EDA.ipynb">3</a>. Part 4 will focus on the modeling portion, but is still very much in ins infancy. Markdown cells and complete documentation to come. 

# Imports

In [46]:
import pandas as pd

from gensim.models import Word2Vec
from nltk import word_tokenize
from ast import literal_eval

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score


# Data

The training data was derived by Sentiment 140, a class project from Stanford University. It is a set of 1,600,000 tweets that are categorized by polarity, where 0 is negative, 2 is neutral, and 4 is positive. The polarity of the tweets was determined by the use of emojis where 'happy' emojis were positive and 'unhappy' emojis were negative. You can find more information and data download <a href="http://help.sentiment140.com/for-students">here.</a>

In [4]:
train_tweets = pd.read_csv('Data/processed_train.csv', 
                       usecols=['polarity', 'processed_tweets'],
                       # Converting string to list
                       converters={"processed_tweets": literal_eval})
train_tweets.head()

Unnamed: 0,polarity,processed_tweets
0,0,"[switchfoot, httptwitpiccom, 2y1zl, awww, that..."
1,0,"[upset, cant, update, facebook, texting, might..."
2,0,"[kenichan, dived, many, time, ball, managed, s..."
3,0,"[whole, body, feel, itchy, like, fire]"
4,0,"[nationwideclass, behaving, im, mad, cant, see]"


In [33]:
print(f"Number of tweets in dataset: {len(train_tweets)}\n")
print(f"Distribution of target classes: \n{train_tweets.polarity.value_counts()}")

Number of tweets in dataset: 1600000

Distribution of target classes: 
4    800000
0    800000
Name: polarity, dtype: int64


The dataset consists of more than 1.5 million tweets, and evenly distributed between negative and positive sentiments, however there are none in the neutral class. 

Now we'll split it into training and test sets. Because of the size of the dataset, it will be too computationally expensive to model so we'll take a random sample of 25,000 tweets before we do the split. 

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
train_sample = train_tweets.sample(n=25000, random_state = 42)
train_sample.polarity.value_counts()

4    12529
0    12471
Name: polarity, dtype: int64

The distribution of classes is nearly equal, which is great and will make for a more accurate classification model.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(train_sample['processed_tweets'], 
                                                    train_sample['polarity'], 
                                                    test_size=.20, 
                                                    random_state=1)

In [8]:
# Getting number of words in all tweets combined
all_words_list = [item for sublist in X_train for item in sublist]

In [9]:
# Getting unique words
total_vocab = set(all_words_list)

In [44]:
print(f"Total number of words in all tweets: {len(all_words_list)}")
print(f"Total unique words: {len(total_vocab)}")
print(f"Percentage of words that are unique: {round(len(total_vocab)/len(all_words_list),4)*100}")

Total number of words in all tweets: 154984
Total unique words: 31349
Percentage of words that are unique: 20.23


# Generating Word Embeddings with Word2Vec

We've got our tokenized tweets split into training and test sets, but they're still just a bunch of words. We'll need to vectorize them by training a neural network, resulting in a high-dimensional embedding space with each word as a unique vector. We can then use those vectors and their proximity to others to identify semantic relationships between words.

The Word2Vec model we'll initiate takes five parameters (from documentation):

* The training data (called 'sentences' in the documentation)
* size : Dimensionality of the word vectors.
* window : Maximum distance between the current and predicted word within a sentence.
* min_count : Ignores all words with total frequency lower than this.
* workers : Use these many worker threads to train the model (=faster training with multicore machines).

In [47]:
model = Word2Vec(X_train, size=100, window=5, min_count=1, workers=4)

In [48]:
model.train(X_train, total_examples=model.corpus_count, epochs=10)

(1461889, 1549840)

# Initial TF-IDF Vectorization

In [11]:
vectorizer = TfidfVectorizer()

In [36]:
train_tweet_list = X_train.apply((' ').join)
test_tweet_list = X_test.apply((' ').join)

In [37]:
train_tweet_list

1485                      genmarie hope fix california least
366259           httptwitpiccom 6pa1n old pic cute miss hair
983258              spicyguy haha ah yes way experimentation
1435115            omgosh lt3 photo httpbitlyzwrjc julesanne
1144081                        finished watching icarly rock
                                 ...                        
1158917    katzy love good car boot sale loved buying old...
355437     jclima hear much fedora every time try im disa...
1147344                goodnight everyone god good love much
310681                     hate say im kind impressed window
182695                          aint aliveim breathing death
Name: processed_tweets, Length: 20000, dtype: object

In [38]:
tfidf_train = vectorizer.fit_transform(train_tweet_list)
tfidf_test = vectorizer.transform(test_tweet_list)

In [39]:
tfidf_train.shape

(20000, 31227)

# Naive Bayes

In [15]:
nb_classifier = MultinomialNB()
# rf_classifier = RandomForestClassifier(n_estimators=5)

In [16]:
nb_classifier.fit(tfidf_train, y_train)
nb_train_preds = nb_classifier.predict(tfidf_train)
nb_test_preds = nb_classifier.predict(tfidf_test)

In [17]:
nb_train_score = accuracy_score(y_train, nb_train_preds)
nb_test_score = accuracy_score(y_test, nb_test_preds)
print("Multinomial Naive Bayes")
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(nb_train_score, nb_test_score))

Multinomial Naive Bayes
Training Accuracy: 0.9993 		 Testing Accuracy: 0.4968


# Logistic Regression

In [18]:
from sklearn.linear_model import LogisticRegression

solvers = ['sag','saga','lbfgs']
C = [.001, .01, .1, 1, 10, 100,1000]

for solver in solvers:
    for c in C:
        lr_classifier = LogisticRegression(verbose=0, solver=solver,C=c, random_state=0)
        lr_model = lr_classifier.fit(tfidf_train, y_train)
    
        lr_train_score = lr_model.score(tfidf_train, y_train)
        lr_test_score = lr_model.score(tfidf_test, y_test)
    
        print(f"Logistic Regression with {solver} Solver and C={c}")
        print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}\n".format(lr_train_score, lr_test_score))

Logistic Regression with sag Solver and C=0.001
Training Accuracy: 0.5038 		 Testing Accuracy: 0.4968

Logistic Regression with sag Solver and C=0.01
Training Accuracy: 0.9988 		 Testing Accuracy: 0.4968

Logistic Regression with sag Solver and C=0.1
Training Accuracy: 0.9989 		 Testing Accuracy: 0.4968

Logistic Regression with sag Solver and C=1
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106

Logistic Regression with sag Solver and C=10
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106





Logistic Regression with sag Solver and C=100
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106





Logistic Regression with sag Solver and C=1000
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106

Logistic Regression with saga Solver and C=0.001
Training Accuracy: 0.5036 		 Testing Accuracy: 0.4964

Logistic Regression with saga Solver and C=0.01
Training Accuracy: 0.9988 		 Testing Accuracy: 0.4968

Logistic Regression with saga Solver and C=0.1
Training Accuracy: 0.9989 		 Testing Accuracy: 0.4968

Logistic Regression with saga Solver and C=1
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106

Logistic Regression with saga Solver and C=10
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106





Logistic Regression with saga Solver and C=100
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106





Logistic Regression with saga Solver and C=1000
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106

Logistic Regression with lbfgs Solver and C=0.001
Training Accuracy: 0.5038 		 Testing Accuracy: 0.4968

Logistic Regression with lbfgs Solver and C=0.01
Training Accuracy: 0.9988 		 Testing Accuracy: 0.4968

Logistic Regression with lbfgs Solver and C=0.1
Training Accuracy: 0.9989 		 Testing Accuracy: 0.4968

Logistic Regression with lbfgs Solver and C=1
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106

Logistic Regression with lbfgs Solver and C=10
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression with lbfgs Solver and C=100
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106

Logistic Regression with lbfgs Solver and C=1000
Training Accuracy: 0.9995 		 Testing Accuracy: 0.5106



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Tensorflow/Keras

In [22]:
!pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.3.1-cp38-cp38-macosx_10_14_x86_64.whl (165.2 MB)
[K     |████████████████████████████████| 165.2 MB 21 kB/s  eta 0:00:01   |█▌                              | 7.5 MB 2.5 MB/s eta 0:01:03     |███▏                            | 16.5 MB 1.5 MB/s eta 0:01:38     |███████                         | 36.1 MB 1.9 MB/s eta 0:01:09     |███████▋                        | 39.0 MB 2.8 MB/s eta 0:00:45     |██████████▏                     | 52.4 MB 2.2 MB/s eta 0:00:51███████▏                  | 67.8 MB 2.3 MB/s eta 0:00:42     |██████████████▋                 | 75.7 MB 1.2 MB/s eta 0:01:12     |███████████████                 | 77.5 MB 1.9 MB/s eta 0:00:47     |████████████████▎               | 84.3 MB 1.3 MB/s eta 0:01:04     |███████████████████▌            | 100.5 MB 2.6 MB/s eta 0:00:26     |███████████████████▉            | 102.6 MB 5.0 MB/s eta 0:00:13     |████████████████████            | 102.8 MB 5.0 MB/s eta 0:00:13     |████████████████████

In [23]:
import matplotlib.pyplot as plt
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [17]:
dense_matrix = tfidf_train.todense()
dense_list = dense_matrix.tolist()

In [16]:
# AUTOTUNE = tf.data.experimental.AUTOTUNE

# tfidf_train = tfidf_train.cache().prefetch(buffer_size=AUTOTUNE)

# tfidf_test = tfidf_test.cache().prefetch(buffer_size=AUTOTUNE)b