For this NLP project, the dataset is from here: https://www.kaggle.com/datatattle/covid-19-nlp-text-classification/notebooks

It consists of tweets surrounding COVID-19. The data is already split into train and test sets with pre-labeled sentiments.

I will be training using CountVectorizer for the sentiment analysis.

In [1]:
import re
import nltk
import spacy
import sklearn
import warnings
import markovify
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.pipeline import Pipeline
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

warnings.filterwarnings("ignore")
!python -m spacy download en
nltk.download('gutenberg')

✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')
✘ Couldn't link model to 'en'
Creating a symlink in spacy/data failed. Make sure you have the required
permissions and try re-running the command as admin, or use a virtualenv. You
can still import the model as a module and call its load() method, or create the
symlink manually.
C:\Users\c\anaconda3\envs\machinelearning\lib\site-packages\en_core_web_sm -->
C:\Users\c\anaconda3\envs\machinelearning\lib\site-packages\spacy\data\en
⚠ Download successful but linking failed
Creating a shortcut link for 'en' didn't work (maybe you don't have admin
permissions?), but you can still load the model via its full package name: nlp =
spacy.load('en_core_web_sm')
You do not have sufficient privilege to perform this operation.
[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\c\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

## Importing Tweet Data

In [2]:
train = pd.read_csv("Corona_NLP_train.csv", encoding="ISO-8859-1")
test = pd.read_csv("Corona_NLP_test.csv", encoding="ISO-8859-1")

In [3]:
train.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


Only *OriginalTweet* will only be used to make the classifier.

In [4]:
def preprocess(docs):
    lemmatizer = WordNetLemmatizer()
    stemmer = SnowballStemmer("english")
    preprocessed = []

    for doc in docs:
        tokenized = word_tokenize(doc)
        cleaned = [stemmer.stem(lemmatizer.lemmatize(token.lower()))
        for token in tokenized
        if not token.lower() in stopwords.words("english")
        if token.isalpha()]

        untokenized = " ".join(cleaned)
        preprocessed.append(untokenized)

    return preprocessed

## Using *preprocess*

In [171]:
train_processed = preprocess(train.OriginalTweet)
test_processed = preprocess(test.OriginalTweet)

['menyrbi', 'chrisitv', 'http', 'http', 'http']
['advic', 'talk', 'neighbour', 'famili', 'exchang', 'phone', 'number', 'creat', 'contact', 'list', 'phone', 'number', 'neighbour', 'school', 'employ', 'chemist', 'gp', 'set', 'onlin', 'shop', 'account', 'po', 'adequ', 'suppli', 'regular', 'med', 'order']
['coronavirus', 'australia', 'woolworth', 'give', 'elder', 'disabl', 'dedic', 'shop', 'hour', 'amid', 'outbreak', 'http']
['food', 'stock', 'one', 'empti', 'pleas', 'panic', 'enough', 'food', 'everyon', 'take', 'need', 'stay', 'calm', 'stay', 'safe', 'coronavirus', 'confin', 'confinementot', 'confinementgener', 'http']
['readi', 'go', 'supermarket', 'outbreak', 'paranoid', 'food', 'stock', 'litterali', 'empti', 'coronavirus', 'serious', 'thing', 'pleas', 'panic', 'caus', 'shortag', 'coronavirusfr', 'restezchezv', 'stayathom', 'confin', 'http']
['news', 'first', 'confirm', 'case', 'came', 'sullivan', 'counti', 'last', 'week', 'peopl', 'flock', 'area', 'store', 'purchas', 'clean', 'suppli',

KeyboardInterrupt: 

## Setting Train and Test sets
Before anything, I would just like to check the sizes of the processed data to be used.

In [98]:
print(f"Train preprocessed: {len(train_processed)}")
print(f"Test preprocessed: {len(test_processed)}")

Train preprocessed: 41157
Test preprocessed: 3798


In [102]:
print(len(train.Sentiment))
print(len(test.Sentiment))

41157
3798


In [54]:
# training data
X_train = train_processed
y_train = train.Sentiment
# test data
X_test = test_processed
y_test = test.Sentiment

## Pipeline with *Count Vectorizer*, *TF-IDF Transformer*, and *Logistic Regression*

In [116]:
model = Pipeline(
    [
     ('vect', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('clf', LogisticRegression(n_jobs=-1))
    ], verbose=1)

model.fit(X_train, y_train)

[Pipeline] .............. (step 1 of 3) Processing vect, total=   0.5s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   4.3s


Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression(n_jobs=-1))],
         verbose=1)

In [120]:
model.score(X_test, y_test)

0.5626645602948921

In [109]:
pred = model.predict(X_test)
print(classification_report(y_test, pred))

                    precision    recall  f1-score   support

Extremely Negative       0.63      0.47      0.54       592
Extremely Positive       0.68      0.57      0.62       599
          Negative       0.52      0.53      0.53      1041
           Neutral       0.61      0.66      0.63       619
          Positive       0.50      0.59      0.54       947

          accuracy                           0.56      3798
         macro avg       0.59      0.56      0.57      3798
      weighted avg       0.57      0.56      0.56      3798



## Using CV

In [112]:
scores = cross_val_score(model, X_train, y_train, cv=10, scoring='f1_macro')
scores.mean()

0.5771650252839542

Initially, with the basic preprocessing above, the model performs poorly. This is due to not having tags for potential retweets, mentions, hashtags, and possibly some emojis down the line. So, to correct those presumption of errors, I will be preprocessing the tweets in that manner. And looking at the current state of the X_train proves that.

In [150]:
X_train[0:9]

['menyrbi chrisitv http http http',
 'advic talk neighbour famili exchang phone number creat contact list phone number neighbour school employ chemist gp set onlin shop account po adequ suppli regular med order',
 'coronavirus australia woolworth give elder disabl dedic shop hour amid outbreak http',
 'food stock one empti pleas panic enough food everyon take need stay calm stay safe coronavirus confin confinementot confinementgener http',
 'readi go supermarket outbreak paranoid food stock litterali empti coronavirus serious thing pleas panic caus shortag coronavirusfr restezchezv stayathom confin http',
 'news first confirm case came sullivan counti last week peopl flock area store purchas clean suppli hand sanit food toilet paper good report http',
 'cashier groceri store share insight prove credibl comment civic class know talk http',
 'supermarket today buy toilet paper rebel toiletpapercrisi http',
 'due retail store classroom atlanta open busi class next two week begin monday ma

The mentions and hyperlinks are still there. For the hashtags, I would want to preserve them as much as possible since some of the tweets do have hashtags that relate to COVID-19.

## A Twitter-focused Preprocessor

In [50]:
word_tokenize(train.OriginalTweet[0])

['@',
 'MeNyrbie',
 '@',
 'Phil_Gahan',
 '@',
 'Chrisitv',
 'https',
 ':',
 '//t.co/iFz9FAn2Pa',
 'and',
 'https',
 ':',
 '//t.co/xX6ghGFzCC',
 'and',
 'https',
 ':',
 '//t.co/I2NlzdxNo8']

*word_tokenize* does not work well with tweets, so I have to make my own preprocessor with some regex.

In [51]:
emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
 
regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hashtags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
 
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
    
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
 
def tokenize(s):
    return tokens_re.findall(s)
 
def preprocess_v2(s, lowercase=False):
    preprocessed = []
    for doc in s:
        tokens = tokenize(doc)

        if lowercase:
            tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]

        untokenized = " ".join(tokens)
        filtered = ' '.join(re.sub("(@[A-Za-z0-9_]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",untokenized).split())
        preprocessed.append(filtered)
        
    return preprocessed

In [52]:
n_X_train = preprocess_v2(train.OriginalTweet)
n_X_test = preprocess_v2(test.OriginalTweet)

## Rerunnning Pipeline Model

In [55]:
model.fit(n_X_train, y_train)
model.score(n_X_test, y_test)

[Pipeline] .............. (step 1 of 3) Processing vect, total=   0.8s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   6.1s


0.5703001579778831

In [57]:
n_pred = model.predict(n_X_test)
print(classification_report(y_test, n_pred))

                    precision    recall  f1-score   support

Extremely Negative       0.64      0.46      0.53       592
Extremely Positive       0.67      0.51      0.58       599
          Negative       0.53      0.56      0.54      1041
           Neutral       0.65      0.68      0.66       619
          Positive       0.50      0.63      0.56       947

          accuracy                           0.57      3798
         macro avg       0.60      0.56      0.57      3798
      weighted avg       0.58      0.57      0.57      3798



In [58]:
n_scores = cross_val_score(model, n_X_train, y_train, cv=10, scoring='f1_macro')
n_scores.mean()

[Pipeline] .............. (step 1 of 3) Processing vect, total=   0.7s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   5.3s
[Pipeline] .............. (step 1 of 3) Processing vect, total=   0.7s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   5.1s
[Pipeline] .............. (step 1 of 3) Processing vect, total=   0.7s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   5.1s
[Pipeline] .............. (step 1 of 3) Processing vect, total=   0.7s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   5.1s
[Pipeline] .............. (step 1 of 3) Processing vect, total=   0.7s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipel

0.574792015727683

While the filtered model gave a better score, the difference is not significant enough to say that this project should end here. I am definitely banking on tuning the models algorithms more over the week. For now, the filtered data works better as a starting point.