## Statistical Model
This is the code for the training the statistical models.

### Structure
- Package Setup
- Preprocessing for labels
- Tokenization
- Model Training
- Metrics

### Setup
Here, I setup the packages and imported the data needed for model training.

**Downloaded Packages**
1. SpaCy English library
2. Contextual Spell Check


**Imported Packages**
1. Pandas
2. SpaCy
3. Scikit-learn
4. string
5. re

In [None]:
# Installation of packages and embedding
import sys
!python -m spacy download en_core_web_sm
!pip install contextualSpellCheck

In [None]:
# Import packages
import pandas as pd
import spacy

I imported the data here.

In [None]:
# Import dataset and pandas
raw_trainDF = pd.read_csv("/work/data/coronavirus_tweet_raw/Corona_NLP_train.csv")
raw_testDF = pd.read_csv("/work/data/coronavirus_tweet_raw/Corona_NLP_test.csv")
raw_trainDF.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [None]:
# Copy the values of the data for further uses
trainDF = raw_trainDF
testDF = raw_testDF

### Reorganized the data
Due to the observation during the EDA process, I decided to concatenate the dataset and resplit them.

In [None]:
# Train test split
from sklearn.model_selection import train_test_split

# Concat the two datasets and split them
allDF = pd.concat((trainDF, testDF), ignore_index=True)

# Split the train, test, validation set
trainDF, testDF = train_test_split(allDF, test_size = 0.2)
testDF, validDF = train_test_split(testDF, test_size = 0.2)

# Print values
print("Train:",len(trainDF), "Test:", len(testDF),"Valid:", len(validDF))

Train: 35964 Test: 7192 Valid: 1799


## Preprocessing
This is the first part of the preprocessing where we create all parts of the pipeline except for the models.

### Structure
- Label encode
- Contextual Spell Check
- Tokenizer
- Pipeline

In [None]:
# Install package for Contextual Spell Check
!pip install contextualSpellCheck
!pip install ipywidgets

You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m


### Encode the Labels
I used `scikit-learn` to encoder the labels from 1 to 5.

In [None]:
# Label Encoder for classes in sentiment 
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
trainDF.encoded_sentiment = encoder.fit_transform(trainDF.Sentiment)
trainDF.encoded_sentiment

encoder = LabelEncoder()
testDF.encoded_sentiment = encoder.fit_transform(testDF.Sentiment)
testDF.encoded_sentiment

  """
  if __name__ == '__main__':


array([3, 3, 3, ..., 2, 4, 4])

### Spelling Correction
I use `contextualSpellCheck` package for fuzzy matching and correcting typos.

In [None]:
# Correct Spelling
import contextualSpellCheck

# Download spacy English library
nlp = spacy.load('en_core_web_sm')

# Add contextual spellchecker to the pipeline

nlp.add_pipe("contextual spellchecker", config={"max_edit_dist": 5})    

# create token of text
sample_text = trainDF.OriginalTweet[0]
doc = nlp(sample_text)

print(doc._.outcome_spellCheck)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2, and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8


### Tokenization Pipeline
This is the entire pipeline for tokenization, including removing urls, punctuations, spelling correction, and lemmatization.

In [None]:
import string
import re

# Load the English library from SpaCy
nlp = spacy.load("en_core_web_sm")

# Add contextual spell check to pipeline
nlp.add_pipe("contextual spellchecker", config={"max_edit_dist": 5})    

# Create list of punctuation marks
punctuations = string.punctuation

# Create list of stopwords from spaCy
stopwords = spacy.lang.en.stop_words.STOP_WORDS

# Remove URLs
def remove_urls(text):
    text = re.sub(r"\S*https?:\S*", "", text, flags=re.MULTILINE)
    return text

# Creat tokenizer function
def spacy_tokenizer(sentence):
    # Create token object from spacy
    #docs = nlp(sentence)
    tokens = nlp(sentence)

    # Correct spelling
    #tokens = docs._.outcome_spellCheck
    #tokens = nlp(tokens)

    # Lemmatize each token and convert each token into lowercase
    tokens = [word.lemma_.lower().strip() if word.lemma_ != "PROPN" else word.lower_ for word in tokens]
    
    # Remove stopwords
    tokens = [word for word in tokens if word not in stopwords and word not in punctuations]
    
    # Remove links
    tokens = [remove_urls(word) for word in tokens]
    
    # return preprocessed list of tokens
    return tokens

spacy_tokenizer(sample_text)



['@menyrbie', '@phil_gahan', '@chrisitv', '', 'pa', '', '']

### Bag-of-words Model
This is the code for bag-of-words model using `scikit learn`'s `CountVectorizer`.

In [None]:
# Bag-of-words data transformation
from sklearn.feature_extraction.text import CountVectorizer
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))


### Pipeline

The following cell contains the entire preprocessing pipeline from tokenization to training models.

In [None]:
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

# Custom transformer class using spaCy
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        # Implement clean_text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

# Basic function to clean the text
def clean_text(text):
    # Remove spaces and converte text into lowercase
    return text.strip().lower()

# Bag-of-words data transformation
from sklearn.feature_extraction.text import CountVectorizer
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

# Multinomial Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()

# Create pipeline
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])

### Model Training
I first assign the train and test sets of the data and I then train the statistical models (Naive Bayes, Logistic Regression, SVM).

In [None]:
X_train = trainDF.OriginalTweet[:5000]
X_test = testDF.OriginalTweet
y_train = trainDF.encoded_sentiment[:5000]
y_test = testDF.encoded_sentiment

In [None]:
# Multinomial Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()

# Create pipeline using Bag of Words
pipe_NB = Pipeline([('cleaner', predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])

# model generation
pipe_NB.fit(X_train,y_train)

KeyboardInterrupt: 

In [None]:
# Logistic Regression Classifier
from sklearn.linear_model import LogisticRegression
classifier_log = LogisticRegression()

# Create pipeline using Bag of Words
pipe_log = Pipeline([('cleaner', predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier_log)])

# model generation
pipe_log.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Pipeline(steps=[('cleaner', <__main__.predictors object at 0x7f9e9258dd50>),
                ('vectorizer',
                 CountVectorizer(tokenizer=<function spacy_tokenizer at 0x7f9e93ab1440>)),
                ('classifier', LogisticRegression())])

In [None]:
# SVM Classifier
from sklearn.svm import SVC
classifier_svm = SVC()

# Create pipeline using Bag of Words
pipe_svm = Pipeline([('cleaner', predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier_svm)])

# model generation
pipe_svm.fit(X_train,y_train)

Pipeline(steps=[('cleaner', <__main__.predictors object at 0x7f9e8726d9d0>),
                ('vectorizer',
                 CountVectorizer(tokenizer=<function spacy_tokenizer at 0x7f9e93ab1440>)),
                ('classifier', SVC())])

### Model Performance
I looked at the precision, recall, and F-1 score to measure the accuracy of the model.

In [None]:
# Classification Report
from sklearn.metrics import classification_report

# Predict with a test dataset
predicted = pipe_NB.predict(X_test)

# Model Accuracy
print("Naive Bayes Model:\n")
print(classification_report(y_test, predicted, target_names = ['Extremely Negative', 'Negative', 'Neutral', 'Positive', 'Extremely Positive']))



                    precision    recall  f1-score   support

Extremely Negative       0.63      0.19      0.29       592
          Negative       0.73      0.20      0.31       599
           Neutral       0.41      0.54      0.47      1040
          Positive       0.70      0.15      0.25       614
Extremely Positive       0.35      0.72      0.47       945

          accuracy                           0.41      3790
         macro avg       0.56      0.36      0.36      3790
      weighted avg       0.53      0.41      0.38      3790



In [None]:
# Classification Report
from sklearn.metrics import classification_report
# Predicting with a test dataset
predicted_log = pipe_log.predict(X_test)

# Model Accuracy
print("Logistic Regression Model:\n")
print(classification_report(y_test, predicted_log, target_names = ['Extremely Negative', 'Negative', 'Neutral', 'Positive', 'Extremely Positive']))



                    precision    recall  f1-score   support

Extremely Negative       0.64      0.57      0.60       592
          Negative       0.68      0.58      0.63       599
           Neutral       0.54      0.52      0.53      1040
          Positive       0.60      0.68      0.64       614
Extremely Positive       0.53      0.59      0.56       945

          accuracy                           0.58      3790
         macro avg       0.60      0.59      0.59      3790
      weighted avg       0.58      0.58      0.58      3790



In [None]:
# Classificatin Report
from sklearn.metrics import classification_report
# Predicting with a test dataset
predicted_svm = pipe_svm.predict(X_test)

# Model Accuracy
print("Logistic Regression Model:\n")
print(classification_report(y_test, predicted_svm, target_names = ['Extremely Negative', 'Negative', 'Neutral', 'Positive', 'Extremely Positive']))

                    precision    recall  f1-score   support

Extremely Negative       0.68      0.44      0.53       592
          Negative       0.73      0.48      0.58       599
           Neutral       0.52      0.53      0.52      1040
          Positive       0.58      0.70      0.63       614
Extremely Positive       0.49      0.63      0.55       945

          accuracy                           0.56      3790
         macro avg       0.60      0.55      0.56      3790
      weighted avg       0.58      0.56      0.56      3790



<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=36980032-e74f-4047-828e-e2329ad1a610' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>