## Complete Guide to Perform Classification of Tweets with SpaCy
This is all the code to implement sentiment analysis on tweets using SpaCy.

### Structure
- Package Setup
- Pandas Profiling
- EDA
- Sample Text
- Pipline
- Statistical Language Models
- Neural Language Models
- Comparison
- Conclusion

### Setup
Here, I setup the packages and imported the data needed for this part of the exploration.

**Downloaded Packages**
1. SpaCy English library
2. SpaCy English Roberta-based library
3. Pandas profiling
4. Contextual Spell Check
5. SpaCy-Transformer


**Imported Packages**
1. Pandas
2. Matplotlib.pyplot
3. SpaCy
4. Scikit-learn
5. string
6. re

In [None]:
# Installation of packages and pipeline
import sys
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_trf
!pip install pandas-profiling[notebook]
!pip install spacy[transformer]

2021-10-08 22:12:31.004784: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-10-08 22:12:31.004826: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[K     |████████████████████████████████| 13.6 MB 15.3 MB/s 
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting pandas-profiling[notebook]
  Downloading pandas_profiling-3.1.0-py2.py3-none-any.whl (261 kB)
[K     |████████████████████████████████| 261 kB 31.6 MB/s 
Coll

In [None]:
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import spacy

2021-10-08 22:12:55.866154: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-10-08 22:12:55.866238: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


I imported the data here.

In [None]:
# Import dataset and pandas
raw_trainDF = pd.read_csv("/work/data/coronavirus_tweet_raw/Corona_NLP_train.csv")
raw_testDF = pd.read_csv("/work/data/coronavirus_tweet_raw/Corona_NLP_test.csv")
raw_trainDF.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [None]:
# Copy the values of the data for further uses
trainDF = raw_trainDF
testDF = raw_testDF

### Pandas Profiling visualization
I used pandas profiling package to generate visualization to start the EDA process.

In [None]:
# Generate a pandas profiling report
from pandas_profiling import ProfileReport
profile_train = ProfileReport(raw_trainDF, title="Pandas Profiling Report (Train)")
profile_train.to_file(output_file="/work/reports/NLP_COVID_EDA_train_Report.html")
profile_train

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]



In [None]:
profile_test = ProfileReport(raw_testDF, title="Pandas Profiling Report (Test)")
profile_test.to_file(output_file="/work/reports/NLP_COVID_EDA_test_Report.html")
profile_test

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]



### Visualization for the EDA
Visualization generated during the EDA process.

In [None]:
# Visualization of the TweetAt categories for train set
sentiments = trainDF.groupby(by = ['TweetAt']).count()['Sentiment']
time = trainDF.TweetAt.unique()

fig = plt.figure(figsize=(15,7))
plt.xticks(rotation=70)
plt.bar(time,sentiments)
plt.xlabel("Date")
plt.ylabel("Count")
plt.title("Number of Tweet by time (Train)")
plt.show()

In [None]:
# Visualization of the TweetAt categories for test set
sentiments = testDF.groupby(by = ['TweetAt']).count()['Sentiment']
time = testDF.TweetAt.unique()

fig = plt.figure(figsize=(15,7))
plt.xticks(rotation=70)
plt.bar(time,sentiments)
plt.xlabel("Date")
plt.ylabel("Count")
plt.title("Number of Tweet by time (Test)")
plt.show()

In [None]:
# Create visualization of the sentiment category
sent_names = list(set(trainDF.Sentiment.values))
train_sent_cats = []
test_sent_cats = []

for name in sent_names:
    train_sent_cats.append(trainDF.Sentiment.value_counts()[name])
    test_sent_cats.append(testDF.Sentiment.value_counts()[name])

fig, ax = plt.subplots(1,2, figsize = (10,5))
ax[0].pie(train_sent_cats, labels = sent_names, autopct='%1.1f%%')
ax[0].set_title("Sentiment Categories in Train Data")
ax[1].pie(test_sent_cats, labels = sent_names, autopct='%1.1f%%')
ax[1].set_title("Sentiment Categories in Test Data")
plt.show()



### Reorganized the data
Due to the observation during the EDA process, I decided to concatenate the dataset and resplit them.

In [None]:
# Train test split
from sklearn.model_selection import train_test_split

# Concat the two datasets and split them
allDF = pd.concat((trainDF, testDF), ignore_index=True)

# Sample dataset due to the large size
allDF = allDF.sample(frac=0.5).reset_index(drop=True)

# Split the train, test, validation set
trainDF, testDF = train_test_split(allDF, test_size = 0.2)
testDF, validDF = train_test_split(testDF, test_size = 0.2)

# Print values
print("Train:",len(trainDF), "Test:", len(testDF),"Valid:", len(validDF))

Train: 35964 Test: 7192 Valid: 1799


### Sample text using SpaCy
I used the general function of SpaCy to try out the package by generating simple tasks.


In [None]:
# Print a sample text
sample_text = trainDF.OriginalTweet[1]
print(sample_text)

advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order


In [None]:
# Load the English library from SpaCy
nlp = spacy.load("en_core_web_sm")

# Apply nlp() to all values in "OriginalTweet"
testDF.nlp = testDF.OriginalTweet.apply(lambda x: nlp(x))


  """


In [None]:
# Testing out spaCy with pandas
testDF.np = testDF.nlp.apply(lambda x: [chunk.text for chunk in x.noun_chunks])
testDF.vb = testDF.nlp.apply(lambda x: [token.lemma_ for token in x if token.pos_ == "VERB"])
print(testDF.np.head())
print(testDF.vb.head())

15827    [I, a super hero, a grocery store, everything,...
7477     [I, a new mousemat, Amazon, I, the delivery wo...
22647    [#PublicLands #FireSale - #OilGas companies, h...
40093    [Covid-19 Fallout, OPEC+, A Historic Oil Deal ...
11490    [Loads, photos, empty supermarket shelves, som...
Name: OriginalTweet, dtype: object
15827                                 [feel, work, go]
7477     [want, get, feel, have, travel, go, halt, be]
22647                                           [grab]
40093                                  [explain, push]
11490                                 [post, be, look]
Name: OriginalTweet, dtype: object
  
  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
# Use spaCy for NP, V, and NER
nlp = spacy.load("en_core_web_sm")
doc = nlp(sample_text)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

Noun phrases: ['advice Talk', 'your neighbours family', 'phone numbers', 'contact list', 'phone numbers', 'neighbours schools employer chemist GP', 'online shopping accounts', 'regular meds', 'order']
Verbs: ['exchange', 'create', 'set']
GP ORG


## Preprocessing
This is the first part of the preprocessing where we create all parts of the pipeline except for the models.

### Structure
- Label encode
- Contextual Spell Check
- Tokenizer
- Pipeline

![Preprocess Image](../images/preprocess_tra.png)

In [None]:
# Install package for Contextual Spell Check
!pip install contextualSpellCheck
!pip install ipywidgets

Collecting contextualSpellCheck
  Downloading contextualSpellCheck-0.4.1-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 30.4 MB/s 
[?25hCollecting editdistance==0.5.3
  Downloading editdistance-0.5.3-cp37-cp37m-manylinux1_x86_64.whl (179 kB)
[K     |████████████████████████████████| 179 kB 66.2 MB/s 
Collecting transformers>=4.0.0
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 60.6 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 60.3 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 52.6 MB/s 
Collecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |███████████████████████████

### Encode the Labels
I used `scikit-learn` to encoder the labels from 1 to 5.

In [None]:
# Label Encoder for classes in sentiment 
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
trainDF.encoded_sentiment = encoder.fit_transform(trainDF.Sentiment)
trainDF.encoded_sentiment

encoder = LabelEncoder()
testDF.encoded_sentiment = encoder.fit_transform(testDF.Sentiment)
testDF.encoded_sentiment

  """
  if __name__ == '__main__':


array([1, 0, 4, ..., 1, 4, 1])

### Spelling Correction
I use `contextualSpellCheck` package for fuzzy matching and correcting typos.

In [None]:
# Correct Spelling
import contextualSpellCheck

# Download spacy English library
nlp = spacy.load('en_core_web_sm')

# Add contextual spellchecker to the pipeline

nlp.add_pipe("contextual spellchecker", config={"max_edit_dist": 5})    

# create token of text
doc = nlp(sample_text)

print(doc._.outcome_spellCheck)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if has adequate supplies of regular meals but not over order


### Tokenization Pipeline
This is the entire pipeline for tokenization, including removing urls, punctuations, spelling correction, and lemmatization.

In [None]:
import string
import re

# Load the English library from SpaCy
nlp = spacy.load("en_core_web_sm")

# Add contextual spell check to pipeline
nlp.add_pipe("contextual spellchecker", config={"max_edit_dist": 5})    

# Create list of punctuation marks
punctuations = string.punctuation

# Create list of stopwords from spaCy
stopwords = spacy.lang.en.stop_words.STOP_WORDS

# Remove URLs
def remove_urls(text):
    text = re.sub(r"\S*https?:\S*", "", text, flags=re.MULTILINE)
    return text

# Creat tokenizer function
def spacy_tokenizer(sentence):
    # Create token object from spacy
    tokens = nlp(sentence)

    # Correct spelling
    # tokens = tokens._.outcome_spellCheck
    # tokens = nlp(tokens)

    # Lemmatize each token and convert each token into lowercase
    tokens = [word.lemma_.lower().strip() if word.lemma_ != "PROPN" else word.lower_ for word in tokens]
    
    # Remove stopwords
    tokens = [word for word in tokens if word not in stopwords and word not in punctuations]
    
    # Remove links
    tokens = [remove_urls(word) for word in tokens]
    
    # return preprocessed list of tokens
    return tokens

spacy_tokenizer(sample_text)



['advice',
 'talk',
 'neighbours',
 'family',
 'exchange',
 'phone',
 'number',
 'create',
 'contact',
 'list',
 'phone',
 'number',
 'neighbour',
 'school',
 'employer',
 'chemist',
 'gp',
 'set',
 'online',
 'shopping',
 'account',
 'adequate',
 'supply',
 'regular',
 'meal',
 'order']

### Bag-of-words Model
This is the code for bag-of-words model using `scikit learn`'s `CountVectorizer`.

In [None]:
# Bag-of-words data transformation
from sklearn.feature_extraction.text import CountVectorizer
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))


### Pipeline

The following cell contains the entire preprocessing pipeline from tokenization to training models.

In [None]:
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

# Custom transformer class using spaCy
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        # Implement clean_text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

# Basic function to clean the text
def clean_text(text):
    # Remove spaces and converte text into lowercase
    return text.strip().lower()

# Bag-of-words data transformation
from sklearn.feature_extraction.text import CountVectorizer
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

# Multinomial Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()

# Create pipeline
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])

## Statistical Model Training
I first assign the train and test sets of the data and I then train the statistical models (Naive Bayes, Logistic Regression, SVM).

In [None]:
X_train = trainDF.OriginalTweet
X_test = testDF.OriginalTweet
y_train = trainDF.encoded_sentiment
y_test = testDF.encoded_sentiment

In [None]:
# Multinomial Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()

# Create pipeline using Bag of Words
pipe_NB = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])

# model generation
pipe_NB.fit(X_train,y_train)

NameError: name 'Pipeline' is not defined

KernelInterrupted: Execution interrupted by the Jupyter kernel.

KernelInterrupted: Execution interrupted by the Jupyter kernel.

In [None]:
# Logistic Regression Classifier
from sklearn.linear_model import LogisticRegression
classifier_log = LogisticRegression()

# Create pipeline using Bag of Words
pipe_log = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier_log)])

# model generation
pipe_log.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Pipeline(steps=[('cleaner', <__main__.predictors object at 0x7f9e9258dd50>),
                ('vectorizer',
                 CountVectorizer(tokenizer=<function spacy_tokenizer at 0x7f9e93ab1440>)),
                ('classifier', LogisticRegression())])

In [None]:
# SVM Classifier
from sklearn.svm import SVC
classifier_svm = SVC()

# Create pipeline using Bag of Words
pipe_svm = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier_svm)])

# model generation
pipe_svm.fit(X_train,y_train)

Pipeline(steps=[('cleaner', <__main__.predictors object at 0x7f9e8726d9d0>),
                ('vectorizer',
                 CountVectorizer(tokenizer=<function spacy_tokenizer at 0x7f9e93ab1440>)),
                ('classifier', SVC())])

### Model Performance
I looked at the precision, recall, and F-1 score to measure the accuracy of the model.

In [None]:
# Classification Report
from sklearn.metrics import classification_report

# Predict with a test dataset
predicted = pipe_NB.predict(X_test)

# Model Accuracy
print("Naive Bayes Model:\n")
print(classification_report(y_test, predicted, target_names = ['Extremely Negative', 'Negative', 'Neutral', 'Positive', 'Extremely Positive']))



                    precision    recall  f1-score   support

Extremely Negative       0.63      0.19      0.29       592
          Negative       0.73      0.20      0.31       599
           Neutral       0.41      0.54      0.47      1040
          Positive       0.70      0.15      0.25       614
Extremely Positive       0.35      0.72      0.47       945

          accuracy                           0.41      3790
         macro avg       0.56      0.36      0.36      3790
      weighted avg       0.53      0.41      0.38      3790



In [None]:
# Classification Report
from sklearn.metrics import classification_report
# Predicting with a test dataset
predicted_log = pipe_log.predict(X_test)

# Model Accuracy
print("Logistic Regression Model:\n")
print(classification_report(y_test, predicted_log, target_names = ['Extremely Negative', 'Negative', 'Neutral', 'Positive', 'Extremely Positive']))



                    precision    recall  f1-score   support

Extremely Negative       0.64      0.57      0.60       592
          Negative       0.68      0.58      0.63       599
           Neutral       0.54      0.52      0.53      1040
          Positive       0.60      0.68      0.64       614
Extremely Positive       0.53      0.59      0.56       945

          accuracy                           0.58      3790
         macro avg       0.60      0.59      0.59      3790
      weighted avg       0.58      0.58      0.58      3790



In [None]:
# Classificatin Report
from sklearn.metrics import classification_report
# Predicting with a test dataset
predicted_svm = pipe_svm.predict(X_test)

# Model Accuracy
print("SVM Model:\n")
print(classification_report(y_test, predicted_svm, target_names = ['Extremely Negative', 'Negative', 'Neutral', 'Positive', 'Extremely Positive']))

                    precision    recall  f1-score   support

Extremely Negative       0.68      0.44      0.53       592
          Negative       0.73      0.48      0.58       599
           Neutral       0.52      0.53      0.52      1040
          Positive       0.58      0.70      0.63       614
Extremely Positive       0.49      0.63      0.55       945

          accuracy                           0.56      3790
         macro avg       0.60      0.55      0.56      3790
      weighted avg       0.58      0.56      0.56      3790



## Neural Model Training
I trained two neural network models. One using the standard English pipeline from spaCy, the other one using a RoBERTa-based pipeline.

In [None]:
# Import packages
import spacy
import pandas as pd
import re
from spacy.tokens import DocBin
from tqdm import tqdm

### Preprocess the data

I then preprocess the data by removing urls, conduct one-hot encoding for the categories, and add the data into SpaCy pipelines. Finally, I saved the data into binary `.spacy` for training. In addition, I separated the pretraining with just normal English pipeline and RoBERTa-based pipeline.

![Preprocess Image](../images/preprocess_nn.png)

In [None]:
def remove_url(text): 
    '''
    Remove urls from text.
    ---
    Input:
    text (str): a sentence

    Output:
    parsed_text (str): text that has url removed
    '''

    # Use regrex to parse urls from the text
    parsed_text = re.sub(r"\S*https?:\S*", "", text, flags=re.MULTILINE)
    return parsed_text

def preprocess(df, embed):
    '''
    Preprocess the dataframe into spacy pipeline for later classification
    ---
    Input:
    df (DataFrame): Pandas dataframe containing the raw text and outputs.
    embed (str): Name of pipeline embedding used

    Output:
    df (DataFrame): Preprocessed input dataframe
    docs (doc): SpaCy doc object that stores text data along with classification
    '''

    # Remove urls from text
    df.OriginalTweet = df.OriginalTweet.apply(remove_url)

    # Store the data into tuples
    data = tuple(zip(df.OriginalTweet.tolist(), df.Sentiment.tolist())) 
    
    # Load English library from SpaCy
    nlp=spacy.load(embed)
    print(data[0])

    # Storage for docs
    docs = []

    # One-hot encoding for the classifications
    for doc, label in tqdm(nlp.pipe(data, as_tuples=True), total = len(data)):
        
        if label=='Extremely Positive':
            doc.cats['extremely_positive'] = 1
            doc.cats['extremely_negative'] = 0
            doc.cats['positive'] = 0
            doc.cats['negative'] = 0
            doc.cats['neutral']  = 0
        elif label=='Positive':
            doc.cats['extremely_positive'] = 0
            doc.cats['extremely_negative'] = 1
            doc.cats['positive'] = 0
            doc.cats['negative'] = 0
            doc.cats['neutral']  = 0
        elif label=='Neutral':
            doc.cats['extremely_positive'] = 0
            doc.cats['extremely_negative'] = 0
            doc.cats['positive'] = 0
            doc.cats['negative'] = 0
            doc.cats['neutral']  = 1
        elif label=='Negative':
            doc.cats['extremely_positive'] = 0
            doc.cats['extremely_negative'] = 0
            doc.cats['positive'] = 0
            doc.cats['negative'] = 1
            doc.cats['neutral']  = 0
        else:
            doc.cats['extremely_positive'] = 0
            doc.cats['extremely_negative'] = 1
            doc.cats['positive'] = 0
            doc.cats['negative'] = 0
            doc.cats['neutral']  = 0
        # print(doc.cats)
        
        docs.append(doc)
    return df, docs


### Config setup

I setup the training config using SpaCy's [quickstart function](https://spacy.io/usage/training#quickstart). This creates a `base_config.cfg` that can be filled into `config.cfg`. This `config` file can then be used to train the model using command line operations. The setup for the quickstart function is shown in the image below.

![config_setup](../images/config_setup.png)

In [None]:
# Initialize config files from base_config
!python -m spacy init fill-config ../config/base_config.cfg ../config/config.cfg 

In [None]:
# Debug config
!python -m spacy debug data ../config/config.cfg

## SpaCy's English Model
I first train the model with normal English pipeline from spaCy.

In [None]:
# Covert the train and test dataframes to .spacy files for training

# Preprocess the dataframes for train data
train_data, train_docs = preprocess(trainDF,"en_core_web_sm")
# Save data and docs in a binary file to disc
doc_bin = DocBin(docs=train_docs)
doc_bin.to_disk("/work/data/spacy_data/textcat_train.spacy")

# Preprocess the dataframes for test data
test_data, test_docs = preprocess(testDF,"en_core_web_sm")
# Save data and docs in a binary file to disc
doc_bin = DocBin(docs=test_docs)
doc_bin.to_disk("/work/data/spacy_data/textcat_valid.spacy")

### Model training

I first verify the `.spacy` files before running the command line operation for training the model. The model is an ensemble of a tok2vec model that uses attention under the transformer architecture combined with a linear bag-of-words model. I trained the model for 11 epochs using accuracy as the loss function and `adam` as the optimizer.

In [None]:
# View the entities in the train and test docs
train_loc = "/work/data/spacy_data/textcat_train.spacy"
dev_loc = "/work/data/spacy_data/textcat_valid.spacy"

# Load library and train data
nlp = spacy.load('en_core_web_sm')
doc_bin = DocBin().from_disk(train_loc)
docs = list(doc_bin.get_docs(nlp.vocab))
entities = 0

# Iterate through the docs
for doc in docs:
    entities += len(doc.ents)
print(f"TRAIN docs: {len(docs)} with {entities} entities")

# Load library and test data
doc_bin = DocBin().from_disk(dev_loc)
docs = list(doc_bin.get_docs(nlp.vocab))
entities = 0

# Iterate through the docs
for doc in docs:
    entities += len(doc.ents)
print(f"DEV docs: {len(docs)} with {entities} entities")

In [None]:
# Train model
!python -m spacy train ../config/config.cfg --verbose --output ../data/textcat_output --paths.train ../data/spacy_data/textcat_train.spacy --paths.dev ../data/spacy_data/textcat_valid.spacy

The training time for each epoch is quite long due to the large size of the data and limited CPU on deepnote. This also shows the accuracy of the best epoch of the model, which has a accuracy score of 0.63.

### Verification

After training the model, I chose model of the best performing epoch and run the model on sample text.

In [None]:
# Verify model
nlp_model = spacy.load("../data/textcat_output/model-best")
test_text = test_data.OriginalTweet.tolist()
test_cats = test_data.Sentiment.tolist()
doc_test = nlp_model(test_text[20])
print("Text: "+ test_text[20])
print("Orig Cat: "+ test_cats[20])
print(" Predicted Cats:") 
print(doc_test.cats)

## Pre-trained BERT Model
Now, we train the model with pre-trained BERT.

In [None]:
# Covert the train and test dataframes to .spacy files for training

# Preprocess the dataframes for train data
train_data_roberta, train_docs = preprocess(trainDF,"en_core_web_trf")
# Save data and docs in a binary file to disc
doc_bin = DocBin(docs=train_docs)
doc_bin.to_disk("/work/data/spacy_data/textcat_roberta_train.spacy")

# Preprocess the dataframes for test data
test_data_roberta, test_docs = preprocess(testDF,"en_core_web_trf")
# Save data and docs in a binary file to disc
doc_bin = DocBin(docs=test_docs)
doc_bin.to_disk("/work/data/spacy_data/textcat_roberta_valid.spacy")

### Model training

I trained the same model but used the pre-trained BERT model with RoBERTa-based pipeline from `spacy-transfomer` package. I trained the model for 10 epochs using accuracy as the loss function and `adam` as the optimizer.

In [None]:
# Train model
!python -m spacy train ../config/config.cfg --verbose --output ../data/textcat_roberta_output --paths.train ../data/spacy_data/textcat_roberta_train.spacy --paths.dev ../data/spacy_data/textcat_roberta_valid.spacy

### Verification

After training the model, I chose the model of the best performing epoch and run the model on sample text.

In [None]:
# Verify model
nlp_model = spacy.load("../data/textcat_roberta_output/model-best")
test_text = test_data_roberta.OriginalTweet.tolist()
test_cats = test_data_roberta.Sentiment.tolist()
doc_test = nlp_model(test_text[20])
print("Text: "+ test_text[20])
print("Orig Cat: "+ test_cats[20])
print(" Predicted Cats:") 
print(doc_test.cats)

### Testing out with validation set

I tested out the models using the validation set we created during the train test split process. I first loaded the differnt models and preprocessed the data.

In [None]:
# Covert the train and test dataframes to .spacy files for training

# Preprocess the dataframes for valid data
valid_data, valid_docs = preprocess(validDF,"en_core_web_sm")
valid_data_roberta, valid_docs = preprocess(validDF,"en_core_web_trf")

I then verified the model by choosing a random tweet to check the distribution of the different labels and the original category. 

In [None]:
# Verify model for English model
nlp_model = spacy.load("../data/textcat_output/model-best")
valid_text = valid_data.OriginalTweet.tolist()
valid_cats = valid_data.Sentiment.tolist()
doc_valid = nlp_model(valid_text[50])
print("Text: "+ valid_text[50])
print("Orig Cat: "+ valid_cats[50])
print(" Predicted Cats:") 
print(doc_valid.cats)

I've done the same with the pre-trained BERT embedding model.

In [None]:
nlp_model_bert = spacy.load("../data/textcat_roberta_output/model-best")
doc_valid_bert = nlp_model_bert(valid_text[50])
print("Text: "+ valid_text[50])
print("Orig Cat: "+ valid_cats[50])
print(" Predicted Cats:") 
print(doc_valid_bert.cats)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=36980032-e74f-4047-828e-e2329ad1a610' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>