## Neural Model
This is the code for the training the neural models.

### Structure
- Package Setup
- Reorganize the data
- Config Setup
- Model Training
- Verifidcation

### Setup
Here, I setup the packages and imported the data needed for model training.

**Downloaded Packages**
1. SpaCy English library
2. SpaCy English Roberta-based library
3. SpaCy Transformer

**Imported Packages**
1. Pandas
2. SpaCy
3. Scikit-learn
4. re
5. tqrm

In [None]:
# Installation of packages and pipelines
# Import spaCy, load model
!pip install spaCy 
!pip install spacy[transformers]
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_trf


You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m
Collecting spacy-transformers<1.1.0,>=1.0.1
  Downloading spacy_transformers-1.0.6-py2.py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 2.2 MB/s 
Collecting spacy-alignments<1.0.0,>=0.7.2
  Downloading spacy_alignments-0.8.3-cp37-cp37m-manylinux2014_x86_64.whl (998 kB)
[K     |████████████████████████████████| 998 kB 57.0 MB/s 
[?25hCollecting transformers<4.10.0,>=3.4.0
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 102.2 MB/s 
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 93.5 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_

In [None]:
# Import packages
import spacy
import pandas as pd
import re
from spacy.tokens import DocBin
from tqdm import tqdm

I imported the data here.

In [None]:
# Import dataset and pandas
raw_trainDF = pd.read_csv("/work/data/coronavirus_tweet_raw/Corona_NLP_train.csv")
raw_testDF = pd.read_csv("/work/data/coronavirus_tweet_raw/Corona_NLP_test.csv")
raw_trainDF.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [None]:
# Copy the values of the data for further uses
trainDF = raw_trainDF
testDF = raw_testDF

### Reorganize the data
Due to the observation during the EDA process, I decided to concatenate the dataset and resplit them.

In [None]:
# Train test split
from sklearn.model_selection import train_test_split

# Concat the two datasets and split them
allDF = pd.concat((trainDF, testDF), ignore_index=True)

# Sample dataset due to the large size
allDF = allDF.sample(frac=0.5).reset_index(drop=True)

# Split the train, test, validation set
trainDF, testDF = train_test_split(allDF, test_size = 0.2)
testDF, validDF = train_test_split(testDF, test_size = 0.2)

# Print values
print("Train:",len(trainDF), "Test:", len(testDF),"Valid:", len(validDF))

Train: 17982 Test: 3596 Valid: 900


### Preprocess the data

I then preprocess the data by removing urls, conduct one-hot encoding for the categories, and add the data into SpaCy pipelines. Finally, I saved the data into binary `.spacy` for training. In addition, I separated the pretraining with just standard English pipeline and RoBERTa-based pipeline.

![Preprocess Image](../images/preprocess_nn.png)

In [None]:
def remove_url(text): 
    '''
    Remove urls from text.
    ---
    Input:
    text (str): a sentence

    Output:
    parsed_text (str): text that has url removed
    '''

    # Use regrex to parse urls from the text
    parsed_text = re.sub(r"\S*https?:\S*", "", text, flags=re.MULTILINE)
    return parsed_text

def preprocess(df, embed):
    '''
    Preprocess the dataframe into spacy pipeline for later classification
    ---
    Input:
    df (DataFrame): Pandas dataframe containing the raw text and outputs.
    embed (str): Name of pipeline embedding used

    Output:
    df (DataFrame): Preprocessed input dataframe
    docs (doc): SpaCy doc object that stores text data along with classification
    '''

    # Remove urls from text
    df.OriginalTweet = df.OriginalTweet.apply(remove_url)

    # Store the data into tuples
    data = tuple(zip(df.OriginalTweet.tolist(), df.Sentiment.tolist())) 
    
    # Load English library from SpaCy
    nlp=spacy.load(embed)
    print(data[0])

    # Storage for docs
    docs = []

    # One-hot encoding for the classifications
    for doc, label in tqdm(nlp.pipe(data, as_tuples=True), total = len(data)):
        
        if label=='Extremely Positive':
            doc.cats['extremely_positive'] = 1
            doc.cats['extremely_negative'] = 0
            doc.cats['positive'] = 0
            doc.cats['negative'] = 0
            doc.cats['neutral']  = 0
        elif label=='Positive':
            doc.cats['extremely_positive'] = 0
            doc.cats['extremely_negative'] = 1
            doc.cats['positive'] = 0
            doc.cats['negative'] = 0
            doc.cats['neutral']  = 0
        elif label=='Neutral':
            doc.cats['extremely_positive'] = 0
            doc.cats['extremely_negative'] = 0
            doc.cats['positive'] = 0
            doc.cats['negative'] = 0
            doc.cats['neutral']  = 1
        elif label=='Negative':
            doc.cats['extremely_positive'] = 0
            doc.cats['extremely_negative'] = 0
            doc.cats['positive'] = 0
            doc.cats['negative'] = 1
            doc.cats['neutral']  = 0
        else:
            doc.cats['extremely_positive'] = 0
            doc.cats['extremely_negative'] = 1
            doc.cats['positive'] = 0
            doc.cats['negative'] = 0
            doc.cats['neutral']  = 0
        # print(doc.cats)
        
        docs.append(doc)
    return df, docs


### Config setup

I setup the training config using SpaCy's [quickstart function](https://spacy.io/usage/training#quickstart). This creates a `base_config.cfg` that can be filled into `config.cfg`. This `config` file can then be used to train the model using command line operations. The setup for the quickstart function is shown in the image below.

![config_setup](../images/config_setup.png)

In [None]:
# Initialize config files from base_config
!python -m spacy init fill-config ../config/base_config.cfg ../config/config.cfg 

2021-10-31 23:46:58.031841: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-10-31 23:46:58.031884: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
../config/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
# Debug config
!python -m spacy debug data ../config/config.cfg

2021-10-31 23:47:07.092715: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-10-31 23:47:07.092753: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[1m
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/shared-libs/python3.7/py/lib/python3.7/site-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/shared-libs/python3.7/py/lib/python3.7/site-packages/spacy/cli/_util.py", line 69, in setup_cli
    command(prog_name=COMMAND)
  File "/shared-libs/python3.7/py/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main

## SpaCy's Standard Model
I first train the model with normal English pipeline from spaCy.

In [None]:
# Covert the train and test dataframes to .spacy files for training

# Preprocess the dataframes for train data
train_data, train_docs = preprocess(trainDF,"en_core_web_sm")
# Save data and docs in a binary file to disc
doc_bin = DocBin(docs=train_docs)
doc_bin.to_disk("/work/data/spacy_data/textcat_train.spacy")

# Preprocess the dataframes for test data
test_data, test_docs = preprocess(testDF,"en_core_web_sm")
# Save data and docs in a binary file to disc
doc_bin = DocBin(docs=test_docs)
doc_bin.to_disk("/work/data/spacy_data/textcat_valid.spacy")

('I work in a supermarket. I understand why social distancing &amp; self isolating is so important right now. However. I\x92m still going into work &amp; risking my own exposure ? HOW the hell do I help myself &amp; others ??! Confused #Covid_19', 'Extremely Negative')
100%|██████████| 17982/17982 [01:01<00:00, 290.76it/s]
('Dampf, store manager of the Paramus @StewLeonards: "We don\x92t have a lot of paper goods ... soap ... cleaning supplies. That\x92s not what our niche is. We\x92re a fresh food market: 80% of our items are fresh food. Only 20% is grocery." / #COVID19 #coronavirus ', 'Positive')
100%|██████████| 3596/3596 [00:13<00:00, 263.99it/s]


### Model training

I first verify the `.spacy` files before running the command line operation for training the model. The model is an ensemble of a tok2vec model that uses attention under the transformer architecture combined with a linear bag-of-words model. I trained the model for 11 epochs using accuracy as the loss function and `adam` as the optimizer.

In [None]:
# View the entities in the train and test docs
train_loc = "/work/data/spacy_data/textcat_train.spacy"
dev_loc = "/work/data/spacy_data/textcat_valid.spacy"

# Load library and train data
nlp = spacy.load('en_core_web_sm')
doc_bin = DocBin().from_disk(train_loc)
docs = list(doc_bin.get_docs(nlp.vocab))
entities = 0

# Iterate through the docs
for doc in docs:
    entities += len(doc.ents)
print(f"TRAIN docs: {len(docs)} with {entities} entities")

# Load library and test data
doc_bin = DocBin().from_disk(dev_loc)
docs = list(doc_bin.get_docs(nlp.vocab))
entities = 0

# Iterate through the docs
for doc in docs:
    entities += len(doc.ents)
print(f"DEV docs: {len(docs)} with {entities} entities")

TRAIN docs: 17982 with 42228 entities
DEV docs: 3596 with 8447 entities


In [None]:
# Train model
!python -m spacy train ../config/config.cfg --verbose --output ../data/textcat_output --paths.train ../data/spacy_data/textcat_train.spacy --paths.dev ../data/spacy_data/textcat_valid.spacy

2021-10-31 23:47:20.347614: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-10-31 23:47:20.347659: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[38;5;4mℹ Saving to output directory: ../data/textcat_output[0m
[2021-10-31 23:47:22,570] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
[38;5;4mℹ Using CPU[0m
[1m
[2021-10-31 23:47:23,329] [INFO] Set up nlp object from config
[2021-10-31 23:47:23,338] [DEBUG] Loading corpus from path: ../data/spacy_data/textcat_valid.spacy
[2021-10-31 23:47:23,339] [DEBUG] Loading corpus from path: ../data/spacy_data/textcat_train.spacy
[2021-10-31 23:47:23,339] [INFO] Pipeline: ['transformer', 'textcat']
[2021-10-31 23:47:23,343] [INFO] Created vocabulary
[2021-10-31 23:47:23,344] [INFO] Finished i

  9    3600           0.62         15.53       62.25    0.62
  9    3800           0.42          7.37       63.12    0.63
 10    4000           0.65          9.39       62.32    0.62
 10    4200           0.37          8.03       61.73    0.62
 11    4400           0.43          7.53       61.15    0.61
 11    4600           0.46          7.06       62.89    0.63
^C


The training time for each epoch is quite long due to the large size of the data and limited CPU on deepnote. This also shows the accuracy of the best epoch of the model, which has a accuracy score of 0.63.

### Verification

After training the model, I chose model of the best performing epoch and run the model on sample text.

In [None]:
# Verify model
nlp_model = spacy.load("../data/textcat_output/model-best")
test_text = test_data.OriginalTweet.tolist()
test_cats = test_data.Sentiment.tolist()
doc_test = nlp_model(test_text[20])
print("Text: "+ test_text[20])
print("Orig Cat: "+ test_cats[20])
print(" Predicted Cats:") 
print(doc_test.cats)

Text: Widespread ramifications of the COVID-19 crisis have created a major demand for food assistance in our State. Help the @MDFoodBank with a donation to make sure no families have to go hungry during these challenging times.  #marylandcoronavirus
Orig Cat: Negative
 Predicted Cats:
{'extremely_positive': 0.0014480953104794025, 'extremely_negative': 0.24643553793430328, 'positive': 6.389307964127511e-05, 'negative': 0.6957257986068726, 'neutral': 0.056326642632484436}


## Pre-trained BERT Model
Now, we train the model with pre-trained BERT.

In [None]:
# Covert the train and test dataframes to .spacy files for training

# Preprocess the dataframes for train data
train_data_roberta, train_docs = preprocess(trainDF,"en_core_web_trf")
# Save data and docs in a binary file to disc
doc_bin = DocBin(docs=train_docs)
doc_bin.to_disk("/work/data/spacy_data/textcat_roberta_train.spacy")

# Preprocess the dataframes for test data
test_data_roberta, test_docs = preprocess(testDF,"en_core_web_trf")
# Save data and docs in a binary file to disc
doc_bin = DocBin(docs=test_docs)
doc_bin.to_disk("/work/data/spacy_data/textcat_roberta_valid.spacy")

NameError: name 'preprocess' is not defined

### Model training

I trained the same model but used the pre-trained BERT model with RoBERTa-based pipeline from `spacy-transfomer` package. I trained the model for 10 epochs using accuracy as the loss function and `adam` as the optimizer.

In [None]:
# Train model
!python -m spacy train ../config/config_bert.cfg --verbose --output ../data/textcat_roberta_output --paths.train ../data/spacy_data/textcat_roberta_train.spacy --paths.dev ../data/spacy_data/textcat_roberta_valid.spacy

2021-11-01 23:09:24.167894: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-01 23:09:24.167934: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[38;5;4mℹ Saving to output directory: ../data/textcat_roberta_output[0m
[2021-11-01 23:09:27,289] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
[38;5;4mℹ Using CPU[0m
[1m
[2021-11-01 23:09:28,160] [INFO] Set up nlp object from config
[2021-11-01 23:09:28,170] [DEBUG] Loading corpus from path: ../data/spacy_data/textcat_roberta_valid.spacy
[2021-11-01 23:09:28,170] [DEBUG] Loading corpus from path: ../data/spacy_data/textcat_roberta_train.spacy
[2021-11-01 23:09:28,171] [INFO] Pipeline: ['transformer', 'textcat']
[2021-11-01 23:09:28,175] [INFO] Created vocabulary
[2021-11-01 23:09:2

[38;5;2m✔ Saved pipeline to output directory[0m
../data/textcat_roberta_output/model-last


### Verification

After training the model, I chose the model of the best performing epoch and run the model on sample text.

In [None]:
# Verify model
nlp_model_bert = spacy.load("../data/textcat_roberta_output/model-best")
test_text = test_data_roberta.OriginalTweet.tolist()
test_cats = test_data_roberta.Sentiment.tolist()
doc_test = nlp_model_bert(test_text[20])
print("Text: "+ test_text[20])
print("Orig Cat: "+ test_cats[20])
print(" Predicted Cats:") 
print(doc_test.cats)

NameError: name 'spacy' is not defined

### Testing out with validation set

I tested out the models using the validation set we created during the train test split process. I first loaded the differnt models and preprocessed the data.

In [None]:
# Covert the train and test dataframes to .spacy files for training

# Preprocess the dataframes for valid data
valid_data, valid_docs = preprocess(validDF,"en_core_web_sm")
valid_data_roberta, valid_docs_roberta = preprocess(validDF,"en_core_web_trf")

('How easy do you think it is Piers Morgan to \x91man up\x92 &amp; \x91get a grip\x92 for those with anxiety, depression &amp;  no money????', 'Negative')
100%|██████████| 900/900 [00:03<00:00, 274.12it/s]
('How easy do you think it is Piers Morgan to \x91man up\x92 &amp; \x91get a grip\x92 for those with anxiety, depression &amp;  no money????', 'Negative')
100%|██████████| 900/900 [00:54<00:00, 16.37it/s]


I then verified the model by choosing a random tweet to check the distribution of the different labels and the original category. 

In [None]:
# Verify model for English model
nlp_model = spacy.load("../data/textcat_output/model-best")
valid_text = valid_data.OriginalTweet.tolist()
valid_cats = valid_data.Sentiment.tolist()
doc_valid = nlp_model(valid_text[50])
print("Text: "+ valid_text[50])
print("Orig Cat: "+ valid_cats[50])
print(" Predicted Cats:") 
print(doc_valid.cats)

Text: Seriously, we are not running out of fresh produce. Also stop hoarding and making the prices go up and making it hard to shop for everyone else #coronavirusau #coronavirus #covid19australia
Orig Cat: Negative
 Predicted Cats:
{'extremely_positive': 0.00046054396079853177, 'extremely_negative': 0.18530161678791046, 'positive': 2.116534233209677e-05, 'negative': 0.8084510564804077, 'neutral': 0.005765631794929504}


I've done the same with the pre-trained BERT embedding model.

In [None]:
nlp_model_bert = spacy.load("../data/textcat_roberta_output/model-best")
doc_valid_bert = nlp_model_bert(valid_text[50])
print("Text: "+ valid_text[50])
print("Orig Cat: "+ valid_cats[50])
print(" Predicted Cats:") 
print(doc_valid_bert.cats)

Text: Seriously, we are not running out of fresh produce. Also stop hoarding and making the prices go up and making it hard to shop for everyone else #coronavirusau #coronavirus #covid19australia
Orig Cat: Negative
 Predicted Cats:
{'extremely_positive': 0.0007743592723272741, 'extremely_negative': 0.047196418046951294, 'positive': 0.00013236790255177766, 'negative': 0.9406003952026367, 'neutral': 0.01129643153399229}


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=36980032-e74f-4047-828e-e2329ad1a610' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>