## Neural Model
This is the code for the training the neural models.

### Structure
- Package Setup
- Preprocessing for labels
- Tokenization
- Model Training
- Metrics

### Setup
Here, I setup the packages and imported the data needed for model training.

**Downloaded Packages**
1. SpaCy English library
2. Contextual Spell Check


**Imported Packages**
1. Pandas
2. SpaCy
3. Scikit-learn
4. string
5. re
6. tqrm

In [None]:
# Installation of packages and embedding
import sys
# Import spaCy ,load model
!pip install spaCy==3.1.3
!{sys.executable} -m spacy download en_core_web_sm

In [None]:
# Import packages
import spacy
import pandas as pd
import string
import re
from spacy.tokens import DocBin
from tqdm import tqdm

I imported the data here.

In [None]:
# Import dataset and pandas
raw_trainDF = pd.read_csv("/work/covid_sentiment/data/coronavirus_tweet_raw/Corona_NLP_train.csv")
raw_testDF = pd.read_csv("/work/covid_sentiment/data/coronavirus_tweet_raw/Corona_NLP_test.csv")
raw_trainDF.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [None]:
# Copy the values of the data for further uses
trainDF = raw_trainDF
testDF = raw_testDF

### Reorganized the data
Due to the observation during the EDA process, I decided to concatenate the dataset and resplit them.

In [None]:
# Train test split
from sklearn.model_selection import train_test_split

# Concat the two datasets and split them
allDF = pd.concat((trainDF, testDF), ignore_index=True)

# Split the train, test, validation set
trainDF, testDF = train_test_split(allDF, test_size = 0.2)
testDF, validDF = train_test_split(testDF, test_size = 0.2)

# Print values
print("Train:",len(trainDF), "Test:", len(testDF),"Valid:", len(validDF))

In [None]:
trainDF['OriginalTweet'].to_numpy()

array(['@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8',
       'advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order',
       'Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P',
       ...,
       'You know it\x92s getting tough when @KameronWilds  is rationing toilet paper #coronavirus #toiletpaper @kroger martinsville, help us out!!',
       'Is it wrong that the smell of hand sanitizer is starting to turn me on?\r\r\n\r\r\n#coronavirus #COVID19 #coronavirus',
       "@TartiiCat Well new/used Rift S are going for $700.00 on Amazon rn although the normal market price is usually $400.00 . Prices are really crazy right now for vr headsets since HL Alex was an

In [None]:
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

def remove_url(text): 
    url_pattern  = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    return url_pattern.sub(r'', text)
 # converting return value from list to string

In [None]:
def preprocess(df):
    df.OriginalTweet = df.OriginalTweet.apply(remove_emoji)
    df.OriginalTweet = df.OriginalTweet.apply(remove_url)
   
    data = tuple(zip(df.OriginalTweet.tolist(), df.Sentiment.tolist())) 
    docs = []
    nlp=spacy.load("en_core_web_sm")
    for doc, label in tqdm(nlp.pipe(data, as_tuples=True), total = len(data)):
        
        if (label=='Extremely Positive'):
            doc.cats['extremely_positive'] = 1
            doc.cats['extremely_negative'] = 0
            doc.cats['positive'] = 0
            doc.cats['negative'] = 0
            doc.cats['neutral']  = 0
        elif (label=='Positive'):
            doc.cats['extremely_positive'] = 0
            doc.cats['extremely_negative'] = 1
            doc.cats['positive'] = 0
            doc.cats['negative'] = 0
            doc.cats['neutral']  = 0
        elif (label=='Neutral'):
            doc.cats['extremely_positive'] = 0
            doc.cats['extremely_negative'] = 0
            doc.cats['positive'] = 0
            doc.cats['negative'] = 0
            doc.cats['neutral']  = 1
        elif (label=='Negative'):
            doc.cats['extremely_positive'] = 0
            doc.cats['extremely_negative'] = 0
            doc.cats['positive'] = 0
            doc.cats['negative'] = 1
            doc.cats['neutral']  = 0
        else:
            doc.cats['extremely_positive'] = 0
            doc.cats['extremely_negative'] = 1
            doc.cats['positive'] = 0
            doc.cats['negative'] = 0
            doc.cats['neutral']  = 0
        # print(doc.cats)
        
        docs.append(doc)
        return df, docs


In [None]:
train_data, train_docs = preprocess(trainDF)
# then we save it in a binary file to disc
doc_bin = DocBin(docs=train_docs)
doc_bin.to_disk("/work/covid_sentiment/data/spacy_data/textcat_train.spacy")

test_data, test_docs = preprocess(testDF)
# then we save it in a binary file to disc
doc_bin = DocBin(docs=test_docs)
doc_bin.to_disk("/work/covid_sentiment/data/spacy_data/textcat_valid.spacy")

  0%|          | 0/41157 [00:02<?, ?it/s]
  0%|          | 0/3798 [00:02<?, ?it/s]


In [None]:
!python -m spacy init fill-config ./textcat_base_config.cfg ./textcat_config.cfg


2021-10-04 17:49:45.553180: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-10-04 17:49:45.553259: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Usage: python -m spacy init fill-config [OPTIONS] BASE_PATH [OUTPUT_FILE]
Try 'python -m spacy init fill-config --help' for help.

Error: Invalid value for 'BASE_PATH': File './textcat_base_config.cfg' does not exist.


In [None]:
!python -m spacy train textcat_config.cfg --verbose --output ./textcat_output --paths.train textcat_data/textcat_train.spacy --paths.dev textcat_data/textcat_valid.spacy


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=36980032-e74f-4047-828e-e2329ad1a610' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>