## Sentiment Prediction using spaCy

Spacy is written in cython language, (C extension of Python designed to give C like performance to the python program). Hence is a quite fast library. spaCy provides a concise API to access its methods and properties governed by trained machine (and deep) learning models.

In [50]:
# Importing libraries
import pandas as pd
import numpy as np
import spacy
from spacy import displacy
from spacy.util import minibatch, compounding

import matplotlib.pyplot as plt

In [51]:
df = pd.read_csv('COVID19 Tweets_clean2.csv', encoding = 'latin')
df.shape 

(40964, 2)

In [52]:
df.head()

Unnamed: 0,Tweet,Sentiment
0,advice Talk neighbour family exchange phone nu...,Positive
1,Coronavirus Australia Woolworths give elderly ...,Positive
2,food stock one empty PLEASE panic THERE WILL E...,Positive
3,ready supermarket #COVID outbreak Not paranoid...,Extremely Negative
4,news region first confirmed COVID case came Su...,Positive


In [53]:
# Import label encoder
from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()
df['Sentiment']= label_encoder.fit_transform(df['Sentiment'])
  
df['Sentiment'].unique()

array([4, 0, 3, 2, 1])

In [54]:
df.head()

Unnamed: 0,Tweet,Sentiment
0,advice Talk neighbour family exchange phone nu...,4
1,Coronavirus Australia Woolworths give elderly ...,4
2,food stock one empty PLEASE panic THERE WILL E...,4
3,ready supermarket #COVID outbreak Not paranoid...,0
4,news region first confirmed COVID case came Su...,4


In [55]:
df = df[['Tweet','Sentiment']].dropna()
df.head()

Unnamed: 0,Tweet,Sentiment
0,advice Talk neighbour family exchange phone nu...,4
1,Coronavirus Australia Woolworths give elderly ...,4
2,food stock one empty PLEASE panic THERE WILL E...,4
3,ready supermarket #COVID outbreak Not paranoid...,0
4,news region first confirmed COVID case came Su...,4


In [56]:
# pip install sense2vec

In [57]:
# Conveting df to a list of tweets with sentiment values
df['tuples'] = df.apply(lambda row: (row['Tweet'],row['Sentiment']), axis=1)
train = df['tuples'].tolist()
train[:5]

[('advice Talk neighbour family exchange phone number create contact list phone number neighbour school employer chemist set online shopping account po adequate supply regular med order',
  4),
 ('Coronavirus Australia Woolworths give elderly disabled dedicated shopping hour amid COVID outbreak',
  4),
 ('food stock one empty PLEASE panic THERE WILL ENOUGH FOOD FOR EVERYONE take need Stay calm stay safe #COVID france #COVID #COVID #coronavirus #confinement #Confinementotal #ConfinementGeneral',
  4),
 ('ready supermarket #COVID outbreak Not paranoid food stock litteraly empty The #coronavirus serious thing please panic cause shortage #CoronavirusFrance #restezchezvous #StayAtHome #confinement',
  0),
 ('news region first confirmed COVID case came Sullivan County last week people flocked area store purchase cleaning supply hand sanitizer food toilet paper good report',
  4)]

### Loading the data and model training

We implemented a pipeline approach to automate all the processes involved in model building, which in turn is created by loading the models. There are different types of models provided in the package which contains the information about language – vocabularies, trained vectors, syntaxes and entities. We explored Tokenization, POS tagging, Dependency Parsing, Noun phrases and Word to Vector integration in our model. 

In [58]:
# Loading the data and evaluating the tokenizer
def load_data(limit=0, split=0.8):
    train_data = train
    np.random.shuffle(train_data)
    train_data = train_data[-limit:]
    texts, labels = zip(*train_data)
    cats = [{'POSITIVE': bool(y)} for y in labels]
    split = int(len(train_data) * split)
    return (texts[:split], cats[:split]), (texts[split:], cats[split:])

def evaluate(tokenizer, textcat, texts, cats):
    docs = (tokenizer(text) for text in texts)
    tp = 1e-8  # True positives
    fp = 1e-8  # False positives
    fn = 1e-8  # False negatives
    tn = 1e-8  # True negatives
    for i, doc in enumerate(textcat.pipe(docs)):
        gold = cats[i]
        for label, score in doc.cats.items():
            if label not in gold:
                continue
            if score >= 0.5 and gold[label] >= 0.5:
                tp += 1.
            elif score >= 0.5 and gold[label] < 0.5:
                fp += 1.
            elif score < 0.5 and gold[label] < 0.5:
                tn += 1
            elif score < 0.5 and gold[label] >= 0.5:
                fn += 1
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f_score = 2 * (precision * recall) / (precision + recall)
    return {'textcat_p': precision, 'textcat_r': recall, 'textcat_f': f_score}

#("Number of texts to train from","t" , int)
n_texts= 30000
#You can increase texts count if you have more computational power.

#("Number of training iterations", "n", int))
n_iter=2

In [59]:
from spacy.lang.en.examples import sentences 
# nlp = spacy.load("en_core_web_sm")
nlp = spacy.blank('en')

In [60]:
# add the text classifier to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy

textcat = nlp.add_pipe('textcat')

# add label to text classifier
textcat.add_label('POSITIVE')

# load the dataset
print("Loading Covid Tweets data...")
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=n_texts)
print("Using {} examples ({} training, {} evaluation)"
      .format(n_texts, len(train_texts), len(dev_texts)))
train_data = list(zip(train_texts, [{'cats': cats} for cats in train_cats]))

Loading Covid Tweets data...
Using 30000 examples (24000 training, 6000 evaluation)


In [61]:
# [str(x[0]) for x in train_data if 'nan' in str(x[0]).lower() and len(str(x[0])) < 90]

In [62]:
# textcat.add_label('Positive')
textcat.add_label('NEGATIVE')

1

These pipelines output a wide range of document properties such as – tokens, token’s reference index, part of speech tags, entities, vectors, sentiment, vocabulary etc. Next, we performed all the steps required to build the model and used our model to make predictions. For instance, our model was able to classify a review as either Positive or Negative.

In [None]:
from spacy.training.example import Example
import random

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
with nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = nlp.create_optimizer()
    print("Training the model...")
    print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))
    for i in range(n_iter):
        losses = {}
        # batch up the examples using spaCy's minibatch
        examples = []
        for text, annots in train_data:
            examples.append(Example.from_dict(nlp.make_doc(text), annots))
        nlp.initialize(lambda: examples)
        for i in range(20):
            random.shuffle(examples)
            for batch in minibatch(examples, size=compounding(4., 32., 1.001)):
                nlp.update(batch)
        
        try:
            with textcat.model.use_params(optimizer.averages):
                # evaluate on the dev data split off in load_data()
                scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
        except TypeError as e:
            print(f"Got Error :: '{e}'\n Skipping this.\n")c            

In [1]:
# testing the trained model and predicting the test tweets

# positive tweet
test_text1 = "Support everyone who's suffering... We are in this together!!"
# negative tweet
test_text2="COVID-19 sucks!!! Gov's scam to rip off money... China should be responsible for this shit.."
doc = nlp(test_text1)
print(test_text1, doc.cats)
print('\n\n')
doc = nlp(test_text2)
print(test_text2, doc.cats)

Support everyone who's suffering... We are in this together!! {'POSITIVE': 0.9988889098167419, 'NEGATIVE': 0.0011111637577414513}
COVID-19 sucks!!! Gov's scam to rip off money... China should be responsible for this shit.. {'POSITIVE': 0.0008598110871389508, 'NEGATIVE': 0.9991401433944702}
