# Sentiment Analysis with Deep Learning using BERT

### Prerequisites

Reference and guide: [Ref1](https://mccormickml.com/2019/07/22/BERT-fine-tuning/#41-bertforsequenceclassification) and [Ref2](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128)


## Loading packages

In [1]:
import numpy as np
import torch
import random
import pandas as pd
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from transformers import BertTokenizer
from torch.utils.data import TensorDataset
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import BertForSequenceClassification
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from transformers import AdamW, get_linear_schedule_with_warmup

### BERT introduction

BERT is a large-scale transformer-based Language Model that can be finetuned for a variety of tasks.

The original paper can be found [here](https://arxiv.org/abs/1810.04805). 

[HuggingFace documentation](https://huggingface.co/transformers/model_doc/bert.html)

[Bert documentation](https://characters.fandom.com/wiki/Bert_(Sesame_Street) ;)

## Data Visualization and Preprocessing

SMILE Twitter dataset: [Link](https://figshare.com/articles/dataset/smile_annotations_final_csv/3187909/2)

Not the best dataset for this test but the same concept can be used on other datasets.
Another notebook will use the same approach for imdb rating prediction


In [2]:
DataFrame = pd.read_csv('Data/smile-annotations-final.csv', names = ['id', 'text', 'label'])


Looking at the begging of the dataset that is loaded in the form of a dataframe.

In [3]:
DataFrame.head()

Unnamed: 0,id,text,label
0,611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
1,614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
2,614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
3,614877582664835073,@Sofabsports thank you for following me back. ...,happy
4,611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy


Using the id column for index and checking again the format of the raw dataframe data

In [4]:
DataFrame.set_index('id', inplace = True)
DataFrame.head()

Unnamed: 0_level_0,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
614877582664835073,@Sofabsports thank you for following me back. ...,happy
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy


## Checking all the categories in the raw data

We can see that the data is not really balanced and may not be so good for training. As the number of samples in some categories is really small.

In [5]:
DataFrame.label.value_counts()

nocode               1572
happy                1137
not-relevant          214
angry                  57
surprise               35
sad                    32
happy|surprise         11
happy|sad               9
disgust|angry           7
disgust                 6
sad|disgust             2
sad|angry               2
sad|disgust|angry       1
Name: label, dtype: int64

Removing the columns that have not been labeled (nocode). 
Creating a dictionary for the labels and assigning a number to each label.

In [6]:
DataFrame = DataFrame[~DataFrame.label.str.contains('\|')]
DataFrame = DataFrame[DataFrame.label != 'nocode']
DataFrame.label.value_counts()
unique_labels = DataFrame.label.unique()
dict = {}
for i, label in enumerate(unique_labels):
    dict[label] = i
dict

{'happy': 0,
 'not-relevant': 1,
 'angry': 2,
 'disgust': 3,
 'sad': 4,
 'surprise': 5}

We create a new column, named new label, in the dataframe with the value of assigned category for each tweet (row). And then we check the dataframe again.

In [7]:
def apply_label(row):
    old_label = row['label']
    return dict[old_label]

DataFrame['new_label'] = DataFrame.apply(lambda row : apply_label(row), axis=1)
DataFrame = DataFrame.drop('label', 1)

DataFrame.head(10)

  DataFrame = DataFrame.drop('label', 1)


Unnamed: 0_level_0,text,new_label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,0
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,0
614877582664835073,@Sofabsports thank you for following me back. ...,0
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,0
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,0
614499696015503361,Lucky @FitzMuseum_UK! Good luck @MirandaStearn...,0
613601881441570816,Yr 9 art students are off to the @britishmuseu...,0
613696526297210880,@RAMMuseum Please vote for us as @sainsbury #s...,1
610746718641102848,#AskTheGallery Have you got plans to privatise...,1
612648200588038144,@BarbyWT @britishmuseum so beautiful,0


## Split the data into Train and Test

We can use train_test_split from sklearn.
Based on the result from train_test_split, we again modify our dataframe by adding a column data_type and check the value based on its assignment to train or test sets. Not that this step is random, and also our final model metrics will depend on this (because our input data is not so uniformly distributed between categories)

In [8]:
X_train, X_test, Y_train, Y_test = train_test_split(DataFrame.index.values,
                                                    DataFrame.new_label,
                                                    test_size = 0.15,
                                                    random_state = 17,
                                                    stratify = DataFrame.new_label.values)

DataFrame['data_type'] = ['not_set'] * DataFrame.shape[0]
DataFrame.loc[X_train, 'data_type'] = 'train'
DataFrame.loc[X_test, 'data_type'] = 'test'
DataFrame.head(10)

Unnamed: 0_level_0,text,new_label,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,0,train
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,0,train
614877582664835073,@Sofabsports thank you for following me back. ...,0,train
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,0,train
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,0,train
614499696015503361,Lucky @FitzMuseum_UK! Good luck @MirandaStearn...,0,train
613601881441570816,Yr 9 art students are off to the @britishmuseu...,0,train
613696526297210880,@RAMMuseum Please vote for us as @sainsbury #s...,1,train
610746718641102848,#AskTheGallery Have you got plans to privatise...,1,train
612648200588038144,@BarbyWT @britishmuseum so beautiful,0,train


Now, we can check how much from each category is assigned to train and validation datasets

In [9]:
DataFrame.groupby(['new_label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,text
new_label,data_type,Unnamed: 2_level_1
0,test,171
0,train,966
1,test,32
1,train,182
2,test,9
2,train,48
3,test,1
3,train,5
4,test,5
4,train,27


## Tokenizing the data

We try BertTokenizer from BERT and load a pre-trained model.

In [10]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Checking some random sequences of words. It seems 101 refers to start of a sentence and 102 refers to an end of sentence.

In [11]:
#tokenizer.batch_encode_plus(['this is the first sentence', 'another setence'])
tokenizer.batch_encode_plus(['I am home', 'Where are you?'])

{'input_ids': [[101, 1045, 2572, 2188, 102], [101, 2073, 2024, 2017, 1029, 102]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

After checking that it works, we use this tokenizer to process/tokenize our dataframe, and using the new column of data_type, we create tokenized train and test datasets.

In [12]:
tokenized_train = tokenizer.batch_encode_plus(
    DataFrame[DataFrame.data_type == 'train'].text.values,
    add_special_tokens = True, return_attention_mask = True,
    pad_to_max_length = True,
    max_length = 256,
    return_tensors = 'pt')

tokenized_test = tokenizer.batch_encode_plus(
    DataFrame[DataFrame.data_type == 'test'].text.values,
    add_special_tokens = True, return_attention_mask = True,
    pad_to_max_length = True,
    max_length = 256,
    return_tensors = 'pt')

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


From the tokenized datasets, we extract and store ids, attention masks (all 1 at the moment), and the labels. 
Then, we create TensorDatasets from them.

In [13]:
input_ids_train = tokenized_train['input_ids']
attention_mask_train = tokenized_train['attention_mask']
labels_train = torch.tensor(
    DataFrame[DataFrame.data_type == 'train'].new_label.values)

input_ids_test = tokenized_test['input_ids']
attention_mask_test = tokenized_test['attention_mask']
labels_test = torch.tensor(
    DataFrame[DataFrame.data_type == 'test'].new_label.values)

dataset_train = TensorDataset(input_ids_train, attention_mask_train, labels_train)
dataset_test  = TensorDataset(input_ids_test , attention_mask_test , labels_test )

## Setting up BERT Pretrained Model

Since this is a sentence (sequence) to multi-class classification, we load the pre-trained Sequence Classification model: https://huggingface.co/docs/transformers/model_doc/bert

The Huggingface list other available models that can be used for other applications.

In [14]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = len(dict), output_attentions = False,
                                      output_hidden_states = False)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

cuda


## Checking the model

In [15]:
params = list(model.named_parameters())

print('The BERT model has {:} different named parameters.\n'.format(len(params)))
print('==== Embedding Layer ====\n')

for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Output Layer ====\n')

for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

The BERT model has 201 different named parameters.

==== Embedding Layer ====

bert.embeddings.word_embeddings.weight                  (30522, 768)
bert.embeddings.position_embeddings.weight                (512, 768)
bert.embeddings.token_type_embeddings.weight                (2, 768)
bert.embeddings.LayerNorm.weight                              (768,)
bert.embeddings.LayerNorm.bias                                (768,)

==== First Transformer ====

bert.encoder.layer.0.attention.self.query.weight          (768, 768)
bert.encoder.layer.0.attention.self.query.bias                (768,)
bert.encoder.layer.0.attention.self.key.weight            (768, 768)
bert.encoder.layer.0.attention.self.key.bias                  (768,)
bert.encoder.layer.0.attention.self.value.weight          (768, 768)
bert.encoder.layer.0.attention.self.value.bias                (768,)
bert.encoder.layer.0.attention.output.dense.weight        (768, 768)
bert.encoder.layer.0.attention.output.dense.bias              (

## Creating Data Loaders

When working with batches, we always have to worry about dataloaders. Sometimes, we can write our own dataloader, but there are also available methods in libraries. Here, we use it from torch.utils.data

Here, we have to decide on the size of the batches that we want to run.

In [16]:
batch_size = 4 #32
train_dataloader = DataLoader(dataset_train, sampler = RandomSampler(dataset_train), batch_size = batch_size)
test_dataloader =  DataLoader(dataset_test,  sampler = RandomSampler(dataset_test ), batch_size = batch_size)

## Selecting the optimizer and scheduler

We choose adam optimized and set some of the hyper-parameters like learning rate and epsilon.

In [17]:
optimizer = AdamW(model.parameters(), lr = 1e-5, eps = 1e-8)
epochs = 5
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, 
                                            num_training_steps = len(train_dataloader) * epochs)

## Defining our Performance Metrics

Define a F1_score_func function and use f1_score from sklearn.

$$
    F_{1} = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}}
$$

$$
    Precision = \frac{TruePositive}{TruePositive + FalsePositive}
$$

$$
    Recall = \frac{TruePositive}{TruePositive + FalseNegative}
$$

Precisions will tell us what percentage of all positives are actually possitive.
Recall will tell us what percentage of all actual positives were correctly predicted.

In [18]:
def f1_score_func(preds, labels):
    preds_flat  = np.argmax(preds, axis = 1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average = 'weighted')

We need to calculate F1_score for each category

In [19]:
def accuracy_per_class(preds, labels):
    preds_flat  = np.argmax (preds, axis = 1).flatten()
    labels_flat = labels.flatten()
    inverse_dict = {v:k for k, v in dict.items()}
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true  = labels_flat[labels_flat==label]
        print(f'Class: {inverse_dict[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}')

## Training

Approach adapted from an older version of HuggingFace's `run_glue.py` script. Accessible [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128).

In [20]:
#seed_val = 17
#random.seed(seed_val)
#np.random.seed(seed_val)
#torch.manual_seed(seed_val)
#torch.cuda.manual_seed_all(seed_val)

Create an evaluate function. tqdm is used to visualize the progress during evaluation and training.
Explanation on how the model is called and parameters: [Link1](https://discuss.huggingface.co/t/new-model-output-types/195) [Link2](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128)

Note: **inputs sends the reference to key, value of inputs to the model

First evaluation function in created and then a training loop is created with the same principle.

In [21]:
def evaluate(dataloader_val):

    model.eval()
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [22]:
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    loss_train_total = 0
    progress_bar = tqdm(train_dataloader, desc='Epoch {:1d}'.format(epoch),
                       leave = False, disable = False)
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {'input_ids'       : batch[0],
                  'attention_mask'   : batch[1],
                  'labels'           : batch[2]
                 }
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({f'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
    
    #model_to_save = model.module if hasattr(model, 'module') else model
    torch.save(model.state_dict(), f'Models/BERT_ft_epoch{epoch}.model')
    tqdm.write(f'\nEpoch {epoch}')
    loss_train_avg = loss_train_total / len(train_dataloader)
    tqdm.write(f'Training Loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(test_dataloader)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 score: {val_f1}')


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch 1
Training Loss: 0.7482757588937169


  0%|          | 0/56 [00:00<?, ?it/s]

Validation loss: 0.7389112422575376
F1 score: 0.7306529543722665


Epoch 2:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch 2
Training Loss: 0.4725364663135556


  0%|          | 0/56 [00:00<?, ?it/s]

Validation loss: 0.597670856505699
F1 score: 0.7758654900098663


Epoch 3:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch 3
Training Loss: 0.3480159034494251


  0%|          | 0/56 [00:00<?, ?it/s]

Validation loss: 0.6773433836767383
F1 score: 0.8406402477927365


Epoch 4:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch 4
Training Loss: 0.24191346026602245


  0%|          | 0/56 [00:00<?, ?it/s]

Validation loss: 0.6586198837467236
F1 score: 0.8419474617655489


Epoch 5:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch 5
Training Loss: 0.18654966705981346


  0%|          | 0/56 [00:00<?, ?it/s]

Validation loss: 0.6560705022961234
F1 score: 0.8441678668900734


## Evaluating our the Model

Now, we apply the model to the test dateset and use our accuracy per class function to output the result.

In [23]:
_, predictions, true_values = evaluate(test_dataloader)

  0%|          | 0/56 [00:00<?, ?it/s]

In [24]:
accuracy_per_class(predictions, true_values)

Class: happy
Accuracy: 160/171
Class: not-relevant
Accuracy: 21/32
Class: angry
Accuracy: 8/9
Class: disgust
Accuracy: 0/1
Class: sad
Accuracy: 0/5
Class: surprise
Accuracy: 2/5


## Evaluating on custom input

The cool thing is that, now we can input any sentence to the model and see how it is categorizing them into the classes!

In [25]:
inverse_dict = {v:k for k, v in dict.items()}

In [33]:
my_tweet = "I do not want to see you"
inputs = tokenizer(my_tweet, return_tensors="pt")
inputs.to(device)
labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
labels.to(device)

with torch.no_grad():        
            outputs = model(**inputs)

logits = outputs["logits"]
logits = logits.detach().cpu().numpy()
category  = np.argmax(logits, axis = 1).flatten()

print("For input sentence:" + my_tweet + "\t Prediction is: " + inverse_dict[category[0]])

For input sentence:I do not want to see you	 Prediction is: angry


In [34]:
my_tweet = "Why is it not raining today!"
inputs = tokenizer(my_tweet, return_tensors="pt")
inputs.to(device)
labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
labels.to(device)

with torch.no_grad():        
            outputs = model(**inputs)

logits = outputs["logits"]
logits = logits.detach().cpu().numpy()
category  = np.argmax(logits, axis = 1).flatten()

print("For input sentence:" + my_tweet + "\t Prediction is: " + inverse_dict[category[0]])

For input sentence:Why is it not raining today!	 Prediction is: surprise


In [32]:
my_tweet = "I am feeling much better today. Thank you."
inputs = tokenizer(my_tweet, return_tensors="pt")
inputs.to(device)
labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
labels.to(device)

with torch.no_grad():        
            outputs = model(**inputs)

logits = outputs["logits"]
logits = logits.detach().cpu().numpy()
category  = np.argmax(logits, axis = 1).flatten()

print("For input sentence:" + my_tweet + "\t Prediction is: " + inverse_dict[category[0]])

For input sentence:I am feeling much better today. Thank you.	 Prediction is: happy
