# Example of a functional model of sentiment Analysis

After testing and understending how this model works, the next topic will start do build a method using a dataset of sentences and respective emotions

In [1]:
!pip install -q transformers

from transformers import pipeline



[K     |████████████████████████████████| 5.5 MB 31.2 MB/s 
[K     |████████████████████████████████| 7.6 MB 56.1 MB/s 
[K     |████████████████████████████████| 182 kB 69.0 MB/s 
[?25h

Pipeline load a model and submit the input data to an analysis using the method specific parameters

In [2]:
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["i'm sad today"]
sentiment_pipeline(data)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'NEGATIVE', 'score': 0.9986425042152405}]

In [3]:

specific_model = pipeline(model="bhadresh-savani/distilbert-base-uncased-emotion")
specific_model(data)

Downloading:   0%|          | 0.00/768 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/291 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'label': 'sadness', 'score': 0.9985848665237427}]

# Building a model
The model of sentiment analysis uses Emotions Dataset for NLP: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp

To run this notebook it is necessary to set colab environment to run in GPU mode following this process:

Runtime --> Change Runtime Type > Hardware Accelerator --> GPU

## 1-Importing Pandas and the Database



### 1.1 Importing pandas

In [4]:
import pandas as pd

### 1.2 Reading test database


In [5]:
#reading test database

test = pd.read_table('sample_data/test.txt', sep=';')
test.columns=['sentence', 'sentiment']
test

FileNotFoundError: ignored

### 1-3 Reading train database

In [None]:
train = pd.read_table('sample_data/train.txt', sep=';')
train.columns=['sentence', 'sentiment']
train

### 1-4 Reading val database

In [None]:
val = pd.read_table('sample_data/val.txt', sep=';')
val.columns=['sentence', 'sentiment']
val

## 2- Treating the data

### 2-1 First off all, concating all the databases 

In [None]:
df = pd.concat([test,train,val], axis=0)
df

### 2-2 Checking for duplicated data

In [None]:
df.drop_duplicates(subset='sentence', inplace=True)
df

There was 52 duplicated rows

### 2-3 Checking for NaN values

In [None]:
df.info()

All rows are classified as non-null, so there aren't any NaN values




## 3- Text processing

### 3-1 Imports

In [None]:
!pip install transformers

import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
import torch.nn.functional as F
from transformers import BertTokenizer, BertConfig,AdamW, BertForSequenceClassification,get_linear_schedule_with_warmup

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import accuracy_score,matthews_corrcoef

from tqdm import tqdm, trange,tnrange,tqdm_notebook
import random
import os
import io
%matplotlib inline


### 3-2 GPU configuration


Setting a torch.cuda device to alocate tensors in the future train 

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

SEED = 19

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if device == torch.device("cuda"):
    torch.cuda.manual_seed_all(SEED)
device = torch.device("cuda")


### 3-3 Encoding

Transforming each sentiment into an ID number using LabelEncoder


In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df['sentiment_enc'] = labelencoder.fit_transform(df['sentiment'])
df

Renaming "sentiment" label to "sentiment_desc" to organize labels

In [None]:
df.rename(columns={'sentiment':'sentiment_desc'},inplace=True)
df

### 3-4 Tokenizer

In [None]:

sentences = df.sentence.values

#check distribution of data based on labels
print("Distribution of data based on labels: ",df.sentiment_enc.value_counts())

MAX_LEN = 256

## Import BERT tokenizer, that is used to convert our text into tokens that corresponds to BERT library
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',do_lower_case=True)
input_ids = [tokenizer.encode(sent, add_special_tokens=True,max_length=MAX_LEN,pad_to_max_length=True) for sent in sentences]
labels = df.sentiment_enc.values

print("Difference between ")
print("Actual sentence before tokenization: ",sentences[2])
print("Encoded Input from dataset: ",input_ids[2])


Creating attention masks to indicate which tokens will be attend by the model and which are just padding tokens

On the example below, the tokens that will be attended are classified as 1.0 and padding tokens as 0.0


In [None]:
## Create attention mask
attention_masks = []
## Create a mask of 1 for all input tokens and 0 for all padding tokens
attention_masks = [[float(i>0) for i in seq] for seq in input_ids]


The number of 1.0 tokens in attention masks correspond to the number of tokens different from zero in a tokenized sentence, this can be observed below:

both have 25 tokens different from zero

In [None]:
print("Actual sentence before tokenization: ",sentences[2])
print("Encoded Input from dataset: ",input_ids[2])
print(attention_masks[2])

## 4- Dataset Preparation

### 4-1 Splitting Train and Test Datasets

In [None]:
train_inputs,validation_inputs,train_labels,validation_labels = train_test_split(input_ids,labels,random_state=42,test_size=0.1)
train_masks,validation_masks,_,_ = train_test_split(attention_masks,input_ids,random_state=42,test_size=0.1)

### 4-2 Converting The Data Into Torch Tensors
it is important to convert because the model uses this type of data in its processing

In [None]:

train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

###4-3 Selecting a Batch Size for Training

The author of Dataset recomended 16 or 32 to batch size, in this case i choose 32 raise the number of data per iteration

In [None]:
batch_size = 32

### 4-4 Iterating Data with Torch DataLoader
The management with Data Loader turns the pipleline more simple because the entire dataset does not need to be loaded, saving some memory in the processing

In [None]:
train_data = TensorDataset(train_inputs,train_masks,train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data,sampler=train_sampler,batch_size=batch_size)

validation_data = TensorDataset(validation_inputs,validation_masks,validation_labels)
validation_sampler = RandomSampler(validation_data)
validation_dataloader = DataLoader(validation_data,sampler=validation_sampler,batch_size=batch_size)

###4-5 Final Preparated Data Sample

In [None]:
train_data[0]


## 5- Training Model

###5-1 Loading BERT Model with its parameters

In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=6).to(device)


# learning rate
lr = 2e-5

# Adam epsilon regulates the variance in learning rate in adaptative methods
adam_epsilon = 1e-8

# Number of training epochs 
epochs = 3

num_warmup_steps = 0
num_training_steps = len(train_dataloader)*epochs


Optimizer and Scheduler are inicializated separately in Transformers

Optimizer gets all the acummulated gradient, compute and update their values to optimize de process through the function optmizer.step()

The Scheduler consider the archtecture, number of epochs and parameters to optimize de learning rate, also uses scheduler.step() function

In [None]:
optimizer = AdamW(model.parameters(), lr=lr,eps=adam_epsilon,correct_bias=False) 
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps) 

In [None]:
## Store our loss and accuracy for plotting
train_loss_set = []
learning_rate = []

# Gradients gets accumulated by default
model.zero_grad()

# tnrange is a tqdm wrapper around the normal python range
for _ in tnrange(1,epochs+1,desc='Epoch'):
  print("<" + "="*22 + F" Epoch {_} "+ "="*22 + ">")
  # Calculate total loss for this epoch
  batch_loss = 0

  for step, batch in enumerate(train_dataloader):
    # Set our model to training mode (as opposed to evaluation mode)
    model.train()
    
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch

    # Forward pass
    outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
    loss = outputs[0]
    
    # Backward pass
    loss.backward()
    
    # Clip the norm of the gradients to 1.0
    # Gradient clipping is not in AdamW anymore
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    
    # Update parameters and take a step using the computed gradient
    optimizer.step()
    
    # Update learning rate schedule
    scheduler.step()

    # Clear the previous accumulated gradients
    optimizer.zero_grad()
    
    # Update tracking variables
    batch_loss += loss.item()

  # Calculate the average loss over the training data.
  avg_train_loss = batch_loss / len(train_dataloader)

  #store the current learning rate
  for param_group in optimizer.param_groups:
    print("\n\tCurrent Learning rate: ",param_group['lr'])
    learning_rate.append(param_group['lr'])
    
  train_loss_set.append(avg_train_loss)
  print(F'\n\tAverage Training loss: {avg_train_loss}')
    
  # Validation

  # Put model in evaluation mode to evaluate loss on the validation set
  model.eval()

  # Tracking variables 
  eval_accuracy,eval_mcc_accuracy,nb_eval_steps = 0, 0, 0

  # Evaluate data for one epoch
  for batch in validation_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    # Telling the model not to compute or store gradients, saving memory and speeding up validation
    with torch.no_grad():
      # Forward pass, calculate logit predictions
      logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
    
    # Move logits and labels to CPU
    logits = logits[0].to('cpu').numpy()
    label_ids = b_labels.to('cpu').numpy()

    pred_flat = np.argmax(logits, axis=1).flatten()
    labels_flat = label_ids.flatten()
    
    df_metrics=pd.DataFrame({'Epoch':epochs,'Actual_class':labels_flat,'Predicted_class':pred_flat})
    
    tmp_eval_accuracy = accuracy_score(labels_flat,pred_flat)
    tmp_eval_mcc_accuracy = matthews_corrcoef(labels_flat, pred_flat)
    
    eval_accuracy += tmp_eval_accuracy
    eval_mcc_accuracy += tmp_eval_mcc_accuracy
    nb_eval_steps += 1

  print(F'\n\tValidation Accuracy: {eval_accuracy/nb_eval_steps}')
  print(F'\n\tValidation MCC Accuracy: {eval_mcc_accuracy/nb_eval_steps}')

## 6- Testing the model with a pipeline
The pipeline works exactly the example given in the beggining of the notebook but uses de trained model instead

In [None]:
from transformers import TextClassificationPipeline
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=False,device=0)

# Pipeline device was set 0 to prevent a error that Pipelines identificate cpu and gpu 0

print("SUPERSCRIPTION \n sadness: 4 \n joy: 2 \n anger: 0 \n fear: 1 \n surprise: 5")
pipe("the rain makes me feel alone sometimes")

The Model worked sucessfully, the response 'LABEL_4' is related to encoded sentiments, comparing with the superscription, we can see that the model identified the SADNESS(4) on the sentence "the rain makes me feel alone sometimes" with 99.9% of precision


## 7- Future Improvements

1- Better tuning the model, because it gets ambiguous with certain words

2- Treat the pipeline responde to format directly to the specific emotion instead the number of the encoded label with its superscription

3- Able multiple phrases to be Pipelined with multiple results