# **Sentiment Analysis using Contextual Embedding**

### **BERT** stands for *Bidirectional Encoder Representations from Transformers*. It is a language model which looks at both left and right context when predicting current word. This is also called **Masked Language Modelling** (MLM). Additionally, BERT also use *"next sentence prediction"* task in addition to MLM during pre-training.

Proper language representation is key for general-purpose language understanding by machines. Context-free models such as **word2vec** or **GloVe** generate a single word embedding representation for each word in the vocabulary. For example, the word *"bank"* would have the same representation in *"bank deposit"* and in *"riverbank"*. Contextual models instead generate a representation of each word that is based on the other words in the sentence. BERT, as a contextual model, captures these relationships in a bidirectional way.

## 1. Importing the Required Libraries and Reading the Dataset

In [1]:
import pandas as pd
import numpy as np
import random
import re

import nltk
nltk.download('stopwords')
nltk.download('wordnet') 
nltk.download('punkt')
nltk.download('omw-1.4')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

!pip install -qq transformers
import transformers
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer

!pip install contractions
import contractions

import torch
from torch import nn, optim
from torch.optim import AdamW
from torch.utils.data import Dataset, DataLoader

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
df = pd.read_csv("Twitter.csv")
df

Unnamed: 0,clean_text,category,category_sentiment
0,i am not happy,-1,negative
1,i am not sad,1,positive
2,i'm fine,0,neutral
3,when modi promised “minimum government maximum...,-1,negative
4,talk all the nonsense and continue all the dra...,0,neutral
...,...,...,...
177971,'I'm not satisfied with The Hills finale. gon...,-1,negative
177972,this sucks,-1,negative
177973,this is bad,-1,negative
177974,I am not okay with this,-1,negative


## 2. Checking Class Distribution

In [3]:
df.iloc[:, -1:].value_counts()

category_sentiment
positive              72250
neutral               62712
negative              43014
dtype: int64

In [4]:
df = df.sample(frac = 1).reset_index(drop = True) # Shuffling of Tweets
df_new = df[df["category"] == -1][:40000]
df_new = df_new.append(df[df["category"] == 1][:40000])
df_new = df_new.append(df[df["category"] == 0][:40000])
df_new = df_new.reset_index(drop = True)

display(df_new["category"].value_counts())
df_new

-1    35000
 1    35000
 0    35000
Name: category, dtype: int64

Unnamed: 0,clean_text,category,category_sentiment
0,modi govt took credit for valour armed forces ...,-1,negative
1,opposition saying modi shouldn’ take credit dr...,-1,negative
2,thats the difference between average citizen a...,-1,negative
3,big twist case prosecution argues why bail sho...,-1,negative
4,dear modi govt you believe national interest t...,-1,negative
...,...,...,...
104995,this man consumed much blind hatred modi their...,0,neutral
104996,modi said centre and state governments are wor...,0,neutral
104997,blow modi,0,neutral
104998,modi must sent back now otherwise india peril,0,neutral


## 3. Creating a Function for Text Transformation

In [5]:
def text_transformation(text):
    text = " ".join(x.lower() for x in str(text).split())                             # Converting Text to Lowercase
    text = contractions.fix(text)                                                     # Fixes Contractions such as ("you're" to "you are" etc.)
    text = " ".join([re.sub("[^A-Za-z]+", "", x) for x in nltk.word_tokenize(text)])  # Removal of Punctuation, Numbers, and Special Characters                                                                      
    return text

In [6]:
df_new["category"] = df_new["category"].map({-1:0, 0:1, 1:2})    # Mapping '0' - negative, '1' - neutral, '2' - positive 

## 4. Checking for CUDA Availability

In [7]:
if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f"There are {torch.cuda.device_count()} GPU(s) available.")
    print("Device Name:", torch.cuda.get_device_name(0))

else:
    print("No GPU available, using the CPU instead.")
    device = torch.device("cpu")

torch.cuda.empty_cache()  # Emptying the cache to prevent out of memory 

There are 1 GPU(s) available.
Device Name: Tesla T4


## 5. Applying Text Transformation to the Dataset

In [8]:
print("Original Text: ", df_new["clean_text"].iloc[0])
print("Processed Text: ", text_transformation(df_new["clean_text"].iloc[0]))

Original Text:  modi govt took credit for valour armed forces failed ensure their safety tdp 
Processed Text:  modi govt took credit for valour armed forces failed ensure their safety tdp


In [9]:
df_new["processed_text"] = df_new["clean_text"].apply(text_transformation)
df_new

Unnamed: 0,clean_text,category,category_sentiment,processed_text
0,modi govt took credit for valour armed forces ...,0,negative,modi govt took credit for valour armed forces ...
1,opposition saying modi shouldn’ take credit dr...,0,negative,opposition saying modi shouldn take credit dr...
2,thats the difference between average citizen a...,0,negative,that is the difference between average citizen...
3,big twist case prosecution argues why bail sho...,0,negative,big twist case prosecution argues why bail sho...
4,dear modi govt you believe national interest t...,0,negative,dear modi govt you believe national interest t...
...,...,...,...,...
104995,this man consumed much blind hatred modi their...,1,neutral,this man consumed much blind hatred modi their...
104996,modi said centre and state governments are wor...,1,neutral,modi said centre and state governments are wor...
104997,blow modi,1,neutral,blow modi
104998,modi must sent back now otherwise india peril,1,neutral,modi must sent back now otherwise india peril


## 6. Importing the BERT Pre-Trained Model and Tokenizer

In [10]:
PRE_TRAINED_MODEL_NAME = "distilbert-base-uncased"                       # Importing the BERT-based Pre-Trained Model               
tokenizer = DistilBertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)  # Loading the Pre-Trained BertTokenizer

## 7. Tokenization Process

In [11]:
sample_text = text_transformation(df_new["processed_text"][0])  # Get a sample text from the dataset

tokens = tokenizer.tokenize(sample_text)            # This will convert a sentence to a list of words
token_ids = tokenizer.convert_tokens_to_ids(tokens) # This will convert the list of words to a list of numbers based on tokenizer

print(f"Sentence: {sample_text}")
print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}")

Sentence: modi govt took credit for valour armed forces failed ensure their safety tdp
Tokens: ['mod', '##i', 'govt', 'took', 'credit', 'for', 'val', '##our', 'armed', 'forces', 'failed', 'ensure', 'their', 'safety', 'td', '##p']
Token IDs: [16913, 2072, 22410, 2165, 4923, 2005, 11748, 8162, 4273, 2749, 3478, 5676, 2037, 3808, 14595, 2361]


## Special Tokens

**[SEP]** - marker for ending of a sentence </br>
**[CLS]** - add this token to the start of each sentence, so BERT knows it's a classification </br>
**[PAD]** - adds padding

In [12]:
encoding = tokenizer.encode_plus(
    sample_text,                    # Sample Text
    max_length = 64,                # Max Length of the Sentence
    truncation = True,              # Truncate to a maximum length specified with argument max_length
    add_special_tokens = True,      # Add '[CLS]', [PAD] and '[SEP]'
    return_token_type_ids = False,  # This case deals with only one sentence as opposed to two sentences in single training
    padding = "max_length",         # Pad to longest sequence as defined by max_length
    return_attention_mask = True,   # Attention mask indicated to the model which tokens should be attended to, and which should not
    return_tensors = "pt",            # Return PyTorch tensors
)

# Encoding which corresponds to the weight of each word
print(len(encoding["input_ids"][0]))
encoding["input_ids"][0]

64


tensor([  101, 16913,  2072, 22410,  2165,  4923,  2005, 11748,  8162,  4273,
         2749,  3478,  5676,  2037,  3808, 14595,  2361,   102,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0])

In [13]:
# Attention Mask also has same length with Encoding
print(len(encoding["attention_mask"][0]))
encoding["attention_mask"]

64


tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

## Creating a Class Implemeting the Processes Above

In [14]:
max_len = 64

class Dataset(Dataset):
    def __init__(self, text, category, tokenizer, max_len):
        self.text = text
        self.category = category
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, item):
        text = str(self.text[item])
        category = self.category[item]

        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens = True,
            max_length = self.max_len,
            truncation = True,
            return_token_type_ids = False,
            padding = "max_length",
            return_attention_mask = True,
            return_tensors = "pt"
        )

        return {
            "text": text,
            "input_ids": encoding["input_ids"].flatten(),         
            "attention_mask": encoding["attention_mask"].flatten(),
            "category": torch.tensor(category, dtype = torch.long)
        }

## 8. Train-Test Split

In [15]:
x_train, x_test, y_train, y_test = train_test_split(df_new["processed_text"], df_new["category"], test_size = 0.20, random_state = 1, stratify=df_new["category_sentiment"])

df_train = pd.concat([pd.DataFrame({"processed_text": x_train.values}), pd.DataFrame({"category": y_train.values})], axis = 1)
df_test = pd.concat([pd.DataFrame({"processed_text": x_test.values}), pd.DataFrame({"category": y_test.values})], axis = 1)


print(df_train.shape, df_test.shape)

(84000, 2) (21000, 2)


## 9. Creating the Dataloader

In [16]:
def create_data_loader(df, tokenizer, max_len, batch_size):

    ds = Dataset(
        text = df.processed_text.to_numpy(),
        category = df.category.to_numpy(),
        tokenizer = tokenizer,
        max_len = max_len
    )
  
    return DataLoader(
        ds,
        batch_size = batch_size,
        num_workers = 4  # Tells the data loader how many sub-processes to use for data loading
    )


batch_size = 32     # Bert recommendation

train_data_loader = create_data_loader(df_train, tokenizer, max_len, batch_size)
test_data_loader = create_data_loader(df_test, tokenizer, max_len, batch_size)



In [17]:
data = next(iter(train_data_loader))

print(data["input_ids"].shape)
print(data["attention_mask"].shape)

torch.Size([32, 64])
torch.Size([32, 64])


## 10. Building the Classifier

In [18]:
model = DistilBertForSequenceClassification.from_pretrained(PRE_TRAINED_MODEL_NAME, num_labels = 3, return_dict = False)
model = model.to(device)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifi

In [19]:
EPOCHS = 10

optimizer = AdamW(model.parameters(), lr = 5e-4)
total_steps = len(train_data_loader) * EPOCHS

In [20]:
def train_epoch(model, data_loader, optimizer, device, n_examples):

    model = model.train()
    losses = []
    correct_predictions = 0

    for d in data_loader:
        input_ids = d["input_ids"].to(device)
        attention_mask = d["attention_mask"].to(device)
        targets = d["category"].to(device)

        loss, logits = model(
          input_ids = input_ids,
          attention_mask = attention_mask,
          labels = targets
        )
        
        logits = logits.detach().cpu().numpy()
        label_ids = targets.cpu().numpy()

        prediction = np.argmax(logits).flatten()
        target = label_ids.flatten()

        correct_predictions += np.sum(prediction == target)

        losses.append(loss.item())
        loss.backward()             # Performs backpropagation (computes derivates of loss w.r.t to parameters)
        nn.utils.clip_grad_norm_(model.parameters(), max_norm = 1.0)  # Clipping gradients so they dont explode
        optimizer.step()            # Makes the optimizer iterate over all parameters and update their gradient values
        optimizer.zero_grad()       # Clears old gradients from last step

    return correct_predictions / n_examples, np.mean(losses)

In [21]:
def eval_model(model, data_loader, device, n_examples):
  
    model = model.eval()
    losses = []
    correct_predictions = 0

    with torch.no_grad():
        for d in data_loader:
            input_ids = d["input_ids"].to(device)
            attention_mask = d["attention_mask"].to(device)
            targets = d["category"].to(device)

            loss, logits = model(
              input_ids = input_ids,
              attention_mask = attention_mask,
              labels = targets
            )

            logits = logits.detach().cpu().numpy()
            label_ids = targets.cpu().numpy()

            prediction = np.argmax(logits).flatten()
            target = label_ids.flatten()

            correct_predictions += np.sum(prediction == target)
            losses.append(loss.item())

    return correct_predictions / n_examples, np.mean(losses)

In [22]:
# Used Accuracy as Performance Metric

best_acc = 0

for epoch in range(EPOCHS):

    print(f"Epoch {epoch + 1} / {EPOCHS}")
    print("-" * 10)

    train_acc, train_loss = train_epoch(model, train_data_loader, optimizer, device, len(df_train))

    print(f"Train Loss {train_loss} Accuracy {train_acc}")

    test_acc, test_loss = eval_model(model, test_data_loader, device, len(df_test))

    print(f"Test Loss {test_loss} Accuracy {test_acc}")
    print()

    if test_acc > best_acc:
      torch.save(model.state_dict(), "contextual_embedding_model.pth")
      best_acc = test_acc

# Storing the State of Best Model Indicated by Highest Validation Accuracy

Epoch 1 / 10
----------
Train Loss 1.0995535072145008 Accuracy 0.16016666666666668
Test Loss 1.0986175219580825 Accuracy 0.3333333333333333

Epoch 2 / 10
----------
Train Loss 1.0992739740553357 Accuracy 0.13238095238095238
Test Loss 1.0987241857853836 Accuracy 0.037476190476190475

Epoch 3 / 10
----------
Train Loss 1.0989082431339083 Accuracy 0.01217857142857143
Test Loss 1.0986224672384277 Accuracy 0.026333333333333334

Epoch 4 / 10
----------
Train Loss 1.0987532854988462 Accuracy 0.012142857142857143
Test Loss 1.0986170273393256 Accuracy 0.03395238095238095

Epoch 5 / 10
----------
Train Loss 1.0988344021751768 Accuracy 0.012047619047619048
Test Loss 1.098815594271075 Accuracy 0.0880952380952381

Epoch 6 / 10
----------
Train Loss 1.0932912505921863 Accuracy 0.010928571428571428
Test Loss 1.1362453171107323 Accuracy 0.05333333333333334

Epoch 7 / 10
----------
Train Loss 1.0862716469991776 Accuracy 0.01138095238095238
Test Loss 1.1785727146916556 Accuracy 0.12385714285714286

Epoc

In [23]:
# Load the Trained Model

path1 = "/content/contextual_embedding_model.pth"

model.load_state_dict(torch.load(path1))

model = model.to(device)    # Moving Model to Device

In [24]:
test_acc, _ = eval_model(model, test_data_loader, device, len(df_test))
test_acc.item()

0.3333333333333333