<a href="https://colab.research.google.com/github/akib1162100/ML_base/blob/main/BERT_sentiment_AMH.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#Install the required libraries
!pip install transformers
!pip install torch



In [None]:
#import the required libraries
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import TensorDataset, DataLoader, RandomSampler


In [None]:
# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'  # You can use other BERT variations too
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
#preparing the dataset
import pandas as pd
dataset = pd.read_csv("/content/BERT_data_sentiment_IMDB.csv")

In [None]:
dataset.columns.values

array(['Text ', 'Sentiment '], dtype=object)

In [None]:
dataset.head(5)

Unnamed: 0,Text,Sentiment
0,------ Spoilers----- Spoilers----- Spoilers---...,1
1,"""Bon Voyage"" has the fast pace that in some wa...",1
2,"""Eagle's Wing"" is a pleasant surprise of a mov...",1
3,"""Embarassing"" is the only word to describe thi...",0
4,"""So there's this bride, you see, and she gets ...",0


In [None]:
dataset.rename(columns = {'Text ':'text', 'Sentiment ':'sentiment'}, inplace = True)

In [None]:
dataset.columns.values

array(['text', 'sentiment'], dtype=object)

In [None]:
### observe the distribution of the text in the dataset
#unbalance or not
dataset['sentiment'].value_counts()

1    105
0    100
Name: sentiment, dtype: int64

In [None]:
# Preprocessing function
import numpy as np

#special tokens refer to the start of sequence and separating sequence, this is specific to the BERT model
#max_length refers to the maximum input size taken by the model, if it exceeds the max_length then it will be truncated and the only the values up until that point will be tokenized
data_lengths = [(len(data)) for data in dataset['text']]
max_length = 512 #max(data_lengths)
def preprocess(text):
    tokens = tokenizer.encode(text, add_special_tokens=True, max_length=max_length, truncation=True)
    return tokens

# Apply preprocessing to text column in the DataFrame
dataset['token_ids'] = dataset['text'].apply(preprocess)
dataset['labels'] = dataset['sentiment']

# Create token IDs, masks, and labels
train_masks = torch.tensor([[1]*len(tokens) + [0]*(max_length - len(tokens)) for tokens in dataset['token_ids']])

# Convert lists in 'token_ids' column to numpy arrays with uniform length
padded_tokens = [tokens + [0]*(max_length - len(tokens)) for tokens in dataset['token_ids']]
token_array = np.array(padded_tokens)

# Create PyTorch tensor from the numpy array
train_inputs = torch.tensor(token_array)
train_labels = torch.tensor(dataset['labels'].values)


In [None]:
#ensuring the size of the tensors
print(train_masks.size(), train_inputs.size(), train_labels.size())

torch.Size([205, 512]) torch.Size([205, 512]) torch.Size([205])


In [None]:
dataset

Unnamed: 0,text,sentiment,token_ids,labels
0,------ Spoilers----- Spoilers----- Spoilers---...,1,"[101, 1011, 1011, 1011, 1011, 1011, 1011, 2759...",1
1,"""Bon Voyage"" has the fast pace that in some wa...",1,"[101, 1000, 14753, 8774, 1000, 2038, 1996, 343...",1
2,"""Eagle's Wing"" is a pleasant surprise of a mov...",1,"[101, 1000, 6755, 1005, 1055, 3358, 1000, 2003...",1
3,"""Embarassing"" is the only word to describe thi...",0,"[101, 1000, 7861, 20709, 18965, 1000, 2003, 19...",0
4,"""So there's this bride, you see, and she gets ...",0,"[101, 1000, 2061, 2045, 1005, 1055, 2023, 8959...",0
...,...,...,...,...
200,Why has this not been released? I kind of thou...,1,"[101, 2339, 2038, 2023, 2025, 2042, 2207, 1029...",1
201,"Why would this film be so good, but only gross...",1,"[101, 2339, 2052, 2023, 2143, 2022, 2061, 2204...",1
202,Woody Allen's second movie set in London. Tha ...,1,"[101, 13703, 5297, 1005, 1055, 2117, 3185, 227...",1
203,"You don't review James Bond movies, you evalua...",1,"[101, 2017, 2123, 1005, 1056, 3319, 2508, 5416...",1


In [None]:
# Define training parameters
#for demonstration sake the batch_size and epoch will be less due to limited computation power
batch_size = 16
epochs = 3
learning_rate = 3e-5

In [None]:
# Prepare DataLoader for training and validation sets

#TensorDataset is a pytorch class that creates a single dataset from the individual tensors into a tuple (train_inputs[i], train_masks[i] and train_labels[i])
train_data = TensorDataset(train_inputs, train_masks, train_labels)

#samples the elements randomly from the dataset in order to avoid biased sampling.
train_sampler = RandomSampler(train_data)

#DataLoader creates an iterable over the dataset. it will yield batches of 16 samples.
device = torch.device("cuda")#we are using the cuda enabled GPU of google colab, if this is run on a local machine with no GPU, then shift to CPU
model.to(device)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Define optimizer and scheduler
#The optimizer's primary function is to adjust the model's weights and biases
#based on the gradients of the loss function, effectively steering the model towards better performance
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, eps=1e-8)

#adjusts the learning rate during training based on a step function with step size 1 (which means it changes the learning rate after 1 epoch)
#and gamma is the factor by which the learning rate will be multiplied after every step size.
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1)


# Training loop
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch in train_dataloader:
        batch = tuple(t.to(device) for t in batch)
        inputs = {'input_ids': batch[0],
                  'attention_mask': batch[1],
                  'labels': batch[2]}
        optimizer.zero_grad() #clears previously calculated gradients
        outputs = model(**inputs)
        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()# computes gradient of the loss with respect to the model parameters
        optimizer.step()#updates the model's parameters using the computed gradients.

    scheduler.step() #reduces the learning rate after each epoch
    avg_train_loss = total_loss / len(train_dataloader)

print(f"  Training Loss: {avg_train_loss:.4f}")
# Save the trained model
model.save_pretrained('/content')



  Training Loss: 0.5443


Considering the size of the dataset which has 205 inputs. The total loss is not too bad. This is a reasonable figure. to assess the model's performance.

In [None]:
outputs

SequenceClassifierOutput(loss=tensor(0.5726, device='cuda:0', grad_fn=<NllLossBackward0>), logits=tensor([[ 0.4239, -0.0079],
        [ 0.4297, -0.0248],
        [-0.1197,  0.1052],
        [ 0.0425,  0.3947],
        [ 0.6305, -0.2122],
        [-0.2183,  0.2490],
        [ 0.0427,  0.0726],
        [ 0.0031,  0.1714],
        [ 0.4869,  0.1465],
        [-0.1285, -0.0031],
        [ 0.6390, -0.0666],
        [ 0.6116, -0.0351],
        [ 0.5829,  0.0706]], device='cuda:0', grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

0.5726 is the negative log-likelihood loss indicating the model's performance on the provided batch. Lower value indicates better performance. In this case the loss is moderate, suggesting that the model is making reasonably accurate predictions.     

The logit scores on the other hand does not bear fruitful information as this has yet to be transformed into final predictions. Given that our output is a binary "positive" or "negative".There are only two columns in the logit list and higher positive logits correspong to more positive respond whilst negative logits correspond to negative response. These values are yet to enter a softmax function in order to provide the correct prediction values for the different classifactions.  

In [None]:
#testing model with a new statement
import torch
from transformers import BertTokenizer, BertForSequenceClassification

# Load the saved model
model_path = "/content/model.safetensors"  # Path where your trained model is saved


In [None]:
def sentiment_analyser(new_text):
  # Preprocess the new text
  tokenized_text = tokenizer.encode_plus(new_text, add_special_tokens=True, return_tensors='pt', max_length=30, truncation=True)
  input_ids = tokenized_text['input_ids']
  attention_mask = tokenized_text['attention_mask']

  # Perform inference
  with torch.no_grad():
      model.eval()
      input_ids = input_ids.to(device)
      attention_mask = attention_mask.to(device)
      outputs = model(input_ids=input_ids, attention_mask=attention_mask)
      predicted_class = torch.argmax(outputs.logits).item()

  # Decode the predicted class
  sentiment_classes = [0, 1]  # Assuming these are your sentiment classes
  predicted_sentiment = sentiment_classes[predicted_class]

  if predicted_sentiment == 0:
    predicted_sentiment = "Negative"
  else:
    predicted_sentiment = "Positive"

  print(f"Text '{new_text}' \nSentiment: {predicted_sentiment}")


We will test the model on completely new dataset.

In [None]:
new_text = "This is a horrible movie. I would not recommend it to anyone who wants to watch this with their family. It has too many violent scenes and there is no proper plot."
sentiment_analyser(new_text)

Text 'This is a horrible movie. I would not recommend it to anyone who wants to watch this with their family. It has too many violent scenes and there is no proper plot.' 
Sentiment: Negative


In [None]:
new_text = "The film seems to kick of as a thriller, and sets an excellent mood. Great Movie!!"
sentiment_analyser(new_text)

Text 'The film seems to kick of as a thriller, and sets an excellent mood. Great Movie!!' 
Sentiment: Positive


The results turned out great as long as the conversation is about movies. What happens if the conversation is not about movies.

In [None]:
new_text = "Sun is shining in the sky, there is no cloud in sight, it stopped raining everbody's in the play, and don't you know...it's a beautiful new day."
sentiment_analyser(new_text)

Text 'Sun is shining in the sky, there is no cloud in sight, it stopped raining everbody's in the play, and don't you know...it's a beautiful new day.' 
Sentiment: Positive


The above statement are lyrics from a song. And the model seems to capture the essence of the text correctly.

In [None]:
new_text = "Is this the real life? Is this just fantasy? Caught in a landslide, no escape from reality. Open your eyes, look up to the skies and see. I'm just a poor boy, I need no sympathy"
sentiment_analyser(new_text)

Text 'Is this the real life? Is this just fantasy? Caught in a landslide, no escape from reality Open your eyes, look up to the skies and see. I'm just a poor boy, I need no sympathy' 
Sentiment: Negative


Funnily enough the model works well on data it has never seen before. This is probably the advantage of working with a pre-trained model like BERT, that already has information from a large corpus of data.