# IMDB Sentiment Analysis using BERT transformer in pytorch

Loading packages:

In [1]:
from torchtext.datasets import IMDB
import torch
from tqdm.notebook import tqdm
import torch.nn as nn
from transformers import BertModel, BertForSequenceClassification
from transformers import BertTokenizer
import numpy as np
from torch.utils.data import TensorDataset
from transformers import AdamW, get_linear_schedule_with_warmup
from ipywidgets import IntProgress
from torch.utils.data import DataLoader, RandomSampler
import warnings
warnings.filterwarnings('ignore')
import time

## Loading the data
Data is available from torchdata [Link](https://pytorch.org/data/main/examples.html)

In [2]:
data = IMDB(split='train')

### Visualizing the data

Printing a few of the samples from the data.

In [3]:
count = 1
for label, text in data:
    print(f"Comment {count}: " + text)
    print(f"Label: {label}")
    count += 1
    if count == 4: break

Comment 1: Zentropa has much in common with The Third Man, another noir-like film set among the rubble of postwar Europe. Like TTM, there is much inventive camera work. There is an innocent American who gets emotionally involved with a woman he doesn't really understand, and whose naivety is all the more striking in contrast with the natives.<br /><br />But I'd have to say that The Third Man has a more well-crafted storyline. Zentropa is a bit disjointed in this respect. Perhaps this is intentional: it is presented as a dream/nightmare, and making it too coherent would spoil the effect. <br /><br />This movie is unrelentingly grim--"noir" in more than one sense; one never sees the sun shine. Grim, but intriguing, and frightening.
Label: pos
Comment 2: Zentropa is the most original movie I've seen in years. If you like unique thrillers that are influenced by film noir, then this is just the right cure for all of those Hollywood summer blockbusters clogging the theaters these days. Von T

## Data Preprocessing

First we tokenize all the comments in the dataset and also transform possitive and negative labels to 1 and 0 respectively.

In [4]:
data = IMDB(split='train')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

tokenized_data = tokenizer.batch_encode_plus(
    data,
    add_special_tokens = True, return_attention_mask = True,
    pad_to_max_length = True,
    max_length = 256,
    return_tensors = 'pt')

labels = []
for label, _ in data:
    if label == 'pos':
        labels.append(1)
    else:
        labels.append(0)


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


I had some bugs and conda conflicts trying to use sklearn or torch for splitting. Although it is not probably the most efficient way but I had to split the data to train and test datasets manually. 

We are using a huggingface transformer based classification model for sequence data [Link](https://huggingface.co/docs/transformers/model_doc/bert). The input format for this data is:
- input_ids: which is the tokenized data
- attention mask: since the sentences are padded to max_length = 256 the attention mask will be 0 for padded region
- labels

First we create input_ids, attention_mask, and labels for all the data and then split them into train and test datasets.

In [5]:
input_ids = tokenized_data['input_ids']
attention_mask = tokenized_data['attention_mask']
labels = labels

indices = np.random.permutation(25000)
train_indices = indices[:20000]
test_indices  = indices[20000:]

input_ids_train = torch.stack([input_ids[x] for x in train_indices])
attention_mask_train = torch.stack([attention_mask[x] for x in train_indices])
labels_train = torch.tensor([labels[x] for x in train_indices])

input_ids_test = torch.stack([input_ids[x] for x in test_indices])
attention_mask_test = torch.stack([attention_mask[x] for x in test_indices])
labels_test = torch.tensor([labels[x] for x in test_indices])

dataset_train = TensorDataset(input_ids_train, attention_mask_train, labels_train)
dataset_test  = TensorDataset(input_ids_test , attention_mask_test , labels_test )

Next step is to create batch dataloaders for both both train and test datasets

In [6]:
batch_size = 32
train_dataloader = DataLoader(dataset_train, sampler = RandomSampler(dataset_train), batch_size = batch_size)
test_dataloader =  DataLoader(dataset_test,  sampler = RandomSampler(dataset_test ), batch_size = batch_size)

## Setting up BERT Pretrained Model

Huggingface BERT transformer based model is used for this which may be an overkill!
We check if cuda is available and set the model to cuda. Note that if GPU is not available it is unpractical to run this model as it would take >10 hours to train the model.

In [7]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 2, output_attentions = False,
                                      output_hidden_states = False)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

cuda


### Checking the model

In [8]:
params = list(model.named_parameters())

print('The BERT model has {:} different named parameters.\n'.format(len(params)))
print('==== Embedding Layer ====\n')

for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Output Layer ====\n')

for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

The BERT model has 201 different named parameters.

==== Embedding Layer ====

bert.embeddings.word_embeddings.weight                  (30522, 768)
bert.embeddings.position_embeddings.weight                (512, 768)
bert.embeddings.token_type_embeddings.weight                (2, 768)
bert.embeddings.LayerNorm.weight                              (768,)
bert.embeddings.LayerNorm.bias                                (768,)

==== First Transformer ====

bert.encoder.layer.0.attention.self.query.weight          (768, 768)
bert.encoder.layer.0.attention.self.query.bias                (768,)
bert.encoder.layer.0.attention.self.key.weight            (768, 768)
bert.encoder.layer.0.attention.self.key.bias                  (768,)
bert.encoder.layer.0.attention.self.value.weight          (768, 768)
bert.encoder.layer.0.attention.self.value.bias                (768,)
bert.encoder.layer.0.attention.output.dense.weight        (768, 768)
bert.encoder.layer.0.attention.output.dense.bias              (

## Defining Performance Metrics

Due to conda conflict with sklearn which could not figure out how to safely manage without installing many packages again, I created F1_score evaluation function below which may not be the most optimum approach.

Created one function for F1_score and the other functions returns the accuracy.


In [9]:
def f1_score_func(preds, labels):
    preds_flat  = np.argmax(preds, axis = 1).flatten()
    labels_flat = labels.flatten()
    PC, NC = 0, 0
    TP, TN, FP, FN = 0, 0, 0, 0
    for pred, label in zip(preds_flat, labels_flat):
        if pred == 1 and label == 1:
            TP += 1
        elif pred == 1 and label == 0:
            FP += 1
        elif pred == 0 and label == 1:
            FN += 1
        elif pred == 0 and label == 0:
            TN += 1
        if label == 1:
            PC += 1
        else:
            NC += 1
    
    precision = TP / (TP + FP)
    recall    = TP / (TP + FN)
    print("TP:{}, TN:{}, FP:{}, FN:{}".format(TP, TN, FP, FN))
    print("Precision:{}, Recall:{}".format(precision, recall))
    print("Positive cases: {}, Negative cases: {}".format(PC, NC))
    return 2 * precision * recall / (precision + recall)

In [10]:
def accuracy_per_class(preds, labels):
    preds_flat  = np.argmax (preds, axis = 1).flatten()
    labels_flat = labels.flatten()
    inverse_dict = {1:'pos', 0:'neg'}
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true  = labels_flat[labels_flat==label]
        print(f'Class: {inverse_dict[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}')

The 'evaluate(...)' functions receives the dataloader of a dataset and returns the average loss together with predicted labels from the model. The gradients are not calculated when we just want to forward the model.

In [11]:
def evaluate(dataloader_val):

    model.eval()
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_total, predictions, true_vals

### Checking the performance of untrained model

Since the model is very strong for this application and we are using pre-trained BERT with word embeddings, we want to first see how the model output performace is even without any prior training on our dataset. 

For this we send test dataset to the 'evaluate' function and output the metrics.

In [12]:
start = time.time()
val_loss, predictions, true_vals = evaluate(test_dataloader)
T = time.time() - start
val_f1 = f1_score_func(predictions, true_vals)
tqdm.write(f'Validation loss: {val_loss}')
tqdm.write(f'F1 score: {val_f1}')
tqdm.write(f'Evaluation took {T} seconds')

  0%|          | 0/157 [00:00<?, ?it/s]

TP:1214, TN:1272, FP:1219, FN:1295
Precision:0.4989724619810933, Recall:0.48385811080111596
Positive cases: 2509, Negative cases: 2491
Validation loss: 109.1338484287262
F1 score: 0.4912990692027519
Evaluation took 20.51658296585083 seconds


It seems that the data is balanced and there is relatively the same number of possible and negative labels. However, it appears that the untrained models predicts all samples as possitive.

## Model Training

First we setup the optimizer and the learning rate schedulers. And we set the model for only 1 epoch.

In [13]:
epochs = 1
optimizer = AdamW(model.parameters(), lr = 1e-5, eps = 1e-8)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, num_training_steps = 25000 * epochs)

Training loop is simple. For each batch, we get the input_ids, attention_mask, and labels as defined earlier for the train dataloader. We send that to the model and accumulate the returned loss.

After the training loop is finished going over all batches and epochs, we call the evaluate on the test dataset.

In [14]:
start = time.time()
for epoch in range(1, epochs+1):
    
    model.train()
    loss_train_total = 0
    progress_bar = tqdm(train_dataloader, desc='Epoch {:1d}'.format(epoch), leave = False, disable = False)
    
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {'input_ids': batch[0], 'attention_mask':batch[1], 'labels':batch[2]}
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({f'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
    
    tqdm.write(f'\nEpoch {epoch}')
    tqdm.write(f'Training Loss: {loss_train_total}')
    
    print("Now evaluating the test set...")
    val_loss, predictions, true_vals = evaluate(test_dataloader)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 score: {val_f1}')

tqdm.write(f'Training took {(time.time() - start)/60} minutes.')
    

Epoch 1:   0%|          | 0/625 [00:00<?, ?it/s]


Epoch 1
Training Loss: 26.96146459034935
Now evaluating the test set...


  0%|          | 0/157 [00:00<?, ?it/s]

TP:2509, TN:2491, FP:0, FN:0
Precision:1.0, Recall:1.0
Positive cases: 2509, Negative cases: 2491
Validation loss: 0.02230093772232067
F1 score: 1.0
Training took 5.007192858060201 minutes.


It seems the model does not make any mistakes! That is probably because the train and test datasets are very similar to each other.

## Evaluating on custom input

Now we can use the trained model and input any custom sentence to it and evaluate its out put. We can define a function to evaluate single comments.

In [15]:
inverse_dict = {1:'pos', 0:'neg'}
def evaulate_comment(comment):
    inputs = tokenizer(comment, return_tensors="pt")
    inputs.to(device)
    labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
    labels.to(device)
    with torch.no_grad():        
            outputs = model(**inputs)
    logits = outputs["logits"]
    logits = logits.detach().cpu().numpy()
    category  = np.argmax(logits, axis = 1).flatten()
    return inverse_dict[category[0]]


We call this function for a list of comments below that can be modified. The model performs really good.

In [21]:
comments = ["Too many movies like this",
            "Too long for such a move!",
            "Lucky that he is still making movies.",
            "A movie for everyone, no matter what age.",
            
           ]

for comment in comments:
    print("For the comment:" + comment + "\t Prediction is: " + evaulate_comment(comment))

For the comment:Too many movies like this	 Prediction is: neg
For the comment:Too long for such a move!	 Prediction is: neg
For the comment:Lucky that he is still making movies.	 Prediction is: pos
For the comment:A movie for everyone, no matter what age.	 Prediction is: pos
