# Sentiment Analysis on IMDB movie reviews using BERT

#### BERT 
Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google.
BERT is a highly complex language model which is being used in many NLP applications including Google Search! 
 > In this notebook we take a simpler version of BERT (distilbert-base-uncased) and train it on IMDB movie reviews which have been labelled as positive or negative sentiment. 
 > This adapted model will then be used to judge the sentiment (positive/negative) of new reviews. 


#### Imports
- Firstly we import Path and os, these libraries will help us talk to other files in our directory. 

- Secondly we import torch (this refers to pytorch), and some sublibraries.
    - The pytorch DataLoader provides tools for iterating over batches of training and testing data. 
    - AdamW is an optimiser which helps avoid overfitting by applying weight decay. In brief this discourages solutions with extreme parameters. 

- Thirdly we import BERT from a library called transfomers. 
    - DistilBERT is a small, fast, cheap and light version of the transformer model. 
    - DistilBertForSequenceClassification adds a layer to the model for classification tasks. 
    - DistilBertTokenizerFast is a tokeniser. This is a tool which transforms sentences into tokens (pieces of words) 

- finally tqdm is a helpful library which adds a loading bar to loops. We use this to see if a loop is going to take an unfeasibly long time to run. 

In [5]:
from pathlib import Path
import os 

import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW

from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast

from tqdm import tqdm


#### Constants 
- We name the model, distilbert-base-uncased.
- We set the batch size, this is the number of training examples which will be used in each iteration of the training loop. Batching reduces the cost of training. We've set it to 16 which is quite small. 
- The number of Epochs is the number of times we pass over the whole training set. We've set it to 3 which again is quite small. 
- The learning rate is set to a very small number which is quite standard. This determines the step size while moving towards a minimum of a loss function. 


In [6]:
MODEL_NAME = 'distilbert-base-uncased'

BATCH_SIZE = 16

NUM_EPOCHS = 3

LEARNING_RATE = 5e-5


Using those imports and constants we initialise our tokeniser and model.
When we load this model it helpfully tells us that the sequence classification top layer is randomly initialised and should probably be trained before it can be used on predictions. So let's do that! 

In [7]:
tokeniser = DistilBertTokenizerFast.from_pretrained(MODEL_NAME)
model = DistilBertForSequenceClassification.from_pretrained(MODEL_NAME)


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'pre_classi

In order to train our model we need to access our data and prepare it for training and testing. 

The data we are using is an open source dataset called Large Movie Review Dataset or acl-Imdb (the ACL stands for association for Computational Linguistics)

The text reviews are stored in folders called test and train, and then in sub-folders called neg and pos referring to the sentiment of the text. 
We want to capture the information infered by the folder names and use it to create useable functions. 

This function takes a directory where the data is stored and returns a list of the texts and an adjacent list of labels. 

The labels are set to 0 where the sentiment is negative, and 1 where the sentiment is positive

In [8]:
def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir == "neg" else 1)
    return texts, labels

We then call the function with the test and train folders and cast the returned values to the relative variables. 

The lengths of the labels and text lists are the same as expected. 

In this case the lengths of the test and train sets are also equal, which isn't always the case. 

In [9]:
train_texts , train_labels = read_imdb_split("../aclimdb/train")
test_texts , test_labels = read_imdb_split("../aclimdb/test")
print(len(train_labels))
print(len(train_texts))
print(len(test_labels))
print(len(test_texts))

25000
25000
25000
25000


The lists of review texts are then passed through the tokeniser.

Truncation sets a maximum length for the input. 

Padding adds data to the end of an input to normalise the length. 
The tokeniser also creates an attention mask which shows where within the data item the relevant information is stored. The positions where there is only padding will be marked as 0 in the attention mask. 

There are 25000 items in the training set, each of them is 512 tokens long. 

In [10]:
train_encodings = tokeniser(train_texts,  truncation=True, padding=True)
test_encodings = tokeniser(test_texts,  truncation=True, padding=True)
print(len(train_encodings["input_ids"]))
print(len(train_encodings["input_ids"][0]))


25000
512


Next we need to build the DataLoader which will feed the training loop 

This class defines an object which adds the tokenised text and sentiment labels as attributes and creates two methods for accessing the length and attributes of the data.  

In [11]:
class ImdbDataset(torch.utils.data.Dataset):

    def __init__(self,encodings,labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self,index):
        item = {k: torch.tensor(v[index]) for k,v in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[index])
        return  item 


Two instances of the class are created for test and train. 

The train_dataset is turned into a DataLoader object which is given the batch size we defined earlier. 

We also define the AdamW optimiser we imported earlier, initialised with the model parameters and the learning rate.

In [12]:
train_dataset = ImdbDataset(train_encodings,train_labels)
test_dataset = ImdbDataset(test_encodings,test_labels)
train_loader = DataLoader(train_dataset, batch_size= BATCH_SIZE,shuffle=True )
optim = AdamW(params=model.parameters(), lr = LEARNING_RATE)

One of the key features of Pytorch is the ability to utilise GPU's if available. This cell finds if a cuda device is available, if so the model is transfered to the GPU. 

In [13]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(DEVICE)
print(DEVICE)

cpu


If we have a model saved we can load in from a file, skipping the training phase. Otherwise the training loop will run over the training set to adjust the weights of the model.

In [14]:
MODEL_PATH = f"models/imdb_{MODEL_NAME}.pth"

#### The training loop
If not running on a GPU this loop will take a long time to run. 

 - Firstly the model is placed in training mode. 

 - The if statement asks if a model exists in the model path, if so the loop is skipped and the model parameters are loaded from the file. 

 - For each epoch, the training loop runs through each batch in the training loader. 
     -The gradients of the optimiser are set to zero
     - the input ids, attention mask and labels are pulled from the batch 
     - these lists are places into a call of the model which produces the outputs
     - the loss is calculated and back propagated
     - the optimiser is stepped



In [15]:
#Long running time 
model.train()
if not os.path.exists(MODEL_PATH):
    for epochs in range(NUM_EPOCHS):
        for batch in train_loader:
            #set gradients to zero
            optim.zero_grad()
            input_ids = batch['input_ids'].to(DEVICE)
            attention_mask = batch['attention_mask'].to(DEVICE)
            labels = batch['labels'].to(DEVICE)

            outputs = model(input_ids,attention_mask=attention_mask,labels=labels)
            loss = outputs[0]
            #back propagation 
            loss.backward()

            #adamW descent function 
            optim.step()
    os.makedirs("models",exist_ok=True)
    torch.save(model.state_dict(), MODEL_PATH)
else:
    model.load_state_dict(torch.load(MODEL_PATH,map_location=torch.device('cpu')))

The model is placed back in evaluation mode, this deactivates the dropout layers

In [16]:
model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

#### Evaluation

In order to test the model we can use the test set to check whether the model is classifying reviews correctly. 

Similar to the training loop, we first create a DataLoader using the test data. 

Each batch is stepped through and the outputs are checked against the labels. 

the prediction is "softmaxed" to be placed between 0 and 1, then "argmaxed" to be converted to 0 or 1. 

A "running correct" score checks the number of correct predictions across the dataset, this is turned into an accuracy. 


In [17]:
# Long running time
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)
running_correct = 0
for batch in tqdm(test_loader):
    input_ids = batch["input_ids"].to(DEVICE)
    attention_mask = batch["attention_mask"].to(DEVICE)
    labels = batch["labels"].to(DEVICE)

    outputs = model(input_ids,attention_mask= attention_mask)
    preds = torch.argmax(torch.softmax(outputs["logits"],1),1)
    running_correct += torch.sum(preds == labels)
accuracy = running_correct.double()/len(test_loader.dataset)
print(accuracy) 


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 3/1563 [00:44<6:29:28, 14.98s/it]


KeyboardInterrupt: 

#### Single record evaluation
If the above code takes too long to run, you could adjust the batch size, or the size of the test dataset. 

Alternatively you can test a single review and view the results below. 
Pick a number between 0 and 24999 to choose a review from the test set. 



In [23]:
test_idx = 0
test_sample = test_dataset[test_idx]
input_ids = test_sample["input_ids"].to(DEVICE).unsqueeze(0)
attention_mask = test_sample["attention_mask"].to(DEVICE).unsqueeze(0)

labels = test_sample["labels"].to(DEVICE)

output = model(input_ids,attention_mask=attention_mask)["logits"]
preds = torch.argmax(torch.softmax(output,1),1)
print("Predicted sentiment: ", preds.item())
print("Actual Sentiment: ",labels.item())
print("Review Text: ", test_texts[test_idx])

Predicted sentiment:  1
Actual Sentiment:  0
Review Text:  Well, this movie actually did have one redeeming quality. It made up the funniest season one episode of MST3K. I wish Rhino had released this one instead of "The Crawling Hand."


#### Custom Evaluation
We now have a sentiment analysing model which can take any sentence and judge the sentiment of it. 

We can therefore create a new custom string and run the classifier on it. 
To do this we need to first tokenise the string so it is in the same format as the training data. 

For this evaluation the percentage probablilty is also printed which gives some indication of the confidence of the binary prediction. 

In [25]:
test_string = "I hated this movie, it was awful"
# test_string = "I loved this movie, it was great"
# test_string = """Carpignano’s impressionistic plot and pseudo-naturalistic style also tends to boil down human emotions so as to only suggest rather than reveal complexity. The limiting style and characterizations in "A Chiara" are only so thoughtful."""


In [35]:
tokenised_text = tokeniser(test_string,truncation=True,padding=True)
input_ids = torch.tensor(tokenised_text["input_ids"],device = DEVICE).unsqueeze(0)
attention_mask = torch.tensor(tokenised_text["attention_mask"],device = DEVICE).unsqueeze(0)
output = model(input_ids,attention_mask=attention_mask)["logits"]
pred = torch.argmax(torch.softmax(output,1),1)
prob_pred = torch.softmax(output,1)
print(prob_pred)
print(pred)
print("Positive sentiment" if pred==1 else "Negative sentiment")


tensor([[9.9944e-01, 5.6132e-04]], grad_fn=<SoftmaxBackward0>)
tensor([0])
Negative sentiment
