<a href="https://colab.research.google.com/github/agrawalabr/deeplearning/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyzing movie reviews using transformers

This problem asks you to train a sentiment analysis model using the BERT (Bidirectional Encoder Representations from Transformers) model, introduced [here](https://arxiv.org/abs/1810.04805). Specifically, we will parse movie reviews and classify their sentiment (according to whether they are positive or negative.)

We will use the [Huggingface transformers library](https://github.com/huggingface/transformers) to load a pre-trained BERT model to compute text embeddings, and append this with an RNN model to perform sentiment classification.

- Name: Abhishek Agrawal
- NetID: aa9360

## Data preparation

Before delving into the model training, let's first do some basic data processing. The first challenge in NLP is to encode text into vector-style representations. This is done by a process called *tokenization*.

In [None]:
import torch
import random
import numpy as np

SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Let us load the transformers library first.

In [None]:
!pip install transformers



Each transformer model is associated with a particular approach of tokenizing the input text.  We will use the `bert-base-uncased` model below, so let's examine its corresponding tokenizer.



In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


The `tokenizer` has a `vocab` attribute which contains the actual vocabulary we will be using. First, let us discover how many tokens are in this language model by checking its length.

In [None]:
# Q1a: Print the size of the vocabulary of the above tokenizer.
tokenizer.vocab_size, len(tokenizer.vocab)

(30522, 30522)

Using the tokenizer is as simple as calling `tokenizer.tokenize` on a string. This will tokenize and lower case the data in a way that is consistent with the pre-trained transformer model.

In [None]:
tokens = tokenizer.tokenize('Hello WORLD how ARE yoU?')

print(tokens)

['hello', 'world', 'how', 'are', 'you', '?']


We can numericalize tokens using our vocabulary using `tokenizer.convert_tokens_to_ids`.

In [None]:
indexes = tokenizer.convert_tokens_to_ids(tokens)

print(indexes)

[7592, 2088, 2129, 2024, 2017, 1029]


The transformer was also trained with special tokens to mark the beginning and end of the sentence, as well as a standard padding and unknown token.

Let us declare them.

In [None]:
init_token = tokenizer.cls_token
eos_token = tokenizer.sep_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token

print(init_token, eos_token, pad_token, unk_token)

[CLS] [SEP] [PAD] [UNK]


We can call a function to find the indices of the special tokens.

In [None]:
init_token_idx = tokenizer.convert_tokens_to_ids(init_token)
eos_token_idx = tokenizer.convert_tokens_to_ids(eos_token)
pad_token_idx = tokenizer.convert_tokens_to_ids(pad_token)
unk_token_idx = tokenizer.convert_tokens_to_ids(unk_token)

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

101 102 0 100


We can also find the maximum length of these input sizes by checking the `max_model_input_sizes` attribute (for this model, it is 512 tokens).

In [None]:
max_input_length = tokenizer.model_max_length
max_input_length

512

Let us now define a function to tokenize any sentence, and cut length down to 510 tokens (we need one special `start` and `end` token for each sentence).

In [None]:
def tokenize_and_cut(sentence):
    tokens = tokenizer.tokenize(sentence)
    tokens = tokens[:max_input_length-2]
    return tokens

In [None]:
!pip install -q torchtext==0.6.0

Finally, we are ready to load our dataset. We will use the [IMDB Moview Reviews](https://huggingface.co/datasets/imdb) dataset. Let us also split the train dataset to form a small validation set (to keep track of the best model).

In [None]:
from torchtext import data, datasets

TEXT = data.Field(batch_first = True,
                  use_vocab = False,
                  tokenize = tokenize_and_cut,
                  preprocessing = tokenizer.convert_tokens_to_ids,
                  init_token = init_token_idx,
                  eos_token = eos_token_idx,
                  pad_token = pad_token_idx,
                  unk_token = unk_token_idx)

LABEL = data.LabelField(dtype = torch.float)

print(TEXT, LABEL)

<torchtext.data.field.Field object at 0x7aabc0478690> <torchtext.data.field.LabelField object at 0x7aaba3308090>


In [None]:
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

Let us examine the size of the train, validation, and test dataset.

In [None]:
# Q1b. Print the number of data points in the train, test, and validation sets.
len(train_data), len(test_data), len(valid_data)

(17500, 25000, 7500)

We will build a vocabulary for the labels using the `vocab.stoi` mapping.

In [None]:
LABEL.build_vocab(train_data)

In [None]:
print(LABEL.vocab.stoi)

defaultdict(None, {'neg': 0, 'pos': 1})


Finally, we will set up the data-loader using a (large) batch size of 128. For text processing, we use the `BucketIterator` class.

In [None]:
BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size = BATCH_SIZE,
    device = device)

## Model preparation

We will now load our pretrained BERT model. (Keep in mind that we should use the same model as the tokenizer that we chose above).

In [None]:
from transformers import BertTokenizer, BertModel

bert = BertModel.from_pretrained('bert-base-uncased')

As mentioned above, we will append the BERT model with a bidirectional GRU to perform the classification.

In [None]:
# torch.cuda.empty_cache()
# with torch.no_grad():
# embedded = bert(batch.text)
# embedded.last_hidden_state.shape

In [None]:
import torch.nn as nn

class BERTGRUSentiment(nn.Module):
    def __init__(self,bert,hidden_dim,output_dim,n_layers,bidirectional,dropout):

        super().__init__()

        self.bert = bert

        embedding_dim = bert.config.to_dict()['hidden_size']

        self.rnn = nn.GRU(embedding_dim,
                          hidden_dim,
                          num_layers = n_layers,
                          bidirectional = bidirectional,
                          batch_first = True,
                          dropout = 0 if n_layers < 2 else dropout)

        self.out = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)

        self.dropout = nn.Dropout(dropout)

    def forward(self, text):

        #text = [batch size, sent len]

        with torch.no_grad():
            embedded = self.bert(text)[0]

        #embedded = [batch size, sent len, emb dim]

        _, hidden = self.rnn(embedded)

        #hidden = [n layers * n directions, batch size, emb dim]

        if self.rnn.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        else:
            hidden = self.dropout(hidden[-1,:,:])

        #hidden = [batch size, hid dim]

        output = self.out(hidden)

        #output = [batch size, out dim]

        return output

Next, we'll define our actual model.

Our model will consist of

* the BERT embedding (whose weights are frozen)
* a bidirectional GRU with 2 layers, with hidden dim 256 and dropout=0.25.
* a linear layer on top which does binary sentiment classification.

Let us create an instance of this model.

In [None]:
# Q2a: Instantiate the above model by setting the right hyperparameters.

# insert code here

HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25

model = BERTGRUSentiment(bert,
                         HIDDEN_DIM,
                         OUTPUT_DIM,
                         N_LAYERS,
                         BIDIRECTIONAL,
                         DROPOUT)

We can check how many parameters the model has.

In [None]:
# Q2b: Print the number of trainable parameters in this model.

# insert code here.
def count_parameters(model):
  return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(count_parameters(model))

112241409


Oh no~ if you did this correctly, you should see that this contains *112 million* parameters. Standard machines (or Colab) cannot handle such large models.

However, the majority of these parameters are from the BERT embedding, which we are not going to (re)train. In order to freeze certain parameters we can set their `requires_grad` attribute to `False`. To do this, we simply loop through all of the `named_parameters` in our model and if they're a part of the `bert` transformer model, we set `requires_grad = False`.

In [None]:
for name, param in model.named_parameters():
    if name.startswith('bert'):
        param.requires_grad = False

In [None]:
# Q2c: After freezing the BERT weights/biases, print the number of remaining trainable parameters.
print(count_parameters(model))

2759169


We should now see that our model has under 3M trainable parameters. Still not trivial but manageable.

## Train the Model

All this is now largely standard.

We will use:
* the Binary Cross Entropy loss function: `nn.BCEWithLogitsLoss()`
* the Adam optimizer

and run it for 2 epochs (that should be enough to start getting meaningful results).

In [None]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

In [None]:
criterion = nn.BCEWithLogitsLoss()

In [None]:
model = model.to(device)
criterion = criterion.to(device)


Also, define functions for:
* calculating accuracy.
* training for a single epoch, and reporting loss/accuracy.
* performing an evaluation epoch, and reporting loss/accuracy.
* calculating running times.

In [None]:
def binary_accuracy(preds, y):
    # Q3a. Compute accuracy (as a number between 0 and 1)
    acc = (torch.round(torch.sigmoid(preds)) == y).to(torch.float).mean()
    return acc

In [None]:
from torch.cuda.amp import autocast, GradScaler
from tqdm.auto import tqdm

scaler = GradScaler()

def train(model, iterator, optimizer, criterion):

    # Q3b. Set up the training function
    progress_bar = tqdm(range(0, len(iterator)), initial=0, desc="Train Steps")
    epoch_loss = 0
    epoch_acc = 0
    model.train()

    for batch in iterator:
        text, labels = batch.text.to(device), batch.label.to(device)

        optimizer.zero_grad()
        with autocast():
            predictions = model(text).squeeze(1)
            loss = criterion(predictions, labels.float())
            acc = binary_accuracy(predictions, labels)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        epoch_loss += loss.item()
        epoch_acc += acc.item()
        progress_bar.update(1)
        progress_bar.set_postfix({"loss": loss.item(), "acc": acc})

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

  scaler = GradScaler()


In [None]:
def evaluate(model, iterator, criterion):

    # Q3c. Set up the evaluation function.
    progress_bar = tqdm(range(0, len(iterator)),initial=0,desc="Eval Steps")
    epoch_loss = 0
    epoch_acc = 0

    model.eval()

    with torch.no_grad():
        for batch in iterator:
            text, labels = batch.text.to(device), batch.label.to(device)
            predictions = model(text).squeeze(1)

            loss = criterion(predictions, labels.float())
            acc = binary_accuracy(predictions, labels)

            epoch_loss += loss.item()
            epoch_acc += acc.item()

            progress_bar.update(1)
            progress_bar.set_postfix({"loss": loss.item(), "acc": acc})

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

We are now ready to train our model.

**Statutory warning**: Training such models will take a very long time since this model is considerably larger than anything we have trained before. Even though we are not training any of the BERT parameters, we still have to make a forward pass. This will take time; each epoch may take upwards of 30 minutes on Colab.

Let us train for 2 epochs and print train loss/accuracy and validation loss/accuracy for each epoch. Let us also measure running time.

Saving intermediate model checkpoints using  

`torch.save(model.state_dict(),'model.pt')`

may be helpful with such large models.

In [None]:
N_EPOCHS = 2
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    if valid_loss < best_valid_loss:
        d_loss = valid_loss
        torch.save(model.state_dict(), 'model.pt')

    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')

best_valid_loss = float('inf')

Train Steps:   0%|          | 0/137 [00:00<?, ?it/s]

  with autocast():
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Eval Steps:   0%|          | 0/59 [00:00<?, ?it/s]

Epoch: 01 | Epoch Time: 4m 28s
	Train Loss: 0.460 | Train Acc: 77.39%
	 Val. Loss: 0.269 | Val. Acc: 89.04%


Train Steps:   0%|          | 0/137 [00:00<?, ?it/s]

Eval Steps:   0%|          | 0/59 [00:00<?, ?it/s]

Epoch: 02 | Epoch Time: 4m 43s
	Train Loss: 0.280 | Train Acc: 88.61%
	 Val. Loss: 0.237 | Val. Acc: 90.63%


Load the best model parameters (measured in terms of validation loss) and evaluate the loss/accuracy on the test set.

In [None]:
model.load_state_dict(torch.load('model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Eval Steps:   0%|          | 0/196 [00:00<?, ?it/s]

Test Loss: 0.228 | Test Acc: 91.11%


## Inference

We'll then use the model to test the sentiment of some fake movie reviews. We tokenize the input sentence, trim it down to length=510, add the special start and end tokens to either side, convert it to a `LongTensor`, add a fake batch dimension using `unsqueeze`, and perform inference using our model.

In [None]:
def predict_sentiment(model, tokenizer, sentence):
    model.eval()
    tokens = tokenizer.tokenize(sentence)
    tokens = tokens[:max_input_length-2]

    tensor = torch.LongTensor([init_token_idx] + tokenizer.convert_tokens_to_ids(tokens) + [eos_token_idx]).to(device)
    tensor = tensor.unsqueeze(0)
    prediction = model(tensor)
    prediction = torch.sigmoid(prediction)
    return LABEL.vocab.itos[round(prediction.item())]

In [None]:
# Q4a. Perform sentiment analysis on the following two sentences.
predict_sentiment(model, tokenizer, "Justice League is terrible. I hated it.")

'neg'

In [None]:
predict_sentiment(model, tokenizer, "Avengers was great!!")

'pos'

Great! Try playing around with two other movie reviews (you can grab some off the internet or make up text yourselves), and see whether your sentiment classifier is correctly capturing the mood of the review.

In [None]:
# Q4b. Perform sentiment analysis on two other movie review fragments of your choice.
predict_sentiment(model, tokenizer, "I this one time watch is okay, how ever it's not worth the time.")

'neg'

In [None]:
predict_sentiment(model, tokenizer, "The movie was hilariously bad!! However, it is beautiful and highly recommended due to VFX!")

'pos'