Problem 5 - Giorgi Merabishvili


**1 Analyzing movie reviews using transformers**

This problem asks you to train a sentiment analysis model using the BERT (Bidirectional Encoder Representations from Transformers) model, introduced here. Specifically, we will parse movie reviews and classify their sentiment (according to whether they are positive or negative.)

We will use the Huggingface transformers library to load a pre-trained BERT model to compute text embeddings, and append this with an RNN model to perform sentiment classification.

**1.1 Data preparation**

Before delving into the model training, let’s first do some basic data processing. The first chal- lenge in NLP is to encode text into vector-style representations. This is done by a process called tokenization.


In [1]:
import torch
import random
import numpy as np

# common seed value for reproducibility across numpy, random, and torch
SEED = 1234

# Seed the random number generators for reproducibility
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Let us load the transformers library first.

In [2]:
!pip install transformers



Each transformer model is associated with a particular approach of tokenizing the input text. We will use the bert-base-uncased model below, so let’s examine its corresponding tokenizer.

In [3]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

The tokenizer has a vocab attribute which contains the actual vocabulary we will be using. First, let us discover how many tokens are in this language model by checking its length.

In [4]:
# Q1a: Print the size of the vocabulary of the above tokenizer.
print(f"Size of the vocabulary: {len(tokenizer.vocab)}")

Size of the vocabulary: 30522


Using the tokenizer is as simple as calling tokenizer.tokenize on a string. This will tokenize and
lower case the data in a way that is consistent with the pre-trained transformer model.

In [5]:
tokens = tokenizer.tokenize('Hello WORLD how ARE yoU?')

print(tokens)


['hello', 'world', 'how', 'are', 'you', '?']


We can numericalize tokens using our vocabulary using tokenizer.convert_tokens_to_ids.

In [6]:
indexes = tokenizer.convert_tokens_to_ids(tokens)

print(indexes)


[7592, 2088, 2129, 2024, 2017, 1029]


The transformer was also trained with special tokens to mark the beginning and end of the sentence, as well as a standard padding and unknown token.

Let us declare them.

In [7]:
init_token = tokenizer.cls_token
eos_token = tokenizer.sep_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token

print(init_token, eos_token, pad_token, unk_token)


[CLS] [SEP] [PAD] [UNK]


We can call a function to find the indices of the special tokens.

In [8]:
init_token_idx = tokenizer.convert_tokens_to_ids(init_token)
eos_token_idx = tokenizer.convert_tokens_to_ids(eos_token)
pad_token_idx = tokenizer.convert_tokens_to_ids(pad_token)
unk_token_idx = tokenizer.convert_tokens_to_ids(unk_token)

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)


101 102 0 100


We can also find the maximum length of these input sizes by checking the max_model_input_sizes attribute (for this model, it is 512 tokens).

In [9]:
max_input_length = tokenizer.max_model_input_sizes['google-bert/bert-base-uncased']

Let us now define a function to tokenize any sentence, and cut length down to 510 tokens (we need one special start and end token for each sentence).

In [10]:
def tokenize_and_cut(sentence):
    tokens = tokenizer.tokenize(sentence)
    tokens = tokens[:max_input_length-2]
    return tokens


Finally, we are ready to load our dataset. We will use the IMDB Moview Reviews dataset. Let us also split the train dataset to form a small validation set (to keep track of the best model).

In [11]:
!pip install torchtext==0.6.0

from torchtext import data

TEXT = data.Field(batch_first = True,
                  use_vocab = False,
                  tokenize = tokenize_and_cut,
                  preprocessing = tokenizer.convert_tokens_to_ids,
                  init_token = init_token_idx,
                  eos_token = eos_token_idx,
                  pad_token = pad_token_idx,
                  unk_token = unk_token_idx)

LABEL = data.LabelField(dtype = torch.float)


Collecting torchtext==0.6.0
  Downloading torchtext-0.6.0-py3-none-any.whl (64 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/64.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.2/64.2 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->torchtext==0.6.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m67.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->torchtext==0.6.0)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m59.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->torchtext==0.6.0)
  Downloading nvidia_cuda

In [12]:
from torchtext import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

train_data, valid_data = train_data.split(random_state = random.seed(SEED))


downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:12<00:00, 7.00MB/s]


Let us examine the size of the train, validation, and test dataset.

In [13]:
# Q1b. Print the number of data points in the train, test, and validation sets.
print(f"Number of data points in the train set: {len(train_data)}")
print(f"Number of data points in the test set: {len(test_data)}")
print(f"Number of data points in the validation set: {len(valid_data)}")


Number of data points in the train set: 17500
Number of data points in the test set: 25000
Number of data points in the validation set: 7500


We will build a vocabulary for the labels using the vocab.stoi mapping.

In [14]:
LABEL.build_vocab(train_data)


In [15]:
print(LABEL.vocab.stoi)


defaultdict(None, {'neg': 0, 'pos': 1})


Finally, we will set up the data-loader using a (large) batch size of 128. For text processing, we use the BucketIterator class.

In [16]:
BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(f"Using device: {device}")

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size = BATCH_SIZE,
    device = device)


Using device: cuda


**1.2 Model preparation**

We will now load our pretrained BERT model. (Keep in mind that we should use the same model as the tokenizer that we chose above).

In [17]:
from transformers import BertTokenizer, BertModel

bert = BertModel.from_pretrained('bert-base-uncased')


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [18]:
import torch.nn as nn

class BERTGRUSentiment(nn.Module):
    def __init__(self, bert, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        self.bert = bert
        embedding_dim = bert.config.to_dict()['hidden_size']
        self.rnn = nn.GRU(embedding_dim, hidden_dim, num_layers = n_layers, bidirectional = bidirectional, batch_first = True, dropout = 0 if n_layers < 2 else dropout)
        self.out = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, text):
        with torch.no_grad():
            embedded = self.bert(text)[0]
        _, hidden = self.rnn(embedded)
        if self.rnn.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        else:
            hidden = self.dropout(hidden[-1,:,:])
        output = self.out(hidden)
        return output


Next, we’ll define our actual model.

Our model will consist of


*   the BERT embedding (whose weights are frozen)
*   a bidirectional GRU with 2 layers, with hidden dim 256 and dropout=0.25.
*   a linear layer on top which does binary sentiment classification.

Let us create an instance of this model.

In [19]:
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25

model = BERTGRUSentiment(bert, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT)


We can check how many parameters the model has.

In [20]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):} trainable parameters')


The model has 112241409 trainable parameters


The model has 112,241,409 trainable parameters

Oh no~ if you did this correctly, youy should see that this contains 112 million parameters. Standard machines (or Colab) cannot handle such large models.

However, the majority of these parameters are from the BERT embedding, which we are not going to (re)train. In order to freeze certain parameters we can set their requires_grad attribute to False. To do this, we simply loop through all of the named_parameters in our model and if they’re a part of the bert transformer model, we set requires_grad = False.

In [21]:
for name, param in model.named_parameters():
    if name.startswith('bert'):
        param.requires_grad = False


In [22]:
# Q2c: After freezing the BERT weights/biases, print the number of remaining trainable parameters.
print(f'The model has {count_parameters(model):} trainable parameters after freezing BERT parameters')


The model has 2759169 trainable parameters after freezing BERT parameters


**1.3 Train the Model**

All this is now largely standard.

We will use: * the Binary Cross Entropy loss function: nn.BCEWithLogitsLoss() * the Adam optimizer

and run it for 2 epochs (that should be enough to start getting meaningful results).

In [23]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())


In [24]:
criterion = nn.BCEWithLogitsLoss()


In [25]:
model = model.to(device)
criterion = criterion.to(device)


Also, define functions for: * calculating accuracy. * training for a single epoch, and reporting loss/accuracy. * performing an evaluation epoch, and reporting loss/accuracy. * calculating running times.

In [26]:
def binary_accuracy(preds, y):

  # Q3a. Compute accuracy (as a number between 0 and 1)

    # Apply sigmoid to predictions and round to get the binary output
    rounded_preds = torch.round(torch.sigmoid(preds))

    # Calculate the number of correct predictions
    correct = (rounded_preds == y).float()  # Convert boolean values to floats for averaging

    # Compute the accuracy
    acc = correct.sum() / correct.size(0)  # Use .size(0) instead of len for consistency with PyTorch

    return acc

In [27]:
def train(model, iterator, optimizer, criterion):

   # Q3b. Set up the training function

    epoch_loss = 0
    epoch_acc = 0

    model.train()

    # Loop over batches in the data iterator.
    for batch in iterator:

        optimizer.zero_grad()

        predictions = model(batch.text).squeeze(1)

        loss = criterion(predictions, batch.label)
        acc = binary_accuracy(predictions, batch.label)

        loss.backward()

        optimizer.step()

        # Update running totals of loss and accuracy.
        epoch_loss += loss.item()
        epoch_acc += acc.item()

    # Return average loss and accuracy for the epoch.
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


In [28]:
def evaluate(model, iterator, criterion):
  # Q3c. Set up the evaluation function.

    epoch_loss = 0
    epoch_acc = 0

    # This will turn off features like dropout.
    model.eval()

    # Deactivate autograd for evaluation.
    with torch.no_grad():
        for batch in iterator:
            # Forward pass
            predictions = model(batch.text).squeeze(1)

            # Compute loss and accuracy.
            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)

            # Accumulate the loss and accuracy.
            epoch_loss += loss.item()
            epoch_acc += acc.item()

    # Compute the average loss and accuracy.
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


In [29]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs


We are now ready to train our model.

**Statutory warning**: Training such models will take a very long time since this model is consid- erably larger than anything we have trained before. Even though we are not training any of the BERT parameters, we still have to make a forward pass. This will take time; each epoch may take upwards of 30 minutes on Colab.

Let us train for 2 epochs and print train loss/accuracy and validation loss/accuracy for each epoch. Let us also measure running time.

Saving intermediate model checkpoints using

torch.save(model.state_dict(),'model.pt')

may be helpful with such large models.

In [30]:
N_EPOCHS = 2

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

  # Q3d. Perform training/valudation by using the functions you defined earlier.

    start_time = time.time()

    # Train the model
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    # Evaluate the model
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

    # Measure the end time of the epoch and calculate time taken
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    # Update the best validation loss and save the model if improvement seen
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'model.pt')  # Save model parameters

    # Print the results for this epoch.
    print(f'Epoch: {epoch + 1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc * 100:.2f}%')
    print(f'\tVal. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc * 100:.2f}%')



We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Epoch: 01 | Epoch Time: 3m 44s
	Train Loss: 0.474 | Train Acc: 76.49%
	 Val. Loss: 0.269 | Val. Acc: 89.02%
Epoch: 02 | Epoch Time: 3m 43s
	Train Loss: 0.276 | Train Acc: 88.89%
	 Val. Loss: 0.237 | Val. Acc: 90.53%


In [31]:
model.load_state_dict(torch.load('model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')


Test Loss: 0.237 | Test Acc: 90.39%


**1.4 Inference**

We’ll then use the model to test the sentiment of some fake movie reviews. We tokenize the input sentence, trim it down to length=510, add the special start and end tokens to either side, convert it to a LongTensor, add a fake batch dimension using unsqueeze, and perform inference using our model.

In [32]:
def predict_sentiment(model, tokenizer, sentence):
    model.eval()
    tokens = tokenizer.tokenize(sentence)
    tokens = tokens[:max_input_length-2]
    indexed = [init_token_idx] + tokenizer.convert_tokens_to_ids(tokens) + [eos_token_idx]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(0)
    prediction = torch.sigmoid(model(tensor))
    return prediction.item()


In [33]:
predict_sentiment(model, tokenizer, "Justice League is terrible. I hated it.")


0.06885521858930588

In [34]:
predict_sentiment(model, tokenizer, "Avengers was great!!")


0.9397968649864197

1.4.1 Sentiment Analysis Results

"The movie 'Justice League' was terrible. "I absolutely despised it." - This review would receive a very low sentiment rating, nearly zero, indicating a strong negative sentiment. "Loved 'Avengers' - it was fantastic!!" - This review would receive a score near one, indicating a high level of positive sentiment.

Excellent! Now, try two more movie reviews (you can find them online or write your own) and see if your sentiment analysis tool correctly identifies the tone of the critiques.

In [39]:
#From the internet
predict_sentiment(model, tokenizer, "It's not an easy film to totally digest, even with two viewings, because that ending has some mind-boggling revelations. Without having to resort to spoilers, let me just say the story is extremely interesting, the acting very good, the period pieces fun to view.")


0.97291100025177

In [41]:
#From the internet
predict_sentiment(model, tokenizer, " Illogical, tension-free, and filled with cut-rate special effects, Jaws: The Revenge is a sorry chapter in a once-proud franchise.")


0.037353284657001495

Q4b. Perform sentiment analysis on two other movie review fragments of your choice.

1.4.2 Sentiment Analysis Outputs

"It's not an easy film to totally digest, even with two viewings, because that ending has some mind-boggling revelations. Without having to resort to spoilers, let me just say the story is extremely interesting, the acting very good, the period pieces fun to view." - Score is very high almost 1, which indicates very positive sentiment. This is because words used in review such as "Extremely interesting" and "acting is very good". Contrary second one - "Illogical, tension-free, and filled with cut-rate special effects, Jaws: The Revenge is a sorry chapter in a once-proud franchise." was almost 0. this indicates very negative sentiment due to words used in the review such as "illogical", "tension free".