# Part 2: Sentiment Analysis Fine-Tuning with BERT

In this part you will fine-tune a pre-trained encoder-only language model called Bert (originally trained and released by Google in 2018) for a sentiment analysis task. Unlike a causal GPT-style language model, BERT is bidirectional in the sense that it was trained to predict a masked word in the middle of a sequence using both the previous and subsequent tokens. For example, BERT was trained on tasks like predicting the masked token in `The sweet black cat [MASK] by the window in the sun.` considering both the preceding tokens `The sweet black cat` **and** the subsequent tokens `by the window in the sun.` 

This kind of model is not used for autoregressively generating new text, but is very useful when you want to understand an entire sequence of text as a whole, allowing attention to earlier or later tokens in a sequence. Sentiment analysis, wherein we want to classify an entire input sequence as either positive or negative in sentiment (for example, in this text we classify movie reviews as either positive or negative), is a good example where this kind of understanding is important.

In this part we will directly modify the `PyTorch` model and will conduct the fine-tuning directly in `PyTorch` as we have done with previous models.

**Learning objectives.** You will:
1. Examine an encoder-only BERT transformer model
2. Modify a BERT model for sentiment analysis
3. Fine-tune the model on movie review data for sentiment analysis

While it is possible to complete this assignment using CPU compute, it may be slow. To accelerate your training, consider using GPU resources such as `CUDA` through the CS department cluster. Alternatives include Google colab or local GPU resources for those running on machines with GPU support.

First, ensure that you have the `transformers` and `datasets` modules installed. We will use these modules for importing tokenizers, pretrained models, and datasets. You can run the following cells to try to install them with `pip` if needed. If you are using ondemand, ideally you would simply include `module load transformers` and `module load datasets` when making your initial reservation.

In [23]:
pip install transformers


Note: you may need to restart the kernel to use updated packages.


In [24]:
pip install datasets

Note: you may need to restart the kernel to use updated packages.


Now the following code imports a *tokenizer* and demonstrates its use. 

Note how the sequence of words in the input string is replaced with a sequence of numbers in the `input_ids`: These are indices into the vocabulary of 30522 used by the tokenizer. Also note the `special_tokens`: an `[UNK]` is used for anything not in the vocabulary, and a `[PAD]` can be useful for padding out a sequence of tokens to a specified length.

Given a sequence of strings, the tokenizer returns a dictionary containing not just the `input_ids` (what you will most often want to use) but also `token_type_ids` (whether the token is special, which you will use least often) and `attention_mask`. The `attention_mask` has the same dimensions as the `input_ids` with a `1` in a given position if there is a non-padding token in that position and a `0` if that position is just a padding token. This is helpful when you are tokenizing a batch of multiple strings with potentially different lengths but want to create a single tensor. `padding='longest'` as shown pads all of the input to the same number of tokens as the longest input by adding `[PAD]` tokens to the end. The `attention_mask` is then passed so that you can ignore the extraneous padding tokens as needed.

Also note the `return_tensors` parameter. Using `"pt"` as shown indicates that the results should be returned as PyTorch tensor. If you omit this parameter then the results will be returned as a Python list by default.

In [25]:
# run but you do not need to modify this code
import torch
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('google-bert/bert-base-uncased',
                                          clean_up_tokenization_spaces=True)
print(tokenizer)
tokenized = tokenizer(["the cow", "jumped over the moon"], padding='longest', return_tensors="pt")
print(tokenized)

BertTokenizer(name_or_path='google-bert/bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)
{'input_ids': tensor([[  101,  1996, 11190,   102,     0,     0],
        [  101,  5598,  2058,  1996,  4231,   102]])

The tokenizer also has a `decode` method by which you can translate `input_ids` back into strings. You can optionally set `skip_special_tokens=True` if you want to ignore the special tokens like padding, unknown, etc.

In [26]:
# run but you do not need to modify this code
for tokens in tokenized["input_ids"]:
    print(tokenizer.decode(tokens, skip_special_tokens=True))

the cow
jumped over the moon


Now we import our language model, in this case a pretrained BERT model. This is an encoder-only transformer architecture previewed below. As you can see, the embedding expects a vocabulary of 30522 matching our tokenizer. The model embedding dimension is 768 and the output layer of the model also has 768 units.

In [27]:
# run but you do not need to modify this code
import torch
from torch import nn
from transformers import BertModel
pretrained_model = BertModel.from_pretrained("google-bert/bert-base-uncased")
pretrained_model.to(device)
print(pretrained_model)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

## Task 1

Our goal will be to modify a base Bert model for a sentiment analysis task. Specifically, we want to predict whether a given review text has a positive (1) or negative (0) sentiment. Define a model architecture that uses the pretrained BERT model but modifies it for classifying a sequence as positive or negative.

Before proceeding, create a model object and ensure you can run forward progagation on a small example such as that defined in the second code block below. Your values may not be interpretable yet prior to fine-tuning, but you should be able to generate outputs of the correct shape.

In [28]:
# todo: define a model architecture for sentiment analysis using BERT
class SentimentBert(nn.Module):
    def __init__(self, model_name="bert-base-uncased"):
        super(SentimentBert, self).__init__()
        self.device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
        self.bert = BertModel.from_pretrained(model_name)
        self.classifier = nn.Linear(self.bert.config.hidden_size, 1)
        
    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled = outputs.last_hidden_state[:, 0]
        logits = self.classifier(pooled)
        return logits

In [29]:
# todo: try inference with your architecture here
tokenized = tokenizer(["the cow", "jumped over the moon"], padding='longest', return_tensors="pt")
model = SentimentBert()
model.to(device)
input_ids = tokenized["input_ids"].to(device)
attention_mask = tokenized["attention_mask"].to(device)
logits = model(input_ids, attention_mask)
print(logits)

tensor([[-0.0725],
        [-0.1148]], device='mps:0', grad_fn=<LinearBackward0>)


## Task 2

Our dataset is drawn from several thousand reviews on the Rotten Tomatoes website. Below we download and preview some of the data. Note that each element of a dataset is a dictionary with a `text` containing the review and a `label` which is `1` for a positive review or `0` for a negative review.

In [30]:
# run but you do not need to modify this code
from datasets import load_dataset
train_data = load_dataset("rotten_tomatoes", split="train")
val_data = load_dataset("rotten_tomatoes", split="validation")

print(f"Training examples: {len(train_data)}, Validation examples: {len(val_data)}")
for i in range(1, 3):
    print(train_data[i])
    print(train_data[-i])

Training examples: 8530, Validation examples: 1066
{'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .', 'label': 1}
{'text': 'things really get weird , though not particularly scary : the movie is all portent and no content .', 'label': 0}
{'text': 'effective but too-tepid biopic', 'label': 1}
{'text': 'interminably bleak , to say nothing of boring .', 'label': 0}


As you can see, the reviews are not all the same length. It is better not to pad the entire dataset to the same length, and instead just to perform padding per batch. We will want to have `DataLoader`s for easy iteration over batches of data as tokenized tensors. 

One way to do this is to supply a `collate_fn` to the `DataLoader` constructor. This is a function that takes as input a list of elements from the dataset (called `batch`), which in our case will be a list of dictionaries containing `text` and `label` values. The function should return the batch with tokenized strings padded to the same length along with the corresponding values.

In [31]:
from torch.utils.data import DataLoader

def collate(batch):
    tokenizer = BertTokenizer.from_pretrained('google-bert/bert-base-uncased')
    # todo: complete collate function
    texts = [line['text'] for line in batch]
    labels = torch.tensor([line['label'] for line in batch], dtype=torch.long).to(device)
    tokenized = tokenizer(texts, padding='longest', return_tensors='pt')
    input_ids = tokenized["input_ids"]
    attention_mask = tokenized["attention_mask"]
    return {
        'input_ids': input_ids.to(device),  # Move input_ids to device
        'attention_mask': attention_mask.to(device),  # Move attention_mask to device
        'labels': torch.tensor(labels)
    }

train_dataloader = DataLoader(train_data, batch_size=8, shuffle=True, collate_fn=collate)
val_dataloader = DataLoader(val_data, batch_size=8, shuffle=False, collate_fn=collate)

In [18]:
# check if DataLoader is as intended
for batch in train_dataloader:
    print(batch)
    break

{'input_ids': tensor([[  101,  1037,  4942,  1011,  5675,  2594, 14308,  1999,  1996,  2227,
          2000, 12348, 15138,  1012,   102,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0],
        [  101,  8982,  2100,  5855,  7733,  2019,  4728, 26380,  4038,  1997,
         10697,  1012,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0],
        [  101,  8040,  5668,  6810,  2987,  1005,  1056,  2507,  2149,  1037,
          2839,  4276,  3228,  1037,  4365,  2055,  1012,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0],
        [  101,  1996,  2143,  1005,  1055,  2754, 10419,  2015,  2024, 10305,
          1998,  3568,  7782,  2121,  2084,  1996,  4728, 101

  'labels': torch.tensor(labels)


## Task 3

Fine-tune the model on the training dataset until you achieve at least 80% accuracy on the validation dataset. You are welcome to use the [SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html) or [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer, whichever you prefer. As always, you may need to experiment to find a good learning rate or to decide on other optimization hyperparameters like momentum.

You should track and evaluate the training loss at least every hundred batches. Evaluate the validation loss and accuracy at least once every epoch of training. 

Note that you are working with a relatively large model and should expect a single epoch to take several minutes, even using GPU compute. This is one reason we direct you to evaluate the training loss at least every hundred batches to monitor progress. With well-chosen hyperparameters, you should only need a small number (such as 1-3) epochs of fine-tuning; this should take minutes but not hours.

Make sure to use the `attention_mask`, else the BERT model will be encoding unecessary `[PAD]` characters at the ends of sequences within a batch.

In [None]:
# todo: fine-tune / train the modified BERT model for sentiment analysis

loss_fn = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)

num_epochs = 3
train_losses = []
val_losses = []
for epoch in range(num_epochs):
    model.train()
    total_loss = 0.0
    correct = 0
    total = 0
    for i, batch in enumerate(train_dataloader):
        input_ids, attention_mask, labels = batch['input_ids'], batch['attention_mask'], batch['labels'].float()
        
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask).squeeze()
        
        loss = loss_fn(outputs, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        predicted = (torch.sigmoid(outputs) > 0.5).long()
        correct += (predicted == labels.long()).sum().item()
        total += labels.size(0)

        if (i + 1) % 100 == 0:
            avg_loss = total_loss / 100
            print(f"Batch {i+1}, Loss: {avg_loss:.4f}, Train Acc: {correct/total:.4f}")
            total_loss = 0.0
    model.eval()
    val_loss, val_correct, val_total = 0.0, 0, 0
    with torch.no_grad():
        for batch in val_dataloader:
            input_ids, attention_mask, labels = batch['input_ids'], batch['attention_mask'], batch['labels'].float()
            outputs = model(input_ids, attention_mask).squeeze()
            loss = loss_fn(outputs, labels)
            val_loss += loss.item()
            predicted = (torch.sigmoid(outputs) > 0.5).long()
            val_correct += (predicted == labels.long()).sum().item()
            val_total += labels.size(0)
            
    val_accuracy = val_correct / val_total
    print(f"Epoch {epoch+1} - Val Loss: {val_loss/len(val_dataloader):.4f}, Val Acc: {val_accuracy:.4f}")


  'labels': torch.tensor(labels)


Batch 100, Loss: 0.5324, Train Acc: 0.7212
Batch 200, Loss: 0.4203, Train Acc: 0.7694
Batch 300, Loss: 0.4024, Train Acc: 0.7867
Batch 400, Loss: 0.3641, Train Acc: 0.7997
Batch 500, Loss: 0.3901, Train Acc: 0.8055
Batch 600, Loss: 0.3376, Train Acc: 0.8163
Batch 700, Loss: 0.3389, Train Acc: 0.8234
Batch 800, Loss: 0.3238, Train Acc: 0.8286
Batch 900, Loss: 0.3373, Train Acc: 0.8318
Batch 1000, Loss: 0.3449, Train Acc: 0.8334
Epoch 1 - Val Loss: 0.3383, Val Acc: 0.8405
Batch 100, Loss: 0.1353, Train Acc: 0.9575
Batch 200, Loss: 0.1620, Train Acc: 0.9463
Batch 300, Loss: 0.1895, Train Acc: 0.9404
Batch 400, Loss: 0.1777, Train Acc: 0.9391
Batch 500, Loss: 0.1578, Train Acc: 0.9403
Batch 600, Loss: 0.1525, Train Acc: 0.9402
Batch 700, Loss: 0.1608, Train Acc: 0.9400
Batch 800, Loss: 0.1646, Train Acc: 0.9400
Batch 900, Loss: 0.1846, Train Acc: 0.9383
Batch 1000, Loss: 0.1825, Train Acc: 0.9367
Epoch 2 - Val Loss: 0.4062, Val Acc: 0.8602
Batch 100, Loss: 0.0420, Train Acc: 0.9862
Batch 2

## Task 4

Finally, retrieve five examples (your choice) from the validation dataset for which your fine-tuned model made incorrect predictions. Interpret the results on these five examples. Do you think the model is clearly incorrect or is there any ambiguity in whether the reviews are positive or negative?

In [None]:
# todo: code for task 4 here
incorrect_examples = []
model.eval()
with torch.no_grad():
    for batch in val_dataloader:
        input_ids, attention_mask, labels = batch['input_ids'], batch['attention_mask'], batch['labels'].float()
        outputs = model(input_ids, attention_mask).squeeze()
        predicted = (torch.sigmoid(outputs) > 0.5).long()
        for i in range(len(labels)):
            if predicted[i] != labels[i]:
                # makes an object that will be easier to manipulate later
                incorrect_examples.append({
                    "text": tokenizer.decode(input_ids[i], skip_special_tokens=True),
                    "true_label": labels[i].item(),
                    "predicted_label": predicted[i].item(),
                    "logit": outputs[i].item()
                })
                
        if len(incorrect_examples) >= 5:
            break

for ex in incorrect_examples:
    print(f"Text: {ex['text']}")
    print(f"True Sentiment: {"Positive" if ex['true_label'] == 1 else "Negative"}")
    print(f"Predicted Sentiment: {"Positive" if ex['predicted_label'] == 1 else "Negative"}")


  'labels': torch.tensor(labels)


Text: made for teens and reviewed as such, this is recommended only for those under 20 years of age... and then only as a very mild rental.
True Sentiment: Positive
Predicted Sentiment: Negative
Text: those moviegoers who would automatically bypass a hip - hop documentary should give " scratch " a second look.
True Sentiment: Positive
Predicted Sentiment: Negative
Text: byler is too savvy a filmmaker to let this morph into a typical romantic triangle. instead, he focuses on the anguish that can develop when one mulls leaving the familiar to traverse uncharted ground.
True Sentiment: Positive
Predicted Sentiment: Negative
Text: there's absolutely no reason why blue crush, a late - summer surfer girl entry, should be as entertaining as it is
True Sentiment: Positive
Predicted Sentiment: Negative
Text: in capturing the understated comedic agony of an ever - ruminating, genteel yet decadent aristocracy that can no longer pay its bills, the film could just as well be addressing the turn of 

*briefly explain for task 4 here*

For some of these examples, there is more ambiguity, while for others, there is not. Starting with the first example, the wording is quite ambiguous. As a human, I'm not even sure that I could honestly say that this comment is positive. For the second and third reviews, they are more clearly positive, but I can understand why there is some ambiguity stemming from words like "automatically bypass" and "too savvy". For the third review, phrases like "absolutely no reason" could lead to the negative prediction. Finally, the last review has phrases like "comedic agony" and "can no longer pay its bills", which have negative connotations on their own. So, I feel that all of these reviews have some degree of ambiguity, but a human would be able to properly classify most of them.