# Sentiment Analysis Fine-Tuning with BERT

In this course project, I fine-tuned a pre-trained encoder-only language model called Bert (originally trained and released by Google in 2018) for a sentiment analysis task.

 Unlike a causal GPT-style language model, BERT is bidirectional in the sense that it was trained to predict a masked word in the middle of a sequence using both the previous and subsequent tokens. For example, BERT was trained on tasks like predicting the masked token in `The sweet black cat [MASK] by the window in the sun.` considering both the preceding tokens `The sweet black cat` **and** the subsequent tokens `by the window in the sun.`

This kind of model is not used for autoregressively generating new text, but is very useful when you want to understand an entire sequence of text as a whole, allowing attention to earlier or later tokens in a sequence. Sentiment analysis, wherein we want to classify an entire input sequence as either positive or negative in sentiment (for example, in this text we classify movie reviews as either positive or negative), is a good example where this kind of understanding is important.

In this part of the project, we will directly modify the `PyTorch` model and will conduct the fine-tuning directly in `PyTorch` as we have done with previous models.

**Learning objectives.** You will:
1. Examine an encoder-only BERT transformer model
2. Modify a BERT model for sentiment analysis
3. Fine-tune the model on movie review data for sentiment analysis

While it is possible to complete this assignment using CPU compute, it may be slow. To accelerate your training, consider using GPU resources such as `CUDA` through the CS department cluster. Alternatives include Google colab or local GPU resources for those running on machines with GPU support.

First, ensure that you have the `transformers` and `datasets` modules installed. We will use these modules for importing tokenizers, pretrained models, and datasets. You can run the following cells to try to install them with `pip` if needed. If you are using ondemand, ideally you would simply include `module load transformers` and `module load datasets` when making your initial reservation.

In [None]:
pip install transformers



In [None]:
pip install datasets



Now the following code imports a *tokenizer* and demonstrates its use.

Note how the sequence of words in the input string is replaced with a sequence of numbers in the `input_ids`: These are indices into the vocabulary of 30522 used by the tokenizer. Also note the `special_tokens`: an `[UNK]` is used for anything not in the vocabulary, and a `[PAD]` can be useful for padding out a sequence of tokens to a specified length.

Given a sequence of strings, the tokenizer returns a dictionary containing not just the `input_ids` (what you will most often want to use) but also `token_type_ids` (whether the token is special, which you will use least often) and `attention_mask`. The `attention_mask` has the same dimensions as the `input_ids` with a `1` in a given position if there is a non-padding token in that position and a `0` if that position is just a padding token. This is helpful when you are tokenizing a batch of multiple strings with potentially different lengths but want to create a single tensor. `padding='longest'` as shown pads all of the input to the same number of tokens as the longest input by adding `[PAD]` tokens to the end. The `attention_mask` is then passed so that you can ignore the extraneous padding tokens as needed.

Also note the `return_tensors` parameter. Using `"pt"` as shown indicates that the results should be returned as PyTorch tensor. If you omit this parameter then the results will be returned as a Python list by default.

In [None]:
# run but you do not need to modify this code
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('google-bert/bert-base-uncased',
                                          clean_up_tokenization_spaces=True)
print(tokenizer)
tokenized = tokenizer(["the cow", "jumped over the moon"], padding='longest', return_tensors="pt")
print(tokenized)

BertTokenizer(name_or_path='google-bert/bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)
{'input_ids': tensor([[  101,  1996, 11190,   102,     0,     0],
        [  101,  5598,  2058,  1996,  4231,   102]])

The tokenizer also has a `decode` method by which you can translate `input_ids` back into strings. You can optionally set `skip_special_tokens=True` if you want to ignore the special tokens like padding, unknown, etc.

In [None]:
# run but you do not need to modify this code
for tokens in tokenized["input_ids"]:
    print(tokenizer.decode(tokens, skip_special_tokens=True))

the cow
jumped over the moon


Now we import our language model, in this case a pretrained BERT model. This is an encoder-only transformer architecture previewed below. As you can see, the embedding expects a vocabulary of 30522 matching our tokenizer. The model embedding dimension is 768 and the output layer of the model also has 768 units.

In [None]:
# run but you do not need to modify this code
import torch
from torch import nn
from transformers import BertModel
pretrained_model = BertModel.from_pretrained("google-bert/bert-base-uncased")
print(pretrained_model)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

## Task 1

Our goal will be to modify a base Bert model for a sentiment analysis task. Specifically, we want to predict whether a given review text has a positive (1) or negative (0) sentiment. Define a model architecture that uses the pretrained BERT model but modifies it for classifying a sequence as positive or negative.

Before proceeding, create a model object and ensure you can run forward progagation on a small example such as that defined in the second code block below. Your values may not be interpretable yet prior to fine-tuning, but you should be able to generate outputs of the correct shape.

In [None]:
# todo: define a model architecture for sentiment analysis using BERT
class SentimentBert(nn.Module):
    def __init__(self):
        super(SentimentBert, self).__init__()
        self.bert = BertModel.from_pretrained("google-bert/bert-base-uncased")
        self.classifier = nn.Linear(768, 2)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        logits = self.classifier(pooled_output)
        return logits
        # todo: define model architecture

In [None]:
# todo: try inference with your architecture here

tokenized = tokenizer(["the cow", "jumped over the moon"], padding='longest', return_tensors="pt")
model = SentimentBert()
logits = model(input_ids=tokenized["input_ids"], attention_mask=tokenized["attention_mask"])
print(logits)
print(logits.shape)

tensor([[-0.4364, -0.2077],
        [-0.2758, -0.1643]], grad_fn=<AddmmBackward0>)
torch.Size([2, 2])


## Task 2

Our dataset is drawn from several thousand reviews on the Rotten Tomatoes website. Below we download and preview some of the data. Note that each element of a dataset is a dictionary with a `text` containing the review and a `label` which is `1` for a positive review or `0` for a negative review.

In [None]:
# run but you do not need to modify this code
from datasets import load_dataset
train_data = load_dataset("rotten_tomatoes", split="train")
val_data = load_dataset("rotten_tomatoes", split="validation")

print(f"Training examples: {len(train_data)}, Validation examples: {len(val_data)}")
for i in range(1, 3):
    print(train_data[i])
    print(train_data[-i])

Training examples: 8530, Validation examples: 1066
{'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .', 'label': 1}
{'text': 'things really get weird , though not particularly scary : the movie is all portent and no content .', 'label': 0}
{'text': 'effective but too-tepid biopic', 'label': 1}
{'text': 'interminably bleak , to say nothing of boring .', 'label': 0}


As you can see, the reviews are not all the same length. It is better not to pad the entire dataset to the same length, and instead just to perform padding per batch. We will want to have `DataLoader`s for easy iteration over batches of data as tokenized tensors.

One way to do this is to supply a `collate_fn` to the `DataLoader` constructor. This is a function that takes as input a list of elements from the dataset (called `batch`), which in our case will be a list of dictionaries containing `text` and `label` values. The function should return the batch with tokenized strings padded to the same length along with the corresponding values.

In [None]:
from torch.utils.data import DataLoader

def collate(batch):
    tokenizer = BertTokenizer.from_pretrained('google-bert/bert-base-uncased')
    texts = [item['text'] for item in batch]
    labels = [item['label'] for item in batch]
    tokenized = tokenizer(texts, padding='longest', return_tensors="pt")
    labels = torch.tensor(labels)
    return {'input_ids': tokenized['input_ids'], 'attention_mask': tokenized['attention_mask'], 'labels': labels}
    # todo: complete collate function

train_dataloader = DataLoader(train_data, batch_size=8, shuffle=True, collate_fn=collate)
val_dataloader = DataLoader(val_data, batch_size=8, shuffle=False, collate_fn=collate)

In [None]:
# check if DataLoader is as intended
for batch in train_dataloader:
    print(batch)
    break

{'input_ids': tensor([[  101,  1000,  1996,  4251,  2137,  1000,  4269,  1999, 24001,  1999,
          3999,  1012,  2008,  1005,  1055,  2049,  2034,  3696,  1997,  4390,
          1012,   102,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  1037,  2143,  2008, 17567,  2138,  1997,  2049,  2116,  9987,
          2229,  1012,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101, 17878,  2063, 15358,  2135,  1005,  1055, 17083,  1011,  4012,
          3640, 25285,  1005,  1055, 14166, 21642,  7140,  2007,  2178,  6904,
          8569,  2571,  5602,  4078,  7629,  1011,  1011,  1045,  1012,  1041,
          1012,  1010,  1037,  7221,  2389,  6259,  8795,  1012,   102,     0],
        [  101,  2009,  1005,  1055

## Task 3

Fine-tune the model on the training dataset until you achieve at least 80% accuracy on the validation dataset. You are welcome to use the [SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html) or [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer, whichever you prefer. As always, you may need to experiment to find a good learning rate or to decide on other optimization hyperparameters like momentum.

You should track and evaluate the training loss at least every hundred batches. Evaluate the validation loss and accuracy at least once every epoch of training.

Note that you are working with a relatively large model and should expect a single epoch to take several minutes, even using GPU compute. This is one reason we direct you to evaluate the training loss at least every hundred batches to monitor progress. With well-chosen hyperparameters, you should only need a small number (such as 1-3) epochs of fine-tuning; this should take minutes but not hours.

Make sure to use the `attention_mask`, else the BERT model will be encoding unecessary `[PAD]` characters at the ends of sequences within a batch.

In [None]:
# todo: fine-tune / train the modified BERT model for sentiment analysis

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SentimentBert().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)
num_epochs = 3
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for i, batch in enumerate(train_dataloader):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        optimizer.zero_grad()
        logits = model(input_ids, attention_mask)
        loss = nn.CrossEntropyLoss()(logits, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        if i % 100 == 0:
            print(f"Epoch {epoch+1}, Batch {i+1}, Loss: {loss.item()}")
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch in val_dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            logits = model(input_ids, attention_mask)
            preds = torch.argmax(logits, dim=1)
            total += labels.size(0)
            correct += (preds == labels).sum().item()

    accuracy = correct / total
    print(f"Epoch {epoch+1}, Validation Loss: {total_loss/len(train_dataloader)}, Validation Accuracy: {accuracy}")
    if accuracy >= 0.8:
        print("Validation accuracy reached 80%, stopping training.")
        break

Epoch 1, Batch 1, Loss: 0.5792637467384338
Epoch 1, Batch 101, Loss: 0.6055983304977417
Epoch 1, Batch 201, Loss: 0.3139762878417969
Epoch 1, Batch 301, Loss: 0.3429502844810486
Epoch 1, Batch 401, Loss: 0.1320042908191681
Epoch 1, Batch 501, Loss: 0.8519556522369385
Epoch 1, Batch 601, Loss: 0.2650899291038513
Epoch 1, Batch 701, Loss: 0.18880772590637207
Epoch 1, Batch 801, Loss: 0.4599440097808838
Epoch 1, Batch 901, Loss: 0.13006779551506042
Epoch 1, Batch 1001, Loss: 0.3385148048400879
Epoch 1, Validation Loss: 0.38462158330413315, Validation Accuracy: 0.8611632270168855
Validation accuracy reached 80%, stopping training.


## Task 4

Finally, retrieve five examples (your choice) from the validation dataset for which your fine-tuned model made incorrect predictions. Interpret the results on these five examples. Do you think the model is clearly incorrect or is there any ambiguity in whether the reviews are positive or negative?

In [None]:
# todo: code for task 4 here
misclassified = []
model.eval()

with torch.no_grad():
    for example in val_data:
        tokenized = tokenizer(example['text'], truncation=True, padding=True, return_tensors="pt")
        tokenized = {key: val.to(device) for key, val in tokenized.items()}
        logits = model(input_ids=tokenized['input_ids'], attention_mask=tokenized['attention_mask'])
        pred = torch.argmax(logits, dim=1).item()
        if pred != example['label']:
            misclassified.append({'text': example['text'], 'label': example['label'], 'pred': pred})
        if len(misclassified) >= 5:
            break

for i, ex in enumerate(misclassified):
    print(f"Example {i+1}:")
    print(f"Text: {ex['text']}")
    print(f"Label: {ex['label']}")
    print(f"Predicted Label: {ex['pred']}")
    print("-----")

Example 1:
Text: made for teens and reviewed as such , this is recommended only for those under 20 years of age . . . and then only as a very mild rental .
Label: 1
Predicted Label: 0
-----
Example 2:
Text: those moviegoers who would automatically bypass a hip-hop documentary should give " scratch " a second look .
Label: 1
Predicted Label: 0
-----
Example 3:
Text: there's absolutely no reason why blue crush , a late-summer surfer girl entry , should be as entertaining as it is
Label: 1
Predicted Label: 0
-----
Example 4:
Text: the events of the film are just so weird that i honestly never knew what the hell was coming next .
Label: 1
Predicted Label: 0
-----
Example 5:
Text: full of bland hotels , highways , parking lots , with some glimpses of nature and family warmth , time out is a discreet moan of despair about entrapment in the maze of modern life .
Label: 1
Predicted Label: 0
-----


*briefly explain for task 4 here*

The model clearly struggles with nuanced and sarcastic languge, but is not really messing up on clear examples. All these definitely have a sense of confusion or ambiguity involved which make it hard to correctly identify the best word. Alot of this is descriptive language like in "full of bland hotels , highways , parking lots , with some glimpses of nature and family warmth , time out is a discreet moan of despair about entrapment in the maze of modern life ", not even I can tell exactly what energy this comment is trying to bring, let alone an AI model.
