# Lab 9: Finetuning BERT-based models

In this lab, we will explore finetuning a BERT model for classification. Your task will be to classify movie reviews as positive or negative (i.e. the task from HW2). Supporting code will help you with the actual training. Your job is to focus on formatting the input and evaluating the accuracy of your final model.

Once you finish working on this lab, please download it as a .ipynb notebook and submit the notebook to Moodle.

#### **What should I do if I run out of RAM?**

The free GPUs that Colab assigns might not always reliable. Sometimes you code will run without issues, and other times you might run into RAM errors. For this reason, try to train your models on as much data as possible, but do not worry if you are not able to train it on all of the data. You can also try to run the models on your personal computers without using GPUs!


### Guiding Questions

1. How do we use pretrained transformer models?
1. How do we finetune a neural model of language for classification?

### Learning Objectives.

1. Understand how to map classification to the context of transformers
1. Hands on experience using pretrained models
1. Build a classifer on top of a pretrained model
1. Reason about your model and its abilities

### Rubric

| Question | Points |
| ------| ----- |
| load_data | 25 Points |
| Reflection | 75 Points |

### Deadline:

November 12, 11PM EST

### Submission format:
ipynb file saved after running your code cells and submitted to Moodle

## Overview

We will use HuggingFace throughout this lab. Finetuning involves XX steps:

1. Loading a pretrained model
2. Loading our finetuning data
3. Searching for good hyperparameters for our model
4. Training our model
5. Testing our model
6. Saving our model

HuggingFace comes with tools for doing all of these. I will give very brief overviews of them below and point you to their code base. This will be useful for at least some of your final projects, so please look over these materials on your own.

### The model and the tokenizer

HuggingFace hosts a large number of pretrained models. You can find them [here](https://huggingface.co/). The main library for working with these models is transformers (documentation [link](https://huggingface.co/docs/transformers/index)). Each model comes with a tokenizer which maps from words to ids for the relevant model.

### The data format

In order to finetune our model with HuggingFace we need our data formatted in a particular way. We will use their dataset library (documentation [link](https://huggingface.co/docs/datasets/index)). Consider the following (modified) sample from the movie review dataset we are using:

    [{'text': 'Note that I did not say that it is',
        'label': 0},
    {'text': 'In what is arguably the best outdoor adventure film of all time,
        four city guys confront nature\'s wrath, in a story of survival.'  
        'label': 1}]

Notice that it is a list of dictionaries mapping text to their labels. You will write a data loader that does this step.

### Hyperparameter search with Trainer

Recalling, HW2 one thing we have to do is find good hyperparameters. Luckily, there exists libraries that facilitate this. We will use [optuna](https://optuna.org/) coupled with HuggingFace's libraries to find optimal hyperparameters for our model.

### Training with Trainer

To finetune our model, we need a way of training the model. HuggingFace has a utility called [Trainer](https://huggingface.co/docs/evaluate/main/en/transformers_integrations#trainer) that will handle this for us.

### Testing with Evaluate

Finally, we need to evaluate our model to see if it is any good. You've already done your own accuracy, precision, recall, and F1-scores before. Here will make use of HuggingFace's [Evaluate](https://huggingface.co/docs/evaluate/main/en/index) library to do the hard things for us.


## Setup

In [1]:
!pip install optuna transformers datasets accelerate evaluate



In [2]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch, evaluate, accelerate
from transformers import TrainingArguments, Trainer
import glob
import numpy as np
from datasets import Dataset, load_dataset
import optuna
import random
import os

In [3]:
# Set device depending on whether or not you have access to GPUs
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"
device

'cuda'

## Load Model

In [4]:
modelname = "distilbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(modelname, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(modelname,
                                                            num_labels=2).to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Load Data

In [5]:
!gdown 18iyGEGz4csxVDUee5gqUafbmvovUBzvr
!unzip -o sentiment_data.zip
!rm -rf __MACOSX/

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: sentiment_data/train/pos/10878_7.txt  
  inflating: __MACOSX/sentiment_data/train/pos/._10878_7.txt  
  inflating: sentiment_data/train/pos/4962_7.txt  
  inflating: __MACOSX/sentiment_data/train/pos/._4962_7.txt  
  inflating: sentiment_data/train/pos/3480_10.txt  
  inflating: __MACOSX/sentiment_data/train/pos/._3480_10.txt  
  inflating: sentiment_data/train/pos/5810_9.txt  
  inflating: __MACOSX/sentiment_data/train/pos/._5810_9.txt  
  inflating: sentiment_data/train/pos/7325_9.txt  
  inflating: __MACOSX/sentiment_data/train/pos/._7325_9.txt  
  inflating: sentiment_data/train/pos/9518_10.txt  
  inflating: __MACOSX/sentiment_data/train/pos/._9518_10.txt  
  inflating: sentiment_data/train/pos/3637_10.txt  
  inflating: __MACOSX/sentiment_data/train/pos/._3637_10.txt  
  inflating: sentiment_data/train/pos/2300_10.txt  
  inflating: __MACOSX/sentiment_data/train/pos/._2300_10.txt  
  inflating: sentimen

In [6]:
#TODO: 25pts
def load_data(path: str, label2id: dict) -> list[dict]:
    """ Loads movie review data into a dictionary
    Args:
        path (str): Path to sentiment data directory
                    (e.g., sentiment_data)
        label2id (dict): Dict mapping label (i.e. pos, neg)
                        to numbers (e.g., {'pos': 0, 'neg': 1})
    Returns:
        data (list[dict]): List of dictionaries, see the markdown block above
                            this cell.
    """
    return_list = []
    pos_label = label2id['pos']
    neg_label = label2id['neg']

    subdirs = os.listdir(path)  # Get subdirectories

    if 'pos' in subdirs:
        pos_dir = os.path.join(path, 'pos')
        neg_dir = os.path.join(path, 'neg')
    else:
        pos_dir = os.path.join(path, 'neg')
        neg_dir = os.path.join(path, 'pos')

    for directory in [pos_dir, neg_dir]:
        label = pos_label if directory == pos_dir else neg_label

        for filename in os.listdir(directory):
            file_dict = {}
            file_dir = os.path.join(directory, filename)
            if '.txt' in filename:
              with open(file_dir, 'r') as file:
                  data = file.read()
                  file_dict['text'] = data
                  file_dict['label'] = label
                  return_list.append(file_dict)

    return return_list


In [None]:
load_data('sentiment_data/train', {'pos': 1, 'neg': 0})

[{'text': "Made by french brothers Jules and Giddeon Naudet, and narrated by Robert De Niro and Firefighter James Hanlon this is a compelling and heartbreaking tale of how New York's finest shone on it's darkest day. I first saw this when I was a young naive 12 year old, and at that age it still touched me. Knowing how serious 9/11 really was seeing this expanded the whole effect of 9/11. We were finding out who the heroes were, how there everyday lives were composed, and how they put their lives on the line in a situation where most people would just run and save their selves. These brave men put their lives on the line and watching this just increases my admiration for them. Watch if you can,this is the best documentary I have personally ever seen.",
  'label': 1},
 {'text': 'What a great Barbara Stanwyck film that I happened to see the other night. "Jeopardy" was fantastic. It was made in 1953 and probably for double bills but it kept me on the edge of my seat.<br /><br />Barbara St

In [7]:
def getDataset(path: str, label2id: dict,
               tokenizer:AutoTokenizer=None,
              tokenize:bool=True,
               percent:float = 0.25) -> Dataset:
    """ Return HuggingFace Dataset instance
    Args:
        path (str): path to directory
        label2id (dict): Dictionary mapping classification labels to id
        tokenizer (AutoTokenizer): A HuggingFace pre-trained tokenizer
        tokenize (bool): Whether to tokenize data. Default True
    Returns:
        (Dataset): HuggingFace Dataset instance
    """
    data = load_data(path, label2id)
    # Shuffle the data
    random.shuffle(data)
    data = data[:int(len(data)*percent)]
    data = Dataset.from_list(data)
    # Tokenize
    if tokenize:
        if tokenizer is None:
            print('Pass a tokenizer')
            return
        data = data.map(lambda examples: tokenizer(examples["text"],
                                                   return_tensors="pt",
                                                   padding=True, truncation=True),
                        batched=True).with_format("torch")
    return data

In [8]:
train_small_dataset = getDataset("sentiment_data/train", {"pos": 0, "neg": 1}, tokenizer, percent=0.05)
train_dataset = getDataset("sentiment_data/train", {"pos": 0, "neg": 1}, tokenizer, percent=0.25)
eval_dataset = getDataset("sentiment_data/eval", {"pos": 0, "neg": 1}, tokenizer)

Map:   0%|          | 0/1224 [00:00<?, ? examples/s]

Map:   0%|          | 0/6124 [00:00<?, ? examples/s]

Map:   0%|          | 0/125 [00:00<?, ? examples/s]

## Hyperparameter search

Recall from HW2 that models involve hyperparameters. We often want to find optimal hyperparameters that will result in better models. We can automate this process using HuggingFace. What will need to do is install a hyperparameter optimization library. We will use [optuna](https://optuna.org/).

At its core, hyperparameter optimization is about trying different configurations and comparing model performance. To facilitate this we will need a way of reseting our model so we can try new hyperparameters. The model_init function below does just that. Additionally, we will want to sample a smaller amount of data, since tuning can take a long time! The code below creates a set up that does this. Look over the code and make sure you understand its aims!

In [None]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(modelname, num_labels=2).to(device)

# Uses accuracy is the metric at eval steps
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Set some initial parameters
batch_size=20
args = TrainingArguments(
        f"{modelname}-finetuned-movie-reviews",
        evaluation_strategy = "epoch",
        save_strategy = "epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=2,
        weight_decay=0.01
)

# Set up a trainer with less data
trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=train_small_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# Find the best hyperparameters over 10 runs
best_run = trainer.hyperparameter_search(n_trials=2, direction="maximize",
                                        backend="optuna")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2023-11-06 17:09:52,380] A new study created in memory with name: no-name-61e46d55-6288-4270-a71b-f8a3d320ee55
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a cal

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.690211,0.56


[I 2023-11-06 17:11:03,209] Trial 0 finished with value: 0.56 and parameters: {'learning_rate': 5.715134156753791e-06, 'num_train_epochs': 1, 'seed': 39, 'per_device_train_batch_size': 32}. Best is trial 0 with value: 0.56.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[W 2023-11-06 17:11:05,875] Trial 1 failed with parameters: {'learning_rate': 3.803542533657919e-05, 'num_train_epochs': 2, 'seed': 32, 'per_device_train_batch_size': 64} because of the following error: OutOfMemoryError('CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacty of 14.75 GiB of which 40.81 MiB is free. Process 9944 has 14.71 GiB memory in use. Of the allocated memory 13.76 GiB is allocated

OutOfMemoryError: ignored

## Train

Now that we have some (hopefully) good hyperparameters, let's train our model on our full training data! This may take some time, so look over the reflection questions in the meantime!

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(modelname,
                                                            num_labels=2).to(device)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)
trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


OutOfMemoryError: ignored

## Test

Now that we've trained our model, we need to see if it's any good. Let's evaluate it on test data. First we load that data, then we make use of HuggingFace's evaluate library. We are interested in text classification here, so we use that task. In particular, we return the accuracy, precision, recall, and F1-score for our trained model on our test data.

If you ran into memory issues or want to evaluate a model trained for longer on more data, run the following code block.

In [9]:
!gdown 1n6M1LasX02kEe4KYMiNHbbH0-ZIH1Zio
!unzip -o distilbert-for-movie-reviews.zip
!rm -rf __MACOSX/

Downloading...
From: https://drive.google.com/uc?id=1n6M1LasX02kEe4KYMiNHbbH0-ZIH1Zio
To: /content/distilbert-for-movie-reviews.zip
100% 244M/244M [00:05<00:00, 43.1MB/s]
Archive:  distilbert-for-movie-reviews.zip
   creating: distilbert-for-movie-reviews/
  inflating: __MACOSX/._distilbert-for-movie-reviews  
  inflating: distilbert-for-movie-reviews/tokenizer_config.json  
  inflating: __MACOSX/distilbert-for-movie-reviews/._tokenizer_config.json  
  inflating: distilbert-for-movie-reviews/special_tokens_map.json  
  inflating: __MACOSX/distilbert-for-movie-reviews/._special_tokens_map.json  
  inflating: distilbert-for-movie-reviews/config.json  
  inflating: __MACOSX/distilbert-for-movie-reviews/._config.json  
  inflating: distilbert-for-movie-reviews/tokenizer.json  
  inflating: __MACOSX/distilbert-for-movie-reviews/._tokenizer.json  
  inflating: distilbert-for-movie-reviews/training_args.bin  
  inflating: __MACOSX/distilbert-for-movie-reviews/._training_args.bin  
  inflating

In [10]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-for-movie-reviews",
                                                           num_labels=2).to(device)
tokenizer = AutoTokenizer.from_pretrained("distilbert-for-movie-reviews")

The following gets your test data and returns metrics for the model.

In [11]:
test_dataset = getDataset("sentiment_data/test", {"pos": 0, "neg": 1}, tokenizer, tokenize=False)

In [12]:
task_evaluator = evaluate.evaluator("text-classification")
model.eval()
model.to("cpu")
results = task_evaluator.compute(
    model_or_pipeline=model,
    tokenizer=tokenizer,
    data=test_dataset,
    metric=evaluate.combine(["accuracy", "recall", "precision", "f1"]),
    label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0})
print(results)

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

{'accuracy': 0.9607843137254902, 'recall': 0.9047619047619048, 'precision': 1.0, 'f1': 0.9500000000000001, 'total_time_in_seconds': 1.279657577999842, 'samples_per_second': 39.85441174014311, 'latency_in_seconds': 0.025091325058820432}


## Save Model

Finally, we may want to save our final model. We can do that as below. I also show how we can load our pretrained model for use later.

In [15]:
# Save the model
trainer.save_model("distilbert-for-movie-reviews")

NameError: ignored

In [25]:
# Load a pretrained model

model = AutoModelForSequenceClassification.from_pretrained("distilbert-for-movie-reviews",
                                                           num_labels=2).to(device)
tokenizer = AutoTokenizer.from_pretrained("distilbert-for-movie-reviews")

## Reflection

1. [25pts] Below, evaluate a non-finetuned version of DistilBERT on our task. Compare the accuracy of that model with your final model. Reflect on your precision, recall, and F1-score. What is your model doing better or worse at?

In [19]:
# Write your code here
modelname = "distilbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(modelname, use_fast=True)
base_model = AutoModelForSequenceClassification.from_pretrained(modelname,
                                                            num_labels=2).to(device)

task_evaluator = evaluate.evaluator("text-classification")
base_model.eval()
base_model.to("cpu")
results = task_evaluator.compute(
    model_or_pipeline=base_model,
    tokenizer=tokenizer,
    data=test_dataset,
    metric=evaluate.combine(["accuracy", "recall", "precision", "f1"]),
    label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0})
print(results)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'accuracy': 0.45098039215686275, 'recall': 0.8571428571428571, 'precision': 0.4186046511627907, 'f1': 0.5625000000000001, 'total_time_in_seconds': 0.710805758000788, 'samples_per_second': 71.74955946254934, 'latency_in_seconds': 0.01393736780393702}


**For the non-fintuned version: **
'accuracy': 0.45098039215686275, 'recall': 0.8571428571428571, 'precision': 0.4186046511627907, 'f1': 0.5625000000000001

**For the fine-tuned version**
'accuracy': 0.9607843137254902
'recall': 0.9047619047619048
'precision': 1.0
'f1': 0.9500000000000001

Fine tuned version is much better!


2. [10pts] How does the model performance compare to your Naive Bayes' Classifier? What do you think might contribute to the differences between these two models?  

It's much better than the bayes classifier-- the bayes classifier never got an accuracy over 85%, while the fine tuned model scores 96% on accuracy.

3. [5pts] For hyperparameter tuning, what parameters in your model were hyperparameters?

From the HuggingFace DistilBERT documentation:


n_layers (int, optional, defaults to 6) — Number of hidden layers in the Transformer encoder.

n_heads (int, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer encoder.

dim (int, optional, defaults to 768) — Dimensionality of the encoder layers and the pooler layer.

hidden_dim (int, optional, defaults to 3072) — The size of the “intermediate” (often named feed-forward) layer in the Transformer encoder.

dropout (float, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (float, optional, defaults to 0.1) — The dropout ratio for the attention probabilities.

activation (str or Callable, optional, defaults to "gelu") — The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "silu" and "gelu_new" are supported.

initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
qa_dropout (float, optional, defaults to 0.1) — The dropout probabilities used in the question answering model DistilBertForQuestionAnswering.

seq_classif_dropout (float, optional, defaults to 0.2) — The dropout probabilities used in the sequence classification and the multiple choice model DistilBertForSequenceClassification.

4. [35pts] Run the following code snippet, which evaluates your model on a movie review I wrote. Try out some cases of your own. Does your model work well? Can you come up with cases that trick it?  

In [16]:
@torch.no_grad()
def getScore(review: str, model: AutoModelForSequenceClassification,
            tokenizer: AutoTokenizer) -> str:
    """ Returns the label of the movie review"""
    model.eval()
    model = model.to("cpu")
    input_ids = tokenizer(review, return_tensors="pt", padding=True,
                          truncation=True)
    output = model(**input_ids).logits
    pred = np.argmax(output, axis=-1).tolist()[0]
    if pred == 0:
        return "Positive"
    else:
        return "Negative"

review = "Sunset Boulevard is an eery movie that deeply upset me. I did love it though."
getScore(review, model, tokenizer)

'Positive'

Does very well if I add more positive, negative stuff!

In [26]:
review = "this movie is so bad, only the dumbest person would give it a good rating"
getScore(review, model, tokenizer)

'Negative'

Tried to trick it with 'badass'

In [27]:
review = "Sunset Boulevard is super badass.... so good"
getScore(review, model, tokenizer)

'Positive'

It doesn't classify gen z slang :)

In [28]:
review = "Sunset boulevard is hella lit"
getScore(review, model, tokenizer)

'Negative'

In [29]:
review = "The characters in this movie were straight bussin"
getScore(review, model, tokenizer)

'Negative'

IDK how it predicteed this one right

In [36]:
review = "May god bless the ancient creatures who died, whose fossils were extracted into fuel, whose fuel powered the actress' mothers car who gave birth to her"
getScore(review, model, tokenizer)

'Positive'

Tricked it with obscure internet compliments

In [32]:
review = "I would drink the movie director's bath water"
getScore(review, model, tokenizer)

'Negative'

nice

In [33]:
review = "Sunset Boulevard made me cry I'd watch it over again"
getScore(review, model, tokenizer)

'Positive'

In [34]:
review = "I would only watch this movie again ironically"
getScore(review, model, tokenizer)

'Negative'

tricked it

In [37]:
review = "I'd jump off a building to meet the director of this movie"
getScore(review, model, tokenizer)

'Negative'

In [38]:
review = "I would write a book about this movie"
getScore(review, model, tokenizer)

'Positive'