# Assignment 2: Text Classification with BERT

**Description:** This assignment notebook builds on the material from the
[lesson 4 notebook](https://github.com/datasci-w266/2025-fall-main/blob/master/materials/lesson_notebooks/lesson_4_BERT.ipynb), in which we fine-tuned a BERT model for the IMDB movie reviews sentiment classification task. In that notebook, we used the bert-base-cased model and applied traditional fine-tuning, with a brief class exercise at the end to try unfreezing different numbers of layers. In this assignment, we'll start with that exercise, and ask you to explore unfreezing more specific layers yourself. Then you'll search for and try different pre-trained BERT-style models.

This notebook should be run on a Google Colab leveraging a GPU. By default, when you open the notebook in Colab it will try to use a GPU. Please note that you the GPU is reuqired for Section 3 but not for Sections 1 and 2.
Since colab is providing free access to a GPU they place constraints on that access.  Therefore you might want to turn off the GPU access (Edit -> Notebook Settings) until you get to section 3.  Total runtime of the entire notebook (with solutions and a Colab GPU) should be about 1h with the majority of that time being in Section 3. If Colab tells you that you have reached your GPU limit, wait up to 24 hours and you should be able to access a GPU again.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2025-fall-main/blob/master/assignment/a2/Text_classification_BERT.ipynb)

The overall assignment structure is as follows:


0. Setup
  
  0.1 Libraries and Helper Functions

  0.2 Data Acquisition

  0.3. Data Preparation


1. Classification with BERT

  1.1. BERT Basics

  1.2 CLS-Token-based Classification

  1.3 Averaging of BERT Outputs

  1.4. Adding a CNN on top of BERT



**INSTRUCTIONS:**:

* Questions are always indicated as **QUESTION**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the **answers** file as you did in a1.  Please do **not** remove the output from your notebooks when you submit them as we'll look at the output as well as your code for grading purposes.  We cannot award points if the output cells are empty.

* **### YOUR CODE HERE** indicates that you are supposed to write code.

* If you want to, you can run all of the cells in section 0 in bulk. This is setup work and no questions are in there. At the end of section 0 we will state all of the relevant variables that were defined and created in section 1.

* Finally, unless otherwise indicated your validation accuracy will be 0.65 or higher if you have correctly implemented the model.



## 0. Setup

### 0.1. Libraries and Helper Functions

This notebook requires the Hugging Face datasets and other prerequisites that you must download.  

In [1]:
# Install uv, the fast package manager
!pip install uv --quiet

# Install your libs via uv (quiet)
!uv pip install transformers torchinfo datasets fsspec huggingface_hub evaluate --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m112.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# This command forces the session to restart.
# Run this cell after your installations.
# It will cause a notification "your session crashed for an unknown reason". This is OK.
import os
os.kill(os.getpid(), 9)

Now we are ready to do the imports.

In [1]:
#@title Imports

import numpy as np

import transformers
import evaluate

from datasets import load_dataset
from torchinfo import summary

from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding

### 0.2 Data Acquisition


We will use the IMDB dataset delivered as part of the TensorFlow-datasets library, and split into training and test sets. For expedience, we will limit ourselves in terms of train and test examples.

In [2]:
imdb_dataset = load_dataset("imdb")

imdb_train_dataset = imdb_dataset['train'].shuffle()
imdb_dev_dataset = imdb_dataset['test'].shuffle().select(range(5000)) # take first 5000 rows after shuffle

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

It is always highly recommended to look at the data. What do the records look like? Are they clean or do they contain a lot of cruft (potential noise)?

In [3]:
imdb_train_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [4]:
# This is looking like a sentiment classificaiton dataset with 0 as neg and 1 as pos
for i in range(4):
  print(imdb_train_dataset['text'][i])
  print(imdb_train_dataset['label'][i])
  print()

After the unexpected accident that killed an inexperienced climber (Michelle Joyner). Eight months has passed... The Rocky Mountain Rescue receive a distress call set by a brilliant terrorist mastermind Eric Quaien (John Lithgow). Quaien has lost three large cases that has millions of dollars inside. Two experienced climbers Walker (Sylvester Stallone) and Tucker (Micheal Rooker) and a helicopter pilot (Janine Turner) are to the rescue but they are set by a trap by Quaien and his men. Now the two climbers and pilot are forced to play a deadly game of hide and seek. While Quaien is trying to find the millions of dollars and he kidnapped Tucker to find the money. Once Tucker finds the money, Tucker will be dead. Against explosive firepower, bitter cold and dizzying heights. Walker must outwit Quaien for survival.<br /><br />Directed by Renny Harlin (Driven, Mindhunters, A Nightmare on Elm Street 4:The Dream Master) made an entertaining non-stop action picture. This film is a spectacular,

In [5]:
imdb_train_dataset.features['label'].names # confirmed out thoughts above

['neg', 'pos']

For convenience, in this assignment we will define a sequence length and truncate all records at that length. For records that are shorter than our defined sequence length we will add padding characters to insure that our input shapes are consistent across all records.

In [6]:
MAX_SEQUENCE_LENGTH = 100

## 0.3. Data Preparation

We will need to tokenize the text into vocab_ids to pass into a BERT model. To do so, we'll need to use the specific tokenizer that goes with the model we're using. In this notebook, we will try several different BERT-style models. Let's
first write a function that will take the text from our dataset and a tokenizer, and encode the text using that tokenizer. Then we'll apply the function to our dataset for each tokenizer and model.

In [7]:
def preprocess_imdb(data, tokenizer):
    review_text = data['text']
    # Extract the raw text from the input dictionary.
    # For IMDB datasets (like Hugging Face `datasets`), each item is often a dict with keys like "text" and "label".

    encoded = tokenizer.batch_encode_plus(
            review_text,              # A string or list of strings to tokenize.
            max_length=MAX_SEQUENCE_LENGTH,    # Cap sequence length at 100.
            padding='max_length',     # Pad shorter sequences with [PAD] (ID=0) to reach length 100.
            truncation=True,          # Truncate longer sequences to 100 tokens.
            return_attention_mask=True, # Return an attention mask (1 = real token, 0 = padding).
            return_token_type_ids=True, # Return token type IDs (all 0 for single sentences).
            return_tensors="pt"       # Return results as PyTorch tensors.
        )

    return encoded
    # The result is a dict with keys:
    #   - 'input_ids': shape (batch_size, 100)
    #   - 'attention_mask': shape (batch_size, 100)
    #   - 'token_type_ids': shape (batch_size, 100)


## 1. BERT-based Classification Models

Now we turn to classification with BERT. We will perform classifications with various models that are based on pre-trained BERT models.  If you turn off GPU access while coding and debugging the setup steps, make sure you change the Notebook settings so you can access a GPU when you're ready to train the models.


### 1.1. Basics

Let us first explore some basics of BERT. We'll start by loading the first pretrained BERT model and tokenizer that we'll use ('bert-base-cased').

To explore just the pre-trained portion of the model, we'll use the AutoModel class (equivalent to BertModel, but works for any architecture including BERT). This class gives us the pre-trained model layers up until the last hidden layer (but not any output layer).

In [8]:
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
bert_model = AutoModel.from_pretrained('bert-base-cased')

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Let's look at a couple of example sentences:

In [9]:
test_input = ['this bank is closed on Sunday', 'the steepest bank of the river is dangerous']

Apply the BERT tokenizer to tokenize them:

In [10]:
tokenized_input = bert_tokenizer(test_input,
                                 max_length=12,
                                 truncation=True,
                                 padding='max_length',
                                 return_tensors='pt')

tokenized_input

{'input_ids': tensor([[ 101, 1142, 3085, 1110, 1804, 1113, 3625,  102,    0,    0,    0,    0],
        [ 101, 1103, 9458, 2556, 3085, 1104, 1103, 2186, 1110, 4249,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])}

 **QUESTION:**

 1.a  Why do the attention_masks have 4 and 1 zeros, respectively?  Choose the correct one and enter it in the answers file.

  *  For the first example the last four tokens belong to a different segment. For the second one it is only the last token.

  *  For the first example 4 positions are padded while for the second one it is only one.

In [11]:
bert_output = bert_model(**tokenized_input)

bert_output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.3945,  0.0420,  0.0648,  ...,  0.0505,  0.2236,  0.2424],
         [-0.0946,  0.0667, -0.0361,  ...,  0.2193, -0.0697,  0.7445],
         [ 0.0056,  0.3132, -0.1798,  ...,  0.1956, -0.1061,  0.4777],
         ...,
         [ 0.2227, -0.1156,  0.1585,  ...,  0.3003,  0.0163,  0.5133],
         [ 0.3164, -0.1099,  0.2366,  ...,  0.1092, -0.1434,  0.3284],
         [ 0.3483, -0.1008,  0.2690,  ...,  0.1271, -0.1843,  0.2618]],

        [[ 0.4451,  0.2226, -0.0997,  ..., -0.2374,  0.1272,  0.0778],
         [ 0.0741, -0.3181, -0.1192,  ..., -0.0668, -0.3062,  0.4692],
         [ 0.3146,  0.6266,  0.0061,  ..., -0.0370, -0.0846,  0.7268],
         ...,
         [ 0.6999, -0.1163,  0.0161,  ..., -0.4744,  0.0573,  0.2183],
         [ 0.5603,  0.0854, -0.9192,  ..., -0.3102, -0.0938,  0.3491],
         [-0.2686,  0.1133,  0.0756,  ...,  0.3738,  0.0074,  0.1668]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_ou

In [12]:
bert_output.last_hidden_state.shape
print(f'The shape of last_hidden_state {bert_output.last_hidden_state.shape}')

The shape of last_hidden_state torch.Size([2, 12, 768])


In [13]:
# pooler_output = tanh( W @ cls_raw + b )
bert_output.pooler_output.shape
print(f'The shape of pooler_output {bert_output.pooler_output.shape}')

The shape of pooler_output torch.Size([2, 768])


 **QUESTION:**

 1.b How many outputs are there?
 - 2

 Enter your code below.

In [14]:
for key in bert_output.keys():
  print(f'The elements in the bert_output: {key}')

The elements in the bert_output: last_hidden_state
The elements in the bert_output: pooler_output


**QUESTION:**

1.c Which output do we need to use to get token-level embeddings?

the first

Put your answer in the answers file.



**QUESTION:**

 1.d In the tokenized input, which input_id number (i.e. the vocabulary id) corresponds to 'bank' in the two sentences? ('bert_tokenizer.tokenize()' may come in handy.. and don't forget the CLS token! )

  - 3085

**QUESTION:**

 1.e In the array of tokens, which position index number corresponds to 'bank' in the first sentence? ('bert_tokenizer.tokenize()' may come in handy.. and don't forget the CLS token! )
  - 2

In [15]:
tokenized_input

{'input_ids': tensor([[ 101, 1142, 3085, 1110, 1804, 1113, 3625,  102,    0,    0,    0,    0],
        [ 101, 1103, 9458, 2556, 3085, 1104, 1103, 2186, 1110, 4249,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])}

**QUESTION:**

1.f Which array position index number corresponds to 'bank' in the second sentence?
 - 4

**QUESTION:**

 1.g What is the cosine similarity between the BERT embeddings for the two occurences of 'bank' in the two sentences?

In [16]:
import torch
import torch.nn.functional as F

tok = tokenized_input
outs = bert_output                 # from your forward pass
H = outs.last_hidden_state         # [B, L, 768]

# --- 'bank' in both sentences ---
bank_s1 = H[0, 2, :]               # sentence 1, index 2
bank_s2 = H[1, 4, :]               # sentence 2, index 4
cos_bank = F.cosine_similarity(bank_s1, bank_s2, dim=0)

print(f'The cosine similarity between the BERT embeddings for the two occurences of bank in the two sentences is {cos_bank:.5f}')


The cosine similarity between the BERT embeddings for the two occurences of bank in the two sentences is 0.74783


**QUESTION:**

1.h How does this relate to the cosine similarity of 'this' (in sentence 1) and the first 'the' (in sentence 2). Compute their cosine similarity.


In [17]:
# --- 'this' (sent1) vs 'the' (sent2) ---
this_s1 = H[0, 1, :]               # 'this' at index 1 in sent 1
the_s2  = H[1, 1, :]               # 'the'  at index 1 in sent 2
cos_det = F.cosine_similarity(this_s1, the_s2, dim=0)

print(f'The cosine similarity between the BERT embeddings for the two occurences of "this" and "the" in the two sentences is {cos_det:.5f}')

The cosine similarity between the BERT embeddings for the two occurences of "this" and "the" in the two sentences is 0.81103


### 2. Testing Different Pre-Trained BERT Models

In the live session we discussed classification with the `bert-base-cased` model, using the Huggingface class BertForSequenceClassification, which comes with a new output layer for our task that we need to train on our dataset.

We're going to try different pre-trained models now. Like in the lesson 4 notebook, we'll want to fine-tune each model on our IMDB reviews dataset and compare them with a metric like the validation accuracy. We'll use the model class AutoModelForSequenceClassification, which is equivalent to BertForSequenceClassification, but works for other similar architectures too.

Let's write the code we'll need as a function that takes the model and tokenizer as arguments, along with the raw train and dev data. The function will need to tokenize the inputs using the provided tokenizer, so that we can repeat the same code for different pre-trained models. Then the function should create the training args and trainer class, and call trainer.train().

The other hyperparameters you'll need are provided in the function definition, including batch_size and num_epochs. You should use the default values provided for those. Use the function provided below for compute_metrics.

For now, keep all layers of the pre-trained models you load unfrozen.

In [18]:
metric = evaluate.load('accuracy')

def compute_metrics(p):
    # Unpack EvalPrediction to raw arrays
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script: 0.00B [00:00, ?B/s]

In [19]:
def fine_tune_classification_model(classification_model,
                                   tokenizer,
                                   train_data,
                                   dev_data,
                                   batch_size = 16,
                                   num_epochs = 2):

    preprocessed_train_data = train_data.map(
        preprocess_imdb, batched=True, fn_kwargs={'tokenizer': tokenizer}
    )
    preprocessed_dev_data = dev_data.map(
        preprocess_imdb, batched=True, fn_kwargs={'tokenizer': tokenizer}
    )

    # Ensure expected columns
    if 'label' in preprocessed_train_data.column_names:
        preprocessed_train_data = preprocessed_train_data.rename_column('label', 'labels')
    if 'label' in preprocessed_dev_data.column_names:
        preprocessed_dev_data = preprocessed_dev_data.rename_column('label', 'labels')

    cols = ['input_ids', 'attention_mask', 'labels']
    preprocessed_train_data = preprocessed_train_data.remove_columns(
        [c for c in preprocessed_train_data.column_names if c not in cols]
    )
    preprocessed_dev_data = preprocessed_dev_data.remove_columns(
        [c for c in preprocessed_dev_data.column_names if c not in cols]
    )

    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    training_args = TrainingArguments(
        output_dir="./imdb-bert-out",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=num_epochs,
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        greater_is_better=True,
        logging_steps=50,
        report_to="none",
        seed=42,

        # Key changes for TPU/CPU safety:
        optim="adamw_torch",
        torch_compile=False,
    )

    trainer = Trainer(
        model=classification_model,
        args=training_args,
        train_dataset=preprocessed_train_data,
        eval_dataset=preprocessed_dev_data,
        processing_class=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    trainer.train()
    print("Validation metrics:", trainer.evaluate())
    return trainer

Let's try BERT-base-case first, the same model that was used in the lesson 4 notebook.

In [20]:
"""
Show the output from training BERT-base-cased on the IMDB movie reviews dataset.
"""

model_checkpoint_name = "bert-base-cased"
bert_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_name)
bert_classification_model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-cased", num_labels=2
)

fine_tune_classification_model(bert_classification_model, bert_tokenizer, imdb_train_dataset, imdb_dev_dataset)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Accuracy
1,0.3689,0.337474,0.8594
2,0.2018,0.36117,0.8646


Validation metrics: {'eval_loss': 0.3611699640750885, 'eval_accuracy': 0.8646, 'eval_runtime': 7.3342, 'eval_samples_per_second': 681.735, 'eval_steps_per_second': 42.677, 'epoch': 2.0}


<transformers.trainer.Trainer at 0x794619dae030>

Often, one of the first choices you have is what pre-trained model you'll want to use. There are quite a few options, especially because other researchers and practitioners fine-tune their own versions of existing models and sometimes make theirs available for others to continue building on.

You can search through models available on [Huggingface at this website](https://huggingface.co/models?pipeline_tag=text-classification&sort=trending). Some models were made by Huggingface or other large companies/organizations; other models may have been uploaded by individual users. Notice the search tags on the left, we've already clicked the tag for "Text Classification" in the link above. You should see various versions of BERT-style models.

For our IMDB classification, we might want to try a model that has been trained on another dataset related to sentiment or emotions. We also want to find models that have a complete model card with documentation about the model architecture and how it was trained, and potentially a link to an associated research paper, and/or a good number of downloads and likes.

Take a look at this model: [cardiffnlp/twitter-roberta-base-sentiment](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment). It's a RoBERTa model (similar to BERT with slightly different pre-training, often popular for classification tasks), that has already been fine-tuned on the TweetEval benchmark set of tasks for sentiment analysis.

The model card indicates that there is an updated version of this model now available. Follow the link to the latest version of the model, and look at that most recent model's card to answer the following questions. Then load that most recent model to train on our task.

**QUESTION:**

 2.a What is the model checkpoint name for the most recent version of this Twitter Roberta-base sentiment analysis model? (Copy and paste the model checkpoint name into the answers file. It should be the full name that you put inside the quotes to load the file below.)

 **QUESTION:**

 2.b Approximately how many tweets was this latest model trained on? (Put the answer in the answers file. You can use the abbreviation for millions like in the model card, e.g. a number like 12M or 85M.)

 **QUESTION:**

 2.c What is the title of the published reference paper for this most recent model? (Copy the full title of the paper and paste it into the answers file.)

In [21]:
# Most recent Twitter RoBERTa sentiment model
model_checkpoint_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Tokenizer + Model
bert_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_name)
# For IMDB (binary), override the head to 2 labels:
bert_classification_model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint_name, num_labels=2, ignore_mismatched_sizes=True
)

# Train
imdb_roberta_trainer = fine_tune_classification_model(
    bert_classification_model, bert_tokenizer,
    imdb_train_dataset, imdb_dev_dataset
)

config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([3, 768]) in the checkpo

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Accuracy
1,0.3267,0.28868,0.8782
2,0.1802,0.305795,0.896


Validation metrics: {'eval_loss': 0.30579498410224915, 'eval_accuracy': 0.896, 'eval_runtime': 7.4561, 'eval_samples_per_second': 670.588, 'eval_steps_per_second': 41.979, 'epoch': 2.0}


In [22]:
# Print the val accuracy of the imdb_roberta_trainer
print(f'The validation accuracy of the imdb_roberta_trainer is {imdb_roberta_trainer.evaluate()["eval_accuracy"]}')

print(f'Other metrics are {imdb_roberta_trainer.evaluate()}')

The validation accuracy of the imdb_roberta_trainer is 0.896
Other metrics are {'eval_loss': 0.30579498410224915, 'eval_accuracy': 0.896, 'eval_runtime': 7.7109, 'eval_samples_per_second': 648.433, 'eval_steps_per_second': 40.592, 'epoch': 2.0}


**QUESTION:**

2.d What is the final validation accuracy that you observed for the Twitter RoBERTa sentiment-trained model after training for 2 epochs? (Copy and paste the decimal value for the final validation accuracy, e.g. a number like 0.567 or 0.876. Use up to 5 significant digits, though fewer is fine if the output shown in the notebook only has 3 or 4. Put the answer in the answers file; it should match the value shown in your output in this notebook.)

**QUESTION:**

2.e Did the Twitter RoBERTa sentiment-trained model do better or worse or the same as the BERT-base?


**(Answer 2.f below but do NOT enter your sentences in the answers file)**

**QUESTION:**

2.f Why do you think that happened? (Put your two to three sentence answer in the cell below.)

Please answer 2.f in two to three sentences right here:

** BEGIN Q 2.f ANSWER HERE **

RoBERTa benefits from stronger pretraining (more data, no NSP, dynamic masking) and this checkpoint was already fine-tuned for sentiment (TweetEval), so its representations separate polarity cues more cleanly than a vanilla BERT-base headstart. After replacing the 3-class head with a 2-class head, fine-tuning quickly adapts those sentiment features to IMDB, overcoming most of the tweet→review domain shift.

** END Q 2.f ANSWER HERE. **


### 3. Unfreezing Different Pre-Trained Layers

In the lesson 4 notebook, we tested freezing most or all of the pre-trained BERT model layers. We used the .named_parameters() method, looking at the specific names of each set of model parameters.

As in the lesson notebook, we will always want to make sure we keep the classification layer parameters unfrozen, since those need to be trained for our specific task. We will also keep the pooler layer unfrozen, since it's next closest to the classification layer and was only pre-trained in standard BERT models with the next sentence prediction task.

For the remaining layers, what happens if we unfreeze lower transformer blocks and keep higher transformer blocks frozen (the opposite of what we did in the lesson notebook)? What if we instead try unfreezing specific types of layers within each transformer block, e.g. all of the self attention layers, or all of the dense layers?

Let's modify our fine-tuning function, to add an argument for the layers that we want to train. We'll make that argument a list of strings, and we'll set the default to just unfreeze the classification layer. You'll need to write the code to compare those strings to the names of the model parameters (after loading the specified model) and freeze all parameters that don't match (as in the lesson 4 notebook).

In [23]:
# Refresh your memory on what the parameter names look like
for name, param in bert_classification_model.named_parameters():
    print(name)

roberta.embeddings.word_embeddings.weight
roberta.embeddings.position_embeddings.weight
roberta.embeddings.token_type_embeddings.weight
roberta.embeddings.LayerNorm.weight
roberta.embeddings.LayerNorm.bias
roberta.encoder.layer.0.attention.self.query.weight
roberta.encoder.layer.0.attention.self.query.bias
roberta.encoder.layer.0.attention.self.key.weight
roberta.encoder.layer.0.attention.self.key.bias
roberta.encoder.layer.0.attention.self.value.weight
roberta.encoder.layer.0.attention.self.value.bias
roberta.encoder.layer.0.attention.output.dense.weight
roberta.encoder.layer.0.attention.output.dense.bias
roberta.encoder.layer.0.attention.output.LayerNorm.weight
roberta.encoder.layer.0.attention.output.LayerNorm.bias
roberta.encoder.layer.0.intermediate.dense.weight
roberta.encoder.layer.0.intermediate.dense.bias
roberta.encoder.layer.0.output.dense.weight
roberta.encoder.layer.0.output.dense.bias
roberta.encoder.layer.0.output.LayerNorm.weight
roberta.encoder.layer.0.output.LayerNorm

In [24]:
def fine_tune_classif_model_freeze_layers(
    classification_model,
    tokenizer,
    train_data,
    dev_data,
    layers_to_train=("classifier.",),
    batch_size=16,
    num_epochs=2,
):
    """
    Freeze all params whose names DO NOT contain any substring in layers_to_train.
    Keep only inputs needed by the model; use dynamic padding and accuracy metric.
    """
    # Preprocess
    preprocessed_train_data = train_data.map(
        preprocess_imdb, batched=True, fn_kwargs={"tokenizer": tokenizer}
    )
    preprocessed_dev_data = dev_data.map(
        preprocess_imdb, batched=True, fn_kwargs={"tokenizer": tokenizer}
    )
    if "label" in preprocessed_train_data.column_names:
        preprocessed_train_data = preprocessed_train_data.rename_column("label", "labels")
    if "label" in preprocessed_dev_data.column_names:
        preprocessed_dev_data = preprocessed_dev_data.rename_column("label", "labels")

    keep_cols = ["input_ids", "attention_mask", "labels"]
    preprocessed_train_data = preprocessed_train_data.remove_columns(
        [c for c in preprocessed_train_data.column_names if c not in keep_cols]
    )
    preprocessed_dev_data = preprocessed_dev_data.remove_columns(
        [c for c in preprocessed_dev_data.column_names if c not in keep_cols]
    )

    # Freeze/unfreeze by substring match
    def _is_trainable(name: str, patterns) -> bool:
        return any(pat in name for pat in patterns)

    trainable_names, frozen_names = [], []
    for name, param in classification_model.named_parameters():
        if _is_trainable(name, layers_to_train):
            param.requires_grad = True
            trainable_names.append(name)
        else:
            param.requires_grad = False
            frozen_names.append(name)

    print("\nTrainable parameter name substrings:", layers_to_train)
    print(f"Trainable tensors: {len(trainable_names)} | Frozen tensors: {len(frozen_names)}")

    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    training_args = TrainingArguments(
        output_dir="./imdb-bert-freeze-out",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=num_epochs,
        eval_strategy="epoch",      # (kept as-is per your code)
        save_strategy="no",
        load_best_model_at_end=False,
        logging_steps=50,
        report_to="none",
        seed=42,
        optim="adamw_torch",
        torch_compile=False,
    )

    trainer = Trainer(
        model=classification_model,
        args=training_args,
        train_dataset=preprocessed_train_data,
        eval_dataset=preprocessed_dev_data,
        tokenizer=tokenizer,        # keep (even though deprec warning)
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    # ---- Train --------------------------------------------------------------
    trainer.train()

    # Ensure we have a final eval entry in log_history
    _ = trainer.evaluate()

    # ---- Extract from log_history ------------------------------------------
    logs = trainer.state.log_history

    # Last eval record (final epoch)
    last_eval = next((rec for rec in reversed(logs) if "eval_loss" in rec), {})
    final_val_loss = float(last_eval.get("eval_loss", float("nan")))
    final_val_acc  = float(last_eval.get("eval_accuracy", float("nan")))
    last_eval_step = last_eval.get("step", None)

    # Last training loss BEFORE (or at) that eval step
    final_train_loss = float("nan")
    for rec in reversed(logs):
        # training logs have 'loss' (and not 'eval_loss')
        if "loss" in rec and "eval_loss" not in rec:
            if last_eval_step is None or rec.get("step", 0) <= last_eval_step:
                final_train_loss = float(rec["loss"])
                break

    ratio = (
        final_train_loss / final_val_loss
        if (final_val_loss == final_val_loss and final_val_loss != 0.0)
        else float("nan")
    )

    print("\n=== FINAL METRICS (from log_history) ===")
    print(f"Training loss (final epoch): {final_train_loss:.5f}")
    print(f"Validation loss (final eval): {final_val_loss:.5f}")
    print(f"Train/Val loss ratio:         {ratio:.5f}")
    print(f"Val accuracy (final eval):    {final_val_acc:.5f}")

    return trainer, {
        "train_loss": final_train_loss,
        "eval_loss": final_val_loss,
        "eval_accuracy": final_val_acc,
        "trainval_loss_ratio": ratio,
    }


We'll go back to using bert-base-cased for this part. First, try freezing the parameters in transformer layers 1-11 (including all parameters with "layer.#" in the name). That means you're leaving unfrozen the initial embedding layers, the first transformer layer (numbered 0), and the classification layer.

Unfreezing the bottom transformer layer(s) rather than the top one(s) is uncommon, but it's always good to try to understand why. Since we're learning, we'll try doing it this way and see what happens. We've given you the code for this exercise, so that the way to specify layers_to_freeze is clear.

In [25]:
"""
Show the output from training a BERT-base-cased classification model, when unfreezing
only the parameters in the embedding layers, first transformer layer (layer 0), and classifier layer.
"""

model_checkpoint_name = "bert-base-cased"
bert_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_name)
bert_classification_model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint_name, num_labels=2
)

# Unfreeze ONLY: embeddings, first encoder block (layer 0), and classifier
layers_to_train = ["embeddings.", "encoder.layer.0.", "classifier."]  # note BERT uses "encoder.layer", not "layer."

trainer_low_unfrozen, metrics = fine_tune_classif_model_freeze_layers(
    bert_classification_model,
    bert_tokenizer,
    imdb_train_dataset,
    imdb_dev_dataset,
    layers_to_train=layers_to_train,
    batch_size=16,
    num_epochs=2
)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]


Trainable parameter name substrings: ['embeddings.', 'encoder.layer.0.', 'classifier.']
Trainable tensors: 23 | Frozen tensors: 178


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4304,0.428129,0.8004
2,0.3134,0.422927,0.808



=== FINAL METRICS (from log_history) ===
Training loss (final epoch): 0.31340
Validation loss (final eval): 0.42293
Train/Val loss ratio:         0.74103
Val accuracy (final eval):    0.80800


In [36]:
# print the val accuracy
print('bert_text_classification_3.1.a:')
print(f'The validation accuracy of the bert-base-cased trainer_low_unfrozen is {metrics["eval_accuracy"]:.5f}')

bert_text_classification_3.1.a:
The validation accuracy of the bert-base-cased trainer_low_unfrozen is 0.85900


 **QUESTION:**

3.a What is the final validation accuracy that you observed for this lowest level unfrozen version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)




Now try two more versions, this time choosing which layers to train yourself. Instead of focusing on the number of the transformer block (layer.#), focus on the type of layer within each block (the stuff that comes after layer.# in the name).

Keep the pooler and classification layers unfrozen in all model versions. Your options to also train include the initial embedding layers and the different components within the transformer blocks (e.g. self attention matrices, dense layers, layer norms).

Try to find one combination that does better than the version you just ran above (higher validation accuracy after 2 epochs), without much more overfitting (training_loss / eval_loss > 0.7). Also try to find one version that overfits a lot more after 2 epochs (training_loss / eval_loss < 0.5).

In [27]:
"""
Show the output from training a particular model on the IMDB movie reviews dataset.
Choose layers to train that lead the model to perform better than the one in question 3.a, without overfitting much more.
"""

model_checkpoint_name = "bert-base-cased"

bert_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_name)
bert_classification_model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint_name)

### YOUR CODE HERE

# Unfreeze TOP layers (domain-adapt the task-relevant features), keep lower layers/embeddings frozen
layers_to_train = [
    "encoder.layer.10.",  # last-2 block
    "encoder.layer.11.",  # last block
    "pooler.",            # optional but helpful for CLS projection
    "classifier."         # always train the head
]

### END YOUR CODE

trainer_last4_unfrozen, metrics = fine_tune_classif_model_freeze_layers(
    bert_classification_model,
    bert_tokenizer,
    imdb_train_dataset,
    imdb_dev_dataset,
    layers_to_train
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(



Trainable parameter name substrings: ['encoder.layer.10.', 'encoder.layer.11.', 'pooler.', 'classifier.']
Trainable tensors: 36 | Frozen tensors: 165


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3555,0.365575,0.8398
2,0.2826,0.328858,0.859



=== FINAL METRICS (from log_history) ===
Training loss (final epoch): 0.28260
Validation loss (final eval): 0.32886
Train/Val loss ratio:         0.85934
Val accuracy (final eval):    0.85900


In [35]:
# Print training loss, val loss, ratio of training/val loss, and val accuracy
print('bert_text_classification_3_2_3_3_b - e:')
print(f"  The training loss is {metrics['train_loss']:.5f}")
print(f"  The validation loss is {metrics['eval_loss']:.5f}")
print(f"  The loss ratio (train/val) is {metrics['trainval_loss_ratio']:.5f}")
print(f"  The validation accuracy is {metrics['eval_accuracy']:.5f}")


bert_text_classification_3_2_3_3_b - e:
  The training loss is 0.28260
  The validation loss is 0.32886
  The loss ratio (train/val) is 0.85934
  The validation accuracy is 0.85900


 **QUESTION:**

3.b What is the final training loss that you observed for this better performing version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

3.c What is the final validation loss that you observed for this better performing version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

3.d What is the ratio of your final training loss/final validation loss? For this better version the ratio must be greater than 0.7.

3.e What is the final validation accuracy that you observed for this better performing version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

In [29]:
"""
Show the output from training a particular model on the IMDB movie reviews dataset.
Choose layers to train that lead the model to overfit.
"""

model_checkpoint_name = "bert-base-cased"

bert_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_name)
bert_classification_model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint_name)

### YOUR CODE HERE

layers_to_train = [""] # The empty string is a substring of every name → everything stays trainable.=

### END YOUR CODE


overfit_trainer, overfit_metrics = fine_tune_classif_model_freeze_layers(
    bert_classification_model,
    bert_tokenizer,
    imdb_train_dataset,
    imdb_dev_dataset,
    layers_to_train
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(



Trainable parameter name substrings: ['']
Trainable tensors: 201 | Frozen tensors: 0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3549,0.331667,0.8598
2,0.1496,0.383825,0.8646



=== FINAL METRICS (from log_history) ===
Training loss (final epoch): 0.14960
Validation loss (final eval): 0.38383
Train/Val loss ratio:         0.38976
Val accuracy (final eval):    0.86460


In [34]:
print(f'3.f Training loss is {overfit_metrics["train_loss"]:.5f}')
print(f'3.g Validation loss is {overfit_metrics["eval_loss"]:.5f}')
print(f'3.h Train/Val loss ratio is {overfit_metrics["trainval_loss_ratio"]:.5f}')
print(f'3.i Validation accuracy is {overfit_metrics["eval_accuracy"]:.5f}')


3.f Training loss is 0.14960
3.g Validation loss is 0.38383
3.h Train/Val loss ratio is 0.38976
3.i Validation accuracy is 0.86460


In [31]:
overfit_trainer.state.log_history

[{'loss': 0.5931,
  'grad_norm': 4.214748382568359,
  'learning_rate': 4.921625079974408e-05,
  'epoch': 0.03198976327575176,
  'step': 50},
 {'loss': 0.5086,
  'grad_norm': 5.423033714294434,
  'learning_rate': 4.841650671785029e-05,
  'epoch': 0.06397952655150352,
  'step': 100},
 {'loss': 0.4685,
  'grad_norm': 9.648335456848145,
  'learning_rate': 4.7616762635956495e-05,
  'epoch': 0.09596928982725528,
  'step': 150},
 {'loss': 0.4863,
  'grad_norm': 8.309975624084473,
  'learning_rate': 4.68170185540627e-05,
  'epoch': 0.12795905310300704,
  'step': 200},
 {'loss': 0.4415,
  'grad_norm': 9.238158226013184,
  'learning_rate': 4.601727447216891e-05,
  'epoch': 0.1599488163787588,
  'step': 250},
 {'loss': 0.4217,
  'grad_norm': 14.247499465942383,
  'learning_rate': 4.5217530390275114e-05,
  'epoch': 0.19193857965451055,
  'step': 300},
 {'loss': 0.3936,
  'grad_norm': 6.219278812408447,
  'learning_rate': 4.441778630838132e-05,
  'epoch': 0.22392834293026231,
  'step': 350},
 {'los

 **QUESTION:**

3.f What is the final training loss that you observed for this overfitting version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

3.g What is the final validation loss that you observed for this overfitting version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

3.h What is the ratio of your final training loss/final validation loss? For this overfitting version the ratio must be less than 0.5.

3.i What is the final validation accuracy that you observed for this overfitting version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

## Congratulations... You are done!