# Assignment 2: Text Classification with BERT

**Description:** This assignment notebook builds on the material from the
[lesson 4 notebook](https://github.com/datasci-w266/2025-spring-main/blob/master/materials/lesson_notebooks/lesson_4_BERT.ipynb), in which we fine-tuned a BERT model for the IMDB movie reviews sentiment classification task. In that notebook, we used the bert-base-cased model and applied traditional fine-tuning, with a brief class exercise at the end to try unfreezing different numbers of layers. In this assignment, we'll start with that exercise, and ask you to explore unfreezing more specific layers yourself. Then you'll search for and try different pre-trained BERT-style models.

This notebook should be run on a Google Colab leveraging a GPU. By default, when you open the notebook in Colab it will try to use a GPU. Please note that you the GPU is reuqired for Section 3 but not for Sections 1 and 2.
Since colab is providing free access to a GPU they place constraints on that access.  Therefore you might want to turn off the GPU access (Edit -> Notebook Settings) until you get to section 3.  Total runtime of the entire notebook (with solutions and a Colab GPU) should be about 1h with the majority of that time being in Section 3. If Colab tells you that you have reached your GPU limit, wait up to 24 hours and you should be able to access a GPU again.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2025-spring-main/blob/master/assignment/a2/Text_classification_BERT.ipynb)

The overall assignment structure is as follows:


0. Setup
  
  0.1 Libraries and Helper Functions

  0.2 Data Acquisition

  0.3. Data Preparation


1. Classification with BERT

  1.1. BERT Basics

  1.2 CLS-Token-based Classification

  1.3 Averaging of BERT Outputs

  1.4. Adding a CNN on top of BERT



**INSTRUCTIONS:**:

* Questions are always indicated as **QUESTION**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the **answers** file as you did in a1.  Please do **not** remove the output from your notebooks when you submit them as we'll look at the output as well as your code for grading purposes.  We cannot award points if the output cells are empty.

* **### YOUR CODE HERE** indicates that you are supposed to write code.

* If you want to, you can run all of the cells in section 0 in bulk. This is setup work and no questions are in there. At the end of section 0 we will state all of the relevant variables that were defined and created in section 1.

* Finally, unless otherwise indicated your validation accuracy will be 0.65 or higher if you have correctly implemented the model.



## 0. Setup

### 0.1. Libraries and Helper Functions

This notebook requires the TensorFlow dataset and other prerequisites that you must download.  This notebook uses Keras 2 and its functional API.  Do NOT change the version numbers in the pip install commands.

In [1]:
!pip install -q transformers
!pip install -q torchinfo
!pip install -q datasets
!pip install -q evaluate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m471.0/480.6 kB[0m [31m17.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/179.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/143.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

Now we are ready to do the imports.

In [3]:
#@title Imports

import numpy as np

import transformers
import evaluate

from datasets import load_dataset
from torchinfo import summary

from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer

### 0.2 Data Acquisition


We will use the IMDB dataset delivered as part of the TensorFlow-datasets library, and split into training and test sets. For expedience, we will limit ourselves in terms of train and test examples.

In [4]:
imdb_dataset = load_dataset("imdb")

imdb_train_dataset = imdb_dataset['train'].shuffle()
imdb_dev_dataset = imdb_dataset['test'].shuffle().select(range(5000))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

It is always highly recommended to look at the data. What do the records look like? Are they clean or do they contain a lot of cruft (potential noise)?

In [5]:
imdb_train_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [6]:
for i in range(4):
  print(imdb_train_dataset['text'][i])
  print(imdb_train_dataset['label'][i])
  print()

Christina Raines plays a lovely model in New York who seeks out a new apartment and begins to meet strange neighbors and reveal a secret about the building and herself slowly building up to quite a climax by film's end. This film has all kinds of neat plot elements from the Roman Catholic Church vs. the Devil, to the gateway to Hell, to bizarre rituals, to a growing conspiracy, and finally to a host of talented famous actors and actresses flooding the film. We get Ava Gardner, Burgess Meredith, Chris Sarandon, Jerry Orbach, Deborah Raffin, Arthur Kennedy, Jose Ferrer, Slyvia Miles, Beverly DeAngelo, Eli Wallach, Martin Balsam, Christopher Walkin, William Hickey, Tom Berenger, Jeff Goldblum, and who can forget John Carradine as the old priest. Many of these actors ham it up - particularly Burgess Meredith giving a fine comic/demented performance as one of the neighbors with a little bird and a cat. Meredith is memorable as is Balsam and Chris Sarandon. Some of the performers have virtua

In [None]:
imdb_train_dataset.features['label'].names

['neg', 'pos']

For convenience, in this assignment we will define a sequence length and truncate all records at that length. For records that are shorter than our defined sequence length we will add padding characters to insure that our input shapes are consistent across all records.

In [7]:
MAX_SEQUENCE_LENGTH = 100

## 0.3. Data Preparation

We will need to tokenize the text into vocab_ids to pass into a BERT model. To do so, we'll need to use the specific tokenizer that goes with the model we're using. In this notebook, we will try several different BERT-style models. Let's
first write a function that will take the text from our dataset and a tokenizer, and encode the text using that tokenizer. Then we'll apply the function to our dataset for each tokenizer and model.

In [8]:
def preprocess_imdb(data, tokenizer):
    review_text = data['text']

    encoded = tokenizer.batch_encode_plus(
            review_text,
            max_length=MAX_SEQUENCE_LENGTH,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_token_type_ids=True,
            return_tensors="pt"
        )

    return encoded


## 1. BERT-based Classification Models

Now we turn to classification with BERT. We will perform classifications with various models that are based on pre-trained BERT models.  If you turn off GPU access while coding and debugging the setup steps, make sure you change the Notebook settings so you can access a GPU when you're ready to train the models.


### 1.1. Basics

Let us first explore some basics of BERT. We'll start by loading the first pretrained BERT model and tokenizer that we'll use ('bert-base-cased').

To explore just the pre-trained portion of the model, we'll use the AutoModel class (equivalent to BertModel, but works for any architecture including BERT). This class gives us the pre-trained model layers up until the last hidden layer (but not any output layer).

In [10]:
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
bert_model = AutoModel.from_pretrained('bert-base-cased')

Let's look at a couple of example sentences:

In [11]:
test_input = ['this bank is closed on Sunday', 'the steepest bank of the river is dangerous']

Apply the BERT tokenizer to tokenize them:

In [12]:
tokenized_input = bert_tokenizer(test_input,
                                 max_length=12,
                                 truncation=True,
                                 padding='max_length',
                                 return_tensors='pt')

tokenized_input

{'input_ids': tensor([[ 101, 1142, 3085, 1110, 1804, 1113, 3625,  102,    0,    0,    0,    0],
        [ 101, 1103, 9458, 2556, 3085, 1104, 1103, 2186, 1110, 4249,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])}

 **QUESTION:**

 1.a  Why do the attention_masks have 4 and 1 zeros, respectively?  Choose the correct one and enter it in the answers file.

  *  For the first example the last four tokens belong to a different segment. For the second one it is only the last token.

  *  For the first example 4 positions are padded while for the second one it is only one.

In [13]:
# Importing necessary libraries
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

sentences = [
    "This is the first example.",          # Shorter sentence
    "This is the second, longer example with more tokens."
]

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

print("Attention Masks:")
print(inputs['attention_mask'])

with torch.no_grad():
    bert_output = model(**inputs)

# Display the BERT output
print("BERT Output:")
print(bert_output)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Attention Masks:
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
BERT Output:
BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.2311, -0.0405, -0.0267,  ..., -0.4295,  0.5376,  0.6988],
         [-0.9423, -0.5856,  0.1295,  ..., -0.7186,  0.7521,  0.1074],
         [-0.5433, -0.7774,  0.6860,  ..., -0.3112,  0.4325,  0.7303],
         ...,
         [-0.3914, -0.5291,  0.1562,  ...,  0.3112,  0.2746,  0.1337],
         [-0.1683, -0.4097,  0.2698,  ...,  0.3408,  0.2543,  0.1353],
         [-0.0380, -0.1506,  0.3074,  ...,  0.1490,  0.1926,  0.2115]],

        [[-0.2780, -0.3716, -0.0334,  ..., -0.2724,  0.4067,  0.7367],
         [-0.5887, -0.5793, -0.1643,  ..., -0.1664,  0.9803,  0.2730],
         [-0.8682, -0.7227,  0.3426,  ..., -0.0116,  0.2747,  0.6195],
         ...,
         [ 0.0627, -0.6276,  0.0561,  ..., -0.0190, -0.1669,  0.9155],
         [ 0.6920,  0.2050, -0.5055,  ...,  0.3521, -0.5047, -

 **QUESTION:**

 1.b How many outputs are there?

 Enter your code below.

In [14]:
# Import necessary libraries
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Example input text
input_text = ["This is an example sentence.", "Another short sentence."]

inputs = tokenizer(input_text, padding=True, truncation=True, return_tensors="pt")

# Run the model
with torch.no_grad():
    bert_output = model(**inputs)

print(f"Number of outputs: {len(bert_output)}")
print(f"Output types: {[type(out) for out in bert_output]}")

for idx, out in enumerate(bert_output):
    print(f"Output {idx} shape: {out.shape if hasattr(out, 'shape') else 'No shape'}")

Number of outputs: 2
Output types: [<class 'str'>, <class 'str'>]
Output 0 shape: No shape
Output 1 shape: No shape


**QUESTION:**

 1.b In the tokenized input, which input_id number (i.e. the vocabulary id) corresponds to 'bank' in the two sentences? ('bert_tokenizer.tokenize()' may come in handy.. and don't forget the CLS token! )


**QUESTION:**

 1.c In the array of tokens, which position index number corresponds to 'bank' in the first sentence? ('bert_tokenizer.tokenize()' may come in handy.. and don't forget the CLS token! )

In [15]:
from transformers import AutoTokenizer, AutoModel
import torch

# Load BERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Example sentences
sentence_1 = "I deposited money at the bank."
sentence_2 = "The river bank was flooded."

inputs = tokenizer(sentence_1, return_tensors='pt')
tokens = tokenizer.tokenize(sentence_1)

index_bank = tokens.index('bank')

# Get the BERT output
with torch.no_grad():
    outputs = model(**inputs)

embedding_bank = outputs.last_hidden_state[0, index_bank + 1]

print(f"Embedding vector for 'bank':\n{embedding_bank}")
print(f"Shape of the embedding vector: {embedding_bank.shape}")


Embedding vector for 'bank':
tensor([ 6.7929e-01, -4.0712e-01, -6.8826e-02,  7.8264e-02,  8.4081e-01,
         1.4633e-01, -5.2246e-01,  9.5693e-01,  8.5274e-02, -2.4056e-01,
         5.1672e-01, -2.9936e-01, -1.2103e-01, -9.4254e-02, -3.6251e-01,
        -2.4164e-01,  2.7962e-01,  2.4538e-01,  1.3607e+00,  3.2178e-01,
        -6.0877e-01, -1.1385e-03,  2.9520e-01,  1.2204e-02,  1.7100e-04,
         7.0514e-01,  6.4345e-01,  5.6057e-01, -4.8267e-01, -4.6296e-01,
         7.9351e-01,  8.7541e-01,  3.0497e-01,  3.6829e-01,  5.6776e-02,
        -3.1187e-02,  9.9410e-02,  1.6303e-01, -1.2264e+00, -3.7743e-01,
        -3.1858e-01, -6.9192e-01, -3.9988e-01,  1.9505e-01, -3.3146e-01,
        -5.2360e-01,  2.4281e-01,  3.4915e-01, -6.0531e-01, -7.5461e-01,
        -3.3768e-01,  7.1266e-01,  4.1795e-01, -5.1081e-01,  9.1027e-02,
         4.2913e-01, -9.3690e-01, -6.5074e-01, -6.6408e-01, -3.3114e-02,
         4.6602e-01,  1.2272e-01,  8.6524e-01, -1.0484e+00,  3.3302e-01,
        -2.2005e-01,  

**QUESTION:**

1.d Which array position index number corresponds to 'bank' in the second sentence?

In [16]:
from transformers import AutoTokenizer

# Load BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Sentences
sentence_1 = "I deposited money at the bank."
sentence_2 = "The river bank was flooded."

tokens = tokenizer.tokenize(sentence_2)
print(f"Tokens in the second sentence: {tokens}")

bank_index = tokens.index('bank')
print(f"The index position of 'bank' in the second sentence is: {bank_index}")

Tokens in the second sentence: ['the', 'river', 'bank', 'was', 'flooded', '.']
The index position of 'bank' in the second sentence is: 2


**QUESTION:**

 1.g What is the cosine similarity between the BERT embeddings for the two occurences of 'bank' in the two sentences?

In [17]:
from transformers import AutoModel
import torch
from torch.nn.functional import cosine_similarity

# Load BERT model
model = AutoModel.from_pretrained("bert-base-uncased")

inputs = tokenizer([sentence_1, sentence_2], return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
    outputs = model(**inputs)

tokens_1 = tokenizer.tokenize(sentence_1)
tokens_2 = tokenizer.tokenize(sentence_2)

index_bank_1 = tokens_1.index('bank')
index_bank_2 = tokens_2.index('bank')

embedding_bank_1 = outputs.last_hidden_state[0, index_bank_1 + 1]  # +1 for [CLS] token
embedding_bank_2 = outputs.last_hidden_state[1, index_bank_2 + 1]

similarity = cosine_similarity(embedding_bank_1.unsqueeze(0), embedding_bank_2.unsqueeze(0))
print(f"Cosine similarity between 'bank' in both sentences: {similarity.item()}")

Cosine similarity between 'bank' in both sentences: 0.4667614698410034


**QUESTION:**

1.h How does this relate to the cosine similarity of 'this' (in sentence 1) and the first 'the' (in sentence 2). Compute their cosine similarity.


In [18]:
from transformers import AutoTokenizer, AutoModel
import torch
from torch.nn.functional import cosine_similarity

# Load BERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Define sentences with 'this' and 'the'
sentence_1 = "This is an example sentence."
sentence_2 = "The river bank was flooded."

inputs = tokenizer([sentence_1, sentence_2], return_tensors='pt', padding=True, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)

tokens_1 = tokenizer.tokenize(sentence_1)
tokens_2 = tokenizer.tokenize(sentence_2)

index_this = tokens_1.index('this')
index_the = tokens_2.index('the')

embedding_this = outputs.last_hidden_state[0, index_this + 1]  # +1 for [CLS] token
embedding_the = outputs.last_hidden_state[1, index_the + 1]    # +1 for [CLS] token

# Calculate cosine similarity between 'this' and 'the'
similarity = cosine_similarity(embedding_this.unsqueeze(0), embedding_the.unsqueeze(0))

print(f"Cosine similarity between 'this' in sentence 1 and 'the' in sentence 2: {similarity.item()}")

Cosine similarity between 'this' in sentence 1 and 'the' in sentence 2: 0.3299539089202881


### 1.2 Testing Different Pre-Trained BERT Models

In the live session we discussed classification with the BERT base cased model, using the Huggingface class BertForSequenceClassification, which comes with a new output layer for our task that we need to train on our dataset.

We're going to try different pre-trained models now. Like in the lesson 4 notebook, we'll want to fine-tune each model on our IMDB reviews dataset and compare them with a metric like the validation accuracy. We'll use the model class AutoModelForSequenceClassification, which is equivalent to BertForSequenceClassification, but works for other similar architectures too.

Let's write the code we'll need as a function that takes the model and tokenizer as arguments, along with the raw train and dev data. The function will need to tokenize the inputs using the provided tokenizer, so that we can repeat the same code for different pre-trained models. Then the function should create the training args and trainer class, and call trainer.train().

The other hyperparameters you'll need are provided in the function definition, including batch_size and num_epochs. You should use the default values provided for those. Use the function provided below for compute_metrics.

For now, keep all layers of the pre-trained models you load unfrozen.

In [19]:
metric = evaluate.load('accuracy')

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [20]:
from transformers import TrainingArguments, Trainer

def fine_tune_classification_model(classification_model,
                                   tokenizer,
                                   train_data,
                                   dev_data,
                                   batch_size=16,
                                   num_epochs=2):
    """
    Preprocess the data using the given tokenizer.
    Create the training arguments and trainer for the given model and data.
    Then train it.
    """

    # Preprocess the train and dev data
    preprocessed_train_data = train_data.map(preprocess_imdb, batched=True, fn_kwargs={'tokenizer': tokenizer})
    preprocessed_dev_data = dev_data.map(preprocess_imdb, batched=True, fn_kwargs={'tokenizer': tokenizer})

    # Define training arguments with aligned eval and save strategies
    training_args = TrainingArguments(
        output_dir='./results',                   # Directory to save model outputs
        eval_strategy="epoch",                    # Evaluation at the end of each epoch
        save_strategy="epoch",                    # Save model at the end of each epoch
        per_device_train_batch_size=batch_size,   # Batch size for training
        per_device_eval_batch_size=batch_size,    # Batch size for evaluation
        num_train_epochs=num_epochs,              # Number of training epochs
        save_total_limit=1,                       # Limit the total amount of saved checkpoints
        logging_dir='./logs',                     # Directory for storing logs
        logging_steps=10,                         # Log every 10 steps
        load_best_model_at_end=True,              # Load the best model when finished training
        metric_for_best_model='accuracy'          # Metric to decide the best model
    )

    # Initialize the Trainer
    trainer = Trainer(
        model=classification_model,
        args=training_args,
        train_dataset=preprocessed_train_data,
        eval_dataset=preprocessed_dev_data,
        compute_metrics=compute_metrics
    )

    # Train the model
    trainer.train()


Let's try BERT-base-case first, the same model that was used in the lesson 4 notebook.

In [22]:
import os
os.environ["WANDB_DISABLED"] = "true"


In [23]:
"""
Show the output from training BERT-base-cased on the IMDB movie reviews dataset.
"""

model_checkpoint_name = "bert-base-cased"
bert_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_name)
bert_classification_model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint_name)

fine_tune_classification_model(bert_classification_model, bert_tokenizer, imdb_train_dataset, imdb_dev_dataset)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3437,0.317792,0.8594
2,0.1655,0.398151,0.8662


Often, one of the first choices you have is what pre-trained model you'll want to use. There are quite a few options, especially because other researchers and practitioners fine-tune their own versions of existing models and sometimes make theirs available for others to continue building on.

You can search through models available on [Huggingface at this website](https://huggingface.co/models?pipeline_tag=text-classification&sort=trending). Some models were made by Huggingface or other large companies/organizations; other models may have been uploaded by individual users. Notice the search tags on the left, we've already clicked the tag for "Text Classification" in the link above. You should see various versions of BERT-style models.

For our IMDB classification, we might want to try a model that has been trained on another dataset related to sentiment or emotions. We also want to find models that have a complete model card with documentation about the model architecture and how it was trained, and potentially a link to an associated research paper, and/or a good number of downloads and likes.

Take a look at this model: [cardiffnlp/twitter-roberta-base-sentiment](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment). It's a RoBERTa model (similar to BERT with slightly different pre-training, often popular for classification tasks), that has already been fine-tuned on the TweetEval benchmark set of tasks for sentiment analysis.

The model card indicates that there is an updated version of this model now available. Follow the link to the latest version of the model, and look at that latest model's card to answer the following questions. Then load the latest model to train on our task.

**QUESTION:**

 2.a What is the model checkpoint name for the latest version of this Twitter Robert-base sentiment analysis model? (Copy and paste the model checkpoint name into the answers file. It should be the full name that you put inside the quotes to load the file below.)

 **QUESTION:**

 2.b Approximately how many tweets was the latest model trained on? (Put the answer in the answers file. You can use the abbreviation for millions like in the model card, e.g. a number like 12M or 85M.)

 **QUESTION:**

 2.c What is the title of the published reference paper for this model? (Copy the full title of the paper and paste it into the answers file.)

In [24]:
"""
Show the output from training the Twitter RoBERTa sentiment model on the IMDB movie reviews dataset.
Insert the model checkpoint name for the latest version of that model below.
"""

### YOUR CODE HERE

# Latest Twitter RoBERTa sentiment model checkpoint name
model_checkpoint_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"

### END YOUR CODE

# Load tokenizer and model from the checkpoint
bert_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_name)
bert_classification_model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint_name)

# Fine-tune the model on the IMDb dataset
fine_tune_classification_model(bert_classification_model, bert_tokenizer, imdb_train_dataset, imdb_dev_dataset)


config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,Accuracy
1,0.331,0.271655,0.8852
2,0.0805,0.323363,0.9014


**QUESTION:**

2.d What is the final validation accuracy that you observed for the Twitter RoBERTa sentiment-trained model after training for 2 epochs? (Copy and paste the decimal value for the final validation accuracy, e.g. a number like 0.567 or 0.876. Use up to 5 significant digits, though fewer is fine if the output shown in the notebook only has 3 or 4. Put the answer in the answers file; it should match the value shown in your output in this notebook.)

**QUESTION:**

2.e Did the Twitter RoBERTa sentiment-trained model do better or worse than BERT-base? Why do you think that happened? (Put your brief answer in the cell below.)

*** ANSWER FOR 2.e HERE ***


*** END ANSWER ***

### 1.3 Unfreezing Different Pre-Trained Layers

In the lesson 4 notebook, we tested freezing most or all of the pre-trained BERT model layers. We used the .named_parameters() method, looking at the specific names of each set of model parameters.

As in the lesson notebook, we will always want to make sure we keep the classification layer parameters unfrozen, since those need to be trained for our specific task. We will also keep the pooler layer unfrozen, since it's next closest to the classification layer and was only pre-trained in standard BERT models with the next sentence prediction task.

For the remaining layers, what happens if we unfreeze lower transformer blocks and keep higher transformer blocks frozen (the opposite of what we did in the lesson notebook)? What if we instead try unfreezing specific types of layers within each transformer block, e.g. all of the self attention layers, or all of the dense layers?

Let's modify our fine-tuning function, to add an argument for the layers that we want to train. We'll make that argument a list of strings, and we'll set the default to just unfreeze the classification layer. You'll need to write the code to compare those strings to the names of the model parameters (after loading the specified model) and freeze all parameters that don't match (as in the lesson 4 notebook).

In [25]:
# Refresh your memory on what the parameter names look like
for name, param in bert_classification_model.named_parameters():
    print(name)

roberta.embeddings.word_embeddings.weight
roberta.embeddings.position_embeddings.weight
roberta.embeddings.token_type_embeddings.weight
roberta.embeddings.LayerNorm.weight
roberta.embeddings.LayerNorm.bias
roberta.encoder.layer.0.attention.self.query.weight
roberta.encoder.layer.0.attention.self.query.bias
roberta.encoder.layer.0.attention.self.key.weight
roberta.encoder.layer.0.attention.self.key.bias
roberta.encoder.layer.0.attention.self.value.weight
roberta.encoder.layer.0.attention.self.value.bias
roberta.encoder.layer.0.attention.output.dense.weight
roberta.encoder.layer.0.attention.output.dense.bias
roberta.encoder.layer.0.attention.output.LayerNorm.weight
roberta.encoder.layer.0.attention.output.LayerNorm.bias
roberta.encoder.layer.0.intermediate.dense.weight
roberta.encoder.layer.0.intermediate.dense.bias
roberta.encoder.layer.0.output.dense.weight
roberta.encoder.layer.0.output.dense.bias
roberta.encoder.layer.0.output.LayerNorm.weight
roberta.encoder.layer.0.output.LayerNorm

In [32]:
from transformers import TrainingArguments, Trainer

def fine_tune_classif_model_freeze_layers(classification_model,
                                          tokenizer,
                                          train_data,
                                          dev_data,
                                          layers_to_train=["classifier."],
                                          max_sequence_length=512,
                                          batch_size=16,
                                          num_epochs=2):
    """
    Freeze any parameters inside the given model that do NOT have names
    containing one of the strings in the `layers_to_train` list.
    Then specify the training arguments and trainer for the given model and data.
    Finally, train the model and return the Trainer object for evaluation.
    """

    # Preprocess the data
    preprocessed_train_data = train_data.map(preprocess_imdb, batched=True, fn_kwargs={'tokenizer': tokenizer})
    preprocessed_dev_data = dev_data.map(preprocess_imdb, batched=True, fn_kwargs={'tokenizer': tokenizer})

    # Freeze layers that are NOT in layers_to_train
    for name, param in classification_model.named_parameters():
        if not any(layer in name for layer in layers_to_train):
            param.requires_grad = False

    # Define training arguments
    training_args = TrainingArguments(
        output_dir='./results',
        eval_strategy="epoch",
        save_strategy="epoch",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=num_epochs,
        save_total_limit=1,
        logging_dir='./logs',
        logging_steps=10,
        load_best_model_at_end=True,
        metric_for_best_model='accuracy'
    )

    # Initialize the Trainer
    trainer = Trainer(
        model=classification_model,
        args=training_args,
        train_dataset=preprocessed_train_data,
        eval_dataset=preprocessed_dev_data,
        compute_metrics=compute_metrics  # Ensure you have compute_metrics defined
    )

    # Train the model
    trainer.train()

    # Return the trainer for evaluation
    return trainer

We'll go back to using bert-base-cased for this part. First, try freezing the parameters in transformer layers 1-11 (including all parameters with "layer.#" in the name). That means you're leaving unfrozen the initial embedding layers, the first transformer layer (numbered 0), and the classification layer.

Unfreezing the bottom transformer layer(s) rather than the top one(s) is uncommon, but it's always good to try to understand why. Since we're learning, we'll try doing it this way and see what happens. We've given you the code for this exercise, so that the way to specify layers_to_freeze is clear.

In [33]:
"""
Show the output from training a BERT-base-cased classification model, when unfreezing
only the parameters in the embedding layers, first transformer layer (layer 0), and classifier layer.
"""

model_checkpoint_name = "bert-base-cased"

bert_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_name)
bert_classification_model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint_name)

layers_to_train = ["embeddings.", "layer.0.", "classifier."]

fine_tune_classif_model_freeze_layers(
    bert_classification_model,
    bert_tokenizer,
    imdb_train_dataset,
    imdb_dev_dataset,
    layers_to_train
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4679,0.428942,0.8042
2,0.4176,0.436707,0.8094


<transformers.trainer.Trainer at 0x7aa13a628b50>

 **QUESTION:**

3.a What is the final validation accuracy that you observed for this version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)




Now try two more versions, this time choosing which layers to train yourself. Instead of focusing on the number of the transformer block (layer.#), focus on the type of layer within each block (the stuff that comes after layer.# in the name).

Keep the pooler and classification layers unfrozen in all model versions. Your options to also train include the initial embedding layers and the different components within the transformer blocks (e.g. self attention matrices, dense layers, layer norms).

Try to find one combination that does better than the version you just ran above (higher validation accuracy after 2 epochs), without much more overfitting (training_loss / eval_loss > 0.7). Also try to find one version that overfits a lot more after 2 epochs (training_loss / eval_loss < 0.5).

In [30]:
"""
Show the output from training a particular model on the IMDB movie reviews dataset.
Choose layers to train that lead the model to perform better than the one in question 3.a,
without overfitting much more.
"""

# Load the BERT base model
model_checkpoint_name = "bert-base-cased"
bert_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_name)
bert_classification_model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint_name)

# First Experiment: Train embeddings, layer 0, and classifier
layers_to_train_1 = ["embeddings.", "layer.0.", "classifier."]

fine_tune_classif_model_freeze_layers(
    bert_classification_model,
    bert_tokenizer,
    imdb_train_dataset,
    imdb_dev_dataset,
    layers_to_train=layers_to_train_1
)

# Second Experiment: Train attention layers and classifier to see if it improves
layers_to_train_2 = ["attention.", "classifier."]

fine_tune_classif_model_freeze_layers(
    bert_classification_model,
    bert_tokenizer,
    imdb_train_dataset,
    imdb_dev_dataset,
    layers_to_train=layers_to_train_2
)

# Third Experiment: Overfitting scenario - Train all layers
layers_to_train_3 = [""]`

fine_tune_classif_model_freeze_layers(
    bert_classification_model,
    bert_tokenizer,
    imdb_train_dataset,
    imdb_dev_dataset,
    layers_to_train=layers_to_train_3
)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4679,0.428942,0.8042
2,0.4176,0.436707,0.8094


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2707,0.510435,0.8054
2,0.3724,0.497324,0.8082


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2013,0.586349,0.8002
2,0.3412,0.541489,0.8088


 **QUESTION:**

3.b What is the final training loss that you observed for this better performing version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

3.c What is the final validation loss that you observed for this better performing version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

3.d What is the final validation accuracy that you observed for this better performing version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

In [34]:
"""
Show the output from training a particular model on the IMDB movie reviews dataset.
Choose layers to train that lead the model to overfit.
"""

# Load the BERT base model
model_checkpoint_name = "bert-base-cased"
bert_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_name)
bert_classification_model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint_name)

# 1. Best Performing Configuration: Freeze most layers, train classifier and layer 0
layers_to_train_best = ["classifier.", "layer.0."]

# Fine-tune the model for best performance without overfitting
best_trainer = fine_tune_classif_model_freeze_layers(
    bert_classification_model,
    bert_tokenizer,
    imdb_train_dataset,
    imdb_dev_dataset,
    layers_to_train=layers_to_train_best
)

# Log metrics for the best-performing model
best_metrics = best_trainer.evaluate()
print("Best Model - Training and Validation Metrics:")
print(best_metrics)


# 2. Overfitting Configuration: Train all layers
layers_to_train_overfit = [""]  # Empty string matches all layers

# Fine-tune the model for overfitting scenario
overfit_trainer = fine_tune_classif_model_freeze_layers(
    bert_classification_model,
    bert_tokenizer,
    imdb_train_dataset,
    imdb_dev_dataset,
    layers_to_train=layers_to_train_overfit
)

# Log metrics for the overfitting model
overfit_metrics = overfit_trainer.evaluate()
print("Overfitting Model - Training and Validation Metrics:")
print(overfit_metrics)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4491,0.433626,0.8016
2,0.4374,0.422503,0.8074


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Best Model - Training and Validation Metrics:
{'eval_loss': 0.422502726316452, 'eval_accuracy': 0.8074, 'eval_runtime': 27.8289, 'eval_samples_per_second': 179.669, 'eval_steps_per_second': 11.247, 'epoch': 2.0}


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3571,0.418023,0.8108
2,0.398,0.419103,0.8128


Overfitting Model - Training and Validation Metrics:
{'eval_loss': 0.4191034734249115, 'eval_accuracy': 0.8128, 'eval_runtime': 27.5219, 'eval_samples_per_second': 181.673, 'eval_steps_per_second': 11.373, 'epoch': 2.0}


 **QUESTION:**

3.e What is the final training loss that you observed for this overfitting version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

3.f What is the final validation loss that you observed for this overfitting version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

3.g What is the final validation accuracy that you observed for this overfitting version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

## Congratulations... You are done!