# Assignment 2: Text Classification with BERT

**Description:** This assignment notebook builds on the material from the
[lesson 4 notebook](https://github.com/datasci-w266/2025-fall-main/blob/master/materials/lesson_notebooks/lesson_4_BERT.ipynb), in which we fine-tuned a BERT model for the IMDB movie reviews sentiment classification task. In that notebook, we used the bert-base-cased model and applied traditional fine-tuning, with a brief class exercise at the end to try unfreezing different numbers of layers. In this assignment, we'll start with that exercise, and ask you to explore unfreezing more specific layers yourself. Then you'll search for and try different pre-trained BERT-style models.

This notebook should be run on a Google Colab leveraging a GPU. By default, when you open the notebook in Colab it will try to use a GPU. Please note that you the GPU is reuqired for Section 3 but not for Sections 1 and 2.
Since colab is providing free access to a GPU they place constraints on that access.  Therefore you might want to turn off the GPU access (Edit -> Notebook Settings) until you get to section 3.  Total runtime of the entire notebook (with solutions and a Colab GPU) should be about 1h with the majority of that time being in Section 3. If Colab tells you that you have reached your GPU limit, wait up to 24 hours and you should be able to access a GPU again.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2025-fall-main/blob/master/assignment/a2/Text_classification_BERT.ipynb)

The overall assignment structure is as follows:


0. Setup
  
  0.1 Libraries and Helper Functions

  0.2 Data Acquisition

  0.3. Data Preparation


1. Classification with BERT

  1.1. BERT Basics

  1.2 CLS-Token-based Classification

  1.3 Averaging of BERT Outputs

  1.4. Adding a CNN on top of BERT



**INSTRUCTIONS:**:

* Questions are always indicated as **QUESTION**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the **answers** file as you did in a1.  Please do **not** remove the output from your notebooks when you submit them as we'll look at the output as well as your code for grading purposes.  We cannot award points if the output cells are empty.

* **### YOUR CODE HERE** indicates that you are supposed to write code.

* If you want to, you can run all of the cells in section 0 in bulk. This is setup work and no questions are in there. At the end of section 0 we will state all of the relevant variables that were defined and created in section 1.

* Finally, unless otherwise indicated your validation accuracy will be 0.65 or higher if you have correctly implemented the model.



## 0. Setup

### 0.1. Libraries and Helper Functions

This notebook requires the Hugging Face datasets and other prerequisites that you must download.  

In [1]:
!pip install -q transformers
!pip install -q torchinfo
!pip install -U -q datasets fsspec huggingface_hub # Hugging Face's dataset library
!pip install -q evaluate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/503.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━[0m [32m389.1/503.6 kB[0m [31m12.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m503.6/503.6 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.3/199.3 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m563.3/563.3 kB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pylibcudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x8

Now we are ready to do the imports.

In [2]:
!pip install --upgrade transformers

Collecting transformers
  Downloading transformers-4.56.2-py3-none-any.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.56.2-py3-none-any.whl (11.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m44.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.56.1
    Uninstalling transformers-4.56.1:
      Successfully uninstalled transformers-4.56.1
Successfully installed transformers-4.56.2


In [3]:
#@title Imports

import numpy as np

import transformers
import evaluate
import torch

from datasets import load_dataset
from torchinfo import summary

from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer

### 0.2 Data Acquisition


We will use the IMDB dataset delivered as part of the TensorFlow-datasets library, and split into training and test sets. For expedience, we will limit ourselves in terms of train and test examples.

In [4]:
imdb_dataset = load_dataset("imdb")

imdb_train_dataset = imdb_dataset['train'].shuffle()
imdb_dev_dataset = imdb_dataset['test'].shuffle().select(range(5000))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

It is always highly recommended to look at the data. What do the records look like? Are they clean or do they contain a lot of cruft (potential noise)?

In [5]:
imdb_train_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [6]:
for i in range(4):
  print(imdb_train_dataset['text'][i])
  print(imdb_train_dataset['label'][i])
  print()

This sci-fi masterpiece has too many flaws after the editors had butchered it after its opening in 1936. Visually it is a wonder to behold, but the script allows too many intellectual speeches about war and progress.This gets very corny when the actors are given to recite a lot of high minded messages at all times.Raymond Massey and Cedric Hardwicke,both great actors,come off as quite a pair of fanatics. Ralph Richardson is very good as the "The Boss" a megalomaniac warlord. The prediction of World War II was very eerie considering that the world was on the brink of the most devastating conflict in human history at the time. I'm sure glad that war didn't turn out as it did in the movie. There are some visually stunning montage sequences bridging the leaps of time between the movie's different episodes. Although its not as entertaining as I hoped it would be,this movie sticks in your mind long after you've seen it.
1

I'm glad that users (as of this date) who liked this movie are now co

In [7]:
imdb_train_dataset.features['label'].names

['neg', 'pos']

For convenience, in this assignment we will define a sequence length and truncate all records at that length. For records that are shorter than our defined sequence length we will add padding characters to insure that our input shapes are consistent across all records.

In [8]:
MAX_SEQUENCE_LENGTH = 100

## 0.3. Data Preparation

We will need to tokenize the text into vocab_ids to pass into a BERT model. To do so, we'll need to use the specific tokenizer that goes with the model we're using. In this notebook, we will try several different BERT-style models. Let's
first write a function that will take the text from our dataset and a tokenizer, and encode the text using that tokenizer. Then we'll apply the function to our dataset for each tokenizer and model.

In [9]:
def preprocess_imdb(data, tokenizer):
    review_text = data['text']

    encoded = tokenizer.batch_encode_plus(
            review_text,
            max_length=MAX_SEQUENCE_LENGTH,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_token_type_ids=True,
            return_tensors="pt"
        )

    return encoded


## 1. BERT-based Classification Models

Now we turn to classification with BERT. We will perform classifications with various models that are based on pre-trained BERT models.  If you turn off GPU access while coding and debugging the setup steps, make sure you change the Notebook settings so you can access a GPU when you're ready to train the models.


### 1.1. Basics

Let us first explore some basics of BERT. We'll start by loading the first pretrained BERT model and tokenizer that we'll use ('bert-base-cased').

To explore just the pre-trained portion of the model, we'll use the AutoModel class (equivalent to BertModel, but works for any architecture including BERT). This class gives us the pre-trained model layers up until the last hidden layer (but not any output layer).

In [10]:
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
bert_model = AutoModel.from_pretrained('bert-base-cased')

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Let's look at a couple of example sentences:

In [11]:
test_input = ['this bank is closed on Sunday', 'the steepest bank of the river is dangerous']

Apply the BERT tokenizer to tokenize them:

In [12]:
tokenized_input = bert_tokenizer(test_input,
                                 max_length=12,
                                 truncation=True,
                                 padding='max_length',
                                 return_tensors='pt')

tokenized_input

{'input_ids': tensor([[ 101, 1142, 3085, 1110, 1804, 1113, 3625,  102,    0,    0,    0,    0],
        [ 101, 1103, 9458, 2556, 3085, 1104, 1103, 2186, 1110, 4249,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])}

 **QUESTION:**

 1.a  Why do the attention_masks have 4 and 1 zeros, respectively?  Choose the correct one and enter it in the answers file.

  *  For the first example the last four tokens belong to a different segment. For the second one it is only the last token.

  *  For the first example 4 positions are padded while for the second one it is only one.

In [13]:
### YOUR CODE HERE

bert_output = bert_model(**tokenized_input)


### END YOUR CODE

 **QUESTION:**

 1.b How many outputs are there?

 Enter your code below.

In [14]:
### YOUR CODE HERE

print("Ouputs:", bert_output.keys())
print("Count:", len(bert_output.keys()))

### END YOUR CODE

Ouputs: odict_keys(['last_hidden_state', 'pooler_output'])
Count: 2


**QUESTION:**

1.c Which output do we need to use to get token-level embeddings?

the first

the second

Put your answer in the answers file.



**QUESTION:**

 1.d In the tokenized input, which input_id number (i.e. the vocabulary id) corresponds to 'bank' in the two sentences? ('bert_tokenizer.tokenize()' may come in handy.. and don't forget the CLS token! )


**QUESTION:**

 1.e In the array of tokens, which position index number corresponds to 'bank' in the first sentence? ('bert_tokenizer.tokenize()' may come in handy.. and don't forget the CLS token! )

In [15]:
### YOUR CODE HERE

# Tokenize each sentence to see the subword tokens
tokens_1 = bert_tokenizer.tokenize(test_input[0])

print("Tokens for sentence 1:", tokens_1)

# Convert 'bank' to input_id
bank_id = bert_tokenizer.convert_tokens_to_ids('bank')
print("Vocabulary ID for 'bank':", bank_id)

tokens_with_cls = ['[CLS]'] + tokens_1 + ['[SEP]']
print(tokens_with_cls)

### END YOUR CODE

Tokens for sentence 1: ['this', 'bank', 'is', 'closed', 'on', 'Sunday']
Vocabulary ID for 'bank': 3085
['[CLS]', 'this', 'bank', 'is', 'closed', 'on', 'Sunday', '[SEP]']


**QUESTION:**

1.f Which array position index number corresponds to 'bank' in the second sentence?

In [16]:
### YOUR CODE HERE

#f. -> Look at tokenization for the second example
tokens_2 = bert_tokenizer.tokenize(test_input[1])

print("Tokens for sentence 2:", tokens_2)

# Include CLS token manually
tokens_with_cls_2 = ['[CLS]'] + tokens_2 + ['[SEP]']
print(tokens_with_cls_2)

### END YOUR CODE

Tokens for sentence 2: ['the', 'steep', '##est', 'bank', 'of', 'the', 'river', 'is', 'dangerous']
['[CLS]', 'the', 'steep', '##est', 'bank', 'of', 'the', 'river', 'is', 'dangerous', '[SEP]']


**QUESTION:**

 1.g What is the cosine similarity between the BERT embeddings for the two occurences of 'bank' in the two sentences?

In [17]:
### YOUR CODE HERE

last_hidden_state = bert_output.last_hidden_state  # shape: (batch_size, seq_len, hidden_size)

bank_pos_1 = tokens_with_cls.index('bank')
bank_pos_2 = tokens_with_cls_2.index('bank')

# Extract embeddings
bank_emb_1 = last_hidden_state[0, bank_pos_1, :]
bank_emb_2 = last_hidden_state[1, bank_pos_2, :]

# Cosine similarity manually
dot_product = torch.dot(bank_emb_1, bank_emb_2)
norm_1 = torch.norm(bank_emb_1)
norm_2 = torch.norm(bank_emb_2)
cos_sim = dot_product / (norm_1 * norm_2)

print("Cosine similarity between 'bank' embeddings:", cos_sim.item())
### END YOUR CODE

Cosine similarity between 'bank' embeddings: 0.7478304505348206


**QUESTION:**

1.h How does this relate to the cosine similarity of 'this' (in sentence 1) and the first 'the' (in sentence 2). Compute their cosine similarity.


In [18]:
### YOUR CODE HERE

#h.  -> get the vectors and calculate cosine similarity
this_pos = tokens_with_cls.index('this')
the_pos = tokens_with_cls_2.index('the')

# Embeddings
this_emb = last_hidden_state[0, this_pos, :]
the_emb = last_hidden_state[1, the_pos, :]

# Cosine similarity manually
dot_product_1 = torch.dot(this_emb, the_emb)
norm_2 = torch.norm(this_emb)
norm_3 = torch.norm(the_emb)
cos_sim_1 = dot_product_1 / (norm_2 * norm_3)

print("Cosine similarity between 'this' and 'the':", cos_sim_1.item())

### END YOUR CODE

Cosine similarity between 'this' and 'the': 0.8110270500183105


### 2. Testing Different Pre-Trained BERT Models

In the live session we discussed classification with the `bert-base-cased` model, using the Huggingface class BertForSequenceClassification, which comes with a new output layer for our task that we need to train on our dataset.

We're going to try different pre-trained models now. Like in the lesson 4 notebook, we'll want to fine-tune each model on our IMDB reviews dataset and compare them with a metric like the validation accuracy. We'll use the model class AutoModelForSequenceClassification, which is equivalent to BertForSequenceClassification, but works for other similar architectures too.

Let's write the code we'll need as a function that takes the model and tokenizer as arguments, along with the raw train and dev data. The function will need to tokenize the inputs using the provided tokenizer, so that we can repeat the same code for different pre-trained models. Then the function should create the training args and trainer class, and call trainer.train().

The other hyperparameters you'll need are provided in the function definition, including batch_size and num_epochs. You should use the default values provided for those. Use the function provided below for compute_metrics.

For now, keep all layers of the pre-trained models you load unfrozen.

In [19]:
metric = evaluate.load('accuracy')

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script: 0.00B [00:00, ?B/s]

In [20]:
def fine_tune_classification_model(classification_model,
                                   tokenizer,
                                   train_data,
                                   dev_data,
                                   batch_size = 16,
                                   num_epochs = 2):
    """
    Preprocess the data using the given tokenizer.
    Create training arguments and Trainer for the given model and data.
    Then train it.
    """

    # Tokenize datasets
    preprocessed_train_data = train_data.map(
        preprocess_imdb,
        batched=True,
        fn_kwargs={'tokenizer': tokenizer}
    )
    preprocessed_dev_data = dev_data.map(
        preprocess_imdb,
        batched=True,
        fn_kwargs={'tokenizer': tokenizer}
    )

    # Training arguments
    training_args = TrainingArguments(
        output_dir="./results",
        save_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=num_epochs,
        weight_decay=0.01,
        logging_dir="./logs",
        logging_strategy="steps",
        logging_steps=50,
        metric_for_best_model="accuracy",
        save_total_limit=2,
        report_to=[]
    )

    # Trainer
    trainer = Trainer(
        model=classification_model,
        args=training_args,
        train_dataset=preprocessed_train_data,
        eval_dataset=preprocessed_dev_data,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )

    # Train the model
    trainer.train()

    return trainer

Let's try BERT-base-case first, the same model that was used in the lesson 4 notebook.

In [21]:
"""
Show the output from training BERT-base-cased on the IMDB movie reviews dataset.
"""

model_checkpoint_name = "bert-base-cased"
bert_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_name)
bert_classification_model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint_name)

fine_tune_classification_model(bert_classification_model, bert_tokenizer, imdb_train_dataset, imdb_dev_dataset)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

  trainer = Trainer(


Step,Training Loss
50,0.6889
100,0.515
150,0.4582
200,0.4102
250,0.4219
300,0.3852
350,0.3995
400,0.4068
450,0.4106
500,0.3512


<transformers.trainer.Trainer at 0x79ad04b728d0>

Often, one of the first choices you have is what pre-trained model you'll want to use. There are quite a few options, especially because other researchers and practitioners fine-tune their own versions of existing models and sometimes make theirs available for others to continue building on.

You can search through models available on [Huggingface at this website](https://huggingface.co/models?pipeline_tag=text-classification&sort=trending). Some models were made by Huggingface or other large companies/organizations; other models may have been uploaded by individual users. Notice the search tags on the left, we've already clicked the tag for "Text Classification" in the link above. You should see various versions of BERT-style models.

For our IMDB classification, we might want to try a model that has been trained on another dataset related to sentiment or emotions. We also want to find models that have a complete model card with documentation about the model architecture and how it was trained, and potentially a link to an associated research paper, and/or a good number of downloads and likes.

Take a look at this model: [cardiffnlp/twitter-roberta-base-sentiment](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment). It's a RoBERTa model (similar to BERT with slightly different pre-training, often popular for classification tasks), that has already been fine-tuned on the TweetEval benchmark set of tasks for sentiment analysis.

The model card indicates that there is an updated version of this model now available. Follow the link to the latest version of the model, and look at that most recent model's card to answer the following questions. Then load that most recent model to train on our task.

**QUESTION:**

 2.a What is the model checkpoint name for the most recent version of this Twitter Roberta-base sentiment analysis model? (Copy and paste the model checkpoint name into the answers file. It should be the full name that you put inside the quotes to load the file below.)

 **QUESTION:**

 2.b Approximately how many tweets was this latest model trained on? (Put the answer in the answers file. You can use the abbreviation for millions like in the model card, e.g. a number like 12M or 85M.)

 **QUESTION:**

 2.c What is the title of the published reference paper for this most recent model? (Copy the full title of the paper and paste it into the answers file.)

In [22]:
"""
Show the output from training the most recent Twitter RoBERTa sentiment model on the IMDB movie reviews dataset.
Insert the model checkpoint name for the latest version of that model below.
"""

### YOUR CODE HERE

model_checkpoint_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"


### END YOUR CODE


bert_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_name)
bert_classification_model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint_name)

fine_tune_classification_model(bert_classification_model, bert_tokenizer, imdb_train_dataset, imdb_dev_dataset)

config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

  trainer = Trainer(


Step,Training Loss
50,0.4971
100,0.4457
150,0.4235
200,0.3799
250,0.3875
300,0.3497
350,0.3211
400,0.3634
450,0.3804
500,0.2807


<transformers.trainer.Trainer at 0x79ad078adb20>

**QUESTION:**

2.d What is the final validation accuracy that you observed for the Twitter RoBERTa sentiment-trained model after training for 2 epochs? (Copy and paste the decimal value for the final validation accuracy, e.g. a number like 0.567 or 0.876. Use up to 5 significant digits, though fewer is fine if the output shown in the notebook only has 3 or 4. Put the answer in the answers file; it should match the value shown in your output in this notebook.)

**QUESTION:**

2.e Did the Twitter RoBERTa sentiment-trained model do better or worse or the same as the BERT-base?


**(Answer 2.f below but do NOT enter your sentences in the answers file)**

**QUESTION:**

2.f Why do you think that happened? (Put your two to three sentence answer in the cell below.)

Please answer 2.f in two to three sentences right here:

** BEGIN Q 2.f ANSWER HERE **




** END Q 2.f ANSWER HERE. **


### 3. Unfreezing Different Pre-Trained Layers

In the lesson 4 notebook, we tested freezing most or all of the pre-trained BERT model layers. We used the .named_parameters() method, looking at the specific names of each set of model parameters.

As in the lesson notebook, we will always want to make sure we keep the classification layer parameters unfrozen, since those need to be trained for our specific task. We will also keep the pooler layer unfrozen, since it's next closest to the classification layer and was only pre-trained in standard BERT models with the next sentence prediction task.

For the remaining layers, what happens if we unfreeze lower transformer blocks and keep higher transformer blocks frozen (the opposite of what we did in the lesson notebook)? What if we instead try unfreezing specific types of layers within each transformer block, e.g. all of the self attention layers, or all of the dense layers?

Let's modify our fine-tuning function, to add an argument for the layers that we want to train. We'll make that argument a list of strings, and we'll set the default to just unfreeze the classification layer. You'll need to write the code to compare those strings to the names of the model parameters (after loading the specified model) and freeze all parameters that don't match (as in the lesson 4 notebook).

In [23]:
# Refresh your memory on what the parameter names look like
for name, param in bert_classification_model.named_parameters():
    print(name)

roberta.embeddings.word_embeddings.weight
roberta.embeddings.position_embeddings.weight
roberta.embeddings.token_type_embeddings.weight
roberta.embeddings.LayerNorm.weight
roberta.embeddings.LayerNorm.bias
roberta.encoder.layer.0.attention.self.query.weight
roberta.encoder.layer.0.attention.self.query.bias
roberta.encoder.layer.0.attention.self.key.weight
roberta.encoder.layer.0.attention.self.key.bias
roberta.encoder.layer.0.attention.self.value.weight
roberta.encoder.layer.0.attention.self.value.bias
roberta.encoder.layer.0.attention.output.dense.weight
roberta.encoder.layer.0.attention.output.dense.bias
roberta.encoder.layer.0.attention.output.LayerNorm.weight
roberta.encoder.layer.0.attention.output.LayerNorm.bias
roberta.encoder.layer.0.intermediate.dense.weight
roberta.encoder.layer.0.intermediate.dense.bias
roberta.encoder.layer.0.output.dense.weight
roberta.encoder.layer.0.output.dense.bias
roberta.encoder.layer.0.output.LayerNorm.weight
roberta.encoder.layer.0.output.LayerNorm

In [27]:
def fine_tune_classif_model_freeze_layers(classification_model,
                                          tokenizer,
                                          train_data,
                                          dev_data,
                                          layers_to_train = ["classifier."],
                                          max_sequence_length=MAX_SEQUENCE_LENGTH,
                                          batch_size = 16,
                                          num_epochs = 2):
    """
    Freeze any parameters inside the given model that have a name containing one of the
    strings in the "layers_to_freeze" list.
    Then specify the training arguments and trainer for the given model and data.
    Then train it.
    """

    preprocessed_train_data = train_data.map(preprocess_imdb, batched=True, fn_kwargs={'tokenizer': tokenizer})
    preprocessed_dev_data = dev_data.map(preprocess_imdb, batched=True, fn_kwargs={'tokenizer': tokenizer})

    ### YOUR CODE HERE
    # Freeze parameters not in layers_to_train
    for name, param in classification_model.named_parameters():
        if not any([layer_name in name for layer_name in layers_to_train]):
            param.requires_grad = False
        else:
            param.requires_grad = True

        print(name, param.requires_grad)

    # Base args shared across versions
    base_args = dict(
        output_dir="./results",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=num_epochs,
        learning_rate=2e-5,
        weight_decay=0.01,
        logging_dir="./logs",
        logging_strategy="steps",
        logging_steps=50,
        metric_for_best_model="accuracy",
        save_total_limit=2,
        report_to=[]  # disable W&B / other reporting
    )

    # version doesn't accept it, fall back to eval_steps + save_strategy="steps".
    try:
        training_args = TrainingArguments(
            **base_args,
            evaluation_strategy="epoch",   # run evaluation every epoch
            save_strategy="epoch",         # make save strategy match eval strategy
            load_best_model_at_end=False   # set False to avoid needing matching strategies elsewhere
        )
    except TypeError:
        # Older transformers may not accept evaluation_strategy; use steps instead.
        # Choose a reasonable eval_steps (tune if needed). We use 200 as you had before.
        print("TrainingArguments rejected evaluation_strategy; falling back to eval_steps strategy.")
        training_args = TrainingArguments(
            **base_args,
            save_strategy="steps",
            eval_steps=200,
        )

    # Trainer (provide eval_dataset so evaluation runs)
    trainer = Trainer(
        model=classification_model,
        args=training_args,
        train_dataset=preprocessed_train_data,
        eval_dataset=preprocessed_dev_data,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )

    ### END YOUR CODE

    trainer.train()

    return trainer


We'll go back to using bert-base-cased for this part. First, try freezing the parameters in transformer layers 1-11 (including all parameters with "layer.#" in the name). That means you're leaving unfrozen the initial embedding layers, the first transformer layer (numbered 0), and the classification layer.

Unfreezing the bottom transformer layer(s) rather than the top one(s) is uncommon, but it's always good to try to understand why. Since we're learning, we'll try doing it this way and see what happens. We've given you the code for this exercise, so that the way to specify layers_to_freeze is clear.

In [28]:
"""
Show the output from training a BERT-base-cased classification model, when unfreezing
only the parameters in the embedding layers, first transformer layer (layer 0), and classifier layer.
"""

model_checkpoint_name = "bert-base-cased"

bert_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_name)
bert_classification_model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint_name)

layers_to_train = ["embeddings.", "layer.0.", "classifier."]

fine_tune_classif_model_freeze_layers(
    bert_classification_model,
    bert_tokenizer,
    imdb_train_dataset,
    imdb_dev_dataset,
    layers_to_train
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


bert.embeddings.word_embeddings.weight True
bert.embeddings.position_embeddings.weight True
bert.embeddings.token_type_embeddings.weight True
bert.embeddings.LayerNorm.weight True
bert.embeddings.LayerNorm.bias True
bert.encoder.layer.0.attention.self.query.weight True
bert.encoder.layer.0.attention.self.query.bias True
bert.encoder.layer.0.attention.self.key.weight True
bert.encoder.layer.0.attention.self.key.bias True
bert.encoder.layer.0.attention.self.value.weight True
bert.encoder.layer.0.attention.self.value.bias True
bert.encoder.layer.0.attention.output.dense.weight True
bert.encoder.layer.0.attention.output.dense.bias True
bert.encoder.layer.0.attention.output.LayerNorm.weight True
bert.encoder.layer.0.attention.output.LayerNorm.bias True
bert.encoder.layer.0.intermediate.dense.weight True
bert.encoder.layer.0.intermediate.dense.bias True
bert.encoder.layer.0.output.dense.weight True
bert.encoder.layer.0.output.dense.bias True
bert.encoder.layer.0.output.LayerNorm.weight True


Step,Training Loss
50,0.6972
100,0.6729
150,0.6513
200,0.6509
250,0.6384
300,0.6011
350,0.5777
400,0.5621
450,0.5404
500,0.5477


<transformers.trainer.Trainer at 0x79ad0976b920>

 **QUESTION:**

3.a What is the final validation accuracy that you observed for this lowest level unfrozen version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)


Now try two more versions, this time choosing which layers to train yourself. Instead of focusing on the number of the transformer block (layer.#), focus on the type of layer within each block (the stuff that comes after layer.# in the name).

Keep the pooler and classification layers unfrozen in all model versions. Your options to also train include the initial embedding layers and the different components within the transformer blocks (e.g. self attention matrices, dense layers, layer norms).

Try to find one combination that does better than the version you just ran above (higher validation accuracy after 2 epochs), without much more overfitting (training_loss / eval_loss > 0.7). Also try to find one version that overfits a lot more after 2 epochs (training_loss / eval_loss < 0.5).

In [29]:
"""
Show the output from training a particular model on the IMDB movie reviews dataset.
Choose layers to train that lead the model to perform better than the one in question 3.a, without overfitting much more.
"""

model_checkpoint_name = "bert-base-cased"

bert_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_name)
bert_classification_model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint_name)

### YOUR CODE HERE

# Train pooler, classifier, and final layer's attention + dense
layers_to_train = [
    "classifier.",
    "pooler.",
    "encoder.layer.11.attention.self.",
    "encoder.layer.11.output.dense."
]

### END YOUR CODE


fine_tune_classif_model_freeze_layers(
    bert_classification_model,
    bert_tokenizer,
    imdb_train_dataset,
    imdb_dev_dataset,
    layers_to_train
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


bert.embeddings.word_embeddings.weight False
bert.embeddings.position_embeddings.weight False
bert.embeddings.token_type_embeddings.weight False
bert.embeddings.LayerNorm.weight False
bert.embeddings.LayerNorm.bias False
bert.encoder.layer.0.attention.self.query.weight False
bert.encoder.layer.0.attention.self.query.bias False
bert.encoder.layer.0.attention.self.key.weight False
bert.encoder.layer.0.attention.self.key.bias False
bert.encoder.layer.0.attention.self.value.weight False
bert.encoder.layer.0.attention.self.value.bias False
bert.encoder.layer.0.attention.output.dense.weight False
bert.encoder.layer.0.attention.output.dense.bias False
bert.encoder.layer.0.attention.output.LayerNorm.weight False
bert.encoder.layer.0.attention.output.LayerNorm.bias False
bert.encoder.layer.0.intermediate.dense.weight False
bert.encoder.layer.0.intermediate.dense.bias False
bert.encoder.layer.0.output.dense.weight False
bert.encoder.layer.0.output.dense.bias False
bert.encoder.layer.0.output.Lay

Step,Training Loss
50,0.712
100,0.6826
150,0.6658
200,0.6429
250,0.6126
300,0.5666
350,0.5189
400,0.4816
450,0.4779
500,0.4409


<transformers.trainer.Trainer at 0x79ad09563e60>

 **QUESTION:**

3.b What is the final training loss that you observed for this better performing version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

3.c What is the final validation loss that you observed for this better performing version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

3.d What is the ratio of your final training loss/final validation loss? For this better version the ratio must be greater than 0.7.

3.e What is the final validation accuracy that you observed for this better performing version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

In [30]:
"""
Show the output from training a particular model on the IMDB movie reviews dataset.
Choose layers to train that lead the model to overfit.
"""

model_checkpoint_name = "bert-base-cased"

bert_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_name)
bert_classification_model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint_name)

### YOUR CODE HERE

# Train everything: embeddings + all attention + dense + pooler + classifier
layers_to_train = [
    "embeddings.",       # word embeddings
    "encoder.layer.",    # all transformer blocks
    "pooler.",           # pooler layer
    "classifier."        # classification layer
]

### END YOUR CODE


fine_tune_classif_model_freeze_layers(
    bert_classification_model,
    bert_tokenizer,
    imdb_train_dataset,
    imdb_dev_dataset,
    layers_to_train
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


bert.embeddings.word_embeddings.weight True
bert.embeddings.position_embeddings.weight True
bert.embeddings.token_type_embeddings.weight True
bert.embeddings.LayerNorm.weight True
bert.embeddings.LayerNorm.bias True
bert.encoder.layer.0.attention.self.query.weight True
bert.encoder.layer.0.attention.self.query.bias True
bert.encoder.layer.0.attention.self.key.weight True
bert.encoder.layer.0.attention.self.key.bias True
bert.encoder.layer.0.attention.self.value.weight True
bert.encoder.layer.0.attention.self.value.bias True
bert.encoder.layer.0.attention.output.dense.weight True
bert.encoder.layer.0.attention.output.dense.bias True
bert.encoder.layer.0.attention.output.LayerNorm.weight True
bert.encoder.layer.0.attention.output.LayerNorm.bias True
bert.encoder.layer.0.intermediate.dense.weight True
bert.encoder.layer.0.intermediate.dense.bias True
bert.encoder.layer.0.output.dense.weight True
bert.encoder.layer.0.output.dense.bias True
bert.encoder.layer.0.output.LayerNorm.weight True


Step,Training Loss
50,0.6939
100,0.539
150,0.4581
200,0.4135
250,0.4323
300,0.3891
350,0.3963
400,0.4002
450,0.4195
500,0.3495


<transformers.trainer.Trainer at 0x79acf80c0740>

 **QUESTION:**

3.f What is the final training loss that you observed for this overfitting version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

3.g What is the final validation loss that you observed for this overfitting version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

3.h What is the ratio of your final training loss/final validation loss? For this overfitting version the ratio must be less than 0.5.

3.i What is the final validation accuracy that you observed for this overfitting version of the BERT classification model after training for 2 epochs? (Copy and paste the decimal value into the answers file, as instructed in 2.b)

## Congratulations... You are done!