# INFO 159/259

# <center> Homework 3: Transformers and Masked Language Models </center>
<center> Due: <b>Monday</b>, February 23, 2026 @ 11:59pm </center>

In this homework, you will experiment with a BERT-style bi-directional encoder transformer model that has been pretrained on the masked language modeling task.

In the first part, you will extract the _contextual_ embedding representations of words in the context of their sentences.

In the second part, you will explore how the embedding representation changes at different layers of the model.


Learning objectives:
- Understand the masked language modeling objective
- Be able to explain the difference between contextual and static representations
- Understand the transformers architecture

During the course of this project, you may need to inspect intermediate outputs to understand how they are structured (e.g., what exactly is in the output of the BERT model?) Feel free to create extra cells to do this investigation.

You should also be using a GPU instance on Google Colab to run your code. You can enable a GPU kernel by clicking `Runtime > Change runtime type` and selecting `T4 GPU`.

In [None]:
!wget https://github.com/dbamman/nlp-course/raw/refs/heads/main/HW/data/promotion.n.xml -O promotion.n.xml
!wget https://github.com/dbamman/nlp-course/raw/refs/heads/main/HW/hw3_transformers/hw3_utils.py -O hw3_utils.py

In [None]:
import torch
from transformers import BertModel, AutoTokenizer

from hw3_utils import load_data_file

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
items = load_data_file("promotion.n.xml")

In [None]:
# You may get a warning in the LOAD REPORT. This is ok!
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

## Tokenizing text

We will begin by implementing the `tokenize` and `collate` functions. These two functions, together, will prepare your data to be fed into the BertModel (see the [huggingface documentation here](https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertModel.forward).)

The `tokenize` function converts a sentence into the token IDs that are recognizable to the model; it also generates an attention mask. It takes an input batch (a dictionary of lists) and returns an output batch (also a dictionary of lists).

```
Input:
{
    "word": list[str]
    "sentence": list[list[str]]
}

Output:
{
    # These are generated by calling tokenizer()
    "input_ids": list[list[int]]
    "token_type_ids": list[list[int]]
    "attention_mask": list[list[int]]

    # Write the code to find the token indices
    "token_indices": list[int]
}
```

You will want to store the index of the first subword token that corresponds to the target word in `batch["word"]`. Here is an example:

```
Text:            I    said    hello    world    .

Tokens: [CLS]    I    said    hello    world    .    [SEP]
IDs:     101   1045   2056    7592     2088    1012   102
Index:    0      1     2        3        4      5      6
                                ^
```
The token `hello` has token index 3.

In [None]:
def tokenize(batch):
    # TODO: Implement me!
    ...
    
    assert not any(x is None for x in output["token_indices"]), "Target token not found in sentence!"
    assert len(token_indices) == len(batch["sentence"]), "Token indices is the wrong length!"

    return output

**Quick check**: This should be the output of the below cell. Your decoded token should also match the target word.
```
{'input_ids': [[101, 1045, 2056, 7592, 2088, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1]], 'token_indices': [4]}

Selected token ID:  2088
Decoded token:  world
```

In [None]:
def _():
    tokens = tokenize({
        "word": ["world"],
        "sentence": [["I", "said", "hello", "world", "."]]
    })
    print(tokens)
    print()

    print("Selected token ID: ", tokens["input_ids"][0][tokens["token_indices"][0]])
    print("Decoded token: ", tokenizer.decode(tokens["input_ids"][0][tokens["token_indices"][0]]))

_()

The `collate` function converts a list of rows into a batched tensor that the model can process in parallel. It takes a list of rows (a list of dicts) and returns a dictionary of tensors that can be fed into the model. It should use the `tokenize` function you implemented.

```
Input:
[{"word": str, "sentence": str}, ...]

Output:
{
    "input_ids": torch.tensor
    "token_type_ids": torch.tensor
    "attention_mask": torch.tensor
    "token_indices": torch.tensor
}
```

Since sentences might be different lengths, you will want to `pad` the sequences before converting to torch tensors. You might want to look into `torch.nn.utils.rnn.pad_sequence`.

Each of `input_ids`, `token_type_ids`, and `attention_mask` should have shape `(B, L)`, where `B` is the batch size and `L` is the maximum sequence length in the batch.

In [None]:
from torch.nn.utils.rnn import pad_sequence

def collate(items, device="cpu"):
    # TODO: Implement me!
    ...
    return outputs


**Quick check**: This should be the output of the following cell.

```
{'input_ids': tensor(
    [[  101, 18558, 18914,  2003,  2019, 17953,  2361,  2607,  1012,   102],
     [  101,  1045,  2293, 17953,  2361,   999,   102,     0,     0,     0],
     [  101,  2054,  2515, 17953,  2361,  3233,  2005,  1029,  1029,   102]]),
 'token_type_ids': tensor(
    [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask': tensor(
    [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
     [1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
     [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 'token_indices': tensor([5, 3, 3])}
```

In [None]:
def _():
    return collate([
        dict(word="NLP", sentence="INFO 159 is an NLP course .".split()),
        dict(word="NLP", sentence="I love NLP !".split()),
        dict(word="NLP", sentence="What does NLP stand for ? ?".split()),
    ])
_()

<!-- BEGIN QUESTION -->

**Answer these questions.** (You may execute arbitrary code if necessary.)
1. Why are there more `input_ids` in each sequence compared to the number of (space-delimited) tokens?
2. List all of the extra (special) tokens that get added to the input. Write the decoded tokens (not the integer input IDs).

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Extracting contextual embedding

Here, you will implement `get_token_embedding` to get the contextual embedding. For the $i$th sentence in the batch, you want to extract the embedding representation for the $i$th token index in `token_indices` at the specified layer.

In [None]:
def get_token_embedding(model_output, token_indices, layer=-1):
    # batch_reps should have shape (B, D)
    # where B is the batch size and D is the dimension of the hidden state
    # (for BERT, D = 768)

    # TODO: Implement me!
    ...
    return batch_reps.detach().cpu()


We implement the inference code for you, but read through it and make sure you understand what is going on! Calling `model(...)` calls the `.forward()` method of the model as well as the necessary pre- and post-processing steps (see the [HF documentation](https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertModel.forward) for the BertModel, and the note at the bottom about overriding the `__call__` method).

Setting up the code this way (with `iter_outputs` serving as a generator that yields model outputs) lets us easily iterate through model outputs and apply arbitrary functions to them (like our `get_token_embedding` function).

In [None]:
from torch.utils.data import DataLoader
from tqdm import tqdm

def iter_outputs(data, model, batch_size=128):
    model.eval()  # setting eval mode disables dropout (and other stuff)
    model.to(device)  # we put the model on the GPU

    dataloader = DataLoader(
        data,
        batch_size=batch_size,
        shuffle=False,
        collate_fn=collate  # we pass in our collate function here
    )

    # disable gradient calculation and storage for efficiency,
    # since we aren't backpropagating
    with torch.no_grad():
        for batch in tqdm(dataloader):
            output = model(
                input_ids=batch["input_ids"].to(device),
                attention_mask=batch["attention_mask"].to(device),
                token_type_ids=batch["token_type_ids"].to(device),
                # by default, this only returns the last layer hidden states
                # we want the flexibility to look at other layers, so we set
                # `output_hidden_states=True`
                output_hidden_states=True
            )
            yield batch, output
    

**Quick check**: you should get the following output
```
tensor([[ 0.5100, -1.1154,  0.6605,  ..., -0.2650,  1.5101,  0.5656],
        [ 0.9360, -1.9150,  0.8224,  ...,  0.1231,  0.9546, -0.1965]])
```

In [None]:
def _():
    embeddings = []
    for batch, batch_output in iter_outputs([dict(word="NLP", sentence="INFO 159 is an NLP course .".split())], model):
        embeddings.append(get_token_embedding(batch_output, batch["token_indices"]))
        embeddings.append(get_token_embedding(batch_output, batch["token_indices"], layer=4))
    embeddings = torch.concat(embeddings, dim = 0)
    return embeddings

_()

In [None]:
def get_all_embeddings():
    embeddings = []
    for batch, batch_output in iter_outputs(items, model):
        embeddings.append(get_token_embedding(batch_output, batch["token_indices"]))
    
    return torch.concat(embeddings, dim=0)

embeddings = get_all_embeddings()

## Exploring contextual embeddings

With contextual embeddings, the representations change depending on the context; let's see that in action by looking at the sentences where the target word embeddings have the greatest similarity.

In [None]:
def nearest_neighbors(vec, matrix, k=10):
    cos_sim = ((vec @ matrix.T) / (vec.norm() * matrix.T.norm(dim=0, keepdim=True))).squeeze()
    inds = torch.argsort(-cos_sim)[:k]
    return inds, cos_sim[inds]

def show_nearest_neighbor_sentences(index, embeddings):
    print(f"QUERY (target word={items[index]['word']}):")
    print(" ".join(items[index]["sentence"]))

    print("NEIGHBORS:")
    inds, _ = nearest_neighbors(embeddings[index], embeddings, k=5)
    for ind in inds:
        print("-", " ".join(items[ind]["sentence"]))

<!-- BEGIN QUESTION -->

**Answer the following question**
1. Identify at least two different contexts in which the target word `promotion` appears. Include the query sentence, one nearest neighbor (not the query), and the ID of the query sentence.

_Type your answer here, replacing this text._

In [None]:
show_nearest_neighbor_sentences(2, embeddings)

<!-- END QUESTION -->

## Attention masking

Let's take a closer look at how attention masks affect the output of the masked language model.

Consider embedding these two sentences which only differ in one token. We modify the attention mask to ignore this token.

In [None]:
custom_items = [
    dict(word="cat", sentence="Have you fed your cat yet ?".split()),
    dict(word="dog", sentence="Have you fed your dog yet ?".split())
]

model_inputs = collate(custom_items, device=device)
# ignore the tokens located at token_indices for each data point
model_inputs["attention_mask"][[0, 1], model_inputs["token_indices"]] = 0
model_inputs

<!-- BEGIN QUESTION -->

**Answer this question**: (you can add cells and run code to arrive at the answer)

If we were to put these inputs through the BERT model, which of the following would be true? For each, explain why or why not.

1. The last layer hidden representation of the target tokens (`cat` and `dog`) would be the same.
2. The last layer hidden representation of the third non-special token (`fed` in both cases) would be the same.
3. The first layer hidden representation and the last layer hidden representation of each target token (e.g., the first and last layer representation of the `cat` token) would be the same.
4. If we mask out everything _but_ the target tokens, the first and last layer hidden representations of each target token would be the same.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Upload instructions

Upload your `.ipynb` file (with all of the cells executed so that the outputs are visible) to Gradescope.