# INFO 159/259

# <center> Homework 3: Transformers and Masked Language Models </center>
<center> Due: <b>Monday</b>, February 23, 2026 @ 11:59pm </center>

In this homework, you will experiment with a BERT-style bi-directional encoder transformer model that has been pretrained on the masked language modeling task.

In the first part, you will extract the _contextual_ embedding representations of words in the context of their sentences.

In the second part, you will explore how the embedding representation changes at different layers of the model.


Learning objectives:
- Understand the masked language modeling objective
- Be able to explain the difference between contextual and static representations
- Understand the transformers architecture

During the course of this project, you may need to inspect intermediate outputs to understand how they are structured (e.g., what exactly is in the output of the BERT model?) Feel free to create extra cells to do this investigation.

You should also be using a GPU instance on Google Colab to run your code. You can enable a GPU kernel by clicking `Runtime > Change runtime type` and selecting `T4 GPU`.

In [None]:
!wget https://github.com/dbamman/nlp-course/raw/refs/heads/main/HW/data/promotion.n.xml -O promotion.n.xml
!wget https://github.com/dbamman/nlp-course/raw/refs/heads/main/HW/hw3_transformers/hw3_utils.py -O hw3_utils.py

--2026-02-22 22:01:51--  https://github.com/dbamman/nlp-course/raw/refs/heads/main/HW/data/promotion.n.xml
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dbamman/nlp-course/refs/heads/main/HW/data/promotion.n.xml [following]
--2026-02-22 22:01:52--  https://raw.githubusercontent.com/dbamman/nlp-course/refs/heads/main/HW/data/promotion.n.xml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2843173 (2.7M) [text/plain]
Saving to: ‘promotion.n.xml’


2026-02-22 22:01:52 (12.0 MB/s) - ‘promotion.n.xml’ saved [2843173/2843173]

--2026-02-22 22:01:52--  https://github.com/dbamman/nlp-course/raw/refs/

In [None]:
import torch
from transformers import BertModel, AutoTokenizer

from hw3_utils import load_data_file

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
items = load_data_file("promotion.n.xml")

In [None]:
# You may get a warning in the LOAD REPORT. This is ok!
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: bert-base-uncased
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  | 
cls.seq_relationship.weight                | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  | 
cls.predictions.bias                       | UNEXPECTED |  | 
cls.predictions.transform.dense.weight     | UNEXPECTED |  | 
cls.predictions.transform.dense.bias       | UNEXPECTED |  | 
cls.seq_relationship.bias                  | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


## Tokenizing text

We will begin by implementing the `tokenize` and `collate` functions. These two functions, together, will prepare your data to be fed into the BertModel (see the [huggingface documentation here](https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertModel.forward).)

The `tokenize` function converts a sentence into the token IDs that are recognizable to the model; it also generates an attention mask. It takes an input batch (a dictionary of lists) and returns an output batch (also a dictionary of lists).

```
Input:
{
    "word": list[str]
    "sentence": list[list[str]]
}

Output:
{
    # These are generated by calling tokenizer()
    "input_ids": list[list[int]]
    "token_type_ids": list[list[int]]
    "attention_mask": list[list[int]]

    # Write the code to find the token indices
    "token_indices": list[int]
}
```

You will want to store the index of the first subword token that corresponds to the target word in `batch["word"]`. Here is an example:

```
Text:            I    said    hello    world    .

Tokens: [CLS]    I    said    hello    world    .    [SEP]
IDs:     101   1045   2056    7592     2088    1012   102
Index:    0      1     2        3        4      5      6
                                ^
```
The token `hello` has token index 3.

In [None]:
tokenizer(["What is my name"])

{'input_ids': [[101, 2054, 2003, 2026, 2171, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1]]}

In [None]:

a = ["1", "2", "2"]
a.index("2")

1

In [None]:
def tokenize(batch):
    # TODO: Implement me!
    # print(batch)
    tokenizer_output = tokenizer(batch["sentence"], is_split_into_words=True)

    output = dict()

    output['input_ids'] = tokenizer_output['input_ids']
    output['token_type_ids'] = tokenizer_output['token_type_ids']
    output['attention_mask'] = tokenizer_output['attention_mask']

    token_indices = []

    for i, word_ in enumerate(batch["word"]):
      word_index = batch['sentence'][i].index(word_)
      token_indices.append(word_index+1)

    output['token_indices'] = token_indices


    assert not any(x is None for x in output["token_indices"]), "Target token not found in sentence!"
    assert len(token_indices) == len(batch["sentence"]), "Token indices is the wrong length!"

    return output

**Quick check**: This should be the output of the below cell. Your decoded token should also match the target word.
```
{'input_ids': [[101, 1045, 2056, 7592, 2088, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1]], 'token_indices': [4]}

Selected token ID:  2088
Decoded token:  world
```

In [None]:
def _():
    tokens = tokenize({
        "word": ["world"],
        "sentence": [["I", "said", "hello", "world", "."]]
    })
    print(tokens)
    print()

    print("Selected token ID: ", tokens["input_ids"][0][tokens["token_indices"][0]])
    print("Decoded token: ", tokenizer.decode(tokens["input_ids"][0][tokens["token_indices"][0]]))

_()

{'input_ids': [[101, 1045, 2056, 7592, 2088, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1]], 'token_indices': [4]}

Selected token ID:  2088
Decoded token:  world


The `collate` function converts a list of rows into a batched tensor that the model can process in parallel. It takes a list of rows (a list of dicts) and returns a dictionary of tensors that can be fed into the model. It should use the `tokenize` function you implemented.

```
Input:
[{"word": str, "sentence": str}, ...]

Output:
{
    "input_ids": torch.tensor
    "token_type_ids": torch.tensor
    "attention_mask": torch.tensor
    "token_indices": torch.tensor
}
```

Since sentences might be different lengths, you will want to `pad` the sequences before converting to torch tensors. You might want to look into `torch.nn.utils.rnn.pad_sequence`.

Each of `input_ids`, `token_type_ids`, and `attention_mask` should have shape `(B, L)`, where `B` is the batch size and `L` is the maximum sequence length in the batch.

In [None]:
from torch.nn.utils.rnn import pad_sequence

def collate(items, device="cpu"):
    item_batch = {key: [item[key] for item in items] for key in items[0].keys()}
    tokenized_items = tokenize(item_batch)
    # print(tokenized_items)

    outputs = dict()
    for key in ['input_ids', 'token_type_ids', 'attention_mask']:
      outputs[key] = pad_sequence([torch.tensor(x) for x in tokenized_items[key]]).T

    outputs['token_indices'] = torch.tensor(tokenized_items['token_indices'])

    return outputs


**Quick check**: This should be the output of the following cell.

```
{'input_ids': tensor(
    [[  101, 18558, 18914,  2003,  2019, 17953,  2361,  2607,  1012,   102],
     [  101,  1045,  2293, 17953,  2361,   999,   102,     0,     0,     0],
     [  101,  2054,  2515, 17953,  2361,  3233,  2005,  1029,  1029,   102]]),
 'token_type_ids': tensor(
    [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask': tensor(
    [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
     [1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
     [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 'token_indices': tensor([5, 3, 3])}
```

In [None]:
def _():
    return collate([
        dict(word="NLP", sentence="INFO 159 is an NLP course .".split()),
        dict(word="NLP", sentence="I love NLP !".split()),
        dict(word="NLP", sentence="What does NLP stand for ? ?".split()),
    ])
_()

{'input_ids': tensor([[  101, 18558, 18914,  2003,  2019, 17953,  2361,  2607,  1012,   102],
         [  101,  1045,  2293, 17953,  2361,   999,   102,     0,     0,     0],
         [  101,  2054,  2515, 17953,  2361,  3233,  2005,  1029,  1029,   102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 'token_indices': tensor([5, 3, 3])}

In [None]:
print(tokenizer.encode(['NLP']))
print(tokenizer.decode([17953]))
print(tokenizer.decode([2361]))
print(tokenizer.decode([17953, 2361]))
print(tokenizer.decode([101, 102, 0]))

[[101, 17953, 2361, 102]]
nl
##p
nlp
[CLS] [SEP] [PAD]


<!-- BEGIN QUESTION -->

**Answer these questions.** (You may execute arbitrary code if necessary.)
1. Why are there more `input_ids` in each sequence compared to the number of (space-delimited) tokens?
2. List all of the extra (special) tokens that get added to the input. Write the decoded tokens (not the integer input IDs).

1. This is because of the special tokens added in the end and beginning of the sequences and also complex words are broken down into sub-words. Example: nlp broken down to nl and ##p (17953, 2361)
2. They are [CLS], [SEP] and [PAD]

<!-- END QUESTION -->

## Extracting contextual embedding

Here, you will implement `get_token_embedding` to get the contextual embedding. For the $i$th sentence in the batch, you want to extract the embedding representation for the $i$th token index in `token_indices` at the specified layer.

In [None]:
model.embeddings(torch.tensor([[101, 102]])).shape

torch.Size([1, 2, 768])

In [None]:
from pygments import token
def get_token_embedding(model_output, token_indices, layer=-1):
    # batch_reps should have shape (B, D)
    # where B is the batch size and D is the dimension of the hidden state
    # (for BERT, D = 768)


    embedding_dim = 768
    batch_reps = model_output['hidden_states'][layer][0][token_indices]

    return batch_reps.detach().cpu()


We implement the inference code for you, but read through it and make sure you understand what is going on! Calling `model(...)` calls the `.forward()` method of the model as well as the necessary pre- and post-processing steps (see the [HF documentation](https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertModel.forward) for the BertModel, and the note at the bottom about overriding the `__call__` method).

Setting up the code this way (with `iter_outputs` serving as a generator that yields model outputs) lets us easily iterate through model outputs and apply arbitrary functions to them (like our `get_token_embedding` function).

In [None]:
from torch.utils.data import DataLoader
from tqdm import tqdm

def iter_outputs(data, model, batch_size=128):
    model.eval()  # setting eval mode disables dropout (and other stuff)
    model.to(device)  # we put the model on the GPU

    dataloader = DataLoader(
        data,
        batch_size=batch_size,
        shuffle=False,
        collate_fn=collate  # we pass in our collate function here
    )

    # disable gradient calculation and storage for efficiency,
    # since we aren't backpropagating
    with torch.no_grad():
        for batch in tqdm(dataloader):
            output = model(
                input_ids=batch["input_ids"].to(device),
                attention_mask=batch["attention_mask"].to(device),
                token_type_ids=batch["token_type_ids"].to(device),
                # by default, this only returns the last layer hidden states
                # we want the flexibility to look at other layers, so we set
                # `output_hidden_states=True`
                output_hidden_states=True
            )
            yield batch, output


**Quick check**: you should get the following output
```
tensor([[ 0.5100, -1.1154,  0.6605,  ..., -0.2650,  1.5101,  0.5656],
        [ 0.9360, -1.9150,  0.8224,  ...,  0.1231,  0.9546, -0.1965]])
```

In [None]:
def _():
    embeddings = []
    for batch, batch_output in iter_outputs([dict(word="NLP", sentence="INFO 159 is an NLP course .".split())], model):
        embeddings.append(get_token_embedding(batch_output, batch["token_indices"]))
        embeddings.append(get_token_embedding(batch_output, batch["token_indices"], layer=4))
    embeddings = torch.concat(embeddings, dim = 0)
    return embeddings

_()

100%|██████████| 1/1 [00:00<00:00,  8.50it/s]


tensor([[ 0.5100, -1.1154,  0.6605,  ..., -0.2650,  1.5101,  0.5655],
        [ 0.9360, -1.9150,  0.8224,  ...,  0.1231,  0.9546, -0.1965]])

In [None]:
def get_all_embeddings():
    embeddings = []
    for batch, batch_output in iter_outputs(items, model):
        embeddings.append(get_token_embedding(batch_output, batch["token_indices"]))

    return torch.concat(embeddings, dim=0)

embeddings = get_all_embeddings()

100%|██████████| 34/34 [42:18<00:00, 74.67s/it]


## Exploring contextual embeddings

With contextual embeddings, the representations change depending on the context; let's see that in action by looking at the sentences where the target word embeddings have the greatest similarity.

In [None]:
def nearest_neighbors(vec, matrix, k=10):
    cos_sim = ((vec @ matrix.T) / (vec.norm() * matrix.T.norm(dim=0, keepdim=True))).squeeze()
    inds = torch.argsort(-cos_sim)[:k]
    return inds, cos_sim[inds]

def show_nearest_neighbor_sentences(index, embeddings):
    print(f"QUERY (target word={items[index]['word']}):")
    print(" ".join(items[index]["sentence"]))

    print("NEIGHBORS:")
    inds, _ = nearest_neighbors(embeddings[index], embeddings, k=5)
    for ind in inds:
        print("-", " ".join(items[ind]["sentence"]))

<!-- BEGIN QUESTION -->

**Answer the following question**
1. Identify at least two different contexts in which the target word `promotion` appears. Include the query sentence, one nearest neighbor (not the query), and the ID of the query sentence.

Query ID: 436

Query Sentence: "Career ladders establish a pathway for career advancement leading to the full performance level of the position. After initial competition to enter the career ladder , successive **promotion** is dependent upon ( 1 ) the employee meeting legal and regulatory requirements ( time-in-grade restrictions ) , ( 2 ) the employee 's performance , and ( 3 ) the need for and availability of higher level work within the organization. Employees are not guaranteed promotion once selected for a position within an established career ladder. However , managers are encouraged to foster a work environment that affords individuals assigned to career ladders an equal opportunity to demonstrate their ability to perform at the full performance level ."

Nearest Neighbor: "Within the College , candidates for **promotion** and/or tenure are evaluated by their District Department Head or Program Department Head , Promotion and Tenure Committee , Associate Dean and Associate Director , and Dean and Chief Administrative Officer. At all levels of this evaluation , judgments must be made based on an individual 's responsibilities and performance. These judgments should recognize that each faculty member has a unique responsibility within the University. Likewise , the candidate must be aware that advancement through the academic ranks requires not only excellence in an academic discipline , but also evidence of developing the professional stature and maturity of view expected of those in the professorial ranks. Candidates for promotion and/or tenure are , therefore , responsible for providing the basis for appraisal of his/her performance , professional maturity , and likelihood of continued contributions. Consideration for issuance of a continuous contract ( tenure ) begins no later than spring of the fifth year and is completed no later than the sixth year of employment. University guidelines state clearly that " promotion to professor should not be considered to be forthcoming merely because of years of service to the university. "
"

Query ID: 3232

Query Sentence: "You not only have to have marketing and **promotion** strategies for your business , you also need to be able to communicate them effectively and efficiently so that customers will be attracted to your business and what it offers. Micro and Home Based Businesses can benefit from the marketing and promotion opportunities offered to members by the Melbourne Chapter of Marketing Communications Executives International ( MCEI ) ."

Nearest Neighbor: "You not only have to have marketing and **promotion** strategies for your business , you also need to be able to communicate them effectively and efficiently so that customers will be attracted to your business and what it offers. Micro and Home Based Businesses can benefit from the marketing and promotion opportunities offered to members by the Melbourne Chapter of Marketing Communications Executives International ( MCEI ) ."

In [None]:
show_nearest_neighbor_sentences(3232, embeddings)

QUERY (target word=promotion):
You not only have to have marketing and promotion strategies for your business , you also need to be able to communicate them effectively and efficiently so that customers will be attracted to your business and what it offers. Micro and Home Based Businesses can benefit from the marketing and promotion opportunities offered to members by the Melbourne Chapter of Marketing Communications Executives International ( MCEI ) . 
NEIGHBORS:
- 7. CDC. Perspectives in disease prevention and health promotion update : universal precautions for prevention of transmission of human immunodeficiency virus , hepatitis B virus , and other bloodborne pathogens in health-care settings. MMWR 1988; 38 : 377 -- 382 , 387 -- 8 . 
- *Find free classified ads that could boost the promotion of your web site. These ads could be seen by other people who you are not targeting for , but may as well be interested in your services . 
- You not only have to have marketing and promotion s

In [None]:
i = 3232
print(" ".join(items[i]['sentence']))

You not only have to have marketing and promotion strategies for your business , you also need to be able to communicate them effectively and efficiently so that customers will be attracted to your business and what it offers. Micro and Home Based Businesses can benefit from the marketing and promotion opportunities offered to members by the Melbourne Chapter of Marketing Communications Executives International ( MCEI ) . 


In [None]:
i = 436
print(" ".join(items[i]['sentence']))

Career ladders establish a pathway for career advancement leading to the full performance level of the position. After initial competition to enter the career ladder , successive promotion is dependent upon ( 1 ) the employee meeting legal and regulatory requirements ( time-in-grade restrictions ) , ( 2 ) the employee 's performance , and ( 3 ) the need for and availability of higher level work within the organization. Employees are not guaranteed promotion once selected for a position within an established career ladder. However , managers are encouraged to foster a work environment that affords individuals assigned to career ladders an equal opportunity to demonstrate their ability to perform at the full performance level . 


<!-- END QUESTION -->

## Attention masking

Let's take a closer look at how attention masks affect the output of the masked language model.

Consider embedding these two sentences which only differ in one token. We modify the attention mask to ignore this token.

<!-- BEGIN QUESTION -->

**Answer this question**: (you can add cells and run code to arrive at the answer)

If we were to put these inputs through the BERT model, which of the following would be true? For each, explain why or why not.

1. The last layer hidden representation of the target tokens (`cat` and `dog`) would be the same.
2. The last layer hidden representation of the third non-special token (`fed` in both cases) would be the same.
3. The first layer hidden representation and the last layer hidden representation of each target token (e.g., the first and last layer representation of the `cat` token) would be the same.
4. If we mask out everything _but_ the target tokens, the first and last layer hidden representations of each target token would be the same.

1. False. They won't be because they would be treated as two words whose embeddings are such that they can't see any other word in the sentence and since cat and dog are two separate words (different input IDs) they have separate embeddings
2. True. Because it would have the same context in both cases where the context "cat" or "dog" would be masked
3. False. Because it would be passing through the different layers through multiple activation functions and normalizations. Even though it won't be combined with other vectors, it would still pass through multiple functions, resulting in different variables
4. False. Similar to 3, all the layers would subsequently modify the input from the last layer.

<!-- END QUESTION -->

## Upload instructions

Upload your `.ipynb` file (with all of the cells executed so that the outputs are visible) to Gradescope.