<a href="https://colab.research.google.com/github/d-noe/NLP_DH_PSL_Fall2025/blob/main/code/1_bert_training/Discover_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT: Bidirectional Encoder Representations from Transformers

![](https://sesameworkshop.org/wp-content/uploads/2023/03/presskit_ss_bio_bert.png)

Link to the original paper by [Devlin et al., 2019](https://aclanthology.org/N19-1423/).



## Set-up

Install and import necessary Python libraries and modules.

This notebook will mainly rely on [`transformers` Python library](https://huggingface.co/docs/transformers/installation) and, later on, we'll use [`bertviz`](https://github.com/jessevig/bertviz) to inspect attention mechanisms.

In [None]:
! pip install bertviz transformers

In [None]:
import torch

# Finds out if 'cuda' (i.e. GPU) is available
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Model Overview

In [None]:
# from transformers import BertForMaskedLM, BertTokenizerFast

# tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
# model = BertForMaskedLM.from_pretrained("bert-base-uncased")

In [None]:
# For BERT
from transformers import BertTokenizerFast, BertModel

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased").to(DEVICE) #

Can you retrieve all the different components that we described earlier?

In [None]:
print(model)

In [None]:
tokenizer.tokenize("BERT means Bidirectional Encoder Representations from Transformers.")

What can you observe? What is the *'atomic unit'*, or *linguistic event*, for BERT? Can you think why?

In [None]:
print(f"Size of the vocabulary: {tokenizer.vocab_size}.")

In [None]:
tokenized_input = tokenizer("BERT means Bidirectional Encoder Representations from Transformers.")
print(tokenizer.decode(tokenized_input["input_ids"]))

Is there anything particular in the tokenized / detokenized text?

# Masked Language Modeling

BERT uses a **“masked language model” (MLM)** pre-training objective, inspired by the Cloze task.

During training, the masked language model randomly masks some of the tokens from the input, and the objective is to **predict** the original word of the masked word **based only on its context**. (Remember: *You shall know a word by the company it keeps* (Firth, J. R. 1957:11)).

Unlike left-to-right language model pre-training (or causal language modeling objective), the MLM objective enables the representation to **fuse the left and the right context**, which allows us to pre-train a deep bidirectional Transformer.

Now let's see how it does in practice! What will BERT predict?

In [None]:
# Load the model using `BertForMaskedLM`
# --> appends a prediction head to the architecture
# --> allows to perform MLM tasks (i.e. predict missing word)

from transformers import BertForMaskedLM, BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased").to(DEVICE)

In [None]:
# Can you spot the difference?
print(model)

In [None]:
# Quick helper function
# --> retrieve the IDs of the `n` most probable tokens based on model's logits

def get_n_most_likely(
    logits,
    mask_token_id:int,
    n:int=10,
):
  """
  Input:
    logits: tensor of shape (batch_size, seq_len, vocab_size)
    mask_token_id: id of the token to predict
    n: number of most likely tokens to return
  Output:
    list of n most likely tokens ids
  """
  return logits[0, mask_token_id].argsort()[-n:].cpu().numpy()[::-1]

def get_masked_position(
    token_ids,
):
  """
  Input
    token_ids: list of token ids
  Output
    position of the [MASK] token
  """
  # ([list with 'True' at MASK position, 'False' elsewhere]) --> convert numpy --> get 'True' (1) postion
  return (token_ids == tokenizer.mask_token_id).cpu().numpy().argmax()

In [None]:
# Enter your sentence here
# - enter full `sentence` and `word_to_mask`
# - or, directly write a sentence with a [MASK]

sentence = "The cat chased the mouse."
word_to_mask = "cat"
masked_sentence = sentence.replace(word_to_mask, "[MASK]")

print(masked_sentence)


Now, let's pass our masked sentence into the model.

Remember, we first have to tokenize it to prepare the input. Then, we'll give our tokenized sentence to the model.

In [None]:
# tokenize + put everything on the same DEVICE
tokenized_inputs = tokenizer(masked_sentence, return_tensors="pt").to(DEVICE)
print(tokenized_inputs)

In [None]:
outputs = model(**tokenized_inputs)
print(outputs)

Now, let's retrieve the most probable tokens at the `[MASK]` position based on the model's output:

In [None]:
# We'll use the helper functions for this part

mask_position = get_masked_position(tokenized_inputs["input_ids"][0])

predicted_token_id = get_n_most_likely(
    logits = outputs.logits,
    mask_token_id=mask_position,
    n=1,
)

predicted_token = tokenizer.decode(predicted_token_id)

print(f"For the [MASK] in the sentence '{masked_sentence}',")
print(f"the model predicts the token: '{predicted_token}'.")

In [None]:
# Now you can try with different sentence in a single cell below:
masked_sentence = "Paris is the most [MASK] city in the world."
n_most_likely = 10

# We'll use the helper functions for this part

tokenized_inputs = tokenizer(masked_sentence, return_tensors="pt").to(DEVICE)
outputs = model(**tokenized_inputs)
mask_position = get_masked_position(tokenized_inputs["input_ids"][0])

predicted_token_id = get_n_most_likely(
    logits = outputs.logits,
    mask_token_id=mask_position,
    n=n_most_likely,
)

print(f"For the [MASK] in the sentence '{masked_sentence}',")
if n_most_likely == 1:
  predicted_tokens = tokenizer.decode(predicted_token_id)
  print(f"the model predicts the token: '{predicted_tokens}'.")
else:
  print(f"the model predicts the tokens:")
  for i, t_id in enumerate(predicted_token_id):
    print(f"{i+1} - {tokenizer.decode(t_id)}")


# 👁️ Exploring Attention with BERTviz

When you run the cell below, an **interactive attention visualization** will appear.  
This tool helps us **see how the model distributes its attention between tokens** in the sentence.

---

#### 🧩 Understanding What You See

When you open the BERTviz `head_view`, you’ll notice several key components:

1. **Layer Selection (Top Bar)**  
   - DistilBERT has **6 layers** (numbered 0–5).  
   - You can switch between layers to see how attention evolves.  
   - Lower layers tend to capture **local relations** (like syntax), while higher layers focus on **global meaning**.

2. **Head Selection (Colored Boxes)**  
   - Each layer has **12 attention heads** (labeled 0–11).  
   - Each head learns different patterns:
     - Some connect nearby words.
     - Some capture longer dependencies (e.g., subject → verb).
     - Others gather sentence-wide context around special tokens like `[CLS]` or `[SEP]`.  
   - Click the head boxes to toggle each head on or off.

3. **Token Lists (Left and Right Columns)**  
   - These columns show all the tokens (subword pieces) in the input sentence.  
   - DistilBERT uses **WordPiece tokenization**, so some words appear split into parts (e.g., `"arith"`, `"##metical"`).  
   - The same sentence is displayed on both sides — attention connects each token **to every other token**.

4. **Attention Lines (Colored Arcs)**  
   - Each line connects a **query token** (on the left) to a **key token** (on the right).  
   - The color and thickness show **how strongly** the model attends from one token to another.  
   - When multiple heads are active, lines from each head are color-coded according to the boxes at the top.

---

#### 🖱️ Hovering Tokens and Asymmetry

- **Hover over a word on the left**: the lines show where this **query token** attends to **key tokens** on the right.  
- **Hover over a word on the right**: the lines show which **query tokens** are attending to this **key token**.  

⚠️ **Important:** Attention is **not necessarily symmetrical**.  
- A token A may strongly attend to token B, but token B might attend more to some other token C.  
- This is because attention is **directional**: each token has its own query vector and computes attention to keys independently.

So, hovering left vs. right gives complementary, but **different perspectives** on the model’s attention patterns.

---

#### 🎓 How to Explore

- **Hover over a word** on the left or right to see attention patterns.  
- **Switch between layers** to see how the focus changes from local to global.  
- **Toggle heads** to isolate different types of relationships.  

Try this with our example sentence:

> *“modern electronic calculators contain a keyboard with buttons for digit and arithmetical operations.”*

Questions to guide your exploration:
- Does *digit* attend to words like *arithmetical*, *keyboard*, or *electronic* in early layers?  
- Do higher layers show attention concentrating on `[SEP]` or spreading evenly across tokens?  
- Which layer seems to highlight meaningful semantic connections?

---

#### 🧠 A Note on Interpretation

While BERTviz gives insight into model behavior, keep in mind:

- **Attention ≠ Explanation**  
  Attention shows *how* tokens interact, not necessarily *why* the model predicts something.  

- **Later layers often focus on `[SEP]` or `[CLS]`**  
  This is normal — these tokens serve as *summary anchors* for global information.

- **Not every head is interpretable**  
  Some heads track syntax or position rather than semantics.

Use this visualization as a **qualitative exploration tool** to understand patterns, not as a definitive explanation of the model’s reasoning.


In [None]:
# For visualisation of attention mechanisms
from bertviz import head_view, model_view

In [None]:
model = BertModel.from_pretrained(
    "bert-base-uncased",
    output_attentions=True,
    device_map=DEVICE,
)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

In [None]:
sentence = "The animal didn't cross the street because it was too tired."
inputs = tokenizer(sentence, return_tensors="pt").to(DEVICE)
outputs = model(**inputs)

attentions = outputs.attentions  # Tuple of attention matrices, one per layer
print(f"Number of layers: {len(attentions)}")
print(f"Shape of each attention tensor: {attentions[0].shape}")  # (batch, num_heads, seq_len, seq_len)

In [None]:
# @title (Simulated) Masked Language Modeling

# Hint: look at layer 10!
sentence = "The capital of France is [MASK]."

inputs = tokenizer.encode_plus(sentence, return_tensors='pt').to(DEVICE)
outputs = model(**inputs)

# Convert token ids to tokens
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

# Use head_view for single-sentence visualization
head_view(attention=outputs.attentions, tokens=tokens)

## *The animal couldn't cross the street because **it** ...*

In [None]:
# @title ... *was too tired.*
sentence = "The animal couldn't cross the street because it was too tired."

inputs = tokenizer.encode_plus(sentence, return_tensors='pt').to(DEVICE)
outputs = model(**inputs)

# Convert token ids to tokens
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

# Use head_view for single-sentence visualization
head_view(attention=outputs.attentions, tokens=tokens)

In [None]:
# @title ... *was too wide.*

## hint: have a look at layer 9
sentence = "The animal couldn't cross the street because it was too wide."

inputs = tokenizer.encode_plus(sentence, return_tensors='pt').to(DEVICE)
outputs = model(**inputs)

# Convert token ids to tokens
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

# Use head_view for single-sentence visualization
head_view(attention=outputs.attentions, tokens=tokens)

In [None]:
model_view(attention=outputs.attentions, tokens=tokens)

# BERT for Feature Exctraction: Vector Representations

You can use the pre-trained BERT to create contextualized word embeddings. Then you can feed these embeddings to your existing model – a process in the paper shows yield results not far behind fine-tuning BERT on a task such as named-entity recognition.

![Extracted from https://jalammar.github.io/illustrated-bert/](https://jalammar.github.io/images/bert-contexualized-embeddings.png)

There are different ways to extract embeddings from pre-trained model (remember `bertviz`, you can use different (combination of) layers, heads, etc.).

Here we'll stick with a common approach, using the `last_hidden_state`: these vectors are the result of all layers' transformations and attention operations combined: they are what the model ultimately "sees" for each token.

In [None]:
# Load the model — no need for MLM here: we'll extract the embeddings
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased").to(DEVICE)

In [None]:
# Toy example data: how does BERT understand the word 'field'?

data = {
    "Zinedine Zidane is the greatest on the field.": "Football Pitch",
    "On the field, Michel Platini was undeniably different.": "Football Pitch",
    "Jonah Lomu reigned on the field.": "Rugby Pitch",
    "Barry Bonds had no contender on the field.": "Baseball",
    "The crop is growing in the field.": "Agriculture Field",
    "This field was harvested by the farmer.": "Agriculture Field",
    "He hopes to find work in the informatics field.": "Domain",
    "Mary Frances Lyon is a pioneer in the field of genetic research.": "Domain",
}

sentences = list(data.keys())
senses = list(data.values())

In [None]:
# tokenize our sentences
tokenized_sentences = tokenizer(
  sentences,
  truncation=True,
  padding=True,
  return_tensors="pt"
).to(DEVICE)

In [None]:
for i, sentence in enumerate(sentences):
  print(f"Sentence: {sentence}")
  for tokenized_id in tokenized_sentences['input_ids'][i]:
    print(f"\t{tokenized_id} : {tokenizer.decode(tokenized_id)}")

What can you observe? What is the `token_id` of the word `'field'`?

In [None]:
tokenizer.vocab["field"]

In [None]:
outputs = model(**tokenized_sentences)

In [None]:
# Vector Representation
embeddings = outputs["last_hidden_state"]
embeddings.shape # N samples, Sequence length, Embedding dimensions

In [None]:
# Quick and dirty way to extract the embeddings at the position of the word 'field'
import numpy as np

word_id = 2492

field_token_pids = [np.argmax(t.cpu().numpy()==word_id) for t in tokenized_sentences["input_ids"]]

field_vectors = np.array([
    o[p_id].detach().cpu().numpy()
    for o, p_id in zip(outputs["last_hidden_state"], field_token_pids)
])

field_vectors.shape # N samples, Embedding dimensions

In [None]:
#@title Embeddings PCA Visualisation:

from sklearn.manifold import MDS
from sklearn.metrics.pairwise import cosine_distances
from sklearn.decomposition import PCA
import altair as alt
import pandas as pd

#dim_reducer = MDS(n_components=2, dissimilarity="precomputed")
dim_reducer = PCA(n_components=2)

df_field = pd.DataFrame()
df_field["sentence"] = sentences
df_field["sense"] = senses

#reduced_vectors = dim_reducer.fit_transform(cosine_distances(field_vectors))
reduced_vectors = dim_reducer.fit_transform(field_vectors)

df_field["Dim_1"] = reduced_vectors[:,0]
df_field["Dim_2"] = reduced_vectors[:,1]

chart = alt.Chart(
    df_field,
    title=f"Field Embeddings"
).mark_circle(size=200).encode(
    alt.X('Dim_1',
        scale=alt.Scale(zero=False)
    ),
    y="Dim_2",
    color= "sense",
    tooltip=['sentence'],
    ).interactive().properties(
    width=500,
    height=500
)

chart

Bingo! The model produces vector representations that are closer in the embedding space when the word is used with a similar meaning!


Remember `word2vec`, would you have been able to produce such results?
