<a href="https://colab.research.google.com/github/antalvdb/antalvdb.github.io/blob/main/INFOMTALC2025_Seminar_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Transformers: Applications in Language and Communication (INFOMTALC)

## Seminar 3: The Transformer Anatomy Lesson

The colab for this seminar consists of two parts:

- PART I: Writing up a transformer encoder
- PART II: Visualizing attention

The overall goal is to get a better intuition about the inner workings of a transformer model. PART I is a read-and-click-through sequence of cells, while PART II contains some exercises.

## PART I: A sketch of transformer encoder implementation

We will now implement a transformer encoder in PyTorch! Note that we will not train it. We will write out all the components and put them together, so that all the matrices and vectors have the right size and they interact in the correct way, but the actual numbers in all these vectors and components will be random (so that if we try to run the resulting module, the output will be garbage).

In this implementation, we pretty closely follow Chapter 3 of the textbook. In order to dive deeper into implementational variants of different transformer blocks and to try out the actual training, we recommend you to follow Andrej Karpathy's ["Let's build GPT" video](https://youtu.be/kCc8FmEb1nY) -- it would be too long for us to cover in a seminar, but there are at least three reasons to watch it:

1. Andrej Karpathy's educational materials are always amazing, and all his video tutorials and lectures are 100% worth watching.
2. Here, we will implement just the transformer encoder, and Andrej Karpathy's video walks you through an implementation of the decoder, and the implementation is somewhat different in different points, it's good to see how different implementational decisions can be made.
3. The video walks you through training the model as well, not just putting the components together.

But let's walk through our implementation.

Recall, during the lecture, we said that the transformer encoder consists of, first of all, a component that **embeds tokens** (that is, maps a sequence of token IDs to a sequence of vectors) and then a bunch of **encoder layers** that apply to these vectors sequentially to modify these embeddings.

> 💡 We will build our model directly in PyTorch, without using the `transformers` library by HuggingFace. However, the `transformers` library is actually only a wrapper that makes working with PyTorch (or the alternative, TensorFlow) easier, and for models that rely on PyTorch there is not much difference between a `transformers` model and a PyTorch model. In fact, a model that is retrieved using the `transformers` library is an instance of a subclass of PyTorch's `Module` class:
>
> ```python
> >>> from transformers import BertModel
> >>> from torch import nn
> >>> model = BertModel.from_pretrained("bert-base-uncased")
> >>> isinstance(model, nn.Module)
True
```


### 1) Embedding

Let's first write a block that produces these embeddings given the token IDs. We will use the classic BERT model (``bert-base-uncased``) as our guide to the sizes of vectors, dimensions, number of stacked components etc. We can load the configuration of the model to use as a reference:

In [None]:
import torch
from torch import nn
from transformers import AutoConfig

model_ckpt = "bert-base-uncased"
config = AutoConfig.from_pretrained(model_ckpt)

Let's just print out the config and see what is specified there:

In [None]:
config

For instance, since we want to define our embeddings, we need to decide on the size of the vocabulary (what is our range of token IDs is going to be?), the max sequence length (what is our range of position IDs is going to be for positional embeddings?) and the hidden size (what's the length of embedding vectors?). We will use the values from BERT:

In [None]:
vocab_size = config.vocab_size
hidden_size = config.hidden_size
max_position_embeddings = config.max_position_embeddings

vocab_size, hidden_size, max_position_embeddings

Here below is our Embeddings class. Let's go over what we are doing here. We will need regular token embeddings and positional embeddings (at the end, we add normalization to keep the values of the resulting embeddings under control). We use [nn.Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) as a look-up tool to retrieve an embedding vector given a token ID. Here, our embeddings (both regular token embeddings and positional embeddings) are simply random since we don't set them to any particular values when initializing.

In the forward pass, we check the length of the sequence to then assign each token a position ID, and then we map each token ID to the corresponding embedding from the embeddings table, map each token to its positional embedding, simply sum up corresponding values of the two embedding vectors per token and normalize the result (``nn.LayerNorm`` subtracts the mean from each value and divides by variance, we ignore some additional details here).

In [None]:
class Embeddings(nn.Module):
  def __init__(self, vocab_size, hidden_size, max_position_embeddings):
    super().__init__()
    self.token_embeddings = nn.Embedding(vocab_size, hidden_size)
    self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size)
    self.layer_norm = nn.LayerNorm(hidden_size)

  def forward(self, input_ids):
    # Create position IDs for input sequence
    seq_length = input_ids.size(1)
    position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0) # Create token and position embeddings
    token_embeddings = self.token_embeddings(input_ids)
    position_embeddings = self.position_embeddings(position_ids)
    # Combine token and position embeddings
    embeddings = token_embeddings + position_embeddings
    embeddings = self.layer_norm(embeddings)
    return embeddings

Now we can run a batch of tokenized sentences through this embedding layer and check that the output will have the right shape: it will be a tensor with dimensions ``batch size`` (= number of sentences) times ``sequence length`` times ``length of embedding vectors``.

(Note a simplifying assumption here that all sequences in a batch are the same length! We are not bothering with padding all batch sequences to the same length etc).

In [None]:
emb = Embeddings(vocab_size, hidden_size, max_position_embeddings)
emb(torch.tensor([[1, 2, 5], [1, 3, 8977]])).shape

Of course, if we simply tokenize some sentence with the BERT tokenizer and then run our Embeddings layer over the ``input_ids`` that the tokenizer outputs, we will get the expected result:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
text = "time flies like an arrow"
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs.input_ids.shape, emb(inputs.input_ids).shape



> 💡 Note that in the second line the `emb` object is called as if it were a function. That is the way to call the `forward` method of our `Embeddings` module: behind the scenes, PyTorch calls `forward` for you. The `forward` method should not be called directly.


### 2) Multi-head attention

Now that we have our embedder, let's work on the transformer encoder layers. We will need several of them stacked on top of each other. How many of them will we have? Let's check the config:

In [None]:
num_hidden_layers = config.num_hidden_layers
num_hidden_layers

BERT uses 12, we will do the same. Let's build one encoder layer and then stack them. One encoder layer consists of a **self-attention block**, a **feed-forward block** and a bunch of normalizations and residual connections.

**Self-attention**, in turn, consists of multiple attention heads. Let's again stick to what BERT does in terms of the number of attention heads per self-attention block:

In [None]:
num_heads = config.num_attention_heads
num_heads

Let's code one **attention head** and then put 12 of them together for a self-attention block.

We will start with the very core: **scaled dot product attention**. We implement it as a function that takes query vectors, key vectors and value vectors for all tokens in the sequence. It then calculates dot products between query and key values pairwise (here in a more efficient way, as a matrix, see [``torch.bmm``](https://pytorch.org/docs/stable/generated/torch.bmm.html) documentation). Then we apply ``softmax`` to these results to get weights that range between 0 and 1 and sum to 1 per token. Then, it does weighted averaging of value vectors: each token is now assigned a vector that is a weighted average of all value vectors, but with different weights for each token.

In [None]:
import torch.nn.functional as F

def scaled_dot_product_attention(query, key, value):
  dim_k = query.size(-1)
  scores = torch.bmm(query, key.transpose(1, 2))
  weights = F.softmax(scores, dim=-1)
  result = torch.bmm(weights, value)
  return result

Let's now use these function on a toy example. Imagine our input sequence is 3 tokens long. Each of these tokens has a query vector, a key vector and a value vector of length 4 each. The result of applying the ``scaled_dot_product_attention`` function is going to be a sequence of 3 vectors of length 4 each:

In [None]:
from random import random

def rand(n):
  return [random() for i in range(n)] # just producing vectors of len 4 with random numbers

query = torch.FloatTensor([[rand(4), rand(4), rand(4),]])
key = torch.FloatTensor([[rand(4), rand(4), rand(4)]])
value = torch.FloatTensor([[rand(4), rand(4), rand(4)]])

scaled_dot_product_attention(query, key, value)

Each attention head starts with projecting token embeddings to queries, keys and values. The lengths of the query, key and value vectors are not the same as the length of the embedding vector, the reason being that transformer models typically have multiple attention heads in one attention layer, and the value vectors from multiple heads are simply concatenated together before being transmitted to the next layer. So, if the embedding vectors are of size 768 and there are 12 attention heads, the query / key / value vectors are of size 64 (we concatenate 12 vectors of 64 numbers and get a vector of size 768 again). So, here is one **attention head**:

In [None]:
class AttentionHead(nn.Module):
  def __init__(self, embed_dim, head_dim):
      super().__init__()
      self.q = nn.Linear(embed_dim, head_dim)
      self.k = nn.Linear(embed_dim, head_dim)
      self.v = nn.Linear(embed_dim, head_dim)

  def forward(self, hidden_state):
    attn_outputs = scaled_dot_product_attention(self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
    return attn_outputs

Now we put 12 attention heads together to create multi-head attention:

In [None]:
class MultiHeadAttention(nn.Module):
  def __init__(self, embed_dim, num_heads):
    super().__init__()
    head_dim = embed_dim // num_heads
    self.heads = nn.ModuleList([AttentionHead(embed_dim, head_dim) for _ in range(num_heads)])
    self.output_linear = nn.Linear(embed_dim, embed_dim)

  def forward(self, hidden_state):
    x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
    x = self.output_linear(x)
    return x

Let's check if it works!

In [None]:
multihead_attn = MultiHeadAttention(hidden_size, num_heads)
attn_output = multihead_attn(emb(inputs.input_ids))
attn_output.size()

### 3) Feed-forward block

We are done with the attention part of the encoder block! Now, let's put together the **feedforward block**, and we can then arrange these two components together with all the remaining details such as layer normalization and skip connections. The feedforward layer simply linearly projects embeddings to a bigger space and then squeezes them back to the same size as before, through a non-linear unit. Again, we use the same intermediate size as the BERT model we are using as our reference.

In [None]:
intermediate_size = config.intermediate_size
print(intermediate_size)

class FeedForward(nn.Module):
  def __init__(self, hidden_size, intermediate_size):
    super().__init__()
    self.linear_1 = nn.Linear(hidden_size, intermediate_size)
    self.linear_2 = nn.Linear(intermediate_size, hidden_size)
    self.gelu = nn.GELU()

  def forward(self, x):
    x = self.linear_1(x)
    x = self.gelu(x)
    x = self.linear_2(x)
    return x

As usual, let's check if it works:

In [None]:
ff = FeedForward(hidden_size, num_heads)
ff_output = ff(emb(inputs.input_ids))
ff_output.size()

### 4) The encoder

Now we have all the ingredients to put together a transformer encoder layer! We will include skip connection and layer normalization as well (refer to the textbook for the discussion on where to put layer normalization! there are different options):

In [None]:
class TransformerEncoderLayer(nn.Module):
  def __init__(self, hidden_size, num_heads, intermediate_size):
    super().__init__()
    self.layer_norm_1 = nn.LayerNorm(hidden_size)
    self.layer_norm_2 = nn.LayerNorm(hidden_size)
    self.attention = MultiHeadAttention(hidden_size, num_heads)
    self.feed_forward = FeedForward(hidden_size, intermediate_size)

  def forward(self, x):
    # Apply layer normalization and then copy input into query, key, value
    hidden_state = self.layer_norm_1(x)
    # Apply attention with a skip connection
    x = x + self.attention(hidden_state)
    # Apply feed-forward layer with a skip connection
    x = x + self.feed_forward(self.layer_norm_2(x))
    return x

Let's see if we can sequentially run the embedding and one encoder layer to produce the expected result:

In [None]:
tr = TransformerEncoderLayer(hidden_size, num_heads, intermediate_size)
tr_output = tr(emb(inputs.input_ids))
tr_output.size()

Yes!

Finally, here is our TRANSFORMER ENCODER assembled together. First, it embeds the tokens, then it runs them through 12 stacked encoder layers, each of which first runs self-attention and then passes the tokens through a feedforward network. The output is embeddings for each token, but modified along all these intermediate steps.

In [None]:
class TransformerEncoder(nn.Module):
  def __init__(self, vocab_size, hidden_size, max_position_embeddings, num_heads, intermediate_size):
    super().__init__()
    self.embeddings = Embeddings(vocab_size, hidden_size, max_position_embeddings)
    self.layers = nn.ModuleList([TransformerEncoderLayer(hidden_size, num_heads, intermediate_size) for _ in range(num_hidden_layers)])

  def forward(self, x):
    x = self.embeddings(x)
    for layer in self.layers:
      x = layer(x)
    return x

Let's make sure it works. It does!

In [None]:
encoder = TransformerEncoder(vocab_size, hidden_size, max_position_embeddings, num_heads, intermediate_size)
encoder_output = encoder(inputs.input_ids)
encoder_output.size()

This encoder model is not trained -- so you cannot really use it to produce meaningful text representations. But you can use it to get a clearer intuition about the building blocks of the model and how information flows between the components of the transformer. We hope it helps!

## PART II: Visualizing attention

We are going to use the `bertviz` library to visualize internals of [bert-base-uncased](https://), a pretrained BERT model. We are going to input a short text into the model, activating all 12 attention layers:

> *I called Ian. I got his answering machine.*

In this text we find, among all types of linguistic phenomena, a co-referential relation between 'Ian' and 'his': the two words refer to the same person. Also, the combination 'answering machine' is a strong 2-word collocation. In terms of attention, one would expect that somewhere in the attention layers, there would be attention to both phenomena.

In [None]:
! pip install bertviz

In [None]:
# Load model and retrieve attention weights

from bertviz import head_view, model_view
from transformers import BertTokenizer, BertModel

model_version = 'bert-base-uncased'
model = BertModel.from_pretrained(model_version, output_attentions=True, attn_implementation="eager")
tokenizer = BertTokenizer.from_pretrained(model_version)
sentence_a = "I called Ian."
sentence_b = "I got his answering machine."
inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt')
input_ids = inputs['input_ids']
token_type_ids = inputs['token_type_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
sentence_b_start = token_type_ids[0].tolist().index(1)
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)

We are going to explore three views offered by `bertviz`: the 'head view', the 'model view', and the 'neuron view'. Although it is far from trivial to interpret what is going on in all the attention layers and their heads, in this seminar's exercises we are going to nonetheless try.

#Exercise A

Explore the 'head view'. Given a selected attention layer, double clicking on the colored tiles for each attention head (selecting only the attention weights for that head), try to find strong attention weights between the token position for 'Ian' and for 'his', and between 'answering' and 'machine'. Once you find examples of both, also track them in the 'model' and 'neuron' view.

#Exercise B

Create a different text, but with the same type of phenomenon; between a male first name and "his", and between the parts of a strong 2-word collocation. Are the same heads paying attention to the same phenomena again?

#Exercise C

Create short texts (from just a few words to longer multi-sentence texts) with specific linguistic phenomena, such as

*   Agreement over multi-word distances ("We never seem to agree", between "We" and "agree")
*   Ambiguity ("Time files like an arrow")
*   Grammatical errors ("I sees a man")

And do Exercise A again.


# Head View
<b>The head view visualizes attention in one or more heads from a single Transformer layer.</b> Each line shows the attention from one token (left) to another (right). Line weight reflects the attention value (ranges from 0 to 1), while line color identifies the attention head. When multiple heads are selected (indicated by the colored tiles at the top), the corresponding  visualizations are overlaid onto one another.

## Usage
👉 **Hover** over any **token** on the left/right side of the visualization to filter attention from/to that token. <br/>
👉 **Double-click** on any of the **colored tiles** at the top to filter to the corresponding attention head.<br/>
👉 **Single-click** on any of the **colored tiles** to toggle selection of the corresponding attention head. <br/>
👉 **Click** on the **Layer** drop-down to change the model layer (zero-indexed).


In [None]:
head_view(attention, tokens, sentence_b_start)

# Model View
<b>The model view provides a birds-eye view of attention throughout the entire model</b>. Each cell shows the attention weights for a particular head, indexed by layer (row) and head (column).  The lines in each cell represent the attention from one token (left) to another (right), with line weight proportional to the attention value (ranges from 0 to 1).

## Usage
👉 **Click** on any **cell** for a detailed view of attention for the associated attention head (or to unselect that cell). <br/>
👉 Then **hover** over any **token** on the left side of detail view to filter the attention from that token.

In [None]:
model_view(attention, tokens, sentence_b_start)

# Neuron View
<b>The neuron view visualizes the intermediate representations (e.g. query and key vectors) that are used to compute attention.</b> In the collapsed view (initial state), the lines show the attention from each token (left) to every other token (right). In the expanded view, the tool traces the chain of computations that produce these attention weights.

## Usage
👉 **Hover** over any of the tokens on the left side of the visualization to filter attention from that token.<br/>
👉 Then **click** on the **plus** icon that is revealed when hovering. This exposes the query vectors, key vectors, and other intermediate representations used to compute the attention weights. Each color band represents a single neuron value, where color intensity indicates the magnitude and hue the sign (blue=positive, orange=negative).<br/>
👉 Once in the expanded view, **hover** over any other **token** on the left to see the associated attention computations.<br/>
👉 **Click** on the **Layer** or **Head** drop-downs to change the model layer or head (zero-indexed).


In [None]:
from bertviz.transformers_neuron_view import BertModel, BertTokenizer
from bertviz.neuron_view import show

model_type = 'bert'
model_version = 'bert-base-uncased'
model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=True)
show(model, model_type, tokenizer, sentence_a, sentence_b, layer=4, head=3)