# Transformers

The goal of this module is to understand the **Transformer architecture** from its core component: the **Self-Attention mechanism**. Grasping this architecture is essential for comprehending state-of-the-art models in Natural Language Processing (NLP), including the Large Language Models (LLMs) driving recent advances.

### Text Processing with Tokens

Before a Transformer model can process any text, the raw string must be converted into a numerical sequence of inputs that the model understands. This process is called **tokenization**, and the resulting numerical units are **tokens**.

A **token** is the fundamental unit of text: it is the smallest block of text that is passed to a model. Tokens serve two purposes:
1.  **Semantic Unit:** They represent meaningful chunks of language, such as words, punctuation, or parts of a word.
2.  **Numerical ID:** Each unique token is mapped to a unique integer ID, which is then used to look up its corresponding **embedding vector** (the dense numerical representation fed into the Transformer).

#### How are Tokens Computed

Modern NLP uses sophisticated tokenization methods that strike a balance between vocabulary size and sequence length.

- **Word-Level Tokenization:** The simplest approach splits text by spaces, treating each word as a token.
    * *Problem:* Leads to a massive vocabulary (e.g., one token for "run," another for "running," another for "runs").
- **Character-Level Tokenization:** Splits text into individual characters.
    * *Problem:* Generates very long input sequences, making the Transformer less efficient.
- **Subword Tokenization (The Standard):** This approach used by most LLMs (like GPT or BERT) solves both problems via methods like [**Byte Pair Encoding (BPE)**](https://en.wikipedia.org/wiki/Byte-pair_encoding) or [**WordPiece**](https://huggingface.co/learn/llm-course/en/chapter6/6):
    1.  **Splitting:** Common words (like "the," "is") are treated as single tokens.
    2.  **Breaking Down:** Rare or complex words are broken down into common subword units. For example, "unbelievable" might be tokenized as \["un", "believ", "able"\].

This subword approach ensures that every word can be represented (even misspelled or rare words) without the vocabulary size exploding, and it keeps sequence lengths manageable.

#### Token to Embedding

Once the text is tokenized:

1.  The sequence of text tokens is converted into a sequence of integer IDs.
2.  These IDs are passed to the model's **Embedding Layer**, which acts as a lookup table mapping tokens to $\mathbf{d}_{model}$-dimensional vectors called embeddings.
3.  The layer returns the corresponding sequence of embeddings, which form the **Input Matrix ($\mathbf{X}$)** for the Self-Attention mechanism.

## Self-Attention Mechanism

**Self-Attention** is a mechanism that allows a model to weigh the importance and relevance of **all other elements** in an input sequence when encoding a single element. It is a powerful way for the model to capture context and dependencies regardless of the distance between tokens in the sequence.

##### Input representation

The input to the Transformer is a sequence of tokens (words, sub-words, etc.), which is then passed through the embedding layer to get the Input Matrix of shape $(n \times \mathbf{d}_{model})$.
For each output element $\mathbf{y}_i$, the model then calculates an **attention score** to determine how much information from $\mathbf{x}_j$ (where $j=1$ to $n$) should flow into $\mathbf{y}_i$.


##### Query, Key and Value vectors

For the attention calculation, the input matrix $\mathbf{X}$ is projected into three distinct, lower-dimensional representation spaces using three separate learned weight matrices: $\mathbf{W}^{Q}$, $\mathbf{W}^{K}$, and $\mathbf{W}^{V}$.

- The **Query** vector $\mathbf{Q} = \mathbf{X} \mathbf{W}^{Q}$ of shape $(\mathbf{d}_{\text{model}} \times d_k)$ is used to query or seek context
- The **Key** vector $\mathbf{K} = \mathbf{X} \mathbf{W}^{K}$ of shape: $(\mathbf{d}_{\text{model}} \times d_k)$ (same as the query) is used to label or categorize context
- The **Value** vector $\mathbf{V} = \mathbf{X} \mathbf{W}^{V}$ of shape: $(\mathbf{d}_{\text{model}} \times d_v)$ contains the actual information to be passed through the network

##### Attention score computation

The first step in calculating attention is to determine the **relevance** between every query vector ($\mathbf{q}_i$) and every key vector ($\mathbf{k}_j$). This is done via a dot product.

The matrix of raw attention scores is computed as:

$$\mathbf{A}_{raw} = \mathbf{Q} \mathbf{K}^T$$

The scores are then divided by the square root of the key dimension, $\sqrt{d_k}$. This **scaling** prevents the dot products from growing too large, which would push the softmax function into regions where gradients are extremely small (saturation), hindering stable training.

$$\mathbf{A}_{scaled} = \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}}$$

The scaled attention scores are then normalized using the **Softmax** function along the sequence dimension (rows of $\mathbf{A}_{scaled}$). This converts the scores into a probability distribution, ensuring that the weights for any single token sum to 1.

$$\mathbf{A}_{weights} = \operatorname{softmax}\left(\frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}}\right)$$

The final output is computed by taking the weighted sum of the $\mathbf{V}$ vectors, using the normalized attention weights ($\mathbf{A}_{weights}$) as the coefficients.

$$\mathbf{Z} = \mathbf{A}_{weights} \mathbf{V}$$

The resulting matrix $\mathbf{Z}$ has the shape $(n \times d_v)$. Each row $\mathbf{z}_i$ is the new, context-aware representation of the original token $\mathbf{x}_i$.

Combining the steps, the basic **Scaled Dot-Product Attention** formula is:

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \operatorname{softmax}\left(\frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_{k}}}\right) \mathbf{V}$$

## Multi-Head Attention

In practice, to allow the model to focus on different types of relationships simultaneously (e.g., syntactical dependency vs. semantic meaning), the attention calculation is parallelized across $h$ different **heads**.

1.  **Multiple Projections:** Instead of one set of $\mathbf{W}^{Q}, \mathbf{W}^{K}, \mathbf{W}^{V}$, we use $h$ sets: $\mathbf{W}^{Q}_i, \mathbf{W}^{K}_i, \mathbf{W}^{V}_i$ for $i=1$ to $h$.
2.  **Parallel Attention:** Each head $\text{head}_i$ computes its own attention output $\mathbf{Z}_i$.
3.  **Concatenation & Final Projection:** The outputs from all $h$ heads are concatenated back together and then passed through a final learned linear transformation ($\mathbf{W}^{O}$) to project the result back into the required output dimension.

The formula for the $i$-th head is:
$$\text{head}_i = \text { Attention }(\mathbf{X} \mathbf{W}_{i}^{Q}, \mathbf{X} \mathbf{W}_{i}^{K}, \mathbf{X} \mathbf{W}_{i}^{V})$$

The final **Multi-Head Attention** output is:
$MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V})= \operatorname{Concat}(\text{head}_1, \ldots, \text{head}_h) \mathbf{W}^{O$

<p style=\"text-align:center\">
    <img src="https://wikidocs.net/images/page/159310/mha_img_original.png" alt="Multi-head attention mechanism"/>
</p>

## The Full Transformer Architecture: Encoder and Decoder

The complete Transformer architecture consists of stacked layers of Self-Attention mechanisms and Feed-Forward networks. These layers are organized into two main components: the **Encoder** and the **Decoder**.

### Encoder (Contextualization)

The **Encoder** is responsible for taking the input sequence (e.g., a sentence) and generating a rich, contextual representation for **every token**.
It consists of a stack of identical layers, each containing a **Multi-Head Self-Attention** sub-layer and a Feed-Forward sub-layer.

The Encoder processes the input text **bidirectionally**, allowing each token to attend to *all* other tokens (past and future) in the input sequence.

### Decoder (Generation)

The **Decoder** is responsible for taking the contextualized representation from the Encoder and generating the output sequence one token at a time.
Each Decoder layer has three sub-layers: a **Masked Multi-Head Self-Attention** sub-layer, a **Cross-Attention** sub-layer, and a Feed-Forward sub-layer.

The Masked Self-Attention ensures the Decoder is **unidirectional** (auto-regressive), meaning it can only attend to tokens it has **already generated** (or the start-of-sequence token). In practice, this is achieved by setting the scaled attention scores to $-\infty$ before applying the softmax to force the attention to $0$ on future tokens.

The **Cross-Attention** sub-layer allows the decoding process to selectively focus on the most relevant parts of the **entire input sequence** that was generally processed by the Encoder.

Cross-Attention uses the same core attention formula, $\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V})$, but draws its three components from **two different sources**: the Query vector comes from the output of the previous Masked Self-Attention layer of the Decoder, whereas the Key and Value vectors come from the final output of the entire Encoder stack.
The output is a new context vector for the Decoder that is highly relevant to both the previously generated sequence and the original input sequence.

### Different Architectures for Different Tasks

The need for both an Encoder and a Decoder depends entirely on the task type:
- **Encoder-Only Models** (like BERT) are used when the goal is to fully understand a given input sequence (e.g., classifying a movie review as positive/negative). Since no new text is generated, the Decoder is unnecessary.
- **Decoder-Only Models** (like GPT) are used for **Large Language Models** and generative tasks. They are trained to predict the next token in a sequence based only on the preceding tokens, making the full Encoder redundant. They rely entirely on their Masked Self-Attention to build context.
- **Encoder-Decoder Models** are reserved for tasks like Machine Translation, where a source sequence (Encoder) is read, and a distinct target sequence (Decoder) is generated.


### References
- [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)
- [Attention? Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)
- [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf)
- [Self-Attention in NLP](https://www.youtube.com/watch?v=5vcj8kSwBCY)

<p style=\"text-align:center\">
    <img src="https://miro.medium.com/v2/resize:fit:1400/1*BHzGVskWGS_3jEcYYi6miQ.png" alt="Multi-head attention mechanism"/>
</p>

# Attention Implementation

In [1]:
from typing import Optional

import torch
import torch.nn as nn

In [2]:
# For all the tranformers, you will use this following tensor as example
dummy_tensor = torch.randn((16, 32, 64)) # A tensor of shape (batch_size, sequence_length, embedding_size)

# you can imagine a batch of 16 sentence of 32 tokens, each token being represented by a 64-dimensional embedding vector

<div class="alert alert-block alert-info">
    
<b> Exercise 3.1: </b>
* Implement the self-attention mechanism
</div>

In [3]:
class SelfAttention(nn.Module): 
    """
    Implements a single head of the Scaled Dot-Product Self-Attention mechanism.
    """
    def __init__(self, input_dim: int, dim_head: int):
        super().__init__()
        self.input_dim = input_dim
        self.dim_head = dim_head
        self.scale_factor = dim_head ** -0.5  # Scaling factor: 1 / sqrt(d_k)

        # Q, K, V projections are linear layers (weights W^Q, W^K, W^V)
        self.query = nn.Linear(input_dim, dim_head, bias=False)
        self.key = nn.Linear(input_dim, dim_head, bias=False)
        self.value = nn.Linear(input_dim, dim_head, bias=False)

        # Softmax is applied on the last dimension (the key/sequence dimension)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, sequence: torch.Tensor, attention_mask: torch.Tensor = None):
        """
        sequence: tensor of shape (batch_size, sequence_length, input_dim)
        attention_mask: tensor of shape (batch_size, sequence_length)
        """
        # Compute Q, K and V
        Q = self.query(sequence)
        K = self.key(sequence)
        V = self.value(sequence)

        # Compute the scaled attention scores Q x K^T
        raw_score = torch.matmul(Q, K.transpose(-2, -1)) # K^T is K transposed on the last two dims (not on the batch dimension)
        scaled_score = raw_score * self.scale_factor
        
        # Apply Mask if provided
        if attention_mask is not None:
            # The mask needs to be broadcastable to the attention score tensor (B, N, N)
            # The input mask is (B, N), we need to expand it to (B, 1, N) to mask the last dim
            mask = attention_mask.unsqueeze(1).bool()
            mask_value = torch.finfo(dots.dtype).min  # smallest possible number, acts as -inf
            scaled_score.masked_fill_(~mask, mask_value) 

        # Apply Softmax
        attention_score = self.softmax(scaled_score)

        # Compute head output
        head = torch.matmul(attention_score, V)

        return head, attention_score

In [4]:
self_attention_layer = SelfAttention(input_dim=64, dim_head=16)
head, attention_score = self_attention_layer(dummy_tensor)
print(head.shape, attention_score.shape)

torch.Size([16, 32, 16]) torch.Size([16, 32, 32])


<div class="alert alert-block alert-info">
    
<b> Exercise 3.2: </b>
* Implement the multihead attention mechanism
</div>

In [5]:
class MultiHeadAttention(nn.Module):
    def __init__(self, dim: int, heads: int = 8, dim_head: int = 64):
        super().__init__()
        inner_dim = dim_head * heads  # total dimension after combining all heads
        self.heads = heads
        self.scale_factor = dim_head ** -0.5  # Scaling factor: 1 / sqrt(d_k)

        # Projections W^Q, W^K, W^V are combined into single layers of size (dim -> inner_dim)
        # We split the heads later via reshaping.
        self.query = nn.Linear(dim, inner_dim, bias=False)
        self.key = nn.Linear(dim, inner_dim, bias=False)
        self.value = nn.Linear(dim, inner_dim, bias=False)

        # Final output projection W^O
        self.softmax = nn.Softmax(dim=-1)
        self.to_out = nn.Linear(inner_dim, dim, bias=False)

    def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None):
        h = self.heads
        B, N, D = x.shape  # Batch, Sequence length, Model dimension

        # Compute Q, K, V
        Q = self.query(x)
        K = self.key(x)
        V = self.value(x)
        
        # Split Heads and Transpose
        # Reshape to (B, N, heads, dim_head) -> Transpose to (B, heads, N, dim_head)
        # This makes the attention calculation treat the 'heads' dimension independently
        Q = Q.reshape(B, N, h, -1).transpose(1, 2)
        K = K.reshape(B, N, h, -1).transpose(1, 2)
        V = V.reshape(B, N, h, -1).transpose(1, 2)

        # Compute the scaled attention scores (Q @ K^T)
        scaled_score = torch.matmul(Q, K.transpose(-2, -1)) * self.scale_factor  # shape: (B, heads, N, N)

        # Apply Mask
        if mask is not None:
            # The mask must be broadcastable to (B, heads, N, N)
            # From a (B, N) mask, we unsqueeze twice to get (B, 1, 1, N)
            mask = mask.unsqueeze(1).unsqueeze(2).bool()
            mask_value = torch.finfo(attention_scores.dtype).min
            scaled_score.masked_fill_(~mask, mask_value)

        # Apply Softmax
        attention_score = self.softmax(scaled_score)

        # Compute head output
        out = torch.matmul(attention_score, V)  # shape: (B, heads, N, dim_head)

        # Final linear projection W^O
        out = self.to_out(out.reshape(B, N, -1))

        return out, attention_score

In [6]:
multihead_attention_layer = MultiHeadAttention(64, 8, 64)
out, attention_scores = multihead_attention_layer(dummy_tensor)
print(out.shape, attention_score.shape)

torch.Size([16, 32, 64]) torch.Size([16, 32, 32])


<div class="alert alert-block alert-info">
    
<b> Exercise 3.3: </b>
* Implement the tranformer encoder block
</div>

In [7]:
class TransformerEncoderBlock(nn.Module):
    """
    Implements a single block of the Transformer Encoder.
    """
    def __init__(self, dim: int, heads: int = 8, dim_head: int = 64):
        super().__init__()
        self.ln1 = nn.LayerNorm(dim)
        self.ln2 = nn.LayerNorm(dim)
        
        self.mha = MultiHeadAttention(dim, heads, dim_head)
        
        self.mlp = nn.Sequential(
            nn.Linear(dim, dim * 4),
            nn.GELU(),
            nn.Linear(dim * 4, dim),
        )

    def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None):
        # Multi-Head Attention Sub-layer
        attention_output, _ = self.mha(self.ln1(x), mask=mask)
        x = x + attention_output  # Residual connection

        # Feed-Forward (MLP) Sub-layer
        mlp_output = self.mlp(self.ln2(x))
        x = x + mlp_output  # Residual connection
        
        return x

In [8]:
encoder_block = TransformerEncoderBlock(64, 8, 64)
out = encoder_block(dummy_tensor)
print(out.shape)

torch.Size([16, 32, 64])


<div class="alert alert-block alert-info">
    
<b> (Optional) Exercise 3.4: </b>
* Implement the tranformer decoder block
* Combine the encoder and decoder to create a transformer
</div>

In [9]:
class CrossAttention(MultiHeadAttention):
    """
    Simulates the Cross-Attention layer by inheriting MHA structure.
    It takes the encoder output (z) to generate K and V, and the decoder 
    output (x) to generate Q.
    """
    def forward(self, x: torch.Tensor, z: torch.Tensor, mask: Optional[torch.Tensor] = None) :
        B, N_x, D = x.shape
        h = self.heads

        # Q is computed from the Decoder's output (x)
        Q = self.query(x)
        # K and V are computed from the Encoder's output (z)
        K = self.key(z)
        V = self.value(z)

        N_z = z.shape[1]
        Q = Q.reshape(B, N_x, h, -1).transpose(1, 2)
        K = K.reshape(B, N_z, h, -1).transpose(1, 2)
        V = V.reshape(B, N_z, h, -1).transpose(1, 2)

        scaled_score = torch.matmul(Q, K.transpose(-2, -1)) * self.scale_factor

        if mask is not None:
            mask = mask.unsqueeze(1).unsqueeze(2).bool()
            mask_value = torch.finfo(attention_scores.dtype).min
            scaled_score.masked_fill_(~mask, mask_value)
        
        attention_scores = self.softmax(scaled_score)
        out = torch.matmul(attention_scores, V)

        return self.to_out(out.reshape(B, N_x, -1))


class TransformerDecoderBlock(nn.Module):
    def __init__(self, dim: int, heads: int = 8, dim_head: int = 64):
        super().__init__()
        
        self.ln1 = nn.LayerNorm(dim)
        self.ln2 = nn.LayerNorm(dim)
        self.ln3 = nn.LayerNorm(dim)
        
        self.masked_mha = MultiHeadAttention(dim, heads, dim_head) 
        self.cross_attn = CrossAttention(dim, heads, dim_head) 
        self.mlp = nn.Sequential(
            nn.Linear(dim, dim * 4),
            nn.GELU(),
            nn.Linear(dim * 4, dim),
        )

    def forward(self, x: torch.Tensor, z: torch.Tensor, self_attn_mask: Optional[torch.Tensor] = None, cross_attn_mask: Optional[torch.Tensor] = None):
        """
        x: Decoder input (output from previous decoder layer)
        z: Encoder output (output of the final encoder block)
        self_attn_mask: Mask for self-attention (causal/look-ahead mask)
        cross_attn_mask: Mask for cross-attention (e.g., padding mask from encoder input)
        """
        attention_output, _ = self.masked_mha(self.ln1(x), mask=self_attn_mask)
        x = x + attention_output
        
        cross_output = self.cross_attn(self.ln2(x), z, mask=cross_attn_mask)
        x = x + cross_output
        
        mlp_output = self.mlp(self.ln3(x))
        x = x + mlp_output
        
        return x

In [10]:
encoder_block = TransformerEncoderBlock(64, 8, 64)
z = encoder_block(dummy_tensor)
decoder_block = TransformerDecoderBlock(64, 8, 64)
out = decoder_block(dummy_tensor, z)
print(out.shape)

torch.Size([16, 32, 64])


In [11]:
class Transformer(nn.Module):
    """
    The full Encoder-Decoder Transformer architecture.
    
    This class stacks the Encoder and Decoder blocks to perform sequence-to-sequence tasks.
    It takes an input sequence (src) and a target sequence (tgt) and uses the Encoder
    output (z) as the K/V source for the Decoder's cross-attention.
    """
    def __init__(self, dim: int = 512, depth: int = 6, heads: int = 8, dim_head: int = 64):
        super().__init__()
        
        # Encoder
        self.encoder_stack = nn.ModuleList([
            TransformerEncoderBlock(dim=dim, heads=heads, dim_head=dim_head)
            for _ in range(depth)
        ])
        
        # Decoder
        self.decoder_stack = nn.ModuleList([
            TransformerDecoderBlock(dim=dim, heads=heads, dim_head=dim_head)
            for _ in range(depth)
        ])

        self.final_ln = nn.LayerNorm(dim)

    def forward(
        self, 
        src: torch.Tensor, 
        tgt: torch.Tensor, 
        src_mask: Optional[torch.Tensor] = None, 
        tgt_mask: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        Args:
            src: Source sequence (e.g., input sentence). Shape (B, N_src, D).
            tgt: Target sequence (e.g., partially generated output). Shape (B, N_tgt, D).
            src_mask: Padding mask for the source sequence (used in encoder and cross-attention).
            tgt_mask: Causal mask for the target sequence (used in decoder self-attention).

        Returns:
            The final output representation from the decoder. Shape (B, N_tgt, D).
        """
        
        # Encoder Forward pass
        z = src
        for encoder_block in self.encoder_stack:
            z = encoder_block(z, mask=src_mask)
            
        # Decoder Forward pass
        y = tgt
        for decoder_block in self.decoder_stack:
            y = decoder_block(x=y, z=z, self_attn_mask=tgt_mask, cross_attn_mask=src_mask)
        
        output = self.final_ln(y)
        return output

An LLM is just a stack of transformer encoder and decoder layers with usually a huge number of parameters and a wide feedforward dimension.

# BERT (Bidirectional Encoder Representations from Transformers)

BERT marked a major turning point in NLP as it introduced a novel pre-training strategy that allowed language models to capture deep, bidirectional context, significantly boosting performance across nearly all downstream tasks.

##### What is the Core Purpose of BERT?

The purpose of BERT is to create a **general-purpose "language understanding" model** that is highly effective at encoding the meaning of words based on their full context. It achieves this by being **pre-trained** in an **unsupervised** manner on a massive, general-purpose text corpus (like Wikipedia and BooksCorpus).

The resulting model weights serve as a powerful foundation that can be quickly adapted to specialized tasks.

A **downstream task** refers to a specific, practical NLP application (e.g., question answering, sentiment analysis, named entity recognition...).

**Fine-Tuning** is the process of taking BERT's pre-trained weights and continuing to train the entire model (or just the final layers) on a small, labeled dataset specific to the downstream task. This adaptation allows the model to specialize its general language understanding for the target task, which is far more efficient than training a model from scratch.

### Why BERT Models Outperformed Previous Methods

BERT's superiority comes from its **deep bidirectionality**, a key architectural feature enabled by the Transformer Encoder.
Older models like OpenAI GPT-1 (using Transformer Decoders) or LSTMs were often **unidirectional** (left-to-right). Consider the sentence:   `I made a bank deposit`

In a unidirectional model, when calculating the representation for the word "**bank**," the model only looks at the preceding words ("I," "made," "a") but ignores the succeeding word "**deposit**." It misses the full context, which is crucial for determining the correct meaning of "bank" (e.g., financial institution vs. river bank).

BERT is composed entirely of the **Transformer Encoder** stack. Because Encoder layers use **non-masked Self-Attention**, every token's representation is computed using information from **all other tokens** in the input sequence, from the very first layer up to the last. This makes BERT **deeply bidirectional**.

### How BERT is Pre-Trained

BERT is trained on two **unsupervised** tasks simultaneously on its massive text corpus. The combined training loss is the sum of the loss from both tasks.

#### Task 1: Masked Language Modeling (MLM)

The goal of MLM is to force the model to capture the **bidirectional context** of a word by predicting words that have been intentionally hidden.

1.  **Masking:** Approximately **15%** of the tokens in the input sentence are selected as prediction targets.
2.  **Prediction:** The model must predict the identity of the masked tokens based on the surrounding, unmasked context.

```python
input = 'the man went to the [MASK1] . he bought a [MASK2] of milk'
label = {'[MASK1]': 'store', '[MASK2]': 'gallon'}
```

**Masking Strategy (The Trick)**: To prevent the model from always knowing a word has been hidden, the $15\%$ of selected tokens are modified as follows:

  * **80% of the time:** Replace the word with the special [MASK] token (the standard case).
  * **10% of the time:** Replace the word with a random word (forcing the model to decide if the word fits the context).
  * **10% of the time:** Keep the word unchanged (forcing the model to keep generating contextual representations for every token).

#### Task 2: Next Sentence Prediction (NSP)

NSP is designed to give BERT an understanding of the **relationship between two distinct sentences**, which is essential for tasks like Question Answering or Document Classification.

1.  **Input Structure:** Two sentences, Sentence A and Sentence B, are concatenated and separated by the \[SEP\] token.
2.  **Task:** The model must classify whether Sentence B is the actual next sentence that follows Sentence A in the corpus (labeled "**IsNext**") or if it is a random, unrelated sentence (labeled "**NotNext**").

$$\text{[CLS] The rain stopped pouring [SEP] The sun started to shine [SEP]} \quad \Rightarrow \text{IsNext}$$
$$\text{[CLS] The capital of France is Paris [SEP] I am going to buy a cat [SEP]} \quad \Rightarrow \text{NotNext}$$

BERT was trained on vast, document-level corpora:

  * BooksCorpus ($\approx 800\text{M}$ words)
  * English Wikipedia ($\approx 2,500\text{M}$ words)

It is crucial to use **document-level corpora** rather than collections of shuffled, unrelated sentences, as this is the only way to effectively train the Next Sentence Prediction task and teach the model about long-range dependencies.

# GPT

Whereas BERT is trained to understand natural language and contextually encode text, models like GPT (3, 4, 5) and LLaMA are trained to generate new tokens.

These models are composed of a Decoder-only stack, and use masked Self-Attention layers to compute token's representation based on the information from only the previous tokens in the input sequence (**unidirectionality**).

They're trained on a vast amount of public internet data on a **Causal Language Modeling (CLM)** task, i.e. predicting the next token of a sequence.

BERT is composed entirely of the Transformer Encoder stack. Because Encoder layers use non-masked Self-Attention, every token's representation is computed using information from all other tokens in the input sequence, from the very first layer up to the last. This makes BERT deeply bidirectional.

#### The Secret Ingredient for Modern GPT Models: RLHF

To give the GPT family models their human-like conversational ability, they're fine-tuned after pre-training _via_ a method called **Reinforcement Learning with Human Feedback (RLHF)**.

RLHF uses human preferences to train a separate "reward model," which then guides the GPT model to generate responses that are helpful, truthful, and harmless, making the output feel much more aligned with human dialogue and intent.

# Fine-tuning BERT for text classification

In this section, we'll fine-tune the BERT model on a simple task of text classification consisting in predicting the number of stars (between 0 and 4 corresponding to 1 star to 5 stars) assigned to a Yelp review.

### Setup the dataset

In [12]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [13]:
from datasets import load_dataset

dataset = load_dataset("Yelp/yelp_review_full")

'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /datasets/Yelp/yelp_review_full/resolve/main/README.md (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1017)')))"), '(Request ID: cca855fa-0012-4cb8-a4e6-f99769f2433a)')' thrown while requesting HEAD https://huggingface.co/datasets/Yelp/yelp_review_full/resolve/main/README.md
Retrying in 1s [Retry 1/5].
'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /datasets/Yelp/yelp_review_full/resolve/main/README.md (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1017)')))"), '(Request ID: 9db07430-0e69-4767-8b51-bdac210bc952)')' thrown while requesting HEAD https://huggingface.co/datasets/Yelp/yelp_review_full/resolve/m

In [14]:
print(dataset["train"][100])

{'label': 0, 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I

### Tokenization

In [15]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /bert-base-cased/resolve/main/tokenizer_config.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1017)')))"), '(Request ID: cfb9e19d-8f8d-4dde-8009-7aa089d692ef)')' thrown while requesting HEAD https://huggingface.co/bert-base-cased/resolve/main/tokenizer_config.json
Retrying in 1s [Retry 1/5].
'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /bert-base-cased/resolve/main/tokenizer_config.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1017)')))"), '(Request ID: af88e137-914a-4b7c-a681-d7ce60e48104)')' thrown while requesting HEAD https://huggingface.co/bert-base-cased/resolve/main/tokenizer_config.jso

In [16]:
print("Nb of tokens:", tokenizer.vocab_size)
print({k:v for k, v in list(tokenizer.get_vocab().items())[:10]})

Nb of tokens: 28996
{'##air': 8341, 'quantum': 9539, 'striker': 13074, 'ghost': 7483, 'gig': 17799, 'ransom': 25057, '##believable': 26438, 'precinct': 26757, 'playwright': 11617, '##rting': 21811}


In [17]:
test_sentence = "Hello world! Welcome to the TSE Machine Learning course."
ids = tokenizer(test_sentence)["input_ids"]
print("Tokens:", ids)
print("//".join(map(tokenizer.decode, ids)))

Tokens: [101, 8667, 1362, 106, 12050, 1106, 1103, 157, 12649, 7792, 9681, 1736, 119, 102]
[CLS]//Hello//world//!//Welcome//to//the//T//##SE//Machine//Learning//course//.//[SEP]


In [18]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, return_tensors="pt")

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [19]:
tokenized_datasets.set_format("pt", columns=["input_ids", "attention_mask", "label"], output_all_columns=True)

In [20]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [21]:
print(small_train_dataset)
print(small_train_dataset[0])

Dataset({
    features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1000
})
{'label': tensor(4), 'input_ids': tensor([  101,   146, 27438,  1142,  4202,   119,   146,   112,  1396,  1151,
         1106,  3924,  8412,  1187,   146,  9981,  1106,  1129,   170, 13395,
         7589,  2288,  1107,  1413,   117,  6322,  8796,  5030,  7424,   117,
         1105,  1104,  1736,  1103,  9230,   112,   188,  2319,   119,  1109,
        20400,  1132,  1177,  1177,  7284, 10455,   119,  1109,  3172,  1110,
         7688,  4931,  1105,  1119,  2228,  1296,  7329,  1118,  1289,  1114,
         1126, 10965,  2971,  1104,  8188,   119,  1192, 13224,  3940,  1303,
         3713,   106,   106,   106,   102,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,  


<div class="alert alert-block alert-info">
    
<b> Exercise 3.5: </b>
* Using the [`transformers` library](https://huggingface.co/docs/transformers/index), download the BERT uncased model (name "bert-base-uncased") and change the number of labels to 5
</div>

In [22]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=5)
print(model)

'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /bert-base-uncased/resolve/main/config.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1017)')))"), '(Request ID: 73b319a7-4083-432a-a8e3-5703a496e80d)')' thrown while requesting HEAD https://huggingface.co/bert-base-uncased/resolve/main/config.json
Retrying in 1s [Retry 1/5].
'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /bert-base-uncased/resolve/main/config.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1017)')))"), '(Request ID: d6f95773-7e99-44fa-9ff2-5a6c34abc7d5)')' thrown while requesting HEAD https://huggingface.co/bert-base-uncased/resolve/main/config.json
Retrying in 2s [Retry 2/5].
'(

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

<div class="alert alert-block alert-info">
    
<b> Exercise 3.6: </b>
* Make an inference with the model
</div>

In [23]:
import torch

# Extract the first example
example = small_train_dataset[0]

# Extract the input fields required by the model
input_ids = example["input_ids"]
attention_mask = example["attention_mask"]

# Convert to the format expected by the model i.e. (batch_size, sequence_length)
inputs = {
    "input_ids": input_ids.unsqueeze(0),
    "attention_mask": attention_mask.unsqueeze(0),
}


# Get predictions from the model
with torch.no_grad():
    outputs = model(**inputs)

# The outputs are logits, you can apply softmax to get probabilities
probabilities = torch.nn.functional.softmax(outputs.logits, dim=1)

print(probabilities)

tensor([[0.2057, 0.1267, 0.1413, 0.2602, 0.2661]])


In [24]:
def evaluate_model(model, dataloader, loss_function, device):
    model.eval()
    total_eval_loss = 0
    correct_predictions = 0
    nb_iteration = 0
    total_data = 0
    for batch in dataloader:
        # Move batch to the same device as model
        y = batch.pop("label").to(device)
        text = batch.pop("text")

        # Forward pass: compute predicted outputs by passing inputs to the model
        input_data = {k: batch[k].to(device) for k in ("input_ids", "attention_mask")}
        with torch.no_grad():
            outputs = model(**input_data)
        logits = outputs.logits
        total_data += logits.size()[0]  # Get the batch size
        loss = loss_function(logits, y)
        total_eval_loss += loss.item()

        preds = torch.argmax(logits, dim=1)
        correct_predictions += torch.sum(preds == y)
        nb_iteration += 1

    avg_loss = total_eval_loss / nb_iteration
    accuracy = correct_predictions.double() / total_data
    return avg_loss, accuracy


<div class="alert alert-block alert-info">
    
<b> Exercise 3.7: </b>
* Add the AdamW optimizer with a Learning Rate $\alpha = 5\mathrm{e}{-5}$
</div>

In [25]:
import torch
from torch.optim import AdamW
from torch.nn import CrossEntropyLoss

# Ensure the model is on the correct device (GPU or CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Initialize the optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Define the loss function
loss_function = CrossEntropyLoss()

<div class="alert alert-block alert-info">
    
<b> Exercise 3.8: </b>
* Perform the training
* Why the performance is so bad?
</div>

In [None]:
from tqdm import tqdm  # for displaying a progress bar

# Define the number of training epochs
epochs = 3

# Start the training loop
for epoch in range(epochs):
    model.train()  # set the model to training mode
    total_loss = 0
    nb_iteration = 0

    train_iter = small_train_dataset.iter(batch_size=8)
    eval_iter = small_eval_dataset.iter(batch_size=8)

    for batch in tqdm(train_iter, desc=f"Epoch {epoch + 1}/{epochs}", unit="batch"):
        # Move batch to the same device as model
        y = batch.pop("label").to(device)
        text = batch.pop("text")
        
        # Forward pass: compute predicted outputs by passing inputs to the model
        input_data = {k: batch[k].to(device) for k in ("input_ids", "attention_mask")}
        outputs = model(**input_data)

        # Compute loss
        loss = loss_function(outputs.logits, y)

        # Backward pass: compute gradient of the loss with respect to model parameters
        loss.backward()

        # Update parameters and zero the gradients
        optimizer.step()
        optimizer.zero_grad()

        nb_iteration += 1

        # Accumulate the training loss
        total_loss += loss.item()

    # Calculate average loss over an epoch
    avg_train_loss = total_loss / nb_iteration

    print(f"\nEpoch {epoch + 1} complete! Average Training Loss: {avg_train_loss:.4f}")
    eval_loss, eval_accuracy = evaluate_model(model, eval_iter, loss_function, device)
    print(f"Validation Loss: {eval_loss:.4f}, Validation Accuracy: {eval_accuracy:.4f}")

<div class="alert alert-block alert-info">
    
<b> Exercise 3.9: </b>
* Play with the training procedure to reach the best possible performance: add a learning rate scheduler, increase the training dataset size etc.
</div>