# TP NLP ‚Äî T4 : **Transformer from Scratch** (Self-Attention) ‚Äî Master IA

Ce notebook constitue le **TUTORIEL 4 (T4)** du module NLP.
Apr√®s Seq2Seq (T1), Bi-Encodeur (T2) et Attention (T3), on introduit le **Transformer**,
architecture fond√©e **uniquement sur l‚Äôattention**, sans r√©currence.

---
## üéØ Objectifs p√©dagogiques

√Ä la fin de ce TP, l‚Äô√©tudiant sera capable de :
- expliquer pourquoi le Transformer **supprime la r√©currence**,
- comprendre la **self-attention** (requ√™tes, cl√©s, valeurs),
- impl√©menter une **Multi-Head Attention** simple,
- comprendre le r√¥le du **positional encoding**,
- assembler un **mini-Transformer fonctionnel**,
- comparer conceptuellement Transformer vs RNN + Attention.

‚ö†Ô∏è Objectif : **comprendre l‚Äôarchitecture**, pas battre l‚Äô√©tat de l‚Äôart.
---



## üß† Pourquoi les Transformers ? (rappel critique)

Les mod√®les RNN/LSTM pr√©sentent :
- une **s√©quentialit√© stricte** (pas de parall√©lisme),
- des difficult√©s sur les longues d√©pendances,
- un co√ªt temporel proportionnel √† la longueur.

Le Transformer (Vaswani et al., 2017) repose sur une id√©e radicale :
> **‚ÄúAttention is all you need.‚Äù**

‚û°Ô∏è Toutes les positions d‚Äôune s√©quence interagissent **en parall√®le**.
---



## üß© Probl√®me p√©dagogique choisi

Nous conservons le probl√®me **d‚Äôinversion de s√©quence** :
```
[1, 5, 7, 3] ‚Üí [3, 7, 5, 1]
```

Pourquoi ?
- structure s√©quence‚Üís√©quence,
- comparaison directe avec T1‚ÄìT3,
- interpr√©tabilit√© claire de la self-attention.
---


In [None]:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import math
import random
import numpy as np
import matplotlib.pyplot as plt

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device


## 1) Param√®tres et vocabulaire

In [None]:

V = 20
MIN_LEN, MAX_LEN = 3, 12

TRAIN_SIZE = 8000
VALID_SIZE = 1000

BATCH_SIZE = 64
D_MODEL = 128
N_HEADS = 4
FF_DIM = 256

EPOCHS = 10
LR = 1e-3

PAD = 0
SOS = V + 1
EOS = V + 2
VOCAB_SIZE = V + 3


## 2) Dataset (inversion)

In [None]:

def generate_pair():
    L = random.randint(MIN_LEN, MAX_LEN)
    src = [random.randint(1, V) for _ in range(L)]
    tgt = [SOS] + list(reversed(src)) + [EOS]
    return src, tgt

class ReverseDataset(Dataset):
    def __init__(self, n):
        self.data = [generate_pair() for _ in range(n)]
    def __len__(self): return len(self.data)
    def __getitem__(self, i): return self.data[i]

def pad(seqs):
    m = max(len(s) for s in seqs)
    return torch.tensor([s+[PAD]*(m-len(s)) for s in seqs], dtype=torch.long)

def collate(batch):
    src = pad([b[0] for b in batch])
    tgt = pad([b[1] for b in batch])
    return src, tgt[:,:-1], tgt[:,1:]

train_loader = DataLoader(ReverseDataset(TRAIN_SIZE), batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate)
valid_loader = DataLoader(ReverseDataset(VALID_SIZE), batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate)



## 3) Positional Encoding

La self-attention est **invariante √† l‚Äôordre**.
On ajoute donc une information de position via :
\[
PE(pos,2i)=sin(pos/10000^{2i/d})
\]
\[
PE(pos,2i+1)=cos(pos/10000^{2i/d})
\]

‚û°Ô∏è Cela injecte la notion d‚Äôordre sans r√©currence.


In [None]:

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=100):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        pos = torch.arange(0, max_len).unsqueeze(1)
        div = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(pos * div)
        pe[:, 1::2] = torch.cos(pos * div)
        self.register_buffer("pe", pe.unsqueeze(0))

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]



## 4) Scaled Dot-Product Attention

Pour des requ√™tes Q, cl√©s K et valeurs V :
\[
Attention(Q,K,V)=softmax(QK^T/\sqrt{d_k})V
\]

Le facteur \(\sqrt{d_k}\) stabilise les gradients.


In [None]:

class ScaledDotAttention(nn.Module):
    def forward(self, Q, K, V, mask=None):
        d_k = Q.size(-1)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        attn = torch.softmax(scores, dim=-1)
        return torch.matmul(attn, V), attn



## 5) Multi-Head Attention

On projette Q,K,V en **plusieurs sous-espaces** :
- chaque t√™te apprend un type de relation
- les r√©sultats sont concat√©n√©s puis reprojet√©s


In [None]:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_k = d_model // n_heads
        self.n_heads = n_heads

        self.Wq = nn.Linear(d_model, d_model)
        self.Wk = nn.Linear(d_model, d_model)
        self.Wv = nn.Linear(d_model, d_model)
        self.fc = nn.Linear(d_model, d_model)
        self.attn = ScaledDotAttention()

    def forward(self, x, mask=None):
        B, T, D = x.shape
        Q = self.Wq(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)
        K = self.Wk(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)
        V = self.Wv(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)

        out, attn = self.attn(Q, K, V, mask)
        out = out.transpose(1,2).contiguous().view(B, T, D)
        return self.fc(out), attn



## 6) Feed-Forward Network position-wise


In [None]:

class FeedForward(nn.Module):
    def __init__(self, d_model, ff_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, ff_dim),
            nn.ReLU(),
            nn.Linear(ff_dim, d_model)
        )
    def forward(self, x):
        return self.net(x)



## 7) Bloc Transformer (Encoder)

Chaque bloc :
1. Multi-head self-attention
2. Add & Norm
3. Feed-forward
4. Add & Norm


In [None]:

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, ff_dim):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.ff = FeedForward(d_model, ff_dim)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x, mask=None):
        attn_out, attn = self.attn(x, mask)
        x = self.norm1(x + attn_out)
        ff_out = self.ff(x)
        x = self.norm2(x + ff_out)
        return x, attn



## 8) Mini-Transformer Seq2Seq (encodeur seul)

Pour simplifier :
- on utilise **un seul bloc Transformer**
- on pr√©dit directement la s√©quence invers√©e


In [None]:

class MiniTransformer(nn.Module):
    def __init__(self):
        super().__init__()
        self.emb = nn.Embedding(VOCAB_SIZE, D_MODEL, padding_idx=PAD)
        self.pe = PositionalEncoding(D_MODEL)
        self.block = TransformerBlock(D_MODEL, N_HEADS, FF_DIM)
        self.fc = nn.Linear(D_MODEL, VOCAB_SIZE)

    def forward(self, x):
        mask = (x != PAD).unsqueeze(1).unsqueeze(2)
        x = self.pe(self.emb(x))
        out, attn = self.block(x, mask)
        return self.fc(out), attn


## 9) Entra√Ænement

In [None]:

model = MiniTransformer().to(device)
optimizer = optim.Adam(model.parameters(), lr=LR)
criterion = nn.CrossEntropyLoss(ignore_index=PAD)

def run_epoch(loader, train=True):
    model.train() if train else model.eval()
    total = 0
    for src, tgt_in, tgt_out in loader:
        src, tgt_out = src.to(device), tgt_out.to(device)
        if train:
            optimizer.zero_grad()
        logits, _ = model(src)
        B,T,V = logits.shape
        loss = criterion(logits.view(B*T, V), tgt_out.view(B*T))
        if train:
            loss.backward()
            optimizer.step()
        total += loss.item()
    return total/len(loader)

for e in range(1, EPOCHS+1):
    tr = run_epoch(train_loader, True)
    va = run_epoch(valid_loader, False)
    print(f"Epoch {e:02d} | train {tr:.4f} | valid {va:.4f}")



## 10) Visualisation de la self-attention

On observe les relations apprises entre positions :
- diagonale invers√©e attendue pour l‚Äôinversion


In [None]:

@torch.no_grad()
def show_self_attention(model, seq):
    model.eval()
    x = torch.tensor([seq], dtype=torch.long, device=device)
    _, attn = model(x)
    attn = attn[0,0].cpu().numpy()  # t√™te 0
    plt.imshow(attn)
    plt.colorbar()
    plt.title("Self-Attention (head 0)")
    plt.xlabel("Positions source")
    plt.ylabel("Positions source")
    plt.show()

show_self_attention(model, [1,5,7,3])
