# TP NLP ‚Äî T5 : **Transformer Encodeur‚ÄìD√©codeur Complet** (from scratch) ‚Äî Master IA

Ce notebook correspond au **TUTORIEL 5 (T5)**.
Apr√®s :
- T1 : Seq2Seq RNN
- T2 : Encodeur bidirectionnel
- T3 : Attention (Bahdanau)
- T4 : Transformer *encoder-only*

üëâ Nous impl√©mentons maintenant un **Transformer complet Encodeur‚ÄìD√©codeur**,  
architecture utilis√©e par **Transformer original**, **T5**, **BART**, **Marian**, etc.

---
## üéØ Objectifs p√©dagogiques

√Ä la fin de ce TP, l‚Äô√©tudiant saura :
- expliquer la diff√©rence **encoder-only / decoder-only / encoder‚Äìdecoder**,
- comprendre la **self-attention masqu√©e** c√¥t√© d√©codeur,
- impl√©menter :
  - un encodeur Transformer,
  - un d√©codeur Transformer,
  - l‚Äôattention **encodeur‚Äìd√©codeur** (cross-attention),
- entra√Æner un Transformer Seq2Seq sur un probl√®me jouet,
- faire le lien avec la **traduction automatique**.

‚ö†Ô∏è Objectif p√©dagogique : **architecture claire**, pas performance maximale.
---



## üß† Rappel conceptuel : pourquoi un encodeur‚Äìd√©codeur ?

Certaines t√¢ches n√©cessitent :
- une **s√©quence source** enti√®rement disponible,
- une **s√©quence cible g√©n√©r√©e pas √† pas**.

Exemples :
- Traduction
- R√©sum√©
- Question‚ÄìR√©ponse g√©n√©ratif

üëâ L‚Äôencodeur :
- lit toute la source,
- produit une m√©moire contextuelle.

üëâ Le d√©codeur :
- g√©n√®re token par token,
- regarde :
  - son pass√© (self-attention masqu√©e),
  - la source (cross-attention).
---



## üß© Probl√®me p√©dagogique choisi

On reste sur un probl√®me **simple et contr√¥l√©** :
### üëâ Inversion de s√©quence (version Seq2Seq)

Source :
```
[1, 5, 7, 3]
```
Cible :
```
[3, 7, 5, 1]
```

Pourquoi encore ce probl√®me ?
- pas de complexit√© linguistique,
- permet de visualiser **self-attention** et **cross-attention**,
- comparaison directe avec T1‚ÄìT4.
---


In [None]:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import math
import random
import numpy as np
import matplotlib.pyplot as plt

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device


## 1) Param√®tres et vocabulaire

In [None]:

V = 20
MIN_LEN, MAX_LEN = 3, 12

TRAIN_SIZE = 8000
VALID_SIZE = 1000

BATCH_SIZE = 64
D_MODEL = 128
N_HEADS = 4
FF_DIM = 256

EPOCHS = 10
LR = 1e-3

PAD = 0
SOS = V + 1
EOS = V + 2
VOCAB_SIZE = V + 3


## 2) Dataset Seq2Seq

In [None]:

def generate_pair():
    L = random.randint(MIN_LEN, MAX_LEN)
    src = [random.randint(1, V) for _ in range(L)]
    tgt = [SOS] + list(reversed(src)) + [EOS]
    return src, tgt

class ReverseDataset(Dataset):
    def __init__(self, n):
        self.data = [generate_pair() for _ in range(n)]
    def __len__(self): return len(self.data)
    def __getitem__(self, i): return self.data[i]

def pad(seqs):
    m = max(len(s) for s in seqs)
    return torch.tensor([s+[PAD]*(m-len(s)) for s in seqs], dtype=torch.long)

def collate(batch):
    src = pad([b[0] for b in batch])
    tgt = pad([b[1] for b in batch])
    return src, tgt[:,:-1], tgt[:,1:]

train_loader = DataLoader(ReverseDataset(TRAIN_SIZE), batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate)
valid_loader = DataLoader(ReverseDataset(VALID_SIZE), batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate)



## 3) Positional Encoding
Identique √† T4.


In [None]:

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=100):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        pos = torch.arange(0, max_len).unsqueeze(1)
        div = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(pos * div)
        pe[:, 1::2] = torch.cos(pos * div)
        self.register_buffer("pe", pe.unsqueeze(0))

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]



## 4) Attention de base (scaled dot-product)


In [None]:

class ScaledDotAttention(nn.Module):
    def forward(self, Q, K, V, mask=None):
        d_k = Q.size(-1)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        attn = torch.softmax(scores, dim=-1)
        return torch.matmul(attn, V), attn



## 5) Multi-Head Attention (r√©utilisable)


In [None]:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_k = d_model // n_heads
        self.n_heads = n_heads

        self.Wq = nn.Linear(d_model, d_model)
        self.Wk = nn.Linear(d_model, d_model)
        self.Wv = nn.Linear(d_model, d_model)
        self.fc = nn.Linear(d_model, d_model)
        self.attn = ScaledDotAttention()

    def forward(self, Q, K, V, mask=None):
        B, Tq, D = Q.shape
        Tk = K.size(1)

        Q = self.Wq(Q).view(B, Tq, self.n_heads, self.d_k).transpose(1,2)
        K = self.Wk(K).view(B, Tk, self.n_heads, self.d_k).transpose(1,2)
        V = self.Wv(V).view(B, Tk, self.n_heads, self.d_k).transpose(1,2)

        out, attn = self.attn(Q, K, V, mask)
        out = out.transpose(1,2).contiguous().view(B, Tq, D)
        return self.fc(out), attn



## 6) Feed-Forward + Bloc Encodeur


In [None]:

class FeedForward(nn.Module):
    def __init__(self, d_model, ff_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, ff_dim),
            nn.ReLU(),
            nn.Linear(ff_dim, d_model)
        )
    def forward(self, x):
        return self.net(x)

class EncoderBlock(nn.Module):
    def __init__(self):
        super().__init__()
        self.attn = MultiHeadAttention(D_MODEL, N_HEADS)
        self.ff = FeedForward(D_MODEL, FF_DIM)
        self.norm1 = nn.LayerNorm(D_MODEL)
        self.norm2 = nn.LayerNorm(D_MODEL)

    def forward(self, x, src_mask):
        attn_out,_ = self.attn(x, x, x, src_mask)
        x = self.norm1(x + attn_out)
        ff_out = self.ff(x)
        x = self.norm2(x + ff_out)
        return x



## 7) Bloc D√©codeur Transformer

Contient **deux attentions** :
1. Self-attention **masqu√©e** (ne regarde pas le futur)
2. Attention encodeur‚Äìd√©codeur (cross-attention)


In [None]:

def subsequent_mask(size):
    mask = torch.tril(torch.ones(size, size)).unsqueeze(0)
    return mask

class DecoderBlock(nn.Module):
    def __init__(self):
        super().__init__()
        self.self_attn = MultiHeadAttention(D_MODEL, N_HEADS)
        self.cross_attn = MultiHeadAttention(D_MODEL, N_HEADS)
        self.ff = FeedForward(D_MODEL, FF_DIM)
        self.norm1 = nn.LayerNorm(D_MODEL)
        self.norm2 = nn.LayerNorm(D_MODEL)
        self.norm3 = nn.LayerNorm(D_MODEL)

    def forward(self, x, memory, tgt_mask, src_mask):
        attn1,_ = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + attn1)

        attn2,_ = self.cross_attn(x, memory, memory, src_mask)
        x = self.norm2(x + attn2)

        ff_out = self.ff(x)
        x = self.norm3(x + ff_out)
        return x



## 8) Transformer Encodeur‚ÄìD√©codeur complet


In [None]:

class TransformerSeq2Seq(nn.Module):
    def __init__(self):
        super().__init__()
        self.emb = nn.Embedding(VOCAB_SIZE, D_MODEL, padding_idx=PAD)
        self.pe = PositionalEncoding(D_MODEL)
        self.encoder = EncoderBlock()
        self.decoder = DecoderBlock()
        self.fc = nn.Linear(D_MODEL, VOCAB_SIZE)

    def forward(self, src, tgt_in):
        src_mask = (src != PAD).unsqueeze(1).unsqueeze(2)
        tgt_mask = subsequent_mask(tgt_in.size(1)).to(tgt_in.device)

        src_emb = self.pe(self.emb(src))
        memory = self.encoder(src_emb, src_mask)

        tgt_emb = self.pe(self.emb(tgt_in))
        dec_out = self.decoder(tgt_emb, memory, tgt_mask, src_mask)

        return self.fc(dec_out)


## 9) Entra√Ænement

In [None]:

model = TransformerSeq2Seq().to(device)
optimizer = optim.Adam(model.parameters(), lr=LR)
criterion = nn.CrossEntropyLoss(ignore_index=PAD)

def run_epoch(loader, train=True):
    model.train() if train else model.eval()
    total = 0
    for src, tgt_in, tgt_out in loader:
        src, tgt_in, tgt_out = src.to(device), tgt_in.to(device), tgt_out.to(device)
        if train:
            optimizer.zero_grad()
        logits = model(src, tgt_in)
        B,T,V = logits.shape
        loss = criterion(logits.view(B*T, V), tgt_out.view(B*T))
        if train:
            loss.backward()
            optimizer.step()
        total += loss.item()
    return total/len(loader)

for e in range(1, EPOCHS+1):
    tr = run_epoch(train_loader, True)
    va = run_epoch(valid_loader, False)
    print(f"Epoch {e:02d} | train {tr:.4f} | valid {va:.4f}")



## 10) Conclusion p√©dagogique

### Ce que les √©tudiants doivent retenir :
- Le d√©codeur **voit son pass√©**, mais pas son futur
- Le d√©codeur **voit la source** via la cross-attention
- L‚Äôencodeur‚Äìd√©codeur est la base de la traduction automatique
- GPT = d√©codeur seul
- BERT = encodeur seul

üëâ √Ä partir d‚Äôici, on comprend toute la famille des LLM.
