# üß† From A to B: Understanding Sequence-to-Sequence Models

Welcome to this notebook on **Sequence-to-Sequence (Seq2Seq)** models! This is a cornerstone architecture in modern NLP, forming the basis for everything from machine translation to text summarization.

In our previous notebooks, we learned how to preprocess text and create word embeddings. Now, we'll use those concepts to build a model that can "read" a sequence (like an English sentence) and "write" a new sequence (like a French sentence).

**Our Goal:** To build a simple English-to-French translator from scratch using a "vanilla" Encoder-Decoder model.

**Notebook Outline:**
1.  **Introduction:** What is a Seq2Seq model? (The big picture)
2.  **Setup:** Importing libraries and preparing our environment.
3.  **Data:** Loading, tokenizing, and preparing our parallel text data.
4.  **The Architecture:** Building the three key components:
    * **The Encoder:** Reads the input sentence.
    * **The Decoder:** Generates the output sentence.
    * **The Seq2Seq Model:** A wrapper that combines them.
5.  **Training:** Teaching the model to translate.
6.  **Inference:** Using our trained model to translate new sentences.
7.  **Next Steps:** Where to go from here (Attention!).

## 1. Introduction: The Big Picture

A Sequence-to-Sequence (Seq2Seq) model is designed for tasks where the input and output are both sequences (lists of items), but their lengths may differ.

* **Input:** A sequence, e.g., `[ "hello", "how", "are", "you", "?" ]`
* **Output:** A sequence, e.g., `[ "bonjour", "comment", "allez", "vous", "?" ]`



Think of it like a human translator. The translator first **reads** the entire English sentence (this is the **Encoder**). They build a mental "summary" or "understanding" of its meaning. This summary is what we call the **context vector** (or "thought vector").

Then, the translator **writes** the French sentence, word by word (this is the **Decoder**). At each step, they consult their mental summary (the context vector) and the word they just wrote to decide which word to write next.

Our model will mimic this process:
1.  **Encoder (an RNN):** Will "read" the input English sentence and compress all its information into a single vector (the final hidden state).
2.  **Decoder (another RNN):** Will take that context vector and generate the French sentence token by token.

### The Encoder-Decoder Paradigm

The Seq2Seq model consists of two main components:

```
Input Sequence ‚Üí [ENCODER] ‚Üí Context Vector ‚Üí [DECODER] ‚Üí Output Sequence
```

#### **Encoder**
- Takes the input sequence (e.g., English sentence)
- Processes it word by word using an RNN (LSTM/GRU)
- Compresses all information into a fixed-size **context vector** (also called **thought vector**)
- This vector captures the "meaning" of the entire input sequence

#### **Context Vector**
- A fixed-size vector (e.g., 256 or 512 dimensions)
- Acts as a bottleneck - all input information must pass through it
- Represents the semantic meaning of the input
- This is actually the **final hidden state** of the encoder

#### **Decoder**
- Takes the context vector as its initial hidden state
- Generates the output sequence word by word
- At each step, it predicts the next word based on:
  - The context vector
  - Its current hidden state
  - The previously generated word

## 2. Setup: Imports and Prerequisites

Let's import all the libraries we'll need. We'll be using **PyTorch** to build our models and **spaCy** for high-quality text tokenization.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from torch.utils.data import DataLoader

# Using spacy for tokenization
!pip install -q spacy
!python -m spacy download en_core_web_sm
!python -m spacy download fr_core_news_sm

import spacy

import random
import math
import time
from collections import Counter

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.8/12.8 MB[0m [31m134.9 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m‚ö† Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting fr-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.8.0/fr_core_news_sm-3.8.0-py3-none-any.whl (16.3 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

### 2.1. Configure Device (GPU)

We'll set up our code to use the GPU, which will make training *significantly* faster.

In [2]:
# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Set a seed for reproducible results
SEED = 1234
random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Using device: cuda


## 3. Data Preparation

This is one of the most important parts. We need a "parallel corpus"‚Äîa dataset with pairs of sentences (e.g., English and its French translation). We'll use a small, clean dataset from [Tatoeba](https://tatoeba.org/en/) for this educational notebook.

### 3.1. Download and Load Data

In [3]:
# Download a small, clean parallel corpus (eng-fra)
!wget -q https://www.manythings.org/anki/fra-eng.zip
!unzip -q fra-eng.zip

# We'll just use the fra.txt file, which contains pairs separated by a tab
DATA_PATH = 'fra.txt'

### 3.2. Tokenization

We need to break sentences into "tokens" (words or punctuation). We'll use spaCy's powerful tokenizers for English and French.

In [4]:
# Load spacy models
spacy_en = spacy.load('en_core_web_sm')
spacy_fr = spacy.load('fr_core_news_sm')

print("Loaded spaCy models")

# Example tokenization
en_text = "Hello, how are you?"
fr_text = "Bonjour, comment allez-vous ?"

print(f"EN tokens: {[token.text for token in spacy_en.tokenizer(en_text)]}")
print(f"FR tokens: {[token.text for token in spacy_fr.tokenizer(fr_text)]}")

Loaded spaCy models
EN tokens: ['Hello', ',', 'how', 'are', 'you', '?']
FR tokens: ['Bonjour', ',', 'comment', 'allez', '-vous', '?']


### 3.3. Building the Vocabulary

Our models don't understand words; they understand numbers. We need to create a "vocabulary" that maps every unique word in our dataset to a unique integer (index).

We'll add four special tokens:
* `<unk>`: **Unknown** word. For any word we see in testing that wasn't in our training data.
* `<pad>`: **Padding**. We'll use this to make all sentences in a batch the same length.
* `<sos>`: **Start of Sentence**. A token to tell the decoder "it's time to start generating!"
* `<eos>`: **End of Sentence**. A token the decoder will learn to predict when it's finished.

In [6]:
class Vocabulary:
    def __init__(self, tokenizer, min_freq=2):
        self.tokenizer = tokenizer
        self.min_freq = min_freq

        # Start with special tokens
        self.itos = {0: '<pad>', 1: '<sos>', 2: '<eos>', 3: '<unk>'}
        self.stoi = {v: k for k, v in self.itos.items()}

    def build_vocabulary(self, sentence_list):
        word_counts = Counter()
        for sentence in sentence_list:
            tokens = [token.text.lower() for token in self.tokenizer(sentence)]
            word_counts.update(tokens)

        # Filter words by minimum frequency
        words = [word for word, count in word_counts.items() if count >= self.min_freq]

        # Add filtered words to vocabulary
        idx = len(self.itos) # Start indexing from the end of special tokens
        for word in words:
            self.stoi[word] = idx
            self.itos[idx] = word
            idx += 1

    def numericalize(self, text):
        tokens = [token.text.lower() for token in self.tokenizer(text)]
        return [self.stoi.get(token, self.stoi['<unk>']) for token in tokens]

# 1. Load the data
pairs = []
with open(DATA_PATH, 'r', encoding='utf-8') as f:
    for line in f:
        # File has tab-separated pairs: EN\tFR\t...
        parts = line.strip().split('\t')
        if len(parts) >= 2:
            pairs.append((parts[0], parts[1]))

# For a small notebook, let's limit the dataset size to train faster
# Let's take 40,000 examples
pairs = pairs[:40000]
print(f"Loaded {len(pairs)} sentence pairs.")

# 2. Separate source (EN) and target (FR)
source_sentences = [pair[0] for pair in pairs]
target_sentences = [pair[1] for pair in pairs]

# 3. Build vocabularies
en_vocab = Vocabulary(spacy_en.tokenizer)
fr_vocab = Vocabulary(spacy_fr.tokenizer)

en_vocab.build_vocabulary(source_sentences)
fr_vocab.build_vocabulary(target_sentences)

print(f"English (Source) Vocab Size: {len(en_vocab.itos)}")
print(f"French (Target) Vocab Size: {len(fr_vocab.itos)}")

# Example numericalization
example_text = "I love this notebook."
print(f"Original: {example_text}")
numericalized = en_vocab.numericalize(example_text)
print(f"Numericalized: {numericalized}")
print(f"Reversed: {[en_vocab.itos[i] for i in numericalized]}")

Loaded 40000 sentence pairs.
English (Source) Vocab Size: 3637
French (Target) Vocab Size: 5613
Original: I love this notebook.
Numericalized: [22, 335, 166, 3242, 5]
Reversed: ['i', 'love', 'this', 'notebook', '.']


### 3.4. Creating the Dataset and DataLoader

Now we'll create a PyTorch `Dataset` and `DataLoader`. This will handle batching our data (feeding it to the model in small chunks) and, importantly, **padding**.

**Padding** is essential because RNNs in a batch need all sequences to be the same length. We'll add `<pad>` tokens to the end of shorter sentences.

In [7]:
from torch.nn.utils.rnn import pad_sequence

# Get special token indices
PAD_IDX = en_vocab.stoi['<pad>']
SOS_IDX = en_vocab.stoi['<sos>']
EOS_IDX = en_vocab.stoi['<eos>']

class TranslationDataset(data.Dataset):
    def __init__(self, source_sentences, target_sentences, en_vocab, fr_vocab):
        self.source_data = []
        self.target_data = []

        for i in range(len(source_sentences)):
            # Numericalize, add SOS/EOS tokens
            src = [SOS_IDX] + en_vocab.numericalize(source_sentences[i]) + [EOS_IDX]
            trg = [SOS_IDX] + fr_vocab.numericalize(target_sentences[i]) + [EOS_IDX]

            self.source_data.append(torch.tensor(src))
            self.target_data.append(torch.tensor(trg))

    def __len__(self):
        return len(self.source_data)

    def __getitem__(self, idx):
        return self.source_data[idx], self.target_data[idx]

# This custom 'collate_fn' is the key to batching
def collate_batch(batch):
    src_batch, trg_batch = [], []
    for src_sample, trg_sample in batch:
        src_batch.append(src_sample)
        trg_batch.append(trg_sample)

    # Use pad_sequence to pad all items in a batch to the same length
    src_padded = pad_sequence(src_batch, batch_first=True, padding_value=PAD_IDX)
    trg_padded = pad_sequence(trg_batch, batch_first=True, padding_value=PAD_IDX)

    return src_padded, trg_padded

# --- Setup Dataloaders ---

# Split data into train/validation
# (For a real project, use a dedicated test set)
total_size = len(pairs)
train_size = int(total_size * 0.9)
val_size = total_size - train_size

train_pairs, val_pairs = data.random_split(pairs, [train_size, val_size])

train_src = [pair[0] for pair in train_pairs]
train_trg = [pair[1] for pair in train_pairs]
val_src = [pair[0] for pair in val_pairs]
val_trg = [pair[1] for pair in val_pairs]

train_dataset = TranslationDataset(train_src, train_trg, en_vocab, fr_vocab)
val_dataset = TranslationDataset(val_src, val_trg, en_vocab, fr_vocab)

BATCH_SIZE = 128

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
                          shuffle=True, collate_fn=collate_batch)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE,
                        shuffle=False, collate_fn=collate_batch)

print(f"Created DataLoaders. {len(train_loader)} training batches.")

# Check a batch
src_batch, trg_batch = next(iter(train_loader))
print(f"Source batch shape: {src_batch.shape}") # (BATCH_SIZE, max_src_len)
print(f"Target batch shape: {trg_batch.shape}") # (BATCH_SIZE, max_trg_len)

Created DataLoaders. 282 training batches.
Source batch shape: torch.Size([128, 8])
Target batch shape: torch.Size([128, 12])


## 4. The Architecture

Time to build the model! We'll use a **GRU (Gated Recurrent Unit)**, which is a type of RNN similar to an LSTM but a bit simpler and faster.

### 4.1. The Encoder

The Encoder's job is to "read" the input sequence and output a single "context vector."

**Process:**
1.  **Embedding:** We convert the input word indices into dense vectors (this is what we learned about in the "embeddings" notebook).
2.  **RNN (GRU):** The embedded vectors are fed into the GRU one by one.
3.  **Output:** The GRU produces two things: `outputs` (the hidden state from *every* time step) and `hidden` (the *final* hidden state).
4.  **Context Vector:** This **final hidden state** (`hidden`) is our context vector. It's our "summary" of the entire input sentence.

In [8]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, dropout):
        super().__init__()

        self.hid_dim = hid_dim
        self.input_dim = input_dim

        # 1. Embedding layer
        # input_dim = vocab size
        self.embedding = nn.Embedding(input_dim, emb_dim)

        # 2. GRU layer
        # We set batch_first=True to match our DataLoader
        self.rnn = nn.GRU(emb_dim, hid_dim, batch_first=True)

        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        # src shape = [batch_size, src_len]

        # 1. Get embeddings
        embedded = self.dropout(self.embedding(src))
        # embedded shape = [batch_size, src_len, emb_dim]

        # 2. Pass through RNN
        # outputs = hidden state for *every* token
        # hidden = the *final* hidden state
        outputs, hidden = self.rnn(embedded)

        # outputs shape = [batch_size, src_len, hid_dim]
        # hidden shape = [1, batch_size, hid_dim] (1 for num_layers)

        # The 'hidden' state is our context vector.
        # This is what we will pass to the decoder.
        return hidden

### 4.2. The Decoder

The Decoder's job is to take the context vector and generate the output sequence, word by word.

**Process:**
1.  **Input:** It takes the context vector from the encoder (as its *initial* hidden state), the *previous* word it generated (or `<sos>` to start), and gets embeddings.
2.  **RNN (GRU):** It runs the GRU for *one step* using the embedding and the *previous* hidden state.
3.  **Output:** The GRU produces a new `output` and a new `hidden` state.
4.  **Prediction:** The `output` vector (which has size `hid_dim`) is passed through a **Linear layer** to transform it into a vector the size of our *target vocabulary*.
5.  **Probabilities:** This final vector represents the "scores" for every word in the target (French) vocab. A softmax can turn this into probabilities. The word with the highest score is our prediction.
6.  **Loop:** The new `hidden` state is saved, and the predicted word is fed back in as the *input* for the next time step.

In [9]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, dropout):
        super().__init__()

        self.output_dim = output_dim
        self.hid_dim = hid_dim

        # 1. Embedding layer
        self.embedding = nn.Embedding(output_dim, emb_dim)

        # 2. GRU layer
        self.rnn = nn.GRU(emb_dim, hid_dim, batch_first=True)

        # 3. Linear layer (to predict the next word)
        # It maps from hidden_dim to our target vocab size
        self.fc_out = nn.Linear(hid_dim, output_dim)

        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden):
        # This decoder processes one token at a time.

        # input shape = [batch_size] (the current token)
        # hidden shape = [1, batch_size, hid_dim] (from encoder or last step)

        # We need to add a sequence length dimension to the input
        input = input.unsqueeze(1) # shape = [batch_size, 1]

        # 1. Get embeddings
        embedded = self.dropout(self.embedding(input))
        # embedded shape = [batch_size, 1, emb_dim]

        # 2. Pass through RNN
        output, hidden = self.rnn(embedded, hidden)
        # output shape = [batch_size, 1, hid_dim]
        # hidden shape = [1, batch_size, hid_dim]

        # 3. Pass through Linear layer
        # We remove the '1' dim from the output
        prediction = self.fc_out(output.squeeze(1))
        # prediction shape = [batch_size, output_dim]

        return prediction, hidden

### 4.3. The Seq2Seq Wrapper Model

Finally, we create a "wrapper" model that combines the Encoder and Decoder. This model will manage the overall process.

A key concept here is **Teacher Forcing**.
* **With Teacher Forcing (Training):** When training, instead of feeding the decoder's *own* prediction back in, we feed the **correct** word from the target data (e.g., the actual French sentence). This makes training much more stable and faster.
* **Without Teacher Forcing (Inference):** When *using* the model, we don't have the correct target data. So, we must feed the decoder's *own* prediction back in.

Our model will use a `teacher_forcing_ratio` (e.g., 0.5) to randomly choose between these two methods during training.

In [11]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()

        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        # src = [batch_size, src_len]
        # trg = [batch_size, trg_len]

        batch_size = trg.shape[0]
        trg_len = trg.shape[1]
        trg_vocab_size = self.decoder.output_dim

        # tensor to store decoder's outputs
        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)

        # 1. Run the encoder
        # context shape = [1, batch_size, hid_dim]
        context = self.encoder(src)

        # 2. Run the decoder
        # The first input to the decoder is the <sos> token
        input = trg[:, 0] # Get all <sos> tokens from the batch

        # The 'context' from the encoder is the first 'hidden' state
        # for the decoder
        hidden = context

        # Loop from 1 (skip <sos>) to the end of the target sentence
        for t in range(1, trg_len):

            # Run the decoder for one step
            output, hidden = self.decoder(input, hidden)

            # Store the prediction
            outputs[:, t] = output

            # Decide whether to "teacher force"
            teacher_force = random.random() < teacher_forcing_ratio

            # Get the top predicted word
            top1 = output.argmax(1)

            # If teacher forcing, use actual target word
            # Otherwise, use the decoder's own prediction
            input = trg[:, t] if teacher_force else top1

        return outputs

## 5. Training the Model

We're ready to train! We need to define our model, optimizer, and loss function.

**The Loss Function:** We'll use **Cross-Entropy Loss**.
* It's perfect for this "classification" task (at each step, we're "classifying" the next word from all possibilities in the vocabulary).
* **Crucially:** We must tell the loss function to **ignore** the `<pad>` tokens. We don't want to penalize the model for its predictions on padding.

In [12]:
# --- Define Model Parameters ---
INPUT_DIM = len(en_vocab.itos)
OUTPUT_DIM = len(fr_vocab.itos)
EMB_DIM = 256
HID_DIM = 512
DROPOUT = 0.5

# --- Instantiate Models ---
enc = Encoder(INPUT_DIM, EMB_DIM, HID_DIM, DROPOUT)
dec = Decoder(OUTPUT_DIM, EMB_DIM, HID_DIM, DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

# --- Optimizer ---
optimizer = optim.Adam(model.parameters())

# --- Loss Function ---
# Get the index of our <pad> token
PAD_IDX = fr_vocab.stoi['<pad>']

# Tell the loss function to ignore padding
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

### 5.1. The Training and Evaluation Loops

We'll create a function for our training loop and one for our evaluation loop.

**Important Note:** The `CrossEntropyLoss` function in PyTorch expects its input to be in a 2D shape: `(num_examples, num_classes)`. Our target is `(batch_size, seq_len)`. So, we have to `reshape` our outputs and targets before calculating the loss.

In [13]:
def train_fn(model, loader, optimizer, criterion, clip):
    model.train() # Set model to training mode
    epoch_loss = 0

    for i, (src, trg) in enumerate(loader):
        src = src.to(device)
        trg = trg.to(device)

        optimizer.zero_grad()

        # 1. Forward pass
        output = model(src, trg) # teacher_forcing_ratio is 0.5 by default

        # output shape = [batch_size, trg_len, output_dim]
        # trg shape = [batch_size, trg_len]

        # 2. Reshape for loss function
        # We skip the <sos> token (index 0)
        output_dim = output.shape[-1]

        output_flat = output[:, 1:].reshape(-1, output_dim)
        trg_flat = trg[:, 1:].reshape(-1)

        # output_flat shape = [(trg_len - 1) * batch_size, output_dim]
        # trg_flat shape = [(trg_len - 1) * batch_size]

        # 3. Calculate loss
        loss = criterion(output_flat, trg_flat)

        # 4. Backward pass and optimization
        loss.backward()

        # Clip gradients to prevent them from exploding
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

        optimizer.step()

        epoch_loss += loss.item()

    return epoch_loss / len(loader)

def evaluate_fn(model, loader, criterion):
    model.eval() # Set model to evaluation mode
    epoch_loss = 0

    with torch.no_grad(): # No gradients needed for evaluation
        for i, (src, trg) in enumerate(loader):
            src = src.to(device)
            trg = trg.to(device)

            # Forward pass (turn off teacher forcing)
            output = model(src, trg, teacher_forcing_ratio=0)

            # Reshape for loss
            output_dim = output.shape[-1]
            output_flat = output[:, 1:].reshape(-1, output_dim)
            trg_flat = trg[:, 1:].reshape(-1)

            # Calculate loss
            loss = criterion(output_flat, trg_flat)

            epoch_loss += loss.item()

    return epoch_loss / len(loader)

### 5.2. Let's Train!

Now we run the main loop. We'll also time it and save the model with the best validation loss.

In [14]:
N_EPOCHS = 30
CLIP = 1.0 # Gradient clipping value

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()

    train_loss = train_fn(model, train_loader, optimizer, criterion, CLIP)
    valid_loss = evaluate_fn(model, val_loader, criterion)

    end_time = time.time()

    epoch_mins = int((end_time - start_time) / 60)
    epoch_secs = int((end_time - start_time) % 60)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-model.pt') # Save the best model

    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 12s
	Train Loss: 3.893 | Train PPL:  49.081
	 Val. Loss: 3.494 |  Val. PPL:  32.910
Epoch: 02 | Time: 0m 13s
	Train Loss: 2.833 | Train PPL:  17.000
	 Val. Loss: 3.062 |  Val. PPL:  21.369
Epoch: 03 | Time: 0m 13s
	Train Loss: 2.434 | Train PPL:  11.400
	 Val. Loss: 2.831 |  Val. PPL:  16.959
Epoch: 04 | Time: 0m 13s
	Train Loss: 2.163 | Train PPL:   8.695
	 Val. Loss: 2.684 |  Val. PPL:  14.645
Epoch: 05 | Time: 0m 13s
	Train Loss: 1.971 | Train PPL:   7.179
	 Val. Loss: 2.527 |  Val. PPL:  12.520
Epoch: 06 | Time: 0m 13s
	Train Loss: 1.786 | Train PPL:   5.963
	 Val. Loss: 2.526 |  Val. PPL:  12.498
Epoch: 07 | Time: 0m 13s
	Train Loss: 1.670 | Train PPL:   5.312
	 Val. Loss: 2.431 |  Val. PPL:  11.368
Epoch: 08 | Time: 0m 13s
	Train Loss: 1.553 | Train PPL:   4.725
	 Val. Loss: 2.327 |  Val. PPL:  10.247
Epoch: 09 | Time: 0m 13s
	Train Loss: 1.454 | Train PPL:   4.278
	 Val. Loss: 2.327 |  Val. PPL:  10.248
Epoch: 10 | Time: 0m 13s
	Train Loss: 1.375 | Train PPL

## 6. Inference: Using Our Model

Training is done! Now for the fun part: using our model to translate.

For inference, we **cannot** use teacher forcing. We must feed the model's *own* predictions back into it. This process is called "greedy decoding" (we always pick the word with the single highest probability).

We'll write a function that:
1.  Takes a new English sentence.
2.  Tokenizes and numericalizes it.
3.  Feeds it to the **Encoder** to get a `context` vector.
4.  Starts the **Decoder** with the `<sos>` token.
5.  Loops, feeding the *last predicted word* back into the decoder.
6.  Stops when the decoder predicts `<eos>` or we hit a max length.
7.  Converts the output indices back into French words.

In [15]:
# Load our best saved model
model.load_state_dict(torch.load('best-model.pt'))

def translate_sentence(sentence, model, en_vocab, fr_vocab, device, max_len=50):
    model.eval() # Set to evaluation mode

    # 1. Tokenize and numericalize
    tokens = [token.text.lower() for token in spacy_en.tokenizer(sentence)]
    tokens = [SOS_IDX] + [en_vocab.stoi.get(t, en_vocab.stoi['<unk>']) for t in tokens] + [EOS_IDX]

    src_tensor = torch.LongTensor(tokens).unsqueeze(0).to(device) # [1, src_len]

    with torch.no_grad():
        # 2. Get context vector from encoder
        context = model.encoder(src_tensor) # [1, 1, hid_dim]

    # 3. Start decoder with <sos>
    trg_indices = [SOS_IDX]

    # The context is the first hidden state
    hidden = context

    for _ in range(max_len):
        # Get the last predicted word
        trg_tensor = torch.LongTensor([trg_indices[-1]]).to(device)

        # 4. Run decoder for one step
        output, hidden = model.decoder(trg_tensor, hidden)

        # 5. Get the top prediction
        pred_token = output.argmax(1).item()

        trg_indices.append(pred_token)

        # 6. Stop if <eos>
        if pred_token == EOS_IDX:
            break

    # 7. Convert indices back to words
    trg_tokens = [fr_vocab.itos[i] for i in trg_indices]

    # Return the translation (skipping <sos>)
    return " ".join(trg_tokens[1:])

# --- Let's try some examples! ---

# Pick a random sentence from the validation set
example_idx = 10
src_sentence = val_src[example_idx]
trg_sentence = val_trg[example_idx]

translation = translate_sentence(src_sentence, model, en_vocab, fr_vocab, device)

print(f"Source (EN): {src_sentence}")
print(f"Target (FR): {trg_sentence}")
print(f"Model (PRED): {translation}")
print("---")

# Try a custom sentence
custom_sentence = "A man is reading a book."
translation = translate_sentence(custom_sentence, model, en_vocab, fr_vocab, device)

print(f"Source (EN): {custom_sentence}")
print(f"Model (PRED): {translation}")

Source (EN): I'm right here.
Target (FR): Je suis juste ici.
Model (PRED): je suis ici ici . <eos>
---
Source (EN): A man is reading a book.
Model (PRED): un mouche est un . <eos>


## 7. Conclusion and Next Steps

Congratulations! You've just built a complete sequence-to-sequence model from scratch.

You'll notice the translations are... *okay*, but not perfect. They might be repetitive or grammatically awkward. This is typical for a "vanilla" Seq2Seq model.

**The Bottleneck Problem:** Our model's main weakness is the **context vector**. The encoder has to compress the *entire* meaning of a 20-word sentence into one small vector (`hidden`). This is a huge information bottleneck!

**The Solution:** The next step is to implement an **Attention** mechanism.

**Attention** allows the decoder to "look back" at *all* of the encoder's outputs (not just the final hidden state) at every step of the generation process. It learns to "pay attention" to the most relevant input words when generating the next output word.



This single improvement is what made Seq2Seq models state-of-the-art and is a direct precursor to the Transformer (which is 100% attention).

This notebook provides the perfect foundation for you to build an "Attention" model next!