# Homework 8: Seq2Seq Model for Machine Translation

In this assignment, you’ll build a sequence-to-sequence (Seq2Seq) model to translate text from German to English. While we’re focusing on language translation, this model can be adapted to any task that requires transforming one sequence into another, such as generating conversational responses in a chatbot (e.g., converting a sequence to a shorter sequence in the same language). This assignment is inspired by the foundational paper[Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215)

To achieve machine translation, we will need to preprocess the dataset and build the necessary components of the Seq2Seq model. The main tasks are summarized below:
- Prepare data
  - Tokenization: Split sentences into individual tokens (words or subwords).
  - Build Vocabularies: Create vocabulary mappings for German and English tokens.
  - Convert Tokens to IDs: Map each token to a unique ID for model input.
  - Pad Sequences: Ensure all sequences in a batch have the same length by padding.
  - Create Data Loaders: Prepare batches of data for efficient training.
- Define the Seq2Seq Model: Implement the core model components:
  - Encoder: Processes the input sequence (German text).
  - Decoder: Generates the output sequence (English translation).
  - Seq2Seq: Combines the encoder and decoder to complete the translation task.
- Train the Model: Learn translation patterns from German to English.
- Evaluate the Model: Assess translation quality on validation data.


## 0 - Useful Libraries
First, let’s import all the necessary libraries for this assignment. The primary libraries we’ll use are:
- [Pytorch](https://pytorch.org/): For building and training neural network models.
- [Datasets](https://huggingface.co/docs/datasets/index): A library from Hugging Face for loading and processing datasets.
- [nltk](https://www.nltk.org/): A toolkit for natural language processing, including a word-level tokenizer.

Other libraries commonly used in NLP tasks are listed below. While they are not required for this assignment, they are helpful for more advanced data processing and evaluation:

- [TorchText](https://github.com/pytorch/text): Provides utilities and helper functions for NLP tasks in PyTorch.
- [spaCy](https://spacy.io/):  Assists in text tokenization and provides additional NLP tools.
- [Evaluate](https://huggingface.co/docs/evaluate/index): A library from Hugging Face for calculating various metrics.

## 1 - Preparting Data

In this assignment, we’ll begin by preparing the data before implementing the neural network models. This approach will give you a strong understanding of how to handle data for sequence-to-sequence tasks, such as machine translation.

If you don’t have the Datasets library installed, you can install it using the following command:

In [1]:
import datasets

### 1.1 - Dataset

We will use [Multi30k](https://github.com/multi30k/dataset) dataset, which was originally designed for multilingual translation and image captioning. Multi30k provides translations of image captions in multiple languages, making it a suitable choice for machine translation tasks with a manageable dataset size. This dataset includes around 30,000 parallel sentence pairs in English, German, and French, allowing models to be trained on translations between these languages.

For this assignment, we will focus on translating from German to English.



In [2]:
# Load the English-German subset
dataset = datasets.load_dataset("bentrevett/multi30k")

In [3]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['en', 'de'],
        num_rows: 29000
    })
    validation: Dataset({
        features: ['en', 'de'],
        num_rows: 1014
    })
    test: Dataset({
        features: ['en', 'de'],
        num_rows: 1000
    })
})


For convinence, we split `dataset` into `train_data`, `valid_data`, and `test_data`.

In [4]:
train_data, valid_data, test_data = (
    dataset["train"],
    dataset["validation"],
    dataset["test"],
)

In [5]:
train_data[3]

{'en': 'A man in a blue shirt is standing on a ladder cleaning a window.',
 'de': 'Ein Mann in einem blauen Hemd steht auf einer Leiter und putzt ein Fenster.'}

### 1.2 - Tokenizers [1/1]

Each sample in train_data consists of a pair of sentences as strings. A tokenizer is used to break down a sentence into a list of individual words or **tokens**. For example, the sentence "A cat chases a mouse into a small hole in the wall." would be tokenized as `['You', 'get', 'what', 'you', 'deserve', ',', 'not', 'what', 'you', 'desire', ',', 'so', 'desire', 'what', 'you', 'deserve']`.

We’ll use the predefined `word_tokenize` method from [nltk](https://www.nltk.org/), which efficiently handles tokenization. You can load this method with the following code:


In [6]:
import nltk
from nltk.tokenize import word_tokenize

# Download necessary data files (Punkt tokenizer models)
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\azeez\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
text_english = "You get what you deserve, not what you desire, so desire what you deserve"
tokens = word_tokenize(text_english)
print("Tokens:", tokens)

Tokens: ['You', 'get', 'what', 'you', 'deserve', ',', 'not', 'what', 'you', 'desire', ',', 'so', 'desire', 'what', 'you', 'deserve']


Next, we’ll use `word_tokenize` as a basic method to tokenize all examples in train_data, valid_data, and test_data. Each example in the dataset is a dictionary containing an English-French sentence pair, so we can extract each sentence and apply `word_tokenize` to convert the string into a list of tokens.

To streamline training, we’ll apply the following preprocessing steps:
- Lowercase Conversion: Convert all words to lowercase for consistency.
- Maximum Length: Limit the number of tokens to a fixed max_length. If a sentence exceeds max_length, we keep only the first max_length tokens.
- Special Tokens: Add `<sos>` (start of sequence) at the beginning and `<eos>` (end of sequence) at the end of each tokenized list.

The final output for each example will still be a dictionary representing the English-French pair, but with each sentence represented as a list of tokens instead of a single string.

**Exercise [1/1]**: Implement `tokenize_example()` to apply these transformations.

In [None]:
def tokenize_example(example, max_length=110, sos_token="<sos>", eos_token="<eos>"):
    tokenized = {}
    for lang_key in example.keys():
        ## Code Here ###

    return tokenized

In [9]:
tokenized = tokenize_example(train_data[5])
print("Tokenized English: ", tokenized["en"])
print("Tokenized German: ", tokenized["de"])

Tokenized English:  ['<sos>', 'a', 'man', 'in', 'green', 'holds', 'a', 'guitar', 'while', 'the', 'other', 'man', 'observes', 'his', 'shirt', '.', '<eos>']
Tokenized German:  ['<sos>', 'ein', 'mann', 'in', 'grün', 'hält', 'eine', 'gitarre', ',', 'während', 'der', 'andere', 'mann', 'sein', 'hemd', 'ansieht', '.', '<eos>']


Next, this function `tokenize_example` will be used in `map()` function to apply on every example in the `train_data`, `test_data`, and `valida_data`, with a dictionary `fn_kwargs` containing all features name and assocaited values.

Next, we’ll use the `tokenize_example` function with the `map()` method to apply it to every example in `train_data`, `test_data`, and `valid_data`. We’ll pass a dictionary called `fn_kwargs` to specify *feature names* and associated *values*, such as `max_length`, and special tokens, which will be used in `tokenize_example` for consistent preprocessing across all datasets.

In [10]:
# Parameters for tokenization
max_length = 110
sos_token = "<sos>"
eos_token = "<eos>"

fn_kwargs = {
    "max_length": max_length,
    "sos_token": sos_token,
    "eos_token": eos_token,
}

train_data_tokenized = train_data.map(tokenize_example, fn_kwargs=fn_kwargs)
valid_data_tokenized = valid_data.map(tokenize_example, fn_kwargs=fn_kwargs)
test_data_tokenized = test_data.map(tokenize_example, fn_kwargs=fn_kwargs)

In [11]:
print(train_data_tokenized[110]["en"])
print(train_data_tokenized[110]["de"])

['<sos>', 'a', 'man', 'wearing', 'a', 'black', 'shirt', 'smoothing', 'out', 'concrete', 'in', 'an', 'urban', 'area', '.', '<eos>']
['<sos>', 'ein', 'mann', ',', 'der', 'ein', 'schwarzes', 'hemd', 'trägt', 'und', 'in', 'einer', 'städtischen', 'umgebung', 'beton', 'glättet', '.', '<eos>']


### 1.3 - Vocabularies [1/1]

Next, we’ll build vocabularies for the source and target languages, specifically `en_vocab` and `ge_vocab`. A vocabulary maps each unique token in our dataset to a unique index (an integer), allowing for efficient token handling. For example, in the vocabulary, "apple" = 1, "banana" = 2, "cat" = 3, etc.

We’ll create a separate vocabulary for each language from our train data (or `train_dataset_tokenized`) using the `build_vocab` function. This function scans each `word` in the `sequence[lan_key]` in `tokenized` example, and if `word` is not already in `vocab`, it adds `word` with the current length of vocab as its index, i.e., `vocab[word] = len(vocab)`.

In principle, the vocabulary could cover all unique tokens in our dataset. However, some tokens might appear only in `test_data_tokenized` or `valid_data_tokenized` and not in `train_data_tokenized`. In these cases, we replace unknown tokens with `"<unk>"` (unknown token), which has its own index. For instance, if the tokens `"gilgamesh"` and `"enkidu"` are missing from our vocabulary, the phrase `"gilgamesh hates enkidu"` would become [0, 24, 0], where `<unk>` is 0, `"hates"` is 24, and each unknown token maps to `<unk>`.

To efficiently input sentences into our model, we `batch` multiple sentences together. For batching, all sentences must be the **same** length. Since sentence lengths vary, we pad each sentence in a batch with the `<pad>` token until they match the longest sentence. For example, `"I love cat"` and `"I do not like mouse"` would be tokenized as `["i", "love", "cat"]` and `["i", "do", "not", "like", "mouse"]`. After padding, they would become `["i", "love", "pizza", "<pad>", "<pad>"]` and `["i", "do", "not", "like", "mouse"]`.

To accommodate these requirements, the `specials` argument in `build_vocab()` allows us to add special tokens (`<pad>`, `<unk>`, `<sos>`, and `<eos>`) to the vocabulary, even if they don’t appear in our tokenized examples. These special tokens are added at the beginning of the vocabulary.

**Exercise [1/1]**: Implement build_vocab().


In [None]:
def build_vocab(tokenized, lang_key, specials=["<pad>", "<sos>", "<eos>", "<unk>"]):
    # Initialize vocab with special tokens
    vocab = {token: idx for idx, token in enumerate(specials)}

    ## Code Here ###

    return vocab

In [13]:
# Define specials
special_tokens = ["<pad>", "<sos>", "<eos>", "<unk>"]

# Build vocabularies
en_vocab = build_vocab(train_data_tokenized, "en", specials=special_tokens)
de_vocab = build_vocab(train_data_tokenized, "de", specials=special_tokens)

print("English Vocabulary size:", len(en_vocab))
print("German Vocabulary size:", len(de_vocab))

English Vocabulary size: 10218
German Vocabulary size: 18680


In [14]:
print("English Vocabulary:", dict(list(en_vocab.items())[:20]))
print("German Vocabulary:", dict(list(de_vocab.items())[:20]))

English Vocabulary: {'<pad>': 0, '<sos>': 1, '<eos>': 2, '<unk>': 3, 'two': 4, 'young': 5, ',': 6, 'white': 7, 'males': 8, 'are': 9, 'outside': 10, 'near': 11, 'many': 12, 'bushes': 13, '.': 14, 'several': 15, 'men': 16, 'in': 17, 'hard': 18, 'hats': 19}
German Vocabulary: {'<pad>': 0, '<sos>': 1, '<eos>': 2, '<unk>': 3, 'zwei': 4, 'junge': 5, 'weiße': 6, 'männer': 7, 'sind': 8, 'im': 9, 'freien': 10, 'in': 11, 'der': 12, 'nähe': 13, 'vieler': 14, 'büsche': 15, '.': 16, 'mehrere': 17, 'mit': 18, 'schutzhelmen': 19}


### 1.4 - Convert to IDs [2/2]

To facilitate training, we’ll convert each tokenized sequence into a list of indices by replacing every `token` in the `sequence` with its corresponding index in the vocabulary. This approach, combined with an *embedding* layer, allows us to efficiently retrieve the relevant row from the embedding matrix without performing an actual matrix-vector multiplication, thus speeding up training.

Specifically, we’ll implement a `convert_tokens_to_ids()` method that takes an example from the dataset along with the English and German vocabularies. Each `example` is a dictionary with language keys (`"en"` and `"de"`) and their associated tokenized sequences. Given a language key, we’ll extract the corresponding `sequence` from example, then iterate over each `token` in the `sequence` and replace it with its index from the appropriate vocabulary, e.g., `vocab[token]`.

Since the vocabularies are built from the *training dataset*, some tokens may appear in the test or validation sets but **not** in the training set. In these cases, `vocab[token]` will return `None` because the token does not exist in the vocabulary. To handle this, we’ll assign the index for `<unk>` (unknown token) instead. The method `vocab.get(token, vocab["<unk>"])` retrieves the token’s index if it exists in `vocab`; otherwise, it returns `vocab["<unk>"]`.

**Exercise [1/1]**: Implement convert_tokens_to_ids().

In [None]:
def convert_tokens_to_ids(example, en_vocab, de_vocab, max_length=12):
    data_ids = {}
    for lang_key in example.keys():
        ## Code Here ###

    return data_ids

Similar to `tokenize_example()`, we’ll apply the `convert_tokens_to_ids()` method to every example in `train_data_tokenized`, `valid_data_tokenized`, and `test_data_tokenized` using the `map()` function. We’ll specify argument names and values in the dictionary `fn_kwargs` for consistent application across all datasets.

**Exercise [1/1]**: Use the map() function to convert tokenized examples in the train, validation, and test datasets into lists of IDs.

In [16]:
fn_kwargs = {"en_vocab": en_vocab, "de_vocab": de_vocab, "max_length": max_length}

## Code Here ###
train_data_ids = train_data_tokenized.map(convert_tokens_to_ids, fn_kwargs=fn_kwargs)
valid_data_ids = valid_data_tokenized.map(convert_tokens_to_ids, fn_kwargs=fn_kwargs)
test_data_ids = test_data_tokenized.map(convert_tokens_to_ids, fn_kwargs=fn_kwargs)

In [17]:
print(train_data_tokenized[3])
print(train_data_ids[3])

{'en': ['<sos>', 'a', 'man', 'in', 'a', 'blue', 'shirt', 'is', 'standing', 'on', 'a', 'ladder', 'cleaning', 'a', 'window', '.', '<eos>'], 'de': ['<sos>', 'ein', 'mann', 'in', 'einem', 'blauen', 'hemd', 'steht', 'auf', 'einer', 'leiter', 'und', 'putzt', 'ein', 'fenster', '.', '<eos>']}
{'en': [1, 21, 31, 17, 21, 32, 33, 34, 35, 36, 21, 37, 38, 21, 39, 14, 2], 'de': [1, 21, 29, 11, 30, 31, 32, 33, 34, 35, 36, 37, 38, 21, 39, 16, 2]}


In [18]:
valid_data_ids[3]

{'en': [1,
  4,
  16,
  215,
  350,
  21,
  32,
  380,
  761,
  1063,
  36,
  208,
  7019,
  265,
  951,
  2],
 'de': [1, 4, 7, 943, 48, 299, 3, 34, 30, 3, 1040, 34, 2]}

### 1.5 - Padding [1/1]

As discussed, we typically pass *multiple* sequences, known as a `batch`, into the model during training. For batching to work, each sequence in a batch must have the **same** length. Since sequences generally vary in length, we’ll use the special token `<pad>` to pad shorter sequences up to a fixed `max_length`. If any sequences exceed `max_length`, we’ll trim them to fit.

**Exercise [1/1]**: Similar to `convert_tokens_to_ids()` and `tokenize_example()`, implement `pad_example()` to pad each sequence in an English-French pair within an `example`. Use the pad token’s ID (0 in our case) for padding, and ensure every sequence is either padded or trimmed to match `max_length`.

In [None]:
def pad_example(example, pad_token_id=0, max_length=12):
    padded = {}
    for lan_key in example.keys():
        ## Code Here ###

    return padded


In [20]:
fn_kwargs = {"pad_token_id": 0, "max_length": max_length}

train_data_padded = train_data_ids.map(pad_example, fn_kwargs=fn_kwargs)
valid_data_padded = valid_data_ids.map(pad_example, fn_kwargs=fn_kwargs)
test_data_padded = test_data_ids.map(pad_example, fn_kwargs=fn_kwargs)

### 1.6 - Data Loaders

With the sequences processed and padded, we can now use PyTorch’s `TensorDataset` and `DataLoader` to create data loaders for the train, validation, and test sets. These data loaders will enable efficient batch processing, shuffling, and iteration through the datasets during training and evaluation.

In [21]:
import torch
from torch.utils.data import DataLoader, TensorDataset

# Convert Padded Datasets to PyTorch Format

train_data_padded.set_format(type="torch")
valid_data_padded.set_format(type="torch")
test_data_padded.set_format(type="torch")

# Define a batch size
batch_size = 64

# Create DataLoaders
train_dataloader = DataLoader(train_data_padded, batch_size=batch_size, shuffle=True)
valid_dataloader = DataLoader(valid_data_padded, batch_size=batch_size, shuffle=False)
test_dataloader = DataLoader(test_data_padded, batch_size=batch_size, shuffle=False)

## 2 - Define the Machine Translation Model

We’ll build our model in three components:
- `Encoder`: Processes the input sequence (source language).
- `Decoder`: Generates the output sequence (target language).
- `Seq2Seq`: A wrapper model that combines the encoder and decoder, providing a unified interface for training and inference.

These components will work together to perform the translation task from German to English.

### 2.1 - Encoder [1/1]

Given an input sequence $\{x_t\}$, for each word or token in the sequence, we first project $x_t$ through an embedding layer:
$$
e_t = E x_t.
$$
The word embedding is then passed into the RNN unit to update the hidden state $h_t$. In the paper [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215), the authors choose an LSTM as the RNN unit. Alongside the hidden state $h_t$, the LSTM also maintains a cell state $c_t$:
$$
(h_t, c_t) = \text{LSTM}(e_t, h_{t-1}, c_{t-1})
$$
Thus, after the encoder processes the input, the context vector is a concatenation of $h_t$ and $c_t$.

Additionally, we can use a **deep** RNN to refine the sequence representation by passing the hidden state from a lower layer as input to a higher layer:
$$
\begin{align}
(h_t^{1}, c_t^{1}) &= \text{LSTM}(e_t, h^1_{t-1}, c^1_{t-1})\\
(h_t^{2}, c_t^{2}) &= \text{LSTM}(h^1_t, h^2_{t-1}, c^2_{t-1})\\
&\vdots
\end{align}
$$

In the Seq2Seq paper, the authors use four layers. For practical purposes, we will use only `2` layers here.

**Exercise [1/1]**: Implement the `Encoder` class to process the input sequence `src` in the `forward()` method and return `(hidden, cell)` as the context vector. The input arguments include:
- `input_dim`: the source vocabulary size
- `emb_dim`: the dimension of the embedding space
- `hidden_dim`: the number of neurons in each RNN unit
- `num_layers`: the depth of the RNN

In the `__init__()` method, first store these arguments as instance attributes. Then, use `nn.Embedding` and `nn.LSTM` to define the embedding layer and RNN unit. For reference, you can find documentation for these functions [here](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) and [here](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html).

In the `forward()` method:
1. Take the input sequence `src` (shape `(batch_size, src_len)`) and convert each token to an embedding vector using the `nn.Embedding` layer. The `nn.Embedding` layer directly maps the entire sequence to an `embedded` sequence of shape `(batch_size, src_len, emb_dim)`.
2. Pass the `embedded` sequence to the `nn.LSTM` layer to obtain `outputs`, `hidden`, and `cell`. Here:
   - `outputs` (shape `(batch_size, src_len, hidden_dim)`) represents the collection of all top hidden states.
   - `hidden` and `cell` are the final hidden/cell states of all layers, with `hidden` having shape `(num_layers, batch_size, hidden_dim)`.

In [None]:
# Define the Encoder in the Seq2Seq model
import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hidden_dim, num_layers=1):
        super().__init__()
        ## Code Here ###

    def forward(self, src):
        # src: (batch_size, src_len)
        ## Code Here ###

        return hidden, cell

### 2.2 - Decoder [1/1]

The decoder is similar to the encoder, but at each time step, we include a fully connected (FC) layer to predict the next token in the target sequence. Specifically, in the `forward()` method, we take an `input` token along with `(hidden, cell)` states as the context vector or the decoder’s hidden/cell states from the previous layer.

1. The `input` token is projected into the embedding space using the `nn.Embedding` layer.
2. The `embedded` input is then passed into the decoder LSTM to compute the `output` (the top decoder hidden state) and updated `(hidden, cell)` states.
3. Finally, we pass `output` through the FC layer to produce the next token `prediction`:
$$
\begin{align}
e_t &= E y_t\\
(s_t^{1}, c_t^{1}) &= \text{LSTM}(e_t, s^1_{t-1}, c^1_{t-1})\\
(s_t^{2}, c_t^{2}) &= \text{LSTM}(s^1_t, s^2_{t-1}, c^2_{t-1})\\
\bar{y}_{t+1} &= \text{Linear}(s_t^2)
\end{align}
$$

**Note**: The LSTM layer assumes the `embedded` token has the shape `(batch_size, seq_length, emb_dim)`. Since our `input` initially has the shape `[batch_size]`, we need to use `unsqueeze(1)` to add an extra dimension. As a result, `input` becomes `(batch_size, 1)`, and after embedding, it transforms to `(batch_size, 1, emb_dim)`.

**Exercise [1/1]**: Implement the `Decoder` class.

In [None]:
# Define the Decoder in the Seq2Seq model
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hidden_dim, num_layers=1):
        super().__init__()
        ## Code Here ###

    def forward(self, input, hidden, cell):
        # Input shape: (batch_size) - single token for each batch element
        # hidden, cell: (num_layers, batch_size, hidden_dim)
        input = input.unsqueeze(1)  # (batch_size, 1)

        ## Code Here ###

        return prediction, hidden, cell

### 2.3 - Seq2Seq [1/1]

With well-defined `Encoder` and `Decoder` components, we can now build the `Seq2Seq` model by combining them. In the `__init__` method of the `Seq2Seq` model, we will accept instances of `encoder` and `decoder` as inputs. Additionally, we’ll take `device` as an argument to leverage GPU parallelism for faster training.

The `forward()` method accepts the source sequence `src`, the target sequence `tgt`, and a `teacher_forcing_ratio` to control the use of **teacher forcing** during training, where the true previous token is sometimes used as input instead of the model’s prediction.

Steps in the `forward()` method:
1. Pass `src` to the encoder to obtain `(hidden, cell)` as the context vector.
2. Define `outputs` with shape `(batch_size, tgt_len, tgt_vocab_size)`, initialized with zeros to store the model’s predictions for the target sequence. Make sure `outputs` is on `self.device` to use GPUs if available.
3. For each time step in the decoder:
   - Pass the current token to the decoder to update `(hidden, cell)` states and generate the `prediction`.
   - Store the `prediction` in `outputs` for calculating the *cross-entropy loss* later.
   - Select the token with the highest probability (`top1`) as the candidate for the next token.
   - With a certain probability defined by `teacher_forcing_ratio`, choose either `top1` as the next input token or the ground truth token from `tgt[:, t]`.

**Exercise [1/1]**: Implement the `Seq2Seq` class.

In [None]:
# Define the Seq2Seq model using the Encoder and Decoder
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, tgt, teacher_forcing_ratio=0.5):
        # src: (batch_size, src_len)
        # tgt: (batch_size, tgt_len)
        batch_size = src.size(0)
        tgt_len = tgt.size(1)
        tgt_vocab_size = self.decoder.output_dim

        # Encode the source sequence
        ### Code Here ###

        return outputs


In [25]:
# Define the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device", device)

# Initialize encoder, decoder, and Seq2Seq model
encoder = Encoder(input_dim=len(de_vocab), emb_dim=256, hidden_dim=512, num_layers=2)
decoder = Decoder(output_dim=len(en_vocab), emb_dim=256, hidden_dim=512, num_layers=2)
model = Seq2Seq(encoder, decoder, device).to(device)

Using device cuda


In [26]:
batch = next(iter(train_dataloader))
(src, tgt) = batch["de"], batch["en"]

print("Max index in src:", src.max().item())
print("Max index in tgt:", tgt.max().item())
print("Input_dim for encoder:", encoder.input_dim)
print("Output_dim for decoder:", decoder.output_dim)

Max index in src: 16845
Max index in tgt: 8934
Input_dim for encoder: 18680
Output_dim for decoder: 10218


In [27]:
batch = next(iter(train_dataloader))
(src, tgt) = batch["de"].to(model.device), batch["en"].to(model.device)
outputs = model(src, tgt)
print(outputs.size())
print(outputs[0, 10, :])

torch.Size([64, 110, 10218])
tensor([-0.0413,  0.0635, -0.0023,  ..., -0.0487,  0.0029, -0.0312],
       device='cuda:0', grad_fn=<SliceBackward0>)


## 3 - Training Seq2Seq Model [1/1]

Now that we have prepared the data and designed our Seq2Seq model, we can start training. Specifically, we’ll implement a `train()` function that takes `model`, `data_loader`, `optimizer`, and the loss `criterion` as inputs. For each `batch` in `data_loader`, we’ll extract the source sequence `src` and target sequence `tgt`, ensuring they are moved to `model.device` to leverage GPU acceleration if available.

For sequence generation tasks, we typically use *cross-entropy loss* to measure divergence between predictions and ground truth. Since `tgt` has shape `(batch_size, tgt_len)`, we’ll need to `reshape()` it to `(batch_size * tgt_len)` for comparison with the model’s output. Similarly, the `output` from the model has shape `(batch_size, tgt_len, tgt_vocab_size)`, so we’ll reshape it to `(batch_size * tgt_len, tgt_vocab_size)` to align with `tgt`.

To monitor overfitting, we’ll also implement an `evaluate()` function that calculates the loss on the `valid_data`. Unlike `train()`, `evaluate()` doesn’t need an `optimizer` as we won’t be performing any gradient updates.

**Exercise [1/1]**: Implement `train()` and `evaluate()`.

In [None]:
def train(model, data_loader, optimizer, criterion):
    model.train()  # Set the model to training mode
    epoch_loss = 0

    for i, batch in enumerate(data_loader):
        src, tgt = batch["de"].to(model.device), batch["en"].to(model.device)

        optimizer.zero_grad()  # Zero gradients before each batch

        # Forward pass through the model
        output = model(src, tgt)

        # Reshape the output and target for calculating loss
        # output -> (batch_size * tgt_len, tgt_vocab_size)
        # tgt -> (batch_size * tgt_len)
        ### Code Here ###


        ### Code Here ###

        # Calculate the loss and perform backpropagation
        loss = criterion(output, tgt)
        loss.backward()

        # Update parameters
        optimizer.step()

        # Accumulate loss
        epoch_loss += loss.item()

    return epoch_loss / len(data_loader)

In [None]:
def evaluate(model, data_loader, criterion):
    model.eval()  # Set the model to evaluation mode
    epoch_loss = 0

    with torch.no_grad():
        for i, batch in enumerate(data_loader):
            ### Code Here ###

    return epoch_loss / len(data_loader)


Now, we can start training by using `Adam` and `CrossEntropyLoss`.

In [None]:
import torch.optim as optim

seed = 5
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

# Define the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device", device)

# Initialize encoder, decoder, and Seq2Seq model
encoder = Encoder(input_dim=len(de_vocab), emb_dim=256, hidden_dim=512, num_layers=2)
decoder = Decoder(output_dim=len(en_vocab), emb_dim=256, hidden_dim=512, num_layers=2)
model = Seq2Seq(encoder, decoder, device).to(device)

PAD_IDX = de_vocab["<pad>"]
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)
optimizer = optim.Adam(model.parameters(), lr=0.01)

train_losses = []
valid_losses = []
for epoch in range(YOUR_DESIRED): # Minimum 10
    train_loss = train(model, train_dataloader, optimizer, criterion)
    train_losses.append(train_loss)
    valid_loss = evaluate(model, valid_dataloader, criterion)
    valid_losses.append(valid_loss)
    # if epoch % 10 == 0:
    print(f"Epoch: {epoch+1}, Train Loss: {train_loss:.4f}, Valid Loss: {valid_loss:.4f}")


## 4 - Evaluation and Translation [1/1]

As discussed in class, we can use the `BLEU` score to evaluate translation quality. While the `evaluate` library provides an implementation of `BLEU`, for this assignment we’ll focus on visualizing results by implementing a `translate_sentence()` method. This method will take a source `sentence`, a trained `model`, and the source and target vocabularies (`src_vocab` and `tgt_vocab`) to return a translated sentence. The method is similar to the `forward()` function in the `Seq2Seq` class but does not use a target sequence `tgt` for teacher forcing. Instead, it will generate tokens autonomously.

Steps in `translate_sentence()`:
1. **Tokenize and Convert to IDs**:
   - The input `sentence` is a string, so we need to tokenize it and convert each token to its corresponding ID in `src_vocab`. This can be achieved by iterating through the sentence and looking up each token in `src_vocab`.

2. **Convert to Tensor**:
   - To pass the tokenized sentence into `model.encoder()`, convert it to a `torch.tensor` and reshape it to `(batch_size=1, seq_len)` using `unsqueeze()`.

3. **Generate Target Sequence**:
   - Pass the encoded source sequence through the encoder to get the context vector `(hidden, cell)`.
   - Maintain a list `tgt_ids` to store each predicted token’s ID (`pred_token_id`).
   - For each prediction step:
     - Use the `model.decoder()` to get the `output` based on the current `hidden`, `cell`, and the last predicted token.
     - Extract the predicted token ID (`pred_token_id`) from `output` by taking the index with the highest probability.
     - Append `pred_token_id` to `tgt_ids`.
     - If `pred_token_id` is `<eos>`, terminate the loop early.
   - Ensure each integer `pred_token_id` is converted to a tensor with shape `(1,)` before feeding it into the `decoder`.

4. **Convert IDs Back to Tokens**:
   - After generating the sequence, convert each ID in `tgt_ids` back to its corresponding token in `tgt_vocab` and return the final translated sentence.

**Exercise [1/1]**: Implement `translate_sentence()` to perform machine translation using the trained model.

In [None]:
def translate_sentence(sentence, model, src_vocab, tgt_vocab, max_len=10):
    model.eval()
    # Convert to indices with <unk> fallback for missing words
    ### Code Here ###
    tokens = [src_vocab.get(token, src_vocab["<unk>"]) for token in sentence.lower().split()]
    src_tensor = torch.tensor(tokens).unsqueeze(0).to(model.device)  # shape: (1, src_len)

    # Encode the source sentence
    ### Code Here ###

    # Decode each token until <eos> or max length
    for _ in range(max_len):
        ### Code Here ###

    return translated_sentence

In [None]:
sentence = test_data[ADD_INDEX]["de"] 
expected_translation = test_data[0]["en"]

sentence, expected_translation

('Ein Mann mit einem orangefarbenen Hut, der etwas anstarrt.',
 'A man in an orange hat starring at something.')

In [None]:
translation = translate_sentence(sentence, model, de_vocab, en_vocab)
translation

['A', 'man', 'in', 'an', 'orange', 'hat', 'starring', 'at', 'something', '.']