In [None]:
%matplotlib inline

# NLP: Machine Translation with ``nn.Transformer`` and torchtext

This tutorial is adapted from [the PyTorch Tutorials](https://pytorch.org/tutorials/), valuable references for learning the fundamentals of Natural Language Processing in PyTorch. We highly recommend exploring other tutorials for deeper understanding and additional topics.

In this tutorial, you will learn the following:
- The process of training a translation model from the ground up using the Transformer architecture.
- Utilizing the torchtext library to acquire the [Multi30k](http://www.statmt.org/wmt16/multimodal-task.html#task1) dataset for training a model that translates from German to English.

## Data Sourcing and Processing

The [torchtext library](https://pytorch.org/text/stable/) provides convenient tools for constructing datasets that can be easily traversed, particularly for the purpose of developing a language translation model. In this tutorial, we demonstrate the utilization of torchtext's built-in datasets, outlining the process of tokenizing a raw text sentence, constructing a vocabulary, and converting tokens into tensors. We will employ the [Multi30k dataset from the torchtext library](https://pytorch.org/text/stable/datasets.html#multi30k), which supplies pairs of source-target raw sentences.

In [None]:
# Importing necessary modules
import sys  # Module for interacting with the Python interpreter

# Checking if the code is running in Google Colab
if 'google.colab' in sys.modules:
    print(f"Running in Google Colab ...")

    print(f"Installing dependencies ...")
    # Install dependencies:
    !pip install -U torchtext
    !pip install -U spacy
    !pip install -U portalocker
    !python -m spacy download en_core_web_sm
    !python -m spacy download de_core_news_sm
else:
    # Code is running in a local setup
    print(f"Running in Local Setup ...")

In [None]:
# Import necessary modules and functions from torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import multi30k, Multi30k
from typing import Iterable, List

The code below does the following:
- The first two lines set the URLs for the training and validation sets of the Multi30k dataset.
- `SRC_LANGUAGE` and `TGT_LANGUAGE` are assigned the strings 'de' and 'en', representing the source (German) and target (English) languages.
- `token_transform` and `vocab_transform` are initialized as empty dictionaries, used for transformations in the later code.

In [None]:
# Define URLs for the Multi30k dataset (training and validation sets)
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"

# Specify source and target languages
SRC_LANGUAGE = 'de'  # German
TGT_LANGUAGE = 'en'  # English

# Placeholders for token and vocabulary transformations
token_transform = {}
vocab_transform = {}

Generate tokenizers for the source and target languages.

- The `get_tokenizer` is a function from torchtext that retrieves a tokenizer based on the specified method ('spacy' in this case). For the source language (German), the 'de_core_news_sm' spaCy model is used as the tokenizer.

- For the target language (English), the 'en_core_web_sm' spaCy model is used as the tokenizer. These tokenizers are stored in the dictionary `token_transform` with keys representing the source and target languages.

In [None]:
# Assign source language tokenizer using the 'spacy' tokenizer with the German language model 
# 'de_core_news_sm'
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')

# Assign target language tokenizer using the 'spacy' tokenizer with the English language model 
# 'en_core_web_sm'
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')

The function is designed to work with language pairs, and it yields tokenized sentences for a specific language.

- The function `yield_tokens` is a helper function that takes a data iterator (`data_iter`) and a target language (`language`).
- It uses the `token_transform` dictionary to access the appropriate tokenizer for the specified language.
- The function iterates through the provided data iterator and yields a list of tokens for each data sample in the specified language.
- The `language_index` dictionary is used to map the source and target languages to their respective indices in the data samples.

In [None]:
# Define a helper function to yield a list of tokens
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
    """
    A helper function to yield a list of tokens from a given data iterator and language.

    Args:
        data_iter (Iterable): The data iterator containing language pairs.
        language (str): The target language for tokenization.

    Yields:
        List[str]: A list of tokens from each data sample.
    """
    # Define an index mapping for source and target languages
    language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}

    # Iterate through the data iterator
    for data_sample in data_iter:
        # Use the specified tokenizer for the given language to tokenize the data sample
        yield token_transform[language](data_sample[language_index[language]])

These special symbols and indices are commonly used in natural language processing tasks, such as in this case for language translation, to handle unknown words, padding, and sentence boundaries.

- Four special symbols are defined: `<unk>` (unknown), `<pad>` (padding), `<bos>` (beginning of sequence), and `<eos>` (end of sequence).
- Each special symbol is assigned a unique index. These indices will be used when building the vocabulary to represent these special symbols.
- The variables UNK_IDX, PAD_IDX, BOS_IDX, and EOS_IDX store the indices for the respective special symbols.

In [None]:
# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3

# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

Building a vocabulary is a crucial step in natural language processing tasks, enabling the model to map words to unique indices.

- The code iterates over the source language (SRC_LANGUAGE) and target language (TGT_LANGUAGE).
- For each language, it creates a training data iterator (`train_iter`) for the Multi30k dataset with the specified language pair.
- The `build_vocab_from_iterator` function is then used to construct a vocabulary for the current language.
- The `yield_tokens` function is employed to tokenize and yield tokens from the training iterator.
- Parameters such as `min_freq` (minimum frequency for token inclusion), `specials` (special symbols), and `special_first` (placing special symbols at the beginning) are specified during vocabulary construction.
- The resulting vocabulary is stored in the `vocab_transform` dictionary, with keys representing the source and target languages.

In [None]:
# Iterate over source (SRC_LANGUAGE) and target (TGT_LANGUAGE) languages
for lang in [SRC_LANGUAGE, TGT_LANGUAGE]:
    # Create a training data iterator for the Multi30k dataset with the specified language pair
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    
    # Build a vocabulary using torchtext's Vocab object for the current language
    vocab_transform[lang] = build_vocab_from_iterator(
        yield_tokens(train_iter, lang),  # Tokenize and yield tokens from the training iterator
        min_freq=1,  # Set the minimum frequency threshold for including tokens in the vocabulary
        specials=special_symbols,  # Specify special symbols to be included in the vocabulary
        special_first=True  # Place special symbols at the beginning of the vocabulary
    )

Setting the default index to the index of the `<unk>` token is a common practice. It ensures that if a word is not present in the vocabulary, the model will treat it as an unknown token during inference.

- The code iterates over the source language (SRC_LANGUAGE) and target language (TGT_LANGUAGE).
- For each language, it sets the default index of the corresponding vocabulary to the index of the `<unk>` (unknown) token.
- The default index is the index assigned to a token when it is not found in the vocabulary. In this case, it is set to the index of the `<unk>` token.

In [None]:
# Set `UNK_IDX` as the default index. This index is returned when the token is not found.
# If not set, it throws `RuntimeError` when the queried token is not found in the Vocabulary.
# Iterate over source (SRC_LANGUAGE) and target (TGT_LANGUAGE) languages
for lang in [SRC_LANGUAGE, TGT_LANGUAGE]:
    # Set the default index of the vocabulary for the current language to the index of the '<unk>' (unknown) token
    vocab_transform[lang].set_default_index(UNK_IDX)

## Seq2Seq Network using Transformer

The Transformer, a Seq2Seq model introduced in the paper ["Attention is all you need"](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf), is designed for addressing machine translation tasks. In the following, we will construct a Seq2Seq network that leverages the Transformer architecture. This network comprises three main components.

The first component is the embedding layer, responsible for converting a tensor of input indices into a corresponding tensor of input embeddings. These embeddings are then enriched with positional encodings, offering positional information about input tokens to the model.

The second component is the actual [Transformer model](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html). This core part of the network utilizes self-attention mechanisms to effectively capture relationships between different parts of the input sequence.

Finally, the output from the Transformer model undergoes processing through a linear layer. This layer generates unnormalized probabilities for each token in the target language, providing the model's predictions for the translation task.

In [None]:
# Import necessary modules and functions from PyTorch
from torch import Tensor  # Tensor class for creating and manipulating tensors
import torch  # The main PyTorch module
import torch.nn as nn  # Neural network module for defining layers and models
from torch.nn import Transformer  # Transformer module for implementing the Transformer model
import math  # The math module for mathematical operations

In [None]:
# Determine the optimal device for computations, prioritizing GPUs and MPS:
DEVICE = torch.device(
    "cuda"  # Prioritize GPU if available
    if torch.cuda.is_available()
    else "mps"  # use MPS if available
    if torch.backends.mps.is_available()
    else "cpu"  # Fallback to CPU
)

# Print the selected device for clarity:
print(f"Torch Device: {DEVICE}")

Positional encoding is essential in Transformer models to provide information about the positions of tokens in a sequence, allowing the model to consider the order of words.

- The `PositionalEncoding` class is defined as a PyTorch module to add positional encoding to token embeddings.
- In the constructor (`__init__` method), denormalization term `den` is computed to be used in the positional encoding formula.
- Positions (`pos`) from 0 to `maxlen` are generated, and an initial tensor for positional embeddings (`pos_embedding`) is initialized.
- Sine and cosine components of the positional encoding are computed and assigned to the even and odd indices of the `pos_embedding` tensor, respectively.
- A batch dimension is added to the `pos_embedding` tensor to match the batch size of the input token embeddings.
- The dropout layer is initialized, and the positional embeddings are registered as a buffer in the module.
- In the `forward` method, dropout is applied to the sum of the input token embeddings and the corresponding positional embeddings.

In [None]:
# Define a helper Module that adds positional encoding to the token embedding to introduce a notion of word order.
class PositionalEncoding(nn.Module):
    def __init__(self,
                 emb_size: int,
                 dropout: float,
                 maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()

        # Compute denormalization term for positional encoding
        den = torch.exp(- torch.arange(0, emb_size, 2) * math.log(10000) / emb_size)

        # Generate positions from 0 to maxlen
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)

        # Initialize positional embeddings tensor
        pos_embedding = torch.zeros((maxlen, emb_size))

        # Compute sine and cosine components of positional encoding
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)

        # Add a batch dimension to positional embeddings
        pos_embedding = pos_embedding.unsqueeze(-2)

        # Initialize a dropout layer
        self.dropout = nn.Dropout(dropout)

        # Register the positional embeddings as a buffer
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        # Apply dropout to the sum of token embeddings and positional embeddings
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])

The scaling by the square root of the embedding size is a common practice to stabilize the training of models using embeddings.

- The `TokenEmbedding` class is defined as a PyTorch module to convert a tensor of input indices into the corresponding tensor of token embeddings.
- In the constructor (`__init__` method), an embedding layer (`nn.Embedding`) is initialized with a vocabulary size of `vocab_size` and an embedding size of `emb_size`.
- The embedding size is stored as an attribute (`self.emb_size`) for later use.
- In the `forward` method, the input tokens are passed through the embedding layer, and the result is scaled by the square root of the embedding size.
- The `.long()` method is used to ensure that input tokens are treated as long integers, as required by the embedding layer.

In [None]:
# Define a helper Module to convert a tensor of input indices into the corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()

        # Initialize an embedding layer with vocab_size vocabulary and emb_size embedding size
        self.embedding = nn.Embedding(vocab_size, emb_size)

        # Store the embedding size for later use
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        # Apply the embedding layer to input tokens, and scale the result by the square root of the embedding size
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

This code represents a sequence-to-sequence model for machine translation tasks using the Transformer architecture.

- The `Seq2SeqTransformer` class is defined as a PyTorch module, representing a sequence-to-sequence model based on the Transformer architecture.
- The constructor (`__init__` method) initializes the Transformer model, linear layer, token embedding layers for source and target languages, and positional encoding layer.
- The `forward` method defines the forward pass of the model, applying positional encoding to token embeddings, passing them through the Transformer, and generating output probabilities.
- The `encode` and `decode` methods separately encode the source sequence and decode the target sequence using the Transformer encoder and decoder, respectively.

In [None]:
# Define a Seq2Seq Network using the Transformer architecture
class Seq2SeqTransformer(nn.Module):
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 dim_feedforward: int = 512,
                 dropout: float = 0.1):
        super(Seq2SeqTransformer, self).__init__()

        # Instantiate the Transformer model with specified parameters
        self.transformer = Transformer(d_model=emb_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward,
                                       dropout=dropout)

        # Linear layer for generating output probabilities
        self.generator = nn.Linear(emb_size, tgt_vocab_size)

        # Token embedding layer for the source language
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)

        # Token embedding layer for the target language
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)

        # Positional encoding layer for introducing word order information
        self.positional_encoding = PositionalEncoding(
            emb_size, dropout=dropout)

    def forward(self,
                src: Tensor,
                trg: Tensor,
                src_mask: Tensor,
                tgt_mask: Tensor,
                src_padding_mask: Tensor,
                tgt_padding_mask: Tensor,
                memory_key_padding_mask: Tensor):
        # Apply positional encoding to token embeddings for source and target sequences
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))

        # Pass the source and target sequences through the Transformer model
        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
                                src_padding_mask, tgt_padding_mask, memory_key_padding_mask)

        # Generate output probabilities using the linear layer
        return self.generator(outs)

    def encode(self, src: Tensor, src_mask: Tensor):
        # Encode the source sequence using the Transformer encoder
        return self.transformer.encoder(self.positional_encoding(
                            self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        # Decode the target sequence using the Transformer decoder
        return self.transformer.decoder(self.positional_encoding(
                          self.tgt_tok_emb(tgt)), memory,
                          tgt_mask)

During the training process, it is crucial to employ a subsequent word mask to prevent the model from peering into future words when generating predictions. Additionally, masks are necessary to conceal padding tokens in both the source and target sequences. Here, we will define a function to handle the creation of these masks.

This function is used during training to mask future words when making predictions in sequence-to-sequence models, preventing the model from peeking into future tokens.

- The function `generate_square_subsequent_mask` creates a square subsequent mask for a given size `sz`.
- `torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1` creates an upper triangular matrix of ones using the `torch.triu` function.
- `.transpose(0, 1)` transposes the matrix to make it a subsequent mask.
- `mask.float()` converts the boolean mask to a float tensor.
- `mask.masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))` replaces 0s with negative infinity and 1s with 0.0 using `masked_fill`.
- The resulting mask is returned.

In [None]:
def generate_square_subsequent_mask(sz):
    # Create an upper triangular matrix of ones using torch.triu
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)

    # Convert the boolean mask to a float tensor
    mask = mask.float()

    # Use masked_fill to replace 0s with negative infinity and 1s with 0.0
    mask = mask.masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))

    # Return the generated square subsequent mask
    return mask

These masks are used during the training process in sequence-to-sequence models to mask future words and handle padding tokens.

- The function `create_mask` takes the source sequence (`src`) and target sequence (`tgt`) as input.
- `src_seq_len` and `tgt_seq_len` are calculated as the lengths of the source and target sequences, respectively.
- `tgt_mask` is generated using the `generate_square_subsequent_mask` function for the target sequence.
- `src_mask` is created as a zero mask for the source sequence, which is often not used in the context of the Transformer model.
- `src_padding_mask` and `tgt_padding_mask` are created by checking if the elements are equal to the padding index (`PAD_IDX`) and then transposing the result.
- The function returns the source mask, target mask, source padding mask, and target padding mask.

In [None]:
def create_mask(src, tgt):
    # Get the sequence lengths of the source and target sequences
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    # Generate a square subsequent mask for the target sequence
    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)

    # Create a zero mask for the source sequence
    src_mask = torch.zeros((src_seq_len, src_seq_len), device=DEVICE).type(torch.bool)

    # Create padding masks for source and target sequences
    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)

    # Return the source mask, target mask, source padding mask, and target padding mask
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

Now, we will specify and instantiate the parameters for our model. Additionally, we will define our loss function, namely the cross-entropy loss, and the optimizer employed during the training process.

These steps set up the model, loss function, and optimizer for training a Seq2Seq Transformer model.

- `torch.manual_seed(0)`: Sets a manual seed to ensure reproducibility of the results.
- `SRC_VOCAB_SIZE` and `TGT_VOCAB_SIZE`: Obtain the vocabulary sizes for the source and target languages.
- Model hyperparameters: Set various hyperparameters for the Seq2Seq Transformer model.
- `transformer`: Instantiate the Seq2SeqTransformer model with the specified hyperparameters.
- Xavier initialization: Initialize the model parameters using Xavier (Glorot) initialization for better training stability.
- Move the model to the specified device (e.g., GPU) using `.to(DEVICE)`.
- `loss_fn`: Define the loss function as CrossEntropyLoss, with the padding index ignored during computation.
- `optimizer`: Define the optimizer as Adam with a learning rate of 0.0001, beta parameters (0.9, 0.98), and epsilon value 1e-9.

In [None]:
# Set a manual seed for reproducibility
torch.manual_seed(0)

# Get the vocabulary sizes for the source and target languages
SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])

# Define model hyperparameters
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3

# Instantiate the Seq2Seq Transformer model
transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                                 NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)

# Initialize the model parameters using Xavier initialization
for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

# Move the model to the specified device (e.g., GPU)
transformer = transformer.to(DEVICE)

# Define the loss function as CrossEntropyLoss with padding index ignored
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

# Define the optimizer as Adam with specified learning rate and beta parameters
optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

## Collation

As demonstrated in the `Data Sourcing and Processing` section, our data iterator produces pairs of raw strings. To facilitate processing by our previously defined `Seq2Seq` network, we must convert these string pairs into batched tensors. Below, we establish our collate function, which transforms a batch of raw strings into tensors suitable for direct input into our model.

In [None]:
# Import the pad_sequence function from torch.nn.utils.rnn module
from torch.nn.utils.rnn import pad_sequence

This helper function is useful for creating a pipeline of text transformations where multiple operations need to be applied in a specific order.

- The `sequential_transforms` function is defined as a helper function to combine sequential transformations.
- The function takes a variable number of transformations (`*transforms`) as arguments, allowing flexibility in the number of transformations to be applied.
- Inside the function, a new function (`func`) is defined. This function takes a text input (`txt_input`) and applies each transformation in the provided list of transforms sequentially.
- The result of applying all sequential transformations is returned by the `func` function.
- Finally, the `sequential_transforms` function returns the `func` function, which can be used to apply the sequential transformations to text inputs.

In [None]:
# Define a helper function to combine sequential transformations
def sequential_transforms(*transforms):
    # The function takes a variable number of transformations as arguments
    def func(txt_input):
        # For each transformation in the provided list of transforms
        for transform in transforms:
            # Apply the transformation to the input
            txt_input = transform(txt_input)
        # Return the result of applying all sequential transformations
        return txt_input
    # Return the function that applies sequential transformations
    return func

This code sets up language-specific text transformations that convert raw strings into tensors with BOS and EOS tokens, which is useful for preparing input sequences for a Seq2Seq model.

- `tensor_transform` is a function defined to add the beginning-of-sequence (BOS) and end-of-sequence (EOS) tokens to a list of token indices and create a tensor.
- It uses `torch.cat` to concatenate tensors for BOS, the input token indices, and EOS.
- An empty dictionary `text_transform` is initialized to store language-specific text transformations.
- The code iterates over source (SRC_LANGUAGE) and target (TGT_LANGUAGE) languages.
- For each language (`lang`), a language-specific text transformation is defined using the `sequential_transforms` helper function.
- The sequential transformations include tokenization (`token_transform[lang]`), numericalization (`vocab_transform[lang-]`), and adding BOS/EOS tokens (`tensor_transform`).
- The resulting language-specific text transformations are stored in the `text_transform` dictionary.

In [None]:
# Define a function to add BOS/EOS and create a tensor for input sequence indices
def tensor_transform(token_ids: List[int]):
    # Concatenate tensors for BOS, token_ids, and EOS
    return torch.cat((torch.tensor([BOS_IDX]),
                      torch.tensor(token_ids),
                      torch.tensor([EOS_IDX])))

# Initialize an empty dictionary to store language-specific text transformations
text_transform = {}

# Iterate over source (SRC_LANGUAGE) and target (TGT_LANGUAGE) languages
for lang in [SRC_LANGUAGE, TGT_LANGUAGE]:
    # Define a language-specific text transformation using sequential transforms
    text_transform[lang] = sequential_transforms(token_transform[lang],  # Tokenization
                                               vocab_transform[lang],  # Numericalization
                                               tensor_transform)     # Add BOS/EOS and create tensor

This collation function is crucial for preparing batches of data during training, ensuring that sequences within a batch have the same length.

- `collate_fn` is a function defined to collate data samples into batch tensors, suitable for processing by a Seq2Seq model.
- Two empty lists, `src_batch` and `tgt_batch`, are initialized to store source and target batches, respectively.
- The function iterates over each (source, target) pair in the input batch.
- For each pair, language-specific text transformations (`text_transform`) are applied to both the source and target samples, and the results are appended to the respective batches.
- `pad_sequence` is then used to pad sequences in the batches to the length of the longest sequence using the specified padding value (`PAD_IDX`).
- The function returns the collated source and target batches.

In [None]:
# Define a function to collate data samples into batch tensors
def collate_fn(batch):
    # Initialize empty lists to store source and target batches
    src_batch, tgt_batch = [], []

    # Iterate over each (source, target) pair in the batch
    for src_sample, tgt_sample in batch:
        # Apply language-specific text transformations to source and target samples
        src_batch.append(text_transform[SRC_LANGUAGE](src_sample.rstrip("\n")))
        tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample.rstrip("\n")))

    # Use pad_sequence to pad sequences in the batch to the length of the longest sequence
    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX)

    # Return the collated source and target batches
    return src_batch, tgt_batch

Let's establish the training and evaluation loops, which will be invoked for each epoch.

The import statement makes the `DataLoader` class available for use in the current code or script.

- `torch.utils.data` is a module in PyTorch that provides utilities for working with data, including datasets and data loaders.
- `DataLoader` is a class within the `torch.utils.data` module that is used to load batches of data from a given dataset.

In [None]:
# Import the DataLoader class from the torch.utils.data module
from torch.utils.data import DataLoader

This function represents a single training epoch for a Seq2Seq model and is a fundamental part of the training process.

- `train_epoch` is a function designed to train one epoch (iteration over the entire training dataset) of the Seq2Seq model.
- The model is set to training mode using `model.train()`.
- The function initializes a variable (`losses`) to store the cumulative loss during the epoch.
- A DataLoader (`train_dataloader`) is created to iterate over batches in the training dataset.
- For each batch, the source and target sequences are moved to the specified device (e.g., GPU).
- The last token is removed from the target sequence to create the target input.
- Masks for source and target sequences are created using the `create_mask` function.
- The model is forwarded with the source and target input sequences, and logits are obtained.
- The gradients in the optimizer are zeroed using `optimizer.zero_grad()`.
- The first token is removed from the target sequence to create the target output.
- The loss is computed using the predicted logits and target output.
- Backward pass (`loss.backward()`) and optimization step (`optimizer.step()`) are performed.
- The loss is accumulated for monitoring.
- The average loss for the epoch is returned.

In [None]:
# Define a function for training one epoch of the Seq2Seq model
def train_epoch(model, optimizer):
    # Set the model to training mode
    model.train()

    # Initialize the variable to store the cumulative loss during the epoch
    losses = 0

    # Create a DataLoader for the training dataset
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    # Iterate over batches in the training DataLoader
    for src, tgt in train_dataloader:
        # Move the source and target batches to the specified device (e.g., GPU)
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        # Remove the last token from the target sequence to create the target input
        tgt_input = tgt[:-1, :]

        # Create masks for source and target sequences
        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        # Forward pass through the model
        logits = model(src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask)

        # Zero out the gradients in the optimizer
        optimizer.zero_grad()

        # Remove the first token from the target sequence to create the target output
        tgt_out = tgt[1:, :]

        # Compute the loss using the predicted logits and target output
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))

        # Backward pass and optimization step
        loss.backward()
        optimizer.step()

        # Accumulate the loss for monitoring
        losses += loss.item()

    # Return the average loss for the epoch
    return losses / len(list(train_dataloader))

This function evaluates the model on the validation set and provides insights into its performance.

- `evaluate` is a function designed to evaluate the performance of the Seq2Seq model on the validation set.
- The model is set to evaluation mode using `model.eval()`.
- The function initializes a variable (`losses`) to store the cumulative loss during evaluation.
- A DataLoader (`val_dataloader`) is created to iterate over batches in the validation dataset.
- For each batch, the source and target sequences are moved to the specified device (e.g., GPU).
- The last token is removed from the target sequence to create the target input.
- Masks for source and target sequences are created using the `create_mask` function.
- The model is forwarded with the source and target input sequences, and logits are obtained.
- The first token is removed from the target sequence to create the target output.
- The loss is computed using the predicted logits and target output.
- The loss is accumulated for monitoring.
- The average loss for the validation set is returned.

In [None]:
# Define a function for evaluating the performance of the Seq2Seq model on the validation set
def evaluate(model):
    # Set the model to evaluation mode
    model.eval()

    # Initialize the variable to store the cumulative loss during evaluation
    losses = 0

    # Create a DataLoader for the validation dataset
    val_iter = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    val_dataloader = DataLoader(val_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    # Iterate over batches in the validation DataLoader
    for src, tgt in val_dataloader:
        # Move the source and target batches to the specified device (e.g., GPU)
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        # Remove the last token from the target sequence to create the target input
        tgt_input = tgt[:-1, :]

        # Create masks for source and target sequences
        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        # Forward pass through the model
        logits = model(src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask)

        # Remove the first token from the target sequence to create the target output
        tgt_out = tgt[1:, :]

        # Compute the loss using the predicted logits and target output
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))

        # Accumulate the loss for monitoring
        losses += loss.item()

    # Return the average loss for the validation set
    return losses / len(list(val_dataloader))

Now, armed with all the necessary components, we are ready to commence the training of our model. Let's proceed!

This code snippet represents the training loop over multiple epochs, where the model is trained for each epoch, and the training and validation losses are printed along with the time taken for each epoch.

- `from timeit import default_timer as timer`: Import the `default_timer` function from the `timeit` module and alias it as `timer`.
- `NUM_EPOCHS = 18`: Set the number of training epochs to 18.
- `for epoch in range(1, NUM_EPOCHS+1)`: Iterate over the specified number of epochs.
- `start_time = timer()`: Record the start time of the epoch using the `timer` function.
- `train_loss = train_epoch(transformer, optimizer)`: Perform one training epoch using the `train_epoch` function and obtain the training loss.
- `end_time = timer()`: Record the end time of the epoch using the `timer` function.
- `val_loss = evaluate(transformer)`: Evaluate the model on the validation set using the `evaluate` function and obtain the validation loss.
- `print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))`: Print the training and validation loss along with epoch information using formatted strings.

In [None]:
# Import the default_timer function from the timeit module
from timeit import default_timer as timer

# Set the number of epochs for training
NUM_EPOCHS = 18

# Iterate over the specified number of epochs
for epoch in range(1, NUM_EPOCHS+1):
    # Record the start time of the epoch
    start_time = timer()

    # Perform one training epoch and obtain the training loss
    train_loss = train_epoch(transformer, optimizer)

    # Record the end time of the epoch
    end_time = timer()

    # Evaluate the model on the validation set and obtain the validation loss
    val_loss = evaluate(transformer)

    # Print the training and validation loss along with epoch information
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "
           f"Epoch time = {(end_time - start_time):.3f}s"))

This function is used for generating output sequences during inference using the trained Seq2Seq model.

- `greedy_decode` is a function for generating an output sequence using the greedy decoding algorithm.
- The source sequence (`src`) and source mask (`src_mask`) are moved to the specified device (e.g., GPU).
- The source sequence is encoded using the `encode` function of the model.
- The target sequence (`ys`) is initialized with the start symbol.
- The function iterates until the maximum length is reached or the end-of-sequence token is encountered.
- The target sequence is decoded using the `decode` function of the model.
- The output tensor is transposed, and the model's generator is used to obtain probabilities for the next word.
- The index of the word with the maximum probability is determined.
- The next word is appended to the target sequence.
- If the end-of-sequence token is encountered, the loop is terminated.
- The function returns the generated target sequence.

In [None]:
# Define a function for generating an output sequence using the greedy algorithm
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    # Move source and source mask to the specified device (e.g., GPU)
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    # Encode the source sequence using the model
    memory = model.encode(src, src_mask)

    # Initialize the target sequence with the start symbol
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)

    # Iterate until the maximum length is reached or the end-of-sequence token is encountered
    for i in range(max_len-1):
        # Move memory to the specified device (e.g., GPU)
        memory = memory.to(DEVICE)

        # Generate a mask for the target sequence
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)

        # Decode the target sequence using the model
        out = model.decode(ys, memory, tgt_mask)

        # Transpose the output tensor
        out = out.transpose(0, 1)

        # Use the model's generator to obtain probabilities for the next word
        prob = model.generator(out[:, -1])

        # Find the index of the word with the maximum probability
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        # Append the next word to the target sequence
        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)

        # If the end-of-sequence token is encountered, break the loop
        if next_word == EOS_IDX:
            break

    # Return the generated target sequence
    return ys

This function is useful for translating input sentences using the trained Seq2Seq model.

- `translate` is a function designed to translate an input sentence into the target language using a Seq2Seq model.
- The model is set to evaluation mode using `model.eval()`.
- The source sentence is tokenized and numericalized using the `text_transform` for the source language.
- The number of tokens in the source sentence is obtained.
- A source mask is created with all values set to `False`.
- The `greedy_decode` function is used to generate the target tokens.
- The target tokens are converted to a string using the vocabulary transformation for the target language.
- The generated string is modified to remove "<bos>" and "<eos>" tokens.
- The translated sentence is returned.

In [None]:
# Define a function to translate an input sentence into the target language
def translate(model: torch.nn.Module, src_sentence: str):
    # Set the model to evaluation mode
    model.eval()

    # Tokenize and numericalize the source sentence using the text_transform
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)

    # Obtain the number of tokens in the source sentence
    num_tokens = src.shape[0]

    # Create a source mask with all values set to False
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)

    # Use the greedy_decode function to generate the target tokens
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()

    # Convert the target tokens to a string using the vocabulary transformation
    translated_sentence = " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy())))

    # Remove "<bos>" and "<eos>" tokens from the translated sentence
    translated_sentence = translated_sentence.replace("<bos>", "").replace("<eos>", "")

    # Return the translated sentence
    return translated_sentence

This code demonstrates an example of translating a German sentence into English using the trained Seq2Seq model.

- `translate(transformer, "Eine Gruppe von Menschen steht vor einem Iglu .")`: Call the `translate` function with the trained Seq2Seq model (`transformer`) and the input sentence in the source language ("Eine Gruppe von Menschen steht vor einem Iglu .").
- The function processes the input sentence through the model and generates the translation in the target language.
- The translated sentence is printed to the console using `print()`.

In [None]:
# Print the translation of the input sentence using the trained Seq2Seq model
print(translate(transformer, "Eine Gruppe von Menschen steht vor einem Iglu ."))

## References

1. Attention is all you need paper.
   https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
2. The annotated transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html#positional-encoding

