<img src='data/images/section-notebook-header.png' />

**Disclaimer:** This notebook is adopted from the official PyTorch Tutorial [Language Translation with nn.Transformer and torchtext](https://pytorch.org/tutorials/beginner/translation_transformer.html). Overall, the code remains almost unchanged but there are some very minor modifications:

* The code has been re-organized to have all import statements at the beginning of the notebook.

* Some minor updates have been made to avoid some warning messages.

* Some variable names and values have been modified to be inline with previous course notebooks; the training and evaluation loops come now with progress bars to see that something is happening.

The main difference between this notebook and the official PyTorch tutorial is in the added explanations and discussion of the code. This should make it easier to understand and follow all the performed steps in this notebook.

# Language Translation with `nn.Transformer` and `torchtext`

In a previous notebook, we addressed the task of machine translation using an RNN-based encoder-decoder architecture. Now we will see how Transformers can be used for this task. Transformers have revolutionized the field of machine translation in NLP and offer several advantages over traditional approaches. Here are some key advantages of Transformers for machine translation:

* **Attention Mechanism:** Transformers employ a self-attention mechanism that allows them to capture dependencies between words in a sentence effectively. Unlike traditional RNNs that process input sequentially, Transformers can attend to all words simultaneously, enabling them to capture long-range dependencies more efficiently. This attention mechanism helps the model understand the context of each word in relation to the entire input sequence, which is crucial for accurate translation.

* **Parallelization:** Transformers are highly parallelizable, meaning they can process multiple input sentences concurrently. This parallelization capability is particularly advantageous for machine translation tasks since it speeds up the training and inference process. In contrast, traditional recurrent models process sentences sequentially, leading to slower computation times.

* **Contextual Representation:** Transformers excel at capturing contextual information. They generate contextualized word representations, allowing them to consider the meaning of a word within the context of the entire sentence. This contextual representation facilitates more accurate translations by enabling the model to understand subtle linguistic nuances, idiomatic expressions, and word sense disambiguation.

* **Bidirectional Encoding:** Transformers employ bidirectional encoding, which means they consider both the left and right context of a word during the encoding process. This bidirectional approach allows the model to capture dependencies between words in both directions, resulting in a more comprehensive understanding of the input sentence. It helps address the limitations of traditional models that process sentences only in one direction, such as RNNs.

* **Transfer Learning:** Transformers can be effectively pretrained on large-scale corpora and then fine-tuned on specific machine translation tasks. This transfer learning approach allows the model to leverage the knowledge acquired during pretraining, such as understanding syntax, semantics, and general language structure. Fine-tuning on translation data further enhances the model's ability to generate accurate and coherent translations.

* **Long-Term Dependency Handling:** Transformers mitigate the vanishing gradient problem that affects traditional recurrent models like RNNs. RNNs struggle to capture long-term dependencies due to the decay of gradients over time. Transformers, on the other hand, use self-attention to directly connect any two words in a sentence, allowing them to model long-range dependencies effectively and capture essential information from distant words.

These advantages of Transformers make them highly suitable for machine translation tasks in NLP, enabling them to achieve state-of-the-art performance and improve translation quality compared to previous approaches. Of course, this assumes training over very large datasets which is beyond the scope of this notebook.

## Setting up the Notebook

### Import Required Packages

In [1]:
import math
from tqdm import tqdm
from typing import Iterable, List
from timeit import default_timer as timer

We utilize some utility methods from PyTorch as well as Torchtext, so we need to import the `torch` and `torchtext` package.

In [2]:
import torch
import torch.nn as nn
from torch import Tensor
from torch.nn import Transformer

from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import multi30k, Multi30k

### Checking/Setting the Computation Device

PyTorch allows to train neural networks on supported GPUs to significantly speed up the training process. If you have a support GPU, feel free to utilize it. 

In [3]:
use_cuda = torch.cuda.is_available()

# Use this line below to enforce the use of the CPU 
#use_cuda = False

DEVICE = torch.device("cuda:0" if use_cuda else "cpu")

print("Available device: {}".format(DEVICE))

Available device: cuda:0


### Additional Requirements

Although we do not explicitly import spaCy, The `get_tokenizer()` method of `torchtext` will use the spaCy tokenizer. This means you need to have spaCy installed together with the language models used in this notebook. Below are the `pip` commands to install spaCy and the required language models, if needed. If you use `conda` or `mamba` as you Python package manager, the command for installing spaCy will differ:

```python
pip install -U spacy
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm
```

We will also make use of a dataset provided by `torchtext`. Thus, you need to ensure to all install the `torchdata` package.

```python
pip install -U torchdata
```

---

## Data Sourcing and Processing

The [`torchtext`](https://pytorch.org/text/stable/) has utilities for creating datasets that can be easily iterated through for the purposes of creating a language translation model. In this example, we show how to use `torchtext`'s inbuilt datasets, tokenize a raw text sentence, build vocabulary, and numericalize tokens into tensors. We will use [Multi30k](https://pytorch.org/text/stable/datasets.html#multi30k) dataset from torchtext library that yields a pair of source-target raw sentences. The Multi30k dataset is a popular benchmark dataset used in the field of machine translation and multimodal research. It is specifically designed for the task of translating natural language descriptions to corresponding images. The dataset contains sentence-image pairs where the sentences describe the content of the images. Here are the key features of the Multi30k dataset:

* **Size:** The dataset consists of approximately 30,000 sentence-image pairs. It is a relatively large-scale dataset, providing ample data for training and evaluating machine translation models.

* **Multilingual:** The Multi30k dataset includes parallel descriptions in multiple languages, making it useful for multilingual machine translation research. The original dataset contains English-German pairs, but it has been expanded to include other languages such as French and Czech.

* **Image Content:** The images in the dataset are sourced from the Flickr30k dataset, which contains 31,014 images. Each image is paired with five English descriptions and their translations in other languages. The descriptions vary in length and complexity, ranging from simple to more detailed and nuanced sentences.

* **Training, Validation, and Test Sets:** The Multi30k dataset is divided into training, validation, and test sets. The training set typically comprises the majority of the dataset and is used for training machine translation models. The validation set is used to tune hyperparameters and make early stopping decisions during training. The test set serves as an unbiased evaluation set to measure the performance of trained models.

* **Annotations:** The Multi30k dataset includes annotations for each sentence-image pair, providing additional information such as part-of-speech tags and parse trees. These annotations can be used to enhance the training process or explore linguistic aspects of the data.

The Multi30k dataset has been widely used by researchers to develop and evaluate machine translation models, especially those incorporating image and text modalities. It facilitates research in multimodal translation, where the goal is to generate accurate and meaningful translations by leveraging both visual and textual information. In this notebook, we will only use the textual information and ignore any images.

In [4]:
# We need to modify the URLs for the dataset since the links to the original dataset are broken
# Refer to https://github.com/pytorch/text/issues/1756#issuecomment-1163664163 for more info
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"

SRC_LANGUAGE = 'de'
TGT_LANGUAGE = 'en'

# Placeholders
token_transform = {}
vocab_transform = {}

### Auxiliary Methods

We use the spaCy tokenizer to tokenize all sentences. For this, we use the `get_tokenizer()` method of `torchext` to load the tokenizer for the source language (German) and the target language (English). We store both tokenizers in the `token_transform` and index the respective tokenizer using the language identifiers `'de'` and `'en'`. We also define a generator method `yield_tokens()` which takes an iterator over the sentences as input and returns the tokenized sentence as output by call the respective tokenizer.

In [5]:
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')


# Helper function to yield list of tokens
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
    language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}

    for data_sample in data_iter:
        yield token_transform[language](data_sample[language_index[language]])

### Build Vocabularies

We can now build the vocabularies for the source and target language. As usual, we first need to define a set of special tokens. Note that here we manually specify the indices of all special tokens. Strictly speaking, this step is not required since we could always get these values later using, e.g., `vocab_transform['en']['<PAD>']`. However, this approach makes the code a bit cleaner.

**Important:** You need to ensure that the indices of the special tokens match the order of the tokens in list `special_symbols`; so best do not edit the code cell below. This is to ensure that, for example, `vocab_transform['en']['<PAD>'] == PAD_IDX` is indeed true.

`torchtext` dataset returns an iterator making it quite easy to create the vocabularies using the in-built method `build_vocab_from_iterator()` -- in previous notebooks, we did this step more "manually" by using a `Counter` and `OrderedDict`, etc. If you check the code below, you will notice that we actually iterate twice over the same data iter `train_iter`, once for each language. This is of course not very efficient, but since the dataset is rather small, we use the convenient implementation.

In [6]:
# Define special symbols and indices
PAD_IDX, UNK_IDX, SOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<PAD>', '<UNK>', '<SOS>', '<EOS>']

for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    # Training data Iterator
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    # Create torchtext's Vocab object
    vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln),
                                                    min_freq=1,
                                                    specials=special_symbols,
                                                    special_first=True)

# Set `UNK_IDX` as the default index. This index is returned when the token is not found.
# If not set, it throws ``RuntimeError`` when the queried token is not found in the Vocabulary.
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    vocab_transform[ln].set_default_index(UNK_IDX)

**Important:** In previous notebooks, we considered the building of the vocabularies and the vectorization a separate step -- that is, we built the vocabularies, vectorized the sentences and saved the result in files for later use. Note that here, we only build the two vocabularies. The reason is that in this notebook, we will vectorize our batches of sentence pairs "on the fly" during each epoch (we will define more auxiliary methods for this later). Vectorizing each batch during training will naturally add some overhead to the process which will result in some longer training times.

---

## Seq2Seq Network using Transformer

### Auxiliary Layers

In the notebook covering the basic Transformer architecture, we also discusses the importance of a **Positional Encoding** layer to preserve the information of the word order. While we implemented a `PositionalEncoding` class in the file `src/transformer.py`, we use the implementation of this class below to adhere to the original PyTorch tutorial, and to keep this notebook self-contained. However, you should compare both implementations to convince yourself that they essentially perform exactly the same steps.

The implementation of the model also uses the `TokenEmbedding` layer. If you check below, this layer merely wraps an `nn.Embedding` layer we already used multiple times. The only difference/addition is that the result embeddings are scaled with respect to the embedding size. Scaling word embeddings with respect to the embedding size is a technique commonly used in natural language processing (NLP) tasks. The purpose of scaling word embeddings is to ensure that the embeddings have a consistent magnitude across different embedding dimensions. This scaling helps in normalizing the embeddings and can lead to improved model performance. Here's why scaling is beneficial:

* **Avoiding Bias Towards Certain Dimensions:** Word embeddings are typically represented as vectors in a high-dimensional space, where each dimension captures a different aspect of the word's meaning. However, the magnitudes of these dimensions can vary widely. By scaling the embeddings, you ensure that no dimension dominates over others, thereby avoiding bias towards specific dimensions. This prevents the model from assigning disproportionate importance to certain aspects of word meanings.

* **Gradient Stability:** Scaling word embeddings can help with gradient stability during training. When the magnitudes of different embedding dimensions differ significantly, it can lead to gradients with varying scales. Large gradients in some dimensions and small gradients in others can make the optimization process challenging and slow down training. Scaling the embeddings helps balance the gradient scales, making the optimization more stable and efficient.

* **Similarity Measures:** In various NLP tasks, such as measuring semantic similarity or calculating distances between word embeddings, scaling helps ensure consistent and meaningful comparisons across dimensions. Without scaling, the magnitude differences can distort the similarity measures, leading to inaccurate results. Scaling ensures that the distance or similarity between word embeddings is based on meaningful comparisons across all dimensions.

* **Regularization:** Scaling word embeddings can act as a form of regularization. By enforcing a consistent magnitude across embedding dimensions, it can prevent individual dimensions from having excessive influence on the overall embedding representation. This regularization can improve the generalization capability of the model, reducing overfitting and improving its ability to handle unseen data.

It's worth noting that scaling word embeddings with respect to the embedding size is just one scaling technique among several possibilities. Other scaling methods, such as unit norm scaling or custom scaling factors, may also be used depending on the specific requirements of the task or the characteristics of the embeddings. The goal is to ensure that the word embeddings are appropriately scaled to promote better training, more meaningful comparisons, and improved model performance.

In [7]:
class PositionalEncoding(nn.Module):
    def __init__(self,
                 emb_size: int,
                 dropout: float,
                 maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        den = torch.exp(- torch.arange(0, emb_size, 2)* math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])


class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

### Complete Architecture

We implemented the Transformer architecture from scratch in the previous notebook. However, for an optimal performance, we will use an available Transformer implementation as the core component of our model. In PyTorch, the `nn.Transformer` module is an implementation of the Transformer model. The module provides a high-level interface for creating and training Transformer models in PyTorch. It encapsulates the core components of the Transformer architecture, including the encoder, decoder, and attention mechanisms. Here are the key components of the `nn.Transformer` module:

* **Encoder and Decoder:** The Transformer model consists of an encoder and a decoder. The encoder takes an input sequence and processes it to capture the contextual representations of the input tokens. The decoder takes the encoder's outputs and generates the output sequence step by step.

* **Attention Mechanism:** The Transformer's attention mechanism is a fundamental component that allows the model to capture dependencies between tokens efficiently. The nn.Transformer module incorporates multi-head self-attention, enabling the model to attend to different parts of the input sequence simultaneously.

* **Feed-Forward Networks:** The Transformer model includes feed-forward neural networks within the encoder and decoder. These networks provide non-linear transformations to the attention outputs, allowing the model to capture complex patterns and relationships between tokens.

* **Masking:** The nn.Transformer module supports masking capabilities, particularly for the decoder. Masking is crucial to prevent the model from attending to future tokens during training, ensuring the model's autoregressive property.

The code cell below implements the class `Seq2SeqTransformer` as our final model. Apart from the `nn.Transformer` layer, it naturally also includes the `PositionalEncding` layer and `TokenEmbedding` layer from above.

In [8]:
class Seq2SeqTransformer(nn.Module):
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 dim_feedforward: int = 512,
                 dropout: float = 0.1):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = Transformer(d_model=emb_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward,
                                       dropout=dropout)
        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(emb_size, dropout=dropout)

    def forward(self,
                src: Tensor,
                trg: Tensor,
                src_mask: Tensor,
                tgt_mask: Tensor,
                src_padding_mask: Tensor,
                tgt_padding_mask: Tensor,
                memory_key_padding_mask: Tensor):
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None, src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
        return self.generator(outs)

    def encode(self, src: Tensor, src_mask: Tensor):
        return self.transformer.encoder(self.positional_encoding(self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        return self.transformer.decoder(self.positional_encoding(self.tgt_tok_emb(tgt)), memory, tgt_mask)

### Masking

In the context of Transformers, masking refers to a technique used to control the flow of information during the self-attention mechanism within the model. Transformers rely on self-attention to capture the relationships between different words or tokens in a sequence. Masking is particularly important in tasks where the model processes sequences, such as machine translation, language modeling, or text classification.

There are two commonly used types of masks in Transformers: padding masks and lookahead masks.

* **Padding masks:** When processing a batch of sequences, they may have different lengths. To handle variable-length sequences efficiently, padding is often used to make all sequences in a batch the same length by adding special tokens like <PAD>. Padding masks are used to mask out the padded positions during self-attention calculations. This ensures that the model does not attend to the padded positions, which do not contain meaningful information.

* **Lookahead masks:** In language modeling or autoregressive tasks (incl. machine translation), where the model predicts the next token given the previous tokens, a lookahead mask is applied to ensure that each token can only attend to the previous tokens and not to the tokens that follow it. This prevents the model from cheating by looking ahead at tokens it should not have access to during training or generation.

On an implementation level mask are matrices/tensors whose values indicate to which token to attend to or not. These values can commonly be Booleans (`True` or `False`) or Integers (`0` or `1`, sometimes also `0` or `-inf`). The choice of values depends on the exact implementation of the transformer. For example `nn.Transformer` supports `0/1` as well as `True/False` masks. Let's first look at an example for a padding mask using boolean values. Suppose we have a batch of four sequences with varying lengths but padded with `0` to enforce the same length:
    
```
Sequence 1: [6, 4, 2, 1, 9]
Sequence 2: [5, 3, 4, 0, 0]
Sequence 3: [7, 8, 2, 3, 0]
Sequence 4: [1, 2, 0, 0, 0]
```

The corresponding padding mask will then look like:
    
    
```
[
  [False, False, False, False, False],
  [False, False, False, True, True],
  [False, False, False, False, True],
  [False, False, True, True, True]
]

```

In this example, the first sequence has a length of 5, so we have 5 `False` values in the first row. The second sequence has a length of 3, so we have 3 `False` values in the second row, and the last two positions are padded, represented by True; and so on for the other 2 sequences. This boolean padding mask can be used in the Transformer network to identify the padded positions and exclude them from attention calculations, ensuring that the model attends only to the relevant tokens in each sequence.
    
**Important:** In the example above, this will be the padding mask independent from whether the batch is the input of the encoder (i.e., source language) or for decoder (i.e., target language)
    
Assuming that the batch is the input for the decoder and we are modeling an autoregressive task such as machine translation, we also need a look-ahead mask. For the batch above, the corresponding lookahead mask looks like:
    
```
[
  [False,  True,  True,  True,  True,  True],
  [False, False,  True,  True,  True,  True],
  [False, False, False,  True,  True,  True],
  [False, False, False, False,  True,  True],
  [False, False, False, False, False,  True],
  [False, False, False, False, False, False]
]
```
    
In this example, the first row represents the first token, which can attend only to itself, so we have 3 `True` values after the first `False`. The second row represents the second token, which can attend to the first token and itself, so we have 2 `True` values after the second `False`. The third row represents the third token, which can attend to the first, second, and itself, so we have 1 `True` value after the third `False`. The fourth row represents the fourth token, which can attend to all the previous tokens, so we have only `False`. This boolean lookahead mask ensures that each token can only attend to the previous tokens in the sequence and not to the tokens that follow it. It helps maintain the causality constraint in autoregressive tasks, such as language modeling or text generation, where the model predicts the next token based on the previous tokens.
    
#### Auxiliary Methods    
    
In the code cell below, the method `create_mask()` -- utilizing the additional method `generate_square_subsequent_mask()` -- generates 4 masks required for the `nn.Transformer`.

In [9]:
def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask.bool()


def create_mask(src, tgt):
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)

    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
    
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

#### Masking: Example

To illustrate the purpose of the methods `create_mask()` and `generate_square_subsequent_mask()`, let's go through an example. For this, we assume that the list of 4 example sequences `batch_src` represent the input of the encoder (i.e., source language), and `batch_tgt` represents the input of the decoder (i.e., target language). Note that the sequence length of `batch_src` can differ from the sequence length of `batch_tgt`.

**Important:** The implementation of the methods `create_mask()` and `generate_square_subsequent_mask()` assume that the input tensors have a shape of `(sequence_length, batch_size)`. We therefore need to transpose the out 2 tensors using `.T`

In [10]:
batch_src = torch.LongTensor([
    [6, 4, 2, 1, 9],
    [5, 3, 4, 0, 0],
    [7, 8, 2, 3, 0],
    [1, 2, 0, 0, 0]
]).T  # <-- Transpose!

batch_tgt = torch.LongTensor([
    [3, 6, 4, 5, 0, 0],
    [2, 3, 1, 4, 7, 6],
    [5, 4, 1, 2, 5, 3],
    [6, 7, 4, 5, 1, 0]
]).T  # <-- Transpose!

src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(batch_src, batch_tgt)

Let's first look at `src_mask`

In [11]:
print(src_mask)

tensor([[False, False, False, False, False],
        [False, False, False, False, False],
        [False, False, False, False, False],
        [False, False, False, False, False],
        [False, False, False, False, False]], device='cuda:0')


As you can see `src_mask` has only `False` values. So in some sense, this mask is not needed here as it represents a lookahead mask but with no restriction. However, there are use cases beyond machine translation where this mask may have some `True` values. For consistency, `nn.Transformer` expects this mask so we give it this one with all `False`.

The padding matrix is a bit more interesting...

In [12]:
print(src_padding_mask)

tensor([[False, False, False, False, False],
        [False, False, False,  True,  True],
        [False, False, False, False,  True],
        [False, False,  True,  True,  True]])


As already discussed above, here we can clearly see that, for example, the second sequence has to padding indices at the end, hence the 2 `True` values at the end of the second row. This tells the encoder of `nn.Transformer` not to attend to the last 2 tokens in the second sequence.

For the decoder, `tgt_mask` represents the lookahead mask as shown above.

In [13]:
print(tgt_mask)

tensor([[False,  True,  True,  True,  True,  True],
        [False, False,  True,  True,  True,  True],
        [False, False, False,  True,  True,  True],
        [False, False, False, False,  True,  True],
        [False, False, False, False, False,  True],
        [False, False, False, False, False, False]], device='cuda:0')


Again, the purpose is to tell the decoder that (a) the first token can only attend to itself, (b) the second token can only attend to the first token and to itself, (c) the third token can only attend to the first/second token and to itself, (d) the fourth token can only attend to the first/second/third token and to itself, (e) ...and so on.

Of course, if the sequences for the decoder include padding tokens, we also need the padding matrix.

In [14]:
print(tgt_padding_mask)

tensor([[False, False, False, False,  True,  True],
        [False, False, False, False, False, False],
        [False, False, False, False, False, False],
        [False, False, False, False, False,  True]])


The interpretation of `tgt_padding_mask` is the same as of `src_padding_mask`.

**Side note:** Padding mask might also have `True` value within a sequence (not only at the end) to indicate which token not to attend to. While this is not the case here for the task of machine translation, masking proper tokens in a sequence is a common technique for Transformer-based language models such [BERT](https://arxiv.org/abs/1810.04805).

### Collation


As seen in the Data Sourcing and Processing section, our data iterator yields a pair of raw strings. We need to convert these string pairs into the batched tensors that can be processed by our `Seq2SeqTransformer` model defined previously. Below we define the `collate_fn()` method that converts a batch of raw strings into batch tensors that can be fed directly into our model.

In the context of machine learning, data collation refers to the process of combining and organizing individual data samples or instances into a structured format that can be used for training, validation, or testing of machine learning models. Data collation typically involves gathering data samples from various sources, such as databases, files, or APIs, and transforming them into a suitable representation for model training. This process may include tasks such as data cleaning, preprocessing, feature extraction, and formatting. Key steps involved in data collation include:

* **Data collection:** Gathering the raw data from diverse sources, which could be in different formats, such as text, images, or numerical data.

* **Data cleaning:** Removing any inconsistencies, errors, or missing values from the collected data. This step ensures the quality and reliability of the data.

* **Data preprocessing:** Applying various transformations to the data, such as normalization, scaling, or one-hot encoding, to make it suitable for the machine learning model.

* **Feature extraction:** Extracting relevant features from the data to capture important patterns or characteristics that are informative for the learning task.

* **Data formatting:** Organizing the data into a structured format, such as matrices or tensors, that can be fed into the machine learning algorithm.

* **Splitting into training, validation, and test sets:** Dividing the collated data into separate subsets for model training, model evaluation during development (validation), and final model evaluation (testing).

Data collation is a crucial step in the machine learning pipeline, as the quality and structure of the training data can significantly impact the performance and generalization of the learned models. It requires careful consideration of data sources, appropriate preprocessing techniques, and maintaining data integrity throughout the process. Since we take and ready-made dataset and assume that all sentences are well-formed w.r.t. to the source or target language, we only have to consider the following 3 mains steps to prepare our data batches

* **Tokenization:** convert raw string into list of tokens

* **Vectorization:** transform lists of tokens into list of their corresponding indices (given the built vocabularies)

* **Data formatting:** add special tokens `<SOS>` and `<EOS>` at the beginning and end of the token lists

The code cell below defines the method `collate_fn` that -- together with additional auxiliary methods -- performs these 3 steps for a given batch of sentence pairs.

In [15]:
# Helper method to club together sequential operations
def sequential_transforms(*transforms):
    def func(txt_input):
        for transform in transforms:
            txt_input = transform(txt_input)
        return txt_input
    return func

# method to add SOS/EOS and create tensor for input sequence indices
def tensor_transform(token_ids: List[int]):
    return torch.cat((torch.tensor([SOS_IDX]), torch.tensor(token_ids), torch.tensor([EOS_IDX])))


# `src` and `tgt` language text transforms to convert raw strings into tensors indices
text_transform = {}
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    text_transform[ln] = sequential_transforms(token_transform[ln], # Tokenization
                                               vocab_transform[ln], # Vectorization
                                               tensor_transform)    # Add SOS/EOS and create tensor


# Method to collate data samples into batch tensors
def collate_fn(batch):
    src_batch, tgt_batch = [], []
    for src_sample, tgt_sample in batch:
        src_batch.append(text_transform[SRC_LANGUAGE](src_sample.rstrip("\n")))
        tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample.rstrip("\n")))
    # pad_sequence is an in-built method proivded by torch package
    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX)
    return src_batch, tgt_batch

### Auxiliary Methods for Training and Evaluation

Like in previous notebooks, it is useful to define methods that handle the basic training and evaluation loop. This is done using the methods `train_epoch()` and `evaluate()` in the code cell below. Notice that in both methods the `DataLoader` class receives the `collate_fn()` as an input parameter. This ensures that during each iteration, the batch of raw sentence pairs is tokenized, vectorized, and formatted as described above.

One implementation detail is worth pointing out. After a batch of raw sentence pairs has been processed by the method `collate_fn()` all sequences in the batch -- for both the source and target language have the following format, using proper tokens instead of indices for better visualization:

```
[<SOS>, I , went, home, <EOS>, <PAD>, <PAD>, <PAD>, <PAD>]
```

However, the expected format is as follows:

<img src='data/images/transformer-mt-tensor-format.png' width='80%' />

While the format of the tensor matches the expected input of the encoder, we need some additional steps to get the right input and output format for the decoder. Most fundamentally, the output of the decoder needs to be shifted 1 token to the left -- we have already seen this when training the RNN decoder for machine translation. This shifting is done with the line `tgt_out = tgt[1:, :]` by removing the `<SOS>` token from the start of all sequences. Since the input and output of the decoder have to have the same sequence lengths, the line `tgt_input = tgt[:-1, :]` removes the last item in all sequences, which is a `<PAD>` token.

In [16]:
def train_epoch(model, optimizer, criterion):
    model.train()
    losses = 0
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in tqdm(train_dataloader, total=len(list(train_dataloader))):
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        # Remove last entry an all target sequences (typically PAD, can be EOS)
        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        optimizer.zero_grad()

        # Remove <SOS> from all targets
        tgt_out = tgt[1:, :]
        
        loss = criterion(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()
        
    return losses / len(list(train_dataloader))


def evaluate(model):
    model.eval()
    losses = 0

    val_iter = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    val_dataloader = DataLoader(val_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in tqdm(val_dataloader, total=len(list(val_dataloader))):
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]
    
        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        tgt_out = tgt[1:, :]
        
        loss = criterion(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        losses += loss.item()

    return losses / len(list(val_dataloader))

Now we have all the ingredients to train our model. Let's do it!




### Define Model

Let's now define the parameters of our model and instantiate the same. Below, we also define our loss function which is the cross-entropy loss and the optimizer used for training. Of course, we do not want to compute the loss w.r.t. to any padding tokens -- for example, see the last 4 `<PAD>` tokens of the decoder output in the figure above. While we could manually only consider the non-adding tokens of the decoder output, PyTorch allows us to specify which token index to ignore when defining the loss function. This means, we can tell the loss function to ignore the losses for each padding token using `ignore_index=PAD_IDX`.

In [17]:
torch.manual_seed(0)

SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3

# Create model
transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE, NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)

# Initialize weights
for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

# Move model to device (ideally GPU, otherwise CPU)
transformer = transformer.to(DEVICE)

# Define loss function
criterion = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

# Define optimizer
optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

### Training the Model

In the code cell below, we perform the actual training by calling the method `train_epoch()` `NUM_EPOCHS` times. After each epoch we also call the `evaluate()` method to evaluate our current training model using the validation data. This gives us 2 losses after each iteration, the training loss and the evaluation loss, which are both displayed via a print statement together with the total time required to perform 1 iteration. This also means that you will have 2 progress bars for each iteration, again one for the training and one for the validation. If you want to train the model further after the code cell is completed, you can simply run the same code cell again.

In [18]:
NUM_EPOCHS = 20

for epoch in range(1, NUM_EPOCHS+1):
    start_time = timer()
    train_loss = train_epoch(transformer, optimizer, criterion)
    end_time = timer()
    val_loss = evaluate(transformer)
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time (total) = {(end_time - start_time):.3f}s"))

100%|██████████████████████████████████████████████████████████████████████████████████████████████| 227/227 [00:30<00:00,  7.40it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 15.88it/s]


Epoch: 1, Train loss: 5.341, Val loss: 4.109, Epoch time (total) = 47.239s


100%|██████████████████████████████████████████████████████████████████████████████████████████████| 227/227 [00:23<00:00,  9.64it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 15.87it/s]


Epoch: 2, Train loss: 3.763, Val loss: 3.326, Epoch time (total) = 40.089s


100%|██████████████████████████████████████████████████████████████████████████████████████████████| 227/227 [00:23<00:00,  9.66it/s]

KeyboardInterrupt



### Testing the Model

Seeing particularly the validation loss going down is of course great, but at the end of the day, we actually want to use our model to translate sentences for here (here, from German to English). For this, the code cell below implements the method `translate()` which takes the train model and an input sentence as string as input and uses the trained model to translate the input sentence. The actual translation -- that is, the generation of the output sentence in the target language -- is done using the method `greedy_decode()`. As the name suggests, this method performs Greedy Decoding. This means that for each next word to be generated the method will always choose the one with the highest probability.

**Side note:** Recall that a more sophisticated alternative to Greedy Decoding is Beam Search. Instead of picking inly the one word with the highest probability in each step, Beam Search picks multiple words (say, 10) in each step to maximize the probability of the whole sentence -- Greede Decoding does not guarantee this! Depending on the exact implementation of parametrization of Beam Search, the same input sentence might be translated in (slightly) different ways.

In [None]:
# Method to generate output sequence using greedy algorithm
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0)).type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        ys = torch.cat([ys, torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys


# Actual method to translate input sentence into target language
def translate(model: torch.nn.Module, src_sentence: str):
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(model, src, src_mask, max_len=num_tokens+5, start_symbol=SOS_IDX).flatten()
    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<SOS>", "").replace("<EOS>", "")

Let's translate a couple of sentences to see how well our model performs. But you need to keep in mind the nature of the Multi30k dataset. This dataset consists of typically short image captions/descriptions, i.e., typically short and simple sentences.

In [None]:
print(translate(transformer, "Eine Gruppe von Menschen steht vor einem Iglu ."))

In [None]:
print(translate(transformer, "Ein Koch in weißer Uniform bereitet Essen in einer Restaurantküche zu ."))

In [None]:
print(translate(transformer, "Zwei junge Mädchen spielen Fußball auf einem Feld. ."))

In [None]:
print(translate(transformer, "Eine Frau mit Hut und Sonnenbrille steht am Strand ."))

In [None]:
print(translate(transformer, "Zwei Freunde lachen und genießen ein Eis auf einer wunderschönen Wiese ."))

---

## Discussion

After completing the notebook, training a Transformer for machine translation might seem rather easy to do. After all, it does not take much training to see good translation results. However, this is a bit misleading insofar we used only an arguably small dataset with generally simple sentences. In practice, training Transformers for machine translation poses several challenges. Here are some of the key difficulties:

* **Data scarcity:** Training effective machine translation models requires large amounts of high-quality bilingual data. Acquiring such data can be expensive and time-consuming, particularly for low-resource language pairs. Limited data can lead to overfitting and poor generalization.

* **Tokenization and subword units:** Transformers operate on sequences of fixed-length tokens. Deciding how to tokenize and represent words and phrases is crucial. Languages with complex morphology, agglutination, or lack of clear word boundaries pose challenges in determining appropriate subword units. Deciding on an effective tokenization strategy is essential for achieving accurate translations.

* **Vocabulary mismatch:** Machine translation models are sensitive to the vocabulary mismatch between the source and target languages. Words and phrases that exist in one language may not have direct equivalents in the other. Handling out-of-vocabulary (OOV) words or rare words that may appear during inference is a challenge. Subword-based approaches partially alleviate this issue, but it is not entirely eliminated.

* **Rare and ambiguous translations:** Some translations are rare, context-dependent, or highly ambiguous. For example, the same source phrase can have multiple valid translations depending on the context. Capturing the correct meaning and selecting appropriate translations require the model to understand the source sentence's semantics and the target language's nuances.

* **Long-range dependencies:** Translations often require understanding long-range dependencies between words or phrases. Traditional recurrent neural networks (RNNs) struggle to capture these dependencies efficiently. Transformers address this issue by employing self-attention mechanisms that allow capturing dependencies across long distances, but training them effectively can still be challenging.

* **Biases and cultural differences:** Machine translation models can unintentionally amplify biases present in the training data, leading to biased translations. Addressing biases and ensuring fair and culturally sensitive translations is an ongoing challenge.

* **Computational resources:** Training transformers for machine translation is computationally intensive. Transformers have a large number of parameters and require substantial computational resources, including powerful GPUs or specialized hardware, to train effectively. Training on massive-scale models further amplifies the resource requirements.

Overcoming these challenges requires a combination of large-scale, high-quality training data, careful preprocessing and tokenization, effective modeling techniques, architecture modifications, regularization methods, and continuous research and development efforts.

---

## Summary

Machine translation using the Transformer architecture has revolutionized the field by achieving state-of-the-art results. The Transformer is a neural network architecture that eliminates the need for recurrent or convolutional layers, making it highly parallelizable and efficient. It relies on a self-attention mechanism to capture dependencies between words in a sentence, allowing it to handle long-range relationships effectively. In a nutshell, the Transformer architecture for machine translation consists of an encoder and a decoder. The encoder processes the input sentence in the source language, generating a rich representation of the sentence. The decoder then uses this representation to generate the translation in the target language. Both the encoder and decoder are composed of multiple layers of self-attention and feed-forward neural networks.

The self-attention mechanism in the Transformer enables the model to attend to different positions in the input sentence while generating the translation. By attending to relevant words and their dependencies, the Transformer captures contextual information effectively. Additionally, the use of residual connections and layer normalization helps alleviate vanishing gradient problems and aids in smoother training. During training, the Transformer architecture is optimized using a variant of the attention mechanism called "scaled dot-product attention." This attention mechanism allows the model to assign appropriate weights to different words in the input sequence, enabling accurate translation. The model is trained to minimize the discrepancy between the predicted translation and the reference translation using techniques such as teacher forcing or reinforcement learning.

Machine translation with the Transformer architecture has demonstrated remarkable performance on various language pairs, achieving human-level or even surpassing human-level translation quality. Its ability to capture long-range dependencies, parallel processing, and effective attention mechanism makes it a powerful tool in the field of machine translation, enabling accurate and fluent translations across different languages.