# Introduction to Natural Language Processing 2 Lab02

## Encoder-decoder model

Today you will implement a pretty decent machine translation model using the transformer and then compare it to what you could have, with the same amount of training time, with a combination of RNN + Attention.

###  Go through the pyTorch tutorial

To start with, just follow the pyTorch [language translation with nn.Transformer and torchtext tutorial](https://pytorch.org/tutorials/beginner/translation_transformer.html).

To make the code turn on Google Colab, you need to update the preinstalled version of spaCy and download the small German and English spaCy models. As pyTorch doesn't seem to maintain its tutorial with their most recent changes, you also need to install torchdata.
```
!pip install spacy sacrebleu torchdata -U
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm
```

As the training takes time (~20min), you can start looking at the following steps while it finishes.

At training, you will encounter `TypeError: ZipperIterDataPipe instance doesn't have valid length` (pyTorch doesn't update their tutorials). A workaround can be found [here](https://github.com/pytorch/tutorials/issues/1868).

### **(5 points)** Decoding functions

The tutorial uses a greedy approach at decoding. Implement the following variations.
* (2 points) A top K sampling with and without temperature.
* (3 points) A beam search (from scratch).
* (1 point) Qualitatively compare a few (at least 3) translation samples for each approach (even the greedy one).

### **(2 points)** Compute the BLEU score of the model

Use the [sacreBLEU](https://github.com/mjpost/sacreBLEU) implementation to evaluate your model and quantitatively compare the 4 implemented decoding approaches. Explain what all the output values mean (when using the `corpus_score` function).

In the [python section](https://github.com/mjpost/sacrebleu#using-sacrebleu-from-python), you'll notice the library accepts more than just one possible translation as reference, but the given dataset only has one translation per sample.

Using the `translate` function provided in the tutorial is pretty slow, as it translate text by text. It's recommended you modify the function to accept a list of texts as input, and batch them for translations (also **bonus point**).

### **(Bonus)** Try with another language

Use the [Tatoeba dataset](https://huggingface.co/datasets/tatoeba) with the language pair of your choice to train the model again. Beware that the Multi30K dataset has 29K training sample and 1K test sample, while the Tatoeba dataset only has a training set (you'll have to split it yourself) and 262K sentence pairs for their English-French data. So maybe train on a sub-sample. As a suggestion, sort the sentences per size and only use the first 30K. 

* Extract data from the Tatoeba dataset.
* Train a model with it.
* Compute the BLEU score using sacreBLEU on left-out data.

## Going further

If you want to understand in-depth how the transformer model works, I recommend you check [The Annotated Tranformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html) from HarvardNLP. This article helps you write your own transformer from scratch in pyTorch.

## Evaluation

The assignment will be evaluated on the following criteria

* A report answering the questions above, describing your technical choices, and analysing your results.
* The quality of your code (modularity, efficiency, comments, coding standards).

For coding standards, please respect the following guidelines
* Use [docstring](https://www.programiz.com/python-programming/docstrings) format to describe your functions and their arguments
* Use typing
* Have clear and verbatim variable names (not x, x1, x2, xx, another_x, ...)
* Make your results reproducible (force random seeds values)
* Don't hesitate commenting in details part of the code you consider complex or hard to read

Provide a `README.md` file with 
* A short description of the project
* A description of the file/module architecture

This part provides 7 points + 3 points on coding standards: naming, typing, comments, and docstring. You can earn extra points by answering the bonus questions, and by packaging your code in extra python files. At the end of the module, all project points are summed and projected on a grade between 0 and 16. The last 4 points can be earned by answering the bonus questions, and presenting a language.

All projects have to be send back at `marc.von-wyl` at `epita` dot `fr` before Thursday 17th of November 2022 at midnight. Thought is is advised to send them progressively.

## Imports

In [1]:
%matplotlib inline

In [None]:
!pip install -q gwpy

In [148]:
%%capture
!pip install spacy sacrebleu torchdata -U
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm

In [120]:
import numpy as np
from sklearn.preprocessing import normalize
from typing import Tuple
from sacrebleu.metrics import BLEU
from sacrebleu.metrics.bleu import BLEUScore


# Language Translation with nn.Transformer and torchtext

This tutorial shows:
    - How to train a translation model from scratch using Transformer. 
    - Use tochtext library to access  [Multi30k](http://www.statmt.org/wmt16/multimodal-task.html#task1)_ dataset to train a German to English translation model.


## Data Sourcing and Processing

[torchtext library](https://pytorch.org/text/stable/)_ has utilities for creating datasets that can be easily
iterated through for the purposes of creating a language translation
model. In this example, we show how to use torchtext's inbuilt datasets, 
tokenize a raw text sentence, build vocabulary, and numericalize tokens into tensor. We will use
[Multi30k dataset from torchtext library](https://pytorch.org/text/stable/datasets.html#multi30k)_
that yields a pair of source-target raw sentences. 

To access torchtext datasets, please install torchdata following instructions at https://github.com/pytorch/data. 




In [149]:
%%capture
!pip install torchtext=="0.14.0"

In [5]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import multi30k, Multi30k
from typing import Iterable, List


# We need to modify the URLs for the dataset since the links to the original dataset are broken
# Refer to https://github.com/pytorch/text/issues/1756#issuecomment-1163664163 for more info
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"

SRC_LANGUAGE = 'de'
TGT_LANGUAGE = 'en'

# Place-holders
token_transform = {}
vocab_transform = {}


# Create source and target language tokenizer. Make sure to install the dependencies.
# pip install -U torchdata
# pip install -U spacy
# python -m spacy download en_core_web_sm
# python -m spacy download de_core_news_sm
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')


# helper function to yield list of tokens
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
    language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}

    for data_sample in data_iter:
        yield token_transform[language](data_sample[language_index[language]])

# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']
 
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    # Training data Iterator 
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    # Create torchtext's Vocab object 
    vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln),
                                                    min_freq=1,
                                                    specials=special_symbols,
                                                    special_first=True)

# Set UNK_IDX as the default index. This index is returned when the token is not found. 
# If not set, it throws RuntimeError when the queried token is not found in the Vocabulary. 
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
  vocab_transform[ln].set_default_index(UNK_IDX)

## Seq2Seq Network using Transformer

Transformer is a Seq2Seq model introduced in [“Attention is all you
need”](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)_
paper for solving machine translation tasks. 
Below, we will create a Seq2Seq network that uses Transformer. The network
consists of three parts. First part is the embedding layer. This layer converts tensor of input indices
into corresponding tensor of input embeddings. These embedding are further augmented with positional
encodings to provide position information of input tokens to the model. The second part is the 
actual [Transformer](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)_ model. 
Finally, the output of Transformer model is passed through linear layer
that give un-normalized probabilities for each token in the target language. 




In [6]:
from torch import Tensor
import torch
import torch.nn as nn
from torch.nn import Transformer
import math
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# helper Module that adds positional encoding to the token embedding to introduce a notion of word order.
class PositionalEncoding(nn.Module):
    def __init__(self,
                 emb_size: int,
                 dropout: float,
                 maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        den = torch.exp(- torch.arange(0, emb_size, 2)* math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

# Seq2Seq Network 
class Seq2SeqTransformer(nn.Module):
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 dim_feedforward: int = 512,
                 dropout: float = 0.1):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = Transformer(d_model=emb_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward,
                                       dropout=dropout)
        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(
            emb_size, dropout=dropout)

    def forward(self,
                src: Tensor,
                trg: Tensor,
                src_mask: Tensor,
                tgt_mask: Tensor,
                src_padding_mask: Tensor,
                tgt_padding_mask: Tensor,
                memory_key_padding_mask: Tensor):
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None, 
                                src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
        return self.generator(outs)

    def encode(self, src: Tensor, src_mask: Tensor):
        return self.transformer.encoder(self.positional_encoding(
                            self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        return self.transformer.decoder(self.positional_encoding(
                          self.tgt_tok_emb(tgt)), memory,
                          tgt_mask)

During training, we need a subsequent word mask that will prevent model to look into
the future words when making predictions. We will also need masks to hide
source and target padding tokens. Below, let's define a function that will take care of both. 




In [7]:
def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask


def create_mask(src, tgt):
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)

    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

Let's now define the parameters of our model and instantiate the same. Below, we also 
define our loss function which is the cross-entropy loss and the optmizer used for training.




In [8]:
torch.manual_seed(0)

SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3

transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE, 
                                 NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)

for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

transformer = transformer.to(DEVICE)

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

## Collation

As seen in the ``Data Sourcing and Processing`` section, our data iterator yields a pair of raw strings. 
We need to convert these string pairs into the batched tensors that can be processed by our ``Seq2Seq`` network 
defined previously. Below we define our collate function that convert batch of raw strings into batch tensors that
can be fed directly into our model.   




In [9]:
from torch.nn.utils.rnn import pad_sequence

# helper function to club together sequential operations
def sequential_transforms(*transforms):
    def func(txt_input):
        for transform in transforms:
            txt_input = transform(txt_input)
        return txt_input
    return func

# function to add BOS/EOS and create tensor for input sequence indices
def tensor_transform(token_ids: List[int]):
    return torch.cat((torch.tensor([BOS_IDX]), 
                      torch.tensor(token_ids), 
                      torch.tensor([EOS_IDX])))

# src and tgt language text transforms to convert raw strings into tensors indices
text_transform = {}
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    text_transform[ln] = sequential_transforms(token_transform[ln], #Tokenization
                                               vocab_transform[ln], #Numericalization
                                               tensor_transform) # Add BOS/EOS and create tensor


# function to collate data samples into batch tesors
def collate_fn(batch):
    src_batch, tgt_batch = [], []
    for src_sample, tgt_sample in batch:
        src_batch.append(text_transform[SRC_LANGUAGE](src_sample.rstrip("\n")))
        tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample.rstrip("\n")))

    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX)
    return src_batch, tgt_batch

Let's define training and evaluation loop that will be called for each 
epoch.




In [10]:
from torch.utils.data import DataLoader

def train_epoch(model, optimizer):
    model.train()
    losses = 0
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)
    
    for src, tgt in train_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        optimizer.zero_grad()

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_dataloader))


def evaluate(model):
    model.eval()
    losses = 0

    val_iter = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    val_dataloader = DataLoader(val_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in val_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)
        
        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        losses += loss.item()

    return losses / len(list(val_dataloader))

Now we have all the ingredients to train our model. Let's do it!




In [11]:
from timeit import default_timer as timer
NUM_EPOCHS = 18

for epoch in range(1, NUM_EPOCHS+1):
    start_time = timer()
    train_loss = train_epoch(transformer, optimizer)
    end_time = timer()
    val_loss = evaluate(transformer)
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))


# function to generate output sequence using greedy algorithm 
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys


# actual function to translate input sentence into target language
def translate(model: torch.nn.Module, src_sentence: str):
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")

Epoch: 1, Train loss: 5.344, Val loss: 4.114, Epoch time = 44.824s
Epoch: 2, Train loss: 3.761, Val loss: 3.320, Epoch time = 41.092s
Epoch: 3, Train loss: 3.161, Val loss: 2.894, Epoch time = 42.189s
Epoch: 4, Train loss: 2.768, Val loss: 2.638, Epoch time = 43.626s
Epoch: 5, Train loss: 2.480, Val loss: 2.441, Epoch time = 41.943s
Epoch: 6, Train loss: 2.250, Val loss: 2.315, Epoch time = 42.334s
Epoch: 7, Train loss: 2.060, Val loss: 2.201, Epoch time = 43.644s
Epoch: 8, Train loss: 1.897, Val loss: 2.113, Epoch time = 41.912s
Epoch: 9, Train loss: 1.754, Val loss: 2.058, Epoch time = 42.058s
Epoch: 10, Train loss: 1.631, Val loss: 2.002, Epoch time = 41.924s
Epoch: 11, Train loss: 1.524, Val loss: 1.975, Epoch time = 42.257s
Epoch: 12, Train loss: 1.420, Val loss: 1.945, Epoch time = 43.132s
Epoch: 13, Train loss: 1.333, Val loss: 1.967, Epoch time = 41.861s
Epoch: 14, Train loss: 1.251, Val loss: 1.942, Epoch time = 42.154s
Epoch: 15, Train loss: 1.173, Val loss: 1.928, Epoch time

In [12]:
print(translate(transformer, "Eine Gruppe von Menschen steht vor einem Iglu ."))

 A group of people standing in front of an igloo . 


## (5 points) Decoding functions

### Greedy decode (from tutorial)

The Greedy method (from the given tutorial) is pretty easy to understand and use. In essence, it is converting the text input (to be translated) into a sequence of tokens and iterates on them, simply finding at each iteration the single word that the transformer model found most probable.

Not much was needed for this function as it is from the given tutorial, but was very useful to use as a basis for the other decoding methods. The results will be analysed at the end of the process.

In [150]:
# Tutorial function to generate output sequence using greedy algorithm
def greedy_decode(model: torch.nn.Module, src: str, src_mask: torch.bool, max_len: int, start_symbol: int) -> torch.long:
    """
    Tutorial function to generate output sequence using greedy algorithm

    Args:
        model (torch.nn.Module): the transformer used to predict a translation
        src (str): the text to translate
        src_mask (torch.bool): a mask that will be used to encode the src
        max_len (int): the maximum number of iterations for the main translation loop
        start_symbol (int): the first symbol that will be in the translation (BOS_IDX)

    Returns:
        torch.long: returns th translation in form of a torch tensor
    """
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len - 1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0)).type(torch.bool)).to(
            DEVICE
        )
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        ys = torch.cat([ys, torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys


# Tutorial actual function to translate input sentence into target language
def greedy_translate(model: torch.nn.Module, src_sentence: str) -> str:
    """
    Tutorial actual function to translate input sentence into target language

    Args:
        model (torch.nn.Module): the transformer used to predict a translation
        src_sentence (str): the sentence to translate

    Returns:
        str: the readable text translation
    """
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(
        model, src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX
    ).flatten()
    return (
        " ".join(
            vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))
        )
        .replace("<bos>", "")
        .replace("<eos>", "")
    )


Testing the Greedy decode (tutorial) on a classic example with variations of its parameters.

In [152]:
# We now test our top-k to translate the following sentence:
to_translate1 = "Menschen stehen vor einem Iglu und Essen wird zubereitet"
to_translate2 = "Motorräder werden vor einem Parkplatz abgestellt"
print("Original phrases:")
print("1 ->", to_translate1)
print("2 ->", to_translate2)

# Greedy decode translation:
print("\nTranslated phrase with greedy decode:")
translated_phrase1 = greedy_translate(transformer, to_translate1)
print("1 ->" + translated_phrase1)
translated_phrase2 = greedy_translate(transformer, to_translate2)
print("2 ->" + translated_phrase2)

# Proper Google translation result:
print("\nGoogle Translate phrase:")
print("1 ->", "People are standing in front of an igloo and food is being prepared")
print("2 ->", "Motorbikes are parked in front of a parking lot")

Original phrases:
1 -> Menschen stehen vor einem Iglu und Essen wird zubereitet
2 -> Motorräder werden vor einem Parkplatz abgestellt

Translated phrase with greedy decode:
1 -> People stand in front of an igloo and food . 
2 -> motorcycles are being lifted by in front of a parking lot .

Google Translate phrase:
1 -> People are standing in front of an igloo and food is being prepared
2 -> Motorbikes are parked in front of a parking lot


### Result analysis:
We can see that this decoding function is providing a decent translation.

For this example, a part of the original first sentence is pretty much ignored ("is being prepared").

The second sentence also has a translation error: the motorbikes are supposed to be parked, not lifted as predicted.

### (2 points) A top K sampling with and without temperature

The Top-K method of decoding operates the same way as the greedy decoding except for a major difference: it is selecting the k most probable translations made by the model when given a token (at each iteration) and randomly chooses one of them, based on their normalized probabilities.

The k value can be adjusted to have a broader selection.

The Top-K can also be implemented with a temperature, a parameter ranging from 0.0 to 1.0, used to extrapolate the existing probabilities the colder the temperature is.

At temperature = 1.0, nothing will change, and close to 0.0 is will be similar to a greedy search (the most probable translation will be so maximized that it will dominate the others by far, becoming almost impossible to not be selected when extremely close to 0.0

The implementation difference for the Top-K with a temperature parameter is mainly:
- a random selection of a token translation based on the model's translation candidates probabilities after their normalization
- a function used to apply the given temperature to this list of probabilities, with a maths formula

In [133]:
# Function used to apply temperature on given probabilities
# Temperature ranges from 0.0 (excluded) to 1.0
def apply_temperature(probs: List[float], temperature: float) -> List[float]:
    """
    Function used to apply temperature on given probabilities

    Args:
        probs (List[float]): the probabilities we want to extrapolate by temperature
        temperature (float): used to extrapolate existing probabilities

    Returns:
        List[float]: the temperature extrapolated probabilities
    """
    exp_sum = sum([np.float_power(prob, 1.0 / temperature) for prob in probs])
    for i in range(len(probs)):
        probs[i] = np.float_power(probs[i], 1.0 / temperature) / exp_sum
    return probs


# Toolbox function used to convert tokens to readable text
def translate_tokens(tgt_tokens: torch.Tensor) -> str:
    """
    Toolbox function used to convert tokens to readable text

    Args:
        tgt_tokens (torch.Tensor): the tokens we want to translate

    Returns:
        str: the readable text translation
    """
    translation = (
        " ".join(
            vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))
        )
        .replace("<bos>", "")
        .replace("<eos>", "")
    )
    return translation


# Main topk loop function
def top_k_decode(model: torch.nn.Module, src: str, src_mask: torch.bool, max_len: int, start_symbol: int, top_k: int, temperature: float) -> torch.long:
    """
    Main topk loop function

    Args:
        model (torch.nn.Module): the transformer used to predict a translation
        src (str): the text to translate
        src_mask (torch.bool): a mask that will be used to encode the src
        max_len (int): the maximum number of iterations for the main translation loop
        start_symbol (int): the first symbol that will be in the translation (BOS_IDX)
        top_k (int): used to select the best k translations.
        temperature (float): used to extrapolate existing probabilities.

    Returns:
        torch.long: the translation in form of a torch tensor
    """
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len - 1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0)).type(torch.bool)).to(
            DEVICE
        )
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        probs, next_words = torch.topk(prob, top_k, dim=1)
        probs_list = probs.tolist()[0]
        next_words = next_words.tolist()[0]
        norm_probs = [float(i) / sum(next_words) for i in next_words]
        norm_probs = apply_temperature(norm_probs, temperature)
        next_word = np.random.choice(next_words, 15, norm_probs)[0]
        ys = torch.cat([ys, torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys


# Main topk decoding translation function
# This is the function to call to translate a given text
def translate_topk(model: torch.nn.Module, src_sentence: str, top_k: int=10, temperature: float=1.0) -> str:
    """
    Main topk decoding translation function

    Args:
        model (torch.nn.Module): the transformer used to predict a translation
        src_sentence (str): the sentence to translate
        top_k (int, optional): used to select the best k translations. Defaults to 10.
        temperature (float, optional): used to extrapolate existing probabilities. Defaults to 1.0.

    Returns:
        str: the readable text translation
    """
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = top_k_decode(
        model,
        src,
        src_mask,
        max_len=num_tokens + 5,
        start_symbol=BOS_IDX,
        top_k=top_k,
        temperature=temperature,
    ).flatten()
    return translate_tokens(tgt_tokens)


Testing the Top-K method on a classic example with variations of its parameters.

In [161]:
# We now test our top-k to translate the following sentence:
to_translate = "Menschen stehen vor einem Iglu und Essen wird zubereitet"
print("Original phrase:")
print(to_translate)

# Setting up the different temperatures we will be using for testing
temperature1 = 1.0
temperature2 = 0.5
temperature3 = 0.3
temperature4 = 0.01

# Setting up the random seed for reproductible results
np.random.seed(6464)

# Top-k without temperature and k = 1
top_k = 1
print("\nTranslated phrase with k =", top_k, "(same as greedy decode):")
translated_phrase = translate_topk(transformer, to_translate, top_k)
print("->" + translated_phrase)

# Top-k with a temperature of 1.0 
top_k = 10
print("\nTranslated phrase with k =", top_k, "and various temperatures:")
print("Temperature =", temperature1)
translated_phrase = translate_topk(transformer, to_translate, top_k, temperature1)
print("->" + translated_phrase)
# Top-k with a temperature of 0.5
print("Temperature =", temperature2)
translated_phrase = translate_topk(transformer, to_translate, top_k, temperature2)
print("->" + translated_phrase)
# Top-k with a temperature of 0.3
print("Temperature =", temperature3)
translated_phrase = translate_topk(transformer, to_translate, top_k, temperature3)
print("->" + translated_phrase)
# Top-k with a temperature of 0.01
print("Temperature =", temperature4)
translated_phrase = translate_topk(transformer, to_translate, top_k, temperature4)
print("->" + translated_phrase)

# Proper Google translation result:
print("\nGoogle Translate phrase:")
print("People are standing in front of an igloo and food is being prepared")


Original phrase:
Menschen stehen vor einem Iglu und Essen wird zubereitet

Translated phrase with k = 1 (same as greedy decode):
-> People stand in front of an igloo and food . 

Translated phrase with k = 10 and various temperatures:
Temperature = 1.0
-> Several persons perform before prepare food in an igloo machine is getting cooking . up
Temperature = 0.5
-> Groups of individuals , in Hawaiian shirts while doing preparing prepare meal and eating 
Temperature = 0.3
-> There , some other individuals standing , before some food stand cooking on . and
Temperature = 0.01
-> Groups pf persons stand , including people getting meat while in the front setting watches

Google Translate phrase:
People are standing in front of an igloo and food is being prepared


### Result analysis:
This time, the results are much less accurate. 

Is is difficult to determine whether the temperature has a positive effect or not.

As expected, a top-k of 1 is the same as the greedy decode.

### (3 points) A beam search (from scratch).

The Beam Search method has a great difference with the 3 previous ones, and is considered as one of the most powerful decoding method.

The reason for that is that is runs a number (beam width) of parallel translations that will each be updated at each iteration (choosing another number of "beam" possible translations for them) and keeping only the "beam" best translations.

This allows the method to reconsider it's best translations at each iteration by taking the context into account, for a single word can change the meaning of the sentence we are translating at any moment.

This implementation was focused on:
- making a function sorting the "beam" best new translations from a given list of translations (like a top-k)
- getting the "beam" best possible translations from a given token, a process applied to each "beam" previous (ongoing) translations, which are in essence the best current candidates at a given moment 

In [57]:
# Return the k best ys from a given list of ys and their probabilities
# This function is used to isolate the current "beam width" best ys translations
def get_k_best_ys(ys_list: List, probs: List[float], k: int) -> Tuple[List, List[float]]:
    """
    Finds the k best ys from a given list of ys and their probabilities

    Args:
        ys_list (List): given list of translations as torch tensors
        probs (List[float]): given list of probabilities of the ys_list (translations)
        k (int): the number of best translations we want

    Returns:
        Tuple[List, List[float]]: the best translation candidates, their probabilities
    """
    best_ys = []
    best_probs = []
    for i in range(len(probs)):
        if len(best_ys) < k:
            best_ys.append(ys_list[i])
            best_probs.append(probs[i])
        else:
            min_index = best_probs.index(min(best_probs))
            best_ys[min_index] = ys_list[i]
            best_probs[min_index] = probs[i]
    return best_ys, best_probs


# Return the top-k best next translations (given ys + new word) by decoding
# k is the beam_width in this situation
def get_k_best_translation_candidates(ys: torch.long, k: int, memory: torch.tensor, model: torch.nn.Module, src: str) -> Tuple[List, List[float], List[int]]:
    """
    Used to get the top-k best next translations (given ys + new word) by decoding

    Args:
        ys (torch.long): the current translation as a torch tensor
        k (int): the number of best translations we want
        memory (torch.tensor): the memory of the translation, used to avoid word by word translation (without context)
        model (torch.nn.Module): the transformer used to predict a translation
        src (str): the text to translate

    Returns:
        Tuple[List, List[float], List[int]]: the translation current word candidates, their probabilities, the associated words
    """
    memory = memory.to(DEVICE)
    tgt_mask = (generate_square_subsequent_mask(ys.size(0)).type(torch.bool)).to(DEVICE)
    out = model.decode(ys, memory, tgt_mask)
    out = out.transpose(0, 1)
    prob = model.generator(out[:, -1])
    probs, next_words = torch.topk(prob, k, dim=1)
    ys_candidates_probas = probs.tolist()[0]
    next_words = next_words.tolist()[0]

    ys_candidates = []
    for i in range(len(next_words)):
        next_word = next_words[i]
        new_ys = torch.cat(
            [ys, torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0
        )
        ys_candidates.append(new_ys)
    return ys_candidates, ys_candidates_probas, next_words


# Main beam search loop function
def beam_search_decode(model: torch.nn.Module, src: str, src_mask: torch.bool, max_len: int, start_symbol: int, beam_width: int) -> torch.long:
    """
    Function used to generate output sequence using a beam search method

    Args:
        model (torch.nn.Module): the transformer used to predict a translation
        src (str): the text to translate
        src_mask (torch.bool): a mask that will be used to encode the src
        max_len (int): the maximum number of iterations for the main translation loop
        start_symbol (int): the first symbol that will be in the translation (BOS_IDX)
        beam_width (int): the width of the beam search: number of parallel translations.

    Returns:
        torch.long: the best found translation in form of a torch tensor
    """
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)
    memory = model.encode(src, src_mask)

    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    beams_ys, beams_ys_probs, beam_words = get_k_best_translation_candidates(
        ys, beam_width, memory, model, src
    )

    for i in range(1, max_len - 1):
        ys_candidates = []
        ys_candidates_probas = []
        next_beam_words = []
        for beam_ys in beams_ys:
            new_ys, new_probas, tmp_beam_words = get_k_best_translation_candidates(
                beam_ys, beam_width, memory, model, src
            )
            ys_candidates += new_ys
            ys_candidates_probas += new_probas
            next_beam_words += beam_words

        beams_ys, beams_ys_probs = get_k_best_ys(
            ys_candidates, ys_candidates_probas, beam_width
        )

    max_index = beams_ys_probs.index(max(beams_ys_probs))
    return beams_ys[max_index]


# Toolbox function used to convert tokens to text while cutting at the first EOS
def translate_tokens_EOS(tgt_tokens: torch.Tensor) -> str:
    """
    Toolbox function used to convert tokens to text while cutting at the first EOS

    Args:
        tgt_tokens (torch.Tensor): the tokens we want to translate

    Returns:
        str: the readable text translation
    """
    translation = " ".join(
        vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))
    )
    eos_index = translation.index("<eos>")
    translation = translation[:eos_index].replace("<bos>", "").replace("<eos>", "")
    return translation


# Main beam search translation function
# This is the function to call to translate a given text
def translate_beamsearch(model: torch.nn.Module, src_sentence: str, beam_width: int=10) -> str:
    """
    Main beam search translation function

    Args:
        model (torch.nn.Module): the transformer used to predict a translation
        src_sentence (str): the sentence to translate
        beam_width (int, optional): the width of the beam search: number of parallel translations. Defaults to 10.

    Returns:
        str: the readable text translation
    """
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = beam_search_decode(
        model,
        src,
        src_mask,
        max_len=num_tokens + 5,
        start_symbol=BOS_IDX,
        beam_width=beam_width,
    ).flatten()
    return translate_tokens_EOS(tgt_tokens)


Testing the Beam Search method on a classic example with variations of its parameters.

In [143]:
# We now test our Beam Search to translate the following sentence:
to_translate = "Menschen stehen vor einem Iglu und Essen wird zubereitet"
print("Original phrase:")
print(to_translate)

beam_width = 1
# Beam Search with a beam width of 1: 
print("\nTranslated phrase with Beam Width =", beam_width, ":")
translated_phrase = translate_beamsearch(transformer, to_translate, beam_width)
print("->" + translated_phrase)

beam_width = 5
# Beam Search with a beam width of 3:
print("\nTranslated phrase with Beam Width =", beam_width, ":")
translated_phrase = translate_beamsearch(transformer, to_translate, beam_width)
print("->" + translated_phrase)

beam_width = 10
# Beam Search with a beam width of 5:
print("\nTranslated phrase with Beam Width =", beam_width, ":")
translated_phrase = translate_beamsearch(transformer, to_translate, beam_width)
print("->" + translated_phrase)

# Proper Google translation result:
print("\nGoogle Translate phrase:")
print("People are standing in front of an igloo and food is being prepared")


Original phrase:
Menschen stehen vor einem Iglu und Essen wird zubereitet

Translated phrase with Beam Width = 1 :
-> People stand in front of an igloo and food . 

Translated phrase with Beam Width = 5 :
-> Several people stand ready in front of an igloo 

Translated phrase with Beam Width = 10 :
-> People stand in front of an igloo and cooking . 

Google Translate phrase:
People are standing in front of an igloo and food is being prepared


### Result analysis:
The Beam Search decoding seem much more efficient and accurate than the top-k. 

The result is even closer to the actual translation when the beam width is pretty large (10).

## SacreBLEU

The BLEU (bilingual evaluation understudy) scoring system, as a function, takes a list of translation hypotheses as input and compares them to the actual translation to give a score of accuracy.

The higher the score, the better the translation.

The implementation here is pretty simple, consisting only in a short sacreBleu scoring function (sacrebleu_score) and a pretty print function to make the result proper and readable.

This step is concluded by an analysis of the results, comparing the function's scores and the actual differences from a human perspective.

In [146]:
# Function that computes and returns a BLEU score by comparing the given predictions (hypotheses translations) with the references (actual translations)
def sacrebleu_score(hypotheses: List[str], references: List[str]) -> BLEUScore:
    """
    Function that computes and returns a BLEU score by comparing the given predictions (hypotheses translations) with the references (actual translations)

    Args:
        hypotheses (List[str]): the given predictions (hypotheses translations)
        references (List[str]): the references (actual translations)

    Returns:
        BLEUScore: the computed BLEU score
    """
    bleu = BLEU()
    result = bleu.corpus_score(hypotheses, references)
    return result


# Function used to pretty print the BLEU score of a decoding function
def pretty_print_bleu_score(hypotheses: List[str], references: List[str], decode_name: str) -> None:
    """
    Function used to pretty print the BLEU score of a decoding function

    Args:
        hypotheses (List[str]): the given predictions (hypotheses translations) that we will print
        references (List[str]): the references (actual translations) used to compute the BLEU score
        decode_name (str): the name of the decoding function to pretty print the results as a readable output
    """
    BLEU_score = sacrebleu_score(hypotheses, references)
    print("\n->", decode_name, "hypotheses:")
    print(hypotheses)
    print("    + Score BLEU =", BLEU_score.score)
    print("    + BLEU score details:", BLEU_score)


In [162]:
# Sentences that we want to translate to find the BLEU score of the 4 decoding approaches
to_be_translated = ["Menschen stehen vor einem Iglu und Essen wird zubereitet",
                    "Motorräder werden vor einem Parkplatz abgestellt",
                    "Autos sind auf der Autobahn",
                    "Der Sänger beginnt für das Publikum zu singen"]
# The Google Translate references (actual translations) of the previous sentences, used as references for the BLEU score
references = ["People are standing in front of an igloo and food is being prepared",
              "Motorbikes are parked in front of a parking lot",
              "Cars are on the highway",
              "The singer begins to sing for the audience"]

# Printing the sentences we want to translate and their translations for result clarity
print("\nSentences to translate:")
print(to_be_translated)
print("\nREFERENCES: Google Translation:")
print(references)

# 
hypotheses_greedy = [greedy_translate(transformer, sentence) for sentence in to_be_translated]
pretty_print_bleu_score(hypotheses_greedy, references, "Greedy")

hypotheses_topk = [translate_topk(transformer, sentence, top_k=5, temperature=1.0) for sentence in to_be_translated]
pretty_print_bleu_score(hypotheses_topk, references, "Top-k without temperature")

hypotheses_topk_temperature = [translate_topk(transformer, sentence, top_k=5, temperature=0.5) for sentence in to_be_translated]
pretty_print_bleu_score(hypotheses_topk_temperature, references, "Top-k with temperature")

hypotheses_beamsearch = [translate_beamsearch(transformer, sentence, beam_width=10) for sentence in to_be_translated]
pretty_print_bleu_score(hypotheses_beamsearch, references, "Beam Search")



Sentences to translate:
['Menschen stehen vor einem Iglu und Essen wird zubereitet', 'Motorräder werden vor einem Parkplatz abgestellt', 'Autos sind auf der Autobahn', 'Der Sänger beginnt für das Publikum zu singen']

REFERENCES: Google Translation:
['People are standing in front of an igloo and food is being prepared', 'Motorbikes are parked in front of a parking lot', 'Cars are on the highway', 'The singer begins to sing for the audience']

-> Greedy hypotheses:
[' People stand in front of an igloo and food . ', ' motorcycles are being lifted by in front of a parking lot .', ' cars on the highway . ', ' The singer is trying to sing the audience . ']
    + Score BLEU = 1.1919267551142738
    + BLEU score details: BLEU = 1.19 2.8/1.6/0.9/0.5 (BP = 1.000 ratio = 9.000 hyp_len = 36 ref_len = 4)

-> Top-k without temperature hypotheses:
[' Some persons are in the igloo outside in a cemetery cooking . . outside ', ' motorbikes being played together outside in a crowded area car ', ' cars 

### Result analysis:
We can see that the BLEU score of the greedy and Beam Search are pretty close (both close to 1.20), while the top-k predictions somewhat lacking (around 0.80).

However, when we actually read the translations as humans, the Beam Search method makes more contextual sense as compared to the greedy one, and is more easily understandable, even if the wording is somewhat different from the original sentences.

The BLEU score is a good indicator, but it seems a human reader could interpret the results differently.

## References

1. Attention is all you need paper.
   https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
2. The annotated transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html#positional-encoding

