# Neural Machine Translation
New version: Jetic Gu

Original: https://github.com/A-Jacobson/minimal-nmt


This is an updated python notebook that can help you with the Neural Machine Translation homework.
The original notebook written by Jacobson uses legacy API of torchtext which is no longer supported. This new version is tested under Ubuntu 22.04 LTS, running PyTorch 2.1 and torchtext 0.16.0 which is the last active development version of torchtext.

It should be noted that the current state-of-the-art is most likely using Transformer-based architecture, however knowledge in traditional RNN-based attention mechanism will still help you understand Multi-Headed Attention better.

Also note that the attention implementation provided in this tutorial is not going to work as the solution to this homework. However, it should be useful for you to understand how attention can be implemented as part of the sequence to sequence model.

In [1]:
import math
import torch
import random
import tqdm
from torch import nn
from torch.autograd import Variable
from torch.optim import Adam
import torchtext
from torchtext.datasets import multi30k, Multi30k
from torch.nn.utils import clip_grad_norm
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader
import spacy

## Convenience Functions

In [2]:
def sequence_to_text(sequence, lex):
    """
    This function will allow you to use a vocab object to convert
    a list of indices to original words. 
    """
    # To convert tokens to indices, use brackets directly as if a dict 
    pad_idx = lex['<pad>']
    # to convert list of indices back to words, use get_itos method
    return " ".join([lex.get_itos()[int(i)] for i in sequence])

## Load Multi30k English/German parallel corpus for NMT, Build Lexicon
TorchText takes care of tokenization, padding,  special character tokens and batching.

This is gonna take a while. Standby, go get a coffee or sth.

In [3]:
import spacy

# loading tokeniser
spacy_de_tokeniser = spacy.load("de_core_news_sm")
spacy_en_tokeniser = spacy.load("en_core_web_sm")

def tokenise_de(text):
    return [tok.text for tok in spacy_de_tokeniser(text.lower())]

def tokenise_en(text):
    return [tok.text for tok in spacy_en_tokeniser(text.lower())]

In [4]:
# Define special symbols and indices
pad_idx, sos_idx, eos_idx, unk_idx = 0, 1, 2, 3
special_symbols = ['<pad>', '<sos>', '<eos>', '<unk>']

def load_dataset():
    # We need to modify the URLs for the dataset since the links to the original dataset are broken
    # Refer to https://github.com/pytorch/text/issues/1756#issuecomment-1163664163 for more info
    multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
    multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
    
    # Load data, perform tokenisation
    print("Performing tokenisation on training set")
    train_list = list(Multi30k(split='train', language_pair=("de", "en")))
    train_list = [(tokenise_de(f), tokenise_en(e)) for f, e in tqdm.tqdm(train_list)]
    
    print("Performing tokenisation on validation set")
    valid_list = list(Multi30k(split='valid', language_pair=("de", "en")))
    valid_list = [(tokenise_de(f), tokenise_en(e)) for f, e in tqdm.tqdm(valid_list)]

    print("Building source lexicon")
    src_lex = torchtext.vocab.build_vocab_from_iterator([f for f, e in train_list],
                                                        min_freq=1,
                                                        specials=special_symbols,
                                                        special_first=True)

    tgt_lex = torchtext.vocab.build_vocab_from_iterator([e for f, e in train_list],
                                                        min_freq=1,
                                                        specials=special_symbols,
                                                        special_first=True)
    
    # Set ``unk_idx`` as the default index. This index is returned when the token is not found.
    # If not set, it throws ``RuntimeError`` when the queried token is not found in the Vocabulary.
    src_lex.set_default_index(unk_idx)
    tgt_lex.set_default_index(unk_idx)
    print("Moving on")

    return train_list, valid_list, src_lex, tgt_lex

train_list, valid_list, src_lex, tgt_lex = load_dataset()

Performing tokenisation on training set


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29001/29001 [04:26<00:00, 108.68it/s]


Performing tokenisation on validation set


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1015/1015 [00:10<00:00, 96.65it/s]


Building source lexicon
Moving on


## Model Inputs
Model inputs are (seq_len, batch_size) Tensors of word indices.

Creating batches for PyTorch 2.1 is different from legacy methods using bucketiterators. Instead, we'll write a batch sampler to create batches for every epoch, and a collate function to convert text for every batch to tensor.

In [5]:
def collate_batch(batch, src_lex, tgt_lex):
    # This function is called to be executed upon loading a new batch into the trainer.
    # It will transform the original text into tensor using the constructed lexicons
    source, target = [], []
    for f, e in batch:
        # Appending sos and eos special tokens
        source.append(torch.tensor([src_lex[f_tok] for f_tok in ['<sos>'] + f + ['<eos>']]))
        target.append(torch.tensor([tgt_lex[e_tok] for e_tok in ['<sos>'] + e + ['<eos>']]))

    # padding ensures source and target are of equal dimension across the batch, it also converts source and target to tensors-
    source = pad_sequence(source, padding_value=pad_idx)
    target = pad_sequence(target, padding_value=pad_idx)
    return source, target


def batch_sampler():
    # sorts the whole dataset by target length, divide into batches, then shuffle the batches
    # this function will be called every epoch
    indices = [(i, len(e)) for i, (f, e) in enumerate(train_list)]
    random.shuffle(indices)
    pooled_indices = []
    # create pool of indices with similar lengths 
    for i in range(0, len(indices), batch_size * 100):
        pooled_indices.extend(sorted(indices[i:i + batch_size * 100], key=lambda x: x[1]))

    pooled_indices = [x[0] for x in pooled_indices]
    pools = []

    # yield indices for current batch
    for i in range(0, len(pooled_indices), batch_size):
        pools.append(pooled_indices[i:i + batch_size])
    random.shuffle(pools)
    return pools
    
batch_size = 5
train_dl = \
    DataLoader(train_list, batch_sampler=batch_sampler(),
               collate_fn=lambda batch:collate_batch(batch, src_lex, tgt_lex))

example_batch = next(iter(train_dl))
example_batch

(tensor([[    1,     1,     1,     1,     1],
         [    8,     5,     5,     5,  1548],
         [ 1423,   493, 16106,    32,    48],
         [    9,    10,    95,    83,     5],
         [    5,     5,    42,    11,   224],
         [ 1335,    49,     5,  2328,   236],
         [   10,    74,   348,   995,     8],
         [    5,    21,    11,    42, 15165],
         [ 1299,     6,  3330,     5,    21],
         [  469,   104,     4,   135,    15],
         [   12,     7,     2,     4,   311],
         [   14, 18346,     0,     2,   151],
         [  136,  2686,     0,     0,   362],
         [   12,   117,     0,     0,     4],
         [    4,     4,     0,     0,     2],
         [    2,     2,     0,     0,     0]]),
 tensor([[   1,    1,    1,    1,    1],
         [   4,   21,    4,    4,  176],
         [ 183,  387, 1588,   35,   10],
         [ 781,   11,  705,   10,    4],
         [  11,   55,  107,   79,  159],
         [ 161,   17,   75,   60,  258],
         [1371, 

We can recover the original text by looking up each index in the vocabularies we build with the `load_data` function.

The `<pad>` tokens here are for the trainer to ignore when calculating loss for the whole batch.

In [6]:
example_batch_f, example_batch_e = example_batch
print(sequence_to_text(example_batch_f[:, 0], src_lex))
print(sequence_to_text(example_batch_e[:, 0], tgt_lex))

<sos> eine sängerin , ein gitarrist und ein schlagzeuger treten auf einer bühne auf . <eos>
<sos> a female singer and male guitarist and drummer perform on a stage . <eos>


## Architecture 
NMT uses an encoder-decoder architecture to effectively translate source sequences and target sequences that are of different lengths
![img](assets/encoder-decoder.png)

## Encoder
Encodes each word of the source sequence into a `hidden_dim` feature map. Sometimes called an `annotation`. Also returns the hidden state of the encoder bi-rnn.

In [7]:
class Encoder(nn.Module):
    def __init__(self, source_vocab_size, embed_dim, hidden_dim,
                 n_layers, dropout):
        super(Encoder, self).__init__()
        self.hidden_dim = hidden_dim
        self.embed = nn.Embedding(source_vocab_size, embed_dim, padding_idx=1)
        self.gru = nn.GRU(embed_dim, hidden_dim, n_layers,
                          dropout=dropout, bidirectional=True)

    def forward(self, source, hidden=None):
        embedded = self.embed(source)  # (batch_size, seq_len, embed_dim)
        encoder_out, encoder_hidden = self.gru(
            embedded, hidden)  # (seq_len, batch, hidden_dim*2)
        # sum bidirectional outputs, the other option is to retain concat features
        encoder_out = (encoder_out[:, :, :self.hidden_dim] +
                       encoder_out[:, :, self.hidden_dim:])
        return encoder_out, encoder_hidden

In [8]:
embed_dim = 256
hidden_dim = 512
n_layers = 2
dropout = 0.5

In [9]:
encoder = Encoder(source_vocab_size=len(src_lex), embed_dim=embed_dim,
                  hidden_dim=hidden_dim, n_layers=n_layers, dropout=dropout)

In [10]:
encoder_out, encoder_hidden = encoder(example_batch_f)
print('encoder output size: ', encoder_out.size())  # source, batch_size, hidden_dim
print('encoder hidden size: ', encoder_hidden.size()) # n_layers * num_directions, batch_size, hidden_dim

encoder output size:  torch.Size([16, 5, 512])
encoder hidden size:  torch.Size([4, 5, 512])


## Attention
Currently the `encoder_output` is a length 14 sequence and the target is a length 13 sequence. We need to compress the information in the `encoder_output` into a `context_vector` which should have all the information the decoder needs to predict the next step of its output. We will use `Luong Attention` to create this context vector.

In [11]:
class LuongAttention(nn.Module):
    """
    LuongAttention from Effective Approaches to Attention-based Neural Machine Translation
    https://arxiv.org/pdf/1508.04025.pdf
    """

    def __init__(self, dim):
        super(LuongAttention, self).__init__()
        self.W = nn.Linear(dim, dim, bias=False)

    def score(self, decoder_hidden, encoder_out):
        # linear transform encoder out (seq, batch, dim)
        encoder_out = self.W(encoder_out)
        # (batch, seq, dim) | (2, 15, 50)
        encoder_out = encoder_out.permute(1, 0, 2)
        # (2, 15, 50) @ (2, 50, 1)
        return encoder_out @ decoder_hidden.permute(1, 2, 0)

    def forward(self, decoder_hidden, encoder_out):
        energies = self.score(decoder_hidden, encoder_out)
        mask = F.softmax(energies, dim=1)  # batch, seq, 1
        context = encoder_out.permute(
            1, 2, 0) @ mask  # (2, 50, 15) @ (2, 15, 1)
        context = context.permute(2, 0, 1)  # (seq, batch, dim)
        mask = mask.permute(2, 0, 1)  # (seq2, batch, seq1)
        return context, mask

This will normally be part of the decoder as it takes the previous decoder hidden state as input, but just to show the inputs and outputs I will use it here.

We will initialize the Decoder rnn's hidden state with the last hidden state from the encoder. Because the encoder is bi-directional we have to reshape it's hidden state in order to select the layer we want.

In [12]:
attention = LuongAttention(dim=hidden_dim)
context, mask = attention(encoder_hidden[-1:], encoder_out)
print(context.size()) # (1, batch, attention_dim) contect_vector
print(mask.size())  # the weights used to compute weighted sum over encoder out (1, batch, source_len)

torch.Size([1, 5, 512])
torch.Size([1, 5, 16])


## Decoder with attention

In [13]:
class Decoder(nn.Module):
    def __init__(self, target_vocab_size, embed_dim, hidden_dim,
                 n_layers, dropout):
        super(Decoder, self).__init__()
        self.n_layers = n_layers
        self.embed = nn.Embedding(target_vocab_size, embed_dim, padding_idx=1)
        self.attention = LuongAttention(hidden_dim)
        self.gru = nn.GRU(embed_dim + hidden_dim, hidden_dim, n_layers,
                          dropout=dropout)
        self.out = nn.Linear(hidden_dim * 2, target_vocab_size)

    def forward(self, output, encoder_out, decoder_hidden):
        """
        decodes one output frame
        """
        embedded = self.embed(output)  # (1, batch, embed_dim)
        context, mask = self.attention(decoder_hidden[:-1], encoder_out)  # 1, 1, 50 (seq, batch, hidden_dim)
        rnn_output, decoder_hidden = self.gru(torch.cat([embedded, context], dim=2),
                                              decoder_hidden)
        output = self.out(torch.cat([rnn_output, context], 2))
        return output, decoder_hidden, mask

In [14]:
decoder = Decoder(target_vocab_size=len(tgt_lex), embed_dim=embed_dim,
                  hidden_dim=hidden_dim, n_layers=n_layers, dropout=dropout)

To translate one word from German to English, the decoder needs:
1. `encoder_outputs`
2. `decoder_hidden` initially, the last n_layers of encoder_hidden then it's own returned hidden state.
3. `previous_output` feed a batch of start of string token (`sos_idx`, 1) at the first step.

The attention mask that the decoder returns is not used in training but can be used to visualize where the decoder is "looking" in the input sequence in order to generate its current output.

In [15]:
decoder_hidden = encoder_hidden[-decoder.n_layers:]
start_token = example_batch_e[:1]
start_token

tensor([[1, 1, 1, 1, 1]])

In [16]:
output, decoder_hidden, mask = decoder(start_token, encoder_out, decoder_hidden)

In [17]:
print('output size: ', output.size())  # (1, batch, target_vocab) # predicted probability distribution over all possible target words
print('decoder hidden size ', decoder_hidden.size())
print('attention mask size', mask.size())

output size:  torch.Size([1, 5, 9795])
decoder hidden size  torch.Size([2, 5, 512])
attention mask size torch.Size([1, 5, 16])


## Decoding Helpers
nmt models use teacher forcing during training and greedy decoding or beam search for inference. In order to accommodate these behaviors, I've made simple helper classes that get output from the decoder using each policy.

The Teacher class sometimes feeds the previous target to the decoder rather than the model's previous prediction. this can help speed convergence but requires targets to be loaded to the helper at each step

In [18]:
class Teacher:
    def __init__(self, teacher_forcing_ratio=0.5):
        self.teacher_forcing_ratio = teacher_forcing_ratio
        self.targets = None
        self.maxlen = 0
        
    def load_targets(self, targets):
        self.targets = targets
        self.maxlen = len(targets)

    def generate(self, decoder, encoder_out, encoder_hidden):
        outputs = []
        masks = []
        decoder_hidden = encoder_hidden[-decoder.n_layers:]  # take what we need from encoder
        output = self.targets[0].unsqueeze(0)  # start token
        for t in range(1, self.maxlen):
            output, decoder_hidden, mask = decoder(output, encoder_out, decoder_hidden)
            outputs.append(output)
            masks.append(mask.data)
            output = Variable(output.data.max(dim=2)[1])
            # teacher forcing
            is_teacher = random.random() < self.teacher_forcing_ratio
            if is_teacher:
                output = self.targets[t].unsqueeze(0)      
        return torch.cat(outputs), torch.cat(masks).permute(1, 2, 0)  # batch, src, trg

In [19]:
decode_helper = Teacher()
decode_helper.load_targets(example_batch_e)
outputs, masks = decode_helper.generate(decoder, encoder_out, encoder_hidden)

## Calc loss
reshape outputs and targets, ignore sos token at start of target batch.

In [20]:
F.cross_entropy(outputs.view(-1, outputs.size(2)),
                           example_batch_e[1:].view(-1), ignore_index=1)

tensor(9.1786, grad_fn=<NllLossBackward0>)

The greedy decoder simply chooses the highest scoring word as output.
We cam use the `set_maxlen` method to generate sequences the same length as our targets to easily check perplexity and bleu score during evaluation steps.

In [21]:
class Greedy:
    def __init__(self, maxlen=20, sos_index=2):
        self.maxlen = maxlen
        self.sos_index = sos_index
        
    def set_maxlen(self, maxlen):
        self.maxlen = maxlen
        
    def generate(self, decoder, encoder_out, encoder_hidden):
        seq, batch, _ = encoder_out.size()
        outputs = []
        masks = []
        decoder_hidden = encoder_hidden[-decoder.n_layers:]  # take what we need from encoder
        output = Variable(torch.zeros(1, batch).long() + self.sos_index)  # start token
        for t in range(self.maxlen):
            output, decoder_hidden, mask = decoder(output, encoder_out, decoder_hidden)
            outputs.append(output)
            masks.append(mask.data)
            output = Variable(output.data.max(dim=2)[1])
        return torch.cat(outputs), torch.cat(masks).permute(1, 2, 0)  # batch, src, trg     

In [22]:
decode_helper = Greedy()
decode_helper.set_maxlen(len(example_batch_e[1:]))
outputs, masks = decode_helper.generate(decoder, encoder_out, encoder_hidden)

In [23]:
outputs.size()

torch.Size([14, 5, 9795])

In [24]:
F.cross_entropy(outputs.view(-1, outputs.size(2)),
                           example_batch_e[1:].view(-1), ignore_index=1)

tensor(9.1879, grad_fn=<NllLossBackward0>)

## seq2seq wrapper

In [25]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, source, decoding_helper):
        encoder_out, encoder_hidden = self.encoder(source)
        outputs, masks = decoding_helper.generate(self.decoder, encoder_out, encoder_hidden)
        return outputs, masks

In [26]:
seq2seq = Seq2Seq(encoder, decoder)
decoding_helper = Teacher(teacher_forcing_ratio=0.5)


## example iteration with wrapper

In [27]:
decoding_helper.load_targets(example_batch_e)
outputs, masks = seq2seq(example_batch_f, decode_helper)

In [28]:
outputs.size(), masks.size()

(torch.Size([14, 5, 9795]), torch.Size([5, 16, 14]))

In [31]:
F.cross_entropy(outputs.view(-1, outputs.size(2)),
                example_batch_e[1:].view(-1), ignore_index=1)

tensor(9.1940, grad_fn=<NllLossBackward0>)