In [None]:
!pip install torchdata torchtext==0.13.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchdata
  Downloading torchdata-0.5.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB)
[K     |████████████████████████████████| 4.5 MB 5.1 MB/s 
Collecting urllib3>=1.25
  Downloading urllib3-1.26.12-py2.py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 64.4 MB/s 
[?25hCollecting torchdata
  Downloading torchdata-0.4.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 45.6 MB/s 
[?25hCollecting portalocker>=2.0.0
  Downloading portalocker-2.6.0-py2.py3-none-any.whl (15 kB)
Collecting urllib3>=1.25
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 67.6 MB/s 
[?25hInstalling collected packages: urllib3, portalocker, torchdata
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3


# Practice: Sequence to Sequence for Neural Machne Translation.

_Reference: this notebook is based on [open-source implementation](https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb) of seq2seq NMT in PyTorch._

We are going to implement the model from the [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215) paper. 

The model will be trained for German to English translations, but it can be applied to any problem that involves going from one sequence to another, such as summarization.


## Introduction

The most common sequence-to-sequence (seq2seq) models are *encoder-decoder* models, which often use a *recurrent neural network* (RNN) to *encode* the source (input) sentence into a single vector. In this notebook, we'll refer to this single vector as a *context vector*. You can think of the context vector as being an abstract representation of the entire input sentence. This vector is then *decoded* by a second RNN which learns to output the target (output) sentence by generating it one word at a time.

![](https://raw.githubusercontent.com/bentrevett/pytorch-seq2seq/master/assets/seq2seq1.png)

The above image shows an example translation. 

## Attention

![](https://miro.medium.com/max/1400/1*BzhKcJJxv974OxWOVqUuQQ.webp)


![](https://miro.medium.com/max/882/1*4pWBd6sgTnr0YieOyRSZVQ.webp)


In [None]:
import torch
import torch.nn.functional as F
from torch import nn

In [None]:
def scaled_dot_product_attention(query, key, value):
    temp = query.bmm(key.transpose(1, 2))
    scale = query.size(-1) ** 0.5
    softmax = F.softmax(temp / scale, dim=-1)
    return softmax, softmax.bmm(value)

In [None]:
class AttentionLayer(nn.Module):
    def __init__(self, dim_q, dim_k):
        super().__init__()
        self.q = nn.Linear(dim_q, dim_k)
        self.k = nn.Linear(dim_k, dim_k)

    def forward(self, query, key, value):
        return scaled_dot_product_attention(self.q(query), self.k(key), value)

In [None]:
qr = torch.rand([1, 256, 512]) 
kr = torch.rand([31, 256, 512])
vr = torch.rand([31, 256, 512])

## Preparing Data

We'll be using data provided by [torchtext](https://pytorch.org/text/stable/) and coding the models in PyTorch. We'll also be using [nltk](https://www.nltk.org) to assist with the tokenization.

First of all, let's load the data. We will be using the [Multi30k dataset](https://github.com/multi30k/dataset). This is a dataset with ~30,000 parallel English, German and French sentences, each with ~12 words per sentence. We will train model to translate sentences from German into English.

In [None]:
from torchtext.datasets import Multi30k


train_iter = Multi30k(split="train")

# torchtext.datasets.DatasetName yield exhaustible IterableDataset.
# To fix this we convert our dataset to a list.
train_data = list(train_iter)

print(f"Number of training examples: {len(train_data)}")
print(train_data[0])

Number of training examples: 29001
('Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.', 'Two young, White males are outside near many bushes.')


As we can see, dataset provides us with pairs of sentences. However, working with whole sentences is not convenient. For this reason we will use tokenizers. A tokenizer is used to turn a string containing a sentence into a list of individual tokens that make up that string, e.g. "good morning!" becomes ["good", "morning", "!"]. We'll start talking about the sentences being a sequence of tokens from now, instead of saying they're a sequence of words. What's the difference? Well, "good" and "morning" are both words and tokens, but "!" is a token, not a word.

Just like in previous practice we'll use the `WordPunctTokenizer` from `nltk` library.

In [None]:
from nltk.tokenize import WordPunctTokenizer


tokenizer = WordPunctTokenizer()
print(tokenizer.tokenize("good morning!"))

['good', 'morning', '!']


Sentences in our dataset are slightly more complicated, but nothing nltk cannot handle. Before tokenization, however, it is important to lowercase the data. And get rid of a `\n` at the end of each sentence whilst we are at it. This yields the following data processing pipeline:

In [None]:
src, trg = train_data[0]
print(tokenizer.tokenize(trg.rstrip().lower()))

['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']


We will use this pipeline a lot, so let's pack it into a function.

In [None]:
def tokenize(sent):
    return tokenizer.tokenize(sent.rstrip().lower())


print(tokenize(src))
print(tokenize(trg))

['zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.']
['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']


In the paper we are implementing, they find it beneficial to reverse the order of the input which they believe "introduces many short term dependencies in the data that make the optimization problem much easier". This means that we need to slightly modify previous code:

In [None]:
print(tokenize(src)[::-1])
print(tokenize(trg))

['.', 'büsche', 'vieler', 'nähe', 'der', 'in', 'freien', 'im', 'sind', 'männer', 'weiße', 'junge', 'zwei']
['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']


Next, we need to build the *vocabulary* for the source and target languages. The vocabulary is used to associate each unique token with an index (an integer) and this is used to build a one-hot encoding for each token (a vector of all zeros except for the position represented by the index, which is 1). The vocabularies of the source and target languages are distinct. It is important to note that your vocabulary should only be built from the training set and not the validation/test set. This prevents "information leakage" into your model, giving you artifically inflated validation/test scores.

Using the `min_freq` argument, we only allow tokens that appear at least 2 times to appear in our vocabulary.

In [None]:
from collections import Counter

from torchtext.vocab import vocab as Vocab


src_counter = Counter()
trg_counter = Counter()
for src, trg in train_data:
    src_counter.update(tokenize(src))
    trg_counter.update(tokenize(trg))

src_vocab = Vocab(src_counter, min_freq=2)
trg_vocab = Vocab(trg_counter, min_freq=2)

Tokens that appear only once (or do not appear in training data at all) should be converted into an `<unk>` (unknown) token. We can achieve this by adding it to our vocabularies and setting it to default.

In [None]:
unk_token = "<unk>"

for vocab in [src_vocab, trg_vocab]:
    if unk_token not in vocab:
        vocab.insert_token(unk_token, index=0)
        vocab.set_default_index(0)

Another special tokens we want to have in our vocabularies are `<sos>` (start of sequence), `<eos>` (end of sequence) and `<pad>` (padding) tokens.

In [None]:
sos_token, eos_token, pad_token = "<sos>", "<eos>", "<pad>"
specials = [sos_token, eos_token, pad_token]
for vocab in [src_vocab, trg_vocab]:
    for token in specials:
        if token not in vocab:
            vocab.append_token(token)

Let's check the sizes of our vocabularies:

In [None]:
print(f"Unique tokens in source (de) vocabulary: {len(src_vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(trg_vocab)}")

Unique tokens in source (de) vocabulary: 7892
Unique tokens in target (en) vocabulary: 5903


Now we can encode our tokenized sentences (i.e. convert them into sequences of token indices) as follows:

In [None]:
# Tokenize sentence and add <sos> and <eos> special tokens.
tokenized = [sos_token] + tokenize(trg) + [eos_token]

# Transform tokens into indices using our vocab.
encoded = [trg_vocab[tok] for tok in tokenized]

[(tok, idx) for tok, idx in zip(tokenized, encoded)]

[('<sos>', 5900), ('<eos>', 5901)]

Just like before with `tokenize`, let's pack it into a neat little function:

In [None]:
def encode(sent, vocab):
    tokenized = [sos_token] + tokenize(sent) + [eos_token]
    return [vocab[tok] for tok in tokenized]


# Note that here we commited a little crime: after [::-1] the <sos> token in
# src sequence is gone to the end whilst the <eos> ended up the last token.
# However, it is not that big a problem: <sos> and <eos> tokens only have
# special meaning for us, for model they are just some tokens until it's been
# trained to use them. For this reason it doesn't care for the actual name of
# init and end tokens. And in trg sequence everything is fine, so we are not
# ending up starting translation with <eos> token or anything.
print(encode(src, src_vocab)[::-1])
print(encode(trg, trg_vocab))

[7890, 7889]
[5900, 5901]


Now we know how to preprocess our input and output sentences into a NN-readable format. The last thing we need to do is to create a `DataLoader` for our data, which will take our sentences and put them together to form a batch. Problem here lies in the fact that sentences can have different sizes and items in one batch absolutely cannot. For this we use padding (remember the `<pad>` token). Back in version `0.8` torchtext used to provide its own custom classes which handled tokenization, `<sos>`, `<eos>` and `<unk>` tokens and added padding. However, since that time, torchtext ditched this functionality in order to make their dataloading API consistent with PyTorch's one. This in turn means that padding (as well as everything else mentioned) ourselves. Luckily, PyTorch's `DataLoader` supports custom collate functions, which seems like a great place to do all our preprocessing, including padding.

In [None]:
import torch
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader


def collate_batch(batch):
    src_list, trg_list = [], []
    for src, trg in batch:
        src_encoded = encode(src, src_vocab)[::-1]
        src_list.append(torch.tensor(src_encoded))

        trg_encoded = encode(trg, trg_vocab)
        trg_list.append(torch.tensor(trg_encoded))

    src_padded = pad_sequence(src_list, padding_value=src_vocab[pad_token])
    trg_padded = pad_sequence(trg_list, padding_value=trg_vocab[pad_token])

    return src_padded, trg_padded


batch_size = 128
train_dataloader = DataLoader(train_data, batch_size, shuffle=True, collate_fn=collate_batch)
src_batch, trg_batch = next(iter(train_dataloader))
src_batch.shape, trg_batch.shape

(torch.Size([29, 128]), torch.Size([29, 128]))

Cool! Now we can load our data and store it in batches. Whilst we are at it, let's create a dataloader for a validation, which we will use to evaluate out model during training.

In [None]:
val_data = list(Multi30k(split="valid"))
val_dataloader = DataLoader(val_data, batch_size, collate_fn=collate_batch)

One could mention, that the first dimention is now `seq_len`, not `batch_size` as used to be. It's because in PyTorch LSTM (and other recurrent units) await for input in format `(seq_len, batch_size, input_size)`. Be careful with that (especially in your homework assignment).

## Building the Seq2Seq Model

We'll be building our model in three parts. The encoder, the decoder and a seq2seq model that encapsulates the encoder and decoder and will provide a way to interface with each.

### Encoder

So our encoder looks something like this: 

![](https://raw.githubusercontent.com/bentrevett/pytorch-seq2seq/master/assets/seq2seq2.png)



In [None]:
import torch.nn as nn


class Encoder(nn.Module):
    def __init__(self, n_tokens, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()

        self.n_tokens = n_tokens
        self.hid_dim = hid_dim
        self.n_layers = n_layers

        # Define embedding, dropout and LSTM layers.
        self.embedding = nn.Embedding(n_tokens, emb_dim)
        self.dropout = nn.Dropout(dropout)
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)

    def forward(self, src):
        # src has a shape of [seq_len, batch_size]

        # Compute an embedding from src data and apply dropout.
        # embedded = ...
        # embedded should have a shape of [seq_len, batch_size, emb_dim]
        embedded = self.embedding(src)
        embedded = self.dropout(embedded)

        output, hidden = self.rnn(embedded)

        return output, hidden

### Decoder

Next, we'll build our decoder, which will also be a 2-layer (4 in the paper) LSTM.

![](https://raw.githubusercontent.com/bentrevett/pytorch-seq2seq/master/assets/seq2seq3.png)


## Add attention

![](https://lena-voita.github.io/resources/lectures/seq2seq/attention/general_scheme-min.png)

In [None]:
class Decoder(nn.Module):
    def __init__(self, n_tokens, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()

        self.n_tokens = n_tokens
        self.hid_dim = hid_dim
        self.n_layers = n_layers

        self.embedding = nn.Embedding(n_tokens, emb_dim)
        self.dropout = nn.Dropout(dropout)
        self.attention = AttentionLayer(hid_dim, hid_dim)
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)
        self.out = nn.Linear(hid_dim + hid_dim, n_tokens)

    def forward(self, encoder_output, input, hidden):
        # input has a shape of [batch_size]
        # hidden is a tuple of two tensors:
        # 1) hidden state
        # 2) cell state
        # both of shape [n_layers, batch_size, hid_dim]
        # (n_directions in the decoder shall always be 1)

        # Compute an embedding from input data and apply dropout.
        # Remember, that LSTM layer expects input to have a shape of
        # [seq_len, batch_size, emb_dim], which means that we need
        # to somehow introduce the seq_len dimension into our input tensor.
        # embedded = ...
        input = input.unsqueeze(dim=0)
        embedded = self.embedding(input)
        embedded = self.dropout(embedded)

        output, hidden = self.rnn(embedded, hidden)

        attn_map, attn_applied = self.attention(
            output.transpose(0, 1), 
            encoder_output.transpose(0, 1), 
            encoder_output.transpose(0, 1)
        )

        attn_applied = attn_applied.transpose(0, 1)

        output = torch.cat((output, attn_applied), 2)

        # Compute the RNN output values.
        # output, hidden = ...
        
        pred = self.out(output.squeeze(dim=0))

        # should have a shape [batch_size, n_tokens]
        return pred, hidden, attn_map

### Seq2Seq



In [None]:
import random


class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()

        self.encoder = encoder
        self.decoder = decoder

        assert encoder.hid_dim == decoder.hid_dim, "encoder and decoder must have same hidden dim"
        assert (
            encoder.n_layers == decoder.n_layers
        ), "encoder and decoder must have equal number of layers"

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        # src has a shape of [src_seq_len, batch_size]
        # trg has a shape of [trg_seq_len, batch_size]
        # teacher_forcing_ratio is probability to use teacher forcing, e.g. if
        # teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.n_tokens

        # tensor to store decoder predictions
        preds = []

        # Last hidden state of the encoder is used as
        # the initial hidden state of the decoder.
        output, hidden = self.encoder(src)
        
        # First input to the decoder is the <sos> token.
        input = trg[0, :]

        for i in range(1, trg_len):
            pred, hidden, attn_map = self.decoder(output, input, hidden)
            preds.append(pred)
            teacher_force = random.random() < teacher_forcing_ratio
            _, top_pred = pred.max(dim=1)
            input = trg[i, :] if teacher_force else top_pred

        return torch.stack(preds)

# Training the Seq2Seq Model

Now we have our model implemented, we can begin training it. 

First, we'll initialize our model. As mentioned before, the input and output dimensions are defined by the size of the vocabulary. The embedding dimesions and dropout for the encoder and decoder can be different, but the number of layers and the size of the hidden/cell states must be the same. 

We then define the encoder, decoder and then our `Seq2Seq` model, which we place on the `device`.

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
enc = Encoder(len(src_vocab), emb_dim=256, hid_dim=512, n_layers=2, dropout=0.5)
dec = Decoder(len(trg_vocab), emb_dim=256, hid_dim=512, n_layers=2, dropout=0.5)
model = Seq2Seq(enc, dec).to(device)

We define our optimizer, which we use to update our parameters in the training loop. Check out [this](http://ruder.io/optimizing-gradient-descent/) post for information about different optimizers. Here, we'll use Adam.

In [None]:
optimizer = torch.optim.Adam(model.parameters())

In [None]:
criterion = nn.CrossEntropyLoss(ignore_index=trg_vocab[pad_token])

Finally, let's train our model. We will train it for 10 epochs, evaluating it after each epoch on a validation data. 

In [None]:
from torch.nn.utils import clip_grad_norm_
from torch.utils.tensorboard import SummaryWriter
from tqdm.auto import tqdm, trange


n_epochs = 10
clip = 1
global_step = 0  # for writer
for epoch in trange(n_epochs, desc="Epochs"):
    model.train()
    train_loss = 0
    for src, trg in tqdm(train_dataloader, desc="Train", leave=False):
        src, trg = src.to(device), trg.to(device)
        output = model(src, trg)

        output = output.view(-1, output.shape[-1])
        trg = trg[1:].view(-1)

        loss = criterion(output, trg)
        optimizer.zero_grad()
        
        loss.backward()

        clip_grad_norm_(model.parameters(), clip)
        optimizer.step()

        train_loss += loss.item()
        global_step += 1

    train_loss /= len(train_dataloader)

    model.eval()
    val_loss = 0
    with torch.no_grad():
        for src, trg in tqdm(val_dataloader, desc="Val", leave=False):
            src, trg = src.to(device), trg.to(device)
            output = model(src, trg)

            output = output.view(-1, output.shape[-1])
            trg = trg[1:].view(-1)

            loss = criterion(output, trg)

            val_loss += loss.item()

    val_loss /= len(val_dataloader)
    print(f"[{epoch}] train/val loss= ({train_loss}, {val_loss})")

Epochs:   0%|          | 0/10 [00:00<?, ?it/s]

Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

[0] train/val loss= (3.116147783884393, 2.8875158429145813)


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

[1] train/val loss= (2.8908050238823577, 2.703189641237259)


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

[2] train/val loss= (2.701564844484371, 2.6018706262111664)


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

[3] train/val loss= (2.5705591056840533, 2.627768784761429)


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

[4] train/val loss= (2.4210391858600837, 2.540013700723648)


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

[5] train/val loss= (2.3325257479881927, 2.5494285821914673)


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

[6] train/val loss= (2.2014537955170685, 2.502074509859085)


Train:   0%|          | 0/227 [00:00<?, ?it/s]

KeyboardInterrupt: ignored

Now that we've trained our model, let's see how good is it at actual translation. Let's translate first 10 examples in validation dataset.

In [None]:
trg_itos = trg_vocab.get_itos()
model.eval()
max_len = 50
with torch.no_grad():
    for src, trg in val_data[:10]:
        encoded = encode(src, src_vocab)[::-1]
        encoded = torch.tensor(encoded)[:, None].to(device)
        output, hidden = model.encoder(encoded)

        pred_tokens = [trg_vocab[sos_token]]

        attn_maps = []

        for _ in range(max_len):
            decoder_input = torch.tensor([pred_tokens[-1]]).to(device)
            pred, hidden, attn_map = model.decoder(output, decoder_input, hidden)
            attn_maps.append(attn_map)
            _, pred_token = pred.max(dim=1)
            if pred_token == trg_vocab[eos_token]:
                # Don't add it to prediction for cleaner output.
                break

            pred_tokens.append(pred_token.item())

        attn_map = torch.cat(attn_maps, dim=1)
        print("shape: ", attn_map.shape)

        print(f"src: '{src.rstrip().lower()}'")
        print(f"trg: '{trg.rstrip().lower()}'")
        print(f"pred: '{' '.join(trg_itos[i] for i in pred_tokens[1:])}'")
        print()

shape:  torch.Size([1, 12, 11])
src: 'eine gruppe von männern lädt baumwolle auf einen lastwagen'
trg: 'a group of men are loading cotton onto a truck'
pred: 'a group of men are on a a a a .'

shape:  torch.Size([1, 12, 13])
src: 'ein mann schläft in einem grünen raum auf einem sofa.'
trg: 'a man sleeping in a green room on a couch.'
pred: 'a man sleeping on a couch in a green room .'

shape:  torch.Size([1, 14, 13])
src: 'ein junge mit kopfhörern sitzt auf den schultern einer frau.'
trg: 'a boy wearing headphones sits on a woman's shoulders.'
pred: 'a boy with a hair sitting on a woman ' s shoulders .'

shape:  torch.Size([1, 12, 13])
src: 'zwei männer bauen eine blaue eisfischerhütte auf einem zugefrorenen see auf'
trg: 'two men setting up a blue ice fishing hut on an iced over lake'
pred: 'two men are a blue inflatable structure on a lake .'

shape:  torch.Size([1, 18, 20])
src: 'ein mann mit beginnender glatze, der eine rote rettungsweste trägt, sitzt in einem kleinen boot.'
trg: '