# 3 - Neural Machine Translation by Jointly Learning to Align and Translate

In this third notebook on sequence-to-sequence models using PyTorch and TorchText, we'll be implementing the model from [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473). This model achives our best perplexity yet, ~27 compared to ~34 for the previous model.

## Introduction

As a reminder, here is the general encoder-decoder model:

![](assets/seq2seq1.png)

In the previous model, our architecture was set-up in a way to reduce "information compression" by explicitly passing the context vector, $z$, to the decoder at every time-step and by passing both the context vector and embedded input word, $d(y_t)$, along with the hidden state, $s_t$, to the linear layer, $f$, to make a prediction.

![](assets/seq2seq7.png)

Even though we have reduced some of this compression, our context vector still needs to contain all of the information about the source sentence. The model implemented in this notebook avoids this compression by allowing the decoder to look at the entire source sentence (via its hidden states) at each decoding step! How does it do this? It uses *attention*. 

Attention works by first, calculating an attention vector, $a$, that is the length of the source sentence. The attention vector has the property that each element is between 0 and 1, and the entire vector sums to 1. We then calculate a weighted sum of our source sentence hidden states, $H$, to get a weighted source vector, $w$. 

$$w = \sum_{i}a_ih_i$$

We calculate a new weighted source vector every time-step when decoding, using it as input to our decoder RNN as well as the linear layer to make a prediction. We'll explain how to do all of this during the tutorial.

## Install required libraries

We'll be coding up the models in PyTorch and using torchtext to help us do all of the pre-processing required. We'll also be using spaCy to assist in the tokenization of the data. Install spaCy tokenizers for English and German. 

In [None]:
!pip install -U torchdata
!pip install -U spacy
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm
!wget "http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz"
!wget "http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz"
!wget "http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/mmt16_task1_test.tar.gz"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchdata
  Downloading torchdata-0.4.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 9.7 MB/s 
Collecting portalocker>=2.0.0
  Downloading portalocker-2.5.1-py2.py3-none-any.whl (15 kB)
Collecting urllib3>=1.25
  Downloading urllib3-1.26.12-py2.py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 6.7 MB/s 
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 49.4 MB/s 
[?25hInstalling collected packages: urllib3, portalocker, torchdata
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling urllib3-1.24.3:
      Successfully uninstalled urllib3-1.24.3
Successfully installed portalocker-2.5.1 torchdata-0.4.1 urllib3-1.25.11
Looking in indexes: https://pypi.org/simple, https://us-python.pk

## Imports

Import evertyhing we need to train the model.

In [None]:
from typing import List, Iterable

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
import torch.nn.functional as F

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import multi30k, Multi30k

import spacy
import numpy as np

import random
import math
import time

We'll set the random seeds for deterministic results.

In [None]:
SEED: int = 1234

def random_seed(seed: int):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.determenistic = True

random_seed(SEED)

The dataset we'll be using is the [Multi30k dataset](https://github.com/multi30k/dataset). This is a dataset with ~30,000 parallel English, German and French sentences, each with ~12 words per sentence. Currently, the URL for multi30k is broken so we have to replace with temporary links to download dataset. 

Next, we'll create the tokenizers. A tokenizer is used to turn a string containing a sentence into a list of individual tokens that make up that string, e.g. "good morning!" becomes ["good", "morning", "!"]. We'll start talking about the sentences being a sequence of tokens from now, instead of saying they're a sequence of words. What's the difference? Well, "good" and "morning" are both words and tokens, but "!" is a token, not a word. 

```
tokenizer_de = get_tokenizer("spacy", language="de_core_news_sm")
tokenizer_en = get_tokenizer("spacy", language="en_core_web_sm")
```
spaCy has model for each language ("de_core_news_sm" for German and "en_core_web_sm" for English) which need to be loaded so we can access the tokenizer of each model. 

In [None]:
SOURCE_LANGUAGE = 'de'
TARGET_LANGUAGE = 'en'

token_transform = {}
vocab_transform = {}

token_transform[SOURCE_LANGUAGE] = get_tokenizer("spacy", language="de_core_news_sm")
token_transform[TARGET_LANGUAGE] = get_tokenizer("spacy", language="en_core_web_sm")


def yield_tokens(data_iter: Iterable, language: str):
    language_index = {SOURCE_LANGUAGE: 0, TARGET_LANGUAGE: 1}
    for data_sample in data_iter:
        yield token_transform[language](data_sample[language_index[language]])

Next, we'll build the *vocabulary* for the source and target languages. The vocabulary is used to associate each unique token with an index (an integer). The vocabularies of the source and target languages are distinct.

`torchtext`'s `build_vocab_from_iterator` would handle create of vocabulary for us. We have to set a yield fucntion for data processing where we will tokenize sentences. Add special tokens with their correspoding indexes with in the vocabulary.Using the `min_freq` argument, we only allow tokens that appear at least 2 times to appear in our vocabulary. Tokens that appear only once are converted into an `<unk>` (unknown) token. Special tokens: `<unk>` - unknown token which is not in the vocabulary, `<pad>` - padding token to make senteces equal lenght for dataloader, `<bos>` - begining of the sentence, `<eos>` - end of the sentence.   

It is important to note that our vocabulary should only be built from the training set and not the validation/test set. This prevents "information leakage" into our model, giving us artifically inflated validation/test scores.

In [None]:
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

In [None]:
for language in [SOURCE_LANGUAGE, TARGET_LANGUAGE]:
    train_iter = Multi30k(root="/content", split="train", language_pair=(SOURCE_LANGUAGE, TARGET_LANGUAGE))
    vocab_transform[language] = build_vocab_from_iterator(
        yield_tokens(train_iter, language),
        min_freq=2,
        specials=special_symbols,
        special_first=True
    )

for language in [SOURCE_LANGUAGE, TARGET_LANGUAGE]:
    vocab_transform[language].set_default_index(UNK_IDX)

We also need to define a `torch.device`. This is used to tell torchText to put the tensors on the GPU or not. We use the `torch.cuda.is_available()` function, which will return `True` if a GPU is detected on our computer. We pass this `device` to the iterator.


In [None]:
DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

Unlike previous version, `torchtext Field` did for us  tokenization and added special symbols, with current version of `torchtext>=0.12.0` you have to make a pipeline for text processing. `text_transform` is dictionary containg all necessary tranformation for source and target languages. Firsly, we tokenize sentences. Secondly, transform it into vocavulary indexes and add special symbol `<bos>` and `<eos>` for indication of start of senteces and ends. Finaly, transform it into `torch.Tensor`.

In [None]:
def sequential_transforms(*transforms):
    def func(text_input: str):
        for transform in transforms:
            text_input = transform(text_input)
        return text_input
    return func

def to_tensor(token_ids: List[int]):
    return torch.cat((
        torch.tensor([BOS_IDX]),
        torch.tensor(token_ids),
        torch.tensor([EOS_IDX]),
    ))

text_transform = {}

for language in [SOURCE_LANGUAGE, TARGET_LANGUAGE]:
    text_transform[language] = sequential_transforms(
        token_transform[language],  # Tokenization
        vocab_transform[language],  # Numericalization
        to_tensor  # Add BOS/EOS and create tensor
    )

def collate_fn(batch):
    source_batch, target_batch = [], []
    for source_sample, target_sample in batch:
        source_batch.append(text_transform[SOURCE_LANGUAGE](source_sample.rstrip("\n")))
        target_batch.append(text_transform[TARGET_LANGUAGE](target_sample.rstrip("\n")))

    source_batch = pad_sequence(source_batch, batch_first=False, padding_value=PAD_IDX)
    target_batch = pad_sequence(target_batch, batch_first=False, padding_value=PAD_IDX)
    return {
        "input": source_batch,
        "target": target_batch,
    }

Select a batch size for our dataloaders and make `collate_fn` function to pad our sequences. When we get a batch of examples using an iterator we need to make sure that all of the source sentences are padded to the same length, the same with the target sentences. Luckily, torchText iterators handle this for us! We use a `pad_sequence` to creates batches.


In [None]:
BATCH_SIZE = 64

train_iter = Multi30k(root="/content", split='train', language_pair=(SOURCE_LANGUAGE, TARGET_LANGUAGE))
train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

valid_iter = Multi30k(root="/content", split='valid', language_pair=(SOURCE_LANGUAGE, TARGET_LANGUAGE))
valid_dataloader = DataLoader(valid_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

test_iter = Multi30k(root="/content", split='test', language_pair=(SOURCE_LANGUAGE, TARGET_LANGUAGE))
test_dataloader = DataLoader(test_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

## Building the Seq2Seq Model

### Encoder

First, we'll build the encoder. Similar to the previous model, we only use a single layer GRU, however we now use a *bidirectional RNN*. With a bidirectional RNN, we have two RNNs in each layer. A *forward RNN* going over the embedded sentence from left to right (shown below in green), and a *backward RNN* going over the embedded sentence from right to left (teal). All we need to do in code is set `bidirectional = True` and then pass the embedded sentence to the RNN as before. 

![](assets/seq2seq8.png)

We now have:

$$\begin{align*}
h_t^\rightarrow &= \text{EncoderGRU}^\rightarrow(e(x_t^\rightarrow),h_{t-1}^\rightarrow)\\
h_t^\leftarrow &= \text{EncoderGRU}^\leftarrow(e(x_t^\leftarrow),h_{t-1}^\leftarrow)
\end{align*}$$

Where $x_0^\rightarrow = \text{<sos>}, x_1^\rightarrow = \text{guten}$ and $x_0^\leftarrow = \text{<eos>}, x_1^\leftarrow = \text{morgen}$.

As before, we only pass an input (`embedded`) to the RNN, which tells PyTorch to initialize both the forward and backward initial hidden states ($h_0^\rightarrow$ and $h_0^\leftarrow$, respectively) to a tensor of all zeros. We'll also get two context vectors, one from the forward RNN after it has seen the final word in the sentence, $z^\rightarrow=h_T^\rightarrow$, and one from the backward RNN after it has seen the first word in the sentence, $z^\leftarrow=h_T^\leftarrow$.

The RNN returns `outputs` and `hidden`. 

`outputs` is of size **[src len, batch size, hid dim * num directions]** where the first `hid_dim` elements in the third axis are the hidden states from the top layer forward RNN, and the last `hid_dim` elements are hidden states from the top layer backward RNN. We can think of the third axis as being the forward and backward hidden states concatenated together other, i.e. $h_1 = [h_1^\rightarrow; h_{T}^\leftarrow]$, $h_2 = [h_2^\rightarrow; h_{T-1}^\leftarrow]$ and we can denote all encoder hidden states (forward and backwards concatenated together) as $H=\{ h_1, h_2, ..., h_T\}$.

`hidden` is of size **[n layers * num directions, batch size, hid dim]**, where **[-2, :, :]** gives the top layer forward RNN hidden state after the final time-step (i.e. after it has seen the last word in the sentence) and **[-1, :, :]** gives the top layer backward RNN hidden state after the final time-step (i.e. after it has seen the first word in the sentence).

As the decoder is not bidirectional, it only needs a single context vector, $z$, to use as its initial hidden state, $s_0$, and we currently have two, a forward and a backward one ($z^\rightarrow=h_T^\rightarrow$ and $z^\leftarrow=h_T^\leftarrow$, respectively). We solve this by concatenating the two context vectors together, passing them through a linear layer, $g$, and applying the $\tanh$ activation function. 

$$z=\tanh(g(h_T^\rightarrow, h_T^\leftarrow)) = \tanh(g(z^\rightarrow, z^\leftarrow)) = s_0$$

**Note**: this is actually a deviation from the paper. Instead, they feed only the first backward RNN hidden state through a linear layer to get the context vector/decoder initial hidden state. This doesn't seem to make sense to me, so we have changed it.

As we want our model to look back over the whole of the source sentence we return `outputs`, the stacked forward and backward hidden states for every token in the source sentence. We also return `hidden`, which acts as our initial hidden state in the decoder.

In [None]:
class Encoder(nn.Module):
    def __init__(
        self,
        input_dim: int,
        embed_dim: int,
        enc_hidden_dim: int,
        dec_hidden_dim: int,
        dropout: float
    ):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(input_dim, embed_dim)

        self.rnn = nn.GRU(embed_dim, enc_hidden_dim, bidirectional=True)

        self.fc = nn.Linear(enc_hidden_dim * 2, dec_hidden_dim)

        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x):
        #x = [src len, batch size]
        embeddings = self.embedding(x)
        #embedded = [src len, batch size, emb dim]

        outputs, hidden = self.rnn(embeddings)
                
        #outputs = [src len, batch size, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]
        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards RNN 
        #hidden [-1, :, : ] is the last of the backwards RNN
        
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        hidden = torch.cat((hidden[-2, :], hidden[-1, :]), dim=1)
        hidden = self.fc(hidden)
        hidden = torch.tanh(hidden)

        return outputs, hidden

In [None]:
class Attention(nn.Module):
    def __init__(self, enc_hidden_dim: int, dec_hidden_dim: int):
        super(Attention, self).__init__()

        self.attn = nn.Linear(enc_hidden_dim * 2 + dec_hidden_dim, dec_hidden_dim)
        self.v = nn.Linear(dec_hidden_dim, 1, bias=False)

    def forward(self, hidden, encoder_outputs):
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]

        batch_size = encoder_outputs.size(1)
        source_len = encoder_outputs.size(0)
        
        hidden = hidden.unsqueeze(1).repeat(1, source_len, 1)

        encoder_outputs = encoder_outputs.permute(1, 0, 2)

        #hidden = [batch size, src len, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        # print(hidden.shape, encoder_outputs.shape)
        input = torch.cat((hidden, encoder_outputs), dim=2)
        # torch.Size([64, 64, 512]) torch.Size([64, 27, 1024])
        energy = torch.tanh(self.attn(input))
        #energy = [batch size, src len, dec hid dim]

        attention = self.v(energy).squeeze(2)
        #attention= [batch size, src len]

        return F.softmax(attention, dim=1)

In [None]:
class Decoder(nn.Module):
    def __init__(
        self,
        output_dim: int,
        embed_dim: int,
        enc_hidden_dim: int,
        dec_hidden_dim: int,
        dropout: float,
        attention
    ):
        super(Decoder, self).__init__()

        self.output_dim = output_dim
        self.attention = attention

        self.embedding = nn.Embedding(output_dim, embed_dim)
        self.rnn = nn.GRU(enc_hidden_dim * 2 + embed_dim, dec_hidden_dim)

        self.fc = nn.Linear(enc_hidden_dim * 2 + dec_hidden_dim + embed_dim, output_dim)

        self.dropout = nn.Dropout(p=dropout)

    def forward(self, input, hidden, encoder_outputs):
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]

        input = input.unsqueeze(0)
        #input = [1, batch size]

        embeddings = self.dropout(self.embedding(input))
        #embedded = [1, batch size, emb dim]

        attention = self.attention(hidden, encoder_outputs)
        #attention = [batch size, src len]
        
        attention = attention.unsqueeze(1)
        #attention = [batch size, 1, src len]

        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        #encoder_outputs = [batch size, src len, enc hid dim * 2]

        weighted = torch.bmm(attention, encoder_outputs)
        #weighted = [batch size, 1, enc hid dim * 2]

        weighted = weighted.permute(1, 0, 2)
        #weighted = [1, batch size, enc hid dim * 2]

        rnn_input = torch.cat((embeddings, weighted),dim=2)
        #rnn_input = [1, batch size, (enc hid dim * 2) + emb dim]

        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #output = [seq len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]
        
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        #this also means that output == hidden
        assert (output == hidden).all()

        embeddings = embeddings.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        #print(output.shape, weighted.shape, embeddings.shape)
        predictions = self.fc(torch.cat((output, weighted, embeddings), dim=1))
        #print(predictions.shape)

        return predictions, hidden.squeeze(0)

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder: Encoder, decoder: Decoder, device: str):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, data, teacher_forcing_ratio: float = 0.5):
        for k, v in data.items():
            data[k] = v.to(self.device)

        batch_size = data["target"].size(1)
        target_len = data["target"].size(0)
        target_vocab_size = self.decoder.output_dim

        outputs = torch.zeros(target_len, batch_size, target_vocab_size).to(self.device)

        encoder_outputs, hidden = self.encoder(data["input"])

        input = data["input"][0, :]

        for t in range(1, target_len):
            #insert input token embedding, previous hidden state and all encoder hidden states
            #receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, encoder_outputs)

            outputs[t] = output

            teacher_force = random.random() < teacher_forcing_ratio

            top1 = output.argmax(1)

            input = data["target"][t, :] if teacher_force else top1

        return outputs

In [None]:
INPUT_DIM = len(vocab_transform[SOURCE_LANGUAGE])
OUTPUT_DIM = len(vocab_transform[TARGET_LANGUAGE])
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HIDDEN_DIM = 512
DEC_HIDDEN_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

attnention = Attention(ENC_HIDDEN_DIM, DEC_HIDDEN_DIM)
encoder = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HIDDEN_DIM, DEC_HIDDEN_DIM, ENC_DROPOUT)
decoder = Decoder(OUTPUT_DIM, ENC_EMB_DIM, ENC_HIDDEN_DIM, DEC_HIDDEN_DIM, ENC_DROPOUT, attnention)

model = Seq2Seq(encoder, decoder, DEVICE).to(DEVICE)

In [None]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(8014, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (embedding): Embedding(6191, 256)
    (rnn): GRU(1280, 512)
    (fc): Linear(in_features=1792, out_features=6191, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 21,170,223 trainable parameters


In [None]:
optimizer = optim.Adam(model.parameters())

In [None]:
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

In [None]:
def train(model, dataloader, optimizer, criterion, clip: float):
    model.train()

    epoch_loss = 0

    for i, data in enumerate(dataloader):

        optimizer.zero_grad()

        output = model(data)

        output_dim = output.size(-1)
        # print(data["target"].shape, output.shape)
        # torch.Size([24, 64]) torch.Size([24, 64, 6191])
        output = output[1:].view(-1, output_dim)
        target = data["target"][1:].view(-1)
        
        loss = criterion(output, target)

        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

        optimizer.step()

        epoch_loss += loss.item()

    return epoch_loss / len(list(dataloader))

In [None]:
def evaluate(model, dataloader, optimizer, criterion):
    model.eval()

    epoch_loss = 0

    with torch.no_grad():

        for i, data in enumerate(dataloader):

            optimizer.zero_grad()

            output = model(data, 0)

            output_dim = output.size(-1)

            output = output[1:].view(-1, output_dim)
            target = data["target"][1:].view(-1)
            
            loss = criterion(output, target)

            epoch_loss += loss.item()

    return epoch_loss / len(list(dataloader))

In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float("inf")

for epoch in range(N_EPOCHS):

    start_time = time.time()
    train_loss = train(model, train_dataloader, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_dataloader, optimizer, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'model.pth')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')



Epoch: 01 | Time: 1m 33s
	Train Loss: 4.618 | Train PPL: 101.331
	 Val. Loss: 4.281 |  Val. PPL:  72.348
Epoch: 02 | Time: 1m 33s
	Train Loss: 3.372 | Train PPL:  29.127
	 Val. Loss: 3.624 |  Val. PPL:  37.506
Epoch: 03 | Time: 1m 33s
	Train Loss: 2.663 | Train PPL:  14.334
	 Val. Loss: 3.370 |  Val. PPL:  29.071
Epoch: 04 | Time: 1m 33s
	Train Loss: 2.200 | Train PPL:   9.023
	 Val. Loss: 3.292 |  Val. PPL:  26.909
Epoch: 05 | Time: 1m 33s
	Train Loss: 1.852 | Train PPL:   6.372
	 Val. Loss: 3.387 |  Val. PPL:  29.590
Epoch: 06 | Time: 1m 34s
	Train Loss: 1.579 | Train PPL:   4.850
	 Val. Loss: 3.415 |  Val. PPL:  30.404
Epoch: 07 | Time: 1m 33s
	Train Loss: 1.381 | Train PPL:   3.979
	 Val. Loss: 3.542 |  Val. PPL:  34.533
Epoch: 08 | Time: 1m 33s
	Train Loss: 1.225 | Train PPL:   3.404
	 Val. Loss: 3.652 |  Val. PPL:  38.564
Epoch: 09 | Time: 1m 33s
	Train Loss: 1.093 | Train PPL:   2.982
	 Val. Loss: 3.765 |  Val. PPL:  43.182
Epoch: 10 | Time: 1m 33s
	Train Loss: 0.949 | Train PPL

In [None]:
model.load_state_dict(torch.load('model.pth'))

<All keys matched successfully>

In [None]:
test_iter = Multi30k(root="/content", split='test', language_pair=(SOURCE_LANGUAGE, TARGET_LANGUAGE))
test_dataloader = DataLoader(test_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

test_loss = evaluate(model, test_dataloader, optimizer, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 3.277 | Test PPL:  26.498 |


In [None]:
model.eval()

epoch_loss = 0
with torch.no_grad():
    for i, batch in enumerate(test_dataloader):
        for k, v in batch.items():
            batch[k] = v.to(DEVICE)

        output = model(batch, 0)
        break

In [None]:
idx2word = vocab_transform[TARGET_LANGUAGE].get_itos()

In [None]:
def reverse(index):
    string = []
    for idx in index:
        if idx > 3:
            string.append(idx2word[idx])
    return " ".join(string)

In [None]:
ground_truths = []
for i in range(64):
    gt = batch["target"][:, i]
    ground_truths.append(reverse(gt.detach().cpu().numpy()))

In [None]:
predictions = []
for i in range(64):
    test_output = output[:, i].argmax(-1)
    predictions.append(reverse(test_output.detach().cpu().numpy()))


In [None]:
for ground_truth, prediction in zip(ground_truths, predictions):
    print(ground_truth)
    print(prediction)
    print()

A man in an orange hat starring at something .
A man with an orange hat , something something something .

A Boston Terrier is running on lush green grass in front of a white fence .
A golden dog runs over grass grass in front of a white fence . .

A girl in karate uniform breaking a stick with a front kick .
A girl in a karate uniform is a a with a gun . .

Five people wearing winter jackets and helmets stand in the snow , with in the background .
Five people in winter winter and helmets stand in the snow with the background . background . background . background . background . background .

People are fixing the roof of a house .
People are on a building . . .

A man in light colored clothing photographs a group of men wearing dark suits and hats standing around a woman dressed in a gown .
A bright red - skinned man is a a of men in a woman in a woman in a dress dress around around around a woman in a light .

A group of people standing in front of an igloo .
A group of people standi