# Machine Translation

Seq2Seq model and Evaluation metric

### Tutorial Topics
- Machine Translation:
    - Seq2Seq model
    - Evaluation metric

### Software Requirements
- Python (>=3.6)
- PyTorch (>=1.2.0) 
- Jupyter (latest)
- torchtext
- NLTK

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Seq2Seq model

In this tutorial, we will introduce a neural network to translate French sentence to English sentence.

We will introduce a important architecture in machine translation: [sequence to sequence network](http://arxiv.org/abs/1409.3215), in which two recurrent neural networks work together to transform one sequence (e.g., sentence) to another. An encoder network condenses an input sequence into a **single vector**, and a decoder network unfolds that vector into a new sequence in target language.

# Sequence to Sequence Learning

A [Sequence to Sequence network](http://arxiv.org/abs/1409.3215), or seq2seq network, or [Encoder Decoder network](https://arxiv.org/pdf/1406.1078v3.pdf), is a model consisting of two separate RNNs called the **`encoder`** and **`decoder`**. The `encoder` reads an input sequence one token at a time, and outputs a vector at each step. The final output of the encoder is kept as the **context** vector. In classification task, we use this **context** vector as the "summarization" of input sequence. In seq2seq model, the decoder uses this context vector as the initial state to generate translation. We will discuss the details in the later section.  

![](https://i.imgur.com/tVtHhNp.png)

 Picture Courtesy: https://i.imgur.com/tVtHhNp.png
 
When using a single RNN, there is a one-to-one relationship between `inputs` and `outputs`. But there are not directly one-to-one relationship between source language and target language. 

Consider a simple sentence "`Je ne suis pas le chat noir"` &rarr; "`I am not the black cat`". Many of the words have a pretty direct translation, like "chat" &rarr; "cat". However the differing grammars cause words to be in different orders, e.g. "chat noir" and "black cat". There is also the "ne ... pas" &rarr; "not" construction that makes the two sentences have different lengths.

With the seq2seq model, by encoding many source inputs into one vector, and decoding from one vector into many target outputs, we are freed from the constraints of sequence order and length. The encoded sequence is represented by a single vector which is a $N$ dimensional representation. In an ideal case, this vector can be considered as the `"summarization"` of the sequence.

The flow of rest of this tutorial is as follows:
1. Preparing data
2. Encoder
3. Decoder
4. Seq2seq
5. Training the model
6. Loading the trained model checkpoint
7. Evaluation

### Required imports

In [None]:
import unicodedata
import string
import re
import random
import time
import datetime
import math

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence
import torchtext
from torchtext.datasets import TranslationDataset

import spacy
import numpy as np

Here we will also define a constant to decide whether to use the GPU (with CUDA specifically) or the CPU. 

If you don't have a GPU, set this as CPU. Later when we create tensors, this variable will be used to decide whether we keep them on CPU or move them to GPU.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


## 1. Preparing Data

***Define tokenizers:***
we create the tokenizers. A tokenizer is used to turn a string containing a sentence into a list of individual tokens.

`spaCy` has model for each language ("fr" for French and "en" for English) which need to be loaded so we can access the tokenizer of each model.

***Note***: the models must first be downloaded using the following on the command line:

```
python -m spacy download en_core_web_sm
python -m spacy download fr_core_news_sm
```

In [None]:
!python -m spacy download en_core_web_sm
!python -m spacy download fr_core_news_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
Collecting fr_core_news_sm==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.2.5/fr_core_news_sm-2.2.5.tar.gz (14.7MB)
[K     |████████████████████████████████| 14.7MB 11.1MB/s 
Building wheels for collected packages: fr-core-news-sm
  Building wheel for fr-core-news-sm (setup.py) ... [?25l[?25hdone
  Created wheel for fr-core-news-sm: filename=fr_core_news_sm-2.2.5-cp36-none-any.whl size=14727027 sha256=5a28f923fb2cc8f60c35c44f2fd3a7f6a4f0315974499867dc3fff4e35801c62
  Stored in directory: /tmp/pip-ephem-wheel-cache-cng4hizv/wheels/46/1b/e6/29b020e3f9420a24c3f463343afe5136aaaf955dbc9e46dfc5
Successfully built fr-core-news-sm
Installing collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('fr_core_n

In [None]:
import fr_core_news_sm
import en_core_web_sm

spacy_fr = fr_core_news_sm.load()
spacy_en = en_core_web_sm.load()


Next, we create the tokenizer functions. These functions can be provided to TorchText and will take in the sentence as a string and return the sentence as a list of tokens.

In [None]:
def tokenize_fr(text):
    """
    Tokenizes French text from a string into a list of strings (tokens)
    """
    return [tok.text for tok in spacy_fr.tokenizer(text)]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings (tokens)
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

`TorchText`'s Fields handle how data should be processed. You can read all of the possible arguments [here](https://github.com/pytorch/text/blob/master/torchtext/data/field.py#L61).

We set the tokenize argument to the corresponding tokenization function for each, with French being the `SRC` (source) field and English being the `TRG` (target) field. The field also appends the "start of sequence" (\<sos\>) and "end of sequence" (\<eos\>) tokens via the `init_token` and `eos_token` arguments, and converts all words to lowercase.

In [None]:
SRC = torchtext.data.Field(tokenize = tokenize_fr, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)
TRG = torchtext.data.Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

Next, we load the train, validation and test data.

The dataset we'll be using is the [Multi30k](https://github.com/multi30k/dataset) dataset. This is a dataset with ~30,000 parallel English, French and German sentences. The length of sentence is around 12 words. You can find more information in [WMT18](http://www.statmt.org/wmt18/multimodal-task.html). This corpus was officially split to Training (29,000 sentences), Validation (1,014 sentences), and multiple Test sets. We provide Test 2016 (1,000 sentences). 

The raw dataset is extracted to three `.tsv` files. Each file includes two column, 'English' and 'French'. We use `torchtext.data.TabularDataset` to load these tsv files. 

In [None]:
train, val, test = torchtext.data.TabularDataset.splits(
    path='./drive/My Drive/Colab Notebooks/eng-fre/', train='train_eng_fre.tsv',validation='val_eng_fre.tsv', test='test_eng_fre.tsv', 
    format='tsv', skip_header=True, fields=[('TRG', TRG), ('SRC', SRC)])

We can double check that we've loaded the right number of examples:

In [None]:
print(f"Number of training examples: {len(train.examples)}")
print(f"Number of validation examples: {len(val.examples)}")
print(f"Number of testing examples: {len(test.examples)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


We can also print out an example:

In [None]:
print(vars(train.examples[0]))

{'TRG': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.'], 'SRC': ['deux', 'jeunes', 'hommes', 'blancs', 'sont', 'dehors', 'près', 'de', 'buissons', '.']}


In [None]:
print(vars(val.examples[100]))

{'TRG': ['an', 'older', ',', 'overweight', 'man', 'flips', 'a', 'pancake', 'while', 'making', 'breakfast', '.'], 'SRC': ['un', 'homme', 'âgé', 'en', 'surpoids', 'fait', 'sauter', 'une', 'crêpe', 'en', 'préparant', 'le', 'petit', 'déjeuner', '.']}


Next, we'll build the vocabulary for the source and target languages. 

The vocabulary is used to associate each unique token with an index and this is used to build a one-hot encoding for each token. The vocabularies of the source and target languages have some minimal overlap (e.g., `critique` and `genre` are used in similar context in both languages).

Using the `min_freq` argument, we only allow tokens that appear at least 2 times to appear in our vocabulary. Tokens that appear only once are converted into an `<unk>` (unknown) token.

It is important to note that your vocabulary should only be built from the `training set` and not the `validation/test set`. This prevents **"information leakage"** into your model, giving you artifically inflated validation/test scores.

In [None]:
TRG.build_vocab(train,min_freq=2)
SRC.build_vocab(train,min_freq=2)

In [None]:
print(f"Unique tokens in source (fr) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")

Unique tokens in source (fr) vocabulary: 6462
Unique tokens in target (en) vocabulary: 5893


`TRG.vocab.stoi` is the dictionary of word to index. For example, the index of `<pad>` is 1.

In [None]:
print(TRG.vocab.stoi['<pad>'])

1


The final step of preparing the data is to create the `iterators` to generate batches. These can be iterated on to return a batch of data. The text of both source and target text will be converted to two sequence of corresponding indexes, using the vocabularies.


We also need to define a `torch.device`. This indicate whether the input `tensors` should be sent to `GPU` or not. We already defined the `device` variable before. 

Finally, the output of the iterator will be `padded`. 

We use a `BucketIterator` to creates batches.

In [None]:
train_iter, val_iter, test_iter = torchtext.data.BucketIterator.splits(
    (train, val, test), # we pass in the datasets we want the iterator to draw data from
    batch_sizes=(16, 256, 256), device = device,
    sort_key=lambda x: len(x.SRC), # the BucketIterator needs to be told what function it should use to group the data.
    sort_within_batch=True)

Each batch will include two **tensors**: tensor of source language and tensor of target language. The size of each tensor is **[max_length, batch_size]**. Each example is already **padded** within batch dynamically.

In [None]:
# batch example of training data
for batch in train_iter:
    src = batch.SRC
    trg = batch.TRG
    print('tensor size of source language:', src.shape)
    print('tensor size of target language:', trg.shape)
    print('the tensor of first example in target language:', trg[:,0])
    break

tensor size of source language: torch.Size([11, 16])
tensor size of target language: torch.Size([12, 16])
the tensor of first example in target language: tensor([  2,   9,   6,  43,  12,   4,  59, 402,  77,   5,   3,   1],
       device='cuda:0')


We save our Fields for reproducibility. Ensure you create a direction `ckpt` in your drive before running the following code.

In [None]:
import pickle
with open("./drive/My Drive/Colab Notebooks/ckpt/TRG.Field","wb") as f:
     pickle.dump(TRG,f)

with open("./drive/My Drive/Colab Notebooks/ckpt/SRC.Field","wb") as f:
     pickle.dump(SRC,f)

## Building the Seq2Seq Model

## 2. Encoder

![](https://pytorch.org/tutorials/_images/seq2seq.png)

First, we'll build the encoder model that encodes the French sentence. We use a single layer `Uni-directional LSTM`.

Similar to the classifiction task (covered in DSCI 572), we only pass the output of embedding layer to the LSTM layer. The LSTM layer returns `outputs`, `hidden` and `cell`. The `hidden` is the final hidden state of LSTM layer (t=seq_len). The `cell` is the final cell state of the LSTM layer (t=seq_len). `hidden` and `cell` can be considered as the **context** representation of source language. 

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim,n_layers, dropout):
        super().__init__()

        self.emb_dim = emb_dim
        self.enc_hid_dim = enc_hid_dim
        self.dropout = dropout
        self.n_layers = n_layers

        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.lstm = nn.LSTM(emb_dim, enc_hid_dim, n_layers, dropout=dropout)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.lstm(embedded)
       
        # outputs are always from the top hidden layer, if bidirectional outputs are concatenated.
        # outputs shape [sequence_length, batch_size, hidden_dim * num_directions]
        # hidden is of shape [num_layers * num_directions, batch_size, hidden_size]
        # cell is of shape [num_layers * num_directions, batch_size, hidden_size]
        
        return hidden, cell

## 3. Decoder

![](https://pytorch.org/tutorials/_images/seq2seq.png)

Next up is the decoder. Decoder is a `uni-directional LSTM`.


At time step $t$, the input of decoder LSTM is embeded word vector of $t$th target word , $y_t$, the previous decoder hidden state, $h_{t-1}$, and the previous decoder hidden cell, $c_{t-1}$.

$$h_t, c_t = \text{DecoderLSTM}(y_t, (h_{t-1}, c_{t-1}))$$

Specially, we will use the last `hidden state` and `cell state` of the encoder LSTM as the initial states of decoder LSTM (i.e., $h_{0}, c_{0}$) rather than randomly initialize them. 

We then pass hidden state of LSTM layer, $h_t$, through the linear layer, $f$, to make a prediction of the next word in the target sentence, $\hat{y}_{t+1}$. 

$$\hat{y}_{t+1} = f(h_t)$$

In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, dec_hid_dim, n_layers, dropout):
        super().__init__()

        self.emb_dim = emb_dim
        self.output_dim = output_dim
        self.dec_hid_dim = dec_hid_dim
        self.n_layers = n_layers
        self.dropout = dropout

        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.lstm = nn.LSTM(emb_dim, dec_hid_dim, n_layers, dropout=dropout)
        self.fc_out = nn.Linear(dec_hid_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
             
        # input is of shape [batch_size]
        # hidden is of shape [n_layer * num_directions, batch_size, hidden_size]
        # cell is of shape [n_layer * num_directions, batch_size, hidden_size]
        
        input = input.unsqueeze(0)
        
        # input shape is [1, batch_size]. reshape is needed rnn expects a rank 3 tensors as input.
        # so reshaping to [1, batch_size] means a batch of batch_size each containing 1 index.
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]    
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        
        # output shape is [sequence_len, batch_size, hidden_dim * num_directions]
        # hidden shape is [num_layers * num_directions, batch_size, hidden_dim]
        # cell shape is [num_layers * num_directions, batch_size, hidden_dim]

        # sequence_len and num_directions will always be 1 in the decoder.
        # output shape is [1, batch_size, hidden_dim]
        # hidden shape is [num_layers, batch_size, hidden_dim]
        # cell shape is [num_layers, batch_size, hidden_dim]
        
        prediction = self.fc_out(hidden.squeeze(0)) # linear expects as rank 2 tensor as input
        # predicted shape is [batch_size, output_dim]
        
        return prediction, hidden, cell

## 4. Seq2Seq


![](https://pytorch.org/tutorials/_images/seq2seq.png)

The `encoder` returns both the final `hidden state` and `cell state` to be used as the initial `hidden state` and `cell state` for the `decoder`.

Briefly going over all of the steps:
- the `outputs` tensor is created to hold all predictions, $\hat{Y} = \{\hat{y_0}, \hat{y_1} ... \hat{y_t}\}$;
- the source sequence, $X = \{x_0,x_1,..., x_t\}$, is fed into the encoder to receive last hidden state, $h^{Encoder}_t$, and last cell state $c^{Encoder}_t$;
- the initial decoder hidden state is set to be the $h^{Encoder}_t$, and the initial decoder cell state is set to be the $c^{Encoder}_t$. (i.e., $h^{Decoder}_0$ = $h^{Encoder}_t$; $c^{Decoder}_0$ = $c^{Encoder}_t$);
- we use a batch of `<sos>` tokens as the first `input` (i.e., $y_1$);
- we then decode within a loop:

 for i in range(1,t): t is the maximal length of target language
  - inserting the input token $y_i$, previous hidden state, $h^{Decoder}_{i-1}$, and previous cell state, $c^{Decoder}_{i-1}$, into the decoder;
  - receiving a prediction, $\hat{y}_{i+1}$, which is the most likely output sequence, a new hidden state, $h^{Decoder}_{i}$, and a new cell state, $c^{Decoder}_{i}$;
  - we then decide if we are going to **teacher force** or not, setting the next input as appropriate, that is, if teacher forcing is on, the next input will be the gold token from the previous timestep, otherwise, the next input will be the predicted token from the previous timestep.

In [None]:
class Seq2Seq(nn.Module):
    ''' This class contains the implementation of complete sequence to sequence network.
    It uses to encoder to produce the context vectors.
    It uses the decoder to produce the predicted target sentence.
    Args:
        encoder: A Encoder class instance.
        decoder: A Decoder class instance.
    '''
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        # src is of shape [src_sequence_len, batch_size]
        # trg is of shape [targ_sequence_len, batch_size]
        # if teacher_forcing_ratio is 0.5 we use ground-truth inputs 50% of time and 50% time we use decoder outputs.

        batch_size = trg.shape[1]
        max_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim

        # to store the outputs of the decoder
        outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)

        # context vector, last hidden and cell state of encoder to initialize the decoder
        hidden, cell = self.encoder(src)

        # first input to the decoder is the <sos> tokens
        input = trg[0, :]

        for t in range(1, max_len):
            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[t] = output
            # pick a random number between 0 to ratio and decide whether to teacher force
            # if the ratio is 1.0, use_teacher_force is always 1 
            # if the ratio is 0.0, use_teacher_force is always 0
            # if the ration is 0.4, use_teacher_force is 1 for 40% of the time (on an average)
            use_teacher_force = random.random() < teacher_forcing_ratio 
            top1 = output.max(1)[1]
            # decide the next token based on use_teacher_force]
            # if teacher forcing is on, the next input will be the gold token from the previous timestep
            # otherwise, the next input will be the predicted token from the previous timestep.
            input = (trg[t] if use_teacher_force else top1) 

        # outputs is of shape [sequence_len, batch_size, output_dim]
        return outputs

## 5. Training the Seq2Seq Model
We instantiate our encoder, decoder and seq2seq model (placing it on the GPU if we have one). 

In [None]:
INPUT_DIM = len(SRC.vocab) # tokens in source vocabulary
OUTPUT_DIM = len(TRG.vocab) # tokens in target vocabulary

# hyperparameters
ENC_EMB_DIM = 256 # encoder embedding size
DEC_EMB_DIM = 256 # decoder embedding size
ENC_HID_DIM = 512 # encoder hidden size
DEC_HID_DIM = 512 # decoder hidden size
ENC_DROPOUT = 0.5 # dropout for encoder
DEC_DROPOUT = 0.3 # dropout for decoder
N_LAYERS = 1 # number of LSTM layers
LEARNING_RT = 0.001 # learning rate

# model
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, DEC_HID_DIM, N_LAYERS, DEC_DROPOUT)
model = Seq2Seq(enc, dec, device).to(device)

  "num_layers={}".format(dropout, num_layers))
  "num_layers={}".format(dropout, num_layers))


We use a simplified version of the **weight initialization scheme**. Here, we will initialize all biases to zero and all weights from $\mathcal{N}(0, 0.01)$.

In [None]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6462, 256)
    (lstm): LSTM(256, 512, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 256)
    (lstm): LSTM(256, 512, dropout=0.3)
    (fc_out): Linear(in_features=512, out_features=5893, bias=True)
    (dropout): Dropout(p=0.3, inplace=False)
  )
)

Calculate the number of parameters.

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 9,339,909 trainable parameters


Create an optimizer.

In [None]:
optimizer = optim.Adam(model.parameters(), lr = LEARNING_RT)

Initialize the loss function. The pad token needs to be ignored.

In [None]:
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
print('<pad> token index: ',TRG_PAD_IDX)
## we will ignor the pad token in true target set
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

<pad> token index:  1


### Testing Model With a Single Batch
We will run the model with first training batch to test our code.

In [None]:
clip = 1
model.train()

for i, batch in enumerate(train_iter):
    
    # read the source sentence and target sentence
    src = batch.SRC
    trg = batch.TRG

    # clear the gradient buffer
    optimizer.zero_grad()

    # forward pass
    output = model(src, trg)
    #trg = [trg len, batch size]
    #output = [trg len, batch size, output dim]

    output_dim = output.shape[-1]

    output = output[1:].view(-1, output_dim)
    trg = trg[1:].view(-1)

    #trg = [(trg len - 1) * batch size]
    #output = [(trg len - 1) * batch size, output dim]
    
    # compute the loss
    loss = criterion(output, trg)
    
    # compute the gradients
    loss.backward()

    # clip the gradients to prevent gradient explosion problem
    torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
    
    # update the parameters
    optimizer.step()

    print(loss/src.shape[1])
    break

tensor(0.5426, device='cuda:0', grad_fn=<DivBackward0>)


## Fully training process
If we test our code successfully. We will start the fully training loop as follows:

In [None]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.SRC
        trg = batch.TRG
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        # loss function works only 2d logits, 1d targets
        # so flatten the trg, output tensors. Ignore the <sos> token
        # trg shape shape should be [(sequence_len - 1) * batch_size]
        # output shape should be [(sequence_len - 1) * batch_size, output_dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

...and the evaluation loop, remembering to set the model to `eval` mode and turn off teaching forcing (i.e., teach forcing rate = 0).

In [None]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.SRC
            trg = batch.TRG

            output = model(src, trg, 0) # turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

Count the running time.

In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

## Training model. 

We will train the model for 10 epochs. At the end of each epoch, we will save a checkpoint and evaluate on the development set. We will print out the loss and perplexity of train and dev set.

In [None]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iter, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, val_iter, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    # Create checkpoint at end of each epoch
    state_dict_model = model.state_dict() 
    state = {
        'epoch': epoch,
        'state_dict': state_dict_model,
        'optimizer': optimizer.state_dict()
        }

    torch.save(state, "./drive/My Drive/Colab Notebooks/ckpt/seq2seq_"+str(epoch+1)+".pt")

    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 55s
	Train Loss: 4.402 | Train PPL:  81.591
	 Val. Loss: 4.433 |  Val. PPL:  84.177
Epoch: 02 | Time: 0m 56s
	Train Loss: 3.550 | Train PPL:  34.819
	 Val. Loss: 3.947 |  Val. PPL:  51.783
Epoch: 03 | Time: 0m 56s
	Train Loss: 3.034 | Train PPL:  20.788
	 Val. Loss: 3.648 |  Val. PPL:  38.409
Epoch: 04 | Time: 0m 57s
	Train Loss: 2.650 | Train PPL:  14.156
	 Val. Loss: 3.450 |  Val. PPL:  31.515
Epoch: 05 | Time: 0m 57s
	Train Loss: 2.365 | Train PPL:  10.649
	 Val. Loss: 3.368 |  Val. PPL:  29.027
Epoch: 06 | Time: 0m 56s
	Train Loss: 2.130 | Train PPL:   8.413
	 Val. Loss: 3.270 |  Val. PPL:  26.310
Epoch: 07 | Time: 0m 56s
	Train Loss: 1.932 | Train PPL:   6.902
	 Val. Loss: 3.282 |  Val. PPL:  26.629
Epoch: 08 | Time: 0m 57s
	Train Loss: 1.766 | Train PPL:   5.848
	 Val. Loss: 3.294 |  Val. PPL:  26.946
Epoch: 09 | Time: 0m 57s
	Train Loss: 1.611 | Train PPL:   5.009
	 Val. Loss: 3.330 |  Val. PPL:  27.950
Epoch: 10 | Time: 0m 56s
	Train Loss: 1.490 | Train PPL

## 6. Load Checkpoint
We will use the best model for the following process.

Load the saved TRG and SRC fields:

In [None]:
with open("./drive/My Drive/Colab Notebooks/ckpt/TRG.Field","rb") as f:
     TRG_saved = pickle.load(f)

with open("./drive/My Drive/Colab Notebooks/ckpt/SRC.Field","rb") as f:
     SRC_saved = pickle.load(f)

Load trained model to `model_best` and put model on device.

In [None]:
INPUT_DIM = len(SRC_saved.vocab)
OUTPUT_DIM = len(TRG_saved.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.3
N_LAYERS = 1
LEARNING_RT = 0.001
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, DEC_HID_DIM, N_LAYERS, DEC_DROPOUT)

model_best = Seq2Seq(enc, dec, device)

  "num_layers={}".format(dropout, num_layers))
  "num_layers={}".format(dropout, num_layers))


In [None]:
model_best.load_state_dict(torch.load('./drive/My Drive/Colab Notebooks/ckpt/seq2seq_7.pt')['state_dict'])
model_best = model_best.to(device)

Pre-process source language and get input tensor

In [None]:
model_best.eval()
src_token = SRC_saved.preprocess('Une femme avec un gros sac passe par une porte.')
print("src_token:", src_token)
src_tensor = SRC_saved.process([src_token],device=device)
print("shape of source language: ", src_tensor.shape)

src_token: ['une', 'femme', 'avec', 'un', 'gros', 'sac', 'passe', 'par', 'une', 'porte', '.']
shape of source language:  torch.Size([13, 1])


We assume this is a test sample so we don't have the gold target text. We create a placeholder for target language which only includes 64 `<pad>` tokens (i.e., the maximal length of our translated generation is 64).

In [None]:
trg_token = ['<pad>']*64
trg_tensor = TRG_saved.process([trg_token],device=device)
print("shape of target language: ", trg_tensor.shape)

shape of target language:  torch.Size([66, 1])


In [None]:
output = model_best(src_tensor, trg_tensor, teacher_forcing_ratio = 0.0)
output_dim = output.shape[-1]
# get translation results, we ignore first token <sos> in both translation and target sentences. 
# output_translate = [(trg len - 1), batch, output dim] output dim is size of target vocabulary. 
output_translate = output[1:]
print("shape of output translate: ", output_translate.size())

shape of output translate:  torch.Size([65, 1, 5893])


Detach the source input tensor to CPU device because our following process will operate on CPU. Then, squeeze the shape of `output_translate` to `[(trg len - 1),output dim]`. 

In [None]:
source_language_token_ids = src_tensor[:,0].cpu().numpy()
print("token indices in source language: ", source_language_token_ids)
translation_logit = output_translate[:,0,:].squeeze(1)
print("shape of logits for each prediction token in target language", translation_logit.size())

token indices in source language:  [  2   6  19  15   4 185 150 211  64   6 109   5   3]
shape of logits for each prediction token in target language torch.Size([65, 5893])


In [None]:
# Choose top 1 word from decoder's output, we get the probability and index of the word
prob, token_id = translation_logit.data.topk(1)
print("shape of unnormalized logits corresponding to the top prediction for each prediction token = ", prob.size())
target_language_token_ids_along_with_pad = token_id.squeeze(1).cpu().numpy()
print("token indices in target language: ", target_language_token_ids_along_with_pad) 

shape of unnormalized logits corresponding to the top prediction for each prediction token =  torch.Size([65, 1])
token indices in target language:  [   4   14   13    4   59  265   10   41   49    4 1627    5    3    3
    3    3    3    3    3    3    3    3    3    3    3    3    3    3
    3    3    3    3    3    3    3    3    3    3    3    3    3    3
    3    3    3    3    3    3    3    3    3    3    3    3    3    3
    3    3    3    3    3    3    3    3    3]


To get the translation in text, we will use a dictionary to convert index to word.


In [None]:
# get source langauge in text
src_language_token_strs = []
for i in source_language_token_ids:
    if i == SRC.vocab.stoi['<eos>']:
        break
    else:
        token = SRC.vocab.itos[i]
        src_language_token_strs.append(token)
print("Source language:", src_language_token_strs)

# get machine translation in text
trans_language = []
for i in target_language_token_ids_along_with_pad:
    if i == TRG.vocab.stoi['<eos>']:
        break
    else:
        token = TRG.vocab.itos[i]
        trans_language.append(token)
print("Our model translation: ",  ' '.join(trans_language))
print("Gold translation: ", "a woman with a large purse is walking by a gate.")

Source language: ['<sos>', 'une', 'femme', 'avec', 'un', 'gros', 'sac', 'passe', 'par', 'une', 'porte', '.']
Our model translation:  a woman with a large bag is walking by a donkey .
Gold translation:  a woman with a large purse is walking by a gate.


## 7. Evaluation 


### Evaluation with perplexity

[Perplexity](https://en.wikipedia.org/wiki/Perplexity) is an information theoretic measurement of how well probability model predicts a sample. Low perplexity values indicate a better fit.

In language modeling, we build a language generation system with state distribution `q` which tries to mimic the real-world language with the state distribution `p` as much as possible. However, in practice we do not know the exact `p`, we only know $\tilde{p}$  which is a sampled distribution (i.e., our dataset) from the real-world system. We use a perplexity with cross entropy $PPL( \tilde{p} ,q)$. Perplexity is defined as: 

$PPL( \tilde{p} ,q) = e^{Entropy[\tilde{p},q]}$

where $Entropy[\tilde{p},q]$ is the `CrossEntropy` loss of our translation model, q is our translation system.
We can directly use `math.exp(train_loss)` get the perplexity of our model. 

### Evaluation with BLEU score

The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for evaluating a generated sentence to a reference sentence.

NLTK provides the [sentence_bleu()](http://www.nltk.org/api/nltk.translate.html#nltk.translate.bleu_score.sentence_bleu) function for evaluating a candidate sentence against one or more reference sentences.

In [None]:
from nltk.translate.bleu_score import sentence_bleu

 Let's take a look at a simple example:

In [None]:
# two references for one document
from nltk.translate.bleu_score import corpus_bleu
references = [[['this', 'is', 'a', 'test'], ['this', 'is' 'test']]]
candidates = [['this', 'is', 'a', 'test']]
score = corpus_bleu(references, candidates)
print(score)

1.0


By default, the sentence_bleu() and corpus_bleu() scores calculate the cumulative 4-gram BLEU score, also called **BLEU-4**. The weights for the BLEU-4 are 1/4 (25%) or 0.25 for each of the 1-gram, 2-gram, 3-gram and 4-gram scores. 

In [None]:
# now, use BLEU evaluate our translation

# tokenize our golden sentence
tokens = TRG.preprocess("a woman with a large purse is walking by a gate.")
print("tokens:\t", tokens)
reference = [[tokens]]
print("Our model translation:\t", trans_language)
our_translation = [trans_language]

score = corpus_bleu(references, our_translation) #corpus_bleu(reference, our_translation, weights=(0.25, 0.25, 0.25, 0.25))
print("BLEU score:",score)

tokens:	 ['a', 'woman', 'with', 'a', 'large', 'purse', 'is', 'walking', 'by', 'a', 'gate', '.']
Our model translation:	 ['a', 'woman', 'with', 'a', 'large', 'bag', 'is', 'walking', 'by', 'a', 'donkey', '.']
BLEU score: 0.6389431042462724


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


The cumulative and individual 1-gram BLEU use the same weights, e.g. (1, 0, 0, 0). The 2-gram weights assign a 50% to each of 1-gram and 2-gram and the 3-gram weights are 33% for each of the 1, 2 and 3-gram scores.

Let’s make this concrete by calculating the cumulative scores for BLEU-1, BLEU-2, BLEU-3 and BLEU-4:

In [None]:
# cumulative BLEU scores
print('Cumulative 1-gram: %f' % corpus_bleu(references, our_translation, weights=(1, 0, 0, 0)))
print('Cumulative 2-gram: %f' % corpus_bleu(references, our_translation, weights=(0.5, 0.5, 0, 0)))
print('Cumulative 3-gram: %f' % corpus_bleu(references, our_translation, weights=(0.33, 0.33, 0.33, 0)))
print('Cumulative 4-gram: %f' % corpus_bleu(references, our_translation, weights=(0.25, 0.25, 0.25, 0.25)))

Cumulative 1-gram: 0.166667
Cumulative 2-gram: 0.408248
Cumulative 3-gram: 0.553618
Cumulative 4-gram: 0.638943


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


## Reference 
* https://pytorch.org/docs/stable/nn.html
* https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb
* https://arxiv.org/abs/1409.3215
* https://github.com/graviraja/seq2seq
* https://github.com/eladhoffer/seq2seq.pytorch
* https://github.com/spro/practical-pytorch/tree/master/seq2seq-translation
* http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/
* https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
* https://leimao.github.io/blog/Entropy-Perplexity/