## Transformer for Machine Translation

<img src="https://mostafadehghani.com/wp-content/uploads/2021/04/transformer_arch-1024x604.png" alt="Transformer" title="Transformer" style="width: 650px;"/>

Picture Courtesy: [Tay et al., 2020](https://arxiv.org/pdf/2009.06732.pdf)


### Transformer mode

It is important to note the differences in the mode of usage of the Transformer block. Transformers can primarily be used in three ways, namely:
- encoder-only (e.g., for classification), 
- decoder-only (e.g., for language modeling), 
- encoder-decoder (e.g., for machine translation, which is our focus)

In encoder-decoder mode, there are usually multiple multi-headed self-attention modules, including a standard self-attention in both the encoder and the decoder, along with an encoder-decoder cross-attention that allows the decoder to utilize information from the encoder. This influences the design of the self-attention mechanism. In the encoder mode, there is no restriction or constraint that the self-attention mechanism has to be causal, i.e., dependent solely on the present and past tokens. In the encoder-decoder setting, the encoder and encoder-decoder cross attention can afford to be non-causal but the decoder self-attention must be causal. The ability to support causal auto-regressive decoding is required when designing efficient self-attention mechanisms since it can be a limiting factor in many applications.

This notebook was tested in a [google colab](https://colab.research.google.com/).

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Import required libraries

In [2]:
import unicodedata
import string
import re
import random
import time
import datetime
import math

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence
import torchtext
from torchtext.legacy import data

import spacy
import numpy as np

import math, copy, time
import matplotlib.pyplot as plt

import seaborn
seaborn.set_context(context="talk")
%matplotlib inline

# set the pseudo-random generator
manual_seed = 77
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
    torch.cuda.manual_seed(manual_seed)

cuda


### Preparing Data

***Define tokenizers:***
we create the tokenizers. A tokenizer is used to turn a string containing a sentence into a list of individual tokens that make up that string.

spaCy has model for each language ("fr" for French and "en" for English) which need to be loaded so we can access the tokenizer of each model.

***Note***: the models must first be downloaded using the following on the command line:


In [3]:
!python -m spacy download en_core_web_sm
!python -m spacy download fr_core_news_sm

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 4.7 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
Collecting fr_core_news_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.2.5/fr_core_news_sm-2.2.5.tar.gz (14.7 MB)
[K     |████████████████████████████████| 14.7 MB 5.0 MB/s 
Building wheels for collected packages: fr-core-news-sm
  Building wheel for fr-core-news-sm (setup.py) ... [?25l[?25hdone
  Created wheel for fr-core-news-sm: filename=fr_core_news_sm-2.2.5-py3-none-any.whl size=14727025 sha256=d1545f83306096363105ab98c087d3b2954d55c6e2b031aa24dc5282c7a1bad8
  Stored in directory: /tmp/pip-ephem-wheel-cache-rax20ehi/wheels/c9/a6/ea/0778337c34660027ee67ef3a91fb9d3600b76777a912ea1c24
Successfu

In [4]:
import fr_core_news_sm
import en_core_web_sm

spacy_fr = fr_core_news_sm.load()
spacy_en = en_core_web_sm.load()

Next, we create the tokenizer functions. These can be passed to TorchText and will take in the sentence as a string and return the sentence as a list of tokens.

In [5]:
def tokenize_fr(text):
    """
    Tokenizes French text from a string into a list of strings (tokens)
    """
    return [tok.text for tok in spacy_fr.tokenizer(text)]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings (tokens)
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

`TorchText`'s Fields handle how data should be processed. You can read all of the possible arguments [here](https://github.com/pytorch/text/blob/master/torchtext/data/field.py#L61).

We set the tokenize argument to the corresponding tokenization function for each, with French being the `SRC` (source) field and English being the `TRG` (target) field. The field also appends the "start of sequence" (\<sos\>) and "end of sequence" (\<eos\>) tokens via the `init_token` and `eos_token` arguments, and converts all words to lowercase.

In [6]:
SRC = data.Field(tokenize = tokenize_fr, 
            # init_token = '<sos>', # since initial encoder hidden state is always set to zero, the network can figure out that the time step is 0 and this token is optional
            eos_token = '<eos>', 
            lower = True)
TRG = data.Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

Next, we load the train, validation and test data.

The dataset we'll be using is the [Multi30k](https://github.com/multi30k/dataset) dataset. This is a dataset with ~30,000 parallel English, French and German sentences. You can find more information in [WMT18](http://www.statmt.org/wmt18/multimodal-task.html). This corpus was officially split to Training (29,000 sentences), Validation (1,014 sentences), and multiple Test sets. We provide Test 2016 (1,000 sentences). 

The raw dataset is extracted to three `.tsv` files. Each file includes two column, 'English' and 'French'. We use `torchtext.legacy.data.TabularDataset` to load these tsv files. 

In [7]:
train, val, test = data.TabularDataset.splits(
    path='./drive/MyDrive/Colab Notebooks/eng-fre/', train='train_eng_fre.tsv',validation='val_eng_fre.tsv', test='test_eng_fre.tsv', 
    format='tsv', skip_header=True, fields=[('TRG', TRG), ('SRC', SRC)])

We can double check that we've loaded the right number of examples:

In [8]:
print(f"Number of training examples: {len(train.examples)}")
print(f"Number of validation examples: {len(val.examples)}")
print(f"Number of testing examples: {len(test.examples)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


We can also print out an example:

In [9]:
print(vars(train.examples[0]))

{'TRG': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.'], 'SRC': ['deux', 'jeunes', 'hommes', 'blancs', 'sont', 'dehors', 'près', 'de', 'buissons', '.']}


In [10]:
print(vars(val.examples[100]))

{'TRG': ['an', 'older', ',', 'overweight', 'man', 'flips', 'a', 'pancake', 'while', 'making', 'breakfast', '.'], 'SRC': ['un', 'homme', 'âgé', 'en', 'surpoids', 'fait', 'sauter', 'une', 'crêpe', 'en', 'préparant', 'le', 'petit', 'déjeuner', '.']}


Next, we'll build the vocabulary for the source and target languages. 

The vocabulary is used to associate each unique token with an index and this is used to build a one-hot encoding for each token. The vocabularies of the source and target languages have some minimal overlap.

Using the `min_freq` argument, we only allow tokens that appear at least 2 times to appear in our vocabulary. Tokens that appear only once are converted into an `<unk>` (unknown) token.

It is important to note that your vocabulary should only be built from the `training set` and not the `validation/test set`. This prevents **"information leakage"** into your model, giving you artifically inflated validation/test scores.

In [11]:
TRG.build_vocab(train,min_freq=2)
SRC.build_vocab(train,min_freq=2)

In [12]:
print(f"Unique tokens in source (fr) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")

Unique tokens in source (fr) vocabulary: 6461
Unique tokens in target (en) vocabulary: 5893


`TRG.vocab.stoi` is the dictionary of word to index. For example, the index of `<pad>` is 1.

In [13]:
print(TRG.vocab.stoi['<pad>'])

1


The final step of preparing the data is to create the `iterators` to generate batches. These can be iterated on to return a batch of data. The text of both source and target text will be converted to two sequence of corresponding indexes, using the vocabularies.


We also need to define a `torch.device`. This indicate whether the input `tensors` should be sent to `GPU` or not. We already defined the `device` variable before. 

Finally, the output of the iterator will be `padded`. 

We use a `BucketIterator` to creates batches.

In [14]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [15]:
train_iter, val_iter, test_iter = data.BucketIterator.splits(
    (train, val, test), # we pass in the datasets we want the iterator to draw data from
    batch_sizes=(16, 256, 256),device = device,
    sort_key=lambda x: len(x.SRC), # the BucketIterator needs to be told what function it should use to group the data.
    sort_within_batch=False)

Each batch will include two tensors: tensor of source language and tensor of target language. The size of each tensor is **[max_length, batch_size]**. Each example is already padded within batch.

In [16]:
# batch example of training data
for batch in train_iter:
    src = batch.SRC
    trg = batch.TRG
    print('tensor size of source language:', src.shape)
    print('tensor size of target language:', trg.shape)
    print('the tensor of first example in target language:', trg[:,0])
    break

tensor size of source language: torch.Size([20, 16])
tensor size of target language: torch.Size([21, 16])
the tensor of first example in target language: tensor([   2, 1487, 1332,    0,  245,   18,  931,    4,    9,    5,    3,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1], device='cuda:0')


We save our Fields for reproducibility.

In [17]:
import pickle
with open("./drive/MyDrive/Colab Notebooks/ckpt_attn/TRG.Field","wb")as f:
     pickle.dump(TRG,f)

with open("./drive/MyDrive/Colab Notebooks/ckpt_attn/SRC.Field","wb")as f:
     pickle.dump(SRC,f)

### Transformer model - Implementation

[nn.Transformer](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html) module implements the entire encoder-decoder Transformer model, excluding the input token embeddings, position encodings, source and target masks. This module takes in the following parameters:
- **$d_{\text{model}}$** – the number of expected features in the encoder/decoder inputs 
- **nhead** – the number of heads in the multiheadattention models (the term 'h' in theory)
- **num_encoder_layers** - the number of sub-encoder-layers in the encoder
- **num_decoder_layers** - the number of sub-decoder-layers in the decoder
- **dim_feedforward** -  the dimension of the feedforward network model
- **dropout** - the dropout value

Let's understand `nn.Transformer` module by passing a sample batch.

Let's first convert the src and target word indices to word embeddings:






In [28]:
# word embedding layers for encoder (src) and decoder (targ)
token_embedding_encoder = nn.Embedding(num_embeddings=len(SRC.vocab), embedding_dim=512).to(device) # src vocab size x embedding size
print("token_embedding_encoder = ", token_embedding_encoder)
token_embedding_decoder = nn.Embedding(num_embeddings=len(TRG.vocab), embedding_dim=512).to(device) # targ vocab size x embedding size
print("token_embedding_decoder = ", token_embedding_decoder)

# print the shape tensors in sample batch: src and targ
print("src shape = ", src.shape) # src. max. seq len x batch size
print("targ shape = ", trg.shape) # targ. max. seq len x batch size

# pass the word indices to embedding layer to get embeddings
src_token_embeddings = token_embedding_encoder(src.to(device))
print("src_token_embeddings shape = ", src_token_embeddings.shape) # src. max. seq len x batch size x embedding size
targ_token_embeddings = token_embedding_decoder(trg.to(device))
print("targ_token_embeddings shape = ", targ_token_embeddings.shape) # targ. max. seq len x batch size x embedding size


token_embedding_encoder =  Embedding(6461, 512)
token_embedding_decoder =  Embedding(5893, 512)
src shape =  torch.Size([20, 16])
targ shape =  torch.Size([21, 16])
src_token_embeddings shape =  torch.Size([20, 16, 512])
targ_token_embeddings shape =  torch.Size([21, 16, 512])


Having built word embeddings, we can now build position embeddings for both encoder and decoder. Let's start by defining the position embedding layer.

In [29]:
# position embedding layers for encoder (src) and decoder (targ)
maximum_sentence_len = 200 # this will be the maximum length of source and target sentence that our model can process (can have separate maximum lengths for both encoder and decoder)
position_embedding_encoder = nn.Embedding(num_embeddings=maximum_sentence_len, embedding_dim=512).to(device) # maximum_sentence_len x embedding size
print("position_embedding_encoder = ", position_embedding_encoder)
position_embedding_decoder = nn.Embedding(num_embeddings=maximum_sentence_len, embedding_dim=512).to(device) # maximum_sentence_len x embedding size
print("position_embedding_decoder = ", position_embedding_decoder)

position_embedding_encoder =  Embedding(200, 512)
position_embedding_decoder =  Embedding(200, 512)


Now we can create the positional input for both encoder and decoder. For a single example setting, the positional inputs can be created as follows:

In [30]:
# if the no. of tokens in the source sentence is `src_seq_len`
src_seq_len = src.size(0)
src_position = torch.arange(0, src_seq_len)
print('positional input for a single source example = ', src_position, src_position.shape)

# if the no. of tokens in the target sentence is `trg_seq_len`
trg_seq_len = trg.size(0)
trg_position = torch.arange(0, trg_seq_len)
print('positional input for a single target example = ', trg_position, trg_position.shape)

positional input for a single source example =  tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19]) torch.Size([20])
positional input for a single target example =  tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20]) torch.Size([21])


If we have a batch of examples, the positional inputs can be extended to batch as follows:

In [31]:
batch_size = src.size(1) # size of the current batch
src_position = (torch.arange(0, src_seq_len).unsqueeze(1).expand(src_seq_len, batch_size).to(device))
print('positional input for a source batch = ', src_position, src_position.shape)

positional input for a source batch =  tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
        [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
        [ 3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3],
        [ 4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4],
        [ 5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5],
        [ 6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6],
        [ 7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7],
        [ 8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8],
        [ 9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9],
        [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10],
        [11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11],
        [12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12],

Similarly for the target batch,

In [32]:
batch_size = trg.size(1) # size of the current batch
targ_position = (torch.arange(0, trg_seq_len).unsqueeze(1).expand(trg_seq_len, batch_size).to(device))
print('positional input for a target batch = ', targ_position, targ_position.shape)

positional input for a target batch =  tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
        [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
        [ 3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3],
        [ 4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4],
        [ 5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5],
        [ 6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6],
        [ 7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7],
        [ 8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8],
        [ 9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9],
        [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10],
        [11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11],
        [12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12],

The positional embeddings can be obtained by passing the position input to position embeddings layer of encoder and decoder:

In [33]:
src_pos_embeddings = position_embedding_encoder(src_position)
print('positional input for source = ', src_position.shape)
print('positional embeddings for source = ', src_pos_embeddings.shape)

targ_pos_embeddings = position_embedding_encoder(targ_position)
print('positional input for target = ', targ_position.shape)
print('positional embeddings for target = ', targ_pos_embeddings.shape)

positional input for source =  torch.Size([20, 16])
positional embeddings for source =  torch.Size([20, 16, 512])
positional input for target =  torch.Size([21, 16])
positional embeddings for target =  torch.Size([21, 16, 512])


Having built word embedding and position embedding, we can now add both the embeddings to create the input embedding to Transformer module

In [34]:
src_input_embedding = src_token_embeddings + src_pos_embeddings
print('input embedding for source = ', src_input_embedding.shape)
targ_input_embedding = targ_token_embeddings + targ_pos_embeddings
print('input embedding for target = ', targ_input_embedding.shape)

input embedding for source =  torch.Size([20, 16, 512])
input embedding for target =  torch.Size([21, 16, 512])


We can optionally add dropout to both the input embeddings before feeding as an input to Transformer.

We need to create a binary mask for the source tokens that marks the pad tokens.

In [35]:
# create the source (pad) mask
mask_src = (src == SRC.vocab.stoi[SRC.pad_token]).transpose(0, 1).to(device) # batch size x src max. seq length
print(mask_src)

tensor([[False, False, False, False, False, False, False, False, False,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True],
        [False, False, False, False, False, False, False, False, False, False,
         False, False,  True,  True,  True,  True,  True,  True,  True,  True],
        [False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False,  True,  True,  True,  True,  True],
        [False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False,  True,  True,  True,  True],
        [False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False,  True,  True],
        [False, False, False, False, False, False, False, False, False, False,
         False, False, False, False,  True,  True,  True,  True,  True,  True],
        [False, False, False, False, False, Fa

We can now create an instance from `nn.Transformer` module:

In [36]:
# check doc for details about the input arguments
# https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html
embed_dim = 512 #  the number of expected features in the encoder/decoder inputs
nhead = 8 # the number of heads in the multiheadattention models
num_encoder_layers = 2 # the number of sub-encoder-layers in the encoder
num_decoder_layers = 2 # the number of sub-decoder-layers in the decoder
dim_feedforward = 1024 # the dimension of the feedforward network model
dropout = 0.1 # the dropout value (default=0.1)
transformer_layer = nn.Transformer(d_model=embed_dim, nhead=nhead, num_encoder_layers=num_encoder_layers, dropout=dropout, num_decoder_layers=num_decoder_layers, dim_feedforward=dim_feedforward).to(device)
print(transformer_layer)

Transformer(
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=1024, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=1024, out_features=512, bias=True)
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
      (1): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=1024, bias=True)
        (dropout): Dropout(p=0.1, in

We have created all the inputs that we need to pass to Transformer module except one: **target mask** that makes the decoder attend to only the past decoder context while predicting the next token.

In [37]:
# create the target (self-attention) mask 
targ_seq_len = trg.size(0)
mask_targ = transformer_layer.generate_square_subsequent_mask(targ_seq_len).to(device)
print("mask target = ", mask_targ, mask_targ.shape)  # targ. seq len x targ. seq len

mask target =  tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, 

Let's now pass all the tensor we've created to transformer layer:

In [38]:
print("shape of src_input_embedding = ", src_input_embedding.shape) # src. seq len x batch size x embedding size 
print("shape of targ_input_embedding = ", targ_input_embedding.shape)  # targ. seq len x batch size x embedding size 
print("shape of mask_src = ", mask_src.shape) #  batch size x src. seq len
print("shape of mask_targ = ", mask_targ.shape) # targ. seq len x targ. seq len
output = transformer_layer(src_input_embedding, targ_input_embedding, src_key_padding_mask=mask_src, tgt_mask=mask_targ)
print("shape of the transformer output = ", output.shape) # targ. seq len x batch size x embedding size 

shape of src_input_embedding =  torch.Size([20, 16, 512])
shape of targ_input_embedding =  torch.Size([21, 16, 512])
shape of mask_src =  torch.Size([16, 20])
shape of mask_targ =  torch.Size([21, 21])
shape of the transformer output =  torch.Size([21, 16, 512])


Having seen an example for using Transformer module, let's implement the Transformer model for machine translation:

In [39]:
class Transformer(nn.Module):
  def __init__(self, src_vocab, trg_vocab, embed_dim, nhead, num_encoder_layers, dropout, num_decoder_layers, dim_feedforward, maximum_sentence_len=200):
    super(Transformer, self).__init__()

    # get initial hyper-parameters
    self.src_vocab = src_vocab
    self.trg_vocab = trg_vocab
    self.embed_dim = embed_dim
    self.nhead = nhead
    self.num_encoder_layers = num_encoder_layers
    self.dropout = dropout
    self.num_decoder_layers = num_decoder_layers
    self.dim_feedforward = dim_feedforward

    # add embedding layers
    self.token_embedding_encoder = nn.Embedding(num_embeddings=self.src_vocab, embedding_dim=self.embed_dim)
    self.token_embedding_decoder = nn.Embedding(num_embeddings=self.trg_vocab, embedding_dim=self.embed_dim)
    self.position_embedding_encoder = nn.Embedding(num_embeddings=maximum_sentence_len, embedding_dim=self.embed_dim)
    self.position_embedding_decoder = nn.Embedding(num_embeddings=maximum_sentence_len, embedding_dim=self.embed_dim)
    
    # Encoder-Decoder Transformer
    self.transformer = nn.Transformer(d_model=self.embed_dim, nhead=self.nhead, num_encoder_layers=self.num_encoder_layers, dropout=self.dropout, num_decoder_layers=self.num_decoder_layers, dim_feedforward=self.dim_feedforward)
    
    # output layer to predict next token
    self.decoder = nn.Linear(self.embed_dim, self.trg_vocab)
    self.drop_layer = nn.Dropout()

  def forward(self, src, tgr):
    # read shapes
    # src = src_seq_len x batch_size
    # tgr = targ_seq_len x batch_size
    src_seq_len, batch_size = src.shape
    targ_seq_len, _ = tgr.shape

    # create position input for encoder
    src_position = (torch.arange(0, src_seq_len).unsqueeze(1).expand(src_seq_len, batch_size).to(device))
    # src_position = src_seq_len x batch_size

    # create position input for decoder
    targ_position = (torch.arange(0, targ_seq_len).unsqueeze(1).expand(targ_seq_len, batch_size).to(device))
    # src_position = targ_seq_len x batch_size

    # input embedding by merging token embedding with position embedding
    embed_src = self.drop_layer((self.token_embedding_encoder(src) + self.position_embedding_encoder(src_position)))
    embed_tgr = self.drop_layer((self.token_embedding_decoder(tgr) + self.position_embedding_decoder(targ_position)))
    # embed_src = src_seq_len x batch_size x d_model
    # embed_src = targ_seq_len x batch_size x d_model

    # create mask for source
    mask_src = (src == SRC.vocab.stoi[SRC.pad_token]).transpose(0, 1).to(device)
    # mask_src = batch_size x src_seq_len

    # create mask for target
    mask_targ = self.transformer.generate_square_subsequent_mask(targ_seq_len).to(device)
    # mask_targ = targ_seq_len x targ_seq_len

    # feed via transformer
    output = self.transformer(embed_src, embed_tgr, src_key_padding_mask=mask_src, tgt_mask=mask_targ)
    # output = targ_seq_len x batch_size x d_model

    # transform the output to match no of. tokens in target vocab 
    output = self.decoder(output) 
    # output = targ_seq_len x batch_size x targ_vocab_size

    return output

Let's set the hyperparameters and create a model instance:

In [40]:
# hyperparameters
src_vocab = len(SRC.vocab)
trg_vocab = len(TRG.vocab)
embed_dim = 512
nhead = 4
num_encoder_layers = 2
dropout = 0.1 
num_decoder_layers = 2
dim_feedforward = 512
learning_rate = 1e-4

# model instance
model = Transformer(src_vocab, trg_vocab, embed_dim, nhead, num_encoder_layers, dropout, num_decoder_layers, dim_feedforward).to(device)


Let's define the train logic (for a single epoch), which is very similar to previous seq2seq tutorials

In [41]:
def train(model, iterator, optimizer, criterion):
    manual_seed = 77
    torch.manual_seed(manual_seed)
    if n_gpu > 0:
        torch.cuda.manual_seed(manual_seed)
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.SRC.to(device)
        trg = batch.TRG.to(device)
        # src = src seq len x batch size
        # trg = targ seq len x batch size

        optimizer.zero_grad()
        
        output = model(src, trg[:-1, :]) # for target, provide targ seq len-1 tokens for each sentence
        
        #output = [targ seq len-1, batch size, output dim]

        output = output.reshape(-1, output.shape[2])
        target = trg[1:].reshape(-1)

        # loss function works only 2d logits, 1d targets
        # so flatten the trg, output tensors. Ignore the <sos> token
        # target shape shape should be [(targ seq len - 1) * batch_size]
        # output shape should be [(targ seq len - 1) * batch_size, output_dim]
        loss = criterion(output, target)
        
        loss.backward()

        # Clip to avoid exploding gradient issues, makes sure grads are
        # within a healthy range
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)
        
        optimizer.step()

        epoch_loss += loss.item()

    return epoch_loss / len(iterator)


Let's define the inference logic for evaluating the quality of the model based on BLEU score

In [42]:
def inference(model, file_name, src_vocab, trg_vocab, attention = False, max_trg_len = 64):
    '''
    Function for translation inference

    Input: 
    model: translation model;
    file_name: the directoy of test file that the first column is target reference, and the second column is source language;
    trg_vocab: Target torchtext Field
    attention: the model returns attention weights or not.
    max_trg_len: the maximal length of translation text (optinal), default = 64

    Output:
    Corpus BLEU score.
    '''
    from nltk.translate.bleu_score import corpus_bleu
    from nltk.translate.bleu_score import sentence_bleu
    from torchtext.legacy.data import TabularDataset
    from torchtext.legacy.data import Iterator

    # convert index to text string
    def convert_itos(convert_vocab, token_ids):
        list_string = []
        for i in token_ids:
            if i == convert_vocab.vocab.stoi['<eos>']:
                break
            else:
                token = convert_vocab.vocab.itos[i]
                list_string.append(token)
        return list_string

    test = TabularDataset(
      path=file_name, # the root directory where the data lies
      format='tsv',
      skip_header=True, # if your tsv file has a header, make sure to pass this to ensure it doesn't get proceesed as data!
      fields=[('TRG', trg_vocab), ('SRC', src_vocab)])

    test_iter = Iterator(
      dataset = test, # we pass in the datasets we want the iterator to draw data from
      sort = False, 
      batch_size=1,
      sort_key=None,
      shuffle=False,
      sort_within_batch=False,
      device = device,
      train=False
    )
  
    model.eval()
    all_gold_trg_tokids = []
    all_translated_trg_tokids = []

    TRG_PAD_IDX = trg_vocab.vocab.stoi[trg_vocab.pad_token]

    with torch.no_grad():
    
        for i, batch in enumerate(test_iter):

            src = batch.SRC.to(device)
            #src = [src len, batch size]

            trg = batch.TRG.to(device)

            #src = GOLD_SRC.to(device)
            #trg = GOLD_TRG.to(device)
            #trg = [trg len, batch size]

            batch_size = trg.shape[1]

            outputs = [trg_vocab.vocab.stoi["<sos>"]]
            for i in range(max_trg_len):
                trg_tensor = torch.LongTensor(outputs).unsqueeze(1).to(device)
                
                output = model(src, trg_tensor)

                topv, topi = output[-1,0,:].topk(1)
                cur_decoded_token = topi.squeeze().detach()  # detach from history as input
                outputs.append(cur_decoded_token.item())

                if cur_decoded_token.item() == trg_vocab.vocab.stoi["<eos>"]:
                    break
            all_translated_trg_tokids.append(outputs[1:-1])
            all_gold_trg_tokids.append([ trg[idx, 0].item() for idx in range(1, trg.size(0)-1)])
    
    # convert token ids to token strs
    all_gold_text = []
    all_translated_text = []
    for i in range(len(all_gold_trg_tokids)): 
        all_gold_text.append([[trg_vocab.vocab.itos[idx] for idx in all_gold_trg_tokids[i]]])
        all_translated_text.append([trg_vocab.vocab.itos[idx] for idx in all_translated_trg_tokids[i]])
        
    corpus_bleu_score = corpus_bleu(all_gold_text, all_translated_text)  
    return corpus_bleu_score

Let's define the evaluation logic to compute the quality of the model based on the loss.

In [43]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.SRC.to(device)
            trg = batch.TRG.to(device)

            output = model(src, trg[:-1, :]) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            #output = output[1:].view(-1, output_dim)
            #trg = trg[1:].view(-1)

            output = output.reshape(-1, output.shape[2])
            target = trg[1:].reshape(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, target)

            epoch_loss += loss.item()
            #break
        
    bleu = inference(model, "./drive/MyDrive/Colab Notebooks/eng-fre/val_eng_fre.tsv", SRC, TRG, False, 64)
    return epoch_loss / len(iterator) , bleu

Let's perform the full-training of the Transformer model:

In [44]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

# set the optimizer
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# create the loss function
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
print('<pad> token index: ', TRG_PAD_IDX)
## we will ignore the pad token in true target set
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

# numer of epochs (hyperparameter)
N_EPOCHS = 15

# initial best valid loss
best_valid_loss = float('inf')

# kick-start training
print('Training started...')
for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iter, optimizer, criterion)
    valid_loss, bleu = evaluate(model, val_iter, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    # Create checkpoint at end of each epoch
    state_dict_model = model.state_dict() 
    state = {
        'epoch': epoch,
        'state_dict': state_dict_model,
        'optimizer': optimizer.state_dict()
        }

    torch.save(state, "./drive/MyDrive/Colab Notebooks/ckpt_ex1/seq2seq_"+str(epoch+1)+".pt")

    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\t Train Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')
    print(f'\t Val. BLEU: {bleu:7.3f}')

# Evaluate the model after the last epoch on the test set
test_loss, bleu_test = evaluate(model, test_iter, criterion)
print(f'\t Test BLEU: {bleu_test:7.3f}')

<pad> token index:  1
Training started...
Epoch: 01 | Time: 1m 43s
	 Train Loss: 4.014 | Train PPL:  55.392
	 Val. Loss: 3.014 |  Val. PPL:  20.364
	 Val. BLEU:   0.148
Epoch: 02 | Time: 1m 33s
	 Train Loss: 3.119 | Train PPL:  22.618
	 Val. Loss: 2.544 |  Val. PPL:  12.733
	 Val. BLEU:   0.236
Epoch: 03 | Time: 1m 33s
	 Train Loss: 2.732 | Train PPL:  15.369
	 Val. Loss: 2.247 |  Val. PPL:   9.459
	 Val. BLEU:   0.292
Epoch: 04 | Time: 1m 36s
	 Train Loss: 2.456 | Train PPL:  11.662
	 Val. Loss: 2.066 |  Val. PPL:   7.894
	 Val. BLEU:   0.306
Epoch: 05 | Time: 1m 33s
	 Train Loss: 2.241 | Train PPL:   9.400
	 Val. Loss: 1.920 |  Val. PPL:   6.819
	 Val. BLEU:   0.348
Epoch: 06 | Time: 1m 32s
	 Train Loss: 2.065 | Train PPL:   7.884
	 Val. Loss: 1.793 |  Val. PPL:   6.007
	 Val. BLEU:   0.369
Epoch: 07 | Time: 1m 32s
	 Train Loss: 1.922 | Train PPL:   6.833
	 Val. Loss: 1.690 |  Val. PPL:   5.420
	 Val. BLEU:   0.393
Epoch: 08 | Time: 1m 35s
	 Train Loss: 1.799 | Train PPL:   6.045
	 V

## Reference:

https://nlp.seas.harvard.edu/2018/04/03/attention.html#position-wise-feed-forward-networks

https://pytorch.org/docs/master/nn.html#transformerencoderlayer

https://pytorch.org/docs/master/_modules/torch/nn/modules/transformer.html#Transformer