# 3 - Neural Machine Translation by Jointly Learning to Align and Translate

In this third notebook on sequence-to-sequence models using PyTorch and TorchText, we'll be implementing the model from [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473). This model achives our best perplexity yet, ~27 compared to ~34 for the previous model.

In the last tut-2 we tries to alleviate information compression by passing the context vector(last hidden state) of encoder at every decoder step. Though the context vector still needs to summarize well. This time, we'll use Attention!.

Attention works by first, calculating an attention vector $a$, that is the length of the source sentence. The attention vector has the property that each element is between 0 and 1, and the entire vector sums to 1. We then calculate a weighted sum of our source sentence hidden states $H$, to get a weighted source vector, $w$.
$$w = \sum_{i}a_ih_i$$
We calculate a new weighted source vector every time-step when decoding, using it as input to our decoder RNN as well as the linear layer to make a prediction.

## Preparing Data

Preparation is same as previous tut-2.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

import spacy
import numpy as np

import random
import math
import time

sandbox_path = "models/"

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

!python -m spacy download en
!python -m spacy download de

spacy_en = spacy.load('en')
spacy_de = spacy.load('de')
print(spacy_en)

def tokenize_de(text):
  "Tokenize a German language string"
  # Not reversing for this one
  return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
  return [tok.text for tok in spacy_en.tokenizer(text)]


foo = spacy_en.tokenizer("Hello. how do you do good, sir!")
print(type(foo))
print(foo)
[type(t.text) for t in foo]

SRC = Field(tokenize = tokenize_de, init_token='<sos>', 
            eos_token='<eos>', lower=True)
TRG = Field(tokenize = tokenize_en, init_token='<sos>', 
            eos_token='<eos>', lower=True)

train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),
                                                    fields = (SRC, TRG))
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")
print(vars(train_data.examples[0]))

SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)
print(f"Unique token in SRC (de) vocab: {len(SRC.vocab)}")
print(f"Unique token in SRC (en) vocab: {len(TRG.vocab)}")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), batch_size=BATCH_SIZE, device=device)

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/google/home/ammarh/anaconda3/lib/python3.7/site-packages/en_core_web_sm
-->
/usr/local/google/home/ammarh/anaconda3/lib/python3.7/site-packages/spacy/data/en
You can now load the model via spacy.load('en')
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/google/home/ammarh/anaconda3/lib/python3.7/site-packages/de_core_news_sm
-->
/usr/local/google/home/ammarh/anaconda3/lib/python3.7/site-packages/spacy/data/de
You can now load the model via spacy.load('de')
<spacy.lang.en.English object at 0x7f9d47666c90>
<class 'spacy.tokens.doc.Doc'>
Hello. how do you do good, sir!
Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000
{'src': ['zwei', 'junge', 'weiße', 'männer', 'sind', 'im

## Encoder
We use a single layer GRU, however now it is bidirectional. With a bidirectional RNN, we have two RNNs in each layer. A forward RNN going over the embedded sentence from left to right, and a backward RNN going over the embedded sentence from right to left. We'll also get two context vectors, one from the forward RNN after it has seen the final word in the sentence, $z^\rightarrow=h_T^\rightarrow$, and one from the backward RNN after it has seen the first word in the sentence, $z^\leftarrow=h_T^\leftarrow$. 

outputs is of size [src len, batch size, hid dim * num directions] where the first hid_dim elements in the third axis are the hidden states from the top layer forward RNN, and the last hid_dim elements are hidden states from the top layer backward RNN. We can think of the third axis as being the forward and backward hidden states concatenated together

hidden is of size [n layers * num directions, batch size, hid dim], where [-2, :, :] gives the top layer forward RNN hidden state after the final time-step (i.e. after it has seen the last word in the sentence) and [-1, :, :] gives the top layer backward RNN hidden state after the final time-step (i.e. after it has seen the first word in the sentence).

As the decoder is not bidirectional, it only needs a single context vector $z$ to use as its hidden state $s_0$ and we currently have two (a forward & a backward). We solve this by concatenating the 2 context vectors, passing them through a linear layer and applying a tanh actiovation. **Note:** Paper only uses the first backward RNN hidden state through a linear layer to get decoder initial hidden state.

As we want our model to look back over the whole of the source sentence we return outputs, aka the stacked forward & backward hidden states for every token in the source sentence. We also return hidden, which acts as our initial hidden state in the decoder.






In [2]:
class Encoder(nn.Module):
  def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout_p):
    super().__init__()

    self.embedding = nn.Embedding(input_dim, emb_dim)
    self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
    self.fc = nn.Linear(2 * enc_hid_dim, dec_hid_dim)
    self.dropout = nn.Dropout(dropout_p)

  def forward(self, src):
    # src: [src_len, batch_sz]

    embedded = self.dropout(self.embedding(src)) # [src_len, batch_sz, emb_dim]
    outputs, enc_hidden = self.rnn(embedded)
    # outputs: [src_len, batch_sz, num_directions*hid_dim]
    # enc_hidden: [num_directions, batch_sz, hid_dim]

    # project the final hidden state to get an initial decoder state.
    proj_hidden = torch.tanh(self.fc(
        torch.cat((enc_hidden[0,:,:], enc_hidden[1,:,:]), dim=1))) # [batch_sz, dec_hid_dim]

    return outputs, proj_hidden

## Attention
This will take in the previous hidden state of the decoder $s_{t-1}$ and all of the stacked forward & backward hidden states from the encoder $H$. The layer will output an attention vector $a_t$ that is the length of the source sentence, with each element between 0 & 1 and the entire vector sums to 1.

We implement **Additive Attention** (as described in XCS224N) for this paper. HWs were basic dot product attention. 
First, we calculate the energy between the previous decoder hidden state and the encoder hidden states. As our encoder hidden states are a sequence of T tensors and our previous decoder hidden state is a single tensor, the first thing we do is repeat the previous decoder hidden state T times. We then calculate the energy $E_t$ between them by concatenating them together and passing them through a linear layer (`attn`) and tanh activation. This can be thought of as calculating how well each encoder hidden state "matches" the previous hidden state. 

We then take a time invariant tensor $v$ and compute a dot product with it giving us a [1, src_len] sized tensor. We implement $v$ as a linear layer without a bias. This is done because in theory we can use the `attn` layer to project the concatenated hidden states into some arbitrary dimension `d3` and then dot a vector `v` of size `d3` to get a score. Here we just use `d3` as `dec_hid_dim`. See lecture slide.

In [3]:
class Attention(nn.Module):
  def __init__(self, enc_hid_dim, dec_hid_dim):
    super().__init__()

    self.attn = nn.Linear(2*enc_hid_dim + dec_hid_dim, dec_hid_dim)
    self.v = nn.Linear(dec_hid_dim, 1, bias = False)

  def forward(self, hidden, encoder_outputs):
    # hidden: [batch_sz, dec_hid_dim]
    # encoder_outputs: [src_len, batch_sz, 2*enc_hid_dim]

    batch_sz = encoder_outputs.shape[1]
    src_len = encoder_outputs.shape[0]

    # repeat decoder hidden state for src_len times (tile it)
    hidden = hidden.unsqueeze(1).repeat(1, src_len, 1) # [batch_sz, src_len, dec_hid_dim]

    encoder_outputs = encoder_outputs.permute(1, 0, 2) # [batch_sz, src_len, 2*hid_dim]

    energy = torch.tanh(self.attn(
        torch.cat((hidden, encoder_outputs), dim = 2))) # [batch_sz, src_len, dec_hid_dim]

    attention = self.v(energy).squeeze(2) # [batch_sz, src_len]

    return F.softmax(attention, dim = 1)


## Decoder

The decoder contains the attention layer, which takes the previous hidden state and all of the encoder hidden states and returns an attention vector, $a_t$.

We then use the attention vector to create a weighted source vector (weighted sum of the encoder hidden states) $w_t$ using $a_t$ as the weights.

We then concatenate the embedded input word and the weighted source vector $w_t$ and pass it to a GRU with the previous hidden state $s_{t-1}$. 

Finally we concatenate the embedded input word, the weighted source vector $w_t$ and the current hidden state $s_t$ and pass it through a linear layer to make a next target word prediction.

In [4]:
class Decoder(nn.Module):
  def __init__(self, output_dim, emb_dim, enc_hid_dim,
               dec_hid_dim, dropout_p, attention):
    super().__init__()
    self.output_dim = output_dim
    self.attention = attention
    
    self.embedding = nn.Embedding(output_dim, emb_dim)
    self.rnn = nn.GRU(2*enc_hid_dim + emb_dim, dec_hid_dim)
    self.fc_out = nn.Linear(2*enc_hid_dim + emb_dim + dec_hid_dim, output_dim)
    self.dropout = nn.Dropout(dropout_p)

  def forward(self, input, hidden, encoder_outputs):
    # input: [batch_sz]
    # hidden: [batch_sz, dec_hid_dim]
    # encoder_outputs: [src_len, batch_sz, 2*enc_hid_dim]

    input = input.unsqueeze(0) # [1, batch_sz]
    embedded = self.dropout(self.embedding(input)) # [1, batch_sz, emb_dim]

    a = self.attention(hidden, encoder_outputs) # [batch_sz, src_len]

    a = a.unsqueeze(1) # [batch_sz, 1, src_len]

    encoder_outputs = encoder_outputs.permute(1,0,2) # [batch_sz, src_len, 2*enc_hid_dim]

    weighted = torch.bmm(a, encoder_outputs) # [batch_sz, 1, 2*enc_hid_dim]

    weighted = weighted.permute(1, 0, 2) # [1, batch_sz, 2*enc_hid_dim]

    rnn_input = torch.cat((embedded, weighted), dim = 2) # [1, batch_sz, 2*enc_hid_dim + emb_dim]

    output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
    # output: [1, batch_sz, dec_hid_dim]
    # hidden: [1, batch_sz, dec_hid_dim]
    # in this case output == hidden
    assert (output == hidden).all()

    embedded = embedded.squeeze(0)
    output = output.squeeze(0)
    weighted = weighted.squeeze(0)
    prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1)) # [batch_sz, output_dim]

    return prediction, hidden.squeeze(0)

## Seq2Seq Model

This Seq2Seq is similar to the previous two. The only addition is that the encoder returns all its hidden states. These are received and passed forward to the decoder for it to compute attention.

In [5]:
class Seq2Seq(nn.Module):
  def __init__(self, encoder, decoder, device):
    super().__init__()

    self.encoder = encoder
    self.decoder = decoder
    self.device = device

  def forward(self, src, trg, teacher_forcing_ratio = 0.5):
    # src: [src_len, batch_sz]
    # trg: [trg_len, batch_sz]

    batch_size = src.shape[1]
    trg_len = trg.shape[0]
    trg_vocab_size = self.decoder.output_dim

    # tensor to store decoder outputs
    outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)

    encoder_outputs, hidden = self.encoder(src)

    input = trg[0,:]

    for t in range(1, trg_len):
      output, hidden = self.decoder(input, hidden, encoder_outputs)

      outputs[t] = output

      teacher_force = random.random() < teacher_forcing_ratio

      top1 = output.argmax(1)

      input = trg[t] if teacher_force else top1

    return outputs

## Training Code

The rest is very similar to the training code from the previous tutorials.

In [6]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
N_LAYERS = 2
DROPOUT_P = 0.5

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DROPOUT_P)
dec = Decoder(OUTPUT_DIM, EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DROPOUT_P, attn)

model = Seq2Seq(enc, dec, device).to(device)

def init_weights(m):
  for name, param in m.named_parameters():
    nn.init.normal_(param.data, mean=0, std=0.01)

model.apply(init_weights)

# define a function that will calculate the number of trainable parameters in the model.
def count_parameters(model):
  return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Model has {count_parameters(model):,} trainable parameters.")

# function to tell us how long an epoch takes.
def epoch_time(start_time, end_time):
  elapsed_time = end_time - start_time
  elapsed_mins = int(elapsed_time / 60)
  elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
  return elapsed_mins, elapsed_secs 

optimizer = optim.Adam(model.parameters())

# The CrossEntropyLoss function calculates both the log softmax as well 
# as the negative log-likelihood of our predictions.
# Our loss function calculates the average loss per token, however by passing
# the index of the <pad> token as the ignore_index argument we ignore the loss 
# whenever the target token is a padding token.
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

# Training Loop
def train(model, iterator, optimizer, criterion, clip):
  model.train()
  epoch_loss = 0
  for i, batch in enumerate(iterator):
    src = batch.src
    trg = batch.trg # shp: [trg_len, batch_sz]

    optimizer.zero_grad()

    output = model(src, trg) # shp: [trg_len, batch_sz, output_dim]

    output_dim = output.shape[-1]

    output = output[1:].view(-1, output_dim) # shp: [(trg_len-1)*batch_sz, output_dim]
    trg = trg[1:].view(-1) # shp: [(trg_len-1)*batch_sz]

    loss = criterion(output, trg)

    loss.backward()

    torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

    optimizer.step()

    epoch_loss += loss.item()

  return epoch_loss / len(iterator)

# Evaluation Loop
def evaluate(model, iterator, criterion):
  model.eval()
  epoch_loss = 0

  with torch.no_grad():
    for i, batch in enumerate(iterator):
      src = batch.src
      trg = batch.trg # shp: [trg_len, batch_sz]

      output = model(src, trg, teacher_forcing_ratio=0.0)  # shp: [trg_len, batch_sz, output_dim]

      output_dim = output.shape[-1]

      output = output[1:].view(-1, output_dim)
      trg = trg[1:].view(-1)

      loss = criterion(output, trg)

      epoch_loss += loss.item()

    return epoch_loss / len(iterator)

N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
  start_time = time.time()

  train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
  valid_loss = evaluate(model, valid_iterator, criterion)

  end_time = time.time()

  epoch_mins, epoch_secs = epoch_time(start_time, end_time)

  if valid_loss < best_valid_loss:
    best_valid_loss = best_valid_loss
    torch.save(model.state_dict(), sandbox_path + 'tut3-model.pt')

  print(f"Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s") 
  print(f"\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}")
  print(f"\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}")

Model has 20,518,661 trainable parameters.
Epoch: 01 | Time: 1m 18s
	Train Loss: 4.917 | Train PPL: 136.631
	 Val. Loss: 4.709 |  Val. PPL: 110.896
Epoch: 02 | Time: 1m 28s
	Train Loss: 3.968 | Train PPL:  52.878
	 Val. Loss: 4.159 |  Val. PPL:  64.039
Epoch: 03 | Time: 2m 15s
	Train Loss: 3.290 | Train PPL:  26.842
	 Val. Loss: 3.624 |  Val. PPL:  37.484
Epoch: 04 | Time: 2m 15s
	Train Loss: 2.785 | Train PPL:  16.192
	 Val. Loss: 3.369 |  Val. PPL:  29.061
Epoch: 05 | Time: 1m 44s
	Train Loss: 2.445 | Train PPL:  11.534
	 Val. Loss: 3.251 |  Val. PPL:  25.818
Epoch: 06 | Time: 1m 23s
	Train Loss: 2.145 | Train PPL:   8.543
	 Val. Loss: 3.224 |  Val. PPL:  25.119
Epoch: 07 | Time: 1m 18s
	Train Loss: 1.915 | Train PPL:   6.784
	 Val. Loss: 3.286 |  Val. PPL:  26.727
Epoch: 08 | Time: 1m 18s
	Train Loss: 1.716 | Train PPL:   5.561
	 Val. Loss: 3.234 |  Val. PPL:  25.378
Epoch: 09 | Time: 1m 18s
	Train Loss: 1.577 | Train PPL:   4.841
	 Val. Loss: 3.272 |  Val. PPL:  26.376
Epoch: 10 | 

In [7]:
# Load the best model with best validation and run the test set
model.load_state_dict(torch.load(sandbox_path + 'tut3-model.pt'))

test_loss = evaluate(model, test_iterator, criterion)
print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 3.286 | Test PPL:  26.744 |
