# 2- Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

In this second notebook on sequence-to-sequence models using PyTorch and TorchText, we'll be implementing the model from [Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation](https://arxiv.org/abs/1406.1078). This model will achieve improved test perplexity whilst only using a single layer RNN in both the encoder and the decoder. [[Tutorial](https://github.com/bentrevett/pytorch-seq2seq/blob/master/2%20-%20Learning%20Phrase%20Representations%20using%20RNN%20Encoder-Decoder%20for%20Statistical%20Machine%20Translation.ipynb)]

## Data Loading

Pretty much the same way as in tut1, though in this paper, the source sentences are not reversed.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

import spacy
import numpy as np

import random
import math
import time


sandbox_path = "models/"

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

!python -m spacy download en
!python -m spacy download de

spacy_en = spacy.load('en')
spacy_de = spacy.load('de')
print(spacy_en)

def tokenize_de(text):
  "Tokenize a German language string"
  # Not reversing for this one
  return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
  return [tok.text for tok in spacy_en.tokenizer(text)]


foo = spacy_en.tokenizer("Hello. how do you do good, sir!")
print(type(foo))
print(foo)
[type(t.text) for t in foo]

SRC = Field(tokenize = tokenize_de, init_token='<sos>', 
            eos_token='<eos>', lower=True)
TRG = Field(tokenize = tokenize_en, init_token='<sos>', 
            eos_token='<eos>', lower=True)

train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),
                                                    fields = (SRC, TRG))
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")
print(vars(train_data.examples[0]))

SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)
print(f"Unique token in SRC (de) vocab: {len(SRC.vocab)}")
print(f"Unique token in SRC (en) vocab: {len(TRG.vocab)}")

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), batch_size=BATCH_SIZE, device=device)

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/home/ammar/anaconda3/lib/python3.7/site-packages/en_core_web_sm -->
/home/ammar/anaconda3/lib/python3.7/site-packages/spacy/data/en
You can now load the model via spacy.load('en')
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')
[38;5;2m✔ Linking successful[0m
/home/ammar/anaconda3/lib/python3.7/site-packages/de_core_news_sm -->
/home/ammar/anaconda3/lib/python3.7/site-packages/spacy/data/de
You can now load the model via spacy.load('de')
<spacy.lang.en.English object at 0x7f3583ba5e10>
<class 'spacy.tokens.doc.Doc'>
Hello. how do you do good, sir!
Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000
{'src': ['zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.'], 'trg': ['two

## Encoder

Single layer GRU.

In [2]:
class Encoder(nn.Module):
  def __init__(self, input_dim, emb_dim, hid_dim, dropout_p):
    super().__init__()

    self.embedding = nn.Embedding(input_dim, emb_dim) # no dropout as only 1 layer
    self.rnn = nn.GRU(emb_dim, hid_dim)
    self.dropout = nn.Dropout(dropout_p)

  def forward(self, src):
    # src shp: [src_len, batch_sz]
    embedded = self.dropout(self.embedding(src)) # shp: [src_len, batch_sz, emb_dim]
    outputs, hidden = self.rnn(embedded)
    # outputs shp: [src_len, batch_sz, hid_dim]
    # outputs are always from the top hidden layer
    # hidden shp: [1, batch_sz, hid_dim]
    return hidden

## Decoder
The Decoder is where the implementation differs from the original Seq2Seq from tut-1.

Instead of the GRU in the decoder taking just the embedded target token, $d(y_t)$ and the previous hidden state $s_{t-1}$ as inputs, it also takes the context vector $z$.

$$s_t = \text{DecoderGRU}(d(y_t), s_{t-1}, z)$$
Note how this context vector, $z$, does not have a $t$ subscript, meaning we re-use the same context vector returned by the encoder for every time-step in the decoder.
Before, we predicted the next token, $\hat{y}_{t+1}$, with the linear layer, $f$, only using the top-layer decoder hidden state at that time-step, $s_t$, as $\hat{y}_{t+1}=f(s_t^L)$. Now, we also pass the embedding of current token, $d(y_t)$ and the context vector, $z$ to the linear layer.

$$\hat{y}_{t+1} = f(d(y_t), s_t, z)$$

Hypothetically the decoder hidden states, $s_t$, no longer need to contain information about the source sequence as it is always available as an input. Thus, it only needs to contain information about what tokens it has generated so far. The addition of $y_t$ to the linear layer also means this layer can directly see what the token is, without having to get this information from the hidden state. This is just a hypothesis, who knows what the model is actually doing.  Nevertheless, it is a solid intuition and the results seem to indicate that this modifications are a good idea!

In [3]:
class Decoder(nn.Module):
  def __init__(self, output_dim, emb_dim, hid_dim, dropout_p):
    super().__init__()
    self.output_dim = output_dim
    
    self.embedding = nn.Embedding(output_dim, emb_dim)
    self.rnn = nn.GRU(emb_dim + hid_dim, hid_dim)
    self.fc_out = nn.Linear(emb_dim + 2*hid_dim, output_dim)
    self.dropout = nn.Dropout(dropout_p)

  def forward(self, input, hidden, context):
    # input shp: [batch_sz]
    # hidden shp: [1, batch_sz, hid_dim]
    # context shp: [1, batch_sz, hid_dim]

    input = input.unsqueeze(0) # [1, batch_sz]
    embedded = self.dropout(self.embedding(input)) # [1, batch_sz, emb_dim]
    embedded_concat = torch.cat((embedded, context), dim = 2) # [1, batch_sz, hid_dim + emb_dim]

    output, hidden = self.rnn(embedded_concat, hidden)
    # output shp: [1, batch_sz, hid_dim]
    # hidden shp: [1, batch_sz, hid_dim]

    output_concat = torch.cat((embedded.squeeze(0), hidden.squeeze(0), context.squeeze(0)),
                              dim = 1) # [batch_sz, emb_dim + 2*hid_dim]
    prediction = self.fc_out(output_concat) # [batch_sz, output_dim]
    
    return prediction, hidden

## Seq2Seq Model

Steps:
- outputs tensor is created to hold all predictions, $\hat{Y}$
- source sequence is fed into encoder to receive a context vector.
- initial decoder hidden state is set to context vector
- get a batch of <sos> tokens as the first input.
- in a decoding loop:
  - insert input token, previous hidden state & context vector into decoder.
  - receive prediction & new hidden state
  - teacher force to decide whether to use prediction or ground truth


In [4]:
class Seq2Seq(nn.Module):
  def __init__(self, encoder, decoder, device):
    super().__init__()

    self.encoder = encoder
    self.decoder = decoder
    self.device = device

  def forward(self, src, trg, teacher_forcing_ratio = 0.5):
    # src: [src_len, batch_sz]
    # trg: [trg_len, batch_sz]
    
    batch_size = src.shape[1]
    trg_len = trg.shape[0]
    trg_vocab_sz = self.decoder.output_dim

    outputs = torch.zeros(trg_len, batch_size, trg_vocab_sz).to(self.device)

    context = self.encoder(src)

    hidden = context

    input = trg[0,:]

    for t in range(1, trg_len):
      output, hidden = self.decoder(input, hidden, context)

      outputs[t] = output

      teacher_force = random.random() < teacher_forcing_ratio

      top1 = output.argmax(1)

      input = trg[t] if teacher_force else top1

    return outputs
    

## Training the Seq2Seq Model

The training part is mostly the same as tut-1. The paper states the parameters are initialized from a normal distribution with a mean of 0 and a standard deviation of 0.01, i.e. $\mathcal{N}(0, 0.01)$. 

In [6]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
DROPOUT_P = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, DROPOUT_P)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, DROPOUT_P)
model = Seq2Seq(enc, dec, device).to(device)

def init_weights(m):
  for name, param in m.named_parameters():
    nn.init.normal_(param.data, mean=0, std=0.01)

model.apply(init_weights)

# define a function that will calculate the number of trainable parameters in the model.
def count_parameters(model):
  return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Model has {count_parameters(model):,} trainable parameters.")

# function to tell us how long an epoch takes.
def epoch_time(start_time, end_time):
  elapsed_time = end_time - start_time
  elapsed_mins = int(elapsed_time / 60)
  elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
  return elapsed_mins, elapsed_secs 

optimizer = optim.Adam(model.parameters())

# The CrossEntropyLoss function calculates both the log softmax as well 
# as the negative log-likelihood of our predictions.
# Our loss function calculates the average loss per token, however by passing
# the index of the <pad> token as the ignore_index argument we ignore the loss 
# whenever the target token is a padding token.
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

# Training Loop
def train(model, iterator, optimizer, criterion, clip):
  model.train()
  epoch_loss = 0
  for i, batch in enumerate(iterator):
    src = batch.src
    trg = batch.trg # shp: [trg_len, batch_sz]

    optimizer.zero_grad()

    output = model(src, trg) # shp: [trg_len, batch_sz, output_dim]

    output_dim = output.shape[-1]

    output = output[1:].view(-1, output_dim) # shp: [(trg_len-1)*batch_sz, output_dim]
    trg = trg[1:].view(-1) # shp: [(trg_len-1)*batch_sz]

    loss = criterion(output, trg)

    loss.backward()

    torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

    optimizer.step()

    epoch_loss += loss.item()

  return epoch_loss / len(iterator)

# Evaluation Loop
def evaluate(model, iterator, criterion):
  model.eval()
  epoch_loss = 0

  with torch.no_grad():
    for i, batch in enumerate(iterator):
      src = batch.src
      trg = batch.trg # shp: [trg_len, batch_sz]

      output = model(src, trg, teacher_forcing_ratio=0.0)  # shp: [trg_len, batch_sz, output_dim]

      output_dim = output.shape[-1]

      output = output[1:].view(-1, output_dim)
      trg = trg[1:].view(-1)

      loss = criterion(output, trg)

      epoch_loss += loss.item()

    return epoch_loss / len(iterator)

N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
  start_time = time.time()

  train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
  valid_loss = evaluate(model, valid_iterator, criterion)

  end_time = time.time()

  epoch_mins, epoch_secs = epoch_time(start_time, end_time)

  if valid_loss < best_valid_loss:
    best_valid_loss = best_valid_loss
    torch.save(model.state_dict(), sandbox_path + 'tut2-model.pt')

  print(f"Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s") 
  print(f"\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}")
  print(f"\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}")

Model has 14,220,037 trainable parameters.
Epoch: 01 | Time: 0m 41s
	Train Loss: 5.058 | Train PPL: 157.286
	 Val. Loss: 4.905 |  Val. PPL: 134.951
Epoch: 02 | Time: 0m 42s
	Train Loss: 4.255 | Train PPL:  70.445
	 Val. Loss: 4.767 |  Val. PPL: 117.549
Epoch: 03 | Time: 0m 42s
	Train Loss: 3.845 | Train PPL:  46.742
	 Val. Loss: 4.393 |  Val. PPL:  80.844
Epoch: 04 | Time: 1m 2s
	Train Loss: 3.456 | Train PPL:  31.690
	 Val. Loss: 4.002 |  Val. PPL:  54.731
Epoch: 05 | Time: 1m 16s
	Train Loss: 3.092 | Train PPL:  22.013
	 Val. Loss: 3.841 |  Val. PPL:  46.583
Epoch: 06 | Time: 1m 15s
	Train Loss: 2.803 | Train PPL:  16.486
	 Val. Loss: 3.749 |  Val. PPL:  42.475
Epoch: 07 | Time: 1m 15s
	Train Loss: 2.549 | Train PPL:  12.788
	 Val. Loss: 3.586 |  Val. PPL:  36.092
Epoch: 08 | Time: 1m 19s
	Train Loss: 2.347 | Train PPL:  10.459
	 Val. Loss: 3.576 |  Val. PPL:  35.719
Epoch: 09 | Time: 1m 18s
	Train Loss: 2.144 | Train PPL:   8.534
	 Val. Loss: 3.565 |  Val. PPL:  35.343
Epoch: 10 | T

In [None]:
# Load the best model with best validation and run the test set
model.load_state_dict(torch.load(sandbox_path + 'tut2-model.pt'))

test_loss = evaluate(model, test_iterator, criterion)
print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')