Implements the model from the [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215) paper. [\[Tutorial\]](https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb)



In [15]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

import spacy
import numpy as np

import random
import math
import time

# Mount Google drive
# from google.colab import drive
# drive.mount('/content/drive')
# sandbox_path = "drive/My Drive/Colab Notebooks/PyTorch/NLP/bentrevett-tuts/"
sandbox_path = "models/"

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

!python -m spacy download en
!python -m spacy download de


[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/home/ammar/anaconda3/lib/python3.7/site-packages/en_core_web_sm -->
/home/ammar/anaconda3/lib/python3.7/site-packages/spacy/data/en
You can now load the model via spacy.load('en')
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')
[38;5;2m✔ Linking successful[0m
/home/ammar/anaconda3/lib/python3.7/site-packages/de_core_news_sm -->
/home/ammar/anaconda3/lib/python3.7/site-packages/spacy/data/de
You can now load the model via spacy.load('de')


## Data Preparation: TorchText

In the Seq2seq paper, they reverse the order of the input which they believe "introduces many short term dependencies in the data that make the optimization problem much easier".

TorchText's Fields handle how data should be processed. All of the possible arguments are detailed [here](https://torchtext.readthedocs.io/en/latest/data.html#field).

In [16]:
spacy_en = spacy.load('en')
spacy_de = spacy.load('de')
print(spacy_en)

def tokenize_de(text):
  "Tokenize and reverse a German language string"
  return [tok.text for tok in spacy_de.tokenizer(text)][::-1]

def tokenize_en(text):
  return [tok.text for tok in spacy_en.tokenizer(text)]


foo = spacy_en.tokenizer("Hello. how do you do good, sir!")
print(type(foo))
print(foo)
[type(t.text) for t in foo]

SRC = Field(tokenize = tokenize_de, init_token='<sos>', 
            eos_token='<eos>', lower=True)
TRG = Field(tokenize = tokenize_en, init_token='<sos>', 
            eos_token='<eos>', lower=True)

<spacy.lang.en.English object at 0x7fdb64d14f90>
<class 'spacy.tokens.doc.Doc'>
Hello. how do you do good, sir!



Next, we download and load the train, validation and test data.

The dataset we'll be using is the [Multi30k](https://github.com/multi30k/dataset) dataset. This is a dataset with ~30,000 parallel English, German and French sentences, each with ~12 words per sentence.

In [17]:
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),
                                                    fields = (SRC, TRG))
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")
print(vars(train_data.examples[0]))

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000
{'src': ['.', 'büsche', 'vieler', 'nähe', 'der', 'in', 'freien', 'im', 'sind', 'männer', 'weiße', 'junge', 'zwei'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


Next, we'll build the vocabulary for the source and target languages. The vocabulary is used to associate each unique token with an index (an integer). The vocabularies of the source and target languages are distinct.

Using the min_freq argument, we only allow tokens that appear at least 2 times to appear in our vocabulary. Tokens that appear only once are converted into an <unk> (unknown) token.

It is important to note that our vocabulary should only be built from the training set and not the validation/test set. This prevents "information leakage" into our model, giving us artifically inflated validation/test scores.

In [18]:
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)
print(f"Unique token in SRC (de) vocab: {len(SRC.vocab)}")
print(f"Unique token in SRC (en) vocab: {len(TRG.vocab)}")

Unique token in SRC (de) vocab: 7854
Unique token in SRC (en) vocab: 5893



The final step of preparing the data is to create the iterators. These can be iterated on to return a batch of data which will have a src attribute (the PyTorch tensors containing a batch of numericalized source sentences) and a trg attribute (the PyTorch tensors containing a batch of numericalized target sentences). Numericalized is just a fancy way of saying they have been converted from a sequence of readable tokens to a sequence of corresponding indexes, using the vocabulary.

When we get a batch of examples using an iterator we need to make sure that all of the source sentences are padded to the same length, the same with the target sentences. Luckily, TorchText iterators handle this for us!

We use a BucketIterator instead of the standard Iterator as it creates batches in such a way that it minimizes the amount of padding in both the source and target sentences.

In [19]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), batch_size=BATCH_SIZE, device=device)

## Building the Seq2Seq Model

We'll be building the model in three parts: The encoder, the decoder, and a seq2seq model that encapsulates the encoder and decoder and will provide a way to interface with each.

### Encoder
2-Layer LSTM. Paper implements 4 layer, but doing 2 to save training time. Only our hidden state from the first layer is passed as input to the second layer, and not the cell state. One thing to note is that the dropout argument to the LSTM is how much dropout to apply between the layers of a multi-layer RNN, i.e. between the hidden states output from layer $l$ and those same hidden states being used for the input of layer $l+1$. The RNN returns: outputs (the top-layer hidden state for each time-step), hidden (the final hidden state for each layer, $h_T$, stacked on top of each other) and cell (the final cell state for each layer, $c_T$, stacked on top of each other).

As we only need the final hidden and cell states (to make our context vector), forward only returns hidden and cell.

In [20]:
class Encoder(nn.Module):
  def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout_p):
    super().__init__()
    self.hid_dim = hid_dim

    self.embedding = nn.Embedding(input_dim, emb_dim)
    self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout_p)
    self.dropout = nn.Dropout(dropout_p)

  def forward(self, src):
    # src shp: src_len x batch_size
    
    embedded = self.dropout(self.embedding(src)) # shp: src_len x batch_size x emb_dim

    outputs, (hidden, cell) = self.rnn(embedded)

    # outputs are always from the top hidden layer
    # outputs shp: src_len x batch_size x hid_dim*n_directions
    # hidden shp: n_layers*n_directions x batch_size x hid_dim
    # cell shp: n_layers*n_directions x batch_size x hid_dim
    return hidden, cell

### Decoder
2-layer LSTM (4 in the paper). The Decoder class does a single step of decoding, i.e. it ouputs single token per time-step. The first layer will receive a hidden and cell state from the previous time-step, $(s_{t-1}^1, c_{t-1}^1)$, and feeds it through the LSTM with the current embedded token, $y_t$, to produce a new hidden and cell state, $(s_t^1, c_t^1)$. The subsequent layers will use the hidden state from the layer below, $s_t^{l-1}$, and the previous hidden and cell states from their layer, $(s_{t-1}^l, c_{t-1}^l)$. This provides equations very similar to those in the encoder. Remember that the initial hidden and cell states to our decoder are our context vectors, which are the final hidden and cell states of our encoder from the same layer, i.e. $(s_0^l,c_0^l)=z^l=(h_T^l,c_T^l)$. We then pass the hidden state from the top layer of the RNN, $s_t^L$, through a linear layer, $f$, to make a prediction of what the next token in the target (output) sequence should be, $\hat{y}_{t+1}$.

As we are only decoding one token at a time, the input tokens will always have a sequence length of 1. We unsqueeze the input tokens to add a sentence length dimension of 1. We then pass the output (after getting rid of the sentence length dimension) through the linear layer to receive our prediction. We then return the prediction, the new hidden state and the new cell state.

Note: as we always have a sequence length of 1, we could use nn.LSTMCell, instead of nn.LSTM, as it is designed to handle a batch of inputs that aren't necessarily in a sequence. nn.LSTMCell is just a single cell and nn.LSTM is a wrapper around potentially multiple cells. Using the nn.LSTMCell in this case would mean we don't have to unsqueeze to add a fake sequence length dimension, but we would need one nn.LSTMCell per layer in the decoder and to ensure each nn.LSTMCell receives the correct initial hidden state from the encoder. All of this makes the code less concise - hence the decision to stick with the regular nn.LSTM.



In [21]:
class Decoder(nn.Module):
  def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout_p):
    super().__init__()

    self.output_dim = output_dim

    self.embedding = nn.Embedding(output_dim, emb_dim)
    self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout_p)
    self.fc_out = nn.Linear(hid_dim, output_dim)
    self.dropout = nn.Dropout(dropout_p)

  def forward(self, input, hidden, cell):
    # input shp: batch_size
    # hidden shp: n_layers x batch_size x hid_dim
    # cell shp: n_layers x batch_size x hid_dim
    
    input = input.unsqueeze(0) # shp: 1 x batch_size
    embedded = self.dropout(self.embedding(input)) # shp: 1 x batch_size x emb_dim

    output, (hidden, cell) = self.rnn(embedded, (hidden, cell))

    # output shp: 1 x batch_size x hid_dim
    # hidden shp: n_layers x batch_size x hid_dim
    # cell shp: n_layers x batch_size x hid_dim

    prediction = self.fc_out(output.squeeze(0))

    # prediction shp: batch_sz x output_dim

    return prediction, hidden, cell

### Seq2Seq

For the final part of the implemenetation, we'll implement the seq2seq model. This will handle:

- receiving the input/source sentence
- using the encoder to produce the context vectors
- using the decoder to produce the predicted output/target sentence

The first thing we do in the forward method is to create an outputs tensor that will store all of our predictions, $\hat{Y}$.

We then feed the input/source sentence, src, into the encoder and receive out final hidden and cell states. We know how long our target sentences should be (max_len), so we loop that many times. The last token input into the decoder is the one before the <eos> token - the <eos> token is never input into the decoder.

During each iteration of the loop, we:

- pass the input, previous hidden and previous cell states ($y_t, s_{t-1}, c_{t-1}$) into the decoder
- receive a prediction, next hidden state and next cell state ($\hat{y}_{t+1}, s_{t}, c_{t}$) from the decoder
- place our prediction, $\hat{y}_{t+1}$/output in our tensor of predictions, $\hat{Y}$/outputs
- decide if we are going to "teacher force" or not
  - if we do, the next input is the ground-truth next token in the sequence, $y_{t+1}$/trg[t]
  - if we don't, the next input is the predicted next token in the sequence, $\hat{y}_{t+1}$/top1, which we get by doing an argmax over the output tensor

Once we've made all of our predictions, we return our tensor full of predictions.

In [22]:
class Seq2Seq(nn.Module):
  def __init__(self, encoder, decoder, device):
    super().__init__()

    self.encoder = encoder
    self.decoder = decoder
    self.device = device

  def forward(self, src, trg, teacher_forcing_ratio=0.5):
    # src shp: [src_len, batch_sz]
    # trg shp: [trg_len, batch_sz]

    batch_sz = trg.shape[1]
    trg_len = trg.shape[0]
    trg_vocab_sz = self.decoder.output_dim

    # tensor to store decoder outputs
    outputs = torch.zeros(trg_len, batch_sz, trg_vocab_sz).to(self.device)

    # last hidden state of the encoder is used as the initial hidden state of decoder
    hidden, cell = self.encoder(src)
    # first input to decoder is the <sos> token
    input = trg[0,:]

    for t in range(1, trg_len):
      # insert token embedding, previous hidden and previous cell states
      # receive output tensor (predictions) and new hidden & cell states
      output, hidden, cell = self.decoder(input, hidden, cell)

      # place predictions in a tensor 
      outputs[t] = output

      # decide if we are using teacher forcing or not
      teacher_force = random.random() < teacher_forcing_ratio

      # get highest predicted token from our predictions
      top1 = output.argmax(1)

      input = trg[t] if teacher_force else top1

    return outputs     


## Training the Seq2Seq Model
The input and output dimensions are defined by the size of the vocabulary.


In [23]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
DROPOUT_P = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, DROPOUT_P)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DROPOUT_P)
model = Seq2Seq(enc, dec, device).to(device)

In the paper they state they initialize all weights from a uniform distribution between -0.08 and +0.08, i.e. $\mathcal{U}(-0.08, 0.08)$.

We initialize weights in PyTorch by creating a function which we apply to our model. When using apply, the init_weights function will be called on every module and sub-module within our model. For each module we loop through all of the parameters and sample them from a uniform distribution with nn.init.uniform_.

In [24]:
def init_weights(m):
  for name, param in m.named_parameters():
    nn.init.uniform_(param.data, -0.08, 0.08)

model.apply(init_weights)

# define a function that will calculate the number of trainable parameters in the model.
def count_parameters(model):
  return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Model had {count_parameters(model):,} trainable parameters.")

optimizer = optim.Adam(model.parameters())

# The CrossEntropyLoss function calculates both the log softmax as well 
# as the negative log-likelihood of our predictions.
# Our loss function calculates the average loss per token, however by passing
# the index of the <pad> token as the ignore_index argument we ignore the loss 
# whenever the target token is a padding token.
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

Model had 13,898,757 trainable parameters.


At each each iteration:
- get the source & target sentence from the batch
- zero the gradients calculated from the last batch
- feed the source & target into the model to get output
- loss function only works on 2d inputs [batch_sz,num_classes] with 1d targets [batch_sz]. We need to flatten each of them with a .view. We slice off the first column since those are just <sos> (trg) or 0 (output) token anyway.
- calculate gradients with `loss.backward()`
- clip the gradients to prevent them from exploding.
- update the parameters of our model by doing an optimizer step
- sum the loss value to a running total




In [25]:
def train(model, iterator, optimizer, criterion, clip):

  model.train()
  epoch_loss = 0
  for i, batch in enumerate(iterator):
    src = batch.src
    trg = batch.trg # shp: [trg_len, batch_sz]

    optimizer.zero_grad()

    output = model(src, trg) # shp: [trg_len, batch_sz, output_dim]

    output_dim = output.shape[-1]

    output = output[1:].view(-1, output_dim) # shp: [(trg_len-1)*batch_sz, output_dim]
    trg = trg[1:].view(-1) # shp: [(trg_len-1)*batch_sz]

    loss = criterion(output, trg)

    loss.backward()

    torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

    optimizer.step()

    epoch_loss += loss.item()

  return epoch_loss / len(iterator)

Our evaluation loop is similar to out training loop, however, as we aren't updating any parameters we don't need to pass an optimizer or a clip value.

We must remember to set the model to `eval()` mode. This will turn off dropout (and batch norm if used).

We use the `with torch.no_grad()` block to ensure no gradients are calculated within the block. This reduces memory consumption and speeds things up.

The iteration loop is similar, without the parameter updates and with teacher forcing turned off.

In [26]:
def evaluate(model, iterator, criterion):
  model.eval()
  epoch_loss = 0

  with torch.no_grad():
    for i, batch in enumerate(iterator):
      src = batch.src
      trg = batch.trg # shp: [trg_len, batch_sz]

      output = model(src, trg, teacher_forcing_ratio=0.0)  # shp: [trg_len, batch_sz, output_dim]

      output_dim = output.shape[-1]

      output = output[1:].view(-1, output_dim)
      trg = trg[1:].view(-1)

      loss = criterion(output, trg)

      epoch_loss += loss.item()

    return epoch_loss / len(iterator)

# function to tell us how long an epoch takes.
def epoch_time(start_time, end_time):
  elapsed_time = end_time - start_time
  elapsed_mins = int(elapsed_time / 60)
  elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
  return elapsed_mins, elapsed_secs   

At each epoch, we'll be checking if our model has achieved the best validation loss so far. If it has, we'll update our best validation loss and save the parameters of our model (called state_dict in PyTorch). Then, when we come to test our model, we'll use the saved parameters used to achieve the best validation loss.

In [28]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
  start_time = time.time()

  train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
  valid_loss = evaluate(model, valid_iterator, criterion)

  end_time = time.time()

  epoch_mins, epoch_secs = epoch_time(start_time, end_time)

  if valid_loss < best_valid_loss:
    best_valid_loss = best_valid_loss
    torch.save(model.state_dict(), sandbox_path + 'tut1-model.pt')

  print(f"Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s") 
  print(f"\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}")
  print(f"\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}")
  

Epoch: 01 | Time: 1m 13s
	Train Loss: 4.487 | Train PPL:  88.828
	 Val. Loss: 4.778 |  Val. PPL: 118.896
Epoch: 02 | Time: 1m 12s
	Train Loss: 4.197 | Train PPL:  66.505
	 Val. Loss: 4.662 |  Val. PPL: 105.881
Epoch: 03 | Time: 1m 13s
	Train Loss: 4.003 | Train PPL:  54.747
	 Val. Loss: 4.534 |  Val. PPL:  93.154
Epoch: 04 | Time: 1m 13s
	Train Loss: 3.870 | Train PPL:  47.940
	 Val. Loss: 4.410 |  Val. PPL:  82.259
Epoch: 05 | Time: 1m 16s
	Train Loss: 3.717 | Train PPL:  41.124
	 Val. Loss: 4.378 |  Val. PPL:  79.700
Epoch: 06 | Time: 1m 15s
	Train Loss: 3.574 | Train PPL:  35.665
	 Val. Loss: 4.260 |  Val. PPL:  70.789
Epoch: 07 | Time: 1m 9s
	Train Loss: 3.434 | Train PPL:  31.005
	 Val. Loss: 4.097 |  Val. PPL:  60.181
Epoch: 08 | Time: 0m 38s
	Train Loss: 3.315 | Train PPL:  27.533
	 Val. Loss: 3.988 |  Val. PPL:  53.951
Epoch: 09 | Time: 0m 38s
	Train Loss: 3.199 | Train PPL:  24.498
	 Val. Loss: 3.910 |  Val. PPL:  49.901
Epoch: 10 | Time: 0m 38s
	Train Loss: 3.084 | Train PPL:

In [None]:
# Load the best model with best validation and run the test set
model.load_state_dict(torch.load(sandbox_path + 'tut1-model.pt'))

test_loss = evaluate(model, test_iterator, criterion)
print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')