# 1 - Sequence to Sequence Learning with Neural Networks

In this series we'll be building a machine learning model to go from once sequence to another, using PyTorch and TorchText. This will be done on German to English translations, but the models can be applied to any problem that involves going from one sequence to another, such as summarization.

In this first notebook, we'll start very simple to understand the general concepts by implementing the model from the [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215) paper. 

## Introduction

The most common sequence-to-sequence (seq2seq) models are *encoder-decoder* models, which (commonly) use a *recurrent neural network* (RNN) to *encode* the source (input) sentence into a single vector. In this notebook we'll refer to this single vector as a *context vector*. You can think of the context vector as being some abstract representation of the input sentence in the source language. This vector is then *decoded* by a second RNN which learns to output the target (output) sentence by generating one word at a time. 

![](assets/seq2seq1.png)

The input/source sentence "guten morgen" is input into the encoder (green) one word at a time. We also append a "start of sequence" (`<sos>`) and "end of sequence" (`<eos>`) token to the start and end of sentence, respectively. At each time-step, the input to the encoder RNN is both the current word, $x_t$ as well as the hidden state from the previous time-step, $h_{t-1}$, and the encoder RNN outputs a new hidden state $h_t$. You can think of the RNN as a function of both of these inputs:

$$h_t = \text{RNN}(x_t, h_{t-1})$$

We're using the term RNN generally here, it could be any recurrent architecture, such as an *LSTM* or a *GRU*. 

Here, we have $x_1 = \text{<sos>}, x_2 = \text{guten}$, etc. The initial hidden state, $h_0$, is usually either initialized to zeros or a learned parameter.

Once the final word, $x_T$, has been passed into the RNN, we use the final hidden state, $h_T$, as the context vector, i.e. $h_T = z$.

Now we have our context vector, we can start decoding it to get the output/target sentence, "good morning". Again, we append start and end of sequence tokens. At each time-step, the input to the decoder RNN (blue) is the current word, $y_t$, as well as the hidden state from the previous time-step, $s_{t-1}$, where the initial decoder hidden state, $s_0$, is the context vector, $s_0 = z = h_T$. Thus, the initial decoder hidden state is the final encoder hidden state. 

At each time-step we also use $s_t$ to predict (by passing it through a `Linear` layer, shown in purple) what we think is the next word in the sequence, $\hat{y}_t$. We always use `<sos>` for $y_1$, but for $y_{>1}$ we will sometimes use the actual next word in the sequence, $y_t$ and sometimes use the last predicted word, $\hat{y}_{t-1}$. This is called *teacher forcing*, and you can read about it more [here](https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/).

When training/testing our model, we know how many words are in our target sentence, so we stop generating words once we hit that many. During inference (i.e. real world usage) it is common to keep generating words until the model outputs an `<eos>` token or after a certain amount of words have been generated.

Once we have our predicted target sentence, $\hat{Y} = \{ \hat{y}_1, \hat{y}_2, ..., \hat{y}_T \}$, we compare it against our actual target sentence, $Y = \{ y_1, y_2, ..., y_T \}$, to calculate our loss. We then use this loss to update all of the parameters in our model.

## Preparing Data

We'll be coding up the models in PyTorch and using TorchText to help us do all of the pre-processing required. We'll also be using spaCy to assist in the tokenization of the data.

In [1]:
import torch
from torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIterator
import spacy

The dataset we'll be using is the [Multi30k dataset](https://github.com/multi30k/dataset). This is a dataset with around 30,000 parallel English, German and French sentences, each with around ~12 words per sentence. 

First, we'll download the dataset:

In [2]:
Multi30k.download('data');

Next, we'll create the tokenizers. A tokenizer is used to turn a string containing a sentence into a list of individual words/tokens that make up that string, e.g. "good morning!" becomes ["good", "morning", "!"].

spaCy has model for each language ("de" for German and "en" for English) which need to be loaded so we can access the tokenizer of each. 

**Note**: the models must first be downloaded using the commands: 
```
python -m spacy download en
python -m spacy download de
```

We load the models as such:

In [3]:
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

Next, we create the tokenizer functions. These can be passed to TorchText and will take in the sentence as a string and should return the sentence as a list.

As mentioned in the paper we are implementing, they find it beneficial to reverse the order of the input which they believe "introduces many short term dependencies in the data that make the optimization problem much easier".

In [4]:
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings and reverses it
    """
    return [tok.text for tok in spacy_de.tokenizer(text)][::-1]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

TorchText's `Field`s handle how data should be processed. You can read all of the possible arguments [here](https://github.com/pytorch/text/blob/master/torchtext/data/field.py#L61). 

We set the `tokenize` argument to the correct tokenization function for each, with German being the `SRC` (source) field and English being the `TRG` (target) field. The field also appends the "start of sequence" and "end of sequence" tokens via the `init_token` and `eos_token` arguments.

In [5]:
SRC = Field(tokenize=tokenize_de, init_token='<sos>', eos_token='<eos>')
TRG = Field(tokenize=tokenize_en, init_token='<sos>', eos_token='<eos>')

Next, we load the train, validation and test data. 

`path` specifies the location of the dataset, `exts` specifies which languages to use as the source and target (source goes first) and `fields` specifies which field to use for the source and target.

In [6]:
train, valid, test = TranslationDataset.splits(      
  path = 'data/multi30k',  
  exts = ['.de', '.en'],   
  fields = [('src', SRC), ('trg', TRG)],
  train = 'train', 
  validation = 'val', 
  test = 'test2016')

We can double check that we've loaded the right number of examples:

In [7]:
print(f"Number of training examples: {len(train.examples)}")
print(f"Number of validation examples: {len(valid.examples)}")
print(f"Number of testing examples: {len(test.examples)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


We can also print out an example, making sure the source sentence is reversed:

In [8]:
print(vars(train.examples[0]))

{'src': ['.', 'Büsche', 'vieler', 'Nähe', 'der', 'in', 'Freien', 'im', 'sind', 'Männer', 'weiße', 'junge', 'Zwei'], 'trg': ['Two', 'young', ',', 'White', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


Next, we'll build the *vocabulary* for the source and target languages. The vocabulary is used to associate each unique word with an index (an integer) and this is used to build a one-hot encoding for each word (a vector of all zeros except for the position represented by the index, which is 1). The vocabularies of the source and target are distinct.

Using the `min_freq` argument, we only allow tokens that appear at least 2 times to appear in our vocabulary. Words that appear only once are converted into an `<unk>` token.

It is important to note that your vocabulary should only be built from the training set and not the validation/test set. 

In [9]:
SRC.build_vocab(train, min_freq=2)
TRG.build_vocab(train, min_freq=2)

In [10]:
print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")

Unique tokens in source (de) vocabulary: 8011
Unique tokens in target (en) vocabulary: 6191


The final step of preparing the data is to create the iterators. These can be iterated to return a batch of data which will have a `src` attribute (the PyTorch tensors containing a batch of numericalized source sentences) and a `trg` attribute (the PyTorch tensors containing a batch of numericalized target sentences). Numericalized is just a fancy way of saying they have been converted from readable words to their indexes using the vocabulary. 

When we get a batch of examples using an iterator we need to make sure that all of the source sentences are padded to the same length, the same with the target sentences. Luckily, TorchText handles this padding for us. 

We use a `BucketIterator` over the standard `Iterator` as this iterator creates batches in such a way that it minimizes the amount of padding in both the source and target sentences. 

In [11]:
BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train, valid, test), batch_size=BATCH_SIZE, repeat=False)

## Building the Seq2Seq Model

We'll be building our model in three parts. The encoder, the decoder and a "seq2seq" model that encapsulates the encoder and decoder. We do this so the whole model will simply take a source sentence as an input and output a predicted target sentence.

First, the encoder, a 2 layer LSTM. In the paper they use a 4-layer LSTM, but in the interest of training time, we'll cut this down to 2-layers. The concept of multi-layer RNNs is easy to expand from 2 to 4 layers. 

For a multi-layer RNN, the input sentence goes into the first (bottom) layer of the RNN and hidden states output by this layer are used as inputs to the next layer of RNNs. This means we'll also need an initial hidden state, $h_0$, per layer and we will also output a context vector, $z$, per layer.

Without going into too much detail about LSTMs (see [this](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) blog post if you want to learn more about them), all we need to know is that they're a type of RNN which instead of just taking in a hidden state and returning a new hidden state, also take in and return a *cell state*, $c$. You can just think of $c$ as another type of hidden state. This means our context vector will both the final hidden state and the final cell state, i.e. $z = (h_T, c_T)$.

So our encoder looks something like this: 

![](assets/seq2seq2.png)

We create this in code by making an `Encoder` module, which requires we inherit from `torch.nn.Module` and use the `super().__init__()` as some boilerplate code. The encoder is takes the following arguments:
- `input_dim` is the size/dimensionality of the one-hot vectors that will be input to the encoder. This is equal to the input (source) vocabulary size.
- `emb_dim` is the dimensionality of the embedding layer. This layer converts the one-hot vectors into dense vectors with `emb_dim` dimensions. 
- `hid_dim` is the dimensionality of the hidden and cell states.
- `n_layers` is the number of layers in your RNN.
- `dropout` is the amount of dropout to use. This is a regularization parameter to prevent overfitting. Check out [this](https://www.coursera.org/lecture/deep-neural-network/understanding-dropout-YaGbR) for more details about dropout.

The embedding layer is created using `nn.Embedding`, the LSTM with `nn.LSTM` and a dropout layer with `nn.Dropout`. Check the PyTorch [documentation](https://pytorch.org/docs/stable/nn.html) for more about these.

One thing to note is that the `dropout` argument to the LSTM is how much dropout to apply between the layers of a multi-layer RNN, i.e. between the hidden states output from layer $L$ and those same hidden states being used for the input of layer $L+1$.

In the `forward` method, we pass in the source sentence which is converted into dense vectors and then dropout is applied. These embeddings are then passed into the RNN which returns: `outputs` (the top-layer hidden state for each time-step/token), `hidden` (the final hidden state for each layer, $h_T$, stacked on top of each other) and `cell` (the final cell state for each layer, $c_T$, stacked on top of each other). As we only need the final 

The sizes of each of the tensors is left as comments in the code. In this implementation `n_directions` will always be 1, however bidirectional RNNs (covered in tutorial 3) will have `n_directions` as 2.

In [12]:
import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.input_dim = input_dim
        self.emb_dim = emb_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        self.dropout = dropout
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [sent len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [sent len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.rnn(embedded)
        
        #outputs = [sent len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #outputs are always from the top-layer
        
        return hidden, cell

DECODER HERE

In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()

        self.emb_dim = emb_dim
        self.hid_dim = hid_dim
        self.output_dim = output_dim
        self.n_layers = n_layers
        self.dropout = dropout
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)
        
        self.out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        
        #input = [bsz]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #n directions will both always be 1, therefore:
        #hidden = [n layers, batch size, hid dim]
        #context = [n layers, batch size, hid dim]
        
        input = input.unsqueeze(0)
        
        #input = [1, bsz]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, bsz, emb dim]
                
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        #outputs = [sent len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #sent len and n directions will always be 1, therefore:
        #outputs = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        
        output = output.squeeze(0)
        
        return self.out(output), hidden, cell

In [None]:
OUTPUT_DIM = len(TRG.vocab)
INPUT_DIM = len(SRC.vocab)
EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
DROPOUT = 0.5

enc = Encoder(INPUT_DIM, EMB_DIM, HID_DIM, N_LAYERS, DROPOUT)
dec = Decoder(INPUT_DIM, EMB_DIM, HID_DIM, N_LAYERS, DROPOUT)

In [None]:
import random

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        
        #src = [sent len, batch size]
        #trg = [sent len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time
        
        batch_size = src.shape[1]
        max_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        
        #first input to the decoder is the <sos> tokens
        output = trg[0,:]
        
        for t in range(1, max_len):
            output, hidden, cell = self.decoder(output, hidden, cell)
            outputs[t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.max(1)[1]
            output = (trg[t] if teacher_force else top1)

        return outputs

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = Seq2Seq(enc, dec, device).to(device)

In [None]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

In [None]:
pad_idx = TRG.vocab.stoi['<pad>']

criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)

In [None]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        loss = criterion(output[1:].view(-1, output.shape[2]), trg[1:].view(-1))
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [None]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0) #turn off teacher forcing

            loss = criterion(output[1:].view(-1, output.shape[2]), trg[1:].view(-1))

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [None]:
import math
import os
N_EPOCHS = 25
CLIP = 10

best_valid_loss = float('inf')

if not os.path.isdir('.save'):
    os.makedirs('.save')

for epoch in range(N_EPOCHS):
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), '.save/tut1_model.pt')
    
    print(f'| Epoch: {epoch+1:02} | Train Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f} | Val. Loss: {valid_loss:.3f} | Val. PPL: {math.exp(valid_loss):7.3f} |')

In [None]:
model.load_state_dict(torch.load('.save/tut1_model.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f}')