# Sequence to Sequence

Sequence to Sequence are normal rnns despite the fact that they consists of two different rnn structures working together. The first one is called the `Encoder` and the another one is called the `Decoder`. Encoder encodes the input and generates a final vector specifically known as the `Context Vector`. The decoder then takes this context vector as an input and decodes it to generate the required result. This has a number of applications in the world of NLP and Machine Learning like Machine Translation, Speech recognition, Image Captioning and many more.

For this task, we will use `Multi30k` dataset from torchtext library that yields a pair of source-target raw sentences.

In [1]:
import time
import torch
import torch.nn as nn
from torch.optim import Adam
from typing import Iterable, List
from torch.utils.data import DataLoader
from torchtext.datasets import Multi30k
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator as bvfi

## Tokenization and Vocabulary Building

In [2]:
SRC_LANG = 'de'
TGT_LANG = 'en'
specials = {'<UNK>': 0, '<PAD>': 1, '<SOS>': 2, '<EOS>': 3}

tokenizer = dict()
vocab = dict()

Create source and target language tokenizer. Make sure to install the dependencies.

```
pip install -U torchdata
pip install -U spacy
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm
```

In [3]:
tokenizer[SRC_LANG] = get_tokenizer('spacy', language='de_core_news_sm')
tokenizer[TGT_LANG] = get_tokenizer('spacy', language='en_core_web_sm')

In [4]:
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
    language_index = {SRC_LANG: 0, TGT_LANG: 1}

    for data_sample in data_iter:
        yield tokenizer[language](data_sample[language_index[language]])

In [5]:
for lang in [SRC_LANG, TGT_LANG]:
    train_iterator, valid_iterator, test_iterator = Multi30k()    # Training data Iterator
    vocab[lang] = bvfi(yield_tokens(train_iterator, lang), min_freq=1, specials=specials.keys(), special_first=True)

Set <UNK> token index (i.e. 0 here) as the default index. This index is returned when the token is not found. If not set, it throws RuntimeError when the queried token is not found in the Vocabulary.

In [6]:
for lang in [SRC_LANG, TGT_LANG]:
  vocab[lang].set_default_index(specials['<UNK>'])

## Defining Seq2seq Model

We inherit each of the three modules below from `torch.nn.Module` and use the super().__init__() as some boilerplate code. The encoder takes the following arguments:

- `input_dim:` Dimension/Size of the one-hot vectors that will be input to the encoder. This is equal to source vocabulary size.
- `emb_dim:` Dimension of the embedding layer. This layer converts the one-hot vectors into dense vectors with emb_dim dimensions.
- `hid_dim:` Dimension of hidden and cell states.
- `n_layers:` Number of layers in RNN.
- `dropout:` Amount of dropout. This is a regularization parameter to prevent overfitting.

In this case, `n_directions` will always be 1 i.e. we are cosidering unidirectional RNNs in this tutorial. However note that bidirectional RNNs will have n_directions as 2.

In [7]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        self.embed = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        embedding = self.dropout(self.embed(src))  # [len(src), batch_size]
        output, (hidden, cell) = self.rnn(embedding)
        # outputs = [src len, batch size, hid dim * n directions]
        # hidden = cell = [n layers * n directions, batch size, hid dim]
        return hidden, cell

The arguments and initialization are similar to the Encoder, except we now have an `output_dim` which is the size of the vocabulary for the output/target. There is also the addition of the `Linear` layer, used to make the predictions from the top layer hidden state.

Within the `forward` method, we accept a batch of input tokens, previous hidden states and previous cell states. As we are only decoding one token at a time, the input tokens will always have a sequence length of 1. We `unsqueeze` the input tokens to add a sentence length dimension of 1.

**`Note:`** as we always have a sequence length of 1, we could use `nn.LSTMCell`, instead of `nn.LSTM`, as it is designed to handle a batch of inputs that aren't necessarily in a sequence. LSTMCell is just a single cell and LSTM is a wrapper around potentially multiple cells. Using the LSTMCell in this case would mean we don't have to `unsqueeze` to add a fake sequence length dimension, but we would need one LSTMCell per layer in the decoder and to ensure each LSTMCell receives the correct initial hidden state from the encoder. All of this makes the code less concise - hence the decision to stick with the regular LSTM.

In [8]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        self.output_dim = output_dim
        self.embed = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)
        self.fc = nn.Linear(hid_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, inputs, hidden, cell):
        inputs = inputs.unsqueeze(0)  # [1, batch_size]
        embedding = self.dropout(self.embed(inputs))  # [1, batch_size, emb_dim]
        output, (hidden, cell) = self.rnn(embedding, (hidden, cell))
        # output = [seq len, batch size, hid dim * n directions]
        # hidden = cell = [n layers * n directions, batch size, hid dim]
        # seq len and n directions will always be 1 in the decoder
        prediction = self.fc(output.squeeze(0))  # [batch size, output dim]
        return prediction, hidden, cell

The Seq2Seq model takes in an Encoder, Decoder, and a device (used to place tensors on the GPU, if it exists).

For this implementation, we have to ensure that the number of layers and the hidden (and cell) dimensions are equal in the Encoder and Decoder. This is not always the case, we do not necessarily need the same number of layers or the same hidden dimension sizes in a sequence-to-sequence model. However, if we did something like having a different number of layers then we would need to make decisions about how this is handled. For example, if our encoder has 2 layers and our decoder only has 1, how is this handled? Do we average the two context vectors output by the decoder? Do we pass both through a linear layer? Do we only use the context vector from the highest layer? etc.

Our forward method takes the source sentence, target sentence and a teacher forcing ratio. `teacher_forcing` is used when training our model. When decoding, at each time-step we will predict what the next token in the target sequence will be from the previous tokens decoded. With probability equal to the teaching forcing ratio we will use the actual ground-truth next token in the sequence as the input to the decoder during the next time-step. However, with probability `1 - teacher_forcing`, we will use the token that the model predicted as the next input to the model, even if it doesn't match the actual next token in the sequence.

The first thing we do in the forward method is to create an outputs tensor that will store all of our predictions. We then feed the input/source sentence, src, into the encoder and receive out final hidden and cell states.

The first input to the decoder is the start of sequence (<SOS>) token. As our *tgt* tensor already has the <SOS> token appended (all the way back when we defined the init_token in our *TGT* field) we get our 
 by slicing into it. We know how long our target sentences should be (max_len), so we loop that many times. The last token input into the decoder is the one before the <EOS> token - the <EOS> token is never input into the decoder.

In [9]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, "Encoder and decoder must have equal number of layers!"
    
    def forward(self, src, tgt, teacher_forcing=0.5):
        batch_size = tgt.shape[1]
        tgt_len = tgt.shape[0]
        tgt_vocab_size = self.decoder.output_dim
        outputs = torch.zeros(tgt_len, batch_size, tgt_vocab_size).to(self.device)
        hidden, cell = self.encoder(src)
        inputs = tgt[0, :]  # First input to the decoder is the <sos> tokens
        
        for target in range(1, tgt_len):
            output, hidden, cell = self.decoder(inputs, hidden, cell)
            outputs[target] = output  # Place predictions in a tensor holding predictions for each token
            teacher_force = random.random() < teacher_forcing  # Decide if we are going to use teacher forcing
            top = output.argmax(1)  # Select the highest predicted token from our predictions
            # If teacher forching use the actual next token else use the predicted one
            inputs = tgt[target] if teacher_force else top 
        return outputs

## Training Seq2seq Model

In [10]:
INPUT_DIM = len(vocab[SRC_LANG])
OUTPUT_DIM = len(vocab[TGT_LANG])
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5
BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [11]:
encoder = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
decoder = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(encoder, decoder, device).to(device)

Now let's initialize the weights of our model. In the paper they state they initialize all weights from a uniform distribution between -0.08 and +0.08. When using apply, the init_weights function will be called on every module and sub-module within our model. For each module we loop through all of the parameters and sample them from a uniform distribution with `nn.init.uniform_`.

In [12]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)
        
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embed): Embedding(19214, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embed): Embedding(10837, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (fc): Linear(in_features=512, out_features=10837, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [13]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 20,608,853 trainable parameters


Next we define our optimizer and loss function. The `CrossEntropyLoss` function calculates both the log softmax as well as the negative log-likelihood of our predictions. Our loss function calculates the average loss per token, however by passing the index of the <PAD> token as the `ignore_index` argument we ignore the loss whenever the target token is a padding token.

In [14]:
optimizer = Adam(model.parameters())
criterion = nn.CrossEntropyLoss(ignore_index=specials['<PAD>'])

`model.train()` sets our model to training mode. This will turn on dropout (and batch normalization, which we aren't using) and then iterate through our data. As stated before, our decoder loop starts at 1, not 0. This means the 0th element of our outputs tensor remains all zeros. When we calculate the loss, we cut off the first element of each tensor. `loss.backward()` is used to calculate gradients. We will also clip the gradients to prevent them from exploding.

In [15]:
train_dataloader = DataLoader(train_iterator, batch_size=BATCH_SIZE)
valid_dataloader = DataLoader(valid_iterator, batch_size=BATCH_SIZE)

In [16]:
def train(model, dataloader, optimizer, criterion, clip):
    model.train()
    epoch_loss = 0
    batch = 0
    for src, tgt in dataloader:
#         src = src.to(device)  # [len(src), batch_size]
#         tgt = tgt.to(device)  # [len(tgt), batch_size]
        optimizer.zero_grad()
        output = model(src, tgt)  # [len(tgt), batch_size, output_dim]
        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim)
        tgt = tgt[1:].view(-1)
        loss = criterion(output, tgt)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item()
        batch += 1
    return epoch_loss / batch

To turn the evaluation mode on, we'll use `model.eval()`. This will turn dropout (and batch normalization) off. We use the `torch.no_grad()` block to ensure no gradients are calculated within the block. This reduces memory consumption and speeds things up. The iteration loop is similar to the training loop except the fact that we are not updaing parameters and teacher forcing is also turned off. This will cause the model to only use it's own predictions to make further predictions within a sentence, which mirrors how it would be used in deployment.

In [17]:
def evaluate(model, dataloader, criterion):
    model.eval()
    epoch_loss = 0
    batch = 0
    with torch.no_grad():
        for src, tgt in dataloader:
#             src = src.to(device)  # [len(src), batch_size]
#             tgt = tgt.to(device)  # [len(tgt), batch_size]
            output = model(src, tgt, 0)  # Teacher forcing is turned off
            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim)  # [(len(tgt) - 1) * batch size, output_dim]
            tgt = tgt[1:].view(-1)  # Shape = [(len(tgt) - 1) * batch size]
            loss = criterion(output, tgt)
            epoch_loss += loss.item()
            batch += 1
    return epoch_loss / batch

Let's create a function to track the time taken by each epoch during the training process.

In [18]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Finally, its training time. At each epoch, we'll be checking if our model has achieved the best validation loss so far. If it has, we'll update our best validation loss and save the parameters of our model (called `state_dict` in PyTorch). Then, when we come to test our model, we'll use the saved parameters used to achieve the best validation loss.

In [19]:
EPOCHS = 10
CLIP = 1

In [20]:
best_valid_loss = float('inf')
for epoch in range(EPOCHS):
    start_time = time.time()
    train_loss = train(model, train_dataloader, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_dataloader, criterion)
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), './../models/seq2seq.pt')
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}:{epoch_secs} | Train Loss: {train_loss:.3f} | Val Loss: {valid_loss:.3f}')

AttributeError: 'tuple' object has no attribute 'shape'

In [None]:
model.load_state_dict(torch.load('./../models/seq2seq.pt'))
test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f}')

## References

- [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/pdf/1409.3215.pdf)
- [Language Translation with nn, transformer and torchtext](https://pytorch.org/tutorials/beginner/translation_transformer.html)