<a href="https://colab.research.google.com/github/almostimplemented/seq2seq_shenanigans/blob/main/Silly%20Seq2Seq%20Shenanigans.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Seq2Seq to learn a simple integer subsequence code

This is a small experiment of training a sequence-to-sequence ("seq2seq") model to learn a prescribed mapping between two sequences of decimal digits.

In [2]:
import math
import time
import random

import torch
from torch import nn
from torch.utils.data import DataLoader

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


Here we define the mapping.

The input $X$ is constructed as follows:

$$X = (idx, len, x_0, ..., x_N)$$

The output is a substring of the input, indicated by $idx$ and $len$:

$$Y = (x_{idx}, ..., x_{idx + len - 1}, 0, ... 0)$$

The tail of the output is simply padded with zeros.

In [3]:
def generate_example(l=10):
    sub_start_idx = random.randint(0,5)
    sub_len = random.randint(2, l - 2 - sub_start_idx)
    x_head = [sub_start_idx, sub_len]
    x_tail = [random.randint(1, 9) for _ in range(l - 2)]
    y_head = [i for i in x_tail[sub_start_idx:sub_start_idx + sub_len]]# + ([0]*(l - sub_len))
    y_tail = [0 for _ in range(l - 2 - len(y_head))]
    x = x_head + x_tail
    y = y_head + y_tail
    return torch.as_tensor(x_head + x_tail), torch.as_tensor(y_head + y_tail)


x, y = generate_example()
print("x = ", x)
print("y = ", y)
print(torch.equal(y[:x[1]], x[2 + x[0] : 2 + x[0] + x[1]]))

x =  tensor([2, 3, 1, 8, 2, 5, 8, 6, 4, 5])
y =  tensor([2, 5, 8, 0, 0, 0, 0, 0])
True


Here we wrap our example generator in `torch`'s `IterableDataset`. The only requirement of this base class is implementing the `__iter__` method, which should return an iterator over the data. We additionally implement `__len__` since we pass a size in as a parameter anyway. This will be useful later when computing average metrics across the training dataset.

In [4]:
class SyntheticDataset(torch.utils.data.IterableDataset):
    def __init__(self, n):
        self.n = n

    def __iter__(self):
        return (generate_example() for _ in range(self.n))
    
    def __len__(self):
        return self.n

# The Model Architecture

The seq2seq model we use has two components: an _encoder_ and a _decoder_. The encoder processes the initial sequence from left-to-right and learns a mapping into an abstract feature space. This enconding yields a feature vector of dimensionality `hidden_dim`. First, we treat each decimal character as a "word" and learn an embedding function that will represent each number as a feature vector with `emb_dim` components. This yields a sequence of embedding vectors. We process each one by one with [`torch.nn.GRU`](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html). We will not cover the internal mechanisms, but the important point is that we process the full input sequence, element by element, and return the final "hidden state" as the representation of the input.

In [5]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hidden_dim, n_layers, dropout=0.5):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.embedding = nn.Embedding(input_dim, emb_dim) # "vocabulary size" -> "word vector dimensionality"
        self.rnn = nn.GRU(emb_dim, hidden_dim, n_layers, dropout=dropout)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        #src.shape = (src len, batch size)
        embedded = self.dropout(self.embedding(src))
        #embedded.shape = (src len, batch size, emb dim)
        outputs, final = self.rnn(embedded)
        
        return final

The second part of the seq2seq model is a _decoder_. This also uses a GRU, but this time we use every intermediate output state, not only the final state. Every word decoded is fed back into the GRU, along with the current state. For the initial state, we use the final state of the encoder, and for the initial word we just use a zero vector.

Note: During training, we know how many steps we need to decode (--> the length of the target $Y$). If we needed to run inference and we did not have access to the target sequence length, we would need to greedily decode until we reach an "end-of-sequence" ("EOS") character. As it turns out, the way we define our sequence naturally makes $0$ the EOS character, but since it is a deterministic / easily computable function, we can always pass in the target.

In [6]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hidden_dim, n_layers, dropout=0.5):
        super().__init__()
        self.output_dim = output_dim
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, hidden_dim, n_layers, dropout = dropout)
        self.fc_out = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input, final):
        input = input.unsqueeze(0)
        embedded = self.dropout(self.embedding(input))
        output, final = self.rnn(embedded, final)
        prediction = self.fc_out(output.squeeze(0))
        return prediction, final

Finally we put everything together into our `Seq2Seq` module. 

Note that we force the encoder and decoder to have equal dimensionality. If we wished to change this, we could insert a learnable layer (such as a linear transformation) to map from one feature space to the other. 

Another trick used is something called "teacher forcing". When we are in the decoding phase, we need the previous element of the output sequence to predict the next. We can either use our last prediction OR we can use the ground truth. The latter choice is called "teacher forcing". We flip a coin to decide which to use.

In [7]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hidden_dim == decoder.hidden_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
           "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        # tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        # final hidden state of the encoder -> initial hidden state of the decoder
        final = self.encoder(src)
        # just feed zeros as input to seed the decoder input
        input = torch.zeros_like(trg[0,:])
        for t in range(0, trg_len):
            output, final = self.decoder(input, final)
            outputs[t] = output
            # get the highest predicted token from our predictions
            top1 = output.argmax(1)            
            # teacher forcing probability
            teacher_force = random.random() < teacher_forcing_ratio
            input = trg[t] if teacher_force else top1

        return outputs

# Preparing the learning program

Here we do the following steps:

1. Configure the dimensionalities of the various components of the model.
2. Instantiate the encoder, decoder, and use these to constructing the Seq2Seq model.
3. Randomly initialize the weights of our model (uniformly across small values)
4. Create the [optimizer](https://pytorch.org/docs/stable/optim.html) (we use [`Adam`](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html))
5. Define the loss criteria (we use [`CrossEntropyLoss`](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)). One thing to notice about this loss is that the target is a single decimal digit (0 - 9), whereas the model output is a vector of weights across each possibly digit. You can think of the model output as an unnormalized probability distribution. This is identical to applying [`LogSoftmax`](https://pytorch.org/docs/stable/generated/torch.nn.LogSoftmax.html) to the final output vector and using Negative-Log-Likelihood Loss ([`NLLLoss`](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html))
6. Finally, we create a [`DataLoader`](https://pytorch.org/docs/stable/data.html) for training and another for validation during our optimization routine.

In [8]:
BATCH_SIZE=32
INPUT_DIM = 10
OUTPUT_DIM = 10
ENC_EMB_DIM = 128
DEC_EMB_DIM = 128
HID_DIM = 256
N_LAYERS = 3

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS)
model = Seq2Seq(enc, dec, device).to(device)

def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)

model.apply(init_weights)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
d_train = DataLoader(SyntheticDataset(4096), batch_size=BATCH_SIZE)
d_valid = DataLoader(SyntheticDataset(8), batch_size=1)

The model has 2,177,034 trainable parameters


Here we define two helper functions to iterate over a dataset. They are very similar and should probably be refactored to share more logic.

In `train`, we call `model.train()` to prepare our model for gradient descent optimizations. Then we iterate over batches from the input. Each batch is fed into the model and we capture its output for each example in the batch. Then we compute the loss across all predicted elements. Finally we call `loss.backward()`, which propogates the gradient of the loss w.r.t. each parameter through the entire network. We apply gradient clipping to avoid exploding gradients, and then safely call `optimizer.step()` to update the parameters with gradient descent.

The `evaluate` method is similar, except we do not use the loss signal to change the network. This is an important difference when data is scarce: we do not want to every "learn", or (worse) "memorize", the validation / test examples. If we fed the loss signal to the network, these validation / testing examples would no longer give us an idea of how well the model will generalize to unseen examples. 

We also print out our predicted sequence in evaluate by taking the highest predicted digit for each sequence element.

Last, we define a helper function to measure how long it takes to process the dataset once (AKA "one epoch").

In [9]:
def train(model, iterator, optimizer, criterion, clip):    
    model.train()    
    epoch_loss = 0
    for i, batch in enumerate(d_train):
        # always reset your gradients!
        optimizer.zero_grad()
        
        # note: we did not customize our DataLoader
        #       our model wants the input shape to be (sequence_len, batch_size),
        #       but by default the DataLoader simply stacks the dataset examples,
        #       which yields a shape (batch_size, sequence_len).
        #   
        #       so we simply transpose
        src = batch[0].T
        trg = batch[1].T
        src = src.to(device)
        trg = trg.to(device)
        
        # predict sequence
        output = model(src, trg)
        
        output_dim = output.shape[-1]
        output = output.view(-1, output_dim)
        trg = trg.reshape(-1)
        loss = criterion(output, trg)
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0
    with torch.no_grad():
        for i, batch in enumerate(iterator):
            src = batch[0].T
            trg = batch[1].T
        
            src = src.to(device)
            trg = trg.to(device)
            output = model(src, trg, 0) # disable teacher forcing

            output_dim = output.shape[-1]
            output = output.view(-1, output_dim)
            trg = trg.reshape(-1)

            loss = criterion(output, trg)
            
            print("Input: ", src.reshape(-1))
            print("Output: ", output.argmax(1))
            print("Target: ", trg)
            print("Loss: ", loss)
            
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Finally, all we have to do is run the training function. We will run for 16 epochs.

You can see that in the first epoch, the outputs are not very good. By epoch 3 and 4, we are getting much more accurate predictions, but errors still occur. Soon our validation errors nearly drop to zero -- this is because we are intentionally using a small validation set so we can print out each example -- but the training loss continues to drop. By the end, our network seems to nearly never make an error and has fully learned the sequence coding scheme.

In [10]:
N_EPOCHS = 16
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, d_train, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, d_valid, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'subseq-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Input:  tensor([3, 2, 1, 8, 5, 6, 9, 1, 4, 5], device='cuda:0')
Output:  tensor([5, 5, 0, 0, 0, 0, 0, 0], device='cuda:0')
Target:  tensor([6, 9, 0, 0, 0, 0, 0, 0], device='cuda:0')
Loss:  tensor(0.7166, device='cuda:0')
Input:  tensor([0, 3, 2, 4, 4, 3, 1, 7, 6, 4], device='cuda:0')
Output:  tensor([4, 4, 0, 0, 0, 0, 0, 0], device='cuda:0')
Target:  tensor([2, 4, 4, 0, 0, 0, 0, 0], device='cuda:0')
Loss:  tensor(0.7886, device='cuda:0')
Input:  tensor([5, 2, 1, 9, 8, 1, 1, 7, 9, 2], device='cuda:0')
Output:  tensor([1, 1, 0, 0, 0, 0, 0, 0], device='cuda:0')
Target:  tensor([7, 9, 0, 0, 0, 0, 0, 0], device='cuda:0')
Loss:  tensor(0.7166, device='cuda:0')
Input:  tensor([1, 7, 7, 3, 4, 3, 3, 6, 6, 1], device='cuda:0')
Output:  tensor([3, 3, 3, 3, 3, 3, 3, 0], device='cuda:0')
Target:  tensor([3, 4, 3, 3, 6, 6, 1, 0], device='cuda:0')
Loss:  tensor(1.8924, device='cuda:0')
Input:  tensor([2, 5, 1, 3, 8, 1, 5, 2, 3, 5], device='cuda:0')
Output:  tensor([5, 5, 0, 0, 0, 0, 0, 0], device='cu

In [11]:
examples = DataLoader(SyntheticDataset(4), batch_size=1)
evaluate(model, examples, criterion)

Input:  tensor([2, 6, 2, 3, 8, 4, 8, 8, 9, 1], device='cuda:0')
Output:  tensor([8, 4, 8, 8, 9, 1, 0, 0], device='cuda:0')
Target:  tensor([8, 4, 8, 8, 9, 1, 0, 0], device='cuda:0')
Loss:  tensor(0.0021, device='cuda:0')
Input:  tensor([5, 2, 5, 9, 3, 6, 4, 1, 8, 8], device='cuda:0')
Output:  tensor([1, 8, 0, 0, 0, 0, 0, 0], device='cuda:0')
Target:  tensor([1, 8, 0, 0, 0, 0, 0, 0], device='cuda:0')
Loss:  tensor(5.1764e-05, device='cuda:0')
Input:  tensor([2, 4, 9, 4, 6, 9, 4, 5, 2, 6], device='cuda:0')
Output:  tensor([6, 9, 4, 5, 0, 0, 0, 0], device='cuda:0')
Target:  tensor([6, 9, 4, 5, 0, 0, 0, 0], device='cuda:0')
Loss:  tensor(6.8823e-05, device='cuda:0')
Input:  tensor([0, 3, 2, 5, 1, 6, 8, 5, 4, 6], device='cuda:0')
Output:  tensor([2, 5, 1, 0, 0, 0, 0, 0], device='cuda:0')
Target:  tensor([2, 5, 1, 0, 0, 0, 0, 0], device='cuda:0')
Loss:  tensor(0.0001, device='cuda:0')


0.0005933678330620751