# Sequence to sequence (seq2seq)

In this notebook, we are going to train an encoder/decoder architecture able to translate short sentences from English to French. First of all, we will need some data. The website https://www.manythings.org/anki/ has a beautiful data set of sentences with their corresponding translations. We are going to use a [text file in a .zip hosted by the pytorch website](https://download.pytorch.org/tutorial/data.zip), that contains pairs of sentences ENG-FRA.

In [1]:
# usual imports, plus some text manipulation utility library
import unicodedata
import numpy as np
import re
import random
import torch
import torch.nn as nn
import torch.nn.functional as F

from __future__ import unicode_literals, print_function, division
from io import open
from torch import optim
from torch.utils.data import TensorDataset, DataLoader, RandomSampler

# check which type of computational devices are available for pytorch, and
# set the most appropriate one
device_name = "cpu"
if torch.cuda.is_available() :
  device_name = "cuda"
elif torch.backends.mps.is_available() :
  device_name = "mps"

# I have issues with the GPU
device_name = "cpu"
torch.device(device_name)

device(type='cpu')

In [2]:
# access a .zip file and read the content in the text file with the specified name
url_zip = "https://download.pytorch.org/tutorial/data.zip"
file_name = "data/eng-fra.txt"
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen

resp = urlopen(url_zip)
myzip = ZipFile(BytesIO(resp.read()))
file_lines = [ l.decode('utf-8') for l in myzip.open(file_name).readlines() ]
print("The file contains %d lines! Example: \"%s\"" % (len(file_lines), file_lines[2]))

The file contains 135842 lines! Example: "Run!	Courez !
"


## Pre-processing the text file

Working on language requires a rather bothersome part of pre-processing, performing tasks such as ensuring that all characters are using the same encoding (like ASCII or UTF-8), removing endlines '\n' and other special characters, or changing all letters to be lowercase. The next code cells should take care of that.

In [3]:
# we define a few functions to help perform text preprocessing, with one of them using the dreaded
# R E G U L A R  E X P R E S S I O N S (or RegEx)

# turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# lowercase, trim, and remove non-letter characters
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z!?]+", r" ", s)
    return s.strip()

# we are going to use a support class called "Vocabulary" that collects and keeps
# track of all the tokens in a vocabulary; in our case, the Vocabulary will be the set of
# words used in the sentences we will see (in ENG and FRA), and tokens will correspond to words
class Vocabulary :
    def __init__(self, name=""):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count Start Of Sequence (SOS) and End Of Sequence (EOS)

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

    # this special function is called whenever len(Vocabulary) is invoked from
    # the outside of the object; returns the number of words (tokens)
    def __len__(self) :
      return self.n_words

Ok, now that we wrote some code to parse the text file we just accessed, let's try to create two Vocabulary instances for the two languages.

In [4]:
# empty vocabularies for the moment
vocabulary_ENG = Vocabulary("ENG")
vocabulary_FRA = Vocabulary("FRA")

print("At the moment, the only tokens in the ENG Vocabulary are:", vocabulary_ENG.index2word)

At the moment, the only tokens in the ENG Vocabulary are: {0: 'SOS', 1: 'EOS'}


It's time to populate the vocabularies using the sentence pairs we read at the beginning.

In [5]:
# iterate over all the file lines, and create pairs of sentences; we know that each
# sentence in the same line is separated by tab, '\t', so we can split the line on that character;
# we are also going to use the 'normalizeString' function to
pairs = [[normalizeString(s) for s in l.split('\t')[:2]] for l in file_lines]
print("Found %d pairs of sentences! This is an example of a pair of corresponding sentences: %s" % (len(pairs), str(pairs[221])))

# that's a lot of data! we cannot spend a lot of time waiting for the network
# to learn the intricacies of two different languages, so we drastically reduce
# the data set, by taking only short sentences and only sentences that start with
# one among a few selected prefixes
maximum_sentence_length = 10
acceptable_prefixes_ENG = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s ",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)

pairs = [p for p in pairs if len(p[0].split(' ')) < maximum_sentence_length and \
            len(p[1].split(' ')) < maximum_sentence_length and \
            p[0].startswith(acceptable_prefixes_ENG)]
print("After selection, we now have %d sentence pairs!" % len(pairs))

Found 135842 pairs of sentences! This is an example of a pair of corresponding sentences: ['go ahead', 'vas y !']
After selection, we now have 11445 sentence pairs!


Good! Now we just need to add the words (tokens) found in the selected sentences to their respective Vocabularies.

In [6]:
print("Analyzing sentences and adding tokens to the vocabularies...")
for pair in pairs :
  vocabulary_ENG.addSentence(pair[0])
  vocabulary_FRA.addSentence(pair[1])

print("Now the ENG vocabulary has %d words, and the FRA vocabulary has %d!" %
      (len(vocabulary_ENG), len(vocabulary_FRA)))

Analyzing sentences and adding tokens to the vocabularies...
Now the ENG vocabulary has 2991 words, and the FRA vocabulary has 4601!


Feel free to print out some of the words inside the two Vocabularies. Their ids (or indexes) have no meaning, they were just set while passing through the sentence pairs. Now that we finally have some clean data, we can create a DataLoader that will later feed our encoder/decoder.

 ## Creating the DataLoaders
 Again, let's create some helper functions that will be applied to our samples. We will also split the data between training, validation, and test.

In [7]:
# set the random seed, then shuffle the pairs (only the lines)
import random
random.seed(42)
print("First five sentence pairs before shuffling:", pairs[:5])
random.shuffle(pairs)
print("First five sentence pairs before shuffling:", pairs[:5])

# let's keep just a small number of sentences for validation and test, and most
# of them for training
validation_set_size = 400
test_set_size = 400
training_set_size = len(pairs) - validation_set_size - test_set_size

pairs_training_set = pairs[:training_set_size]
pairs_validation_set = pairs[training_set_size:training_set_size+validation_set_size]
pairs_test_set = pairs[-test_set_size:]

print("Sizes: training set %d; validation set %d; test set %d " %
 (len(pairs_training_set), len(pairs_validation_set), len(pairs_test_set)))

First five sentence pairs before shuffling: [['i m ok', 'je vais bien'], ['i m ok', 'ca va'], ['i m fat', 'je suis gras'], ['i m fat', 'je suis gros'], ['i m fit', 'je suis en forme']]
First five sentence pairs before shuffling: [['he is a lazy student', 'c est un etudiant paresseux'], ['you re the best dad ever', 'tu es le meilleur papa de tous les temps'], ['you are early', 'tu viens tot'], ['i m at the beach', 'je suis a la plage'], ['he is always losing his umbrella', 'il perd tout le temps son parapluie']]
Sizes: training set 10645; validation set 400; test set 400 


In [8]:
# helper function, pretty much self-explanatory
def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]

# this function returns a DataLoader object, given a list of pairs, and the two
# input/output languages, to compute the token ids starting from the words
def get_dataloader(pairs, input_lang, output_lang, batch_size=32, max_length=10, EOS_token=1, device="cpu"):

    n = len(pairs)
    input_ids = np.zeros((n, max_length), dtype=np.int32)
    target_ids = np.zeros((n, max_length), dtype=np.int32)

    for idx, (inp, tgt) in enumerate(pairs):
        inp_ids = indexesFromSentence(input_lang, inp)
        tgt_ids = indexesFromSentence(output_lang, tgt)
        inp_ids.append(EOS_token)
        tgt_ids.append(EOS_token)
        input_ids[idx, :len(inp_ids)] = inp_ids
        target_ids[idx, :len(tgt_ids)] = tgt_ids

    # create a TensorDataset
    data = TensorDataset(torch.LongTensor(input_ids).to(device),
                               torch.LongTensor(target_ids).to(device))

    # what kind of sampling strategy will we apply when the DataLoader will be
    # asked to return a batch of samples? Here we set a simple RandomSampler
    sampler = RandomSampler(data)
    # finally, instantiate and return the DataLoader
    dataloader = DataLoader(data, sampler=sampler, batch_size=batch_size)

    return dataloader

In [9]:
# create DataLoaders for train, validation, and test
batch_size = 32
train_loader = get_dataloader(pairs_training_set, vocabulary_ENG, vocabulary_FRA, batch_size=batch_size, device=device_name)

print("Total number of samples that the train_loader will provide:", len(train_loader.sampler))

# TODO create a validation_loader and a test_loader
validation_loader = get_dataloader(pairs_validation_set, vocabulary_ENG, vocabulary_FRA, batch_size=len(pairs_validation_set), device=device_name)
test_loader = get_dataloader(pairs_test_set, vocabulary_ENG, vocabulary_FRA, batch_size=len(pairs_test_set), device=device_name)

Total number of samples that the train_loader will provide: 10645


Now, modify the cell code above to (i) add the code to create the validation_loader and the test_loader, (ii) try to fetch out a few samples from one of the DataLoaders, and print them out. Are they how you were expecting them to be?

## Encoder/decoder architecture

It's again our favorite moment of a pytorch notebook! We finally get to create new classes that inherit from torch.nn.Module! Oh, joy! In this case, we will create two separate classes, one for the Encoder, the other for the Decoder, to keep things more clear. Interestingly, it does not really matter if we will later use two separate objects to perform a forward pass: as the tensor that is travelling through the two object will maintain a computational graph of all the other tensors it interacted with, it will later be able to compute all gradients of all other tensors with respect to the loss function.

In [10]:
# inherits from torch.nn.Module
class EncoderRNN(torch.nn.Module) :
    # as for the other exercise on RNNs, let's have the size of the hidden layer
    # as one of the arguments of the builder; notice that the dropout probability
    # is another argument, as this time we will use a Dropout module to try to
    # reduce the overfitting
    def __init__(self, input_size, hidden_size, dropout_p=0.1) :
        # invoke builder of the parent class
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        # three modules: an embedding, to go from the Vocabulary tokens to a vector space (embedding)
        # internally, the Embedding module has only linear layers, as we discussed in class
        self.embedding = torch.nn.Embedding(input_size, hidden_size)
        # a dropout module
        self.dropout = torch.nn.Dropout(dropout_p)
        # and a module of Gated Recurring Units
        self.gru = torch.nn.GRU(hidden_size, hidden_size, batch_first=True)

    def forward(self, input) :
        # the forward pass is simple: first, we compute the embedding value for the input token
        embedded_token = self.embedding(input)
        # then, we pass through a dropout
        embedded_token = self.dropout(embedded_token)
        # and finally, through the GRU module, that will return the output and a hidden state
        output, hidden = self.gru(embedded_token)
        return output, hidden

In [11]:
# the decoder, on the other hand, is a bit more complex, not really for the architecture
# (it uses only three modules), but for the way that it processes the output
class DecoderRNN(torch.nn.Module) :

    def __init__(self, hidden_size, output_size, device="cpu", max_length=10, SOS_token=0, EOS_token=1):
        super(DecoderRNN, self).__init__()
        # these are just internal attributes for variables that we will need to keep track of
        self.device = device # "cpu", "cuda", or "mps"
        self.max_length = max_length # maximum length, used to stop during training if EOS not reached
        self.SOS_token = SOS_token # index of the SOS
        self.EOS_token = EOS_token # index of the EOS

        # modules
        self.embedding = torch.nn.Embedding(output_size, hidden_size)
        self.relu = torch.nn.ReLU()
        self.gru = torch.nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.out = torch.nn.Linear(hidden_size, output_size)

    # forward pass: notice that, besides the arguments you would expect (encoder_outputs
    # and encoder_hidden), there is an optional argument called 'target_tensor'
    def forward(self, encoder_outputs, encoder_hidden, target_tensor=None) :
        # training samples are sent in batches, so the tensor in output to the
        # encoder will have one of the dimensions that corresponds to batch size
        # let's keep track of it to set up the other tensors; we use encoder_outputs just for that
        batch_size = encoder_outputs.size(0)
        # first, fill in all the first inputs for the decoder as SOS (Start of Sequence) special tokens
        # the code below creates an empty tensor and fills it the the value self.SOS_token
        decoder_input = torch.empty(batch_size, 1, dtype=torch.long, device=self.device).fill_(self.SOS_token)
        # the initial hidden state of the GRU/RNN layer, on the other hand, will be set as the hidden state
        # (or context vector) that came out of the encoder
        decoder_hidden = encoder_hidden
        # we clearly want to keep track of the outputs
        decoder_outputs = []

        for i in range(self.max_length):

            # forward pass of the tensor through the modules
            z = self.embedding(decoder_input)
            z = self.relu(z)
            z, decoder_hidden = self.gru(z, decoder_hidden)
            decoder_output = self.out(z)

            # keep track of the outputs
            decoder_outputs.append(decoder_output)

            # target_tensor is used here to understand whether we are running during training
            # or during validation/test
            if target_tensor is not None:
                # teacher forcing: feed the known ground truth target as the next input
                decoder_input = target_tensor[:, i].unsqueeze(1) # teacher forcing
            else:
                # without teacher forcing: use its own predictions as the next input
                _, topi = decoder_output.topk(1) # returns the k largest elements of the given input tensor along a given dimension; here k=1
                decoder_input = topi.squeeze(-1).detach()  # detach from history as input

        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)
        return decoder_outputs, decoder_hidden #, None # We return `None` for consistency in the training loop

## Training the networks

We are finally at the point where we can instantiate our networks and train them! In the code below, you might have noticed that I created two optimizers, one for the encoder and one for the decoder. As the gradients on the tensors representing the parameters of each are computed in the same backward pass, in theory we could use just one optimizer. However, I had some trouble passing the parameters of the two networks together to a single optimizer, so for the moment I used this patch. I am sure that if I created another wrapper class, that had inside both an instance of EncoderRNN and DecoderRNN, I would be able to do it, but I ran out of time to prepare this example, sorry ^_^;

Maybe you can do better than what I did!

In [12]:
# networks hyperparameters: size of all hidden states
hidden_size = 128

# optimizer hyperparameters
max_epochs = 100
learning_rate = 3e-4

# fix random seed
torch.manual_seed(42)

# instantiate networks! the input size for the encoder is the number of tokens in
# the input vocabulary (ENG), the output side of the decoder is the number of tokens
# in the output vocabulary (FRA)
encoder = EncoderRNN(vocabulary_ENG.n_words, hidden_size)
decoder = DecoderRNN(hidden_size, vocabulary_FRA.n_words)
encoder.to(device_name)
decoder.to(device_name)

# instantiate optimizers
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate)

# loss function, works like Categorical Cross-Entropy, but accepts in input
# only tensors that have already passed through a SoftMax
loss_function = nn.NLLLoss()

In [None]:
# just before starting the loop, set all models in 'training' mode, so that Dropouts
# and possibly other training-only behaviors are active
encoder.train()
decoder.train()

# and here is the training loop, in all its glory!
for epoch in range(0, max_epochs) :

  # keep track of the losses for each batch
  batches_loss = []

  # process all batches
  for input_tensor, target_tensor in train_loader :

    # pass input tensor through the encoder
    encoder_outputs, encoder_hidden = encoder(input_tensor.to(device_name))

    # pass outputs and hidden state to the encoder, along with the target tensor
    # that, during training, will be used; we also obtain decoder_hidden, but we
    # will not really use it
    decoder_outputs, decoder_hidden = decoder(encoder_outputs.to(device_name), 
                                              encoder_hidden.to(device_name), 
                                              target_tensor.to(device_name))

    # compute loss function
    batch_loss = loss_function(decoder_outputs.view(-1, decoder_outputs.size(-1)), 
                               target_tensor.view(-1))

    # and now, an optimization step! reset gradients on network parameters
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    # backward pass, updating gradients
    batch_loss.backward()
    # optimizer step OH MY GOD, THIS IS HORRIBLE, WHY COULDN'T I MAKE IT WORK WITH ONE?
    encoder_optimizer.step()
    decoder_optimizer.step()

    # store loss value
    batches_loss.append(batch_loss.detach().item())

  # training loss for the epoch is the mean of the losses of each batch
  train_loss = np.mean(batches_loss)

  # TODO also compute the validation loss, using the torch.no_grad() context
  # TODO also print out the validation loss
  print("Epoch %d: training loss=%.6f" % (epoch, train_loss))

Epoch 0: training loss=3.886690
Epoch 1: training loss=2.836368
Epoch 2: training loss=2.536768


Fantastic! The loss is decreasing. Notice how each epoch takes a considerable amount of time. If you want to restart the training from the weights you obtained in the last iteration and keep going for a while more, you can just re-run the last code cell. However, if you modify some hyperparameter and would like to restart from scratch, you will have to re-run the code cell before the last one, because it sets the random seed and instantiates the networks.

Now that we have the trained weights, we can finally evaluate our network! Write the code to:
1. Select a random sentence pair.
2. Transform the words in the input sentence into a tensor of tokens.
3. Send the tensor through the encoder and the decoder, to get a tensor of tokens in output.
4. Convert the tokens in the output back to words.
5. And actually, do that in a loop with a lot of sentences, because my experience is that some of them were translated properly and other absolutely were not :-)

You can write the code completely from scratch, or you can re-use the helper functions that we declared at the beginning of the notebook.

In [None]:
number_of_test_pairs = 1

encoder.eval()
decoder.eval()

for i in range(0, number_of_test_pairs) :
  input_sentence, output_sentence = random.choice(pairs_training_set)
  print("Input: \"%s\"; Output (Ground truth): \"%s\"" % (input_sentence, output_sentence))

  input_ids = indexesFromSentence(vocabulary_ENG, input_sentence)
  output_ids = indexesFromSentence(vocabulary_FRA, output_sentence)

  with torch.no_grad() :
    input_tensor = torch.Tensor(input_ids)
    encoder_outputs, encoder_hidden = encoder(input_tensor.unsqueeze(dim=0))
    decoder_outputs, decoder_hidden = decoder(encoder_outputs, encoder_hidden) # there is no target tensor

    output_sentence_pred = [ vocabulary_FRA.index2word[t] for t in decoder_outputs.tolist() ].append
    print("Output (prediction): \"%s\"" % output_sentence_pred)
