Lab made by Melek GHOUMA / Mohamed Aymen BOUYAHIA

# Machine Translation with Seq2seq models via Pytorch

The goal of this lab are to:
- Familiarize yourself with the task of **Machine Translation (MT)**
- Implement a basic **recurrent sequence-to-sequence** model in Pytorch
- Train the model on a very simple English-French MT dataset
- Implement an **attention** module into the model and visualize its results

We will in this lab focus on the model and leave aside what are normally very important aspects of Machine Learning methodology: in particular, we won't use validation and test data to search for hyperparameter search and performance evaluation.

In [None]:
# General stuff
from io import open
import re
import random
import numpy as np
import matplotlib.pyplot as plt

# Nice printing
from pprint import pprint

# Pytorch
import torch
import torch.nn as nn
from torch import optim
from torch.nn import functional as F
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import Dataset, DataLoader

# Which device to use ?
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### I Dataset and pre-processing

We're going to work with data from the **tatoeba** website. This website proposes human-made translations for many (relatively) simple sentences, with sometimes several possible translations for one sentence.
Pre-processed versions of the *tatoeba dataset* can be found on this [website](https://www.manythings.org/anki/). On the moodle, you can find the 'English $\rightarrow$ French' data already cleaned, but you are free to use any other language you would prefer.  


<div class='alert alert-block alert-warning'>
            Question:</div>
            
We will define these as global variables - for convenience. Given what was said in class and how they are employed in this lab, explain briefly what each one is used for.

In [None]:
# Some global variables
PAD_TOKEN = 0
SOS_TOKEN = 1
EOS_TOKEN = 2

In [None]:
# PAD_TOKEN : used to fill in sequences to ensure they all have the same length when batched together
# SOS_TOKEN : start-of-sequence token
# EOS_TOKEN : end-of-sequence token

From the previous pytorch lab, we know we will require to define some parameters. We can already choose the maximum length of sequences, the size of our batches, and the internal dimension used by our model. Note that the length of sequence is rather short in this data - you can take a look at the histogram.

In [None]:
# Parameters
max_length = 10
batch_size = 32
hidden_size = 128

In [None]:
# Read the file and split into lines
parallel = open('fra.txt', encoding='utf-8').\
        read().strip().split('\n')

In [None]:
# Data looks like this
pprint(parallel[0:5])

['Go.\tVa !',
 'Run!\tCours\u202f!',
 'Run!\tCourez\u202f!',
 'Wow!\tÇa alors\u202f!',
 'Fire!\tAu feu !']


We will need to clean this up. Use the regular expression package ```re``` to remove any non letter character. Be careful, though, with French, you need to keep the accents. We will then organize the data into pairs, as is usual in MT.

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
    # Lowercase
    s = s.lower()
    # Trim, and remove non-letter characters
    s = re.sub(r"[^a-zàâçéèêëîïôûùüÿñ\s']", '', s)
    return s

In [None]:
pprint(parallel[0:5])

['Go.\tVa !',
 'Run!\tCours\u202f!',
 'Run!\tCourez\u202f!',
 'Wow!\tÇa alors\u202f!',
 'Fire!\tAu feu !']


In [None]:
# Split every line into pairs and normalize
pairs = [[normalizeString(s) for s in l.split('\t')] for l in parallel]

In [None]:
pprint(pairs[0:5])

[['go', 'va '],
 ['run', 'cours\u202f'],
 ['run', 'courez\u202f'],
 ['wow', 'ça alors\u202f'],
 ['fire', 'au feu ']]


Begin with implementing a class ```Vocab``` that will accumulate counts and indexes of words into language-specific dictionnaries. In this case, we would like the vocabulary to be built on the fly, to work well with the format of our data (parallel sentences from both languages).

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
class Vocab:
    def __init__(self):
        self.word2count = {}
        self.word2idx = {"SOS": SOS_TOKEN, "EOS": EOS_TOKEN}
        self.idx2word = {SOS_TOKEN: "SOS", EOS_TOKEN: "EOS"}

    # Implemented assuming we will process lines one by one, easier given the format of our data
    def addSent(self, sent):
        for word in sent.split(' '):
          self.addWord(word)

    def addWord(self, word):
      if word not in self.word2idx.keys():
        last_pos = self.__len__()
        self.word2idx[word] = last_pos
        self.word2count[word] = 1
        self.idx2word[last_pos] = word
      else:
        self.word2count[word] += 1

    def __len__(self):
        return len(self.word2idx)

Then, create a function ```tensorFromSentence``` that will take an untokenized sentence (hence, a string), a ```Vocab``` object, and the ```max_length``` parameter as inputs, and return a ```LongTensor``` representing the sequence of indexes.

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
def tensorFromSentence(sent, vocab, max_length):
    indexes = [vocab.word2idx["SOS"]]
    for word in sent.split(' '):
        if word in vocab.word2idx:
            indexes.append(vocab.word2idx[word])
    indexes.append(vocab.word2idx["EOS"])
    return torch.tensor(indexes, dtype=torch.long)

Finally, complete this ```TranslationDataset``` class inheriting from ```Dataset```. It should, from the list of parallel sentences:
- Apply an optional filter to possibly reduce the dataset size and complexity,
- Instantiate and build ```Vocab``` objects for both languages,
- Create two lists containing ```LongTensor``` objects for each language,
- Group them into two tensors of the appropriate size with ```pad_sequence```.

You should note that, depending on the ordering of the pairs, one language will be the **source**, and the other will be the **target** of our model. In this case, English is the source and French the target.
<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
class TranslationDataset(Dataset):
    def __init__(self, parallel_data, max_length = 10, filter_target_prefixes = None):
        # We will select some subset on the data to avoid having too much
        self.pairs = self.filterData(parallel_data, filter_target_prefixes)

        self.max_length = max_length

        # Creating both vocabularies
        self.input_lang = Vocab()
        self.output_lang = Vocab()

        # Filling both vocabularies
        for pair in self.pairs:
            self.input_lang.addSent(pair[0])
            self.output_lang.addSent(pair[1])

        # Lists of tensors to be created
        self.tensor_inputs = [tensorFromSentence(pair[0], self.input_lang, self.max_length) for pair in self.pairs]
        self.tensor_outputs = [tensorFromSentence(pair[1], self.output_lang, self.max_length) for pair in self.pairs]
        print("The tensor inputs is \t")
        print(self.tensor_inputs[0])

        # Put them all at the same size with pad_sequence
        self.tensor_inputs = pad_sequence(self.tensor_inputs, padding_value=PAD_TOKEN, batch_first=True)
        self.tensor_outputs = pad_sequence(self.tensor_outputs, padding_value=PAD_TOKEN, batch_first=True)
        print("The tensor inputs is \t")
        print(self.tensor_inputs[0])

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        # The iterator just gets one particular example
        # The dataloader will take care of the shuffling and batching
        if torch.is_tensor(idx):
            idx = idx.tolist()
        return self.tensor_inputs[idx], self.tensor_outputs[idx]

    def filterPair(self, pair, prefixes):
        return pair[0].startswith(prefixes)

    def filterData(self, pairs, filter_target_prefixes):
        if filter_target_prefixes is not None:
            return [pair for pair in pairs if self.filterPair(pair, filter_target_prefixes)]
        else:
            return pairs

Create a ```TranslationDataset``` from our data, with no filter, and look at its size, and the sizes of the vocabularies. What could be a problem here ?

<div class='alert alert-block alert-warning'>
            Question:</div>

In [None]:
# Create a TranslationDataset from our data with no filter
translationDataset = TranslationDataset(pairs, max_length=max_length)
print("Observing the sizes of the vocabularies\n")
print("The size of the source vocabulary is: {}\n".format(len(translationDataset.input_lang)))
print("The size of the target vocabulary is: {}\n".format(len(translationDataset.output_lang)))

The tensor inputs is 	
tensor([1, 2, 2])
The tensor inputs is 	
tensor([1, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0])
Observing the sizes of the vocabularies

The size of the source vocabulary is: 13653

The size of the target vocabulary is: 29439



We note that the target vocabulary (French) is double the size of the source vocabulary (English). This discrepancy could lead to potential issues in the model complexity, as the model must choose from a substantially broader vocabulary of possible words. Consequently, in the final layer, where a softmax function is applied across the vocabulary, it must manage a significantly larger class set.


---



We will now use a filter: we will only consider pairs of sentences which English begins with chains of characters from the ```prefixes``` set. Create the dataset with this filter. Look at the sizes involved. Create a dataloader with the previously defined ```batch_size```.

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
# Consider only the sentences beginning with these
prefixes = ("i am ", "i m ",
            "he is", "he s ",
            "she is", "she s ",
            "you are", "you re ",
            "we are", "we re ",
            "they are", "they re ")

# New dataset:
new_dataset = TranslationDataset(pairs, max_length=max_length, filter_target_prefixes=prefixes)

The tensor inputs is 	
tensor([1, 2, 3, 4, 2])
The tensor inputs is 	
tensor([1, 2, 3, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])


In [None]:
# Creating the dataloader
training_dataloader = DataLoader(new_dataset, batch_size=batch_size, shuffle=True)

### II - Sequence to sequence architecture and training

We will now create two pytorch objects, which will inherit from ```Module```: the ```EncoderRNN``` and the ```DecoderRNN``` classes. Both are based on RNNs; we will use the lighter ```GRU``` (gated recurrent unit) recurrent layer.
While we won't check it with validation data, we should try to avoid overfitting with ```Dropout```.

Begin by completing the **encoder**. It uses an ```Embedding``` layer, which has as many vectors as the size of the **source** vocabularies, plus the ```GRU```. Both embeddings and the recurrent layer use dimension ```hidden_size```. It should output two things:
- A sequence of vectors, corresponding to the representations of each input word that has gone through the encoder,
- The last hidden state used by the GRU of the encoder.

**Important**: with our first decoder, we will only use the **last hidden state**. However, we can still add the sequence of representations to the outputs, as we will need it for the *attention module* later.

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
# Create encoder
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, dropout_p=0.1):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, input):
        embedded = self.dropout(self.embedding(input))
        output, hidden = self.gru(embedded)
        return output, hidden

Next, you will need to complete the **decoder**. Besides the ```Embedding``` (for the **target** language) and ```GRU```, it needs an additional layer: a ```Linear``` layer to obtain output scores for the next word to be predicted.
The ```forward``` function is however a little more complicated: we will need it to be able to re-use what was predicted at the previous step during inference. Therefore, we will use the old-fashioned way: a **loop**. To summarize, we will:
- Create an empty tensor containing only the first token of the output sequence (*which is ?*) with ```torch.empty```.
- If we are in training mode, we can fill out that tensor with what we know to be the rest of the output sequence, make it go through the recurrent layer, and obtain scores.
- If we are in inference mode, we need to make a prediction at each step to re-insert the corresponding index as input afterwards. We can use the ```topk``` method to get the best index directly ! **Important:** use the ```detach()``` method to cut this from the computational graph.  

In both cases, we loop through the sequence and apply the same operations, which are in ```forward_step```. We return the log-probabilities of prediction at each step.

**Important:** Again, we also return an empty placeholder variable which we will later use for attention.

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
# Create decoder
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)

    def forward(self, encoder_outputs, encoder_hidden, max_length, target_tensor=None):
        # We build the  decoder the old school way:
        # As we will need to loop through the decoder for inference, let's use the same structure for training
        # We will process one input and predict one output at the time, with a method implementing the recurrent step: "foward_step"
        batch_size = encoder_outputs.size(0)
        # Create the input to the decoder: which token is it ? Put it in "fill_"
        # Which shape should it be ?
        decoder_input = torch.empty(batch_size, 1, dtype=torch.long, device=device).fill_(SOS_TOKEN)
        # Where does the first hidden state come from ?
        decoder_hidden = encoder_hidden
        # We'll keep the output in a list
        decoder_outputs = []

        # Looping on the output sequence
        for i in range(max_length):
            # Apply the forward_step function ...
            decoder_output, decoder_hidden = self.forward_step(decoder_input, decoder_hidden)
            # and keep the output
            decoder_outputs.append(decoder_output)

            # We are in training mode: we know the target
            if target_tensor is not None:
                # Teacher forcing: Feed the target as the next input
                # Which shape do we need ?
                decoder_input = target_tensor[:, i].unsqueeze(1)

            # We are doing inference, we need to predict the next word and re-use it as input
            else:
                # Without teacher forcing: use its own predictions as the next input
                # Use the topk function to get the best index
                _, topi = decoder_output.topk(1)
                # Very important: to be re-used as input, detach from computational graph
                # Which shape do we need ?
                decoder_input = topi.squeeze(-1).detach()

        # Concatenate outputs on the second dimension (length of the sequence)
        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        # Apply log_softmax
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)
        # We return `None` for consistency in the training loop - it will be used for attention later
        return decoder_outputs, decoder_hidden, None

    def forward_step(self, input, hidden):
        # Get your input through embedding, an activation function, the recurrent layer, and the output layer
        output = self.embedding(input)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.out(output)
        return output, hidden

Create an instance of one ```EncoderRNN``` and one ```DecoderRNN```. In order to do this, get the vocabulary sizes for the appropriate languages from the ```TranslationDataset``` object.

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
# Get the vocabulary sizes for the appropriate languages from the TranslationDataset object
input_size = len(new_dataset.input_lang.word2idx)
output_size = len(new_dataset.output_lang.word2idx)

encoder = EncoderRNN(input_size=input_size, hidden_size=hidden_size).to(device)
decoder = DecoderRNN(hidden_size=hidden_size, output_size=output_size).to(device)

Implement the training loop into the ```train_epoch``` function. Follow the model from the previous lab. Note that we will use separated *optimizers* for the encoder and decoder. **Be careful to the sizes of the model outputs for use with the criterion !**

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
def train_epoch(dataloader, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion):
    total_loss = 0
    for data in dataloader:
        input_tensor, target_tensor = data
        input_tensor = input_tensor.to(device)
        target_tensor = target_tensor.to(device)
        # Initiate gradient
        encoder_optimizer.zero_grad()
        decoder_optimizer.zero_grad()
        # Forward
        encoder_outputs, encoder_hidden = encoder(input_tensor)
        decoder_outputs, _, _ = decoder(encoder_outputs, encoder_hidden, target_tensor.size(1), target_tensor)
        # Compute loss : put the output at the right size, the reference too, and apply the criterion
        loss = criterion(
            decoder_outputs.view(-1, decoder_outputs.size(-1)),
            target_tensor.view(-1)
        )
        # Compute gradient
        loss.backward()
        # Update weights
        encoder_optimizer.step()
        decoder_optimizer.step()
        # Keep track of loss
        total_loss += loss.item()
    return total_loss / len(dataloader)

We can know simply loop on this using the following function:

In [None]:
def train(train_dataloader, encoder, decoder, n_epochs=80, learning_rate=0.001, print_every=10, plot_every=10):
    encoder.train()
    decoder.train()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every
    # Initialize optimizers
    encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate)
    # Initialize criterion
    criterion = nn.NLLLoss()
    # Training loop
    for epoch in range(n_epochs):
        loss = train_epoch(train_dataloader, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if (epoch + 1) % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('(%d %d%%) %.4f' % (epoch, epoch / n_epochs * 100, print_loss_avg))

        if (epoch + 1) % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

And we need to also implement an ```evaluate``` function. Here, we will need to use the decoder in **inference** node, so it will re-use what output it generates to continue processing. We will then transform this sequence of outputs into **words**. What is the stopping condition for our model generating words ?

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
def evaluate(encoder, decoder, sentence, max_length, input_lang, output_lang):
    encoder.eval()
    decoder.eval()
    # One example to evaluate
    # We need to make it into a batch of one exemple to respect tensor dimensions
    input_tensor = tensorFromSentence(sentence, input_lang, max_length).view(1, -1).to(device)
    # Forward
    encoder_outputs, encoder_hidden = encoder(input_tensor)
    decoder_outputs, decoder_hidden, decoder_attn = decoder(encoder_outputs, encoder_hidden, max_length)
    # Get best output
    _, topi = decoder_outputs.topk(1)
    decoded_ids = topi.squeeze()
    # Decode until stopping condition ?
    decoded_words = []
    for idx in decoded_ids:
        if idx.item() == EOS_TOKEN:
            decoded_words.append('<EOS>')
            break
        decoded_words.append(output_lang.idx2word[idx.item()])
    return decoded_words, decoder_attn

Let's use this function to evaluate our model on a random subset of the training data:

In [None]:
def evaluateRandomly(encoder, decoder, dataset, n=10):
    # do n examples
    for i in range(n):
        # select one from the known data to avoid vocabulary issue
        pair = random.choice(dataset.pairs)
        print('>', pair[0])
        print('=', pair[1])

        output_words, _ = evaluate(encoder, decoder, pair[0], dataset.max_length, dataset.input_lang, dataset.output_lang)
        output_sentence = ' '.join(output_words)
        print('<', output_sentence)
        print('')

Now, execute the training loop for, and look at what it generates. It should be fast on a cpu, and not take too long on a GPU.

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
train(training_dataloader, encoder, decoder, print_every=5, plot_every=5)

(4 5%) 1.6361
(9 11%) 1.0442
(14 17%) 0.7959
(19 23%) 0.6289
(24 30%) 0.5072
(29 36%) 0.4142
(34 42%) 0.3445
(39 48%) 0.2915
(44 55%) 0.2494
(49 61%) 0.2149
(54 67%) 0.1874
(59 73%) 0.1648
(64 80%) 0.1463
(69 86%) 0.1279
(74 92%) 0.1140
(79 98%) 0.1013


In [None]:
evaluateRandomly(encoder, decoder, new_dataset)

> i am ready to follow you
= je suis prêt à te suivre
< <EOS>

> she is a nurse
= elle est infirmière
< SOS c'est un bébé <EOS>

> he is almost six feet tall
= il fait presque six pieds de haut
< SOS il est sûr de réussir l'examen <EOS>

> i am afraid it will rain in the afternoon
= je crains qu'il ne pleuve dans l'aprèsmidi
< j'ai amitié de tête de tête de café <EOS>

> i am interested in swimming
= la natation m'intéresse
< la dernière personne la plus la tête plus un de

> she is no match for me
= elle ne fait pas le poids avec moi
< SOS elle n'a pas très riche mais elle est heureuse

> we are the same age but different heights
= nous avons le même âge mais sommes de tailles différentes
< minute la tête de la ville <EOS>

> he is engaged to my younger sister
= il est fiancé à ma jeune sur
< SOS il est trop honnête pour dire à la dernière

> i am growing to hate the girl
= je me mets à détester cette fille
< <EOS>

> you aren't as short as i am
= vous n'êtes pas aussi petit que moi
<

### II - Attention module

We will know implement a new class ```Attention``` inheriting from ```Module```.

Begin by implementing it following the scheme presented in class. In order to implement this efficiently, you will need to use *batched* operations and pay attention to shapes. in particular, use the **batched matrix multiplication** ```torch.bmm```. Use shape manipulation functions (```permute, squeeze, unsqueeze```) when needed.

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
class Attention(nn.Module):
    def __init__(self, hidden_size):
        super(Attention, self).__init__()

    def forward(self, query, keys):
        # What shape do we need the query in ?
        query = query.squeeze(2)
        print("\nquery shape is: {}".format(query.shape))
        # Compute similarity scores with bmm. Any shape change for keys ?
        keys = keys.transpose(1, 2)
        print("\nkeys shape is: {}".format(keys.shape))
        scores = torch.bmm(query, keys)
        print("\nscores shape is: {}".format(scores.shape))
        # Apply softmax to get weights. Any shape change for scores ?
        weights = F.softmax(scores, dim=-1)
        # Use bmm to make weighted sum. Any shape change required ?
        context = torch.bmm(weights, keys)
        context = context.squeeze(1)
        return context, weights

Then, you will need to modify the decoder class into a new ```AttentionDecoderRNN```. The usual way of implementing the loop in ```forward_step``` is as follows:

- Apply the recurrent loop as before: $\mathbf{s}_{t} = \text{GRU}(\mathbf{r}_t, \mathbf{s}_{t-1})$
- Noting $\mathbf{z}_t = Attention(\mathbf{H}, \mathbf{s}_t)$ the output of the attention, we compute a modified state $\tilde{s}_t$: $$ \tilde{s}_t = tanh(\mathbf{W}_a \times [\mathbf{z}_t; \mathbf{s}_t])$$ based on the concatenation of the attention output and output of the GRU.
- We predict score based on this modified new state: $\mathbf{o}_t = \mathbf{W}_{out} \times \tilde{s}_t$.

**Important**:
- You need to instantiate the ```Attention``` class when building the decoder.
- You also need a new parameter representing $\mathbf{W}_a$, of the appropriate size - as this matrix is applied to a concatenation of the attention output and the decoder hidden state.
- You need to keep track of attention weights at each step, and also concatenate them and output them at the end of the ```forward```.

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
class AttentionDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1):
        super(AttentionDecoderRNN, self).__init__()
        self.embedding = nn.Embedding(output_size, hidden_size)
        # Don't forget to instantiate the Attention
        self.attention = Attention(hidden_size=hidden_size)
        self.gru = nn.GRU(2 * hidden_size, hidden_size, batch_first=True)
        # And the new linear layer needed
        self.concat = nn.Linear(hidden_size * 2, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(dropout_p)

    def forward_step(self, input, hidden, encoder_outputs):
        # Get your input through embedding, apply the recurrent layer
        embedded = self.dropout(self.embedding(input))
        gru_output, hidden = self.gru(embedded, hidden)
        # Compute the attention
        query = hidden
        context, attn_weights = self.attention(query, encoder_outputs)
        # Concatenate the result of the attention and the encoder outputs
        concat_input = torch.cat((gru_output, context), dim=2)
        # Apply the linear transformation and tanh
        output, hidden = torch.tanh(self.concat(concat_input))
        # Apply the last layer to obtain scores
        output = self.out(output)
        return output, hidden, attn_weights

    def forward(self, encoder_outputs, encoder_hidden, max_length, target_tensor=None):
        batch_size = encoder_outputs.size(0)
        decoder_input = torch.empty(batch_size, dtype=torch.long, device=device).fill_(SOS_TOKEN)
        decoder_hidden = encoder_hidden
        decoder_outputs = []
        # New: attention list
        attention_weights = []

        for i in range(max_length):
            decoder_output, decoder_hidden, weights = self.forward_step(
                decoder_input, decoder_hidden, encoder_outputs
            )
            decoder_outputs.append(decoder_output)
            # Also keep track of attentions
            attention_weights.append(weights)

            if target_tensor is not None:
                decoder_input = target_tensor[:, i].unsqueeze(1)

            else:
                _, topi = decoder_output.topk(1)
                decoder_input = topi.squeeze(-1).detach()

        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)
        return decoder_outputs, decoder_hidden, attention_weights

Create new encoder and decoders instances for this model, and train them !

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
encoder_att = EncoderRNN(input_size=input_size, hidden_size=hidden_size).to(device)
decoder_att = AttentionDecoderRNN(hidden_size=hidden_size, output_size=output_size).to(device)

train(training_dataloader, encoder_att, decoder_att, print_every=5, plot_every=5)

We tried various methods to implement the attentionDecoder but unfortunately an error concerning the tensors size has persisted. If it is possible we request a correction for this lab or maybe some tips about how we can resolve this issue as we are interested to learn more and deepen our knowledge in NLP.


---



Use the following function to visualize the attention learnt by the model:

In [None]:
def showAttention(input_sentence, output_words, attentions):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    cax = ax.matshow(attentions.cpu().numpy(), cmap='bone')
    fig.colorbar(cax)

    # Set up axes
    ax.set_xticklabels([''] + input_sentence.split(' ') + ['<EOS>'], rotation=90)
    ax.set_yticklabels([''] + output_words)
    plt.show()

def evaluateAndShowAttention(input_sentence, encoder, decoder, dataset):
    output_words, attentions = evaluate(encoder, decoder, input_sentence, dataset.max_length, dataset.input_lang, dataset.output_lang)
    print('input =', input_sentence)
    print('output =', ' '.join(output_words))
    showAttention(input_sentence, output_words, attentions[0, :len(output_words), :])

In [None]:
evaluateRandomly(encoder_att, decoder_att, training_dataset)

In [None]:
evaluateAndShowAttention('i am not a doctor but a teacher', encoder_att, decoder_att, training_dataset)

We played here with a dataset but did not rigorously evaluate. If you were to create a model or re-use one, describe the rigorous methodology you could use to **evaluate model performance** - look for appropriate metrics and which functions you could use.

<div class='alert alert-block alert-warning'>
            Question:</div>

In our lab exercise, we did not divide the dataset into distinct sets for training and testing. Therefore, when creating a model or re-using one, for starters I should split the dataset to training, validation and testing. the test subset is used to evaluate the model, while the validation subset is used to fine-tune its hyperparameters.
For the purpose of evaluating the model, various metrics can be used including the metric that we've seen in class: ***BLEU (Bilingual Evaluation Understudy)***. BLEU can is the best choice to be used thanks to its ability to measure how many words and phrases in the translated output match reference translations.


---



We improved our initial model with attention. But considering our goal is to **generate text**, what is the aspect we did not consider yet and that we could improve ? What did we do 'too simply' and with which algorithm could we improve it ? How would you go about implementing that given our current code ?

<div class='alert alert-block alert-warning'>
            Question:</div>

Since we aim to generate text, an essential aspect that could be further refined is the use of transformers thanks to their ability to handle sequences by enabeling parallel processing and capturing long-range dependencies more effectively than the RNN-based models we've been using in this lab.
In this seq2seq project we implemented a simple attention mechanism in the decoder which is relatively too simple, therefore we could have a potential improvement using a more complex approach: the ***self-attention mechanism*** which allows each part of the input sequence to attend to every other part.
To do so we could replace the RNN-based encoder and decoder with Transformer blocks that use self-attention: including multi-head self-attention layers and positional encoding.


---

