<a href="https://colab.research.google.com/github/cerasole/ml4hep/blob/main/RNNs/torch_Encoder_Decoder_for_Neural_Machine_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine translation using RNNs

[It is suggested to run this notebook using GPUs, i.e. in Colab]

The theory of machine translation using RNNs is described, for instance, in Chapter 16 of https://github.com/ageron/handson-ml3.

Basics of the algorithm
- Input: sentence in a given language,
- Output: translation of the input sentence in a different language.

The architecture consists in an encoder-decoder system.
- The input sentence is fed to the encoder, which transforms the input into a low-dimensional latent representation
- The decoder transforms the latent representation into an output sentence.

In [1]:
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata

import string
import re
import random

import numpy as np

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [2]:
device

device(type='cpu')

## Language class for tokenization

We define a class for handling languages and sentences.

This class will need to
- register the words in the sentences given as input to the class

In other words, we employ this class to perform the tokenization = encode every word as an integer.

In the preprocessing, we will uniform the sentences to a standard. This is is not included in the Lang class and it is a fundamental preprocessing step.

In [3]:
SOS_token = 0   # Start-of-Sentence
EOS_token = 1   # End-of-Sentence

class Lang:

    def __init__(self, name):
        self.name = name # name of the language
        self.word2index = {} # dictionary containing words: indices when they were first inserted in the dictionary
        self.word2count = {} # dictionary containing words: number of times they were inserted in the dictionary
        self.index2word = {0: "SOS", 1: "EOS"} # dictionary contiaining indices of first insertion in the dictionary: words
        self.n_words = 2  # Total counts of words in the dictionary, including SOS and EOS

    def addWord(self, word):
        # When we add a word, we have to check if it is already present in the dictionary, e.g. in the word2index.
        # If it is not present, we need to
        #  - add this word to the self.word2index dictionary, giving it as index the current self.n_words,
        #  - add this word to the self.word2count dictionary, giving it 1 count,
        #  - add, in the self.index2word dictionary, using as index given by self.n_words, the word itself,
        #  - increase by 1 the number of total words, aka self.n_words.
        # If it is already present, we need to
        #  - increase by 1 the corresponding self.word2counts entry.
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

    def addSentence(self, sentence):
        # Add words splitting the sentence by space.
        # It would be better if the sentence is already standardized.
        # We will take care of this in the preprocessing of the sentences.
        for word in sentence.split(' '):
            self.addWord(word)


In [None]:
lang = Lang("prova")
print (lang.word2count, lang.word2index, lang.index2word, lang.n_words)
lang.addSentence("could you please stop the noise?")
print (lang.word2count, lang.word2index, lang.index2word, lang.n_words)

{} {} {0: 'SOS', 1: 'EOS'} 2
{'could': 1, 'you': 1, 'please': 1, 'stop': 1, 'the': 1, 'noise?': 1} {'could': 2, 'you': 3, 'please': 4, 'stop': 5, 'the': 6, 'noise?': 7} {0: 'SOS', 1: 'EOS', 2: 'could', 3: 'you', 4: 'please', 5: 'stop', 6: 'the', 7: 'noise?'} 8


## Preprocessing: standardization of the sentences

In the next cells, we will investigate methods to standardize the input sentences:
- transform to lower case
- all special characters need to be treated differently
- Take into account abbreviations

In [4]:
# https://stackoverflow.com/a/518232/2809427
# Turn a Unicode string to plain ASCIIi, i.e.
# - remove the accents without changing the letter
# - turn letters in different languages into the corresponding ASCII
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

In [None]:
unicodeToAscii("ciàò!")

'ciao!'

In [5]:
def normalizeString(s):
    # string.lower() to go to lowercase
    # string.strip() to eliminate the first (and last) character, if  blank space
    # unicodeToAscii(string) to transform to plain ASCII
    # re.sub(pattern, repl, string) returns the string obtained by replacing the
    #  leftmost non-overlapping occurrences of the pattern in string by the replacement repl
    #  - within the pattern, the outermost () indicate that we may want to "capture" the pattern and re-use it in the replacement.
    #     Indeed, in the replacement we use r" \1" to say that we want to replace the pattern with " {same_pattern}". We need to specify
    #     r otherwise he considers "\" as a normal backslash character, when instead we want to use it as a special character into \1
    #  - the [] are used because out pattern is composed by several characters, not just one. Indeed, we want to replace ".", "!" and "?".
    #  - So, the second command will transform "." to " .", "!" to " !", "?" to " ?", "??" to " ? ?"
    #  - In the second command, we do something else.
    #    - Again we use [] to indicate a group of characters for the pattern.
    #    - We use ^ to indicate that we want to consider as pattern everything that is *not* indicated in the [].
    #      a-zA-Z indicate that we don't want to consider lowercase nor uppercase letters (even though we already did lower everything)
    #      .!? indicate that we don't want to consider to three characters ., ! and ?
    #      So, we want to consider everything that is not a letter, nor a ., nor a !, nor a ?,
    #      and we want to delete it and replace it with a single space!
    #      But, in this way, a string like "***" would be transformed to "   ", which is not good as later we will use "   ".split(" ").
    #      To solve this, there is the last + in the pattern string, which says that the pattern can be composed by one or more
    #      consecutive characters satisfying the same condition.
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

In [None]:
"Ciao!".lower()

'ciao!'

In [None]:
" ciao ! ".strip()

'ciao !'

In [None]:
re.sub(r"([?])", r" \1", "Ciao??")

'Ciao ? ?'

In [None]:
re.sub(r"[^a-zA-Z.!?]+", r" ", "a ?*^*:-_=+"), re.sub(r"[^a-zA-Z.!?]", r" ", "a ?*^*:-_=+")

('a ? ', 'a ?        ')

## Download the **data**

In [7]:
!wget https://download.pytorch.org/tutorial/data.zip
!unzip data.zip

--2024-08-27 03:49:31--  https://download.pytorch.org/tutorial/data.zip
Resolving download.pytorch.org (download.pytorch.org)... 18.160.10.36, 18.160.10.22, 18.160.10.28, ...
Connecting to download.pytorch.org (download.pytorch.org)|18.160.10.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2882130 (2.7M) [application/zip]
Saving to: ‘data.zip’


2024-08-27 03:49:31 (25.9 MB/s) - ‘data.zip’ saved [2882130/2882130]

Archive:  data.zip
   creating: data/
  inflating: data/eng-fra.txt        
   creating: data/names/
  inflating: data/names/Arabic.txt   
  inflating: data/names/Chinese.txt  
  inflating: data/names/Czech.txt    
  inflating: data/names/Dutch.txt    
  inflating: data/names/English.txt  
  inflating: data/names/French.txt   
  inflating: data/names/German.txt   
  inflating: data/names/Greek.txt    
  inflating: data/names/Irish.txt    
  inflating: data/names/Italian.txt  
  inflating: data/names/Japanese.txt  
  inflating: data/names/Korean.tx

In [8]:
filename = "data/eng-fra.txt"
lines = open(filename, encoding = "utf-8").read().strip().split("\n")
pairs = [
    [
        normalizeString(s) for s in l.split("\t")
    ] for l in lines
]

In [9]:
len(pairs)

135842

In [None]:
pairs[0]

['go .', 'va !']

In [None]:
list(reversed(pairs[0]))

['va !', 'go .']

In [23]:
def reverse_pairs(pairs):
    return [list(reversed(pair)) for pair in pairs]

### Set the languages for the *translator*

In [10]:
############# Command for an English to French translator #############
lang1 = "eng"
lang2 = "fra"
english_index = 0
if pairs[0][0] != "go .":
  pairs = reverse_pairs(pairs)
input_lang, output_lang = Lang(lang1), Lang(lang2)

In [24]:
############# Command for an English to French translator #############
lang1 = "fra"
lang2 = "eng"
english_index = 1
if pairs[0][0] == "go .":
  pairs = reverse_pairs(pairs)
input_lang, output_lang = Lang(lang1), Lang(lang2)

### Cut in the **sentences**

Here we want to introduce an **enormous** cut in the sentences.
Among all the 135k sentences, we decide to retain only those that satisfy these requirements.
- Both lengths of the sentences in the two languages have to be smaller than MAX_LENGTH = 10
- The english sentences have to begin with personal pronoun + present simple of to be

In [25]:
MAX_LENGTH = 10

eng_prefixes = (
    "i am ", "i m ",
    "he is ", "he s ",   # note that we need to include the blank space, otherwise also the sentence "he stopped" would be included
    "she is ", "she s ",
    "you are ", "you re ",
    "we are ", "we re ",
    "they are ", "they re "
)

def filterPair(p, english_index):
    condition = True
    # Cut on the length of the sentences in the two languages
    for i in range(2):
      length = len(p[i].split(" "))
      condition *= (length < MAX_LENGTH)
    # Cut that the english sentence has to start with one of the sentences in the eng_prefixes
    condition *= (p[english_index].startswith(eng_prefixes))
    return condition

def filterPairs(pairs, english_index):
    return [pair for pair in pairs if filterPair(pair, english_index)]

cut_pairs = filterPairs(pairs, english_index)

In [26]:
print (f"From {len(pairs)} sentences, we ended up with {len(cut_pairs)} sentences after the cut.")

From 135842 sentences, we ended up with 10522 sentences after the cut.


In [None]:
pairs[0][0], cut_pairs[0][0]

('va !', 'j ai ans .')

## Preprocessing: tokenization of the post-cut sentences using the Lang objects

In [29]:
### Fill the Lang instances
if input_lang.n_words == 2:  # if not already filled
    for pair in cut_pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])

In [30]:
input_lang.name, input_lang.n_words

('fra', 4341)

In [31]:
output_lang.name, output_lang.n_words

('eng', 2802)

## Preprocessing: transform the tokenization results into torch.tensor objects



Now we have to transform the cut_pairs into something pytorch-friendly using the tokenization provided by the attributes inside the two Lang objects that we just filled.

In [18]:
def tensorFromSentence(lang, sentence):
    """
    This function
    - separates the sentence into words,
    - transforms the vector of words into a vector of indices,
    - appends the EOS token, i.e. 1
    - transforms the indices vector into a torch.tensor with shape (len(indices), 1) using the torch.Tensor.view object.
    """
    words = sentence.split(" ")
    indices = [lang.word2index[word] for word in words]
    indices.append(EOS_token)
    return torch.tensor(indices, dtype = torch.long, device = device).view(-1, 1)

def tensorsFromPair(pair):
    """
    This function takes a single pair vector, which contains the sentences in the two languages,
    and pass each of the sentences, with the corresponding Lang objects, to the tensorFromSentence method.
    This function returns the tuple of the input and output tensors.
    """
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)

In [21]:
input_lang.name

'eng'

In [32]:
ptt = tensorFromSentence(output_lang, "i will try to fix you")
print (ptt)
print (ptt.shape), print (ptt.dtype)

tensor([[   2],
        [1700],
        [1096],
        [ 532],
        [2218],
        [ 129],
        [   1]])
torch.Size([7, 1])
torch.int64


(None, None)

================================================================
### Useful preliminary: embeddings

[See pages 430-439 of the A. Geron book, "Hands-on Machine Learning with Sklearn, Keras and Tensorflow"]

Embeddings are used in the preprocessing step. If data consists in categorical or text features, they need to be converted to numerical features.

For instance, consider a categorical feature such as the location "<=1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND" of the California housing dataset [https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html].
Or, consider the words of each vocabulary that we're considering in this notebook.

It is mandatory to encode these features into a numerical feature before using it in a NN.

##### **Dimension = 1**
As first choice, it is possible to map each category to its index (0 to 4), possibly including a 5 index for out-of-vocabulary (oov) instances.
Such a choice has the drawback that categories with close indices (e.g. 2 and 3) are considered by the NN closer than categories with distant indices (e.g. 1 and 5).
Such a distance is a product of the choice of mapping the categories with numerical indices and has nothing to do with the categories themselves. So, this results in a bad choice and introduces bias in the network.

[NB: Notice that this is what we did when we tokenized the words of the two languages: we mapped each word to an index.
However, if we used these indices directly in the NN, we would introduce that the words with indices 225 and 226 are closer than those with indices 10 and 450, which has no sense, as they are random words in a language, which have those indices only because we fed the Lang objects in a given order.]

##### **Dimension = number N of categories**
As second choice, we can use one-hot vectors: starting from the indices defined earlier, we can associate the category with index i (out of N) to the vector of length N having all zero values apart from 1 in the i-th position.
It's like we are adding N features to the dataset.
Equivalently, now each instance would become an array of N values.
This can be good when the number of dimensions is low, but it becomes less efficient for wide datasets.

##### **So, let's take an intermediate number of dimensions between 1 and N?**
If so, let's take a number D of dimensions (indicated as embedding_dim in pytorch), intermediate between 1 and N.
This number D is a hyperparameter that can be tweaked, so we can try different values for D and see what changes.
- For one-hot vectors, with dimension N, each category became associate to a vector containing zero everywhere except for 1 index.
- In embeddings, the situation is different, because embeddings are trainable!
At first, for each instance, a vector with dimension D is initialized randomly (equivalently, a matrix of input_size x D is initialized randomly). Then, the embeddings have parameters that are tuned so that each category, after training, will become clustered in some regions of the parameters' space, and hopefully different categories will cluster in different parameters' regions.

Trivially, the number of parameters of an embedding is equal to the product of num_embeddings and embedding_dim.

This is called 'representation learning'.

================================================================

In the following example, we create an embedding with 60 input dimensions (input_size = 60) and 10 hidden dimensions (hidden_size = 10).

So, essentially, it is a 60 x 10 matrix, which has a row per input dimension and a column for hidden dimension.

The tensor t has, inside, the categories number 3, 4 and 58. It is necessary that the input size (= first argument of the Embedding) is larger than 58+1, then, of course. So, input_size >= 59. Otherwise, it will give an error later.

In [None]:
# Example
emb = nn.Embedding(60, 10, device = device)

In [None]:
t = torch.tensor([3, 4, 58], dtype = torch.long, device = device)
emb(t)
# This is the representation of the 3 categories with indices 3, 4 and 58.

tensor([[ 1.3775, -1.1485,  1.1086,  0.4524,  1.1358,  0.6782,  1.2561, -0.3769,
          1.4632, -2.3228],
        [-1.7808, -1.4599,  1.8347, -0.1513, -1.4200,  0.5593, -0.8887, -0.2016,
          1.3862,  0.5823],
        [-0.4966, -0.0085,  1.2775, -0.4772,  1.3190, -1.2386, -1.5081,  1.1757,
         -0.3952, -0.6479]], grad_fn=<EmbeddingBackward0>)

In [None]:
t = torch.tensor([3, 4, 58], dtype = torch.long, device = device).view(-1, 1)
emb(t)#.view(-1, 1, 10)  <--  this view is useless!
# Same as before, just a different shape because of the .view method
# --> we're considering separate categories, each instance has only 1 category.

# NB. Pytorch RNNs takes arrays with shape (-1, 1, )

tensor([[[ 1.3775, -1.1485,  1.1086,  0.4524,  1.1358,  0.6782,  1.2561,
          -0.3769,  1.4632, -2.3228]],

        [[-1.7808, -1.4599,  1.8347, -0.1513, -1.4200,  0.5593, -0.8887,
          -0.2016,  1.3862,  0.5823]],

        [[-0.4966, -0.0085,  1.2775, -0.4772,  1.3190, -1.2386, -1.5081,
           1.1757, -0.3952, -0.6479]]], grad_fn=<EmbeddingBackward0>)

So, in our case, the input size of the encoder is for sure equal to input_lang.n_words.

Differently, the hidden size is a hyper parameter that we can tweak.

## Define an RNN encoder

As standard in Pytorch, a model is a subclass of the torch.nn.Module class and needs to have a forward method for evaluating the model for a given instance.

**constructor**

In this case, the encoder has 2 argument inputs:
- input_size, which is NOT the number of words of the input sentence, but the number of words of the input_lang object ($\approx$ 4.3k for French).
- hidden_size, which is the number of dimensions of the embedding to be used.

1 - As usual, the first thing to do is to use the constructor of the parent class. This is done as `super().__init__()`, possibly passing some arguments that were given to the constructor.

2 - Then, we save as attributes the input and hidden sizes.

3 - Then, we define the attributes of the layers that we're including in the encoder: an embedding for treating the categorical inputs and a GRU (simply, an optimized recurrent layer).

**forward**

1 - We need to feed the self.embedding with the input tensor (which represents a sequence of word indices) and as output we want a vector of size (-1, 1, hidden_size) because the GRU takes it like that

2 - We need to pass to the GRU both the embedding output and hidden, which is (related to) exactly the previous state of the GRU (remind that it's a Recurrent Neural Network, so the output at step j depends on the output at step j-1)

**init_hidden**

1 - For the first step, the output of the previous layer does not exist, so we just initialize it to 0. Of course, it needs to have the same shape as the input of the GRU, which we shaped as (-1, 1, self.hidden_size)


In [42]:
class EncoderRNN (torch.nn.Module):
    """
    """
    def __init__(self, input_size, hidden_size):
      #
      super().__init__()
      #
      self.input_size = input_size
      self.hidden_size = hidden_size
      #
      self.embedding = nn.Embedding(input_size, hidden_size)
      self.gru = nn.GRU(hidden_size, hidden_size)
      return

    def forward (self, input, hidden):
      embedded = self.embedding(input).view(-1, 1, self.hidden_size)  # this view might be useless if we already shaped input as (-1, 1)
      output, hidden = self.gru(embedded, hidden)
      return output, hidden

    def initHidden (self):
      return torch.zeros(1, 1, self.hidden_size, device = device)

In [43]:
hidden_size = 256
encoder = EncoderRNN (input_size = input_lang.n_words, hidden_size = hidden_size)

In [44]:
def count_parameters(model):
    # Conta dei parametri
    total_params = sum(p.numel() for p in model.parameters())
    print(f"Numero totale di parametri: {total_params}")

    # Visualizzazione dei parametri specifici
    for name, param in model.named_parameters():
        print(f"Nome: {name}, Shape: {param.shape}, Numero di parametri: {param.numel()}")

count_parameters(encoder)  ### i = input, h = hidden

Numero totale di parametri: 1506048
Nome: embedding.weight, Shape: torch.Size([4341, 256]), Numero di parametri: 1111296
Nome: gru.weight_ih_l0, Shape: torch.Size([768, 256]), Numero di parametri: 196608
Nome: gru.weight_hh_l0, Shape: torch.Size([768, 256]), Numero di parametri: 196608
Nome: gru.bias_ih_l0, Shape: torch.Size([768]), Numero di parametri: 768
Nome: gru.bias_hh_l0, Shape: torch.Size([768]), Numero di parametri: 768


## Define an RNN decoder

The RNN decoder has to process
- the output of the encoders
- the sentences of the translations

So, we need to give as argument to the class
- the dimension of the output of the encoder, i.e. hidden_size
- analogously to the input_size for the encoder, the output_size, which will be the number of words of the target language (output_lang.n_words, which was $\approx$ 2.8k).

**constructor**
- After the same trivial things as before, we have to define an embedding from the output_size to the hidden_size
- We also have to define a GRU that goes from hidden_size dimension to hidden_size dimension
- We define a nn.Linear from hidden_size to output_size and an activation layer of LogSoftMax with dim = 1

**forward**
- The embedding takes the input and shapes it as desired
- **NB: DIFFERENTLY FROM BEFORE, THE EMBEDDED OUTPUT IS PASSED TO A RELU LAYER BEFORE PASSING TO THE GRU**. This does not change the dimension and shapes
- Then, we pass it to the GRU, giving also the hidden as usual
- **NB: DIFFERENTLY THAN BEFORE, WE NOW GO THROUGH A LINEAR LAYER WITH LOGSOFTMAX ACTIVATION FUNCTION**. Also this step does not change the shape, which is still (-1, 1, self.hidden_size).

In [45]:
class DecoderRNN (torch.nn.Module):

    def __init__ (self, hidden_size, output_size):
        super().__init__()
        #
        self.hidden_size = hidden_size
        self.output_size = output_size
        #
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim = 1)
        return

    def forward (self, input, hidden):
        embedded = self.embedding(input).view(-1, 1, self.hidden_size)
        output = F.relu(embedded)
        output, hidden = self.gru(output, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden

    def initHidden (self):
        return torch.zeros(1, 1, self.hidden_size, device = device)

In [None]:
m = nn.LogSoftmax(dim=1)
input = torch.randn(2, 3)
output = m(input)
input.shape, output.shape  # This is just to show that LogSoftMax does not change the shape of the tensor

(torch.Size([2, 3]), torch.Size([2, 3]))

In [46]:
hidden_size = 256
decoder = DecoderRNN (hidden_size = hidden_size, output_size = output_lang.n_words)

In [None]:
count_parameters (decoder)  ### i = input, h = hidden

Numero totale di parametri: 1832178
Nome: embedding.weight, Shape: torch.Size([2802, 256]), Numero di parametri: 717312
Nome: gru.weight_ih_l0, Shape: torch.Size([768, 256]), Numero di parametri: 196608
Nome: gru.weight_hh_l0, Shape: torch.Size([768, 256]), Numero di parametri: 196608
Nome: gru.bias_ih_l0, Shape: torch.Size([768]), Numero di parametri: 768
Nome: gru.bias_hh_l0, Shape: torch.Size([768]), Numero di parametri: 768
Nome: out.weight, Shape: torch.Size([2802, 256]), Numero di parametri: 717312
Nome: out.bias, Shape: torch.Size([2802]), Numero di parametri: 2802


## Training over the whole training dataset

How can we train this?

0 - Define a sequence of sentences the algorithm will be trained on. But for now we will train only using few sentences, just to see how it works

1 - We need to define some hyper parameters, such as the learning rate

2 - We need to define the optimizers of the encoder and decoder

3 - We need to specify the criterion to compute the loss.
We use the Negative Log Likelihood Loss (nn.NLLLoss) because this is a classification problem with lots of classes [https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html].

\begin{equation}
    \ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad
    l_n = - w_{y_n} x_{n,y_n}, \quad
    w_{c} = \text{weight}[c] \cdot \mathbb{1}\{c \not= \text{ignore\_index}\},
\end{equation}
where `x` is the input, `y` is the target, `w` is the weight, and `N` is the batch size.


\begin{equation}
\ell(x, y) = \begin{cases}
        \sum_{n=1}^N \frac{1}{\sum_{n=1}^N w_{y_n}} l_n, &
        \text{if reduction} = \text{`mean';}\\
        \sum_{n=1}^N l_n,  &
        \text{if reduction} = \text{`sum'.}
    \end{cases}
\end{equation}

4 - Do a loop for each training instance, compute the loss for each training instance, and possibly sum the various losses to have the cumulative one.
[Remind that input tensor is the tensor of the indices of a sentence, shaped as (-1, 1, len(sentence tokens)).]

5 -  We may like to print the average loss and elapsed time every `print_every` iterations.

In [48]:
### In the original notebook, this is the beginning of the trainIters method
import time
start = time.time()

#0
niters = 10
training_pairs = [tensorsFromPair(random.choice(cut_pairs)) for i in range(niters)]
print_every = 1
loss_partial = 0

#1
learning_rate = 0.01

#2
encoder_optimizer = optim.SGD(params = encoder.parameters(), lr = learning_rate, nesterov = False)
decoder_optimizer = optim.SGD(params = decoder.parameters(), lr = learning_rate, nesterov = False)

#3
criterion = nn.NLLLoss()

#4
for iter in range(1, niters + 1):
    tensor_pair = training_pairs[iter-1]
    input_tensor, output_tensor = tensor_pair[0], tensor_pair[1]
    # Remind that
    loss = train (input_tensor, output_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)
    loss_total += loss

    #5
    if iter % print_every == 0:
        avg_loss = loss_partial / print_every
        loss_partial = 0
        seconds = time.time() - start
        minutes = int(seconds / 60)
        hour_string = "%d min, %.1f sec" % (minutes, seconds)
        print (f"{iter} / {niters}, ", hour_string, f" - Average loss in the last {print_every} iters = {avg_loss:.3f}")

NameError: name 'train' is not defined

## Training using a single pair

For a single training instance, this is what we have to do.

0 - We're using RNN-based encoder and decoder: we have to give each sentence by giving word by word in succession. Each word is an index.
- The first word is the first step.
- The second word is the second step and so on.

As it is an RNN, at step j we need to give as input also the output of step j-1.
At step 0, the output of the previous step of the GRU has to be initialized to 0. This is done through the encoder_hidden variable.

1 - We need to set the gradients of the optimizers of the encoder and decoder (we're using SGD for both) to zero because we will have to perform the optimization using the sequences of tokens inside the input and output tensors.

2 - In the first place, we will need to loop over the tokens of the input tensor, so it's good to compute the lengths of the two (input and output) tensors.

3 - We do a loop over the tokens of the input tensor. For each token, we have a tensor with shape (1,1) that contains the index of the token. We feed the encoder with this input (1,1) tensor and hidden layer with size (1, 1, hidden_size).

What will the encoder do?

It will compute the embedding of the input token, producing a tensor with shape (-1, 1, hidden_size). Then, the GRU will receive as input this embedded tensor and the hidden tensor (the output of the previous step, otherwise 0 tensor for the first step) and compute encoder_output and encoder_hidden. These are the SAME tensors, by definition of RNN (= the output is given as input to the next step).

The last input token is torch.tensor([1]), which corresponds to the EOS token.
After the loop on the input tensor, we finally have the encoder_output and encoder_hidden tensors (which, I recall to you, are always the same, because the output of an RNN at step j is given as additional input to the RNN at step j+1)

4 - We will have to loop over the size of the output tensor.
We define the initial input and hidden state of the decoder as follows:
 - the initial input of the decoder is a tensor with a single token, that is the [[0]], the SOS token.
 - the hidden state of the decoder is, instead, THE OUTPUT OF THE ENCODER (or, equivalently, the hidden state of the encoder)

5 - Now, we finally make the loop over the target tensor.
 - At the first step, decoder_input is the torch.tensor([[SOS_token]], device = device), while the decoder_hidden is the encoder output.
 - In the next steps,
   - the decoder_input csn be the prediction of the decoder itself! (no teaching forcing)
   Otherwise, it can be the actual target tensor token (teacher forcing)
   - the decoder_hidden is simply the output of the previous step of the GRU within the decoder

   Notice that, in this case, the output of the decoder is something with dimension given by output_lang.n_words, whereas the hidden state of the GRU within the decoder has shape given by hidden_size, so the two tensors are not the same in this case.


In [50]:
### Now we have defined everything that is passed to the train method
input_tensor, output_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion;

In [54]:
input_tensor.shape

torch.Size([6, 1])

In [75]:
#0
encoder_hidden = encoder.initHidden()   # tensor of zeros of shape (1, 1, encoder.hidden_size)

#1
encoder_optimizer.zero_grad()
decoder_optimizer.zero_grad()

#2
length_input_tensor, length_output_tensor = input_tensor.size(0), output_tensor.size(0)
print (f"Input sentence has {length_input_tensor:d} tokens, whereas output sentence has {length_output_tensor:d} tokens.\n")

def sentenceFromTensor (tensor, lang):
    s = ""
    length_tensor = tensor.size(0)
    for i in range(length_tensor):
        s += lang.index2word[int(tensor[i][0])] + " "
    s = s.strip()
    return s

print (f"Input sentence: '{sentenceFromTensor(input_tensor, input_lang)}'")
print (f"Output sentence: '{sentenceFromTensor(output_tensor, output_lang)}'")

#3
loss = 0
for ei in range(length_input_tensor):
    encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)  ### at each step, the two tensors encoder_output and encoder_hidden are the same!

#4
# Initial input of the decoder
decoder_input = torch.tensor([[SOS_token]], device = device)
# Initial hidden state of the decoder
decoder_hidden = encoder_output # or, equivalently, encoder_hidden, as they are identical tensors.

#5
teaching_forcing_ratio = 0.2
use_teacher_forcing = random.random() > teaching_forcing_ratio

for di in range(length_output_tensor):
    decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)


Input sentence has 6 tokens, whereas output sentence has 6 tokens.

Input sentence: 'c est mon associe . EOS'
Output sentence: 'he s my partner . EOS'


In [85]:
# these are some interesting tests made during the training step

In [77]:
input_tensor[0], encoder_hidden

(tensor([145]),
 tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.

In [78]:
encoder_first_output, encoder_first_hidden = encoder(input_tensor[0], encoder_hidden)

In [81]:
encoder_first_output == encoder_first_hidden

tensor([[[True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          Tr

In [82]:
encoder_second_output, encoder_second_hidden = encoder(input_tensor[1], encoder_first_hidden)

In [83]:
encoder_second_output == encoder_second_hidden

tensor([[[True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          True, True, True, True, True, True, True, True, True, True, True,
          Tr

In [84]:
input_tensor[-1]  # last token of the input_tensor is the EOS token, 1

tensor([1])

In [88]:
decoder_input = torch.tensor([[SOS_token]], device = device)
decoder_hidden = encoder_second_output

In [89]:
decoder_first_output, decoder_first_hidden = decoder (decoder_input, decoder_hidden)

In [91]:
decoder_first_output.shape, decoder_first_hidden.shape

(torch.Size([1, 2802]), torch.Size([1, 1, 256]))

In [92]:
decoder_first_output.topk(1) # top_value, top_index
# torch.topk function in PyTorch is used to return the k largest (or smallest) elements of a tensor along a specified dimension.

torch.return_types.topk(
values=tensor([[-7.4262]], grad_fn=<TopkBackward0>),
indices=tensor([[2057]]))

In [95]:
top_value, top_index = decoder_first_output.topk(1)  # both tensors with shape ([1, 1])
top_index, top_index.shape

(tensor([[2057]]), torch.Size([1, 1]))

In [96]:
top_index.squeeze()

tensor(2057)

In [97]:
top_index.squeeze().detach()
# .detach(): This operation creates a tensor that shares storage with topi but without tracking the operations in the computation graph.
# This is important because, during the decoding process, you don't want to backpropagate gradients through the sampled output (in this case, topi).

tensor(2057)

In [107]:
teaching_forcing_ratio = 0.2
r = random.random()
use_teaching_forcing = r > teaching_forcing_ratio
r, use_teaching_forcing

(0.15989888831532872, False)

In [109]:
decoder_input, decoder_input.item()

(tensor([[0]]), 0)

## Performance evaluation on the test set