<a href="https://colab.research.google.com/github/cerasole/ml4hep/blob/main/RNNs/torch_Encoder_Decoder_for_Neural_Machine_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine translation using RNNs

The theory of machine translation using RNNs is described, for instance, in Chapter 16 of https://github.com/ageron/handson-ml3.

Basics of the algorithm
- Input: sentence in a given language,
- Output: translation of the input sentence in a different language.

The architecture consists in an encoder-decoder system.
- The input sentence is fed to the encoder, which transforms the input into a low-dimensional latent representation
- The decoder transforms the latent representation into an output sentence.

In [1]:
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata

import string
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [2]:
device

device(type='cuda')

We define a class for handling languages and sentences.

This class will need to
- uniform the sentences to a standard
- register the words in the sentences given as input to the class

In [4]:
SOS_token = 0   # Start-of-Sentence
EOS_token = 1   # End-of-Sentence

class Lang:

    def __init__(self, name):
        self.name = name # name of the language
        self.word2index = {} # dictionary containing words: indices when they were first inserted in the dictionary
        self.word2count = {} # dictionary containing words: number of times they were inserted in the dictionary
        self.index2word = {0: "SOS", 1: "EOS"} # dictionary contiaining indices of first insertion in the dictionary: words
        self.n_words = 2  # Total counts of words in the dictionary, including SOS and EOS

    def addWord(self, word):
        # When we add a word, we have to check if it is already present in the dictionary, e.g. in the word2index.
        # If it is not present, we need to
        #  - add this word to the self.word2index dictionary, giving it as index the current self.n_words,
        #  - add this word to the self.word2count dictionary, giving it 1 count,
        #  - add, in the self.index2word dictionary, using as index given by self.n_words, the word itself,
        #  - increase by 1 the number of total words, aka self.n_words.
        # If it is already present, we need to
        #  - increase by 1 the corresponding self.word2counts entry.
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

    def addSentence(self, sentence):
        # Add words splitting the sentence by space
        for word in sentence.split(' '):
            self.addWord(word)

In [7]:
lang = Lang("prova")
print (lang.word2count, lang.word2index, lang.index2word, lang.n_words)
lang.addSentence("could you please stop the noise?")
print (lang.word2count, lang.word2index, lang.index2word, lang.n_words)

{} {} {0: 'SOS', 1: 'EOS'} 2
{'could': 1, 'you': 1, 'please': 1, 'stop': 1, 'the': 1, 'noise?': 1} {'could': 2, 'you': 3, 'please': 4, 'stop': 5, 'the': 6, 'noise?': 7} {0: 'SOS', 1: 'EOS', 2: 'could', 3: 'you', 4: 'please', 5: 'stop', 6: 'the', 7: 'noise?'} 8
