[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/francisco-ortin/data-science-course/blob/main/deep-learning/rnn/encoder_decoder.ipynb)
[![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)

# Encoder-Decoder architecture

The [Encoder-Decoder architecture](https://medium.com/analytics-vidhya/encoders-decoders-sequence-to-sequence-architecture-5644efbb3392) is a neural network architecture used in sequence-to-sequence (Seq2Seq) tasks. It is composed of two main parts: the encoder and the decoder. The encoder processes the input sequence and compresses it into a fixed-size internal representation (hidden state of context vector). The decoder is a conditional language model that generates the output sequence.

On of the common usages of the Encoder-Decoder architecture is in [neural machine translation](https://en.wikipedia.org/wiki/Neural_machine_translation), where the input sequence is a sentence in one language and the output sequence is the translation of the sentence in another language. The encoder processes the input sentence and compresses it into a fixed-size internal representation. The decoder generates the translation of the sentence in the target language.

In this notebook, we will implement a simple Encoder-Decoder architecture using Recurrent Neural Networks (RNN) to translate English into Spanish. The Enoder is a simple bidirectional LSTM network, and the decoder is a simple LSTM network. 

<img src="img/encoder-decoder.jpg" width="1200">

In [40]:
# make sure the required packages are installed
%pip install pandas numpy seaborn matplotlib scikit-learn keras tensorflow --quiet
# if running in colab, install the required packages and copy the necessary files
directory='data-science-course/deep-learning/rnn'
if get_ipython().__class__.__module__.startswith('google.colab'):
    !pip uninstall -y keras --quiet
    !pip install keras==2.15.0 --quiet
    !pip install tensorflow==2.15.1 --quiet
    !git clone --depth 1 https://github.com/francisco-ortin/data-science-course.git  2>/dev/null
    !cp --update {directory}/*.py .
    !mkdir -p img data
    !cp {directory}/data/* data/.
    !cp {directory}/img/* img/.

import numpy as np
from keras import Model
import tensorflow as tf
import os
from tensorflow.keras.models import load_model
import keras

Note: you may need to restart the kernel to use updated packages.


## Important variables

We define the following variables:
- `vocab_size`: the size of the vocabulary (number of unique words in both English and Spanish languages).
- `max_length`: the maximum length of the input and output sequences (in words). If a sequence is longer than this, it will be truncated. If it is shorter, it will be padded.
- `chars_to_remove`: a list of characters to remove from the text.
- `train_size_percentage`: the percentage of the data to use for training ([0-100]).
- `embedding_size`: the size of the embedding layer (hyperparameter).
- `n_epochs`: the maximum number of epochs to train the model (early stopping is used).
- `SOS_word`, `EOS_word`: the start and end of sentence special words.
- `n_lstm_units`: the number of LSTM units in the Encoder and Decoder RNNs.
- `model_file_name`: the file name to save or load the trained model. If the file exists, the model is loaded from disk, otherwise, the model is trained and saved.

In [41]:
vocab_size = 5_000
max_length = 50
chars_to_remove = ["¡", "¿"]
train_size_percentage = 85
embedding_size = 128
n_epochs = 10
n_lstm_units = 512
SOS_word, EOS_word = "startofsentence", "endofsentence"
model_file_name = 'data/english_spanish_encoder_decoder.keras'

## Prepare the data

Our dataset is a collection of 118,964 English-Spanish sentence pairs taken from [here](https://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip). We load the file and remove special characters.

In [42]:
# read the contents of the data/english-spanish.txt file
with open("data/english-spanish.txt", 'r', encoding='utf-8') as file:
    text = file.read()
# remove the special characters
for special_char in chars_to_remove:
    text = text.replace(special_char, "")

We get a list of the English and Spanish sentences by splitting each line by the tab character. We shuffle the list of pairs, convert it to a pair of lists and show some examples in them.

In [43]:
# take the English and Spanish sentences, by splitting each line by the tab character
pairs: list[(str, str)] = [line.split("\t") for line in text.splitlines()]
np.random.shuffle(pairs)

# take a list of pairs and returns a pair of lists: one with the English sentences and one with the Spanish sentences
sentences_en, sentences_es = zip(*pairs)

assert (n_sentences := len(sentences_en)) == len(sentences_es)
print(f"Number of sentences: {n_sentences:,}.")

print("Some example translations:")
for i in range(5):
    print(f"\t{i+1}: {sentences_en[i]} -> {sentences_es[i]}")

Number of sentences: 118,964.
Some example translations:
	1: The ball rolled into the stream. -> La pelota rodó hasta el arrollo.
	2: You'd better not go. -> Mejor no vayas.
	3: I'm learning English. -> Estoy aprendiendo inglés.
	4: I wouldn't have done that if I were you. -> Si yo fuera tú, no habría hecho eso.
	5: They attempted in vain to bribe the witness. -> Ellos trataron en vano de sobornar al testigo.


The Encoder-Decoder ANN has two inputs: one for the Encoder (English) and another one for the Decoder (Spanish). Both are strings. However, we create a `TextVectorization` layer for each input, which transforms a batch of strings into a list of token indices or ids (ints). Upon creation, we pass the vocabulary size and the maximum length of the sequences.  

Word index/id definition is performed with the `adapt` method, which transforms each input sentence into a list of word indices, considering the vocabulary size. The most frequent words will be mapped to the first token indices, and the least frequent words to the last token indices. Those with a frequency below the vocabulary size will be mapped to the same token index [UNK].

For the Decoder input (Spanish sentences), we include the start and end of sentence tokens (SOS and EOS). SOS will indicate the Decoder to start generating the first Spanish word, and EOS will indicate the end of the sentence (termination of generation).

In [44]:
# TextVectorization is as keras layer that converts a batch of strings into either a list of token indices / ids (ints)
# It could also output a dense representation of the strings, where each token is represented by a dense vector (not used here)
# sentences longer than `max_length` are truncated, and shorter sentences are padded with zeros
text_vec_layer_en = tf.keras.layers.TextVectorization(vocab_size, output_sequence_length=max_length)
text_vec_layer_es = tf.keras.layers.TextVectorization(vocab_size, output_sequence_length=max_length)

# adapt makes the layer to transform each input sentence into a list of word indices, considering the vocabulary size
text_vec_layer_en.adapt(sentences_en)
# we adapt the Spanish  layer to the Spanish sentences, including the start and end of sentence tokens
text_vec_layer_es.adapt([f"{SOS_word} {sentence} {EOS_word}" for sentence in sentences_es])

print(f"Some example English words: {text_vec_layer_en.get_vocabulary()[:10]}")  # 0 is padding, visualized as ''
print(f"Some example Spanish words: {text_vec_layer_es.get_vocabulary()[:10]}")

Some example English words: ['', '[UNK]', 'the', 'i', 'to', 'you', 'tom', 'a', 'is', 'he']
Some example Spanish words: ['', '[UNK]', 'startofsentence', 'endofsentence', 'de', 'que', 'a', 'no', 'tom', 'la']


We split the data into training and validation sets (both for Encoder and Decoder inputs). We include the start of sentence token at the beginning of each sentence. The output of the Decoder is the same as the input, but shifted one position to the right, and with the end of sentence token at the end of each sentence.

The input is one full English sentence (e.g., "I like soccer") for the Encoder and the corresponding Spanish sentence prefixed with SOS (e.g., "SOS Me gusta el fútbol") for the Decoder. The output is the Spanish sentence without SOS and with EOS at the end (e.g., "Me gusta el fútbol EOS"). In this way, give "I like soccer" to the Encoder, and SOS to the Decoder, the latter will generate "Me". This output will be passed again to the Decoder concatenated to the previous input (i.e., "SOS Me"), which will generate "gusta", and so on, until it generates the EOS.

In [45]:
# we split the data into training and validation sets
# we first take the input for the Encoder (English sentences)
X_train_encoder = tf.constant(sentences_en[:n_sentences * train_size_percentage // 100])
X_valid_encoder = tf.constant(sentences_en[n_sentences*train_size_percentage//100:])

# then, we take the input for the Decoder (Spanish sentences)
# We include the SOS at the beginning of each sentence. This is because we want the Decoder to start generating
# the first Spanish word, by passing SOS as the first input. Then, the Decoder will generate the first word and
# we will pass it to the Decoder again, so it can generate the second word, and so on, until it generates the EOS.
# EOS does not need to be added to the input, since we want the Decoder to generate it (it will be added to
# Y training dataset).
X_train_decoder = tf.constant([f"{SOS_word} {sentence}" for sentence in
                               sentences_es[:n_sentences * train_size_percentage // 100]])
X_valid_decoder = tf.constant([f"{SOS_word} {sentence}" for sentence in
                               sentences_es[n_sentences * train_size_percentage // 100:]])

# The output of the Decoder is the same as the input, but shifted one position to the right, and with EOS at the end
# of each sentence. This is because we want the Decoder to generate the first word of the Spanish sentence, then
# the second word, and so on, until it generates the EOS.
Y_train = text_vec_layer_es([f"{sentence} {EOS_word}" for sentence in sentences_es[:n_sentences * train_size_percentage // 100]])
Y_valid = text_vec_layer_es([f"{sentence} {EOS_word}" for sentence in sentences_es[n_sentences * train_size_percentage // 100:]])

## Create the model

Let's create the Encoder-Decoder ANN with two recurrent neural networks (RNNs). The Encoder is a bidirectional LSTM network, and the Decoder is a simple LSTM network. The Encoder processes the input sequence and compresses it into a fixed-size internal representation. The Decoder generates the output sequence. 

In [46]:
def create_model(n_lstm_units_p: int, vocab_size_p: int) -> Model:
    """
    Creates a Keras model for the Encoder-Decoder architecture.
    :param n_lstm_units_p: Number of LSTM units in the Encoder and Decoder.
    :param vocab_size_p: Vocabulary size.
    :return: The model
    """
    # Both the Encoder and the Decoder will receive a batch of sentences (strings) as input 
    # (English sentences for the Encoder, and Spanish sentences for the Decoder).
    encoder_inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)
    decoder_inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)

    # We connect the inputs t the text vectorization layers using the Keras functional API
    # In this way, the input sequences are converted into lists of word indices / ids using the TextVectorization layers
    encoder_input_ids = text_vec_layer_en(encoder_inputs)
    decoder_input_ids = text_vec_layer_es(decoder_inputs)

    # The word indices are then converted into dense vectors using an Embedding layer of `embedding_size` dimensions
    # The padding character zero is masked out, so it is ignored by the model (its weight is not updated/learned). This speeds up training.
    encoder_embedding_layer = tf.keras.layers.Embedding(vocab_size_p, embedding_size, mask_zero=True)
    decoder_embedding_layer = tf.keras.layers.Embedding(vocab_size_p, embedding_size, mask_zero=True)
    # we connect the embedding layers to the input indices (functional API)
    encoder_embeddings = encoder_embedding_layer(encoder_input_ids)
    decoder_embeddings = decoder_embedding_layer(decoder_input_ids)

    # We create the Encoder as a single bidirectional LSTM layer with half of the units (two LSTMs, one for each direction)
    # Return_state=True => gets a reference to the layer’s final state (not the output for all the RNN steps)
    # Since we are using a bi-LSTM layer, the final state is a tuple containing 2 short- and 2 long-term states,
    # one pair for each direction (that is why we use *encoder_states, to store the four states in one single tuple variable)
    encoder = tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(n_lstm_units_p // 2, return_state=True))
    encoder_outputs, *encoder_states = encoder(encoder_embeddings)

    # we concatenate the states of the left and right LSTMs (first, the 2 short-term states and then the 2 long-term states)
    # this way, we get a single state for each type (short and long-term) to be passed 
    # to the one-directional Decoder RNN as its initial state (conditional language model)
    encoder_state = (keras.layers.Concatenate(axis=1)([encoder_states[0], encoder_states[2]]), # short-term (0 & 2)
                     keras.layers.Concatenate(axis=1)([encoder_states[1], encoder_states[3]])) # long-term (1 & 3)

    # The Decoder is also an LSTM layer with `n_lstm_units` units, but it returns sequences (return_sequences=True)
    # instead of the final state: we want to know the output (probabilities) for all the words in the Spanish sentence, not just the last one. 
    # It cannot be bidirectional, since it needs to generate the words in order (otherwise, it would be cheating).
    # Remember that the Decoder is a conditional language model, so it needs to receive the states of the Encoder
    # (initial_state parameter)
    decoder = tf.keras.layers.LSTM(n_lstm_units_p, return_sequences=True)
    decoder_outputs = decoder(decoder_embeddings, initial_state=encoder_state)

    # For each step in the Decoder RNN, we add a Dense layer with a softmax activation function to predict the next word in the Spanish sentence
    output_layer = tf.keras.layers.Dense(vocab_size_p, activation="softmax")
    Y_probas = output_layer(decoder_outputs)

    # Finally, we create the Keras Model, specifying the inputs and outputs
    model_loc = tf.keras.Model(inputs=[encoder_inputs, decoder_inputs], outputs=[Y_probas])
    
    model_loc.summary()
    return model_loc

We compile and train the model. We use the `sparse_categorical_crossentropy` as the loss function, since the targets are integers (word indices / ids). Otherwise, if we had one-hot vectors, we would use `categorical_crossentropy`. We use the Nadam optimizer and accuracy as a metric. We train the model for a maximum of `n_epochs` epochs, using a batch size of 32. We use early stopping with patience of 2 epochs and restore the best weights.

If the model is already saved in the file `model_file_name`, we load it from disk. Otherwise, we compile and train the model and save it to disk.

*Notice*: if you run the following cell, you need a GPU (otherwise, it will take more than 3 hours to train the model).

In [47]:
def compile_and_train_model(model: Model, X_train_encoder_p: np.array, X_train_decoder_p: np.array,
                            Y_train_p: np.array, X_valid_encoder_p: np.array, X_valid_decoder_p: np.array,
                            Y_valid_p: np.array, n_epochs_p: int, model_file_name: str) -> Model:
    if os.path.exists(model_file_name):
        return load_model(model_file_name)
    # we compile and train the model with sparse_categorical_crossentropy as the loss function, since the targets are integers
    model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
    model.fit((X_train_encoder_p, X_train_decoder_p), Y_train_p,
          epochs=n_epochs_p, batch_size=32,
          validation_data=((X_valid_encoder_p, X_valid_decoder_p), Y_valid_p),
          callbacks=[tf.keras.callbacks.EarlyStopping(patience=2, restore_best_weights=True)])
    model.save(model_file_name)
    return model


model = create_model(n_lstm_units, vocab_size)
model = compile_and_train_model(model, X_train_encoder, X_train_decoder, Y_train, X_valid_encoder,
                                X_valid_decoder, Y_valid, n_epochs, model_file_name)

Model: "model_4"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_13 (InputLayer)       [(None,)]                    0         []                            
                                                                                                  
 text_vectorization_8 (Text  (None, 50)                   0         ['input_13[0][0]']            
 Vectorization)                                                                                   
                                                                                                  
 input_14 (InputLayer)       [(None,)]                    0         []                            
                                                                                                  
 embedding_12 (Embedding)    (None, 50, 128)              640000    ['text_vectorization_8[0

## Inference

We use the model for inference. Now, we use a greedy search strategy to predict the next word in the Spanish sentence. We take the word with the highest probability as the next word. We continue this process until we predict the end of sentence token.

In [48]:
def translate(sentence_en: str) -> str:
    """
    Translates an English sentence into Spanish, preparing the input for the model and calling the predict method.
    :param sentence_en: The English sentence to translate.
    :return: The Spanish translation.
    """
    translation = ""
    for word_idx in range(max_length):
        # Encoder input: one English sentence (batch size = 1)
        X_inf_encoder = np.array([sentence_en])
        # Decoder input: SOS + existing translation (empty at the beginning)
        X_inf_decoder = np.array([SOS_word + translation])
        # We call predict with (Encoder_input, Decoder_input) to get the probabilities of the next word
        # we take the first sentence ([0]) and the probabilities idx-th word (returns a list of probabilities for max_length words)
        y_probas = model.predict((X_inf_encoder, X_inf_decoder), verbose=0)[0, word_idx]  # probas of the last predicted word
        # we take the word id with the highest probability
        predicted_word_id = np.argmax(y_probas)
        # we get the word from the vocabulary
        predicted_word = text_vec_layer_es.get_vocabulary()[predicted_word_id]
        if predicted_word == EOS_word:
            # we are done when we predict the end of sentence token
            break
        translation += " " + predicted_word
    return translation.strip()


# we test the translation with some sentences. Feel free to add more sentences to test the model
english_sentences = ["hello everyone",
                     "how old are you?",
                     "what is your name?",
                     "where are you from?",
                     "I like soccer",
                     "I want you to try to correctly translate this long sentence from English to Spanish"]
for sentence in english_sentences:
    print(f"{sentence} -> {translate(sentence)}.")

hello everyone -> hola a todos.
how old are you? -> cuántos años tiene.
what is your name? -> cómo se llama.
where are you from? -> de dónde eres.
I like soccer -> me gusta el fútbol.
I want you to try to correctly translate this long sentence from English to Spanish -> quiero que [UNK] algún día a traducir de traducir este color de inglés.


## ✨ Questions ✨

1. In the training data "I like" is mostly translated as "Me gustan". However, the model translates "I like" as "Me gusta". Why is that?
2. Is there any sentence that is not translated correctly? 
3. If so, why do you think that happens?

### Answers

*Write your answers here.*


## Beam search

The previous greedy search strategy is not the best one. I may take a word with the highest probability, but it may not be the best probability for any second word. That is, $P(w_1, w_2)$ may not be as high as $P(w_1', w_2')$. This is the problem of finding a local maximum instead of a global maximum. Performing an exact search is not tractable, since the number of possible sentences grows exponentially with the length of the sentence.

We can use the Beam search strategy, which considers the $k$ most probable words at each step. We keep track of the $k$ most probable sequences and continue the process until we predict the end of sentence token. We take the sequence with the highest probability at the end.

In [49]:
def beam_search(sentence_en: str, beam_width: int, verbose=False) -> str:
    """
    Translates an English sentence into Spanish using beam search wit k=beam_width.
    :param sentence_en: The Encoder input (English sentence).
    :param beam_width: The k parameter of the beam search algorithm.
    :param verbose: Whether to display the top words and translations at each step.
    :return: The Spanish translation.
    """
    # Translation of the first word
    # Encoder input: one English sentence (batch size = 1)
    X_inf_encoder = np.array([sentence_en])
    # Decoder input: SOS
    X_inf_decoder = np.array([SOS_word])
    # Predict the probabilities of the first word
    y_proba = model.predict((X_inf_encoder, X_inf_decoder), verbose=0)[0, 0]  # first token's probas
    # we take the top k words with the highest probabilities Dict{word_id: proba}
    top_k_words = tf.math.top_k(y_proba, k=beam_width)
    # list of best (log_proba, translation) pairs
    # Important: instead of taking Prob(w1) * Prob(w2) * ... * Prob(wn), we take the log of the product:
    # log(Prob(w1)) + log(Prob(w2)) + ... + log(Prob(wn))
    # this is because the product of many probabilities between 0 and 1 can be very small and lead to 0.0 after some iterations
    top_translations = [
        (np.log(word_proba), text_vec_layer_es.get_vocabulary()[word_id])
        for word_proba, word_id in zip(top_k_words.values, top_k_words.indices)
    ]

    # displays the top first words if verbose mode
    print("Top first words:", top_translations) if verbose else None

    # Translation of the next words (from 1 on)
    for idx in range(1, max_length):
        # list of best (log_proba, translation) pairs
        candidates: list[(float, str)] = []
        for log_proba, translation in top_translations:
            if translation.endswith(EOS_word):
                candidates.append((log_proba, translation))
                # translation is finished, so don't try to extend it
                continue
            # Encoder input: one English sentence (batch size = 1)
            X_inf_encoder = np.array([sentence_en])  # encoder input
            # Decoder input: SOS + existing translation
            X_inf_decoder = np.array([SOS_word + " " + translation])  # decoder input
            # probabilites of the new word
            y_proba = model.predict((X_inf_encoder, X_inf_decoder), verbose=0)[0, idx]  # last token's proba
            # we include in candidates the top k existing translations with all the possible next words and their probabilities
            for word_id, word_proba in enumerate(y_proba):
                word = text_vec_layer_es.get_vocabulary()[word_id]
                candidates.append((log_proba + np.log(word_proba), f"{translation} {word}"))
        # we sort the candidates by the log of the probabilities and take the top k
        top_translations = sorted(candidates, reverse=True)[:beam_width]

        # displays the top translation so far, if verbose mode
        print("Top translations so far:", top_translations) if verbose else None

        # the process terminates when all the K top translations end with the EOS token
        if all([top_translation.endswith(EOS_word) for _, top_translation in top_translations]):
            # returns the best translation pair ([0] because it is sorted by log probabilities),
            # take the translation text ([1]) and remove the EOS token
            return top_translations[0][1].replace(EOS_word, "").strip()


for sentence in english_sentences:
    print("-" * 50)
    print(f"Translation with beam search for: \n\t '{sentence}':")
    translation = beam_search(sentence, 3, verbose=True)
    print(f"Spanish translation: {translation}.")

--------------------------------------------------
Translation with beam search for: 
	 'hello everyone':
Top first words: [(-0.9104477, 'hola'), (-1.7676975, '[UNK]'), (-2.3887048, 'y')]
Top translations so far: [(-1.1329316, 'hola a'), (-2.2534118, '[UNK] a'), (-3.7593431, '[UNK] todo')]
Top translations so far: [(-1.5704861, 'hola a todos'), (-2.7043881, '[UNK] a todos'), (-3.4118323, 'hola a dios')]
Top translations so far: [(-1.5736455, 'hola a todos endofsentence'), (-2.7050316, '[UNK] a todos endofsentence'), (-3.4296794, 'hola a dios endofsentence')]
Spanish translation: hola a todos.
--------------------------------------------------
Translation with beam search for: 
	 'how old are you?':
Top first words: [(-0.070351236, 'cuántos'), (-2.970761, 'qué'), (-5.0579257, 'cómo')]
Top translations so far: [(-0.20440114, 'cuántos años'), (-2.883246, 'cuántos hermanos'), (-3.056719, 'qué edad')]
Top translations so far: [(-1.1800603, 'cuántos años tiene'), (-1.3173417, 'cuántos años t

This simple model performs decently on short sentences, but it struggles with longer sentences. It is possible to significantly improve the translation quality by using [attention](https://en.wikipedia.org/wiki/Attention_(machine_learning)). A more sophisticated implementation of the Encoder-Decoder architecture with attention called [Transformer](https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)) is the state-of-the-art model for machine translation, currently used by GPT, BERT, and many other models.

## ✨ Questions ✨

4. What would happen in beam search if we used probability product instead of the sum of $log$ probabilities?
5. Do you think the last sentence will be translated better with $k$=10? 

### Answers


*Write your answers here.*

