# Lab5: Recurrent Neural Network and LSTM

## Introduction
Sometimes, datapoints have time-dependent relationship.

* Audio sequence
* Text
* Video

In those situations, a datapoint have to be represented as $\{ x_t \}_{t=1}^T$, a sequence of datapoints over time. When we want to train a neural network on those kind of data, we need to exploit that temporal relationship (a word in a sentence, out of context, does mean nothing).

Recurrent Neural Networks (RNNs) are specific kind on Networks, mainly created for those kind of applications.

![](https://upload.wikimedia.org/wikipedia/commons/b/b5/Recurrent_neural_network_unfold.svg)

In a classical RNN model, each recurrent cell has an output $h_t$ for each time step $t>0$, and an internal memory, usually represented as $C_t$. To preserve temporal informations and simplify backpropagation, usually $C_t = h_{t-1}$ for any $t>1$. \\

Even if this approach works perfectly in theory, in practice there are some severe issues when the time interval is large (i.e. when $T$ is big), such as **vanishing gradient** or **exploding gradient**. This is a consequence that, in backpropagation with a $T$ time-step RNN, the gradient of a loss function $\ell(\{ x_t \}, \{ y_t \})$ w.r.t. a weight $w$ is:

$$
\Bigl(\frac{∂ℓ}{∂w}\Bigr)^T
$$

Thus, if $\frac{∂ℓ}{∂w}$ is low, this gradient vanishes for high $T$, while if it is large, it explodes for high $T$. \\

LSTM solves this issue by limiting the effective time-interval of the model, by letting the cell _lose memory_ during iterations, removing informations to $C_t$ over time. \\

![](https://miro.medium.com/max/662/1*mcHP77YF63SuqUGAIiBBsA.jpeg)

Thus, in LSTM we have two **different** output informations for each $t$: the _state_ $h_t$ and the memory cell $C_t$. Differently from classical RNNs, those are usually not the same. 

## Case of study: text-to-text translation
Application of RNNs (and, in particular, LSTM) are multiple:

* Time-series Classification
* NLP
* Text-to-Text translation
* ...

We will consider a text-to-text translation example, trying to use an LSTM to translate a sentence from English to French.

In [1]:
# Import libraries
import numpy as np
import tensorflow as tf
from tensorflow import keras as ks

## Download and Prepare the data

We will use a dataset from the Tab-Delimited Bilingual Sentence Pairs (TBSP) dataset (http://www.manythings.org/anki/). There, you can find a huge amount of dataset of short sentences from English to other languages, we can use for this task. Of course (since we are patriotic), we will choose English to Italian.

In [None]:
# Download and unzip the data
!curl -O http://www.manythings.org/anki/ita-eng.zip
!unzip ita-eng.zip

The dataset file is a $\texttt{.txt}$ file, which is described on the official documentation to be in the following form:

**English + TAB + The Other Language + TAB + Attribution**

where the **Attribution** part can be ignored. Thus, this dataset uses **TAB** the split the input from the output, and **\n** to split different datapoints. \\

We need to do:
* Read the data and create lists containing input and output sequences splitten apart.
* Encode the sentences in a matrix form, using for example a one-hot encode algorithm.

In [None]:
def read_data_from_file(path, n_samples=10_000):

    # Vectorize the data.
    input_texts = []
    target_texts = []
    input_characters = set()
    target_characters = set()
    with open(path, "r", encoding="utf-8") as f:
        lines = f.read().split("\n")

    for line in lines[: min(n_samples, len(lines) - 1)]:
        input_text, target_text, _ = line.split("\t")

        # We use "tab" as the "start sequence" character
        # for the targets, and "\n" as "end sequence" character.
        target_text = "\t" + target_text + "\n"
        input_texts.append(input_text)
        target_texts.append(target_text)

    return input_texts, target_texts


def compute_unique_characters(texts):
    characters = set()

    for text in texts:
        for char in text:
            if char not in characters:
                characters.add(char)
    
    characters = sorted(list(characters))
    return characters

def one_hot_encode():
    input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
    target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])

    encoder_input_data = np.zeros(
        (len(input_texts), max_encoder_seq_length, num_encoder_tokens), dtype="float32"
    )
    decoder_input_data = np.zeros(
        (len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32"
    )
    decoder_target_data = np.zeros(
        (len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32"
    )

    for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
        for t, char in enumerate(input_text):
            encoder_input_data[i, t, input_token_index[char]] = 1.0
        encoder_input_data[i, t + 1 :, input_token_index[" "]] = 1.0
        for t, char in enumerate(target_text):
            # decoder_target_data is ahead of decoder_input_data by one timestep
            decoder_input_data[i, t, target_token_index[char]] = 1.0
            if t > 0:
                # decoder_target_data will be ahead by one timestep
                # and will not include the start character.
                decoder_target_data[i, t - 1, target_token_index[char]] = 1.0
        decoder_input_data[i, t + 1 :, target_token_index[" "]] = 1.0
        decoder_target_data[i, t:, target_token_index[" "]] = 1.0

    return encoder_input_data, decoder_input_data, decoder_target_data


# Read data
input_texts, target_texts = read_data_from_file('ita.txt')
input_characters = compute_unique_characters(input_texts)
target_characters = compute_unique_characters(target_texts)

# Compute useful measures
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print("Number of samples:", len(input_texts))
print("Number of unique input tokens:", num_encoder_tokens)
print("Number of unique output tokens:", num_decoder_tokens)
print("Max sequence length for inputs:", max_encoder_seq_length)
print("Max sequence length for outputs:", max_decoder_seq_length)

# Encode data
encoder_input_data, decoder_input_data, decoder_target_data = one_hot_encode()

print('\n')
print("Shape of encoder input data:", encoder_input_data.shape)
print("Shape of decoder input data:", decoder_input_data.shape)
print("Shape of decoder target data:", decoder_target_data.shape)

## Explore the data

Just take a look at the data to understand how it works.

In [None]:
# Starting data
print(f"{input_texts[1]} -> {target_texts[1]}")
print(f"{input_texts[5]} -> {target_texts[5]}")
print(f"{input_texts[30]} -> {target_texts[30]}")

# Encoding
for char in encoder_input_data[5]:
    idx = np.where(char==1.0)[0][0]
    print(f"{char} -> {input_characters[idx]}")

# NOTE: Empty character = [1, 0, 0, ..., 0]

## Model

Our model (referred to as Sequence-to-Sequence Learning (Seq2Seq)), works in the following way:


1.   Consider two languages Lan1 and Lan2, and suppose we want to translate from Lan1 -> Lan2. 
2.   Create two LSTM models, referred to as Encoder and Decoder.
3.   Train the Encoder to map strings from Lan1 to an encoded version of them, ELan1. This is required to capture local informations (translation character by character is not possible).
4.   Ignore the output of the Encoder and keep the states $(h_e, C_e)$, which will be used as an Input for the states $(h_0, C_0)$ of the Decoder.
5.   The Decoder takes as input the states from the Encoder (i.e. the encoded string in ELan1) and the string from Lan2, to produce a character of the output translated sequence. Append that character to the output sequence.
6.   Repeat the process until a STOP character or the maximum length of the output string is reached. 

This process is called _teacher forcing_, since the input sequence of the decoder is not changed during training (the target translated sequence is repeatedly given as input).

![](https://blog.keras.io/img/seq2seq/seq2seq-teacher-forcing.png)



In [4]:
# Define parameters for the model
latent_dim = 256

# Define an input sequence and process it.
encoder_inputs = ks.Input(shape=(None, num_encoder_tokens))
encoder = ks.layers.LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = ks.Input(shape=(None, num_decoder_tokens))

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = ks.layers.LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)

# Choose a character using softmax as output
decoder_dense = ks.layers.Dense(num_decoder_tokens, activation="softmax")
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = ks.Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [None]:
# Visualize the model
ks.utils.plot_model(model)

In [6]:
# Training parameters
batch_size = 64
epochs = 100

# Train
model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])
hist = model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=batch_size, epochs=epochs, validation_split=0.1)

# Save weights
model.save_weights('seq2seq.h5')

In [None]:
# Visualize overfitting
import matplotlib.pyplot as plt

# Loss
plt.plot(hist.history['loss'])
plt.plot(hist.history['val_loss'])
plt.grid()
plt.xlabel('epoch')
plt.legend(['loss', 'val_loss'])
plt.title('Plot of Loss over Epochs')
plt.show()

# Accuracy
plt.plot(hist.history['accuracy'])
plt.plot(hist.history['val_accuracy'])
plt.grid()
plt.xlabel('epoch')
plt.legend(['acc', 'val_acc'])
plt.title('Plot of Accuracy over Epochs')
plt.show()

## Evaluation

We now want to evaluate our model.


In [7]:
# We want to split our model in the two part: encoder and decoder.

# Build the Encoder model
encoder_inputs = model.input[0]  # input_1
encoder_outputs, state_h_enc, state_c_enc = model.layers[2].output  # lstm_1
encoder_states = [state_h_enc, state_c_enc]
encoder_model = ks.Model(encoder_inputs, encoder_states)

# Build the decoder part
decoder_inputs = model.input[1]  # input_2
decoder_state_input_h = ks.Input(shape=(latent_dim,))
decoder_state_input_c = ks.Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_lstm = model.layers[3]
decoder_outputs, state_h_dec, state_c_dec = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs
)
decoder_states = [state_h_dec, state_c_dec]
decoder_dense = model.layers[4]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = ks.Model(
    [decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states
)


# Decode sequences back to something readable.
input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])

reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())

def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))

    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index["\t"]] = 1.0

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ""
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if sampled_char == "\n" or len(decoded_sentence) > max_decoder_seq_length:
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.0

        # Update states
        states_value = [h, c]
    return decoded_sentence

In [None]:
for seq_index in range(20):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index : seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print("-")
    print("Input sentence:", input_texts[seq_index])
    print("Decoded sentence:", decoded_sentence)

In [None]:
# Encode a new string
def encode_string(s):
    encoded_s = np.zeros((1, max_encoder_seq_length, num_encoder_tokens))

    for t, char in enumerate(s):
        encoded_s[0, t, input_token_index[char]] = 1.0
    encoded_s[0, t+1:, input_token_index[" "]] = 1.0

    return encoded_s

# Try it out (remember the max length!!)
s = "Hello."
e_s = encode_string(s)

# Translate
print(f"{s} -> {decode_sequence(e_s)}")