<div style="font-size:22pt; line-height:25pt; font-weight:bold; text-align:center;">Natural Language Generation</div>

In this exercice, we will try out the approach of Natural Language Generation, using a Seq2Seq (sequence to sequence) architecture.

Natural Language Generation has often used approaches close to "fill-in-the-blanks" templates. 
Seq2Seq Encoder-Decoder architectures make the issue of Natural Language Generation different, especially in the case of language translation.

The approach is the following :
1. A LSTM network encodes the input sequence into state vectors, with a predefined dimensionality
2. A decoder LSTM predicts the next token of a target sequence based on the beginning o the sequence. The initial state is given by the encoder.

The approach used here to train a system that predicts the next token based on the beginning of the sequence is called Teacher Forcing

# 1. Parameters of the experiment

In this example, we will train a system that translates basic french sentences into english. The data used for this example is a list of French sentences and their translation into english.

# 1. Teacher forcing

## 1.a Example

Say we work with the following sequence :

```
Rien ne sert de courir il faut partir à point
```

We want to train a model that predicts the following word of the sequence based on the start of it. First, in order for the first word of the sequence and the end of it to be predicted, we need to add beginning and end tokens to the sequence. We decide to use \t as the beginning token, and \n as the end one :

```
\t Rien ne sert de courir il faut partir à point \n
```

When training the system, we start by inputing the "\t" beginning token:

```
input: 
\t
prediction:
sert
```

The untrained model generated "sert" where we expected "Mary". There are now two options to continue :

### Without forcing :

We add the previous output, "sert", to the input sequence, and continue generating :

```
input: 
\t sert
```

With this approach, the error will propagate and make the model much slower to learn. 

### With teacher forcing

After computing error, we discard the output "sert", and replace it with the word that was actually expected ("Rien"). This is called *teacher forcing* :

```
input: 
\t Rien
```

Using his technique provides much faster training of the model.

## 1.b. Implementing




In [2]:
batch_size = 64  # Batch size for training.
epochs = 50  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 1000  # Number of samples to train on.

# Path to the data
data_path = 'datasets/enfratexts.txt'

# 2. Vectorizing the data



We vectorize the data using characters as features. 


For instance, if we only used the 26 letters of the alphabet to represent our data, the sequence "Bonjour" would be represented as a sequence of 7 (length of the sequence) 26th-dimensional sparse vectors. Each vector would have a "1" for the corresponding letter (here, the first input vector for the sequence "Bonjour" would have a 1 in the 2nd position, corresponding to the letter B)


Of course, our input data may contain more characters than just 26 letters. We may have accents, numbers, symbols... And also don't forget that we are using two special characters "\t" for start of sequence and "\n" for end of sequence.


Knowing this, the first step is to :
- Figure out how many different characters we need to vectorize our input sequences (in french) and target sequences (in english). We can of course expect different dimensionnalities for different languages.
- Figure out the maximum size of the input sequences, for encoding.
- Vectorize each of the input and target sequences, for our training data.

## 2.1. First step : defining the vector space

In [3]:
input_texts = [] #input texts (french)
target_texts = [] #target texts (english)
input_characters = set() #the set that will contain all the input characters, used for vectorizing
target_characters = set() #same as above, for target

#Read each line of the file containing data
with open(data_path, 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')

# Run through the lines and isolate input (french) and target (english) texts
for line in lines[: min(num_samples, len(lines) - 1)]:
    target_text, input_text, _ = line.split('\t')
    # We use "\t" as the start sequence character, and "\n" as end sequence character.
    target_text = '\t' + target_text + '\n'
    # List of input texts
    input_texts.append(input_text)
    target_texts.append(target_text)
    # We list the characters required for encoding, by storing all unique characters in the sequences
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)

            
input_characters = sorted(list(input_characters)) #List of all required characters to encode input sequences
target_characters = sorted(list(target_characters)) #List of all required characters to encode target sequences

# Dimensions of the encoding and decoding spaces
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])


print('Number of input tokens:', num_encoder_tokens)
print('Number of output tokens:', num_decoder_tokens)
print('Max sequence length of inputs:', max_encoder_seq_length)
print('Max sequence length of outputs:', max_decoder_seq_length)

Number of input tokens: 75
Number of output tokens: 65
Max sequence length of inputs: 31
Max sequence length of outputs: 12


## 2.2. Second step : vectorizing 

Prepare the vectors with the right dimensions that will hold the input data of the encoder, the output data of the decoder, and the input data of the decoder.

**Reminder : 
- The encoder will use the input sequences as an input. The dimensionnality of their vectorized form will be the size of the character space, times the length of the biggest sequence
- The decoder is meant to predict the next step of the sequence, based on the beginning of it. Consequently, its inputs and outputs will have the same dimensionnality.
**

In [5]:
import numpy as np

#Dictionnaries for vectorizing sequences, or restore vectorized sequences to readable format
input_token_index = dict(
    [(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict(
    [(char, i) for i, char in enumerate(target_characters)])
reverse_input_char_index = dict(
    (i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
    (i, char) for char, i in target_token_index.items())


#empty vectors to hold encoder and decoder data
encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')


Next, We prepare the vectorized sequences. 

**Reminder : The decoder target data is the same as the input data, but ahead by one step.**

In [6]:
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.
        #Fill the rest of the input sequence with spaces, until the max length is reached
    encoder_input_data[i, t + 1:, input_token_index[' ']] = 1.
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.
        if t > 0:
            # decoder_target_data will be ahead by one step and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.
        #Fill the rest of the sequence with spaces, until the max length is reached    
    decoder_input_data[i, t + 1:, target_token_index[' ']] = 1.
    decoder_target_data[i, t:, target_token_index[' ']] = 1.


# 3.0. Defining the Encoder-Decoder Architecture

## 3.1. Encoder

Encoding is performed via a LSTM, whose state we will store to condition the decoding. The output is discarded, and we only keep the states (see teacher forcing explanation, earlier in this notebook).
Earlier in this notebook, we defined the latent dimensionnality of the encoder. 

In [7]:
from keras.models import Model
from keras.layers import Input, LSTM, Dense

# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]


Using TensorFlow backend.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


## 3.2. Decoder

Decoding is performed through lstm and a softmax layer (used to predict character probabilities).
During inference, we will need to use the internal states of the decoder (see lower). We need to define the decoder accordingly 

In [8]:
# Prepare the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We define our decoder to return full output sequences, and to return internal states as well. We won't use the
# return states during trainging, but use them during inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)


## 3.3 Full model

We define the model :
- using two inputs: the encoder inputs (french sentences) and decoder inputs (english sentences)
- returning the decoder outputs (english sentences ahead of one step)

Run a few epochs (50 only, and on 1000 documents, to save memory and time)

In [9]:
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# 4. Training the Seq2Seq model

In this example, we propose to train on 80% of the data and test on 20% of the data. 
For processing time reasons, we propose to limit the size of training data as well as the number of epochs we run.

In [21]:
# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy',metrics=['accuracy'])
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=batch_size,
          epochs=5,
          validation_split=0.2)
# Save model
model.save('nlg_encode_decode_model')

Train on 800 samples, validate on 200 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


# 5. Inference

First, we need to define the models required to perform inference

1) encode input and retrieve initial decoder state
2) run one step of decoder with this initial state and a "start of sequence" token as target. Output will be the next target token
3) Repeat with the current target token and current states

In [11]:

# 1.Define encoder model transforming the encoder input into the states of the encoding LSTM
encoder_model = Model(encoder_inputs, encoder_states)

# 2.Define decoder : the initial states of the decoder are also an input of the model (input from the encoder)
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
# Decoder LSTM
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
# Decode softmax layer
decoder_outputs = decoder_dense(decoder_outputs)
# Full decoder model : 
# - it takes the decoder inputs (english sequences) as well as the initial states as inputs
# - it returns the decoder outputs (english sequence ahead by one step) and the current states (that will be used for the next inference step)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

We then write a function that will perform decoding. The steps for decoding are the following :
- We encode the input sequence, and retrieve the states (step 1 in the cell above)
- We run one step of the decoder, using the initial state retrieved from the encoder, and a sequence containing only a start of sequence token "\t". 
- From this step, we retrieve the output sequence and states of the decoder
- We then repeat the 2 previous steps with the updated output sequence and states.

In [12]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index['\t']] = 1.

    # Sampling loop for sequences
    stop_condition = False #Defines when to stop iterating
    decoded_sentence = '' #Will hold the decoded sequence
    while not stop_condition:
        
        ## PREDICTING
        #Run one step of decoder using the target_seq and states as input
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Get the decode token : we check which token is the most likely, decode it using the mapping dictionary, 
        # and then add it to the decoded sequeces
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length or predict stop character.
        if (sampled_char == '\n' or len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        # UPDATING
        # Update the target sequence using the predicted token as input.
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.
        # Update states to the current states of the lstm
        states_value = [h, c]

    return decoded_sentence

## 5.1. Run inference

In this simple example, we under-trained our model for memory and processing time reasons. We propose to see what the output looks like using some of the first sequences we used during training.

In [13]:
for seq_index in range(100):
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('French Sentence:', input_texts[seq_index])
    print('English Translation:', decoded_sentence)

French Sentence: Va !
English Translation: Go away.

French Sentence: Salut !
English Translation: Go away.

French Sentence: Salut.
English Translation: Go away.

French Sentence: Cours !
English Translation: We lost.

French Sentence: Courez !
English Translation: Go away.

French Sentence: Qui ?
English Translation: We lost.

French Sentence: Ça alors !
English Translation: Who fun?

French Sentence: Au feu !
English Translation: Go away.

French Sentence: À l'aide !
English Translation: Go away.

French Sentence: Saute.
English Translation: Go away.

French Sentence: Ça suffit !
English Translation: We lost.

French Sentence: Stop !
English Translation: Go away.

French Sentence: Arrête-toi !
English Translation: Go away.

French Sentence: Attends !
English Translation: Go away.

French Sentence: Attendez !
English Translation: Go away.

French Sentence: Poursuis.
English Translation: Go away.

French Sentence: Continuez.
English Translation: Help up.

French Sentence: Poursuivez.
