## Machine translation using LSTM cells on Sequence-to-Sequence models

Sequence-to-Sequence models try to map input sequences to target/output sequences. It is still a prediction problem, but instead of classifying the input to a single or multiple class, the model predicts tokens one by one till the end of the sequence, thereby generating an output sequence.

LSTMs are Long-short Term memory networks. They are most commonly used in long sequence problems, as they seem to be good at preserving the history of tokens.

Here we will apply LSTM cells in the Sequence-to-Sequence networks on a character level for Machine translation.

In [None]:
from keras.layers import Input, LSTM, Dense
from keras.models import Model
import numpy as np

While working on Multiple languages, it is highly recommended to use `utf-8` encoding so that the non-ascii characters are read properly.

In [None]:
with open('resource\en_de.txt', 'r', encoding='utf-8') as f:
    data = f.read().split('\n')

In [None]:
data[0:10]

The first part of the sentence before the `\t` is in english and the part after is the translation in German language. Now, we must extract them and store them as input and target values for the ML model to be trained.

In [None]:
input_texts = []
target_texts = []

In [None]:
for i, line in enumerate(data):
    ip,op = line.split('\t')
    input_texts.append(ip)
    target_texts.append('\t' + op + '\n')

As you could see here, we have deliberately concatenated the `\t` at the front and `\n` at the end. These are called START and END tokens. Its more common nowadays to use `<START>` and `<END>` tokens, so its unambiguous. We just use `\t` and `\n` for simplicity.

Since we will be working on character-level, the `input_chars` and `target_chars` extract the character level vocabulary from the input, target texts

In [None]:
input_chars = set([l for w in input_texts for l in w])
target_chars = set([l for w in target_texts for l in w])

### Encoder-Decoder network

Sequence-to-Sequence adopts the architecture of Encoder-decoder network. In training, the input sequences are fed into an encoder-network, and the target sentences to a decoder-network. In-order to do so, we must initialise the shape of the network. 

We have to train all the input sequences in the same network, and not all the sequences are of same length, So we must get the maximum length of input and target sequences for the encoder and decoder sequence length.

In [None]:
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

Each character token in the sequence is represented by one-hot encoding, thus the length of each token (`num_enc_tokens`, `num_dec_tokens`) is the size of the respective vocabulary.

In [None]:
num_enc_tokens = len(input_chars)
num_dec_tokens = len(target_chars)

For one-hot encoding, we must create character-index pairs, so that we can map back and forth. Just make sure that the order of the `input_chars` and `target_chars` are preserved till the end, any change in the list order might lead to improper mapping to their indices.

In [None]:
input_char2idx = dict([(char, i) for i, char in enumerate(input_chars)])
target_char2idx = dict([(char, i) for i, char in enumerate(target_chars)])

The reverse lookup is not needed during training. But, this might come in handy when we do Inference.

In [None]:
input_idx2char = dict([(i, char) for i, char in enumerate(input_chars)])
target_idx2char = dict([(i, char) for i, char in enumerate(target_chars)])

### Model training

Inorder to fill the encoder - decoder network, the following step instantiates zero matrices with defined shapes for input to encoder input, decoder input and decoder output. The shape of the matrix is the length of the training data, maximum length of the sequence, and the size of the vocabulary.

In [None]:
encoder_input_data = np.zeros((len(input_texts), max_encoder_seq_length, num_enc_tokens),dtype='float32')
decoder_input_data = np.zeros((len(target_texts), max_decoder_seq_length, num_dec_tokens),dtype='float32')
decoder_target_data = np.zeros((len(target_texts), max_decoder_seq_length, num_dec_tokens),dtype='float32')

Once the empty matrices are created, `encoder_input_data`, `decoder_input_data`, `decoder_ouput_data` have to be filled with data from the english-german training dataset

In [None]:
for idx, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[idx, t, input_char2idx[char]] = 1.
    encoder_input_data[idx, t + 1:, input_char2idx[' ']] = 1.
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[idx, t, target_char2idx[char]] = 1.
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[idx, t - 1, target_char2idx[char]] = 1.
    decoder_input_data[idx, t + 1:, target_char2idx[' ']] = 1.
    decoder_target_data[idx, t:, target_char2idx[' ']] = 1.

In [None]:
latent_units = 128 # The latent units are nothing but the hidden units

In [None]:
encoder_inputs = Input(shape=(None, num_enc_tokens))
encoder_lstm = LSTM(latent_units, return_state=True)

Make sure that the activation function in the final layer (`decoder_dense`) is `softmax` activation. Since, it predicts the probability of the next character and it must sum to 1.

In [None]:
decoder_inputs = Input(shape=(None, num_dec_tokens))
decoder_lstm = LSTM(latent_units, return_state=True, return_sequences=True)
decoder_dense = Dense(num_dec_tokens, activation='softmax')

In [None]:
encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
encoder_states = [state_h, state_c]

In [None]:
decoder_outputs, _,_ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_outputs = decoder_dense(decoder_outputs)

In [None]:
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [None]:
model.compile(optimizer='rmsprop',loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=64, epochs=50, validation_split=0.2)

In [None]:
model.save('models/en2de.h5')

Model inference

In [None]:
encoder_model = Model(encoder_inputs, encoder_states)

In [None]:
input_state_h = Input(shape=(latent_units,))
input_state_c = Input(shape=(latent_units,))
decoder_input_states = [input_state_h, input_state_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_input_states)
decoder_outputs = decoder_dense(decoder_outputs)
decoder_states = [state_h, state_c]

In [None]:
decoder_model = Model([decoder_inputs] + decoder_input_states, [decoder_outputs] + decoder_states)

In [None]:
def decode_sequence(input_seq):
    # Run input seq through encoder model
    states = encoder_model.predict(input_seq)
    
    target_seq = np.zeros((1, 1, num_dec_tokens))
    target_seq[0, 0, target_char2idx['\t']] = 1
    
    end_of_seq = False
    decoded_sentence = ''
    
    while not end_of_seq:
        output, h, c = decoder_model.predict([target_seq] + states)
        predicted_char_idx = np.argmax(output[0, -1, :])
        predicted_char = target_idx2char[predicted_char_idx]
        decoded_sentence += predicted_char
        
        if (predicted_char == '\n' or len(decoded_sentence) > max_decoder_seq_length):
            end_of_seq = True
            
        target_seq[0, 0, predicted_char_idx] = 1
        states = [h, c]
    return decoded_sentence

In [None]:
for seq_index in range(100):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('Decoded sentence:', decoded_sentence)