## Sequence-to-sequence Learning

### Summary
In this notebook, we will follow the newly published [tutorial](https://blog.keras.io/) on keras's official website. Objective of this study is to demonstrate how to build a 
model to convert sequence from one-domain to sequences in other-domain.
The source code from the tutorial can be found at [github](https://github.com/fchollet/keras/blob/master/examples/lstm_seq2seq.py)

**model prediction flow**:
input sequence of variable length --> model --> output sequence of variable length

model architect:
* A RNN layer acts as "encoder": it processes the input sequence and returns its own internal state. We discard the outputs of the encoder RNN, only recovering the state. The state will serve as the "context" or "conditioning" of the decoder in the next step.

* Another RNN layer (or stack thereof) acts as "decoder": it is trained to predict the next character of the target sequence, given the previous characters of the target sequence.

In [1]:
import os
import keras 
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

Using TensorFlow backend.


In [15]:
batch_size = 64
epochs = 100
latent_dim = 256
num_samples = 10000
data_path = os.path.join(os.getcwd(), 'data', 'cmn-eng', 'cmn.txt')

### Load data for training the translator

In [16]:
if not os.path.isfile(data_path):
    msg = "ERROR! data file is not found ({}).".format(data_path)
    print(msg)

In [17]:
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
lines = open(data_path).read().split('\n')

for line in lines[:min(num_samples, len(lines) - 1)]:
    
    input_text, target_text = line.split('\t')
    target_text = '\t' + target_text + '\n'
    
    input_texts.append(input_text)
    target_texts.append(target_text)
    
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
    
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)
            
input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))

derive the parameters to define the configuration of encoder + decoder system from the data.

In [20]:
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print('Number of samples: ', len(input_texts))
print('Number of unique input tokens: ', num_encoder_tokens)
print('Number of unique ouput tokens: ', num_decoder_tokens)
print('Max sequence length for inputs: ', max_encoder_seq_length)
print('Max sequence length for outputs: ', max_decoder_seq_length)

Number of samples:  10000
Number of unique input tokens:  73
Number of unique ouput tokens:  2637
Max sequence length for inputs:  31
Max sequence length for outputs:  22


#### Data Preparation
In this section, we will split the original data into three sets. They are:
* **encoder_input_data**: is a 3D array of shape num_pairs, max_english_sentence_length, num_english_characters)
* **decoder_input_data**: is a 3D array of shape (num_pairs, max_chinese_sentence_length, num_chinese_characters)
* **decoder_target_data**: is the same as decoder_input_data but offset by one timestamp.

The task of this model development is to train a seq2seq model to predict decoder_target_data given decoder_input_data + encoder_input_data.

In [22]:
input_token_index = dict(
    [(char, ii) for ii, char in enumerate(input_characters)])
target_token_index = dict(
    [(char, ii) for ii, char in enumerate(target_characters)])

encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens), 
    dtype='float32')
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens), 
    dtype='float32')
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens), 
    dtype='float32')

zipped = zip(input_texts, target_texts)
for ii, (input_text, target_text) in enumerate(zipped):
    for tt, char in enumerate(input_text):
        encoder_input_data[ii, tt, input_token_index[char]] = 1.
    for tt, char in enumerate(target_text):
        decoder_input_data[ii, tt, target_token_index[char]] = 1.
        if tt > 0:
            decoder_target_data[ii, tt-1, target_token_index[char]] = 1.

In [25]:
encoder_input_data[0, ].shape, encoder_input_data[1, ].shape

((31, 73), (31, 73))

In [26]:
decoder_input_data[0, ].shape, decoder_input_data[1, ].shape

((22, 2637), (22, 2637))

#### RNN constructor

In [30]:
from keras.models import Model
from keras.layers import Input, LSTM, Dense

# define an input sequence and process it
encoder_inputs = Input(shape=(None, num_encoder_tokens))
# return_state could allow RNN exports two pieace of information
# 1) output of the model, and 2) the internal state
encoder_lstm = LSTM(latent_dim, return_state=True)
_, state_h, state_c = encoder_lstm(encoder_inputs)
# we discard 'encoder_outputs' and only keep the stats
encoder_states = [state_h, state_c]

# set up the decoder, using 'encoder_states' as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# we set up our decoder to return full ouput sequences.
# and to return internal states as well. We don't use the 
# return states in the training model, but we will use them
# in the inference
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, 
                                    initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# define the model that will turn 
# 'encoder_input_data' & 'decoder_input_data' into 'decoder_target_data'
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [31]:
model.compile(optimizer='rmsprop', 
              loss='categorical_crossentropy')

training precossing

In [None]:
model.fit([encoder_input_data, decoder_input_data], 
          decoder_target_data, 
          batch_size=batch_size, 
          epochs=epochs,
          validation_split=0.2)
# save model
outfile_path = os.path.join(os.getcwd(), 'models', 'seq2seq_eng2chn_charlevel.h5')
model.save(outfile_path)

### develop inference model

In [None]:
import logging
