# HW5 Machine translation with Encoder-Decoder model

## Due April 24th, 23:59

In this homework, you are first shown an example of encoder-decoder machine translation model for a dummy problem. Make sure you understand how it works. Then you will need to build a similar model for a real machine translation data set. The data set provided in this homework is an italiano-english dataset (perché italiano 
è mia lingua preferita), but feel free to download your preferred language pari here (http://www.manythings.org/anki/).


You are given the following files:
- `Machine-Translation.ipynb`: This notebook file
- `ita.txt`: Training dataset (see http://www.manythings.org/anki/ to understand the structure)
- `utils/`: folder containing all utility code for the series of homeworks


### Deliverables (zip them all)

- pdf or html version of your final notebook
- Show some translation examples in your notebook
- writeup.pdf: Add a short essay discussing the biggest challenges you encounter during this assignment and what you have learnt.

(**You are encouraged to add the writeup doc into your notebook
using markdown/html langauge, just like how this notes is prepared**)

<h1>HW6 Write up</h1>
<h2>The dummy task</h2>
The date conversion task can get very high accracy, because it is very simple, even it can be described by some rules, so a NN can easily study its pattern, same as the own dummy task, 
But it is very useful, it let us understand the encoder-decoder stucture. It is suitable to solve the sequence generating problem with labeled data.(seqtoseq)
<h2> BLEU score </h2>
$BLEU score = BP * exp(\frac{1}{N}\sum_i^4{log(p_i)})$

$BP = min(1,e^{1−r/c})$

$p_i = \frac{\# of\ common\ ngram}{\#\ of\ total\ ngrams}$

<h2>Biggest challenge</h2>
First is the time

Second is the seed selection, for some seed, even the dummy task turns out to predict to be some weird value, I figured this out for quite some time. My own dummy "Adding number" does not converge as the date conversion example. I need more time to debug.

Third is the BLEU score in the translation problem. The vocab size and the training sample is large so it will take very long time to train. I did not implement it in this homework, I will definitely re-evaluate it.

# Set up

In [1]:
#%load_ext autoreload
#%autoreload 2
%matplotlib inline

import os, sys
# add utils folder to path
p = os.path.dirname(os.getcwd())
if p not in sys.path:
    sys.path = [p] + sys.path

from utils.general import show_keras_model

from keras.models import Model

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


# Dummy Translation Problem
We are not doing anything real here, rather, we create a dummy problem to demonstrate how easy or hard to use a S2S model for machine translation.

The dummy prblem I choose here is to translate datestr like "Aug-30-1989" to another format "1989/08/30". Sounds easy, isn't it? But think about it, you feel this simple because you have so much prior knowledge. You know the English meaning of "Aug", you know the different ways of representing dates, MM-DD-YYYY vs YYYY/MM/DD. But our model starts from absolute ignorance. Imagine you show this problem to a 2-year-old child, how much time does it make for him to figure out the rule? 

## Generate Training Data

In [2]:
import numpy as np

choice = np.random.choice
def source_generation(batch=100):
    months = choice(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], batch)
    days = choice(range(1, 28), batch)
    years = choice(range(1990, 2050), batch)
    
    return [ f"{m}-{d}-{y}" for m, d, y in zip(months, days, years)]

def translate(src):
    if type(src) == str: src = [src]
    mmap = {'Jan': '01', 'Feb': '02', 'Mar': '03', 'Apr': '04', 'May': '05', 'Jun': "06", 'Jul': "07", 
            'Aug': '08', 'Sep': '09', 'Oct': '10', 'Nov': '11', 'Dec': '12'}
    result = []
    for d in src:
        m, d, y = d.split('-')
        result.append(f"{y}/{mmap[m]}/{str(d).rjust(2, '0')}")
        
    return result

In [3]:
# Let's generate some data
train_X_raw = source_generation(10000)
train_Y_raw = translate(train_X_raw)

# Verify the translation
print(train_X_raw[:5])
print(train_Y_raw[:5])

['Feb-15-1999', 'Jul-12-1997', 'May-1-1996', 'Jul-7-2004', 'Jan-12-2003']
['1999/02/15', '1997/07/12', '1996/05/01', '2004/07/07', '2003/01/12']


## Other dummy tasks

You are encouraged to generate your own dummy tasks, for example, what about a simple calculator, can you train your model to understand "186+95" equal to "281"?

# Encoder-Decoder Model

In [4]:
encoder_input_len = 11
decoder_input_len = 10
latent_dim = 256

## Raw data transformer

As of today, I guess you should be quite familar with what we are doing here.

In [5]:
from keras.preprocessing.sequence import pad_sequences

char_vocab = list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-/0123456789$^')

reverse_vocab = {k:v for v, k in enumerate(char_vocab)}
def char_to_num(X_raw, is_encoder=True):
    """
    Translate the raw input to the numerical encoding. We take different treatments for the
    encoder inputs and decoder inputs. This is because we need a starter character "^" for the 
    decoder inputs.
    """
    result = [[reverse_vocab[c] for c in sent] for sent in X_raw]
    
    if(is_encoder):
        assert all([len(row) <= encoder_input_len for row in X_raw])
        return pad_sequences(sequences=result, maxlen=encoder_input_len, 
                             padding='post', truncating='post', 
                             value=reverse_vocab['$'])
    else:
        assert all([len(row) == decoder_input_len for row in X_raw])
        return pad_sequences(sequences=result, maxlen=decoder_input_len+1, 
                             padding='pre', truncating='post', 
                             value=reverse_vocab['^'])

    return pad_sequences(result)

def num_to_char(X):
    return [''.join([char_vocab[c] for c in row]) for row in X]

## Training model

In [7]:
# from keras.models import Model
from keras.layers import (Input, LSTM, Dense, Bidirectional, Embedding, 
                          TimeDistributed, Concatenate)

"""
Define an input Layer. We use one-hot encoding instead of embedding layer here. Since
we are using character based model, embedding may not be necessary, and may not be very 
helpful neither. Do you know why?
"""
encoder_inputs = Input(shape=(encoder_input_len, len(char_vocab)), name="Encoder_Input")
# For encoder, we can see the entire sentence at once, so we can use Bidirectional LSTM
encoder_lstm = Bidirectional(LSTM(latent_dim, return_state=True, name="Encoder_LSTM"))
# Bidrectional LSTM has 4 states instead of 2, we concatenate them to be comparable
# with the decoder LSTM
_, forward_h, forward_c, backward_h, backward_c = encoder_lstm(encoder_inputs)
state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

# Set up the decoder, using `encoder_states` as initial state
encoder_states = [state_h, state_c]
decoder_inputs = Input(shape=(decoder_input_len, len(char_vocab)), name="Decoder_Input")
decoder_lstm = LSTM(latent_dim*2, return_sequences=True, name="Decoder_LSTM")
decoder_lstm_outputs = decoder_lstm(decoder_inputs,
                                    initial_state=encoder_states)
decoder_dense = Dense(len(char_vocab), activation='softmax')
decoder_outputs = TimeDistributed(decoder_dense)(decoder_lstm_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

#show_keras_model(model)

## Train training model

In [8]:
# Run training
from keras.utils import to_categorical
"""
Don't be suprized that this model actually needs quite quite a lot of epochs to train, so please be patient.
After the model is trained, you can use the history.history object to plot the metrics improvment process.

While you are waiting for the model to train, feel free to read the next cell.
"""
batch_size = 1000
epochs = 75

# Here it's just some data transformation to translate the raw data to matrix inputs
encoder_input_data = to_categorical(char_to_num(train_X_raw, True), num_classes=len(char_vocab))
train_Y = to_categorical(char_to_num(train_Y_raw, False), num_classes=len(char_vocab))
# for decoder, the target lags input by 1 time step
decoder_input_data = train_Y[:, :-1, :]
decoder_target_data = train_Y[:, 1:, :]

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
                    batch_size=batch_size,
                    epochs=epochs,
                    validation_split=0.2)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

Train on 8000 samples, validate on 2000 samples
Epoch 1/75
Epoch 2/75
Epoch 3/75
Epoch 4/75
Epoch 5/75
Epoch 6/75
Epoch 7/75
Epoch 8/75
Epoch 9/75
Epoch 10/75
Epoch 11/75
Epoch 12/75
Epoch 13/75
Epoch 14/75
Epoch 15/75
Epoch 16/75
Epoch 17/75
Epoch 18/75
Epoch 19/75
Epoch 20/75
Epoch 21/75
Epoch 22/75
Epoch 23/75
Epoch 24/75
Epoch 25/75
Epoch 26/75
Epoch 27/75
Epoch 28/75
Epoch 29/75
Epoch 30/75
Epoch 31/75
Epoch 32/75
Epoch 33/75
Epoch 34/75
Epoch 35/75
Epoch 36/75
Epoch 37/75
Epoch 38/75
Epoch 39/75
Epoch 40/75
Epoch 41/75
Epoch 42/75
Epoch 43/75
Epoch 44/75
Epoch 45/75
Epoch 46/75
Epoch 47/75
Epoch 48/75
Epoch 49/75
Epoch 50/75
Epoch 51/75
Epoch 52/75
Epoch 53/75


Epoch 54/75
Epoch 55/75
Epoch 56/75
Epoch 57/75
Epoch 58/75
Epoch 59/75
Epoch 60/75
Epoch 61/75
Epoch 62/75
Epoch 63/75
Epoch 64/75
Epoch 65/75
Epoch 66/75
Epoch 67/75
Epoch 68/75
Epoch 69/75
Epoch 70/75
Epoch 71/75
Epoch 72/75
Epoch 73/75
Epoch 74/75
Epoch 75/75


## Inference model

Similar to HW04, we need a different model structure for the inference model. The inference model should copy exactly the same weights from the training model, but it predicts only 1 time step at a time.

In [10]:
# Trucate the encoder part of the training model as encoder model
encoder_model = Model(encoder_inputs, encoder_states)
#show_keras_model(encoder_model)

In [11]:
# Build the inference model
inference_inputs = Input(batch_shape=(1,1, len(char_vocab)), name="Inference_Input")
inference_lstm = LSTM(latent_dim*2, stateful=True,
                      name="Inference_LSTM",)
inference_lstm_outputs = inference_lstm(inference_inputs)

inference_dense = Dense(len(char_vocab), activation='softmax')
inference_outputs = inference_dense(inference_lstm_outputs)

# Assign the weights of decoder to inference model
inference_lstm.set_weights(decoder_lstm.get_weights())
inference_dense.set_weights(decoder_dense.get_weights())

inference_model = Model(inference_inputs, inference_outputs)
#show_keras_model(inference_model)

In [12]:
def inference(encoder_input_data):
    """
    A utility function to generate the model prediction
    """
    states_h, states_c = encoder_model.predict(encoder_input_data)
    results = []
    inference_model.reset_states()
    for h, c in zip(states_h, states_c):
        sent, seed = [], reverse_vocab['^']
        inference_lstm.states[0].assign(h[None, :])
        inference_lstm.states[1].assign(c[None, :])
        for i in range(decoder_input_len):
            seed = to_categorical(np.array([seed]), num_classes=len(char_vocab))[None, :, :]
            seed = inference_model.predict(seed)[0].argmax()
            sent.append(seed)
            
        results.append(sent)
        
    return num_to_char(results)

In [14]:
# Let's look at some output
print(num_to_char(encoder_input_data[:10].argmax(axis=2)))
print(translate(num_to_char(encoder_input_data[:10].argmax(axis=2))))
print(inference(encoder_input_data[:10]))
print(num_to_char(decoder_input_data[:10].argmax(axis=2)))
print(num_to_char(decoder_target_data[:10].argmax(axis=2)))

['Feb-15-1999', 'Jul-12-1997', 'May-1-1996$', 'Jul-7-2004$', 'Jan-12-2003', 'Oct-1-2017$', 'Jun-20-2030', 'Jan-5-1991$', 'May-23-2010', 'Jul-18-2036']
['1999/02/15', '1997/07/12', '1996$/05/01', '2004$/07/07', '2003/01/12', '2017$/10/01', '2030/06/20', '1991$/01/05', '2010/05/23', '2036/07/18']
['201//18111', '111/11/11/', '1181181181', '1811811811', '8118118118', '1181181181', '1811811811', '8118118118', '1181181181', '1811811811']
['^1999/02/1', '^1997/07/1', '^1996/05/0', '^2004/07/0', '^2003/01/1', '^2017/10/0', '^2030/06/2', '^1991/01/0', '^2010/05/2', '^2036/07/1']
['1999/02/15', '1997/07/12', '1996/05/01', '2004/07/07', '2003/01/12', '2017/10/01', '2030/06/20', '1991/01/05', '2010/05/23', '2036/07/18']


# Own dummy Model

In [15]:
choice = np.random.choice
def source_generation(batch=100):
    a = choice(range(1,1000), batch)
    b = choice(range(1,1000), batch)
    
    return [f"{m}+{n}" for m,n in zip(a,b)]

def translate(src):
    if type(src) == str: src = [src]
    result = []
    for d in src:
        a,b = d.split('+')
        result.append(f"{int(a)+int(b)}".rjust(5,'0'))
    return result

In [16]:
# Let's generate some data
train_X_raw = source_generation(10000)
train_Y_raw = translate(train_X_raw)

# Verify the translation
print(train_X_raw[:5])
print(train_Y_raw[:5])

['513+83', '24+814', '978+758', '346+625', '466+166']
['00596', '00838', '01736', '00971', '00632']


In [17]:
encoder_input_len = 8
decoder_input_len = 5
latent_dim = 256

In [18]:
from keras.preprocessing.sequence import pad_sequences

char_vocab = list('0123456789$^+')

reverse_vocab = {k:v for v, k in enumerate(char_vocab)}
def char_to_num(X_raw, is_encoder=True):
    """
    Translate the raw input to the numerical encoding. We take different treatments for the
    encoder inputs and decoder inputs. This is because we need a starter character "^" for the 
    decoder inputs.
    """
    result = [[reverse_vocab[c] for c in sent] for sent in X_raw]
    
    if(is_encoder):
        assert all([len(row) <= encoder_input_len for row in X_raw])
        return pad_sequences(sequences=result, maxlen=encoder_input_len, 
                             padding='post', truncating='post', 
                             value=reverse_vocab['$'])
    else:
        assert all([len(row) == decoder_input_len for row in X_raw])
        return pad_sequences(sequences=result, maxlen=decoder_input_len+1, 
                             padding='pre', truncating='post', 
                             value=reverse_vocab['^'])

    return pad_sequences(result)

def num_to_char(X):
    return [''.join([char_vocab[c] for c in row]) for row in X]

In [19]:
from keras.layers import (Input, LSTM, Dense, Bidirectional, Embedding, 
                          TimeDistributed, Concatenate)

encoder_inputs = Input(shape=(encoder_input_len, len(char_vocab)), name="Encoder_Input")
# For encoder, we can see the entire sentence at once, so we can use Bidirectional LSTM
encoder_lstm = Bidirectional(LSTM(latent_dim, return_state=True, name="Encoder_LSTM"))
# Bidrectional LSTM has 4 states instead of 2, we concatenate them to be comparable
# with the decoder LSTM
_, forward_h, forward_c, backward_h, backward_c = encoder_lstm(encoder_inputs)
state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

# Set up the decoder, using `encoder_states` as initial state
encoder_states = [state_h, state_c]
decoder_inputs = Input(shape=(decoder_input_len, len(char_vocab)), name="Decoder_Input")
decoder_lstm = LSTM(latent_dim*2, return_sequences=True, name="Decoder_LSTM")
decoder_lstm_outputs = decoder_lstm(decoder_inputs,
                                    initial_state=encoder_states)
decoder_dense = Dense(len(char_vocab), activation='softmax')
decoder_outputs = TimeDistributed(decoder_dense)(decoder_lstm_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

#show_keras_model(model)

In [20]:
# Run training
"""
Don't be suprized that this model actually needs quite quite a lot of epochs to train, so please be patient.
After the model is trained, you can use the history.history object to plot the metrics improvment process.

While you are waiting for the model to train, feel free to read the next cell.
"""
batch_size = 1000
epochs = 75

# Here it's just some data transformation to translate the raw data to matrix inputs
encoder_input_data = to_categorical(char_to_num(train_X_raw, True), num_classes=len(char_vocab))
train_Y = to_categorical(char_to_num(train_Y_raw, False), num_classes=len(char_vocab))
# for decoder, the target lags input by 1 time step
decoder_input_data = train_Y[:, :-1, :]
decoder_target_data = train_Y[:, 1:, :]

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
                    batch_size=batch_size,
                    epochs=epochs,
                    validation_split=0.2)

Train on 8000 samples, validate on 2000 samples
Epoch 1/75
Epoch 2/75
Epoch 3/75
Epoch 4/75
Epoch 5/75
Epoch 6/75
Epoch 7/75
Epoch 8/75
Epoch 9/75
Epoch 10/75
Epoch 11/75
Epoch 12/75
Epoch 13/75
Epoch 14/75
Epoch 15/75
Epoch 16/75
Epoch 17/75
Epoch 18/75
Epoch 19/75
Epoch 20/75
Epoch 21/75
Epoch 22/75
Epoch 23/75
Epoch 24/75
Epoch 25/75
Epoch 26/75
Epoch 27/75
Epoch 28/75
Epoch 29/75
Epoch 30/75
Epoch 31/75
Epoch 32/75
Epoch 33/75
Epoch 34/75
Epoch 35/75
Epoch 36/75
Epoch 37/75
Epoch 38/75
Epoch 39/75
Epoch 40/75
Epoch 41/75
Epoch 42/75
Epoch 43/75
Epoch 44/75
Epoch 45/75
Epoch 46/75
Epoch 47/75
Epoch 48/75
Epoch 49/75
Epoch 50/75
Epoch 51/75
Epoch 52/75
Epoch 53/75
Epoch 54/75
Epoch 55/75
Epoch 56/75


Epoch 57/75
Epoch 58/75
Epoch 59/75
Epoch 60/75
Epoch 61/75
Epoch 62/75
Epoch 63/75
Epoch 64/75
Epoch 65/75
Epoch 66/75
Epoch 67/75
Epoch 68/75
Epoch 69/75
Epoch 70/75
Epoch 71/75
Epoch 72/75
Epoch 73/75
Epoch 74/75
Epoch 75/75


In [21]:
# Trucate the encoder part of the training model as encoder model
encoder_model = Model(encoder_inputs, encoder_states)
#show_keras_model(encoder_model)

In [22]:
from keras.models import Model
# Build the inference model
inference_inputs = Input(batch_shape=(1,1, len(char_vocab)), name="Inference_Input")
inference_lstm = LSTM(latent_dim*2, stateful=True,
                      name="Inference_LSTM",)
inference_lstm_outputs = inference_lstm(inference_inputs)

inference_dense = Dense(len(char_vocab), activation='softmax')
inference_outputs = inference_dense(inference_lstm_outputs)

# Assign the weights of decoder to inference model
inference_lstm.set_weights(decoder_lstm.get_weights())
inference_dense.set_weights(decoder_dense.get_weights())

inference_model = Model(inference_inputs, inference_outputs)
#show_keras_model(inference_model)

In [23]:
def inference(encoder_input_data):
    """
    A utility function to generate the model prediction
    """
    states_h, states_c = encoder_model.predict(encoder_input_data)
    results = []
    inference_model.reset_states()
    for h, c in zip(states_h, states_c):
        sent, seed = [], reverse_vocab['^']
        inference_lstm.states[0].assign(h[None, :])
        inference_lstm.states[1].assign(c[None, :])
        for i in range(decoder_input_len):
            seed = to_categorical(np.array([seed]), num_classes=len(char_vocab))[None, :, :]
            seed = inference_model.predict(seed)[0].argmax()
            sent.append(seed)
            
        results.append(sent)
        
    return num_to_char(results)

In [24]:
# Let's look at some output
print(num_to_char(encoder_input_data[:10].argmax(axis=2)))
print(inference(encoder_input_data[:10]))

['513+83$$', '24+814$$', '978+758$', '346+625$', '466+166$', '743+379$', '16+586$$', '161+899$', '81+799$$', '914+32$$']
['08664', '01642', '24242', '24242', '24242', '24242', '24242', '24242', '24242', '24242']


# Real Machine translation 

In [25]:
"""
Now are you ready for the real challenge? You can use the ita.txt file as training data. 
But feel free to download different language from http://www.manythings.org/anki/. If you
happen to speak French or Japanese, it's time to show off!

1. Implement a Bidrectional LSTM Encoder-Decoder model, or other viable models to translate 
   the language dataset you choose.

2. Write the function to calculate the BLEU score of your model
"""
import os, sys
# add utils folder to path
p = os.path.dirname(os.getcwd())
if p not in sys.path:
    sys.path = [p] + sys.path

from utils.general import show_keras_model

from keras.models import Model

In [26]:
train_X_raw = []
train_Y_raw = []
f = open("ita.txt")
for r in f:
    s = r.split('\t')
    train_X_raw.append(s[0])
    train_Y_raw.append(s[1])
f.close()
encoder_input_len = max([len(X) for X in train_X_raw])
decoder_input_len = max([len(y) for y in train_Y_raw])
print(encoder_input_len,decoder_input_len)
latent_dim = 256

262 303


In [27]:
from collections import Counter
from keras.preprocessing.sequence import pad_sequences

total_chars = ''.join(train_X_raw)+''.join(train_Y_raw)
total_chars = Counter(total_chars)
char_vocab = sorted([c for c in total_chars])+['^','<END>']
reverse_vocab = {k:v for v, k in enumerate(char_vocab)}
def char_to_num(X_raw, is_encoder=True):
    """
    Translate the raw input to the numerical encoding. We take different treatments for the
    encoder inputs and decoder inputs. This is because we need a starter character "^" for the 
    decoder inputs.
    """
    result = [[reverse_vocab[c] for c in sent] for sent in X_raw]
    
    if is_encoder :
        assert all([len(row) <= encoder_input_len for row in X_raw])
        return pad_sequences(sequences=result, maxlen=encoder_input_len, 
                             padding='post', truncating='post', 
                             value=reverse_vocab['$'])
    else:
        assert all([len(row) <= decoder_input_len for row in X_raw])
        postpend = pad_sequences(sequences=result, maxlen=decoder_input_len, 
                             padding='post', truncating='post', 
                             value=reverse_vocab['<END>'])       
        return pad_sequences(sequences=postpend, maxlen=decoder_input_len+1, 
                             padding='pre', truncating='post', 
                             value=reverse_vocab['^'])

    return pad_sequences(result)

def num_to_char(X):
    return [''.join([char_vocab[c] for c in row]) for row in X]

In [28]:
print(char_vocab)
print(reverse_vocab)

[' ', '!', '"', '$', '%', "'", ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\xa0', '\xad', '°', 'º', 'È', 'à', 'á', 'ã', 'è', 'é', 'ê', 'ì', 'î', 'ï', 'ñ', 'ò', 'ö', 'ù', 'ú', 'ü', 'ō', '\u200b', '’', '€', '^', '<END>']
{' ': 0, '!': 1, '"': 2, '$': 3, '%': 4, "'": 5, ',': 6, '-': 7, '.': 8, '/': 9, '0': 10, '1': 11, '2': 12, '3': 13, '4': 14, '5': 15, '6': 16, '7': 17, '8': 18, '9': 19, ':': 20, ';': 21, '?': 22, 'A': 23, 'B': 24, 'C': 25, 'D': 26, 'E': 27, 'F': 28, 'G': 29, 'H': 30, 'I': 31, 'J': 32, 'K': 33, 'L': 34, 'M': 35, 'N': 36, 'O': 37, 'P': 38, 'Q': 39, 'R': 40, 'S': 41, 'T': 42, 'U': 43, 'V': 44, 'W': 45, 'X': 46, 'Y': 47, 'Z': 48, 'a': 49, 'b': 50, 'c': 51, 'd': 52, 'e': 53, 'f

In [29]:
from keras.layers import (Input, LSTM, Dense, Bidirectional, Embedding, 
                          TimeDistributed, Concatenate)

encoder_inputs = Input(shape=(encoder_input_len, len(char_vocab)), name="Encoder_Input")
# For encoder, we can see the entire sentence at once, so we can use Bidirectional LSTM
encoder_lstm = Bidirectional(LSTM(latent_dim, return_state=True, name="Encoder_LSTM"))
# Bidrectional LSTM has 4 states instead of 2, we concatenate them to be comparable
# with the decoder LSTM
_, forward_h, forward_c, backward_h, backward_c = encoder_lstm(encoder_inputs)
state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

# Set up the decoder, using `encoder_states` as initial state
encoder_states = [state_h, state_c]
decoder_inputs = Input(shape=(decoder_input_len, len(char_vocab)), name="Decoder_Input")
decoder_lstm = LSTM(latent_dim*2, return_sequences=True, name="Decoder_LSTM")
decoder_lstm_outputs = decoder_lstm(decoder_inputs,
                                    initial_state=encoder_states)
decoder_dense = Dense(len(char_vocab), activation='softmax')
decoder_outputs = TimeDistributed(decoder_dense)(decoder_lstm_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [32]:
# Run training
from keras.utils import to_categorical

"""
Don't be suprized that this model actually needs quite quite a lot of epochs to train, so please be patient.
After the model is trained, you can use the history.history object to plot the metrics improvment process.

While you are waiting for the model to train, feel free to read the next cell.
"""
batch_size = 1000
epochs = 1

# Here it's just some data transformation to translate the raw data to matrix inputs
encoder_input_data = to_categorical(char_to_num(train_X_raw[:10000], True), num_classes=len(char_vocab))
train_Y = to_categorical(char_to_num(train_Y_raw[:10000], False), num_classes=len(char_vocab))
# for decoder, the target lags input by 1 time step
decoder_input_data = train_Y[:, :-1, :]
decoder_target_data = train_Y[:, 1:, :]

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
                    batch_size=batch_size,
                    epochs=epochs,
                    validation_split=0.2)

Train on 8000 samples, validate on 2000 samples
Epoch 1/1
