# Home 5: Build a seq2seq model for machine translation.

### Name: [David Fu]

### Task: Translate English to [Estonian]

## 0. You will do the following:

1. Read and run my code.
2. Complete the code in Section 1.1 and Section 4.2.

    * Translation **English** to **German** is not acceptable!!! Try another pair of languages.
    
3. **Make improvements.** Directly modify the code in Section 3. Do at least one of the two. By doing both correctly, you will get up to 1 bonus score to the total.

    * Bi-LSTM instead of LSTM.
        
    * Attention. (You are allowed to use existing code.)
    
4. Evaluate the translation using the BLEU score. 

    * Optional. Up to 1 bonus scores to the total.
    
5. Convert the notebook to .HTML file. 

    * The HTML file must contain the code and the output after execution.

6. Put the .HTML file in your Google Drive, Dropbox, or Github repo.  (If you submit the file to Google Drive or Dropbox, you must make the file "open-access". The delay caused by "deny of access" may result in late penalty.)

7. Submit the link to the HTML file to Canvas.    


### Hint: 

To implement ```Bi-LSTM```, you will need the following code to build the encoder. Do NOT use Bi-LSTM for the decoder.

In [1]:
from tensorflow.keras.layers import Bidirectional, Concatenate, LSTM

# encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True, 
#                                   dropout=0.5, name='encoder_lstm'))
# _, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

# state_h = Concatenate()([forward_h, backward_h])
# state_c = Concatenate()([forward_c, backward_c])

## 1. Data preparation

1. Download data (e.g., "est-eng.zip") from http://www.manythings.org/anki/
2. Unzip the .ZIP file.
3. Put the .TXT file (e.g., "deu.txt") in the directory "./Data/".

### 1.1. Load and clean text


In [2]:
import re
import string
from unicodedata import normalize
import numpy

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


# split a loaded document into sentences
def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    return pairs

def clean_data(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    # prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)
    for pair in lines:
        clean_pair = list()
        for line in pair:
            # normalize unicode characters
            line = normalize('NFD', line).encode('ascii', 'ignore')
            line = line.decode('UTF-8')
            # tokenize on white space
            line = line.split()
            # convert to lowercase
            line = [word.lower() for word in line]
            # remove punctuation from each token
            line = [word.translate(table) for word in line]
            # remove non-printable chars form each token
            line = [re_print.sub('', w) for w in line]
            # remove tokens with numbers in them
            line = [word for word in line if word.isalpha()]
            # store as string
            clean_pair.append(' '.join(line))
        cleaned.append(clean_pair)
    return numpy.array(cleaned)

#### Fill the following blanks:

In [3]:
# e.g., filename = 'Data/deu.txt'
filename = 'est-eng/est.txt'

# e.g., n_train = 20000
n_train = 2187

In [4]:
# load dataset
doc = load_doc(filename)

# split into Language1-Language2 pairs
pairs = to_pairs(doc)

# clean sentences
clean_pairs = clean_data(pairs)[0:n_train, :]

In [5]:
for i in range(1300, 1310):
    print('[' + clean_pairs[i, 0] + '] => [' + clean_pairs[i, 1] + ']')

[i wish i were by your side] => [ma sooviksin et ma oleksin su korval]
[i wish i were by your side] => [ma sooviksin su korval olla]
[id like a map of the city] => [ma sooviksin selle linna kaarti]
[im not afraid of anything] => [ma ei karda midagi]
[if i were rich id buy it] => [kui oleksin rikas siis ostaksin selle]
[is that all right with you] => [kas see koik sobib sulle]
[is your uncle still abroad] => [kas su onu on ikka veel valismaal]
[life is hard for everybody] => [koigi elu on raske]
[many people like to travel] => [paljudele inimestele meeldib reisida]
[no one else has complained] => [keegi teine pole kurtnud]


In [6]:
input_texts = clean_pairs[:, 0]
target_texts = ['\t' + text + '\n' for text in clean_pairs[:, 1]]

print('Length of input_texts:  ' + str(input_texts.shape))
print('Length of target_texts: ' + str(input_texts.shape))

Length of input_texts:  (2187,)
Length of target_texts: (2187,)


In [7]:
max_encoder_seq_length = max(len(line) for line in input_texts)
max_decoder_seq_length = max(len(line) for line in target_texts)

print('max length of input  sentences: %d' % (max_encoder_seq_length))
print('max length of target sentences: %d' % (max_decoder_seq_length))

max length of input  sentences: 96
max length of target sentences: 92


**Remark:** To this end, you have two lists of sentences: input_texts and target_texts

## 2. Text processing

### 2.1. Convert texts to sequences

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.

In [8]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# encode and pad sequences
def text2sequences(max_len, lines):
    tokenizer = Tokenizer(char_level=True, filters='')
    tokenizer.fit_on_texts(lines)
    seqs = tokenizer.texts_to_sequences(lines)
    seqs_pad = pad_sequences(seqs, maxlen=max_len, padding='post')
    return seqs_pad, tokenizer.word_index


encoder_input_seq, input_token_index = text2sequences(max_encoder_seq_length, 
                                                      input_texts)
decoder_input_seq, target_token_index = text2sequences(max_decoder_seq_length, 
                                                       target_texts)

print('shape of encoder_input_seq: ' + str(encoder_input_seq.shape))
print('shape of input_token_index: ' + str(len(input_token_index)))
print('shape of decoder_input_seq: ' + str(decoder_input_seq.shape))
print('shape of target_token_index: ' + str(len(target_token_index)))

shape of encoder_input_seq: (2187, 96)
shape of input_token_index: 27
shape of decoder_input_seq: (2187, 92)
shape of target_token_index: 27


In [9]:
num_encoder_tokens = len(input_token_index) + 1
num_decoder_tokens = len(target_token_index) + 1

print('num_encoder_tokens: ' + str(num_encoder_tokens))
print('num_decoder_tokens: ' + str(num_decoder_tokens))

num_encoder_tokens: 28
num_decoder_tokens: 28


**Remark:** To this end, the input language and target language texts are converted to 2 matrices. 

- Their number of rows are both n_train.
- Their number of columns are respective max_encoder_seq_length and max_decoder_seq_length.

The followings print a sentence and its representation as a sequence.

In [10]:
target_texts[100]

'\tkui sugav\n'

In [11]:
decoder_input_seq[100, :]

array([12, 14,  9,  4,  1,  5,  9, 19,  2, 17, 13,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0], dtype=int32)

## 2.2. One-hot encode

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.
- It is represented by a $n\times t \times v$ tensor ($t$ is the number of unique chars) after the one-hot encoding.

In [12]:
from tensorflow.keras.utils import to_categorical

# one hot encode target sequence
def onehot_encode(sequences, max_len, vocab_size):
    n = len(sequences)
    data = numpy.zeros((n, max_len, vocab_size))
    for i in range(n):
        data[i, :, :] = to_categorical(sequences[i], num_classes=vocab_size)
    return data

encoder_input_data = onehot_encode(encoder_input_seq, max_encoder_seq_length, num_encoder_tokens)
decoder_input_data = onehot_encode(decoder_input_seq, max_decoder_seq_length, num_decoder_tokens)

decoder_target_seq = numpy.zeros(decoder_input_seq.shape)
decoder_target_seq[:, 0:-1] = decoder_input_seq[:, 1:]
decoder_target_data = onehot_encode(decoder_target_seq, 
                                    max_decoder_seq_length, 
                                    num_decoder_tokens)

print(encoder_input_data.shape)
print(decoder_input_data.shape)

(2187, 96, 28)
(2187, 92, 28)


## 3. Build the networks (for training)

- Build encoder, decoder, and connect the two modules to get "model". 

- Fit the model on the bilingual data to train the parameters in the encoder and decoder.

### 3.1. Encoder network

- Input:  one-hot encode of the input language

- Return: 

    -- output (all the hidden states   $h_1, \cdots , h_t$) are always discarded
    
    -- the final hidden state  $h_t$
    
    -- the final conveyor belt $c_t$

In [13]:
from tensorflow.keras.layers import Input, LSTM
from tensorflow.keras.models import Model

latent_dim = 256

# inputs of the encoder network
encoder_inputs = Input(shape=(None, num_encoder_tokens), 
                       name='encoder_inputs')

# # set the LSTM layer
# encoder_lstm = LSTM(latent_dim, return_state=True, 
#                     dropout=0.5, name='encoder_lstm')

# _, state_h, state_c = encoder_lstm(encoder_inputs)

# # set the BiDirectional LSTM layer
encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True, 
                                  dropout=0.5, name='encoder_lstm'))

encoder_lstm_output, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

encoder_state_h = Concatenate()([forward_h, backward_h])
encoder_state_c = Concatenate()([forward_c, backward_c])

# build the encoder network model
encoder_model = Model(inputs=encoder_inputs, 
                      outputs=[encoder_state_h, encoder_state_c],
                      name='encoder')

Print a summary and save the encoder network structure to "./encoder.pdf"

In [14]:
from IPython.display import SVG
from tensorflow.keras.utils import plot_model, model_to_dot
SVG(model_to_dot(encoder_model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=encoder_model, show_shapes=False,
    to_file='encoder.pdf'
)

encoder_model.summary()

Model: "encoder"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     [(None, None, 28)]   0                                            
__________________________________________________________________________________________________
bidirectional (Bidirectional)   [(None, 512), (None, 583680      encoder_inputs[0][0]             
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 512)          0           bidirectional[0][1]              
                                                                 bidirectional[0][3]              
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 512)          0           bidirectional[0][2]        

### 3.2. Decoder network

- Inputs:  

    -- one-hot encode of the target language
    
    -- The initial hidden state $h_t$ 
    
    -- The initial conveyor belt $c_t$ 

- Return: 

    -- output (all the hidden states) $h_1, \cdots , h_t$

    -- the final hidden state  $h_t$ (discarded in the training and used in the prediction)
    
    -- the final conveyor belt $c_t$ (discarded in the training and used in the prediction)

In [15]:
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.models import Model

# inputs of the decoder network
decoder_input_h = Input(shape=(latent_dim*2,), name='decoder_input_h')
decoder_input_c = Input(shape=(latent_dim*2,), name='decoder_input_c')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# set the LSTM layer
decoder_lstm = LSTM(latent_dim*2, return_sequences=True, 
                    return_state=True, dropout=0.5, name='decoder_lstm')
decoder_lstm_outputs, decoder_state_h, decoder_state_c = decoder_lstm(decoder_input_x, 
                                                      initial_state=[decoder_input_h, decoder_input_c])

# set the dense layer
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_lstm_outputs)

# build the decoder network model
decoder_model = Model(inputs=[decoder_input_x, decoder_input_h, decoder_input_c],
                      outputs=[decoder_outputs, decoder_state_h, decoder_state_c],
                      name='decoder')

Print a summary and save the encoder network structure to "./decoder.pdf"

In [16]:
from IPython.display import SVG
from tensorflow.keras.utils import plot_model, model_to_dot

SVG(model_to_dot(decoder_model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=decoder_model, show_shapes=False,
    to_file='decoder.pdf'
)

decoder_model.summary()

Model: "decoder"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
decoder_input_x (InputLayer)    [(None, None, 28)]   0                                            
__________________________________________________________________________________________________
decoder_input_h (InputLayer)    [(None, 512)]        0                                            
__________________________________________________________________________________________________
decoder_input_c (InputLayer)    [(None, 512)]        0                                            
__________________________________________________________________________________________________
decoder_lstm (LSTM)             [(None, None, 512),  1107968     decoder_input_x[0][0]            
                                                                 decoder_input_h[0][0]      

### 3.3. Connect the encoder and decoder

In [17]:
# input layers
encoder_input_x = Input(shape=(None, num_encoder_tokens), name='encoder_input_x')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# connect encoder to decoder
encoder_final_states = encoder_model([encoder_input_x])
decoder_lstm_output, _, _ = decoder_lstm(decoder_input_x, initial_state=encoder_final_states)
decoder_pred = decoder_dense(decoder_lstm_output)

model = Model(inputs=[encoder_input_x, decoder_input_x], 
              outputs=decoder_pred, 
              name='model_training')

model.summary()

Model: "model_training"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input_x (InputLayer)    [(None, None, 28)]   0                                            
__________________________________________________________________________________________________
decoder_input_x (InputLayer)    [(None, None, 28)]   0                                            
__________________________________________________________________________________________________
encoder (Functional)            [(None, 512), (None, 583680      encoder_input_x[0][0]            
__________________________________________________________________________________________________
decoder_lstm (LSTM)             [(None, None, 512),  1107968     decoder_input_x[0][0]            
                                                                 encoder[0][0]       

In [18]:
print(encoder_state_h)
print(decoder_input_h)

KerasTensor(type_spec=TensorSpec(shape=(None, 512), dtype=tf.float32, name=None), name='concatenate/concat:0', description="created by layer 'concatenate'")
KerasTensor(type_spec=TensorSpec(shape=(None, 512), dtype=tf.float32, name='decoder_input_h'), name='decoder_input_h', description="created by layer 'decoder_input_h'")


In [19]:
from IPython.display import SVG
from tensorflow.keras.utils import plot_model, model_to_dot

SVG(model_to_dot(model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=model, show_shapes=False,
    to_file='model_training.pdf'
)

model.summary()

Model: "model_training"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input_x (InputLayer)    [(None, None, 28)]   0                                            
__________________________________________________________________________________________________
decoder_input_x (InputLayer)    [(None, None, 28)]   0                                            
__________________________________________________________________________________________________
encoder (Functional)            [(None, 512), (None, 583680      encoder_input_x[0][0]            
__________________________________________________________________________________________________
decoder_lstm (LSTM)             [(None, None, 512),  1107968     decoder_input_x[0][0]            
                                                                 encoder[0][0]       

### 3.3.v2 Enhance Attention Model

In [20]:
from tensorflow.keras.layers import Attention
attn_inputs_encoder = Input(shape=(512), 
                       name='attn_inputs_encoder')

attn_inputs_decoder = Input(shape=(512), 
                       name='attn_inputs_decoder')

attn_layer = Attention(name='attention_layer')
attn_state_c = attn_layer([attn_inputs_encoder, attn_inputs_decoder])

concat_state = Concatenate()([attn_state_c, attn_inputs_decoder])

attn_dense = Dense(1024, activation='softmax', name='attn_dense')
final_state = attn_dense(concat_state)
attn_model = Model(inputs=[attn_inputs_encoder, attn_inputs_decoder],
                outputs=final_state, name='attn_model')

attn_model.summary()

# Updated Decoder Model
ehanced_decoder_dense = Dense(1024, activation='softmax', name='enhance_dense')

# connect encoder to decoder and attention
decoder_pred = decoder_dense(decoder_lstm_output)

encoder_final_h, encoder_final_c = encoder_model([encoder_input_x])
decoder_lstm_output, decoder_final_h, decoder_final_c = decoder_lstm(decoder_input_x, initial_state=[encoder_final_h, encoder_final_c])
attn_final_c = attn_model((encoder_final_h, decoder_final_h))
concat_pre = Concatenate()([attn_final_c, decoder_final_h])
enhance_pred = ehanced_decoder_dense(concat_pre)

enhance_decoder_model = Model(inputs=[encoder_input_x, decoder_input_x],
                      outputs=enhance_pred,
                      name='ehance_model')

enhance_decoder_model.summary()

Model: "attn_model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
attn_inputs_encoder (InputLayer [(None, 512)]        0                                            
__________________________________________________________________________________________________
attn_inputs_decoder (InputLayer [(None, 512)]        0                                            
__________________________________________________________________________________________________
attention_layer (Attention)     (None, 512)          0           attn_inputs_encoder[0][0]        
                                                                 attn_inputs_decoder[0][0]        
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (None, 1024)         0           attention_layer[0][0]   

In [21]:
from IPython.display import SVG
from tensorflow.keras.utils import plot_model, model_to_dot

SVG(model_to_dot(enhance_decoder_model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=enhance_decoder_model, show_shapes=False,
    to_file='attention_model.pdf'
)

model.summary()

Model: "model_training"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input_x (InputLayer)    [(None, None, 28)]   0                                            
__________________________________________________________________________________________________
decoder_input_x (InputLayer)    [(None, None, 28)]   0                                            
__________________________________________________________________________________________________
encoder (Functional)            [(None, 512), (None, 583680      encoder_input_x[0][0]            
__________________________________________________________________________________________________
decoder_lstm (LSTM)             [(None, None, 512),  1107968     decoder_input_x[0][0]            
                                                                 encoder[0][0]       

### 3.5. Fit the model on the bilingual dataset

- encoder_input_data: one-hot encode of the input language

- decoder_input_data: one-hot encode of the input language

- decoder_target_data: labels (left shift of decoder_input_data)

- tune the hyper-parameters

- stop when the validation loss stop decreasing.

In [22]:
print('shape of encoder_input_data' + str(encoder_input_data.shape))
print('shape of decoder_input_data' + str(decoder_input_data.shape))
print('shape of decoder_target_data' + str(decoder_target_data.shape))

shape of encoder_input_data(2187, 96, 28)
shape of decoder_input_data(2187, 92, 28)
shape of decoder_target_data(2187, 92, 28)


In [23]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

model.fit([encoder_input_data, decoder_input_data],  # training data
          decoder_target_data,                       # labels (left shift of the target sequences)
          batch_size=16, epochs=100, validation_split=0.2)

model.save('seq2seq.h5')

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


In [25]:
# enhance_decoder_model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

# enhance_decoder_model.fit([encoder_input_data, decoder_input_data],  # training data
#           decoder_target_data,                       # labels (left shift of the target sequences)
#           batch_size=16, epochs=100, validation_split=0.2)

# enhance_decoder_model.save('seq2seqattention.h5')

## 4. Make predictions


### 4.1. Translate English to Estonian

1. Encoder read a sentence (source language) and output its final states, $h_t$ and $c_t$.
2. Take the [star] sign "\t" and the final state $h_t$ and $c_t$ as input and run the decoder.
3. Get the new states and predicted probability distribution.
4. sample a char from the predicted probability distribution
5. take the sampled char and the new states as input and repeat the process (stop if reach the [stop] sign "\n").

In [26]:
# Reverse-lookup token index to decode sequences back to something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())

In [34]:
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)

    target_seq = numpy.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, target_token_index['\t']] = 1.

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # this line of code is greedy selection
        # try to use multinomial sampling instead (with temperature)
        sampled_token_index = numpy.argmax(output_tokens[0, -1, :])
        
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        target_seq = numpy.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        states_value = [h, c]

    return decoded_sentence


In [28]:
for seq_index in range(2100, 2120):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('English:       ', input_texts[seq_index])
    print('Estonian (true): ', target_texts[seq_index][1:-1])
    print('Estonian (pred): ', decoded_sentence[0:-1])


-
English:        my sunglasses were stolen at the beach yesterday
Estonian (true):  mu paikeseprillid varastati eile rannas ara
Estonian (pred):  mu noorem vend vaatab televiisorit
-
English:        the period is missing at the end of the sentence
Estonian (true):  lause lopus puudub punkt
Estonian (pred):  tonaut kari maadad in taisi
-
English:        this flower is the most beautiful of all flowers
Estonian (true):  see lill on kauneim koigist lilledest
Estonian (pred):  see sird tahtud on pagem kui mitte midagi
-
English:        tom grabbed marys right hand with his left hand
Estonian (true):  tom haaras oma vasaku kaega mary parema
Estonian (pred):  tom utles et keegi teine ei olnud naljane
-
English:        tom isnt sure what he wants to do with his life
Estonian (true):  tom pole kindel mida ta oma eluga peale tahab hakata
Estonian (pred):  tom ei raaki kedus prantsuse keele opetaja
-
English:        what language do they speak in the united states
Estonian (true):  mis keelt am

### 4.2. Translate an English sentence to the target language

1. Tokenization
2. One-hot encode
3. Translate

In [29]:
input_sentences = numpy.array(['i love you','i solve the problem'])
input_sequences, input_index = text2sequences(max_encoder_seq_length, input_sentences)
encoded_input_x = onehot_encode(input_sequences, max_encoder_seq_length, num_encoder_tokens)

for seq_index in range(0, len(input_sentences)):
    input_seq = encoded_input_x[seq_index: seq_index + 1]
    translated_sentence = decode_sequence(input_seq)
    print('-')
    print('source sentence is: ' + input_sentences[seq_index])
    print('translated sentence is: ' + translated_sentence[0:-1])


-
source sentence is: i love you
translated sentence is: ma laan sida
-
source sentence is: i solve the problem
translated sentence is: ma opin selle ala aala


## 5. Evaluate the translation using BLEU score

Reference: 
- https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
- https://en.wikipedia.org/wiki/BLEU


**Hint:** 

- Randomly partition the dataset to training, validation, and test. 

- Evaluate the BLEU score using the test set. Report the average.

- A reasonable BLEU score should be 0.1 ~ 0.5.

In [38]:
from sklearn.model_selection import train_test_split

x_train, x_remain, y_train, y_remain = train_test_split(encoder_input_data, target_texts, test_size=0.20, random_state=11)
x_valid, x_test, y_valid, y_test = train_test_split(x_remain, y_remain, test_size=0.1, random_state=5)

In [39]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
bleu_score = -1

for seq_index in range(1, len(x_test)-1):
    input_seq = x_test[seq_index: seq_index + 1]
    translated_sentence = decode_sequence(input_seq)
    reference = translated_sentence[0:-1]
    candidate = y_test[seq_index]
    result = sentence_bleu(reference, candidate, smoothing_function=SmoothingFunction().method7)
    if bleu_score == -1:
        bleu_score = result
    else:
        bleu_score = (bleu_score + result)/2
        
print("The mean of BLEU score for LSTM: ", bleu_score)


The mean of BLEU score for LSTM:  0.36975452840460477
