# Home 4: Build a seq2seq model for machine translation.

### Name: [Songgaojun Deng]

### Task: Translate English to [Spanish, French]

## 0. You will do the following:

1. Read and run my code.
2. Complete the code in Section 1.1 and Section 4.2.

    * Translation English to **German** is not acceptable!!! Try another language.
    
3. **Make improvements.** Directly modify the code in Section 3. Do at least one of the followings. By doing more, you will get up to 2 bonus scores to the total.

    * Bi-LSTM instead of LSTM
    
    * Multi-task learning (e.g., both English to French and English to Spanish)
    
    * Attention
    
4. Evaluate the translation using the BLEU score. 

    * Optional. Up to 2 bonus scores to the total.
    
5. Convert the notebook to .HTML file. 

    * The HTML file must contain the code and the output after execution.

6. Put the .HTML file in your own Github repo. 

7. Submit the link to the HTML file to Canvas

    * E.g., https://github.com/wangshusen/CS583A-2019Spring/blob/master/homework/HM4/seq2seq.html

#### Hint: To implement Bi-LSTM, you will need the following code to build the encoder; the decoder won't be much different.

In [1]:
# from keras.layers import Bidirectional, Concatenate, LSTM, 

# encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True, 
#                                   dropout=0.5, name='encoder_lstm'))
# _, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

# state_h = Concatenate()([forward_h, backward_h])
# state_c = Concatenate()([forward_c, backward_c])

## 1. Data preparation

1. Download data (e.g., "deu-eng.zip") from http://www.manythings.org/anki/
2. Unzip the .ZIP file.
3. Put the .TXT file (e.g., "deu.txt") in the directory "./Data/".

### 1.1. Load and clean text


In [2]:
import re
import string
from unicodedata import normalize
import numpy as np
import os
os.environ["CUDA_VISIBLE_DEVICES"]='3'

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


# split a loaded document into sentences
def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    return pairs

def clean_data(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    # prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)
    for pair in lines:
        clean_pair = list()
        for line in pair:
            # normalize unicode characters
            line = normalize('NFD', line).encode('ascii', 'ignore')
            line = line.decode('UTF-8')
            # tokenize on white space
            line = line.split()
            # convert to lowercase
            line = [word.lower() for word in line]
            # remove punctuation from each token
            line = [word.translate(table) for word in line]
            # remove non-printable chars form each token
            line = [re_print.sub('', w) for w in line]
            # remove tokens with numbers in them
            line = [word for word in line if word.isalpha()]
            # store as string
            clean_pair.append(' '.join(line))
        cleaned.append(clean_pair)
    return np.array(cleaned)

#### Fill the following blanks:


In [3]:
# e.g., filename = 'Data/deu.txt'
filename = 'Data/spa.txt'
filename2 = 'Data/fra.txt'
# e.g., n_train = 20000
n_train = 40000

In [4]:
# load dataset
doc = load_doc(filename)
doc2 = load_doc(filename2)

# split into Language1-Language2 pairs
pairs = to_pairs(doc)
pairs2 = to_pairs(doc2)

# find intersection and indices, for multi-task learning
def intersect_mtlb(a, b):
    a1, ia = np.unique(a, return_index=True)
    b1, ib = np.unique(b, return_index=True)
    aux = np.concatenate((a1, b1))
    aux.sort()
    c = aux[:-1][aux[1:] == aux[:-1]]
    return c, ia[np.isin(a1, c)], ib[np.isin(b1, c)]

A = np.array(pairs)[:,0]
B = np.array(pairs2)[:,0]
c, ia, ib = intersect_mtlb(A,B)

pairs = np.array(pairs)[ia]
pairs2 = np.array(pairs2)[ib]

# clean sentences
clean_pairs = clean_data(pairs)[0:n_train, :]
clean_pairs2 = clean_data(pairs2)[0:n_train, :]

In [5]:
for i in range(3000, 3005):
    print('[' + clean_pairs[i, 0] + '] => [' + clean_pairs[i, 1] + ']')
print('-')
for i in range(3000, 3005):
    print('[' + clean_pairs2[i, 0] + '] => [' + clean_pairs2[i, 1] + ']')

[do you know this man in the picture] => [conoces al hombre en esta fotografia]
[do you know this song] => [conoces esta cancion]
[do you know us] => [nos conoces]
[do you know what i mean] => [sabes que quiero decir]
[do you know what unesco stands for] => [sabe que significa unesco]
-
[do you know this man in the picture] => [connaistu lhomme sur cette photo]
[do you know this song] => [tu connais cette chanson]
[do you know us] => [nous connaistu]
[do you know what i mean] => [comprenezvous ce que je veux dire]
[do you know what unesco stands for] => [savezvous ce que unesco veut dire]


In [6]:
input_texts = clean_pairs[:, 0]
target_texts = ['\t' + text + '\n' for text in clean_pairs[:, 1]]
target_texts2 = ['\t' + text + '\n' for text in clean_pairs2[:, 1]]

print('Length of input_texts:  ' + str(input_texts.shape))
print('Length of target_texts: ' + str(len(target_texts)))
print('Length of target_texts2: ' + str(len(target_texts2)))

Length of input_texts:  (40000,)
Length of target_texts: 40000
Length of target_texts2: 40000


In [7]:
max_encoder_seq_length = max(len(line) for line in input_texts)
max_decoder_seq_length = max(len(line) for line in target_texts)
max_decoder_seq_length2 = max(len(line) for line in target_texts2)

print('max length of input  sentences: %d' % (max_encoder_seq_length))
print('max length of target sentences: %d' % (max_decoder_seq_length))
print('max length of target2 sentences: %d' % (max_decoder_seq_length2))

max length of input  sentences: 281
max length of target sentences: 330
max length of target2 sentences: 341


**Remark:** To this end, you have two lists of sentences: input_texts and target_texts

## 2. Text processing

### 2.1. Convert texts to sequences

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.

In [8]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# encode and pad sequences
def text2sequences(max_len, lines):
    tokenizer = Tokenizer(char_level=True, filters='')
    tokenizer.fit_on_texts(lines)
    seqs = tokenizer.texts_to_sequences(lines)
    seqs_pad = pad_sequences(seqs, maxlen=max_len, padding='post')
    return seqs_pad, tokenizer.word_index


encoder_input_seq, input_token_index = text2sequences(max_encoder_seq_length, 
                                                      input_texts)
decoder_input_seq, target_token_index = text2sequences(max_decoder_seq_length, 
                                                       target_texts)
decoder_input_seq2, target_token_index2 = text2sequences(max_decoder_seq_length2, 
                                                       target_texts2)

print('shape of encoder_input_seq: ' + str(encoder_input_seq.shape))
print('shape of input_token_index: ' + str(len(input_token_index)))
print('shape of decoder_input_seq: ' + str(decoder_input_seq.shape))
print('shape of target_token_index: ' + str(len(target_token_index)))
print('shape of decoder_input_seq2: ' + str(decoder_input_seq2.shape))
print('shape of target_token_index2: ' + str(len(target_token_index2)))

Using TensorFlow backend.


shape of encoder_input_seq: (40000, 281)
shape of input_token_index: 27
shape of decoder_input_seq: (40000, 330)
shape of target_token_index: 29
shape of decoder_input_seq2: (40000, 341)
shape of target_token_index2: 29


In [9]:
num_encoder_tokens = len(input_token_index) + 1
num_decoder_tokens = len(target_token_index) + 1
num_decoder_tokens2 = len(target_token_index2) + 1

print('num_encoder_tokens: ' + str(num_encoder_tokens))
print('num_decoder_tokens: ' + str(num_decoder_tokens))
print('num_decoder_tokens2: ' + str(num_decoder_tokens2))

num_encoder_tokens: 28
num_decoder_tokens: 30
num_decoder_tokens2: 30


**Remark:** To this end, the input language and target language texts are converted to 2 matrices. 

- Their number of rows are both n_train.
- Their number of columns are respective max_encoder_seq_length and max_decoder_seq_length.
 
 
The followings print a sentence and its representation as a sequence.

In [10]:
target_texts[100], target_texts2[100]

('\tel oido de un ciego a menudo es muy agudo\n',
 '\tles aveugles ont souvent une perception auditive accrue\n')

In [11]:
decoder_input_seq[100, :], decoder_input_seq2[100, :]

(array([13,  2,  9,  1,  4,  8, 12,  4,  1, 12,  2,  1, 11,  6,  1, 15,  8,
         2, 22,  4,  1,  3,  1, 16,  2,  6, 11, 12,  4,  1,  2,  5,  1, 16,
        11, 23,  1,  3, 22, 11, 12,  4, 14,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0, 

## 2.2. One-hot encode

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.
- It is represented by a $n\times t \times v$ tensor ($t$ is the number of unique chars) after the one-hot encoding.

In [12]:
from keras.utils import to_categorical

# one hot encode target sequence
def onehot_encode(sequences, max_len, vocab_size):
    n = len(sequences)
    data = np.zeros((n, max_len, vocab_size))
    for i in range(n):
        data[i, :, :] = to_categorical(sequences[i], num_classes=vocab_size)
    return data

encoder_input_data = onehot_encode(encoder_input_seq, max_encoder_seq_length, num_encoder_tokens)
decoder_input_data = onehot_encode(decoder_input_seq, max_decoder_seq_length, num_decoder_tokens)
decoder_input_data2 = onehot_encode(decoder_input_seq2, max_decoder_seq_length2, num_decoder_tokens2)

decoder_target_seq = np.zeros(decoder_input_seq.shape)
decoder_target_seq[:, 0:-1] = decoder_input_seq[:, 1:]
decoder_target_data = onehot_encode(decoder_target_seq, 
                                    max_decoder_seq_length, 
                                    num_decoder_tokens)

decoder_target_seq2 = np.zeros(decoder_input_seq2.shape)
decoder_target_seq2[:, 0:-1] = decoder_input_seq2[:, 1:]
decoder_target_data2 = onehot_encode(decoder_target_seq2, 
                                    max_decoder_seq_length2, 
                                    num_decoder_tokens2)

print(encoder_input_data.shape)
print(decoder_input_data.shape)
print(decoder_input_data2.shape)

(40000, 281, 28)
(40000, 330, 30)
(40000, 341, 30)


## 3. Build the networks (for training)

- Build encoder, decoder, and connect the two modules to get "model". 

- Fit the model on the bilingual data to train the parameters in the encoder and decoder.

### 3.1. Encoder network

- Input:  one-hot encode of the input language

- Return: 

    -- output (all the hidden states   $h_1, \cdots , h_t$) are always discarded
    
    -- the final hidden state  $h_t$
    
    -- the final conveyor belt $c_t$

In [13]:
from keras.layers import Input, LSTM, Bidirectional, Concatenate, CuDNNLSTM
from keras.models import Model

latent_dim = 256

# inputs of the encoder network
encoder_inputs = Input(shape=(None, num_encoder_tokens), 
                       name='encoder_inputs')

## Bidirectional LSTM
encoder_lstm = Bidirectional(CuDNNLSTM(latent_dim, return_state=True,  #dropout=0.5, 
                    name='encoder_lstm'))
_, forward_h, forward_c, backward_h, backward_c = encoder_lstm(encoder_inputs)
state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

print(state_h.shape, state_c.shape)

# build the encoder network model
encoder_model = Model(inputs=encoder_inputs, 
                      outputs=[state_h, state_c],
                      name='encoder')

(?, 512) (?, 512)


In [14]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot, plot_model

SVG(model_to_dot(encoder_model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=encoder_model, show_shapes=False,
    to_file='encoder.pdf'
)

encoder_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     (None, None, 28)     0                                            
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) [(None, 512), (None, 585728      encoder_inputs[0][0]             
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 512)          0           bidirectional_1[0][1]            
                                                                 bidirectional_1[0][3]            
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (None, 512)          0           bidirectional_1[0][2]            
          

### 3.2. Decoder network

- Inputs:  

    -- one-hot encode of the target language
    
    -- The initial hidden state $h_t$ 
    
    -- The initial conveyor belt $c_t$ 

- Return: 

    -- output (all the hidden states) $h_1, \cdots , h_t$

    -- the final hidden state  $h_t$ (discarded in the training and used in the prediction)
    
    -- the final conveyor belt $c_t$ (discarded in the training and used in the prediction)

In [17]:
from keras.layers import Input, LSTM, Dense, Activation, dot
from keras.models import Model

# inputs of the decoder network
decoder_input_h = Input(shape=(latent_dim * 2,), name='decoder_input_h')
decoder_input_c = Input(shape=(latent_dim * 2,), name='decoder_input_c')

decoder_input_h2 = Input(shape=(latent_dim * 2,), name='decoder_input_h2')
decoder_input_c2 = Input(shape=(latent_dim * 2,), name='decoder_input_c2')

decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')
decoder_input_x2 = Input(shape=(None, num_decoder_tokens2), name='decoder_input_x2')


decoder_lstm = CuDNNLSTM(latent_dim * 2, return_sequences=True, #dropout=0.5, 
                    return_state=True, name='decoder_lstm')

decoder_lstm_outputs, state_h, state_c = decoder_lstm(decoder_input_x, 
                                                      initial_state=[decoder_input_h, decoder_input_c])
decoder_lstm2 = CuDNNLSTM(latent_dim * 2, return_sequences=True, #dropout=0.5, 
                    return_state=True, name='decoder_lstm2')

decoder_lstm_outputs2, state_h2, state_c2 = decoder_lstm2(decoder_input_x2, 
                                                      initial_state=[decoder_input_h2, decoder_input_c2])

# set the dense layer
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='decoder_dense') 
decoder_outputs = decoder_dense(decoder_lstm_outputs)

decoder_dense2 = Dense(num_decoder_tokens2, activation='softmax', name='decoder_dense2') 
decoder_outputs2 = decoder_dense2(decoder_lstm_outputs2)

# build the decoder network model
decoder_model = Model(inputs=[decoder_input_x, decoder_input_h, decoder_input_c],
                      outputs=[decoder_outputs, state_h, state_c],
                      name='decoder')

decoder_model2 = Model(inputs=[decoder_input_x2, decoder_input_h2, decoder_input_c2],
                      outputs=[decoder_outputs2, state_h2, state_c2],
                      name='decoder2')

Print a summary and save the encoder network structure to "./decoder.pdf"

In [20]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot, plot_model

SVG(model_to_dot(decoder_model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=decoder_model, show_shapes=False,
    to_file='decoder.pdf'
)

decoder_model.summary()

plot_model(
    model=decoder_model2, show_shapes=False,
    to_file='decoder2.pdf'
)

decoder_model2.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
decoder_input_x (InputLayer)    (None, None, 30)     0                                            
__________________________________________________________________________________________________
decoder_input_h (InputLayer)    (None, 512)          0                                            
__________________________________________________________________________________________________
decoder_input_c (InputLayer)    (None, 512)          0                                            
__________________________________________________________________________________________________
decoder_lstm (CuDNNLSTM)        [(None, None, 512),  1114112     decoder_input_x[0][0]            
                                                                 decoder_input_h[0][0]            
          

### 3.3. Connect the encoder and decoder

In [21]:
# input layers
encoder_input_x = Input(shape=(None, num_encoder_tokens), name='encoder_input_x')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# connect encoder to decoder
encoder_final_states = encoder_model([encoder_input_x])

decoder_lstm_output, _, _ = decoder_lstm(decoder_input_x, initial_state=encoder_final_states)
decoder_pred = decoder_dense(decoder_lstm_output)

decoder_lstm_output2, _, _ = decoder_lstm2(decoder_input_x2, initial_state=encoder_final_states)
decoder_pred2 = decoder_dense2(decoder_lstm_output2)

# model = Model(inputs=[encoder_input_x, decoder_input_x], 
#               outputs=decoder_pred, 
#               name='model_training')

model = Model(inputs=[encoder_input_x, decoder_input_x, decoder_input_x2], 
              outputs=[decoder_pred, decoder_pred2], 
              name='model_training')

In [22]:
print(state_h)
print(decoder_input_h)
print(decoder_input_h2)

Tensor("decoder_lstm_2/strided_slice_16:0", shape=(?, 512), dtype=float32)
Tensor("decoder_input_h_2:0", shape=(?, 512), dtype=float32)
Tensor("decoder_input_h2_2:0", shape=(?, 512), dtype=float32)


In [23]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot, plot_model

SVG(model_to_dot(model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=model, show_shapes=False,
    to_file='model_training.pdf'
)

model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input_x (InputLayer)    (None, None, 28)     0                                            
__________________________________________________________________________________________________
decoder_input_x (InputLayer)    (None, None, 30)     0                                            
__________________________________________________________________________________________________
encoder (Model)                 [(None, 512), (None, 585728      encoder_input_x[0][0]            
__________________________________________________________________________________________________
decoder_input_x2 (InputLayer)   (None, None, 30)     0                                            
__________________________________________________________________________________________________
decoder_ls

### 3.5. Fit the model on the bilingual dataset

- encoder_input_data: one-hot encode of the input language

- decoder_input_data: one-hot encode of the input language

- decoder_target_data: labels (left shift of decoder_input_data)

- tune the hyper-parameters

- stop when the validation loss stop decreasing.

In [24]:
print('shape of encoder_input_data' + str(encoder_input_data.shape))
print('shape of decoder_input_data' + str(decoder_input_data.shape))
print('shape of decoder_target_data' + str(decoder_target_data.shape))

shape of encoder_input_data(40000, 281, 28)
shape of decoder_input_data(40000, 330, 30)
shape of decoder_target_data(40000, 330, 30)


In [25]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

model.fit([encoder_input_data, decoder_input_data, decoder_input_data2],  # training data
          [decoder_target_data, decoder_target_data2],                # labels (left shift of the target sequences)
          batch_size=64, epochs=1, validation_split=0.2)

model.save('seq2seq2.h5')


Train on 32000 samples, validate on 8000 samples
Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


  '. They will not be included '
  '. They will not be included '


## 4. Make predictions


### 4.1. Translate English to Spanish and French

1. Encoder read a sentence (source language) and output its final states, $h_t$ and $c_t$.
2. Take the [star] sign "\t" and the final state $h_t$ and $c_t$ as input and run the decoder.
3. Get the new states and predicted probability distribution.
4. sample a char from the predicted probability distribution
5. take the sampled char and the new states as input and repeat the process (stop if reach the [stop] sign "\n").

In [26]:
# Reverse-lookup token index to decode sequences back to something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())
reverse_target_char_index2 = dict((i, char) for char, i in target_token_index2.items())

In [38]:
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)
    states_value_org = states_value
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, target_token_index['\t']] = 1.
    
    target_seq2 = np.zeros((1, 1, num_decoder_tokens2))
    target_seq2[0, 0, target_token_index2['\t']] = 1.
    

    stop_condition = False
    decoded_sentence = ''
    
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # this line of code is greedy selection
        # try to use multinomial sampling instead (with temperature)
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        states_value = [h, c]
        
    stop_condition = False
    decoded_sentence2 = ''
    states_value = states_value_org
    while not stop_condition:
        output_tokens, h, c = decoder_model2.predict([target_seq2] + states_value)

        # this line of code is greedy selection
        # try to use multinomial sampling instead (with temperature)
        sampled_token_index2 = np.argmax(output_tokens[0, -1, :])
        
        sampled_char = reverse_target_char_index2[sampled_token_index2]
        decoded_sentence2 += sampled_char

        if (sampled_char == '\n' or
           len(decoded_sentence2) > max_decoder_seq_length2):
            stop_condition = True

        target_seq2 = np.zeros((1, 1, num_decoder_tokens2))
        target_seq2[0, 0, sampled_token_index2] = 1.

        states_value = [h, c]

    return decoded_sentence, decoded_sentence2


In [40]:
for seq_index in range(2100, 2110):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
#     print(input_seq)
#     print(input_seq.shape)
    decoded_sentence, decoded_sentence2 = decode_sequence(input_seq)
    print('-----------')
    print('English:       ', input_texts[seq_index])
    print('Spanish (true): ', target_texts[seq_index][1:-1])
    print('Spanish (pred): ', decoded_sentence[0:-1])
    print('French (true): ', target_texts2[seq_index][1:-1])
    print('French (pred): ', decoded_sentence2[0:-1])


-----------
English:        come back home
Spanish (true):  vuelve a casa
Spanish (pred):  vuelve a casa en la cama
French (true):  reviens a la maison
French (pred):  venez le boison
-----------
English:        come back in an hour
Spanish (true):  regresa en una hora
Spanish (pred):  vuelve a casa en un hospital
French (true):  revenez dans une heure
French (pred):  reviens a la maison
-----------
English:        come back soon
Spanish (true):  regresa pronto
Spanish (pred):  vuelve a casa
French (true):  ne sois pas long
French (pred):  reviens a la maison
-----------
English:        come back tomorrow
Spanish (true):  vuelve manana
Spanish (pred):  vuelve a casa
French (true):  reviens demain
French (pred):  reviens a la maison
-----------
English:        come back
Spanish (true):  vuelve
Spanish (pred):  ven casa
French (true):  reviens
French (pred):  venez ici
-----------
English:        come down here
Spanish (true):  baja aqui
Spanish (pred):  ven con nosotros
French (true):  

### 4.2. Translate an English sentence to the target language

1. Tokenization
2. One-hot encode
3. Translate

In [69]:
input_sentence = 'i need a new one'
# do tokenization
encoder_input_seq_test, input_token_index_test = text2sequences(max_encoder_seq_length, 
                                                      [input_sentence])
# do one-hot encode
encoder_input_data_test = onehot_encode(encoder_input_seq_test, max_encoder_seq_length, num_encoder_tokens)

#do translation
spanish_pred, french_pred = decode_sequence(encoder_input_data_test)

print('source sentence is: ')
print('[English]:', input_sentence)
print('-')
print('translated sentence is: ')
print()
print('[Spanish]:', spanish_pred)
print('[French]:', french_pred)

source sentence is: 
[English]: i need a new one
-
translated sentence is: 

[Spanish]: mas viene con su ayuda

[French]: ou a repondre a cette question



## 5. Evaluate the translation using BLEU score

Reference: 
- https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
- https://en.wikipedia.org/wiki/BLEU


In [73]:
from nltk.translate.bleu_score import sentence_bleu

for i in clean_pairs:
    if i[0] == input_sentence:
        ref = i[1]
        score = sentence_bleu([ref], spanish_pred)
        print(score)
for i in clean_pairs2:
    if i[0] == input_sentence:
        ref = i[1]
        score = sentence_bleu([ref], french_pred)
        print(score)

5.9720086015470606e-155
5.3158692855940656e-155
