<img src="../Pics/MLSb-T.png" width="160">
<br><br>
<center><u><H1>Neural Machine Translation with Seq2Seq</H1></u></center>

In [1]:
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.log_device_placement = True
sess = tf.Session(config=config)
set_session(sess)

Using TensorFlow backend.


## Summary:
1) Turn the sentences into 3 Numpy arrays, encoder_input_data, decoder_input_data, decoder_target_data:

encoder_input_data is a 3D array of shape (num_pairs, max_english_sentence_length, num_english_characters) containing a one-hot vectorization of the English sentences
decoder_input_data is a 3D array of shape (num_pairs, max_french_sentence_length, num_french_characters) containing a one-hot vectorization of the French sentences
decoder_target_data is the same as decoder_input_data but offset by one timestep. decoder_target_data[:, t, :] will be the same as decoder_input_data[:, t + 1, :]

2) Train a basic LSTM-based Seq2Seq model to predict decoder_target_data given encoder_input_data and decoder_input_data.

3) Decode some sentences to check that the model is working (i.e. turn samples from encoder_input_data into corresponding samples from decoder_target_data).

The original author of this code is Francois Chollet: 

https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py

In [2]:
from keras.layers import Dense, LSTM, CuDNNLSTM
from keras import Model, Input
from keras.optimizers import Adam
import numpy as np

In [3]:
lines = open('../data/machine_translation/spa.txt', encoding='utf-8').read().split('\n')
lines

['Go.\tVe.',
 'Go.\tVete.',
 'Go.\tVaya.',
 'Go.\tVáyase.',
 'Hi.\tHola.',
 'Run!\t¡Corre!',
 'Run.\tCorred.',
 'Who?\t¿Quién?',
 'Wow!\t¡Órale!',
 'Fire!\t¡Fuego!',
 'Fire!\t¡Incendio!',
 'Fire!\t¡Disparad!',
 'Help!\t¡Ayuda!',
 'Help!\t¡Socorro! ¡Auxilio!',
 'Help!\t¡Auxilio!',
 'Jump!\t¡Salta!',
 'Jump.\tSalte.',
 'Stop!\t¡Parad!',
 'Stop!\t¡Para!',
 'Stop!\t¡Pare!',
 'Wait!\t¡Espera!',
 'Wait.\tEsperen.',
 'Go on.\tContinúa.',
 'Go on.\tContinúe.',
 'Hello!\tHola.',
 'I ran.\tCorrí.',
 'I ran.\tCorría.',
 'I try.\tLo intento.',
 'I won!\t¡He ganado!',
 'Oh no!\t¡Oh, no!',
 'Relax.\tTomátelo con soda.',
 'Smile.\tSonríe.',
 'Attack!\t¡Al ataque!',
 'Attack!\t¡Atacad!',
 'Get up.\tLevanta.',
 'Go now.\tVe ahora mismo.',
 'Got it!\t¡Lo tengo!',
 'Got it?\t¿Lo pillas?',
 'Got it?\t¿Entendiste?',
 'He ran.\tÉl corrió.',
 'Hop in.\tMétete adentro.',
 'Hug me.\tAbrázame.',
 'I fell.\tMe caí.',
 'I know.\tYo lo sé.',
 'I left.\tSalí.',
 'I lied.\tMentí.',
 'I lost.\tPerdí.',
 'I quit.\tDim

In [4]:
len(lines)

119937

## Creating English - Spanish Sentences:

In [5]:
eng_sent = []
spa_sent = []
eng_chars = set()
spa_chars = set()
n_samples = 10000

for line in range(n_samples):
    
    eng_line = str(lines[line]).split('\t')[0]
    
    # Append '\t' for start of the sentence and '\n' to signify end of the sentence
    spa_line = '\t' + str(lines[line]).split('\t')[1] + '\n'
    eng_sent.append(eng_line)
    spa_sent.append(spa_line)
    
    for ch in eng_line:
        if (ch not in eng_chars):
            eng_chars.add(ch)
            
    for ch in spa_line:
        if (ch not in spa_chars):
            spa_chars.add(ch)

In [6]:
eng_sent[0]

'Go.'

In [7]:
spa_sent[0]

'\tVe.\n'

## Preprocessing of the data:

In [8]:
spa_chars = sorted(list(spa_chars))
eng_chars = sorted(list(eng_chars))

In [9]:
print(eng_chars)

[' ', '!', '"', '$', "'", ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [10]:
print(spa_chars)

['\t', '\n', ' ', '!', '"', "'", ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', ':', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '¡', '«', '»', '¿', 'Á', 'É', 'Ó', 'Ú', 'á', 'é', 'í', 'ñ', 'ó', 'ú', 'ü']


In [11]:
# dictionary to index each english character - key is index and value is english character
eng_id2char = {}

# dictionary to get english character given its index - key is english character and value is index
eng_char2id = {}

for i, j in enumerate(eng_chars):
    eng_id2char[i] = j
    eng_char2id[j] = i

In [12]:
print(eng_id2char)

{0: ' ', 1: '!', 2: '"', 3: '$', 4: "'", 5: ',', 6: '-', 7: '.', 8: '0', 9: '1', 10: '2', 11: '3', 12: '4', 13: '5', 14: '6', 15: '7', 16: '8', 17: '9', 18: ':', 19: '?', 20: 'A', 21: 'B', 22: 'C', 23: 'D', 24: 'E', 25: 'F', 26: 'G', 27: 'H', 28: 'I', 29: 'J', 30: 'K', 31: 'L', 32: 'M', 33: 'N', 34: 'O', 35: 'P', 36: 'Q', 37: 'R', 38: 'S', 39: 'T', 40: 'U', 41: 'V', 42: 'W', 43: 'Y', 44: 'Z', 45: 'a', 46: 'b', 47: 'c', 48: 'd', 49: 'e', 50: 'f', 51: 'g', 52: 'h', 53: 'i', 54: 'j', 55: 'k', 56: 'l', 57: 'm', 58: 'n', 59: 'o', 60: 'p', 61: 'q', 62: 'r', 63: 's', 64: 't', 65: 'u', 66: 'v', 67: 'w', 68: 'x', 69: 'y', 70: 'z'}


In [13]:
# dictionary to index each english character - key is index and value is english character
spa_id2char = {}

# dictionary to get english character given its index - key is english character and value is index
spa_char2id = {}

for i, j in enumerate(spa_chars):
    spa_id2char[i] = j
    spa_char2id[j] = i

In [14]:
max_len_eng = max([len(line) for line in eng_sent])
max_len_eng

17

In [15]:
max_len_spa = max([len(line) for line in spa_sent])
max_len_spa

42

## Tokenizers:

In [16]:
tokenized_eng_sent = np.zeros(shape = (n_samples,max_len_eng,len(eng_chars)), dtype='float32')
tokenized_spa_sent = np.zeros(shape = (n_samples,max_len_spa,len(spa_chars)), dtype='float32')
target_data = np.zeros((n_samples, max_len_spa, len(spa_chars)),dtype='float32')

In [17]:
for i in range(n_samples):
    for k,ch in enumerate(eng_sent[i]):
        tokenized_eng_sent[i,k,eng_char2id[ch]] = 1
        
    for k,ch in enumerate(spa_sent[i]):
        tokenized_spa_sent[i,k,spa_char2id[ch]] = 1

        # decoder_target_data will be ahead by one timestep and will not include the start character.
        if k > 0:
            target_data[i,k-1,spa_char2id[ch]] = 1

In [18]:
tokenized_eng_sent.shape

(10000, 17, 71)

In [19]:
tokenized_spa_sent.shape

(10000, 42, 86)

In [20]:
target_data.shape

(10000, 42, 86)

## Encoder Model:

In [21]:
encoder_input = Input(shape=(None,len(eng_chars)))
encoder_LSTM = CuDNNLSTM(256,return_state = True)
encoder_outputs, encoder_h, encoder_c = encoder_LSTM (encoder_input)
encoder_states = [encoder_h, encoder_c]

## Decoder model:

In [22]:
decoder_input = Input(shape=(None,len(spa_chars)))
decoder_LSTM = CuDNNLSTM(256,return_sequences=True, return_state = True)
decoder_out, _ , _ = decoder_LSTM(decoder_input, initial_state=encoder_states)
decoder_dense = Dense(len(spa_chars),activation='softmax')
decoder_out = decoder_dense(decoder_out)

In [23]:
model = Model(inputs=[encoder_input, decoder_input],outputs=[decoder_out])

In [24]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None, 71)     0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None, 86)     0                                            
__________________________________________________________________________________________________
cu_dnnlstm_1 (CuDNNLSTM)        [(None, 256), (None, 336896      input_1[0][0]                    
__________________________________________________________________________________________________
cu_dnnlstm_2 (CuDNNLSTM)        [(None, None, 256),  352256      input_2[0][0]                    
                                                                 cu_dnnlstm_1[0][1]               
          

In [25]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

In [26]:
%%time
model.fit(x=[tokenized_eng_sent,tokenized_spa_sent], 
          y=target_data,
          batch_size=64,
          epochs=20,
          validation_split=0.2)

Train on 8000 samples, validate on 2000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Wall time: 39.4 s


<keras.callbacks.History at 0x1d6116715f8>

## Inference model for Testing:

1) Encode input and retrieve initial decoder state.

2) Run one step of decoder with this initial state
   and a "start of sequence" token as target.
   Output will be the next target token.

3) Repeat with the current target token and current states.

In [27]:
# Encoder inference model
encoder_model_inf = Model(encoder_input, encoder_states)

# Decoder inference model
decoder_state_input_h = Input(shape=(256,))
decoder_state_input_c = Input(shape=(256,))
decoder_input_states = [decoder_state_input_h, decoder_state_input_c]

decoder_out, decoder_h, decoder_c = decoder_LSTM(decoder_input, 
                                                 initial_state=decoder_input_states)

decoder_states = [decoder_h , decoder_c]

decoder_out = decoder_dense(decoder_out)

decoder_model_inf = Model(inputs=[decoder_input] + decoder_input_states, outputs=[decoder_out] + decoder_states)

In [28]:
def decode_seq(inp_seq):
    
    # Initial states value is coming from the encoder 
    states_val = encoder_model_inf.predict(inp_seq)
    
    target_seq = np.zeros((1, 1, len(spa_chars)))
    target_seq[0, 0, spa_char2id['\t']] = 1
    
    translated_sent = ''
    stop_condition = False
    
    while not stop_condition:   
        decoder_out, decoder_h, decoder_c = decoder_model_inf.predict(x=[target_seq] + states_val)
        
        max_val_index = np.argmax(decoder_out[0,-1,:])
        sampled_spa_char = spa_id2char[max_val_index]
        translated_sent += sampled_spa_char
        
        if ((sampled_spa_char == '\n') or (len(translated_sent) > max_len_spa)) :
            stop_condition = True
        
        target_seq = np.zeros((1, 1, len(spa_chars)))
        target_seq[0, 0, max_val_index] = 1
        
        states_val = [decoder_h, decoder_c]
        
    return translated_sent

In [29]:
values_idx = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
for index in values_idx:
    inp_seq = tokenized_eng_sent[index:index+1]
    translated_sent = decode_seq(inp_seq)
    print('--------')
    print('Input sentence:', eng_sent[index])
    print('Decoded sentence:', translated_sent)

--------
Input sentence: Go.
Decoded sentence: Vete.

--------
Input sentence: Fire!
Decoded sentence: ¡Pueda esto!

--------
Input sentence: Wait!
Decoded sentence: ¡Espera!

--------
Input sentence: Relax.
Decoded sentence: Larmate.

--------
Input sentence: Hop in.
Decoded sentence: ¡Estraba en las!

--------
Input sentence: I'm 19.
Decoded sentence: Estoy llena.

--------
Input sentence: No way!
Decoded sentence: ¡No puede ser!

--------
Input sentence: We try.
Decoded sentence: Lo hamos misto.

--------
Input sentence: Beat it.
Decoded sentence: Tramete.

--------
Input sentence: Come on.
Decoded sentence: Van a come.



## Reference:
#### Sentence pairs: http://www.manythings.org/anki/