# Assignment 3: Build a seq2seq model for machine translation.

### Name: Joshua Meharg

### Task: Change LSTM model to Bidirectional LSTM Model and Translate English to Spanish

### Due Date: Wednesday, April 17th, 11:59PM

## 0. You will do the following:

1. Read and run the code. Please make sure you have installed keras or tensorflow.Running the script on colab will speed up the training process and also prevent package loading issue.
2. Complete the code in Section 1.1, you may fill in your data directory.
3. Directly modify the code in Section 3. Change the current LSTM layer to a Bidirectional LSTM Model.
4. Training your model and translate English to Spanish in Section 4.2. You could try translating other languages.
5. Complete the code in Section 5.

### Hint:

To implement ```Bi-LSTM```, you will need the following code to build the encoder **in Section 3**. Do NOT use Bi-LSTM for the decoder. But there are other codes **you need to modify** to make it work.

In [1]:
# from keras.layers import Bidirectional, Concatenate

# encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True,
#                                   dropout=0.5, name='encoder_lstm'))
# _, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

# state_h = Concatenate()([forward_h, backward_h])
# state_c = Concatenate()([forward_c, backward_c])

## 1. Data preparation (10 points)

1. Download spanish-english data from http://www.manythings.org/anki/
2. You may try to use other languages.
3. Unzip the .ZIP file.
4. Put the .TXT file (e.g., "deu.txt") in the directory "./Data/".
5. Fill in your data directory in section 1.1.

### 1.1. Load and clean text


In [2]:
import re
import string
from unicodedata import normalize
import numpy

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


# split a loaded document into sentences
def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    return pairs

def clean_data(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    # prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)
    for pair in lines:
        clean_pair = list()
        for line in pair:
            # normalize unicode characters
            line = normalize('NFD', line).encode('ascii', 'ignore')
            line = line.decode('UTF-8')
            # tokenize on white space
            line = line.split()
            # convert to lowercase
            line = [word.lower() for word in line]
            # remove punctuation from each token
            line = [word.translate(table) for word in line]
            # remove non-printable chars form each token
            line = [re_print.sub('', w) for w in line]
            # remove tokens with numbers in them
            line = [word for word in line if word.isalpha()]
            # store as string
            clean_pair.append(' '.join(line))
        cleaned.append(clean_pair)
    return numpy.array(cleaned)

#### Fill the following blanks:

In [3]:
# e.g., filename = 'Data/deu.txt'
filename = '/content/Data/spa.txt'

# e.g., n_train = 20000
n_train = 20000

In [4]:
# load dataset
doc = load_doc(filename)

# split into Language1-Language2 pairs
pairs = to_pairs(doc)

# clean sentences
clean_pairs = clean_data(pairs)[0:n_train, :]

In [5]:
for i in range(3000, 3010):
    print('[' + clean_pairs[i, 0] + '] => [' + clean_pairs[i, 1] + ']')

[we are here] => [estamos aqui]
[we ate eggs] => [hemos comido huevos]
[we ate eggs] => [comimos huevos]
[we broke up] => [nos separamos]
[we broke up] => [lo dejamos]
[we broke up] => [rompimos]
[we can help] => [podemos ayudar]
[we can help] => [nosotros podemos ayudar]
[we can meet] => [podemos encontrarnos]
[we can meet] => [podemos vernos]


In [6]:
input_texts = clean_pairs[:, 0]
target_texts = ['\t' + text + '\n' for text in clean_pairs[:, 1]]

print(input_texts)
print('Length of input_texts:  ' + str(input_texts.shape))
print('Length of target_texts: ' + str(input_texts.shape))

['go' 'go' 'go' ... 'dont fall please' 'dont feed the dog'
 'dont get it wrong']
Length of input_texts:  (20000,)
Length of target_texts: (20000,)


In [7]:
max_encoder_seq_length = max(len(line) for line in input_texts)
max_decoder_seq_length = max(len(line) for line in target_texts)

print('max length of input  sentences: %d' % (max_encoder_seq_length))
print('max length of target sentences: %d' % (max_decoder_seq_length))

max length of input  sentences: 18
max length of target sentences: 48


**Remark:** To this end, you have two lists of sentences: input_texts and target_texts

## 2. Text processing

### 2.1. Convert texts to sequences

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.

In [8]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# encode and pad sequences
def text2sequences(max_len, lines):
    tokenizer = Tokenizer(char_level=True, filters='')
    tokenizer.fit_on_texts(lines)
    seqs = tokenizer.texts_to_sequences(lines)
    seqs_pad = pad_sequences(seqs, maxlen=max_len, padding='post')
    return seqs_pad, tokenizer.word_index


encoder_input_seq, input_token_index = text2sequences(max_encoder_seq_length,
                                                      input_texts)
decoder_input_seq, target_token_index = text2sequences(max_decoder_seq_length,
                                                       target_texts)

print('shape of encoder_input_seq: ' + str(encoder_input_seq.shape))
print('shape of input_token_index: ' + str(len(input_token_index)))
print('shape of decoder_input_seq: ' + str(decoder_input_seq.shape))
print('shape of target_token_index: ' + str(len(target_token_index)))

shape of encoder_input_seq: (20000, 18)
shape of input_token_index: 27
shape of decoder_input_seq: (20000, 48)
shape of target_token_index: 29


In [9]:
num_encoder_tokens = len(input_token_index) + 1
num_decoder_tokens = len(target_token_index) + 1

print('num_encoder_tokens: ' + str(num_encoder_tokens))
print('num_decoder_tokens: ' + str(num_decoder_tokens))

num_encoder_tokens: 28
num_decoder_tokens: 30


**Remark:** To this end, the input language and target language texts are converted to 2 matrices.

- Their number of rows are both n_train.
- Their number of columns are respective max_encoder_seq_length and max_decoder_seq_length.

The followings print a sentence and its representation as a sequence.

In [10]:
target_texts[100]

'\tentendiste\n'

In [11]:
decoder_input_seq[100, :]

array([ 6,  2,  9,  8,  2,  9, 15, 11,  5,  8,  2,  7,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
      dtype=int32)

## 2.2. One-hot encode

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.
- It is represented by a $n\times t \times v$ tensor ($t$ is the number of unique chars) after the one-hot encoding.

In [12]:
from tensorflow.keras.utils import to_categorical

# one hot encode target sequence
def onehot_encode(sequences, max_len, vocab_size):
    n = len(sequences)
    data = numpy.zeros((n, max_len, vocab_size))
    for i in range(n):
        data[i, :, :] = to_categorical(sequences[i], num_classes=vocab_size)
    return data

encoder_input_data = onehot_encode(encoder_input_seq, max_encoder_seq_length, num_encoder_tokens)
decoder_input_data = onehot_encode(decoder_input_seq, max_decoder_seq_length, num_decoder_tokens)

decoder_target_seq = numpy.zeros(decoder_input_seq.shape)
decoder_target_seq[:, 0:-1] = decoder_input_seq[:, 1:]
decoder_target_data = onehot_encode(decoder_target_seq,
                                    max_decoder_seq_length,
                                    num_decoder_tokens)

print(encoder_input_data.shape)
print(decoder_input_data.shape)

(20000, 18, 28)
(20000, 48, 30)


## 3. Build the networks (for training) (20 points)

- In this section, we have already implemented the LSTM model for you. You can run the code and see what the code is doing.  

- **You need to change the existing LSTM model to a Bidirectional LSTM model. Just modify the network structure and do not change the training cell in section 3.4.**

- Build encoder, decoder, and connect the two modules to get "model".

- Fit the model on the bilingual data to train the parameters in the encoder and decoder.



### 3.1. Encoder network

- Input:  one-hot encode of the input language

- Return:

    -- output (all the hidden states   $h_1, \cdots , h_t$) are always discarded
    
    -- the final hidden state  $h_t$
    
    -- the final conveyor belt $c_t$

In [13]:
# from keras.layers import Bidirectional, Concatenate

# encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True,
#                                   dropout=0.5, name='encoder_lstm'))
# _, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

# state_h = Concatenate()([forward_h, backward_h])
# state_c = Concatenate()([forward_c, backward_c])

In [14]:
from tensorflow.keras.layers import Input, LSTM, Bidirectional, Concatenate
from tensorflow.keras.models import Model

latent_dim = 256

# inputs of the encoder network
encoder_inputs = Input(shape=(None, num_encoder_tokens),
                       name='encoder_inputs')

# set the BiLSTM layer
# encoder_lstm = LSTM(latent_dim, return_state=True,
#                     dropout=0.5, name='encoder_lstm')
# _, state_h, state_c = encoder_lstm(encoder_inputs)
encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True,
                                  dropout=0.5, name='encoder_lstm'))
_, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

# build the encoder network model
# Changed outputs to backward_h and backward_c
encoder_model = Model(inputs=encoder_inputs,
                      outputs=[backward_h, backward_c],
                      name='encoder')

Print a summary and save the encoder network structure to "./encoder.pdf"

In [15]:
from IPython.display import SVG
from keras.utils import model_to_dot, plot_model

SVG(model_to_dot(encoder_model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=encoder_model, show_shapes=False,
    to_file='encoder.pdf'
)

encoder_model.summary()

Model: "encoder"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 encoder_inputs (InputLayer  [(None, None, 28)]        0         
 )                                                               
                                                                 
 bidirectional (Bidirection  [(None, 512),             583680    
 al)                          (None, 256),                       
                              (None, 256),                       
                              (None, 256),                       
                              (None, 256)]                       
                                                                 
Total params: 583680 (2.23 MB)
Trainable params: 583680 (2.23 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


### 3.2. Decoder network

- Inputs:  

    -- one-hot encode of the target language
    
    -- The initial hidden state $h_t$
    
    -- The initial conveyor belt $c_t$

- Return:

    -- output (all the hidden states) $h_1, \cdots , h_t$

    -- the final hidden state  $h_t$ (discarded in the training and used in the prediction)
    
    -- the final conveyor belt $c_t$ (discarded in the training and used in the prediction)

In [16]:
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.models import Model

# inputs of the decoder network
decoder_input_h = Input(shape=(latent_dim,), name='decoder_input_h')
decoder_input_c = Input(shape=(latent_dim,), name='decoder_input_c')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# set the LSTM layer
decoder_lstm = LSTM(latent_dim, return_sequences=True,
                    return_state=True, dropout=0.5, name='decoder_lstm')
decoder_lstm_outputs, state_h, state_c = decoder_lstm(decoder_input_x,
                                                      initial_state=[decoder_input_h, decoder_input_c])

# set the dense layer
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_lstm_outputs)

# build the decoder network model
decoder_model = Model(inputs=[decoder_input_x, decoder_input_h, decoder_input_c],
                      outputs=[decoder_outputs, state_h, state_c],
                      name='decoder')

Print a summary and save the encoder network structure to "./decoder.pdf"

In [17]:
from IPython.display import SVG
from keras.utils import model_to_dot, plot_model

SVG(model_to_dot(decoder_model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=decoder_model, show_shapes=False,
    to_file='decoder.pdf'
)

decoder_model.summary()

Model: "decoder"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 decoder_input_x (InputLaye  [(None, None, 30)]           0         []                            
 r)                                                                                               
                                                                                                  
 decoder_input_h (InputLaye  [(None, 256)]                0         []                            
 r)                                                                                               
                                                                                                  
 decoder_input_c (InputLaye  [(None, 256)]                0         []                            
 r)                                                                                         

### 3.3. Connect the encoder and decoder

In [18]:
# input layers
encoder_input_x = Input(shape=(None, num_encoder_tokens), name='encoder_input_x')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# connect encoder to decoder
encoder_final_states = encoder_model([encoder_input_x])
decoder_lstm_output, _, _ = decoder_lstm(decoder_input_x, initial_state=encoder_final_states)
decoder_pred = decoder_dense(decoder_lstm_output)

model = Model(inputs=[encoder_input_x, decoder_input_x],
              outputs=decoder_pred,
              name='model_training')

In [19]:
from IPython.display import SVG
from keras.utils import model_to_dot, plot_model

SVG(model_to_dot(model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=model, show_shapes=False,
    to_file='model_training.pdf'
)

model.summary()

Model: "model_training"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 encoder_input_x (InputLaye  [(None, None, 28)]           0         []                            
 r)                                                                                               
                                                                                                  
 decoder_input_x (InputLaye  [(None, None, 30)]           0         []                            
 r)                                                                                               
                                                                                                  
 encoder (Functional)        [(None, 256),                583680    ['encoder_input_x[0][0]']     
                              (None, 256)]                                           

### 3.4. Fit the model on the bilingual dataset

- encoder_input_data: one-hot encode of the input language

- decoder_input_data: one-hot encode of the input language

- decoder_target_data: labels (left shift of decoder_input_data)

- tune the hyper-parameters

- stop when the validation loss stop decreasing.

In [20]:
print('shape of encoder_input_data' + str(encoder_input_data.shape))
print('shape of decoder_input_data' + str(decoder_input_data.shape))
print('shape of decoder_target_data' + str(decoder_target_data.shape))

shape of encoder_input_data(20000, 18, 28)
shape of decoder_input_data(20000, 48, 30)
shape of decoder_target_data(20000, 48, 30)


In [21]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

model.fit([encoder_input_data, decoder_input_data],  # training data
          decoder_target_data,                       # labels (left shift of the target sequences)
          batch_size=64, epochs=50, validation_split=0.2)

model.save('seq2seq.h5')

Epoch 1/50




Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


  saving_api.save_model(


## 4. Make predictions

- In this section, you need to complete section 4.2 to translate English to the target language.


### 4.1. Translate English to XXX

1. Encoder read a sentence (source language) and output its final states, $h_t$ and $c_t$.
2. Take the [star] sign "\t" and the final state $h_t$ and $c_t$ as input and run the decoder.
3. Get the new states and predicted probability distribution.
4. sample a char from the predicted probability distribution
5. take the sampled char and the new states as input and repeat the process (stop if reach the [stop] sign "\n").

In [22]:
# Reverse-lookup token index to decode sequences back to something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())

In [23]:
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)

    target_seq = numpy.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, target_token_index['\t']] = 1.

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # this line of code is greedy selection
        # try to use multinomial sampling instead (with temperature)
        sampled_token_index = numpy.argmax(output_tokens[0, -1, :])

        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        target_seq = numpy.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        states_value = [h, c]

    return decoded_sentence


In [24]:
for seq_index in range(2100, 2120):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('English:       ', input_texts[seq_index])
    print('Spanish (true): ', target_texts[seq_index][1:-1])
    print('Spanish (pred): ', decoded_sentence[0:-1])


-
English:        come see me
Spanish (true):  venid a verme
Spanish (pred):  ven a aqui
-
English:        come see me
Spanish (true):  vengan a verme
Spanish (pred):  ven a aqui
-
English:        come see me
Spanish (true):  venga a verme
Spanish (pred):  ven a aqui
-
English:        comfort tom
Spanish (true):  consuela a tom
Spanish (pred):  lo hace tom
-
English:        contact tom
Spanish (true):  ponte en contacto con tom
Spanish (pred):  pondes a tom
-
English:        cook for me
Spanish (true):  cociname
Spanish (pred):  venga a un mi
-
English:        cook for me
Spanish (true):  cocina para mi
Spanish (pred):  venga a un mi
-
English:        count me in
Spanish (true):  cuenta conmigo
Spanish (pred):  no pereis
-
English:        count on it
Spanish (true):  cuenta con eso
Spanish (pred):  pores lo pierta
-
English:        count on it
Spanish (true):  cuente con eso
Spanish (pred):  pores lo pierta
-
English:        count on me
Spanish (true):  cuenta conmigo
Spanish (pred):  

### 4.2. Translate an English sentence to the target language （20 points）

1. Tokenization
2. One-hot encode
3. Translate

In [25]:
input_sentence = 'I love you'

input_texts = [input_sentence]

print(input_texts)

input_sequence, input_token_index = text2sequences(len(input_sentence),
                                                      input_texts)

print(str(input_sequence.shape))
print('shape of input_token_index: ' + str(len(input_token_index)))

num_encoder_tokens = len(input_token_index) + 1

input_x = onehot_encode(input_sequence, len(input_sentence), num_encoder_tokens)

translated_sentence = decode_sequence(input_x)

print('source sentence is: ' + input_sentence)
print('translated sentence is: ' + translated_sentence)

['I love you']
(1, 10)
shape of input_token_index: 8
source sentence is: I love you
translated sentence is: esta



# 5. Evaluate the translation using BLEU score

- We have already translated from English to target language, but how can we evaluate the performance of our model quantitatively?

- In this section, you need to re-train the model we built in section 3 and then evaluate the bleu score on testing dataset.

Reference:

https://machinelearningmastery.com/calculate-bleu-score-for-text-python/

https://en.wikipedia.org/wiki/BLEU

#### Hint:

- Randomly partition the dataset to training, validation, and test.

- Evaluate the BLEU score using the test set. Report the average.

- You may use packages to calculate bleu score, e.g., sentence_bleu() from nltk package.

### 5.1. Partition the dataset to training, validation, and test. Build new token index. (10 points)

1. You may try to load more data/lines from text file.
2. Convert text to sequences and build token index using training data.
3. One-hot encode your training and validation text sequences.

In [26]:
# e.g., filename = 'Data/deu.txt'
filename = '/content/Data/spa.txt'

# Get 50,000 sentences this time
n_train = 50000

In [27]:
# load dataset
doc = load_doc(filename)

# split into Language1-Language2 pairs
pairs = to_pairs(doc)

# clean sentences
clean_pairs = clean_data(pairs)[0:n_train, :]

In [28]:
from sklearn.model_selection import train_test_split

input_texts = clean_pairs[:, 0]
target_texts = ['\t' + text + '\n' for text in clean_pairs[:, 1]]

max_encoder_seq_length = max(len(line) for line in input_texts)
max_decoder_seq_length = max(len(line) for line in target_texts)

print('max length of input  sentences: %d' % (max_encoder_seq_length))
print('max length of target sentences: %d' % (max_decoder_seq_length))

# X_train, X_test, y_train, y_test = train_test_split(input_texts, target_texts, test_size=0.2, train_size=0.8, random_state=42)


max length of input  sentences: 24
max length of target sentences: 68


In [29]:
encoder_input_seq, input_token_index = text2sequences(max_encoder_seq_length,
                                                      input_texts)
decoder_input_seq, target_token_index = text2sequences(max_decoder_seq_length,
                                                       target_texts)

print('shape of encoder_input_seq: ' + str(encoder_input_seq.shape))
print('shape of input_token_index: ' + str(len(input_token_index)))
print('shape of decoder_input_seq: ' + str(decoder_input_seq.shape))
print('shape of target_token_index: ' + str(len(target_token_index)))

shape of encoder_input_seq: (50000, 24)
shape of input_token_index: 27
shape of decoder_input_seq: (50000, 68)
shape of target_token_index: 29


In [30]:
num_encoder_tokens = len(input_token_index) + 1
num_decoder_tokens = len(target_token_index) + 1

print('num_encoder_tokens: ' + str(num_encoder_tokens))
print('num_decoder_tokens: ' + str(num_decoder_tokens))

num_encoder_tokens: 28
num_decoder_tokens: 30


In [31]:
encoder_input_data = onehot_encode(encoder_input_seq, max_encoder_seq_length, num_encoder_tokens)
decoder_input_data = onehot_encode(decoder_input_seq, max_decoder_seq_length, num_decoder_tokens)

decoder_target_seq = numpy.zeros(decoder_input_seq.shape)
decoder_target_seq[:, 0:-1] = decoder_input_seq[:, 1:]
decoder_target_data = onehot_encode(decoder_target_seq,
                                    max_decoder_seq_length,
                                    num_decoder_tokens)


In [32]:
from tensorflow.keras.layers import Input, LSTM, Bidirectional, Concatenate
from tensorflow.keras.models import Model

latent_dim = 256

# inputs of the encoder network
encoder_inputs = Input(shape=(None, num_encoder_tokens),
                       name='encoder_inputs')

# set the BiLSTM layer
# encoder_lstm = LSTM(latent_dim, return_state=True,
#                     dropout=0.5, name='encoder_lstm')
# _, state_h, state_c = encoder_lstm(encoder_inputs)
encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True,
                                  dropout=0.5, name='encoder_lstm'))
_, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

# build the encoder network model
# Changed outputs to backward_h and backward_c
encoder_model = Model(inputs=encoder_inputs,
                      outputs=[backward_h, backward_c],
                      name='encoder')

In [33]:
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.models import Model

# inputs of the decoder network
decoder_input_h = Input(shape=(latent_dim,), name='decoder_input_h')
decoder_input_c = Input(shape=(latent_dim,), name='decoder_input_c')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# set the LSTM layer
decoder_lstm = LSTM(latent_dim, return_sequences=True,
                    return_state=True, dropout=0.5, name='decoder_lstm')
decoder_lstm_outputs, state_h, state_c = decoder_lstm(decoder_input_x,
                                                      initial_state=[decoder_input_h, decoder_input_c])

# set the dense layer
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_lstm_outputs)

# build the decoder network model
decoder_model = Model(inputs=[decoder_input_x, decoder_input_h, decoder_input_c],
                      outputs=[decoder_outputs, state_h, state_c],
                      name='decoder')

In [34]:
# input layers
encoder_input_x = Input(shape=(None, num_encoder_tokens), name='encoder_input_x')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# connect encoder to decoder
encoder_final_states = encoder_model([encoder_input_x])
decoder_lstm_output, _, _ = decoder_lstm(decoder_input_x, initial_state=encoder_final_states)
decoder_pred = decoder_dense(decoder_lstm_output)

model = Model(inputs=[encoder_input_x, decoder_input_x],
              outputs=decoder_pred,
              name='model_training')

In [35]:
en_train, en_test, dec_train, dec_test = train_test_split(encoder_input_data, decoder_input_data, test_size=0.2, train_size=0.8, random_state=42)

decoder_target_train, decoder_target_test = train_test_split(decoder_target_data,
                                                              test_size=0.2,
                                                              random_state=42)


### 5.2 Retrain your previous Bidirectional LSTM model with training and validation data and tune the parameters (learning rate, optimizer, etc) based on validation score. (25 points)

1. Use the model structure in section 3 to train a new model with new training and validation datasets.
2. Based on validation BLEU score or loss to tune parameters.

In [36]:

from tensorflow.keras.optimizers import RMSprop

# Define your desired learning rate
# Increased learning_rate as model was plateauing at certain loss values
learning_rate = 0.01  # For example, set your desired learning rate here

# Create the optimizer instance with the desired learning rate
optimizer = RMSprop(learning_rate=learning_rate)

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

model.fit([en_train, dec_train],  # training data
          decoder_target_train,                       # labels (left shift of the target sequences)
          batch_size=64, epochs=50, validation_split=0.2)

model.save('seq2seq.h5')

Epoch 1/50




Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


  saving_api.save_model(


### 5.3 Evaluate the BLEU score using the test set. (15 points)

1. Use trained model above to calculate the BLEU score with testing dataset.
2. A reasonable value should be 0.1-0.3. The higher, the better.

In [37]:
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())

In [38]:
# print(target_texts)
from tqdm import tqdm
import time
import random




def make_target_and_generated(generateds, num_samples):
  # Randomly samples indexes of target and generated sentences and processing them
  # By trimming starting and ending symbolss
  start_index = 0
  end_index = 10000

  sampled_indexes = random.sample(range(start_index, end_index-1), num_samples)

  # targets_output = []
  # for e in tqdm(range(0, len(sampled_indexes))):
  #   temp = (decode_sequence(targets[e:e+1]))
  #   targets_output.append(temp[0:-1])

  generateds_output = []
  for e in tqdm(range(0, len(sampled_indexes))):
    index = sampled_indexes[e]
    temp = (decode_sequence(generateds[index:index+1]))
    generateds_output.append(temp[0:-1])

  return sampled_indexes, generateds_output


In [39]:
print(en_test.shape)
# With the function above I randomly select 100 setences my model has predicted to test its BLEU score
sampled_indexes, gens = make_target_and_generated(en_test, 100)




(10000, 24, 28)


  0%|          | 0/100 [00:00<?, ?it/s]



  1%|          | 1/100 [00:02<04:17,  2.60s/it]



  2%|▏         | 2/100 [00:04<03:33,  2.18s/it]



  3%|▎         | 3/100 [00:06<03:03,  1.90s/it]



  4%|▍         | 4/100 [00:07<02:40,  1.67s/it]



  5%|▌         | 5/100 [00:09<02:39,  1.68s/it]



  6%|▌         | 6/100 [00:11<02:49,  1.80s/it]



  7%|▋         | 7/100 [00:13<02:51,  1.84s/it]



  8%|▊         | 8/100 [00:14<02:45,  1.79s/it]



  9%|▉         | 9/100 [00:16<02:56,  1.94s/it]



 10%|█         | 10/100 [00:18<02:39,  1.77s/it]



 11%|█         | 11/100 [00:19<02:29,  1.68s/it]



 12%|█▏        | 12/100 [00:21<02:16,  1.55s/it]



 13%|█▎        | 13/100 [00:22<02:05,  1.44s/it]



 14%|█▍        | 14/100 [00:25<02:37,  1.83s/it]



 15%|█▌        | 15/100 [00:27<03:04,  2.17s/it]



 16%|█▌        | 16/100 [00:30<03:22,  2.41s/it]



 17%|█▋        | 17/100 [00:32<03:01,  2.19s/it]



 18%|█▊        | 18/100 [00:34<02:53,  2.12s/it]



 19%|█▉        | 19/100 [00:37<03:14,  2.40s/it]



 20%|██        | 20/100 [00:38<02:40,  2.01s/it]



 21%|██        | 21/100 [00:39<02:20,  1.77s/it]



 22%|██▏       | 22/100 [00:41<02:15,  1.74s/it]



 23%|██▎       | 23/100 [00:44<02:32,  1.97s/it]



 24%|██▍       | 24/100 [00:46<02:36,  2.06s/it]



 25%|██▌       | 25/100 [00:48<02:26,  1.95s/it]



 26%|██▌       | 26/100 [00:50<02:23,  1.94s/it]



 27%|██▋       | 27/100 [00:51<02:19,  1.91s/it]



 28%|██▊       | 28/100 [00:52<02:00,  1.68s/it]



 29%|██▉       | 29/100 [00:54<02:02,  1.73s/it]



 30%|███       | 30/100 [00:55<01:47,  1.54s/it]



 31%|███       | 31/100 [00:57<01:48,  1.57s/it]



 32%|███▏      | 32/100 [00:59<01:48,  1.60s/it]



 33%|███▎      | 33/100 [01:01<01:59,  1.79s/it]



 34%|███▍      | 34/100 [01:03<01:55,  1.74s/it]



 35%|███▌      | 35/100 [01:04<01:45,  1.62s/it]



 36%|███▌      | 36/100 [01:06<01:46,  1.67s/it]



 37%|███▋      | 37/100 [01:07<01:43,  1.64s/it]



 38%|███▊      | 38/100 [01:09<01:39,  1.60s/it]



 39%|███▉      | 39/100 [01:10<01:39,  1.63s/it]



 40%|████      | 40/100 [01:12<01:28,  1.47s/it]



 41%|████      | 41/100 [01:14<01:36,  1.64s/it]



 42%|████▏     | 42/100 [01:15<01:34,  1.63s/it]



 43%|████▎     | 43/100 [01:17<01:31,  1.60s/it]



 44%|████▍     | 44/100 [01:18<01:21,  1.46s/it]



 45%|████▌     | 45/100 [01:19<01:17,  1.41s/it]



 46%|████▌     | 46/100 [01:20<01:13,  1.35s/it]



 47%|████▋     | 47/100 [01:22<01:14,  1.40s/it]



 48%|████▊     | 48/100 [01:23<01:14,  1.44s/it]



 49%|████▉     | 49/100 [01:25<01:18,  1.53s/it]



 50%|█████     | 50/100 [01:27<01:23,  1.66s/it]



 51%|█████     | 51/100 [01:29<01:29,  1.84s/it]



 52%|█████▏    | 52/100 [01:31<01:27,  1.83s/it]



 53%|█████▎    | 53/100 [01:32<01:16,  1.63s/it]



 54%|█████▍    | 54/100 [01:34<01:15,  1.65s/it]



 55%|█████▌    | 55/100 [01:36<01:14,  1.65s/it]



 56%|█████▌    | 56/100 [01:38<01:19,  1.80s/it]



 57%|█████▋    | 57/100 [01:39<01:11,  1.67s/it]



 58%|█████▊    | 58/100 [01:41<01:08,  1.63s/it]



 59%|█████▉    | 59/100 [01:43<01:11,  1.75s/it]



 60%|██████    | 60/100 [01:44<01:03,  1.59s/it]



 61%|██████    | 61/100 [01:46<01:09,  1.77s/it]



 62%|██████▏   | 62/100 [01:48<01:02,  1.64s/it]



 63%|██████▎   | 63/100 [01:49<00:55,  1.50s/it]



 64%|██████▍   | 64/100 [01:50<00:47,  1.31s/it]



 65%|██████▌   | 65/100 [01:51<00:51,  1.46s/it]



 66%|██████▌   | 66/100 [01:53<00:53,  1.58s/it]



 67%|██████▋   | 67/100 [01:54<00:46,  1.40s/it]



 68%|██████▊   | 68/100 [01:55<00:42,  1.34s/it]



 69%|██████▉   | 69/100 [01:57<00:39,  1.27s/it]



 70%|███████   | 70/100 [01:58<00:41,  1.38s/it]



 71%|███████   | 71/100 [01:59<00:39,  1.34s/it]



 72%|███████▏  | 72/100 [02:01<00:41,  1.46s/it]



 73%|███████▎  | 73/100 [02:04<00:52,  1.96s/it]



 74%|███████▍  | 74/100 [02:06<00:49,  1.92s/it]



 75%|███████▌  | 75/100 [02:08<00:48,  1.93s/it]



 76%|███████▌  | 76/100 [02:10<00:45,  1.90s/it]



 77%|███████▋  | 77/100 [02:12<00:43,  1.89s/it]



 78%|███████▊  | 78/100 [02:13<00:36,  1.68s/it]



 79%|███████▉  | 79/100 [02:15<00:37,  1.77s/it]



 80%|████████  | 80/100 [02:17<00:36,  1.82s/it]



 81%|████████  | 81/100 [02:18<00:30,  1.59s/it]



 82%|████████▏ | 82/100 [02:19<00:26,  1.50s/it]



 83%|████████▎ | 83/100 [02:20<00:23,  1.41s/it]



 84%|████████▍ | 84/100 [02:21<00:20,  1.30s/it]



 85%|████████▌ | 85/100 [02:23<00:20,  1.36s/it]



 86%|████████▌ | 86/100 [02:24<00:17,  1.26s/it]



 87%|████████▋ | 87/100 [02:26<00:17,  1.36s/it]



 88%|████████▊ | 88/100 [02:28<00:21,  1.79s/it]



 89%|████████▉ | 89/100 [02:30<00:17,  1.59s/it]



 90%|█████████ | 90/100 [02:31<00:15,  1.60s/it]



 91%|█████████ | 91/100 [02:32<00:13,  1.52s/it]



 92%|█████████▏| 92/100 [02:34<00:11,  1.39s/it]



 93%|█████████▎| 93/100 [02:35<00:10,  1.49s/it]



 94%|█████████▍| 94/100 [02:37<00:08,  1.45s/it]



 95%|█████████▌| 95/100 [02:38<00:07,  1.44s/it]



 96%|█████████▌| 96/100 [02:41<00:07,  1.94s/it]



 97%|█████████▋| 97/100 [02:43<00:05,  1.89s/it]



 98%|█████████▊| 98/100 [02:46<00:04,  2.15s/it]



 99%|█████████▉| 99/100 [02:48<00:02,  2.19s/it]



100%|██████████| 100/100 [02:50<00:00,  1.70s/it]


In [40]:
print(gens)

['no quiero verte', 'el casa esta en el cario', 'eres muy inteligente', 'lo lo me encontre', 'no puedo contar con tom', 'no le gusta esa comida', 'eres muy inteligente', 'cual es la casa de tom', 'tom no puede estar en el coche', 'dejame venir ahora', 'dejame entrar este', 'tom esta miento', 'no tienes pada', 'me pendi en la comida', 'donde puedo contar la casa', 'tom se dio un poco de la cara', 'debes detar a tom', 'no estoy confuntado', 'podemos pararte a tom', 'eso es mio', 'no soy carie', 'mi padre esta aqui', 'puedo contar un poco de casa', 'deja el casa', 'todos estan de casa', 'eso es un carien', 'cuando esta este tiempo', 'salte mi coche', 'por que estamos enfermos', 'estas enfermo', 'el es un buen peligro', 'no necesitamos amudar', 'necesitas algo de mismo', 'eso no es tanido', 'estas estudiando', 'no tenemos pueno a tom', 'era un poco de comer', 'era un buen pesigio', 'era un auen comiliano', 'estas consado', 'no me gusta el cocie', 'segamos en la cama', 'voy a todos los dias

#Because I was dumb and partioned the data after processing I have to transform the decoder_target_test processed data back into readable format and this is why I make the following functions

In [41]:
import numpy as np

def reverse_onehot(encoded_sequences):
    decoded_sequences = np.argmax(encoded_sequences, axis=-1)
    return decoded_sequences

In [42]:
decoder_test = reverse_onehot(decoder_target_test)

print(decoder_test)

[[ 7 11  2 ...  0  0  0]
 [ 2 10  2 ...  0  0  0]
 [19  2  7 ...  0  0  0]
 ...
 [ 7  4 14 ...  0  0  0]
 [ 7  4 14 ...  0  0  0]
 [ 2  5  1 ...  0  0  0]]


In [43]:
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

def reverse_text2sequences(seqs_pad, word_index):
    reverse_word_index = {index: char for char, index in word_index.items()}
    decoded_lines = []
    for seq in seqs_pad:
        decoded_line = ''.join(reverse_word_index.get(index, '') for index in seq)
        decoded_lines.append(decoded_line)

    return decoded_lines


In [44]:
decoder_test = reverse_text2sequences(decoder_test, target_token_index)

In [45]:
decoder_test_output=[]
for e in decoder_test:
  decoder_test_output.append(e[0:-1])

print(decoder_test_output)
len(decoder_test_output)

['tienen una bicicleta', 'eres casada', 'vete ya', 'puedo ir a trabajar', 'se de que se trata', 'cual fue el acuerdo', 'envuelvalo por favor', 'no le oi venir', 'hable frances', 'tom es un estudiante de arte', 'no me dejas eleccion', 'se rehuso a dar comentarios', 'se enfado', 'lo envidio', 'esto esta muy bueno', 'estaba en la ducha', 'tom estaba confundido', 'alcance el ultimo bus', 'te gustan los bichos', 'por que corriste', 'nadie sonrio', 'es usted olvidadiza', 'vivire', 'tom nunca hace trampa', 'perdi el colectivo', 'nuestro tiempo es limitado', 'tom nos condujo a una trampa', 'no estaba listo para ir', 'la premonicion de tom era correcta', 'te falta imaginacion', 'el tren iba lleno', 'puedo usar esta bicicleta', 'yo hice que el se fuera', 'se ve divertido', 'no cambies de idea', 'quisiera comermelo', 'hoy tengo muchas cosas que hacer', 'tom quiere ser abrazado', 'naciste en un establo', 'ella es una belleza', 'hay un patron aqui', 'ella hizo de mi una estrella', 'estas emocionado

10000

In [46]:

print(gens)
print("gens length: ", len(gens))
print("decoder_test_output: ", len(decoder_test_output))

['no quiero verte', 'el casa esta en el cario', 'eres muy inteligente', 'lo lo me encontre', 'no puedo contar con tom', 'no le gusta esa comida', 'eres muy inteligente', 'cual es la casa de tom', 'tom no puede estar en el coche', 'dejame venir ahora', 'dejame entrar este', 'tom esta miento', 'no tienes pada', 'me pendi en la comida', 'donde puedo contar la casa', 'tom se dio un poco de la cara', 'debes detar a tom', 'no estoy confuntado', 'podemos pararte a tom', 'eso es mio', 'no soy carie', 'mi padre esta aqui', 'puedo contar un poco de casa', 'deja el casa', 'todos estan de casa', 'eso es un carien', 'cuando esta este tiempo', 'salte mi coche', 'por que estamos enfermos', 'estas enfermo', 'el es un buen peligro', 'no necesitamos amudar', 'necesitas algo de mismo', 'eso no es tanido', 'estas estudiando', 'no tenemos pueno a tom', 'era un poco de comer', 'era un buen pesigio', 'era un auen comiliano', 'estas consado', 'no me gusta el cocie', 'segamos en la cama', 'voy a todos los dias

In [47]:
# get decoder_test_target values of the same random indexes
final_decoder_test_out = []
for e in sampled_indexes:
  final_decoder_test_out.append(decoder_test_output[e])

print(sampled_indexes)
print(decoder_test_output[5643])


[2132, 5907, 6955, 531, 6399, 9391, 8065, 9232, 6980, 8073, 8417, 2747, 4278, 5083, 9144, 8172, 8848, 5190, 579, 9943, 77, 1531, 3881, 159, 573, 9738, 2960, 4607, 8810, 6832, 4221, 6923, 213, 6758, 7084, 3592, 5910, 7317, 794, 9240, 4438, 5847, 6930, 8998, 1257, 1, 3301, 7255, 9950, 4865, 1066, 5529, 7082, 9154, 7719, 5591, 3413, 7680, 4501, 1015, 8807, 500, 2474, 1307, 7702, 8808, 6653, 6114, 5676, 1029, 2977, 1861, 6935, 354, 1386, 5856, 4265, 7831, 1216, 1559, 7298, 8594, 273, 252, 3285, 2197, 5168, 698, 2354, 6009, 1487, 2024, 4392, 7745, 6410, 6063, 4519, 4117, 8275, 3067]
tom se esta quejando


In [48]:
from nltk.translate.bleu_score import corpus_bleu

bleu_score = corpus_bleu([[reference] for reference in final_decoder_test_out], gens)

print("BLEU Score:", bleu_score)

BLEU Score: 0.3656926612945475
