# Machine Translation (Deep Learning, LSTM)

Machine Translation is a sequence-to-sequence task, where the input is one sequence and the output is another sequences. Sequence-to-sequence tasks have seen a great improvement with the advent of Deep Learning. Various network models have been proposed and are still proposed for further improvements. In this tutorial, we rely on recurrent network models (in more detail: LSTMs). The connection between the two models can be sketched as follows:

<img src="images/rnn-many-to-many-machine-translation.png" alt="Drawing" style="width: 60%;"/>

The left LSTM network, the encoder, reads the input sequence. The hidden states of the LSTM cell is the copied to the decoder (righthand-side network), which uses this state to generate the new sequence.

**Task:** In this tutorial we translate keyword-based queries into the corresponding questions in proper English. Each query is about a restaurant or a hotel, and we assume that each restaurant or hotel name has already been exracted using NER and replaced with `<POI>` (point of interest). This is a common step for two reasons:

* Named entities often contain words that are not in the dictionary. Excluding them makes the vocabulary significantly smaller, thus easier for the network to train.
* Named entities -- here the names of restaurant and hotels -- are usually not translated anyway

## Import required packages

In [None]:
import numpy as np
from numpy.random import shuffle

from pickle import dump, load

from keras.models import Sequential, Model
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import LSTM, TimeDistributed, Activation, Embedding, Dense, Bidirectional, RepeatVector, Flatten
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint

from timeit import default_timer as timer
from utils.timeutil import convert_seconds_to_time

## Data preperation

### Read dataset file

If you look into the text file `data/seq2seq-dataset/closeup-dataset-q2q-1m.txt` you can see that each line contains a pair of strings seperated by a tabulator `\t`. The first string represents a keyword-based query while the seconds string id the corresponding question in proper English.


In [None]:
pairs = []
with open('data/seq2seq-dataset/closeup-dataset-q2q-1m.txt') as f:
    for i, line in enumerate(f):
        query, question = line.strip().split('\t')
        pairs.append([query, question])
        
# Convert list to numpy array for convenience
pairs = np.array(pairs)
        
for i in range(10):
    #print(pairs[i])
    print('[%s] => [%s]' % (pairs[i,0], pairs[i,1]))        

### Prepare training and test data

We first limit the number of query-question pairs to 10,000. While this is not enough to get proper results, the training is reasonable fast. Only when everything works fine, one can increase the size of the dataset for training.

We take 99% of all query-question pairs for the training since we only need some test data for mannually inspecting the results. Reliably quantifying the accuracy of machine translation is still an open challenge; some commonly used measured do exist, though.

In [None]:
# reduce dataset size
n_sentences = 10000
dataset = pairs[:n_sentences, :]

# random shuffle
shuffle(dataset)

# split into train/test
train_ratio = 0.99
train_size = int(train_ratio * len(dataset))
train, test = dataset[:train_size], dataset[train_size:]

print('Size of training data: {}'.format(len(train)))
print('Size of test data: {}'.format(len(test)))


### Data enconding

As usual, we need to properly encode the data to be used as input and output for the deep learning network.

This little auxiliary method returns the longest string (in terms int the number of words/tokens) in a list of strings

In [None]:
def max_length(lines):
    return max(len(line.split()) for line in lines)

We again use the `Tokenizer` class provided by Keras to generate the vocabulary and the word-to-index mapping. Since we do machine translation, we need to do this for both the input sequences (queries) and the output sequences (questions). Strictly speaking, since input and output sequences are both in English, one tokenizer would suffice in this case. However, we pretent that sequences are in different "languages" to make it more flexible.

In [None]:
query_tokenizer = Tokenizer(filters='', lower=False)
query_tokenizer.fit_on_texts(dataset[:, 0])
query_vocab_size = len(query_tokenizer.word_index) + 1
query_length = max_length(dataset[:, 0])
print('Query vocabulary Size: %d' % query_vocab_size)
print('Query max length: %d' % (query_length))

In [None]:
question_tokenizer = Tokenizer(filters='', lower=False)
question_tokenizer.fit_on_texts(dataset[:, 1])
question_vocab_size = len(question_tokenizer.word_index) + 1
question_length = max_length(dataset[:, 1])
print('Question vocabulary Size: %d' % question_vocab_size)
print('Question max length: %d' % (question_length))

The followinf to auxiliary methods do the actual encoding of the data to enable the training:
    
* `encode_sequences()`: the method first converts the sequences from strings into list of word indexes and then pads each list with zeros so that all lists have the same length

* `encode_output()`: only needed for the output sequences (i.e., the questions); in encodes each list of word indexes into a list of one-hot vectors of size of the vocabulary.

In [None]:
def encode_sequences(tokenizer, length, lines):
    # integer encode sequences
    X = tokenizer.texts_to_sequences(lines)
    # pad sequences with 0 values
    X = pad_sequences(X, maxlen=length, padding='post')
    return X

def encode_output(sequences, vocab_size):
    ylist = list()
    for sequence in sequences:
        encoded = to_categorical(sequence, num_classes=vocab_size)
        ylist.append(encoded)
    y = np.array(ylist)
    y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size)
    return y

The following lines just illustrate how the encoded input and the encoded output looks like for the first query-question pair in the training data

In [None]:
query_sample_encoded = encode_sequences(query_tokenizer, query_length, train[0:1, 0])[0]

question_sample_encoded = encode_sequences(question_tokenizer, question_length, train[0:1, 0])
question_sample_encoded = encode_output(question_sample_encoded, question_vocab_size)[0]

print(query_sample_encoded.shape)
print(query_sample_encoded)
print()
print(question_sample_encoded.shape)
print(question_sample_encoded)

## Defining the model

* The `Embedding` layers vectorizes the input sequences; each input sequences is list of word indexes

* The first `LSTM` layer represent the encoder

* `RepeatVector` facilitiates the copying of the hidden layers of the encoder to the hidden layer of the decoder

* The second `LSTM` layer represents the decoder; since the output is a sequence and not just one word, we need to return all vectors of the hidden layer (i.e., the vector of the hidden layer at each time step)

* The last layer is fully connected (`Dense`); notice that it is wrapped in a `TimeDistributed` layer, again since we need multiple outputs.

In [None]:
n_units = 256

model = Sequential()
model.add(Embedding(query_vocab_size, n_units, input_length=query_length, mask_zero=True))
model.add(LSTM(n_units))
model.add(RepeatVector(question_length))
model.add(LSTM(n_units, return_sequences=True))
model.add(TimeDistributed(Dense(question_vocab_size, activation='softmax')))

model.compile(optimizer='adam', loss='categorical_crossentropy')

# summarize defined model
print(model.summary())

### Training the network

In principle, we can simply call `model.fit()` to train the network; see previous tutorials. But here we have a little problem: Encoding the input and input sequences will require too much memory and result in an error. Remember the output sequences have a shape of (#items, #max_length, #vocabulary_size). For example, for 100,000 datapoints, a maximum sequence length of 15 and a vocbulary size of 1,000 words, would require to store 1,500,000,000 numbers in the main memory. Even with the internal optimization this requirement is too much for a commodity PC.

To address this, we have to train the network in blocks, e.g., only 1,000 query-question pairs at a time. This includes that we have to manually implement the notion of epochs.

Let's first define a auxiliary method than encodes the input and output sequences for a current block of data.

In [None]:
def prepare_data(macro_batch, num_samples):
    start = macro_batch * num_samples
    end = (macro_batch+1) * num_samples
    # prepare training data
    X_train = encode_sequences(query_tokenizer, query_length, train[start:end, 0])
    y_train = encode_sequences(question_tokenizer, question_length, train[start:end, 1])
    y_train = encode_output(y_train, question_vocab_size)
    return X_train, y_train

#### Training

The training is now 2 nested loops. The outer loop handles the number of epochs (e.g., 10). The inner loop generates the next block of training data (e.g., 1,000 query-question pairs) and trains the network with that block. The other commands just measure the execution time and print some informative output.

Note that in `model.fit()` we set `epochs=1` since we handle the number of epochs now in the outer loop ourselves.

In [None]:
NUM_SAMPLES = 1000
NUM_EPOCHS = 5

num_macro_batches = int(np.ceil(len(train) / NUM_SAMPLES))

filename = 'data/trained-models/q2q.keras'
checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=0, save_best_only=True, mode='min')


start = timer()
for epoch in range(NUM_EPOCHS):
    print("=====================================================")
    print("MACRO EPOCH: {}".format(epoch+1))
    start_epoch = timer()
    for macro_batch in range(num_macro_batches):
        X_train, y_train = prepare_data(macro_batch, NUM_SAMPLES)
        model.fit(X_train, y_train, epochs=1, batch_size=64, callbacks=[checkpoint], verbose=0, validation_split=0.1)
    end_epoch = timer()
    
    execution_time_epoch = convert_seconds_to_time(end_epoch - start_epoch)
    execution_time = convert_seconds_to_time(end_epoch - start)
    
    print(">>> epoch execution time {}, overall execution time: {}".format(execution_time_epoch, execution_time))

As you notice, even with just around 10,000 input pairs, the training now will quite take some time. If you use this trained model, you will also notice that rather bad results. In practice, you need to train the network with much more data and more epochs.

To see the improvement, you can load the weights that have been trained using about 1 million data pairs over 10 epoch. the training required about 15-20 hours on a modern commodity PC (CPU only).

In [None]:
# load model
#model.load_weights('data/trained-models/q2q.keras')
#model.load_weights('data/trained-models/q2q-1m-10epochs.keras')

## Evaluation

In this tutorial we evaluate the network only by inspecting individual predictions.

The following auxiliary methods simply converts a word index back into the original word.

In [None]:
def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

The method `predict_sequence()` predicts the output sequence (question) for a single input sequence (query).

In [None]:
def predict_sequence(model, tokenizer, source):
    prediction = model.predict(source, verbose=0)[0]
    integers = [np.argmax(vector) for vector in prediction]
    target = list()
    for i in integers:
        word = word_for_id(i, tokenizer)
        if word is None:
            break
        target.append(word)
    return ' '.join(target)

Let's pick a couple of queries from the set of test data.

In [None]:
X_test_sample = encode_sequences(query_tokenizer, query_length, test[0:10, 0])

print(X_test_sample.shape)

In [None]:
for i, source in enumerate(X_test_sample):
    # translate encoded source text
    source = source.reshape((1, source.shape[0]))
    translation = predict_sequence(model, question_tokenizer, source)
    raw_src, raw_target = test[i]
    print('========================================')
    print('input query: %s\ntrue question: %s\npredicted question: %s' % (raw_src, raw_target, translation))