# Train Neural Translation Model

This involves both loading and preparing the clean text data ready for modeling and defining and training the model on the prepared data.

In [1]:
#import libraries
from pickle import load
from numpy import array

In [2]:
# load a clean dataset
def load_clean_sentences(fname):
    return load(open(fname, 'rb'))

In [3]:
# load datasets
dataset = load_clean_sentences('english-german-both.pkl')
train = load_clean_sentences('english-german-train.pkl')
test = load_clean_sentences('english-german-test.pkl')

In [4]:
dataset[0:10]

array([['im helping you', 'ich helfe dir'],
       ['it wasnt ours', 'es war nicht unseres'],
       ['tom walked into the bar', 'tom ging in die kneipe'],
       ['he has an eye for art', 'er hat einen blick fur kunst'],
       ['do you use aftershave', 'benutzen sie aftershave'],
       ['i said drop your weapon', 'ich sagte lassen sie die waffe fallen'],
       ['ill stop by later', 'ich komme spater vorbei'],
       ['do you happen to know tom', 'kennen sie zufalligerweise tom'],
       ['he sketched an apple', 'er zeichnete einen apfel'],
       ['tom has ocd', 'tom leidet an einer zwangserkrankung']],
      dtype='<U370')

I'm using combination of the train and test datasets to define the maximum length and vocabulary of the problem.

We can use the Keras Tokenize class to map words to integers, as needed for modeling. We will use separate tokenizer for the English sequences and the German sequences. 

In [5]:
# import libraries
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
from keras.callbacks import ModelCheckpoint

Using TensorFlow backend.


In [6]:
# fit a tokenizer
def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

In [7]:
# find the length of the longest sequence in a list of phrases.
def max_length(lines):
    return max(len(line.split()) for line in lines)

In [8]:
# prepare english tokenizer
eng_tokenizer = create_tokenizer(dataset[:, 0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
eng_length = max_length(dataset[:, 0])
print('English Vocabulary Size: %d' % eng_vocab_size)
print('English Max Length: %d' % (eng_length))

English Vocabulary Size: 3580
English Max Length: 8


In [9]:
# prepare german tokenizer
ger_tokenizer = create_tokenizer(dataset[:, 1])
ger_vocab_size = len(ger_tokenizer.word_index) + 1
ger_length = max_length(dataset[:, 1])
print('German Vocabulary Size: %d' % ger_vocab_size)
print('German Max Length: %d' % (ger_length))

German Vocabulary Size: 5237
German Max Length: 10


Each input and output sequence must be encoded to integers and padded to the maximum phrase length. This is because we will use a word embedding for the input sequences and one hot encode the output sequences

In [10]:
# encode and pad sequences
def encode_sequences(tokenizer, length, lines):
    # integer encode sequences
    X = tokenizer.texts_to_sequences(lines)
    # pad sequences with 0 values
    X = pad_sequences(X, maxlen=length, padding='post')
    return X

The output sequence needs to be one-hot encoded. This is because the model will predict the probability of each word in the vocabulary as output

In [11]:
# one hot encode target sequence
def encode_output(sequences, vocab_size):
    ylist = list()
    for sequence in sequences:
        encoded = to_categorical(sequence, num_classes=vocab_size)
        ylist.append(encoded)
    y = array(ylist)
    y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size)
    return y

In [12]:
# prepare training data
trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1])
trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0])
trainY = encode_output(trainY, eng_vocab_size)

In [13]:
# prepare validation data
testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1])
testY = encode_sequences(eng_tokenizer, eng_length, test[:, 0])
testY = encode_output(testY, eng_vocab_size)

We will use an encoder-decoder LSTM model on this problem. In this architecture, the input sequence is encoded by a front-end model called the encoder then decoded word by word by a backend model called the decoder.

The function define_model() below defines the model and takes a number of arguments used to configure the model, such as the size of the input and output vocabularies, the maximum length of input and output phrases, and the number of memory units used to configure the model.

The model is trained using the efficient Adam approach to stochastic gradient descent and minimizes the categorical loss function because we have framed the prediction problem as multi-class classification.

In [14]:
# define NMT model
def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units):
    model = Sequential()
    model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True))
    model.add(LSTM(n_units))
    model.add(RepeatVector(tar_timesteps))
    model.add(LSTM(n_units, return_sequences=True))
    model.add(TimeDistributed(Dense(tar_vocab, activation='softmax')))
    return model


In [16]:
# define model
model = define_model(ger_vocab_size, eng_vocab_size, ger_length, eng_length, 256)
model.compile(optimizer='adam', loss='categorical_crossentropy')

# summarize defined model
print(model.summary())
# plot_model(model, to_file='model.png', show_shapes=True)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 10, 256)           1340672   
_________________________________________________________________
lstm_3 (LSTM)                (None, 256)               525312    
_________________________________________________________________
repeat_vector_2 (RepeatVecto (None, 8, 256)            0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 8, 256)            525312    
_________________________________________________________________
time_distributed_2 (TimeDist (None, 8, 3580)           920060    
Total params: 3,311,356
Trainable params: 3,311,356
Non-trainable params: 0
_________________________________________________________________
None


In [18]:
# fit model
filename = 'model.h5'
checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, 
                             save_best_only=True, mode='min')
model.fit(trainX, trainY, epochs=50, batch_size=64, 
          validation_data=(testX, testY), callbacks=[checkpoint], verbose=2)

Train on 9000 samples, validate on 1000 samples
Epoch 1/50
Epoch 00001: val_loss improved from inf to 2.48823, saving model to model.h5
 - 21s - loss: 0.9200 - val_loss: 2.4882
Epoch 2/50
Epoch 00002: val_loss improved from 2.48823 to 2.48455, saving model to model.h5
 - 19s - loss: 0.8562 - val_loss: 2.4845
Epoch 3/50
Epoch 00003: val_loss improved from 2.48455 to 2.47814, saving model to model.h5
 - 19s - loss: 0.7976 - val_loss: 2.4781
Epoch 4/50
Epoch 00004: val_loss did not improve
 - 19s - loss: 0.7435 - val_loss: 2.4984
Epoch 5/50
Epoch 00005: val_loss did not improve
 - 19s - loss: 0.6902 - val_loss: 2.5166
Epoch 6/50
Epoch 00006: val_loss did not improve
 - 19s - loss: 0.6424 - val_loss: 2.5212
Epoch 7/50
Epoch 00007: val_loss did not improve
 - 19s - loss: 0.5970 - val_loss: 2.5197
Epoch 8/50
Epoch 00008: val_loss did not improve
 - 19s - loss: 0.5538 - val_loss: 2.5340
Epoch 9/50
Epoch 00009: val_loss did not improve
 - 19s - loss: 0.5123 - val_loss: 2.5376
Epoch 10/50
Epoch

<keras.callbacks.History at 0x202950afda0>