## German to English Dataset

In this tutorial, we will use a dataset of German to English terms used as the basis for flashcards for language learning. The dataset is available from the ManyThings.org website, with examples drawn from the Tatoeba Project. 
    
    􏰀 Download the English-German pairs dataset: http://www.manythings.org/anki/deu-eng.zip

## Clean data

We are now ready to clean each sentence. The specific cleaning operations we will perform are as follows:

    􏰀 Remove all non-printable characters.
    􏰀 Remove all punctuation characters.
    􏰀 Normalize all Unicode characters to ASCII (e.g. Latin characters). 
    􏰀 Normalize the case to lowercase.
    􏰀 Remove any remaining tokens that are not alphabetic.

In [46]:
with open("../data/deu.txt") as file:
    doc = file.read().strip().split('\n')
    pairs = [ line.split('\t') for line in doc  ]
    file.close()
    
print(pairs[:10])

[['Hi.', 'Hallo!'], ['Hi.', 'Grüß Gott!'], ['Run!', 'Lauf!'], ['Fire!', 'Feuer!'], ['Help!', 'Hilfe!'], ['Help!', 'Zu Hülf!'], ['Stop!', 'Stopp!'], ['Wait!', 'Warte!'], ['Go on.', 'Mach weiter.'], ['Hello!', 'Hallo!']]


In [52]:
import string
import re
import pickle 
from unicodedata import normalize
from numpy import array

def clean_pairs(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    re_print = re.compile('[^%s]' % re.escape(string.printable))

    for pair in lines:
        clean_pair = list()
        for line in pair:
            line = normalize('NFD', line).encode('ascii', 'ignore')
            line = line.decode('UTF-8')
            line = line.split()
            line = [word.lower() for word in line]
            line = [re_punc.sub('', w) for w in line]
            line = [re_print.sub('', w) for w in line]
            line = [word for word in line if word.isalpha()]
            clean_pair.append(' '.join(line))
        cleaned.append(clean_pair)
    return array(cleaned)    

cleaned_pairs = clean_pairs(pairs)
print(cleaned_pairs[:3])

[['hi' 'hallo']
 ['hi' 'gru gott']
 ['run' 'lauf']]


In [53]:
# Save cleaned dataset
with open("../data/translation_pairs.pkl", "wb") as file:
    pickle.dump(cleaned_pairs, file)
    file.close()
    
print("Finished saving.")

Finished saving.


## Split Data in Train Test

In [54]:
import pickle
from numpy.random import shuffle

#load cleaned pairs
#cleaned_pairs = pickle.load(open("../data/translation_pairs.pkl","rb"))

def save_data(data, filename):
    pickle.dump( data, open(filename, "wb") )
    print("saved: ", filename)
    
print("Raw Dataset size: ", len(cleaned_pairs))

# Reduce dataset size for training demo
n_sentences = 10000
dataset = cleaned_pairs[:n_sentences, :]
shuffle(dataset)

train, test = dataset[:9000], dataset[9000:]
# save
save_data(dataset, '../data/english-german-both.pkl') 
save_data(train, '../data/english-german-train.pkl') 
save_data(test, '../data/english-german-test.pkl')


Raw Dataset size:  169813
saved:  ../data/english-german-both.pkl
saved:  ../data/english-german-train.pkl
saved:  ../data/english-german-test.pkl


## Build and Train Model

In [61]:
import pickle

def load_data(filename):
    return pickle.load(open(filename,'rb'))

# load datasets
dataset = load_data('../data/english-german-both.pkl') 
train = load_data('../data/english-german-train.pkl')
test = load_data('../data/english-german-test.pkl')

# Fit tokenizer
def create_tokenizer(data):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(data)
    return tokenizer

# max sentence length
def max_length(data):
    return max( len(line.split()) for line in data )

### Prepare Tokenizers and encode sequences

In [62]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np 

# Encode sequences
def encode_sequences(tokenizer, length, lines):
    X = tokenizer.texts_to_sequences(lines)
    X = pad_sequences(X, maxlen=length, padding='post')
    return X

# One hot encode target sequence
def encode_output(sequences, vocab_size):
    ylist = list()
    for sequence in sequences:
        encoded = to_categorical(sequence, num_classes=vocab_size)
        ylist.append(encoded)
    y = np.array(ylist)
    y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size)
    return y


# English tokenizer
eng_tokenizer = create_tokenizer(dataset[:,0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
eng_length = max_length(dataset[:,0])
print("english vocab size %i, seq length: %i" %(eng_vocab_size, eng_length))

# German Tokenizer
ger_tokenizer = create_tokenizer(dataset[:, 1]) 
ger_vocab_size = len(ger_tokenizer.word_index) + 1 
ger_length = max_length(dataset[:, 1])
print("german vocab size %i, seq length: %i" %(ger_vocab_size, ger_length))

english vocab size 2315, seq length: 5
german vocab size 3686, seq length: 10


In [63]:
# prepare training data
trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1])
trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0])
trainY = encode_output(trainY, eng_vocab_size)
print("X train: ", trainX.shape, "Y train: ", trainY.shape)

# Prepare validation data
testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1])
testY = encode_sequences(eng_tokenizer, eng_length, test[:, 0])
testY = encode_output(testY, eng_vocab_size)
print("X test: ", testX.shape, "Y test: ", testY.shape)

X train:  (9000, 10) Y train:  (9000, 5, 2315)
X test:  (1000, 10) Y test:  (1000, 5, 2315)


### Compile Model

In [64]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding, RepeatVector, TimeDistributed
from keras.callbacks import ModelCheckpoint

def define_model(src_vocab, tar_vocab, 
                 src_timesteps, tar_timesteps, n_units):
    model = Sequential()
    model.add(Embedding(src_vocab, n_units, 
                        input_length=src_timesteps, mask_zero=True))
    model.add(LSTM(n_units))
    model.add(RepeatVector(tar_timesteps))
    model.add(LSTM(n_units, return_sequences=True)) 
    model.add(TimeDistributed(Dense(tar_vocab, activation='softmax'))) # compile model
    model.compile(optimizer='adam', loss='categorical_crossentropy')
    return model
    
    
model = define_model(ger_vocab_size, eng_vocab_size, 
                     ger_length, eng_length, 256)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 10, 256)           943616    
_________________________________________________________________
lstm_5 (LSTM)                (None, 256)               525312    
_________________________________________________________________
repeat_vector_3 (RepeatVecto (None, 5, 256)            0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 5, 256)            525312    
_________________________________________________________________
time_distributed_3 (TimeDist (None, 5, 2315)           594955    
Total params: 2,589,195
Trainable params: 2,589,195
Non-trainable params: 0
_________________________________________________________________


In [65]:
checkpoint = ModelCheckpoint('../models/nmt_model.h5', monitor='val_loss',
                             verbose=1, save_best_only=True, mode='min')
model.fit(trainX, trainY, epochs=30, batch_size=64, 
          validation_data=(testX, testY),
          callbacks=[checkpoint], verbose=2)

Train on 9000 samples, validate on 1000 samples
Epoch 1/30
 - 19s - loss: 4.3151 - val_loss: 3.5629

Epoch 00001: val_loss improved from inf to 3.56295, saving model to ../models/nmt_model.h5
Epoch 2/30
 - 18s - loss: 3.4112 - val_loss: 3.4244

Epoch 00002: val_loss improved from 3.56295 to 3.42444, saving model to ../models/nmt_model.h5
Epoch 3/30
 - 18s - loss: 3.2742 - val_loss: 3.3699

Epoch 00003: val_loss improved from 3.42444 to 3.36994, saving model to ../models/nmt_model.h5
Epoch 4/30
 - 18s - loss: 3.1453 - val_loss: 3.2564

Epoch 00004: val_loss improved from 3.36994 to 3.25637, saving model to ../models/nmt_model.h5
Epoch 5/30
 - 18s - loss: 3.0051 - val_loss: 3.1740

Epoch 00005: val_loss improved from 3.25637 to 3.17402, saving model to ../models/nmt_model.h5
Epoch 6/30
 - 18s - loss: 2.8607 - val_loss: 3.0734

Epoch 00006: val_loss improved from 3.17402 to 3.07337, saving model to ../models/nmt_model.h5
Epoch 7/30
 - 18s - loss: 2.7196 - val_loss: 2.9614

Epoch 00007: va

<keras.callbacks.History at 0x7fc6edf8f518>

## Evaluate Neural Translation Model

In [66]:
# Load datasets if not loaded
dataset = load_data('../data/english-german-both.pkl')
train = load_data('../data/english-german-train.pkl') 
test = load_data('../data/english-german-test.pkl')

# prepare english tokenizer
eng_tokenizer = create_tokenizer(dataset[:, 0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
eng_length = max_length(dataset[:, 0])

# prepare german tokenizer
ger_tokenizer = create_tokenizer(dataset[:, 1])
ger_vocab_size = len(ger_tokenizer.word_index) + 1
ger_length = max_length(dataset[:, 1])

# prepare data
trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1])
testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1])

In [69]:
from keras.models import load_model
from nltk.translate.bleu_score import corpus_bleu
from numpy import argmax

# load model
#model = load_model("../models/nmt_model.h5")

# map an integer to a word
def word_for_id(integer, tokenizer):
	for word, index in tokenizer.word_index.items():
		if index == integer:
			return word
	return None

# generate target given source sequence
def predict_sequence(model, tokenizer, source):
	prediction = model.predict(source, verbose=0)[0]
	integers = [argmax(vector) for vector in prediction]
	target = list()
	for i in integers:
		word = word_for_id(i, tokenizer)
		if word is None:
			break
		target.append(word)
	return ' '.join(target)

# evaluate the skill of the model
def evaluate_model(model, sources, raw_dataset):
	actual, predicted = list(), list()
	for i, source in enumerate(sources):
		# translate encoded source text
		source = source.reshape((1, source.shape[0]))
		translation = predict_sequence(model, eng_tokenizer, source)
		raw_target, raw_src = raw_dataset[i]
		if i < 10:
			print('src=[%s], target=[%s], predicted=[%s]' % (raw_src, raw_target, translation))
		actual.append(raw_target.split())
		predicted.append(translation.split())
	# calculate BLEU score
	print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
	print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
	print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
	print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

In [70]:
# test on some training sequences
print('train')
evaluate_model(model, trainX, train)
# test on some test sequences
print('test')
evaluate_model(model, testX, test)           

train
src=[wo seid ihr], target=[where are you], predicted=[where are you]
src=[ich bin mude], target=[im sleepy], predicted=[im tired]
src=[ich sah nichts], target=[i saw nothing], predicted=[i didnt nothing]
src=[ich habe kaffee getrunken], target=[i drank coffee], predicted=[i love coffee]
src=[tom ging raus], target=[tom walked out], predicted=[tom went out]
src=[ich bin dein chef], target=[im your boss], predicted=[im your boss]
src=[das ist keine neuigkeit], target=[this isnt news], predicted=[this isnt news]
src=[mach es aus], target=[turn it off], predicted=[do it off]
src=[ich habe mir sorgen gemacht], target=[i was worried], predicted=[i been abroad]
src=[tom ist erstaunlich], target=[toms amazing], predicted=[toms amazing]


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


BLEU-1: 0.074037
BLEU-2: 0.000000
BLEU-3: 0.000000
BLEU-4: 0.000000
test
src=[ich ziehe mich aus], target=[im undressing], predicted=[i am undressing]
src=[tom wird nicht warten], target=[tom wont wait], predicted=[tom wont talk]
src=[tom ist pleite], target=[tom is broke], predicted=[tom is gasping]
src=[tom hat gebetet], target=[tom prayed], predicted=[tom obeyed]
src=[ich studiere englisch], target=[i study english], predicted=[i teach english]
src=[tom wird arbeiten], target=[tom will work], predicted=[tom will]
src=[fur hier oder zum mitnehmen], target=[here or to go], predicted=[here or do]
src=[ich bin knapp bei kasse], target=[im broke], predicted=[im in free]
src=[das ist ekelhaft], target=[this is ugly], predicted=[this is]
src=[die schule ist vorbei], target=[schools out], predicted=[school is over]
BLEU-1: 0.065689
BLEU-2: 0.000000
BLEU-3: 0.000000
BLEU-4: 0.000000
