
## Neural translation model
### English to Deutsch



In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import unicodedata
import re
from IPython.display import Image
import numpy as np

<img src="data/germany_uk_flags.png">

Using a language dataset from http://www.manythings.org/anki/  of the dataset used is not part of the grading rubric.

The goal is to develop a neural translation model from English to German, making use of a pre-trained English word embedding module.

In [2]:
# load the dataset
NUM_EXAMPLES = 20000
data_examples = []
with open('data/deu.txt', 'r', encoding='utf8') as f:
    for line in f.readlines():
        if len(data_examples) < NUM_EXAMPLES:
            data_examples.append(line)
        else:
            break

In [3]:
# These functions preprocess English and German sentences

def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

def preprocess_sentence(sentence):
    sentence = sentence.lower().strip()
    sentence = re.sub(r"ü", 'ue', sentence)
    sentence = re.sub(r"ä", 'ae', sentence)
    sentence = re.sub(r"ö", 'oe', sentence)
    sentence = re.sub(r'ß', 'ss', sentence)
    
    sentence = unicode_to_ascii(sentence)
    sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
    sentence = re.sub(r"[^a-z?.!,']+", " ", sentence)
    sentence = re.sub(r'[" "]+', " ", sentence)
    
    return sentence.strip()

#### The custom translation model
<img src="data/neural_translation_model.png">
The following is a schematic of the custom translation model architecture you will develop in this project.

The custom model consists of an encoder RNN and a decoder RNN. The encoder takes words of an English sentence as input, and uses a pre-trained word embedding to embed the words into a 128-dimensional space. To indicate the end of the input sentence, a special end token (in the same 128-dimensional space) is passed in as an input. This token is a TensorFlow Variable that is learned in the training phase (unlike the pre-trained word embedding, which is frozen).

The decoder RNN takes the internal state of the encoder network as its initial state. A start token is passed in as the first input, which is embedded using a learned German word embedding. The decoder RNN then makes a prediction for the next German word, which during inference is then passed in as the following input, and this process is repeated until the special `<end>` token is emitted from the decoder.

In [4]:
# lists of both sequences
eng = []
deu = []
for sente in data_examples:
    en, de = re.split("\t", sente)[0:2]
    eng.append(preprocess_sentence(en))
    deu.append(''.join(['<start> ', preprocess_sentence(de), ' <end>']))

In [5]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
tokenizer.fit_on_texts(deu)
tokenizer_config = tokenizer.get_config()

In [6]:
word_index = tokenizer_config['word_index']
index_word = tokenizer_config['index_word']

In [7]:
deu_tokenized = tokenizer.texts_to_sequences(deu)

In [43]:
for _ in np.random.randint(len(data_examples), size=10):
    print("English:           {}\nDeutsch:           {}\nDeutsch Tokenized: {}".format(eng[_], deu[_], deu_tokenized[_]))

English:           keep tom there .
Deutsch:           <start> halte tom dort . <end>
Deutsch Tokenized: [1, 288, 5, 141, 3, 2]
English:           tom told a joke .
Deutsch:           <start> tom hat einen witz erzaehlt . <end>
Deutsch Tokenized: [1, 5, 16, 40, 469, 757, 3, 2]
English:           you look healthy .
Deutsch:           <start> du siehst gesund aus . <end>
Deutsch Tokenized: [1, 13, 236, 700, 41, 3, 2]
English:           i'll pay later .
Deutsch:           <start> ich werde spaeter bezahlen . <end>
Deutsch Tokenized: [1, 4, 39, 613, 464, 3, 2]
English:           tom is a fool .
Deutsch:           <start> tom ist ein narr . <end>
Deutsch Tokenized: [1, 5, 6, 19, 1626, 3, 2]
English:           tom surprised me .
Deutsch:           <start> tom ueberraschte mich . <end>
Deutsch Tokenized: [1, 5, 3083, 22, 3, 2]
English:           i'm happy .
Deutsch:           <start> ich bin froh . <end>
Deutsch Tokenized: [1, 4, 15, 804, 3, 2]
English:           i fainted .
Deutsch:         

In [9]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
deu_tokenized_padded = pad_sequences(deu_tokenized,
                                     maxlen=None,
                                     padding='post',
                                     value=0.0)

## 2. Prepare the data with tf.data.Dataset objects

#### Load the embedding layer
Download the module from [here](https://tfhub.dev/google/tf2-preview/nnlm-en-dim128-with-normalization/1)

#### Import the pre-trained embedding layer

In [11]:
# Load embedding module from Tensorflow Hub
embedding_layer = hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim128-with-normalization/1", 
                                 output_shape=[128], input_shape=[], dtype=tf.string)

In [12]:
# Test the layer
embedding_layer(tf.constant(["<start>", "these", "aren't", "the", "droids", "you're", "looking", "for", "<end>"]))

<tf.Tensor: shape=(9, 128), dtype=float32, numpy=
array([[-0.33764157, -0.12379622,  0.00591127, ...,  0.11612329,
         0.00694278, -0.11781787],
       [ 0.15317006, -0.06145132,  0.07350554, ..., -0.15094818,
        -0.12576084, -0.12233189],
       [ 0.140084  ,  0.02941015,  0.04331429, ...,  0.0944555 ,
        -0.1265336 , -0.25905257],
       ...,
       [ 0.03285561,  0.06345107, -0.05201129, ..., -0.08083786,
        -0.10174342,  0.03802322],
       [ 0.25726095,  0.01382531, -0.03725627, ..., -0.01149414,
         0.0629049 , -0.00084706],
       [-0.15089144,  0.276619  , -0.22944725, ...,  0.19615449,
        -0.1152845 , -0.13853867]], dtype=float32)>

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(eng, deu_tokenized_padded)

In [15]:
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
test_dataset = tf.data.Dataset.from_tensor_slices((X_test, y_test))

In [16]:
def split_eng(x,y):
    x = tf.strings.split(x, sep=' ')
    return x, y

train_dataset = train_dataset.map(split_eng)
test_dataset = test_dataset.map(split_eng)

In [17]:
def embed_eng(x,y):
    return embedding_layer(x), y

train_dataset = train_dataset.map(embed_eng)
test_dataset = test_dataset.map(embed_eng)

In [18]:
train_dataset = train_dataset.filter(lambda x,y: tf.shape(x)[0] <= 13) 
test_dataset = test_dataset.filter(lambda x,y: tf.shape(x)[0] <= 13)

In [19]:
def pad(x, y):
    paddings = tf.concat(([[13-tf.shape(x)[0],0]], [[0,0]]), axis=0)
    x = tf.pad(x, paddings)
    return x, y

In [20]:
train_dataset = train_dataset.map(lambda x, y: (pad(x,y)))
test_dataset = test_dataset.map(lambda x, y: (pad(x,y)))

In [21]:
train_dataset = train_dataset.batch(16)
test_dataset = test_dataset.batch(16)

In [22]:
train_dataset.element_spec

(TensorSpec(shape=(None, None, None), dtype=tf.float32, name=None),
 TensorSpec(shape=(None, 14), dtype=tf.int32, name=None))

In [23]:
for x,y in train_dataset.take(1):
    pass

print(x.shape)
print(y)

(16, 13, 128)
tf.Tensor(
[[   1  356  193    6    5    3    2    0    0    0    0    0    0    0]
 [   1 1539   55 2990    9    2    0    0    0    0    0    0    0    0]
 [   1  131    8   34  143    3    2    0    0    0    0    0    0    0]
 [   1   38   18    4 1437    7    2    0    0    0    0    0    0    0]
 [   1    5    6 1061 1289    3    2    0    0    0    0    0    0    0]
 [   1 1683   52   11  637    3    2    0    0    0    0    0    0    0]
 [   1   14   24   42 2665    3    2    0    0    0    0    0    0    0]
 [   1    8   69   10 3474    3    2    0    0    0    0    0    0    0]
 [   1    5    6  668    3    2    0    0    0    0    0    0    0    0]
 [   1    5  311   42 5303   63    3    2    0    0    0    0    0    0]
 [   1 3211   27    3    2    0    0    0    0    0    0    0    0    0]
 [   1    5   16   37  163  116    3    2    0    0    0    0    0    0]
 [   1  317    8   21  202  204    9    2    0    0    0    0    0    0]
 [   1   73   27 1417    7

## 3. Create the custom layer
You will now create a custom layer to add the learned end token embedding to the encoder model:

In [24]:
from tensorflow.keras.layers import Layer
class EmbedEndToken(Layer):
    def __init__(self, **kwargs):
        super(EmbedEndToken, self).__init__(**kwargs)
        self.end_token_embed = tf.Variable(initial_value=tf.random.uniform(shape=(128,)), trainable=True)

    def call(self, inputs): #inputs is of shape (16,13,128)
        end_token = tf.tile(
                tf.reshape(self.end_token_embed, shape=(1, 1, self.end_token_embed.shape[0])),
                [tf.shape(inputs)[0], 1, 1])
        return tf.keras.layers.concatenate([inputs, end_token], axis=1)

In [25]:
for x,y in train_dataset.take(1):
  #print(x[0])
      print(x.shape) #(16, 13, 128)
      print(EmbedEndToken()(x).shape)

(16, 13, 128)
(16, 14, 128)


## 4. Build the encoder network

<img src="data/neural_translation_model_encoder.png">


In [26]:
inputs = tf.keras.layers.Input(shape=((13,128)))
h = EmbedEndToken()(inputs)
h = tf.keras.layers.Masking(mask_value=0.0)(h)
_, h_state, c_state = tf.keras.layers.LSTM(512, return_state=True, return_sequences=True)(h)

encoder = tf.keras.Model(inputs, [h_state, c_state])

In [27]:
encoder.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 13, 128)]         0         
                                                                 
 embed_end_token_1 (EmbedEnd  (None, 14, 128)          128       
 Token)                                                          
                                                                 
 masking (Masking)           (None, 14, 128)           0         
                                                                 
 lstm (LSTM)                 [(None, 14, 512),         1312768   
                              (None, 512),                       
                              (None, 512)]                       
                                                                 
Total params: 1,312,896
Trainable params: 1,312,896
Non-trainable params: 0
___________________________________________________

In [28]:
for x,y in train_dataset:
    outputs = encoder(x)
    break
    
print(outputs[0].shape)
print(outputs[1].shape)    

(16, 512)
(16, 512)


## 5. Build the decoder network
The decoder network follows the schematic diagram below. 

![Decoder schematic](data/neural_translation_model_decoder.png)

In [29]:
vocab_size = len(tokenizer.word_index)+1

In [30]:
from tensorflow.keras.layers import Embedding, LSTM, Dense
class Decoder(tf.keras.Model):

    def __init__(self, vocab_size, **kwargs):
        super(Decoder, self).__init__(**kwargs)
        self.embedding = Embedding(vocab_size, 128, mask_zero=True)
        self.lstm = LSTM(512, return_sequences=True, return_state=True)
        self.dense = Dense(vocab_size) 

    def call(self, inputs, hidden_state=None, cell_state=None):
        h = self.embedding(inputs)
        h, hidden_state, cell_state = self.lstm(h, initial_state=[hidden_state, cell_state])
        output = self.dense(h)
        return output, hidden_state, cell_state

In [31]:
decoder = Decoder(vocab_size)

In [32]:
for x,y in train_dataset.take(1):
    enc_h_state, enc_c_state = encoder(x)
    output, dec_h_state, dec_c_state = decoder(y, enc_h_state, enc_h_state)

In [33]:
print(enc_h_state.shape)
print(enc_c_state.shape)
print(dec_h_state.shape)
print(dec_c_state.shape)
print(output.shape)

(16, 512)
(16, 512)
(16, 512)
(16, 512)
(16, 14, 5744)


In [34]:
decoder.summary()

Model: "decoder"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  735232    
                                                                 
 lstm_1 (LSTM)               multiple                  1312768   
                                                                 
 dense (Dense)               multiple                  2946672   
                                                                 
Total params: 4,994,672
Trainable params: 4,994,672
Non-trainable params: 0
_________________________________________________________________


## 6. Make a custom training loop

In [35]:
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()

In [36]:
trainable_variables = encoder.trainable_variables + decoder.trainable_variables

In [37]:
@tf.function
def optimization(eng_in, deu_in, deu_out):
    with tf.GradientTape() as tape:
        enc_h, enc_c = encoder(eng_in)
        dec_out,_,_ = decoder(deu_in, enc_h, enc_c)
        loss_val = tf.math.reduce_mean(loss(deu_out, dec_out))
        gradients = tape.gradient(loss_val, trainable_variables)
    return loss_val, gradients

In [None]:
epochs = 10
train_loss = []
val_loss = []

for epoch in range(epochs):
    train_loss_avg = tf.keras.metrics.Mean()
    val_loss_avg = tf.keras.metrics.Mean()

    for eng, deu in train_dataset:
        deu_in, deu_out = deu[:, :-1], deu[:, 1:]
        loss_value, gradients = optimization(eng, deu_in, deu_out)
        optimizer.apply_gradients(zip(gradients, trainable_variables))
        train_loss_avg(loss_value)

    for eng, deu in test_dataset:
        deu_in, deu_out = deu[:, :-1], deu[:, 1:]
        loss_value, _ = optimization(eng, deu_in, deu_out)
        val_loss_avg(loss_value)


    train_loss.append(train_loss_avg.result())
    val_loss.append(val_loss_avg.result())

    print('Epoch {}, train loss {}, val loss {}'.format(epoch, train_loss[-1], val_loss[-1]))

In [None]:
import matplotlib.pyplot as plt
plt.plot(train_loss, label='Training')
plt.plot(val_loss, label='Validation')
plt.title('Epochs vs. Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend
plt.show()

## 7. Use the model to translate





For the purposes of illustration, we will import saved `encoder` and `decoder` models
Note that this word2word NLP model is a computationally demanding one

In [38]:
encoder.load_weights('nmt/encoder.h5')
decoder.load_weights('nmt/decoder.h5')

In [45]:
eng = []
deu = []
for sente in data_examples:
    en, de = re.split("\t", sente)[0:2]
    eng.append(preprocess_sentence(en))
    deu.append(preprocess_sentence(de))

In [46]:
idx = np.random.randint(len(eng), size=5)

In [47]:
start_token = tokenizer.word_index['<start>']
end_token = tokenizer.word_index['<end>']

for i in np.random.choice(len(eng), size=15):
    en = eng[i]
    en = tf.strings.split(en)
    en_em = embedding_layer(en)
    padding = [[tf.math.maximum(13-tf.shape(en_em)[0],0), 0], [0,0]]
    padded = tf.expand_dims(tf.pad(en_em, padding), axis = 0)
    
    curr_token = start_token
    h_state, c_state = encoder(padded)
    
    deu_sente=[]
    
    while len(deu_sente) < 15:
        inputs = tf.Variable([[curr_token]])
        out, h_state, c_state = decoder(inputs, h_state, c_state)
        curr_token = np.argmax(out[0][0].numpy())
        if curr_token==end_token:
            break
        deu_word = tokenizer.index_word[curr_token]
        deu_sente.append(deu_word)
        
    print("English: {}".format(eng[i]))
    print("Deutsch True: {}".format(deu[i]))
    print("Deutsch Pred: {}".format(' '.join(deu_sente)))
    print()

English: i see tom .
Deutsch True: ich sehe tom .
Deutsch Pred: ich sehe tom .

English: there's no proof .
Deutsch True: es gibt keine beweise .
Deutsch Pred: es ist kein bisschen da .

English: i'll go ask tom .
Deutsch True: ich frage tom .
Deutsch Pred: ich gucke fernsehen .

English: are we going far ?
Deutsch True: gehen wir weit weg ?
Deutsch Pred: gehen wir weit weg ?

English: tom sounded busy .
Deutsch True: tom klang beschaeftigt .
Deutsch Pred: tom wirkte stumm .

English: did you hit tom ?
Deutsch True: haben sie tom geschlagen ?
Deutsch Pred: haben sie tom geschlagen ?

English: i won't fail .
Deutsch True: ich werde nicht versagen .
Deutsch Pred: ich habe es nicht gemeint .

English: he began running .
Deutsch True: er fing an zu rennen .
Deutsch Pred: er ist in tokyo gegangen .

English: stay cool .
Deutsch True: bleibt ruhig .
Deutsch Pred: bleib duenn .

English: did tom send you ?
Deutsch True: hat tom dich geschickt ?
Deutsch Pred: hat tom euch gekuesst ?

English: 