<a href="https://colab.research.google.com/github/ap-nlp-research/language_translation_en_ru_tf2/blob/master/MachineTranslationModels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Translation Project


The goal of the project is to compare the strength of the following recurrent models:

1. Embedded GRU
2. Embedded Bidirectional GRU
3. Embedded GRU encoder-decoder model
4. Embedded GRU encoder-decoder model with Multiplicative Attention

The models implemented in Tensorflow 2.0 with Keras as a high-level API. Models are trained and analyzed based on EN-RU [wmt19_translate dataset](https://www.tensorflow.org/datasets/datasets#wmt19_translate) ([ACL 2019 FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT19)](http://www.statmt.org/wmt19/translation-task.html)).

In [21]:
!pip install --force https://github.com/chengs/tqdm/archive/colab.zip
!pip install tensorflow-gpu==2.0.0-alpha0
!git clone https://github.com/ap-nlp-research/language_translation_en_ru_tf2.git translation_en_ru

Collecting https://github.com/chengs/tqdm/archive/colab.zip
[?25l  Downloading https://github.com/chengs/tqdm/archive/colab.zip
[K     | 604kB 3.2MB/s
[?25hBuilding wheels for collected packages: tqdm
  Building wheel for tqdm (setup.py) ... [?25l[?25hdone
  Stored in directory: /tmp/pip-ephem-wheel-cache-4qr3zmaj/wheels/41/18/ee/d5dd158441b27965855b1bbae03fa2d8a91fe645c01b419896
Successfully built tqdm
Installing collected packages: tqdm
  Found existing installation: tqdm 4.28.1
    Uninstalling tqdm-4.28.1:
      Successfully uninstalled tqdm-4.28.1
Successfully installed tqdm-4.28.1


In [1]:
import os
import pickle as pk
import subprocess
import re
import numpy as np
#from tqdm import tqdm, tqdm_notebook
from tqdm.autonotebook import tqdm
from typing import Callable
from functools import partial
import tensorflow as tf
from tensorflow import keras
from tensorflow.python.keras import backend as K



## Data ETL

The data load, extraction, and transformation was done beforehand with [create_dataset_en_ru.py](./create_dataset_en_ru.py) script. This script stores a dictionary containing source data under 'x' label. Target data is stored under 'y' label. In addition to the source and target data, the dictionary contains x and y tockenizers (stored as 'x_tk' and 'y_tk'):

dataset: dict

{
    'x': np.ndarray,
    'y': np.ndarray,
    'x_tk': keras.preprocessing.text.Tokenizer,
    'y_tk': keras.preprocessing.text.Tokenizer
}

In [0]:
with open("./translation_en_ru/data/wlm_en_ru.pkl", 'rb') as file:
    dataset = pk.load(file)

## Utility Functions

In addition to the data ETL, the code below provides two additional functions for converting logits into word indicies and converting word indicies into text.

In [0]:
def logits_to_id(logits):
    """
    Turns logits into word ids
    :param logits: Logits from a neural network
    """
    return np.argmax(logits, 1)

def id_to_text(idx, tokenizer):
    """
    Turns id into text using the tokenizer
    :param idx: word id
    :param tokenizer: Keras Tokenizer fit on the labels
    :return: String that represents the text of the logits
    """
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'

    return ' '.join([index_to_words[prediction] for prediction in idx]).replace(" <PAD>", "")

In [4]:
print("Here is an example for a sample number 1:")
print("Source('en') example:", id_to_text(dataset['x'][0], dataset['x_tk']))
print("Target('ru') example:", id_to_text(dataset['y'][0], dataset['y_tk']))
print(" ")
print("A sample number 2:")
print("Source('en') example:", id_to_text(dataset['x'][-10], dataset['x_tk']))
print("Target('ru') example:", id_to_text(dataset['y'][-10], dataset['y_tk']))
print("source vocabulary size:", dataset['x'].max())
print("target vocabulary size:", dataset['y'].max())
print("Source shape:", dataset['x'].shape)
print("Target shape:", dataset['y'].shape)

Here is an example for a sample number 1:
Source('en') example: the company has been registered with the municipal court in prague in section b file 1 4 8 5 7
Target('ru') example: фирма зарегистрирована в городском суде в г праге раздел б [rare] 1 4 8 5 7
 
A sample number 2:
Source('en') example: six years ago i had a surgery and l 4 l 5 [rare] were [rare] now l 5 s 1 [rare] [rare] and i had a second surgery that went well
Target('ru') example: шесть лет назад мне сделали операцию и на дисках l 4 l 5 сейчас [rare] [rare] диски l 5 s 1 и было необходимо второе хирургическое вмешательство которое произошло вчера и прошло хорошо
source vocabulary size: 3499
target vocabulary size: 14999
Source shape: (14751, 148)
Target shape: (14751, 148)


## Models

The models are implemented with a similar set of parameters. The main idea is to keep models as small and simple as possible to quickly train them and validate the difference the primarely derived from model architectures. The summary of main hyper parameters presented below:

* Mapping:
    - Embeddings - word indices will be mapped into a 16-dimentional space
    - Dense mapping - recurrence outputs mapped into the target-language space, represented with OHE, via Dense mapping
* Layers:
    - GRU - number of units 256
    - Bidirectional GRU - number of untis set up to 128 in order to keep the total number of units the same (256)
    - Batch Normalization - To speed up the training batch normalization is inserted after embeddings and before dense mapping
* Optimization:
    - Adam - all models trained with Adam optimizer and the same learning rate (1e-3)
* Loss function:
    - sparse_categorical_crossentropy_from_logits - keras.losses.sparse_categorical_crossentropy

In [0]:
learning_rate = 1e-3
embeddings_units = 16
gru_units = 256
epochs = 10
validation_split = 0.1
batch_size = 64

def focal_loss(gamma=2., alpha=.25):

    def call(y_true, y_pred):
        y_true = tf.squeeze(tf.cast(y_true, tf.int32))
        vocab_size = y_pred.shape[-1]

        y_pred = tf.clip_by_value(y_pred, K.epsilon(), 1 - K.epsilon())
        truth = tf.one_hot(y_true, depth=vocab_size, dtype=tf.float32)

        p_1 = -alpha * tf.reduce_sum(truth * tf.pow(1. - y_pred, gamma) * tf.math.log(y_pred), axis=-1)
        p_0 = - (1 - alpha) * tf.reduce_sum((1 - truth) * tf.pow(y_pred, gamma) * tf.math.log(1. - y_pred), axis=-1)

        cost = p_1 + p_0

        return cost

    return call


#loss = keras.losses.sparse_categorical_crossentropy
loss = focal_loss(gamma=2., alpha=.25)

**Model list:**

1. Embedded GRU - 2 stacked GRU cells 256 units each
2. Embedded Bidirectional GRU - 2 stacked bi-directional GRU cells 256 units each
3. Embedded GRU encoder-decoder model - 1 GRU cell as an encoder and 1 GRU cell as a decoder (256 units each)
4. Embedded GRU encoder-decoder model with Multiplicative Attention

#### Model 1 - Embedded GRU

In [0]:
def embedded_gru_model(input_shape, output_sequence_length, source_vocab_size, target_vocab_size):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    input_seq = keras.Input(input_shape[1:])
    if output_sequence_length>input_shape[1]:
        expanded_seq = keras.backend.squeeze(
            keras.layers.ZeroPadding1D((0, output_sequence_length-input_shape[1]))(
                keras.layers.Reshape((input_shape[1], 1))(input_seq)
            ),
            axis = -1
        )
    else:
        expanded_seq = input_seq
        
        
    embedded_seq = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.Embedding(source_vocab_size, embeddings_units, input_length=output_sequence_length)(expanded_seq)
    )
    rnn = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.GRU(gru_units, return_sequences=True)(embedded_seq)
    )
    rnn = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.GRU(gru_units, return_sequences=True)(rnn)
    )
    logits = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.TimeDistributed(keras.layers.Dense(4*gru_units, activation='elu'))(rnn)
    )
    probabilities = keras.layers.TimeDistributed(keras.layers.Dense(target_vocab_size, activation='softmax'))(logits)
    
    model = keras.Model(input_seq, probabilities)
    
    model.compile(loss=loss,
                  optimizer=keras.optimizers.Adam(learning_rate, clipnorm=3.0),
                  metrics=['accuracy'])
    return model

  
# Train the neural network
keras.backend.clear_session()
embed_rnn_model = embedded_gru_model(
    dataset['x'].shape,
    dataset['y'].shape[1],
    dataset['x'].max()+1,
    dataset['y'].max()+1)



print("Model summary:")
embed_rnn_model.summary()

Model summary:
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 148)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 148, 16)           56000     
_________________________________________________________________
time_distributed (TimeDistri (None, 148, 16)           64        
_________________________________________________________________
unified_gru (UnifiedGRU)     (None, 148, 256)          210432    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 148, 256)          1024      
_________________________________________________________________
unified_gru_1 (UnifiedGRU)   (None, 148, 256)          394752    
_________________________________________________________________
time_distributed_2 (TimeDist (None, 148, 256) 

In [0]:
embed_rnn_model.fit(
    dataset['x'], 
    dataset['y'][:,:, None], 
    batch_size=batch_size, 
    epochs=epochs, 
    validation_split=validation_split)

Train on 13275 samples, validate on 1476 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fb5ca71d8d0>

In [0]:
# Print prediction(s)
sentense_id = 2
x_sample = dataset['x'][sentense_id]
y_sample = dataset['y'][sentense_id]
print("Source('en') example:", id_to_text( x_sample, dataset['x_tk'] ))
print("Source('ru') example:", id_to_text( y_sample, dataset['y_tk'] ))
prediction = embed_rnn_model.predict(x_sample[None, :], verbose=1).squeeze()
print("Translation(en_ru) example:", id_to_text( logits_to_id(prediction), dataset['y_tk'] ))

Source('en') example: our team consists of highly experienced professionals who have already successfully implemented several [rare]
Source('ru') example: наша команда состоит из [rare] специалистов которые уже реализовали ряд успешных проектов
Translation(en_ru) example: наша [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare]


#### Model 2 - BiDirectional GRU

In [0]:
def bd_gru_model(input_shape, output_sequence_length, source_vocab_size, target_vocab_size):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    input_seq = keras.Input(input_shape[1:])
    if output_sequence_length>input_shape[1]:
        expanded_seq = keras.backend.squeeze(
            keras.layers.ZeroPadding1D((0, output_sequence_length-input_shape[1]))(
                keras.layers.Reshape((input_shape[1], 1))(input_seq)
            ),
            axis = -1
        )
    else:
        expanded_seq = input_seq
    embedded_seq = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.Embedding(source_vocab_size, embeddings_units, input_length=output_sequence_length)(expanded_seq)
    )
    rnn = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.Bidirectional(keras.layers.GRU(int(gru_units/2), return_sequences=True))(embedded_seq)
    )
    rnn = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.Bidirectional(keras.layers.GRU(int(gru_units/2), return_sequences=True))(rnn)
    )
    logits = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.TimeDistributed(keras.layers.Dense(4*gru_units, activation='elu'))(rnn)
    )
    probabilities = keras.layers.TimeDistributed(keras.layers.Dense(target_vocab_size, activation='softmax'))(logits)
    
    model = keras.Model(input_seq, probabilities)
    
    model.compile(loss=loss,
                  optimizer=keras.optimizers.Adam(learning_rate, clipnorm=3.0),
                  metrics=['accuracy'])
    return model

  
# Train the neural network
keras.backend.clear_session()
bd_rnn_model = embedded_gru_model(
    dataset['x'].shape,
    dataset['y'].shape[1],
    dataset['x'].max()+1,
    dataset['y'].max()+1)



print("Model summary:")
bd_rnn_model.summary()

Model summary:
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 148)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 148, 16)           56000     
_________________________________________________________________
time_distributed (TimeDistri (None, 148, 16)           64        
_________________________________________________________________
unified_gru (UnifiedGRU)     (None, 148, 256)          210432    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 148, 256)          1024      
_________________________________________________________________
unified_gru_1 (UnifiedGRU)   (None, 148, 256)          394752    
_________________________________________________________________
time_distributed_2 (TimeDist (None, 148, 256) 

In [0]:
bd_rnn_model.fit(
    dataset['x'], 
    dataset['y'][:,:, None], 
    batch_size=batch_size, 
    epochs=epochs, 
    validation_split=validation_split
)

Train on 13275 samples, validate on 1476 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fb54e6a3f60>

In [0]:
# Print prediction(s)
sentense_id = 2
x_sample = dataset['x'][sentense_id]
y_sample = dataset['y'][sentense_id]
print("Source('en') example:", id_to_text( x_sample, dataset['x_tk'] ))
print("Source('ru') example:", id_to_text( y_sample, dataset['y_tk'] ))
prediction = bd_rnn_model.predict(x_sample[None, :], verbose=1).squeeze()
print("Translation(en_ru) example:", id_to_text( logits_to_id(prediction), dataset['y_tk'] ))

Source('en') example: our team consists of highly experienced professionals who have already successfully implemented several [rare]
Source('ru') example: наша команда состоит из [rare] специалистов которые уже реализовали ряд успешных проектов
Translation(en_ru) example: антибактериальная команда [rare] [rare] [rare] специалистов [rare] [rare] реализовали реализовали в [rare] [rare] [rare]


###Model 3 - Seuqnce 2 Sequence

This model accumulates the hidden state during the encoder stage and then uses this state to make predictions during the decoder step.

In [0]:
class Encoder(keras.Model):


    def __init__(self, vocab_size: int, gru_units: int, embeddings_units: int):

        super(Encoder, self).__init__()

        self.embedding = keras.layers.Embedding(vocab_size, embeddings_units)
        self.BN_embedding = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))
        self.gru = keras.layers.GRU(gru_units, return_sequences=False, return_state=True)

    def call(self, x):

        embedded = self.embedding(x)
        bn_embedding = self.BN_embedding(embedded)
        h_s = self.gru(bn_embedding)

        return h_s[0]


class Decoder(keras.Model):


    def __init__(self, vocab_size: int, gru_units: int, embeddings_units: int):
        super(Decoder, self).__init__()

        self.hs_BN = keras.layers.BatchNormalization(axis=-1)
        # the embedding layer needs to contain 1 additional token for the start of the sequence (dataset['y'].max()+1)
        self.embedding = keras.layers.Embedding(vocab_size+1, embeddings_units)
        self.embedding_BN = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))
        self.gru = keras.layers.GRU(gru_units, return_sequences=True, return_state=True)
        self.gru_BN = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))
        self.dense = keras.layers.Dense(vocab_size, activation='softmax')


    def call(self, x, hidden):

        embedded = self.embedding_BN(
                self.embedding(x)
        )

        seq, h_t = self.gru(embedded, initial_state = self.hs_BN(hidden))
        p = self.dense(self.gru_BN(seq))

        return p, h_t

Additional element of the Encoder-Decoder architecture is the teacher-student relationship between the training function and the model. The training function during the decoding phase, uses forcefeeding of the lables as an input into the decoder. This setup allows to simulate the prediction of the correct translation for each word and reusing that prediciton to influence the prediction for the next word.

In [0]:
@tf.function
def train_step(inputs, targets,
               encoder, decoder,
               loss_fn, optimizer,
               start_word_index):

    loss = 0.0
    BATCH_SIZE = int(targets.shape[0])
    N_STEPS = int(targets.shape[1])

    with tf.GradientTape() as tape:
        h_s = encoder(inputs=inputs)

        # Kick-off decoding with a start word
        dec_input = tf.expand_dims([start_word_index] * BATCH_SIZE, 1)
        # Teacher forcing - feeding the target as the next input
        h_t = h_s
        for t in range(N_STEPS):
            p, h_t = decoder(dec_input, hidden=h_t)
            # normalize loss across time dimension
            loss += loss_fn(targets[:, t], p) / tf.cast(N_STEPS, p.dtype)
            # using teacher forcing
            dec_input = tf.expand_dims(targets[:, t], 1)

    # obtain all trainable variables
    variables = encoder.trainable_variables + decoder.trainable_variables

    # obtain gradient history across all steps with respect to trainable variables
    gradients = tape.gradient(loss, variables)

    optimizer.apply_gradients(zip(gradients, variables))

    return tf.reduce_mean(loss)

To allow a custom training step, we need to create our custom trianing loop function - fit:

In [0]:
def fit(inputs: np.ndarray, targets: np.ndarray, batch_size: int, epochs: int,
        train_step: Callable[[np.ndarray, np.ndarray], float]):

    steps = inputs.shape[0] // batch_size
    dataset = tf.data.Dataset.from_tensor_slices((inputs, targets))
    dataset = dataset.batch(batch_size, drop_remainder=True)

    for epoch in range(epochs):

        pbar = tqdm(range(steps), desc="Epoch {}".format(epoch))
        minibatch = enumerate(dataset.take(steps))

        for i in pbar:

            _, (x, y) = next(minibatch)
            loss = train_step(x, y)
            pbar.set_postfix(ordered_dict={"loss": loss.numpy()})

Putting it all together

In [0]:
keras.backend.clear_session()
# remember 1 token is reserved for unknown words
encoder = Encoder(vocab_size=dataset['x'].max() + 1, gru_units=gru_units, embeddings_units=embeddings_units)
# remember 1 token is reserved for unknown words
decoder = Decoder(vocab_size=dataset['y'].max() + 1, gru_units=gru_units, embeddings_units=embeddings_units)

optimizer = tf.keras.optimizers.Adam(lr=learning_rate)

loss_fn = keras.losses.sparse_categorical_crossentropy
# the decoder contains embedding with target vocabulary size + 1. Additional (+1) token is reserved for the start
# token.
train_step_fn = partial(train_step, encoder=encoder, decoder=decoder, loss_fn=loss_fn, optimizer=optimizer,
                        start_word_index=dataset['y'].max()+1)

fit(dataset['x'], dataset['y'], batch_size=batch_size, epochs=epochs, train_step=train_step_fn)