<a href="https://colab.research.google.com/github/ap-nlp-research/language_translation_en_ru_tf2/blob/master/MachineTranslationModels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Translation Project


The goal of the project is to compare the strength of the following recurrent models:

1. Embedded GRU
2. Embedded Bidirectional GRU
3. Embedded GRU encoder-decoder model
4. Embedded GRU encoder-decoder model with Multiplicative Attention

The models implemented in Tensorflow 2.0 with Keras as a high-level API. Models are trained and analyzed based on EN-RU [wmt19_translate dataset](https://www.tensorflow.org/datasets/datasets#wmt19_translate) ([ACL 2019 FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT19)](http://www.statmt.org/wmt19/translation-task.html)).

In [1]:
!pip install tensorflow-gpu==2.0.0-alpha0
!git clone https://github.com/ap-nlp-research/language_translation_en_ru_tf2.git translation_en_ru

Cloning into 'translation_en_ru'...
remote: Enumerating objects: 61, done.[K
remote: Counting objects: 100% (61/61), done.[K
remote: Compressing objects: 100% (52/52), done.[K
remote: Total 61 (delta 19), reused 33 (delta 8), pack-reused 0[K
Unpacking objects: 100% (61/61), done.


In [0]:
import os
import pickle as pk
import subprocess
import re
import numpy as np
from functools import partial
import tensorflow as tf
from tensorflow import keras

## Data ETL

The data load, extraction, and transformation was done beforehand with [create_dataset_en_ru.py](./create_dataset_en_ru.py) script. This script stores a dictionary containing source data under 'x' label. Target data is stored under 'y' label. In addition to the source and target data, the dictionary contains x and y tockenizers (stored as 'x_tk' and 'y_tk'):

dataset: dict

{
    'x': np.ndarray,
    'y': np.ndarray,
    'x_tk': keras.preprocessing.text.Tokenizer,
    'y_tk': keras.preprocessing.text.Tokenizer
}

In [0]:
with open("./translation_en_ru/data/wlm_en_ru.pkl", 'rb') as file:
    dataset = pk.load(file)

## Utility Functions

In addition to the data ETL, the code below provides two additional functions for converting logits into word indicies and converting word indicies into text.

In [0]:
def logits_to_id(logits):
    """
    Turns logits into word ids
    :param logits: Logits from a neural network
    """
    return np.argmax(logits, 1)

def id_to_text(idx, tokenizer):
    """
    Turns id into text using the tokenizer
    :param idx: word id
    :param tokenizer: Keras Tokenizer fit on the labels
    :return: String that represents the text of the logits
    """
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'

    return ' '.join([index_to_words[prediction] for prediction in idx]).replace(" <PAD>", "")

In [13]:
print("Here is an example for a sample number 1:")
print("Source('en') example:", id_to_text(dataset['x'][0], dataset['x_tk']))
print("Target('ru') example:", id_to_text(dataset['y'][0], dataset['y_tk']))
print(" ")
print("A sample number 2:")
print("Source('en') example:", id_to_text(dataset['x'][-10], dataset['x_tk']))
print("Target('ru') example:", id_to_text(dataset['y'][-10], dataset['y_tk']))
print("source vocabulary size:", dataset['x'].max())
print("target vocabulary size:", dataset['y'].max())
print("Source shape:", dataset['x'].shape)
print("Target shape:", dataset['y'].shape)

Here is an example for a sample number 1:
Source('en') example: the company has been registered with the municipal court in prague in section b file 1 4 8 5 7
Target('ru') example: фирма зарегистрирована в городском суде в г праге раздел б вкладыш 1 4 8 5 7
 
A sample number 2:
Source('en') example: we all need to [rare] [rare] words because it is for our own good
Target('ru') example: мы все нуждаемся в словах любви и они делают нас лучше
source vocabulary size: 3499
target vocabulary size: 14999
Source shape: (13643, 136)
Target shape: (13643, 136)


## Models

The models are implemented with a similar set of parameters. The main idea is to keep models as small and simple as possible to quickly train them and validate the difference the primarely derived from model architectures. The summary of main hyper parameters presented below:

* Mapping:
    - Embeddings - word indices will be mapped into a 16-dimentional space
    - Dense mapping - recurrence outputs mapped into the target-language space, represented with OHE, via Dense mapping
* Layers:
    - GRU - number of units 256
    - Bidirectional GRU - number of untis set up to 128 in order to keep the total number of units the same (256)
    - Batch Normalization - To speed up the training batch normalization is inserted after embeddings and before dense mapping
* Optimization:
    - Adam - all models trained with Adam optimizer and the same learning rate (1e-3)
* Loss function:
    - sparse_categorical_crossentropy_from_logits - keras.losses.sparse_categorical_crossentropy

In [0]:
learning_rate = 1e-3
embeddings_units = 32
gru_units = 256
epochs = 10
validation_split = 0.1
batch_size = 256
loss = keras.losses.sparse_categorical_crossentropy

**Model list:**

1. Embedded GRU
2. Embedded Bidirectional GRU
3. Embedded GRU encoder-decoder model
4. Embedded GRU encoder-decoder model with Multiplicative Attention

#### Model 1 - Embedded GRU

In [7]:
def embedded_gru_model(input_shape, output_sequence_length, source_vocab_size, target_vocab_size):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    input_seq = keras.Input(input_shape[1:])
    if output_sequence_length>input_shape[1]:
        expanded_seq = keras.backend.squeeze(
            keras.layers.ZeroPadding1D((0, output_sequence_length-input_shape[1]))(
                keras.layers.Reshape((input_shape[1], 1))(input_seq)
            ),
            axis = -1
        )
    else:
        expanded_seq = input_seq
    embedded_seq = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.Embedding(source_vocab_size, embeddings_units, input_length=output_sequence_length)(expanded_seq)
    )
    rnn = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.GRU(int(gru_units), return_sequences=True)(embedded_seq)
    )
    logits = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.TimeDistributed(keras.layers.Dense(4*gru_units, activation='elu'))(rnn)
    )
    probabilities = keras.layers.TimeDistributed(keras.layers.Dense(target_vocab_size, activation='softmax'))(logits)
    
    model = keras.Model(input_seq, probabilities)
    
    model.compile(loss=loss,
                  optimizer=keras.optimizers.Adam(learning_rate, clipnorm=3.0),
                  metrics=['accuracy'])
    return model

  
# Train the neural network
embed_rnn_model = embedded_gru_model(
    dataset['x'].shape,
    dataset['y'].shape[1],
    dataset['x'].max()+1,
    dataset['y'].max()+1)



print("Model summary:")
embed_rnn_model.summary()

Model summary:
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 136)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 136, 32)           112000    
_________________________________________________________________
time_distributed (TimeDistri (None, 136, 32)           128       
_________________________________________________________________
unified_gru (UnifiedGRU)     (None, 136, 256)          222720    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 136, 256)          1024      
_________________________________________________________________
time_distributed_3 (TimeDist (None, 136, 1024)         263168    
_________________________________________________________________
time_distributed_2 (TimeDist (None, 136, 1024)

In [8]:
embed_rnn_model.fit(
    dataset['x'], 
    dataset['y'][:,:, None], 
    batch_size=batch_size, 
    epochs=epochs, 
    validation_split=validation_split)

Train on 12278 samples, validate on 1365 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f4958dc51d0>

In [9]:
# Print prediction(s)
sentense_id = 2
x_sample = dataset['x'][sentense_id]
y_sample = dataset['y'][sentense_id]
print("Source('en') example:", id_to_text( x_sample, dataset['x_tk'] ))
print("Source('ru') example:", id_to_text( y_sample, dataset['y_tk'] ))
prediction = embed_rnn_model.predict(x_sample[None, :], verbose=1).squeeze()
print("Translation(en_ru) example:", id_to_text( logits_to_id(prediction), dataset['y_tk'] ))

Source('en') example: about us news [rare] contacts
Source('ru') example: o нас новости референции контакт
Translation(en_ru) example: [rare] [rare] новости [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [ra

#### Model 2 - BiDirectional GRU

In [10]:
def bd_gru_model(input_shape, output_sequence_length, source_vocab_size, target_vocab_size):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    input_seq = keras.Input(input_shape[1:])
    if output_sequence_length>input_shape[1]:
        expanded_seq = keras.backend.squeeze(
            keras.layers.ZeroPadding1D((0, output_sequence_length-input_shape[1]))(
                keras.layers.Reshape((input_shape[1], 1))(input_seq)
            ),
            axis = -1
        )
    else:
        expanded_seq = input_seq
    embedded_seq = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.Embedding(source_vocab_size, embeddings_units, input_length=output_sequence_length)(expanded_seq)
    )
    rnn = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.Bidirectional(keras.layers.GRU(int(gru_units/2), return_sequences=True))(embedded_seq)
    )
    logits = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.TimeDistributed(keras.layers.Dense(4*gru_units, activation='elu'))(rnn)
    )
    probabilities = keras.layers.TimeDistributed(keras.layers.Dense(target_vocab_size, activation='softmax'))(logits)
    
    model = keras.Model(input_seq, probabilities)
    
    model.compile(loss=loss,
                  optimizer=keras.optimizers.Adam(learning_rate, clipnorm=3.0),
                  metrics=['accuracy'])
    return model

  
# Train the neural network
bd_rnn_model = embedded_gru_model(
    dataset['x'].shape,
    dataset['y'].shape[1],
    dataset['x'].max()+1,
    dataset['y'].max()+1)



print("Model summary:")
bd_rnn_model.summary()

Model summary:
Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 136)]             0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 136, 32)           112000    
_________________________________________________________________
time_distributed_5 (TimeDist (None, 136, 32)           128       
_________________________________________________________________
unified_gru_1 (UnifiedGRU)   (None, 136, 256)          222720    
_________________________________________________________________
time_distributed_6 (TimeDist (None, 136, 256)          1024      
_________________________________________________________________
time_distributed_8 (TimeDist (None, 136, 1024)         263168    
_________________________________________________________________
time_distributed_7 (TimeDist (None, 136, 102

In [11]:
bd_rnn_model.fit(
    dataset['x'], 
    dataset['y'][:,:, None], 
    batch_size=batch_size, 
    epochs=epochs, 
    validation_split=validation_split
)

Train on 12278 samples, validate on 1365 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f48a739ac50>

In [12]:
# Print prediction(s)
sentense_id = 2
x_sample = dataset['x'][sentense_id]
y_sample = dataset['y'][sentense_id]
print("Source('en') example:", id_to_text( x_sample, dataset['x_tk'] ))
print("Source('ru') example:", id_to_text( y_sample, dataset['y_tk'] ))
prediction = bd_rnn_model.predict(x_sample[None, :], verbose=1).squeeze()
print("Translation(en_ru) example:", id_to_text( logits_to_id(prediction), dataset['y_tk'] ))

Source('en') example: about us news [rare] contacts
Source('ru') example: o нас новости референции контакт
Translation(en_ru) example: [rare] [rare] новости [rare] новости [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [rare] [r