# Machine Translation Project


The goal of the project is to compare the strength of the following recurrent models:

1. Embedded GRU
2. Embedded Bidirectional GRU
3. Embedded GRU encoder-decoder model
4. Embedded GRU encoder-decoder model with Multiplicative Attention

The models implemented in Tensorflow 2.0 with Keras as a high-level API. Models are trained and analyzed based on [TedHrlrTranslate dataset](https://www.tensorflow.org/datasets/datasets#ted_hrlr_translate).

In [97]:
import re
import numpy as np
from functools import partial
from tqdm import tqdm, tqdm_notebook
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
from tensorflow_datasets.translate.ted_hrlr import TedHrlrTranslate

## Data ETL

The data load, extraction, and transformation is done with data_etl() method. This method returns a dictionary containing source data stored under 'x' label. Target data is stored under 'y' label. In addition to the source and target data, the dictionary contains x and y tockenizers (stored as 'x_tk' and 'y_tk'):

In [111]:
def data_etl(lang_pairs: str = 'ru_to_en', download_dir: str = ".") -> dict:
    print("Start data ETL")
    # Download a language data-set specified by :param language_pairs
    builder = TedHrlrTranslate(data_dir=download_dir, config=lang_pairs)
    builder.download_and_prepare()
    datasets = builder.as_dataset()
    print("Downloaded successfully")

    # extract data
    target, source = [], []
    for dataset_name in ['train', 'test', 'validation']:
        # extract dataset
        dataset = datasets[dataset_name]
        # convert into numpy
        dataset = tfds.as_numpy(dataset)
        # convert to string
        dataset = list(map(lambda features: (features['ru'].decode("utf-8"), features['en'].decode("utf-8")), dataset))
        source.extend([t[1] for t in dataset])
        target.extend([t[0] for t in dataset])

    print("Extracted successfully")
    
    source = [re.sub("[0-9]", " \g<0>", re.sub("[^a-zA-Z0-0/-]", " ", s.lower())) for s in source]
    target = [re.sub("[0-9]", " \g<0>", re.sub("[^а-яА-ЯёЁ0-9/-]", " ", s.lower())) for s in target]

    # Tockenize
    x, x_tk = tokenize(source)
    y, y_tk = tokenize(target)

    x = pad(x)
    y = pad(y)

    print("Transformed successfully")

    return {'x': x, 'y': y, 'x_tk': x_tk, 'y_tk': y_tk}

def tokenize(x, num_words=5000, filters_regex=None):
    """
    Tokenize x
    :param x: List of sentences/strings to be tokenized
    :n_words: Limit of the number of words that will be kept
    :filters_regex: Regular expression filtering out words
    :return: Tuple of (tokenized x data, tokenizer used to tokenize x)
    """
    if filters_regex:
        x_tk = keras.preprocessing.text.Tokenizer(num_words=num_words, filters=filters_regex)
    else:
        x_tk = keras.preprocessing.text.Tokenizer(num_words=num_words)
    x_tk.fit_on_texts(x)
    return x_tk.texts_to_sequences(x), x_tk

def pad(x, length=None):
    """
    Pad x
    :param x: List of sequences.
    :param length: Length to pad the sequence to.  If None, use length of longest sequence in x.
    :return: Padded numpy array of sequences
    """
    if length is None:
        length = max([len(sentence) for sentence in x])

    return keras.preprocessing.sequence.pad_sequences(x, maxlen=length, padding='post')

dataset = data_etl()

{
    'x': np.ndarray,
    'y': np.ndarray,
    'x_tk': keras.preprocessing.text.Tokenizer,
    'y_tk': keras.preprocessing.text.Tokenizer
}

In [112]:
dataset = data_etl()

Start data ETL
Downloaded successfully
Extracted successfully
Transformed successfully


## Utility Functions

In addition to the data ETL, the code below provides two additional functions for converting logits into word indicies and converting word indicies into text.

In [113]:
def logits_to_id(logits):
    """
    Turns logits into word ids
    :param logits: Logits from a neural network
    """
    return [prediction for prediction in np.argmax(logits, 1)]

def id_to_text(idx, tokenizer):
    """
    Turns id into text using the tokenizer
    :param idx: word id
    :param tokenizer: Keras Tokenizer fit on the labels
    :return: String that represents the text of the logits
    """
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'

    return ' '.join([index_to_words[prediction] for prediction in idx]).replace(" <PAD>", "")

In [114]:
print("Here is an example for a samples number 1:")
print("Source('en') example:", id_to_text(dataset['x'][0], dataset['x_tk']))
print("Target('ru') example:", id_to_text(dataset['y'][0], dataset['y_tk']))
print(" ")
print("Samples number 2:")
print("Source('en') example:", id_to_text(dataset['x'][1], dataset['x_tk']))
print("Target('ru') example:", id_to_text(dataset['y'][1], dataset['y_tk']))
print("source vocabulary size:", dataset['x'].max())
print("target vocabulary size:", dataset['y'].max())
dataset['y_tk'].word_index

Here is an example for a samples number 1:
Source('en') example: and i d like to tell you the story in three acts and if i have time still an
Target('ru') example: и я хотел бы рассказать вам эту историю в трех а если останется время и
 
Samples number 2:
Source('en') example: is you re of it
Target('ru') example: вы не её
source vocabulary size: 4999
target vocabulary size: 4999


{'и': 1,
 'в': 2,
 'что': 3,
 'я': 4,
 'это': 5,
 'на': 6,
 'не': 7,
 'мы': 8,
 'с': 9,
 'как': 10,
 '0': 11,
 'но': 12,
 'то': 13,
 'вы': 14,
 'они': 15,
 'из': 16,
 'для': 17,
 'а': 18,
 'так': 19,
 'у': 20,
 'к': 21,
 'о': 22,
 'он': 23,
 'по': 24,
 'если': 25,
 'когда': 26,
 '1': 27,
 'чтобы': 28,
 'за': 29,
 'их': 30,
 'бы': 31,
 'или': 32,
 'есть': 33,
 'от': 34,
 'было': 35,
 'же': 36,
 'очень': 37,
 'все': 38,
 'вот': 39,
 'его': 40,
 '2': 41,
 'мне': 42,
 'которые': 43,
 'она': 44,
 'нас': 45,
 'меня': 46,
 'всё': 47,
 'нам': 48,
 'потому': 49,
 'только': 50,
 'смех': 51,
 'был': 52,
 'эти': 53,
 '5': 54,
 'лет': 55,
 'том': 56,
 'вам': 57,
 'чем': 58,
 'может': 59,
 'людей': 60,
 'быть': 61,
 'того': 62,
 'до': 63,
 'этого': 64,
 'можно': 65,
 'люди': 66,
 'просто': 67,
 'этот': 68,
 'больше': 69,
 'этом': 70,
 'где': 71,
 'были': 72,
 'который': 73,
 'ещё': 74,
 'была': 75,
 '9': 76,
 'время': 77,
 'более': 78,
 'нет': 79,
 'вас': 80,
 'во': 81,
 '3': 82,
 'здесь': 83,
 'сей

## Models

The models are implemented with a similar set of parameters. The main idea is to keep models as small and simple as possible to quickly train them and validate the difference the primarely derived from model architectures. The summary of main hyper parameters presented below:

* Mapping:
    - Embeddings - word indices will be mapped into a 16-dimentional space
    - Dense mapping - recurrence outputs mapped into the target-language space, represented with OHE, via Dense mapping
* Layers:
    - GRU - number of units 128
    - Bidirectional GRU - number of untis set up to 64 in order to keep the total number of units the same (128)
    - Batch Normalization - To speed up the training batch normalization is inserted after embeddings and before dense mapping
* Optimization:
    - Adam - all models trained with Adam optimizer and the same learning rate (1e-3)
* Loss function:
    - sparse_categorical_crossentropy_from_logits - keras.losses.sparse_categorical_crossentropy

In [115]:
n_samples = int(dataset['x'].shape[0] * 0.30)
learning_rate = 1e-3
embeddings_units = 16
gru_units = 128
epochs = 10
validation_split = 0.1
sparse_categorical_crossentropy_from_logits = partial(keras.losses.sparse_categorical_crossentropy, from_logits=True)

**Model list:**

1. Embedded GRU
2. Embedded Bidirectional GRU
3. Embedded GRU encoder-decoder model
4. Embedded GRU encoder-decoder model with Multiplicative Attention

#### Model 1 - Embedded GRU

In [124]:
def embedded_gru_model(input_shape, output_sequence_length, source_vocab_size, target_vocab_size):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    input_seq = keras.Input(input_shape[1:])
    embedded_seq = keras.layers.Embedding(source_vocab_size, embeddings_units, input_length=output_sequence_length)(input_seq)
    rnn = keras.layers.GRU(gru_units, return_sequences=True)(embedded_seq)
    logits = keras.layers.TimeDistributed(keras.layers.Dense(target_vocab_size))(rnn)
    model = keras.Model(input_seq, logits)
    model.compile(loss=keras.losses.sparse_categorical_crossentropy,
                  optimizer=keras.optimizers.Adam(learning_rate),
                  metrics=['accuracy'])
    return model

# Train the neural network
embed_rnn_model = embedded_gru_model(
    dataset['x'].shape,
    dataset['y'].shape[1],
    len(dataset['x_tk'].word_index)+1,
    len(dataset['y_tk'].word_index)+1)
print("Model summary:")
embed_rnn_model.summary()
embed_rnn_model.fit(dataset['x'][:n_samples], 
                    keras.layers.ZeroPadding1D((0, dataset['x'].shape[1]-dataset['y'].shape[1]))(dataset['y'][:n_samples][:,:,None]), 
                    batch_size=512, 
                    epochs=epochs, 
                    validation_split=validation_split)
# Print prediction(s)
print(logits_to_text(embed_rnn_model.predict(dataset['x'][-2:])[0], dataset['y_tk']))

Model summary:
Model: "model_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_9 (InputLayer)         [(None, 113)]             0         
_________________________________________________________________
embedding_7 (Embedding)      (None, 113, 16)           816640    
_________________________________________________________________
unified_gru_7 (UnifiedGRU)   (None, 113, 128)          56064     
_________________________________________________________________
time_distributed_6 (TimeDist (None, 113, 155245)       20026605  
Total params: 20,899,309
Trainable params: 20,899,309
Non-trainable params: 0
_________________________________________________________________
Train on 58964 samples, validate on 6552 samples
Epoch 1/10
  128/58964 [..............................] - ETA: 4:28:25 - loss: 5.6606 - accuracy: 0.6659   

KeyboardInterrupt: 