# Machine Translation Project


The goal of the project is to compare the strength of the following recurrent models:

1. Embedded GRU
2. Embedded Bidirectional GRU
3. Embedded GRU encoder-decoder model
4. Embedded GRU encoder-decoder model with Multiplicative Attention

The models implemented in Tensorflow 2.0 with Keras as a high-level API. Models are trained and analyzed based on [TedHrlrTranslate dataset](https://www.tensorflow.org/datasets/datasets#ted_hrlr_translate).

In [4]:
from tqdm import tqdm, tqdm_notebook
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
from tensorflow_datasets.translate.ted_hrlr import TedHrlrTranslate

The data load, extraction, and transformation is done with data_etl() method. This method returns a dictionary containing source data stored under 'x' label. Target data is stored under 'y' label. In addition to the source and target data, the dictionary contains x and y tockenizers (stored as 'x_tk' and 'y_tk') and source and target maximum sequence length ('x_length' and 'y_length'):

In [2]:
def data_etl(lang_pairs: str = 'ru_to_en', download_dir: str = ".") -> dict:
    print("Start data ETL")
    # Download a language data-set specified by :param language_pairs
    builder = TedHrlrTranslate(data_dir=download_dir, config=lang_pairs)
    builder.download_and_prepare()
    datasets = builder.as_dataset()
    print("Downloaded successfully")

    # extract data
    target, source = [], []
    for dataset_name in ['train', 'test', 'validation']:
        # extract dataset
        dataset = datasets[dataset_name]
        # convert into numpy
        dataset = tfds.as_numpy(dataset)
        # convert to string
        dataset = list(map(lambda features: (features['ru'].decode("utf-8"), features['en'].decode("utf-8")), dataset))
        source.extend([t[1] for t in dataset])
        target.extend([t[0] for t in dataset])

    print("Extracted successfully")

    # Tockenize
    x, x_tk = tokenize(source)
    y, y_tk = tokenize(target)

    x, x_length = pad(x)
    y, y_length = pad(y)

    print("Transformed successfully")

    return {'x': x, 'y': y, 'x_tk': x_tk, 'y_tk': y_tk, 'x_length': x_length, 'y_length': y_length}

def tokenize(x):
    """
    Tokenize x
    :param x: List of sentences/strings to be tokenized
    :return: Tuple of (tokenized x data, tokenizer used to tokenize x)
    """
    x_tk = keras.preprocessing.text.Tokenizer()
    x_tk.fit_on_texts(x)
    return x_tk.texts_to_sequences(x), x_tk

def pad(x, length=None) -> tuple:
    """
    Pad x
    :param x: List of sequences.
    :param length: Length to pad the sequence to.  If None, use length of longest sequence in x.
    :return: Padded numpy array of sequences
    """
    if length is None:
        length = max([len(sentence) for sentence in x])

    return keras.preprocessing.sequence.pad_sequences(x, maxlen=length, padding='post'), length

dataset = data_etl()

{
'x': np.ndarray,
'y': np.ndarray,
'x_tk': keras.preprocessing.text.Tokenizer,
'y_tk': keras.preprocessing.text.Tokenizer,
'x_length': int,
'y_length': int
}


In [3]:
dataset = data_etl()

Start data ETL
Downloaded successfully
Extracted successfully
Transformed successfully
