# Statistical Machine Translation

Statistical machine translation, or SMT for short, is the use of statistical models that learn to translate text from a source language to a target language given a large corpus of examples. This task of using a statistical model can be stated formally as follows:
    
    Given a sentence T in the target language, we seek the sentence S from which the translator produced T. We know that our chance of error is minimized by choosing that sentence S that is most probable given T. Thus, we wish to choose S so as to maximize Pr(S|T).

## What is Neural Machine Translation?

Neural machine translation, or NMT for short, is the use of neural network models to learn a statistical model for machine translation. The key benefit to the approach is that a single system can be trained directly on source and target text, no longer requiring the pipeline of specialized systems used in statistical machine learning.

## Encoder-Decoder Model

Multilayer Perceptron neural network models can be used for machine translation, although the models are limited by a fixed-length input sequence where the output must be the same length. These early models have been greatly improved upon recently through the use of recurrent neural networks organized into an encoder-decoder architecture that allow for variable length input and output sequences.

Key to the encoder-decoder architecture is the ability of the model to encode the source text into an internal fixed-length representation called the context vector. Interestingly, once encoded, different decoding systems could be used, in principle, to translate the context into different languages.

    ... one model first reads the input sequence and emits a data structure that summarizes the input sequence. We call this summary the “context” C. [...] A second mode, usually an RNN, then reads the context C and generates a sentence in the target language.
    
## Encoder-Decoders w/ Attention

Although effective, the Encoder-Decoder architecture has problems with long sequences of text to be translated. The problem stems from the fixed-length internal representation that must be used to decode each word in the output sequence. The solution is the use of an attention mechanism that allows the model to learn where to place attention on the input sequence as each word of the output sequence is decoded.

Using a fixed-sized representation to capture all the semantic details of a very long sentence [...] is very difficult. [...] A more efficient approach, however, is to read the whole sentence or paragraph [...], then to produce the translated words one at a time, each time focusing on a different part of the input sentence to gather the semantic details required to produce the next output word.

The encoder-decoder recurrent neural network architecture with attention is currently the state-of-the-art on some benchmark problems for machine translation. And this architecture is used in the heart of the Google Neural Machine Translation system, or GNMT, used in their Google Translate service.

Although effective, the neural machine translation systems still suffer some issues, such as scaling to larger vocabularies of words and the slow speed of training the models. There are the current areas of focus for large production neural translation systems, such as the Google system.


# Encoder-Decoder models for NMT

## Sutskever NMT Model

In this section, we will look at the neural machine translation model developed by Ilya Sutskever, et al. as described in their 2014 paper Sequence to Sequence Learning with Neural Networks. We will refer to it as the Sutskever NMT Model, for lack of a better name. This is an important paper as it was one of the first to introduce the Encoder-Decoder model for machine translation and more generally sequence-to-sequence learning. It is an important model in the field of machine translation as it was one of the first neural machine translation systems to outperform a baseline statistical machine learning model on a large translation task.

### Problem

The model was applied to English to French translation, specifically the WMT 2014 translation task. The translation task was processed one sentence at a time, and an end-of-sequence (<EOS>) token was added to the end of output sequences during training to signify the end of the translated sequence. This allowed the model to be capable of predicting variable length output sequences.

### Model

An Encoder-Decoder architecture was developed where an input sequence was read in entirety and encoded to a fixed-length internal representation. A decoder network then used this internal representation to output words until the end of sequence token was reached. LSTM networks were used for both the encoder and decoder.

The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixed-dimensional vector representation, and then to use another LSTM to extract the output sequence from that vector

The final model was an ensemble of 5 deep learning models. A left-to-right beam search was used during the inference of the translations.

![sutskever](images/sutskever.jpeg)

### Model Configuration

The following provides a summary of the model configuration taken from the paper:

    􏰀 Input sequences were reversed.
    􏰀 A 1000-dimensional word embedding layer was used to represent the input words.
    􏰀 Softmax was used on the output layer.
    􏰀 The input and output models had 4 layers with 1,000 units per layer.
    􏰀 The model was fit for 7.5 epochs where some learning rate decay was performed.
    􏰀 A batch-size of 128 sequences was used during training.
    􏰀 Gradient clipping was used during training to mitigate the chance of gradient explosions. 
    􏰀 Batches were comprised of sentences with roughly the same length to speed-up computation.

# Result

The system achieved a BLEU score of 34.81, which is a good score compared to the baseline score developed with a statistical machine translation system of 33.30. Importantly, this is the first example of a neural machine translation system that outperformed a phrase-based statistical machine translation baseline on a large scale problem.

# How to Configure Encoder-Decoder Models for Machine Translation
 
## Baseline Model
 A baseline model configuration was chosen such that the model would perform reasonably well on the translation task.
 
    􏰀 Embedding: 512-dimensions.
    􏰀 RNN Cell: Gated Recurrent Unit or GRU.
    􏰀 Encoder: Bidirectional.
    􏰀 Encoder Depth: 2-layers (1 layer in each direction). 
    􏰀 Decoder Depth: 2-layers.
    􏰀 Attention: Bahdanau-style.
    􏰀 Optimizer: Adam.
    􏰀 Dropout: 20% on input.
    
Each experiment started with the baseline model and varied one element in an attempt to isolate the impact of the design decision on the model skill, in this case, BLEU scores.

![encoder](images/encoder_decoder.jpeg)

## Word Embedding Size

A word-embedding is used to represent words input to the encoder. This is a distributed representation where each word is mapped to a fixed-sized vector of continuous values. The benefit of this approach is that different words with similar meaning will have a similar representation. This distributed representation is often learned while fitting the model on the training data. The embedding size defines the length of the vectors used to represent words. It is generally believed that a larger dimensionality will result in a more expressive representation, and in turn, better skill. Interestingly, the results show that the largest size tested did achieve the best results, but the benefit of increasing the size was minor overall.


Recommendation: Start with a small embedding, such as 128, perhaps increase the size later for a minor lift in skill.

## RNN Cell Type

There are generally three types of recurrent neural network cells that are commonly used:

    􏰀 Simple RNN.
    􏰀 Long Short-Term Memory or LSTM. 
    􏰀 Gated Recurrent Unit or GRU.

The LSTM was developed to address the vanishing gradient problem of the Simple RNN that limited the training of deep RNNs. The GRU was developed in an attempt to simplify the LSTM. Results showed that both the GRU and LSTM were significantly better than the Simple RNN, but the LSTM was generally better overall.

In our experiments, LSTM cells consistently outperformed GRU cells
— Massive Exploration of Neural Machine Translation Architectures, 2017.

Recommendation: Use LSTM RNN units in your model.

## Encoder-Decoder Depth

Generally, deeper networks are believed to achieve better performance than shallow networks. The key is to find a balance between network depth, model skill, and training time. This is because we generally do not have infinite resources to train very deep networks if the benefit to skill is minor. The authors explore the depth of both the encoder and decoder models and the impact on model skill. When it comes to encoders, it was found that depth did not have a dramatic impact on skill and more surprisingly, a 1-layer bidirectional model performs only slightly better than a 4-layer bidirectional configuration. A two-layer bidirectional encoder performed slightly better than other configurations tested.

Recommendation: Use a 1-layer bidirectional encoder and extend to 2 bidirectional layers for a small lift in skill.
A similar story was seen when it came to decoders. The skill between decoders with 1, 2, and 4 layers was different by a small amount where a 4-layer decoder was slightly better. An 8-layer decoder did not converge under the test conditions.


Recommendation: Use a 1-layer decoder as a starting point and use a 4-layer decoder for better results.

## Direction of Encoder Inpu

The order of the sequence of source text can be provided to the encoder a number of ways:

    􏰀 Forward or as-normal.
    􏰀 Reversed.
    􏰀 Both forward and reversed at the same time.

The authors explored the impact of the order of the input sequence on model skill comparing various unidirectional and bidirectional configurations. Generally, they confirmed previous findings that a reversed sequence is better than a forward sequence and that bidirectional is slightly better than a reversed sequence.

Recommendation: Use a reversed order input sequence or move to bidirectional for a small lift in model skill.

## Attention Mechanism

A problem with the naive Encoder-Decoder model is that the encoder maps the input to a fixed-length internal representation from which the decoder must produce the entire output sequence. Attention is an improvement to the model that allows the decoder to pay attention to different words in the input sequence as it outputs each word in the output sequence. The authors look at a few variations on simple attention mechanisms. The results show that having attention results in dramatically better performance than not having attention.

The simple weighted average style attention described by Bahdanau, et al. in their 2015 paper Neural machine translation by jointly learning to align and translate was found to perform the best.

Recommendation: Use attention and prefer the Bahdanau-style weighted average style attention.

## Inference

It is common in neural machine translation systems to use a beam-search to sample the probabilities for the words in the sequence output by the model. The wider the beam width, the more exhaustive the search, and, it is believed, the better the results. The results showed that a modest beam-width of 3-5 performed the best, which could be improved only very slightly through the use of length penalties. The authors generally recommend tuning the beam width on each specific problem.

Recommendation: Start with a greedy search (beam=1) and tune based on your problem.

## Final Model

![nmt_model](images/nmt_model.jpeg)