# Neural Machine Translation

## Seq2Seq Modelling

Seq2Seq models use neural networks to translate a piece of text from a source language to a machine language. Introduced by Google, it has proved itself for a variety of applications, namely, machine translation, image captioning, conversation models, text summarization etc. 

As the name sugests, it injests words in sequence and translates them. The beauty of seq2seq is that it not only considers the word, but it also considers the neighbouring words to capture better semantics. It uses RNN to learn and then translate an unseen piece of text from one language to another. More often than not, some specialized RNNs like LSTM or GRU is used since the vanilla RNN suffers from the problem of vanishing gradients. 

Seq2Seq models use a special arrangements of neural networks, viz, __Encoder Decoder__ networks to translate. The Encoder takes in the source text and outputs an intermediate vector which represents the input sequence. The decoder then reads the intermediate vector output of the encoder and then translates it into the target language. The following figure gives a pictorial representation of seq2seq model.

<img src="../images/encodec_overview.png"/>

### Encoder

1. A stack of several recurrent units (LSTM or GRU cells for better performance) where each accepts a single element of the input sequence, collects information for that element and propagates it forward.

2. The input sequence is a collection of all words from the question. Each word is represented as x_i where i is the order of that word. The words are then encoded into dense vectors to represt the input text. This vector aims to encapsulate the information for all input elements in order to help the decoder make accurate predictions.It acts as the initial hidden state of the decoder part of the model.


### Decoder

1. A stack of several recurrent units where each predicts an output y_t at a time step t.

2. Each recurrent unit accepts a hidden state from the previous unit(Encoder unit) and produces and output as well as its own hidden state.

3. The output sequence is a collection of all words from the answer. Each word is represented as y_i where i is the order of that word.

The power of this model lies in the fact that it can map sequences of different lengths to each other. As you can see the inputs and outputs are not correlated and their lengths can differ. This opens a whole new range of problems which can now be solved using such architecture.


The Encoder - Decoder model not only gets better accuracy, but is also much scalable than any of the traditional machine translation methods know thus far. Infact, seq2seq is so popular and effective that any modern day machine translation happens via this. Google translate is one of the most popular platforms relying on seq2seq for translating text in different languages.

### Uses in Industry

Seq2seq has proved its mettle and is being used in lot of areas to solve complex problems. Let's take a look at some of these.

#### Machine Translation : - 

Used to translate text from one language to another language. We will study this is more detail in later sections.

#### Image Captioning :-

Image Captioning is a technique, where in computer automatically generates captions for an image. For example if there is a picture in which a cat is sitting on a table, computer would generate a caption like "Cat sitting on table". 

For those of use with knowledge / interest in convulational neural networks, the image is converted into vectors using conv-nets, and then the resulting vector is trained on seq2seq models. The same can be envisioned using the image below:-

<img src="../images/img_captioning.jpg"/>

#### Other uses

Seq2seq finds additional uses in :-

    *Conversation models (for chatbots etc)
    
    *Document sumarization

### Seq2Seq in action

Having seen how seq2seq works at a very high level, let us try and understand how do we actually translate a text from a source language to a target language. 

For our study let us consider the following sentence in Spanish :

__Patrick visitará la India en septiembre.__

The task at hand is to translate this sentence into English. However we do have some problems to solve first.

1. To translate the text correctly into the target language.

2. To check if the translated text is correct.

3. To select the best available translation(after 1 & 2) of the target language.

Point 1 is pretty clear in itself. However, 2 brings with itself additional problems,i.e, how do we verify if the translated text is correct. Point 3 futher states that, assuming if the translation(s) are correct, which one do we select. How do we determine the accuracy of the translation. Forexample, let us look at the following translations of the above text:

1. Patrick will visit India in September.

2. Patrick is going to visit India in September.

3. In September, Patrick will be  visiting India.

4. In September, Patrick is welcome in India.

Translations from 1-3 are more or less correct and they kind of convey the meaning of the source language. However, translation number 1 is the best out of all the 4 because it conveys the meaning of the source text precisely and concisely and also is grammatically correct. However, translation 4 goes awfully wrong. We will see how tackle / solve the above problems one by one.

Let us try and define the objective now. The objective of a machine translation model is to translate a given piece of text in one language to its best possible match in the target language. Or in other words, find a piece of text in the target language whose probablity of being a correct translation of the piece of text in the source language is the highest. Mathematically, we can say that we have to find translation(text in target language) which has the highest conditional probablity of being the correct translation given the source text. In another words we need to find the value of $Y_i$ which maximizes $P(Y|X)$ ,where $Y$ is a set of all the probable candidates for the correct translation and $X$ is the source text. To find the words from the target dictionary, we use a specialized algorithm called Beam Search. 

__Beam Search__

In order to translate successfully, we need to find the words to begin with. Beam search is a specially designed algorithm to help us with that. It will find the most probable word(s) from the dictionary of the target language that will be used in the translated text. Let's take a look at the image below:-

<img src="../images/beamsearch_step1.png"/>


Following are the actions performed by the beam search algorithm to find out the words from the target language.

__Step 1__:

1. Given dictionary of the target language & the beam width (explained in the figure), find out the probablity of the all the words from the dictionary given the source text. 

    $P(Y_i|X)$ , where $Y_i$ are the words from the dictionary and $X$ is the source text to be translated.
    
    So essentially the beam search algorithm would take all the words in the dictionary of the target language and run it through a softmax layer over an encoder-decoder network, essentially calculating conditinal probability of all the words in the dictionary so as to choose the best first word.
    
    <img src="../images/beamsearch2.png"/>

2. Select top n = beam width words which have the highest conditional probablity. In this case let us say the top three words that the beam search algorithm picks up is "patrick", "in", "september".


__Step 2__:

1. After getting the most probable first words for the translation, the beam search algorithms needs to find the next words. To do that,
    * It keeps the first words in memory(number of words equal to the max beam length) calculates the probablity of the second word given the first word and the input sentence. An intuition can be developed from the image below:-
    <img src="../images/beamsearch_s20.png"/>
    
    




<img src="../images/beamsearch_step2.png"/>


As we can see from the image above, the algorithm now tries to find out the second word keeping the first set of words fixed (from step 1). Mathematically, in this step we are trying to maximize the probablity of two words given the input sentence.
$P(Y_1,Y_2|X)$

By the rules of conditional probablity, this can also be written as:-

$P(Y_1,Y_2|X)=P(Y_1|X)*P(Y_2|X)$

When the above operation is carried out on all the words of the dictionary one by one, we get the pair of words with the highest probablity as probable candidates for the translation.

The above equation can also be extented to n terms, for example

$P(Y_1...Y_n|X)=P(Y_1|X)*P(Y_2|X)...P(Y_n|X)$

The above steps are repeated until all the probable words for the translation are found out by the algorithm. So the entire process when combined might look something like this:-

<img src="../images/machine_translation.png"/>

#### Translation Evaluation: Bleu Score (Bilingual evaluation understudy)

Now, once we have a proper translation, how do we verify the results? To evaluate a MT model, the standard way is to calculate the Blue score. However, The standard way to calculate the accuracy of a ML model is by calculating __Precision__. Let's see why calculating Precision will not work in our example. So let's take our translation example:-

Spanish : Patrick visitará la India en septiembre.

Some human verified references :

1. Patrick will visit India in September.

2. Patrick is going to visit India in September


Now, let's take a look at what a machine translated output could look like in extreme cases :-

1. Patrick will visit India in September.

2. visit visit visit visit visit visit

Now if we were to calculate the Precision of the above outputs, it would give some unexpected results. Like, let's calculate the precision of machine translated output #2 

Recall that precision is number of correct predections (In this case a prediction is correct is the word is present in the human reference ) divided by total correct predictions. So in our reference num 1 , we see there are 6 words, and from the machine translated output, we see that all the 6 words (visit) is present in the human reference. So this essentially gives us a precision of 1 for a poorly translated output.

It is for this problem, a new measure for calculating the effectiveness of machine translation was invented.

Blue score is calculated in the following way:-

1. n-Grams are considered for calculating the score.

2. Number of occurance of n-Grams(in the above example, unigrams ) is capped to the maximum number of occurance of the n-Grams in any of the references.


Having gained insight into how seq2seq model works, let's try and implement it in python in the next section.

## Machine Translation in Python

In the last chapter we learnt about Seq2Seq modelling for machine translation. In this chapter we will actually learn how to implement a machine learning model in Python. We will learn how to use translate sentence from a source language to another language by building a Seq2Seq model using Keras, a popular python deep learning API. 

Here is what we are going to do:

1. Turn each of the sentence into 3 Numpy arrays, __encoder_input_data, decoder_input_data, decoder_target_data__:

__encoder_input_data__ is a 3D array of shape (num_pairs, max_jap_sentence_length, num_jap_characters) containing a one-hot vectorization of the Japanese sentences.

__decoder_input_data__ is a 3D array of shape (num_pairs, max_english_sentence_length, num_english_characters) containing a one-hot vectorization of the English sentences.

__decoder_target_data__ is the same as decoder_input_data but offset by one timestep. decoder_target_data[:, t, :] will be the same as decoder_input_data[:, t + 1, :].

2. Train a basic LSTM-based Seq2Seq model to predict decoder_target_data given encoder_input_data and decoder_input_data. Our model uses teacher forcing.

3. Decode some sentences to check that the model is working (i.e. turn samples from encoder_input_data into corresponding samples from decoder_target_data).


Because the training process and inference process (decoding sentences) are quite different, we use different models for both, i.e, they all leverage the same inner layers.

This is our training model. It leverages three key features of Keras RNNs:

1. The return_state contructor argument, configuring a RNN layer to return a list where the first entry is the outputs and the next entries are the internal RNN states. This is used to recover the states of the encoder.

2. The inital_state call argument, specifying the initial state(s) of a RNN. This is used to pass the encoder states to the decoder as initial states.

3. The return_sequences constructor argument, configuring a RNN to return its full sequence of outputs (instead of just the last output, which the defaults behavior). This is used in the decoder.

Knowing the overview of we are trying to achieve, let's understand the code snippets that we can use the build the Keras  model.

First and formost, we import the libraries that we will be needing. Following snippet does that:-

```python
from keras.models import Model
from keras.layers import Input, LSTM, Dense
```

In the snippet below, we define the Encoder Inputs as a Keras Input Layer. The keras input layer is passed on parameters about the shape of the input and the number of word tokens in the input sequence.

We also define an LSTM model `encoder`.

We then call the encoder function to get the encoder output and the hidden state variables. We discard encoder_output and keep the state variables, namely `state_h` and `state_c` for further processing.
```python
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]
```

We will not setup the decoder network to decode from the encoder states. Following are the things that we do in this phase:

1. Set up the decoder, using `encoder_states` as initial state.

2. We set up our decoder to return full output sequences,and to return internal states as well. We don't use the 
return states in the training model, but we will use them in inference.

Following snippet shows the process in more detail.

``` python

# Setup Decoder network

decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
```

Once we have the decoder layer setup, we now define a Keras model that takes in the `encoder_input_data` and `decoder_inputs` and outputs the `decoder_target_data`.

```python
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
```

Once we have the model, we run training on the model to train it on the input Japanese data. Following snippet does that:-


``` python
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2)
```

Once we have the trained model, we will use this model to decode the sentence and print out the output. 


Hope the above section of implementing a machine translation using python brings some clarity on how can you implement your own machine translation system. In the section below, we will now actually implement the above concept on a sample dataset. The dataset contains English - French translation. On the lines described above, let's start to build an encoder decoder model using Keras.

In [3]:
# Import necessary packages
from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np
import pandas as pd
from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(2)

# Setting training environments and other variables
train_batch_size = 64
epochs = 100
latent_dim=256
num_samples=10000

Using TensorFlow backend.


In [87]:
# Explore the dataset a bit, aye ?
df = pd.read_csv("../data/frenchenglish-bilingual-pairs/fra-eng/fra.txt",delimiter='\t')
df.head(5)

Unnamed: 0,Go.,Va !
0,Run!,Cours !
1,Run!,Courez !
2,Wow!,Ça alors !
3,Fire!,Au feu !
4,Help!,À l'aide !


As you can see, the dataset is a tad delimited file of Eglish - French translation. Let us now prepare this dataset to suit the parameters required to use Keras apis as described in the section above:

In [88]:
input_texts=[]
target_texts=[]
input_chars=set()
target_chars=set()
lines = open("../data/frenchenglish-bilingual-pairs/fra-eng/fra.txt",encoding='utf-8').read().split("\n")
for line in lines:
    try:
        input_text, target_text= line.split("\t")
        target_text = '\t' + target_text + '\n'
        input_texts.append(input_text)
        target_texts.append(target_text)

        for char in input_text:
            if char not in input_chars:
                input_chars.add(char)

        # Create a set of all unique output characters
        for char in target_text:
            if char not in target_chars:
                target_chars.add(char)

       
    except:
        pass
    
input_chars = sorted(list(input_chars))
target_chars = sorted(list(target_chars))
num_encoder_tokens = len(input_chars) # aka size of the english alphabet + numbers, signs, etc.
num_decoder_tokens = len(target_chars) # aka size of the french alphabet + numbers, signs, etc.

                      
input_token_index = {char: i for i, char in enumerate(input_chars)}
target_token_index = {char: i for i, char in enumerate(target_chars)}

max_encoder_seq_length = max([len(txt) for txt in input_texts]) # Get longest sequences length
max_decoder_seq_length = max([len(txt) for txt in target_texts])


# encoder_input_data is a 3D array of shape (num_pairs, max_english_sentence_length, num_english_characters) 
# containing a one-hot vectorization of the English sentences.

encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')

# decoder_input_data is a 3D array of shape (num_pairs, max_french_sentence_length, num_french_characters) 
# containg a one-hot vectorization of the French sentences.

decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

# decoder_target_data is the same as decoder_input_data but offset by one timestep. 
# decoder_target_data[:, t, :] will be the same as decoder_input_data[:, t + 1, :]

decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')


# Loop over input texts
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    # Loop over each char in an input text
    for t, char in enumerate(input_text):
        # Create one hot encoding by setting the index to 1
        encoder_input_data[i, t, input_token_index[char]] = 1.
    # Loop over each char in the output text
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.

print("Done preparing data for training")
                    

Done preparing data for training


Now once we have the dataset prepared as per the requirement of Keras APIs, we will now train the model as per the explanation above:-

In [89]:
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens), 
                       name = 'encoder_inputs')

# The return_state contructor argument, configuring a RNN layer to return a list 
# where the first entry is the outputs and the next entries are the internal RNN states. 
# This is used to recover the states of the encoder.
encoder = LSTM(latent_dim, 
                    return_state=True, 
                    name = 'encoder')

encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens), 
                       name = 'decoder_inputs')

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, 
                         return_sequences=True, 
                         return_state=True, 
                         name = 'decoder_lstm')

# The inital_state call argument, specifying the initial state(s) of a RNN. 
# This is used to pass the encoder states to the decoder as initial states.
# Basically making the first memory of the decoder the encoded semantics
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)

decoder_dense = Dense(num_decoder_tokens, 
                      activation='softmax', 
                      name = 'decoder_dense')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [None]:
# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], 
                    decoder_target_data,
                    batch_size=1000,
                    epochs=epochs,
                    validation_split=0.2)
# Save model
#model.save('s2s.h5')

Train on 116349 samples, validate on 29088 samples
Epoch 1/100


The next step is the inference mode. In this we do the following:

1. Encode input and retrieve initial decoder state

2. Run 1 step of decoder with this initial state and a start sequence as target. The output of this will be the next target token

3. Repeat the current target token and the current states.

Let's see how to implement this in the following section

In [84]:
#Define the encoder model
encoder_model = Model(encoder_inputs, encoder_states)
#define decoder initial states
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
#define decoder output seq and model
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict(
    (i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
    (i, char) for char, i in target_token_index.items())

Once we have the inference model ready,  we gotta write a function that will convert the decoded output (which is basically location of the words) to human readable translations

In [85]:
def decode(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index['\t']] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    return decoded_sentence


Now let us take our model for a spin and translate some of the English sentences to French.

In [86]:
for seq_index in range(100):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode(input_seq)
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: It's me!
Decoded sentence: Je sis sre.

-
Input sentence: Join us.
Decoded sentence: Noss eses

-
Input sentence: Join us.
Decoded sentence: Noss eses

-
Input sentence: Keep it.
Decoded sentence: Aasee--oi

-
Input sentence: Keep it.
Decoded sentence: Aasee--oi

-
Input sentence: Kiss me.
Decoded sentence: Aosee--oi

-
Input sentence: Kiss me.
Decoded sentence: Aosee--oi

-
Input sentence: Me, too.
Decoded sentence: Noss sasn.

-
Input sentence: Open up.
Decoded sentence: Aasee-oi

-
Input sentence: Open up.
Decoded sentence: Aasee-oi

-
Input sentence: Perfect!
Decoded sentence: Aaisee !

-
Input sentence: See you.
Decoded sentence: Aossees s

-
Input sentence: Show me.
Decoded sentence: Noss eses

-
Input sentence: Show me.
Decoded sentence: Noss eses

-
Input sentence: Shut up!
Decoded sentence: Aasee-ooi

-
Input sentence: Shut up!
Decoded sentence: Aasee-ooi

-
Input sentence: Shut up!
Decoded sentence: Aasee-ooi

-
Input sentence: Shut up!
Decoded sentence: Aas