# Chapter 1: Machine Translation

Machine translation (MT) is automated translation. It is the process by which computer software is used to translate a text from one natural language (such as English) to another (such as Spanish). 

Until very recently, if you wanted to know the Chinese translation of an English sentence, you had to hire a linguist who would both know English and Chinese language and would then translate the english text into chinese. Problem solved. But what if you wanted to translate the same piece of text into 5 different languages ?

In recent times, MT has become one of the hottest areas of research given the success it is getting using deep learning. We would have all used Google Translate at some point or the other. Researchers are devising new techniques which are getting them higher accuracy in translating from one language to another. Years of R&D in deep neural networks have resulted in advances in the field of Machine Translation using which we could translate a given piece of text to any language.

This will be the main focus of this tutrial where different aspects of Machine translation will be presented on along with hands on tutorial to implement it yourself. So let's get right on to it.

## 1.1 Approaches

Machine Translation(MT) has been an active area of studies right since the very beginnning of the computing age. Over the years it has evolved to what it is today. Some methods included pure semantic translation of the sentence / text based on some linguistic rules like replacing the words of the source language by the words of the target language. While, with the advent of latest search algorithms and big data, the process then shifted to Statistical based approaches. In the new era, with advances in Artificial Intelligence, MT has achived new heights and ever greater accuracy. Lets take a look at each of these processes in a bit more detail to better understand the journey of machine translation.

   __1: Rule Based MT__: Rule based approach relies on countless built-in algorithms and sophisticated linguistic rules and millions of bi-lingual dictionaries of each language pair. The text would then be parsed and a transitional representation of the text would then be created, from which the final target text is generated. The translations are built on top of gigantic dictionaries having words from source and the target language thereby exponentially increasing the development cost and time. For example, let's say that we have to translate the following English text to German.
   
   __Source Language__ :English:  A girl eats an apple.
   
   __Target Language__: German
   
   According to Rule based MT, we need following to successfully translate the English text to German:
   
   1. A dictionary set, mapping each word in the English language to its German counterpart.
   
   2. Rules to represent English grammar
   
   3. Rules to represent German grammar.
   
   4. Rules relating English grammar to German grammar.
   
   
   So using the above rules, we can attempt to translate the text to German in following stages:-
   
   __Stage 1__:
      Get POS(Parts of speech) information.
      
      `a` article, `girl` noun, `eats` verb, `an` article, `apple` noun
      
   __Stage 2__:
       
      Then translate into German text by dictionary lookup and rules of grammar
      
      `a` = `ein`
      
      `girl` = `Madchen`
      
      `eat` = `essen`
      
      `apple` = `Apfel`
      
      So the translation would look something like :
      
      `A girl eats an apple` => `Ein Madchen isst einen Apfel`
    
   __2: Statistical MT__: Statistical MT tries to generate translations using statistical methods bases on bilingual text corpora. A document is converted according to probablity distribution $p(e|f)$ that a string $e$ in the target language, for example English is the translation of a string $f$ in the source language (for example, French). 
   
   Typically in Statistical MT models, sentences are translated at once. A bilingual text corpus is trained and translated into probablity functions, like Bayes Theorem, where the translation model depicts a probablity function which estimates the probablity of a source string being a translation of the target string(from the dictionary).
   
   __Limitations__:
   * The accuracy of this approach is directly proportional to the availability of the corpus of the language pair. The constraint here is the limited availability of such corpora for many language pairs thus limiting major success using the Statistical method.
   
   * Corpus training and creation can be really costly
   
   * Some errors can be hard to predict and fix.
   
   __Other statistical based approaches:__
   
   __2.1: Word based MT__: In word based MT, the fundamental unit of translation is a word. So each word of the source language is mapped on to one or more word of the target language. And then, a probablity function determines the best word(from the target language) to repressent the word from the source lang. However, the number of words in the translatd sentence can differ since a word in source language may tranlate to one or more words in the target language. To have a standard statistical measurement for this, a measure of ration of lengths of sequences of translated words called __fertility__ was introduced. It basically tells how many target language words, a source language word produces. It was assumed from the principles of Information theory that each of these words would cover the same concept. In practise however, this was not really true. For example, the English word corner can be translated in Spanish by either rincón or esquina, depending on whether it is to mean its internal or external angle.
   
   Let us try and understand this with a more concrete example. In what follows we will try to translate a German text into English.
   Haus (German) -> house, building, home, household, shell(The house of a snail is shell)
   
   Now we see here that 1 german word has multiple english word translation based on the context of usage etc. Using statistical MT based approach, we will find out the word with the hightst probablity in the corpus(as shown in the figure below)
   
   <img src="../images/wordbasedmt.png"/>


Now when we look at the maximum likelihood score of the above words,

<img src="../images/probablity_wordmt.png"/>

So we choose "house" as the probable translation to the word "Haus". Similarly we translate all the other words in source text (German) to target text (English):-

<img src="../images/wt.png"/>

   __2.2: Phrase Based MT__: Phrase based models adopt a similar way of translation as Word based MT, the difference being, in Word based MT, the atomic unit is a word, whereas in Phrase based MT, the atomic unit is a phrase.
Source text is split into phrases and then each phrase is then translated into the target language.  

Let's try and understand this with an example. Let us consider the following source text(German):

<img src="../images/phrasemt1.png"/>

As you can see in the figure above, the foreign text is segmented into phrases. Now similar to Word based translation, maximum likelihood score for the phrases will be calculated from the corpus and each of the probable phrases in the target language (English) will be assigned a score. The phrase with the best score would then be selected.

For example, for the phrase translation of the word "natuerlich", the probablity scores could be something like :

<img src="../images/phrasemt2.png"/>

Similarly the scores for each of the phrases would be calculated and then the translation to the target language would be the output as shown in the following:

<img src="../images/phrasemt3.png"/>

   __3: Neural Machine Translation__: Neural MT takes advantage of the advances in deep learning in the world today and attempts to mitigate the shortcomings of other machine translation approaches that we studied earlier. It has been observed that it is able to produce better results with greator accuracy and with requirement of less data. Infact, many popular neural based machine translation do not need any inputs regarding grammar, rules of the language etc(unlike rule based / statistical machine translation).
   
Many of the popular neural MT approaches use a framework called Encoder-Decoder framework, in which the source language is encoded into an intermediatory machine generated language and is then decoded by another Decoder layer. This approach has proved to be very efficient in generating translations with very high accuracy and is currently the industry adopted approach for machine translation. We will be studying this is greator detail in the next few chapters.

# Chapter 2:Neural Machine Translation

## 2.1 Seq2Seq Modelling

Seq2Seq models use neural networks to translate a piece of text from a source language to a machine language. Introduced by Google, it has proved itself for a variety of applications, namely, machine translation, image captioning, conversation models, text summarization etc. 

As the name sugests, it injests words in sequence and translates them. The beauty of seq2seq is that it not only considers the word, but it also considers the neighbouring words to capture better semantics. It uses RNN to learn and then translate an unseen piece of text from one language to another. More often than not, some specialized RNNs like LSTM or GRU is used since the vanilla RNN suffers from the problem of vanishing gradients. 

Seq2Seq models use a special arrangements of neural networks, viz, __Encoder Decoder__ networks to translate. The Encoder takes in the source text and outputs an intermediate vector which represents the input sequence. The decoder then reads the intermediate vector output of the encoder and then translates it into the target language. The following figure gives a pictorial representation of seq2seq model.

<img src="../images/encodec_overview.png"/>

### Encoder

1. A stack of several recurrent units (LSTM or GRU cells for better performance) where each accepts a single element of the input sequence, collects information for that element and propagates it forward.

2. The input sequence is a collection of all words from the question. Each word is represented as x_i where i is the order of that word. The words are then encoded into dense vectors to represt the input text. This vector aims to encapsulate the information for all input elements in order to help the decoder make accurate predictions.It acts as the initial hidden state of the decoder part of the model.


### Decoder

1. A stack of several recurrent units where each predicts an output y_t at a time step t.

2. Each recurrent unit accepts a hidden state from the previous unit(Encoder unit) and produces and output as well as its own hidden state.

3. The output sequence is a collection of all words from the answer. Each word is represented as y_i where i is the order of that word.

The power of this model lies in the fact that it can map sequences of different lengths to each other. As you can see the inputs and outputs are not correlated and their lengths can differ. This opens a whole new range of problems which can now be solved using such architecture.


The Encoder - Decoder model not only gets better accuracy, but is also much scalable than any of the traditional machine translation methods know thus far. Infact, seq2seq is so popular and effective that any modern day machine translation happens via this. Google translate is one of the most popular platforms relying on seq2seq for translating text in different languages.

### Uses in Industry

Seq2seq has proved its mettle and is being used in lot of areas to solve complex problems. Let's take a look at some of these.

#### Machine Translation : - 

Used to translate text from one language to another language. We will study this is more detail in later sections.

#### Image Captioning :-

Image Captioning is a technique, where in computer automatically generates captions for an image. For example if there is a picture in which a cat is sitting on a table, computer would generate a caption like "Cat sitting on table". 

For those of use with knowledge / interest in convulational neural networks, the image is converted into vectors using conv-nets, and then the resulting vector is trained on seq2seq models. The same can be envisioned using the image below:-

<img src="../images/img_captioning.jpg"/>

#### Other uses

Seq2seq finds additional uses in :-

    *Conversation models (for chatbots etc)
    
    *Document sumarization
    
    
    
### Seq2Seq in action

Having seen how seq2seq works at a very high level, let us try and understand how do we actually translate a text from a source language to a target language. 

For our study let us consider the following sentence in Spanish :

__Patrick visitará la India en septiembre.__

The task at hand is to translate this sentence into English. However we do have some problems to solve first.

1. To translate the text correctly into the target language.

2. To check if the translated text is correct.

3. To select the best available translation(after 1 & 2) of the target language.

Point 1 is pretty clear in itself. However, 2 brings with itself additional problems,i.e, how do we verify if the translated text is correct. Point 3 futher states that, assuming if the translation(s) are correct, which one do we select. How do we determine the accuracy of the translation. Forexample, let us look at the following translations of the above text:

1. Patrick will visit India in September.

2. Patrick is going to visit India in September.

3. In September, Patrick will be  visiting India.

4. In September, Patrick is welcome in India.

Translations from 1-3 are more or less correct and they kind of convey the meaning of the source language. However, translation number 1 is the best out of all the 4 because it conveys the meaning of the source text precisely and concisely and also is grammatically correct. However, translation 4 goes awfully wrong. We will see how tackle / solve the above problems one by one.

Let us try and define the objective now. The objective of a machine translation model is to translate a given piece of text in one language to its best possible match in the target language. Or in other words, find a piece of text in the target language whose probablity of being a correct translation of the piece of text in the source language is the highest. Mathematically, we can say that we have to find translation(text in target language) which has the highest conditional probablity of being the correct translation given the source text. In another words we need to find the value of $Y_i$ which maximizes $P(Y|X)$ ,where $Y$ is a set of all the probable candidates for the correct translation and $X$ is the source text. To find the words from the target dictionary, we use a specialized algorithm called Beam Search. 

__Beam Search__

In order to translate successfully, we need to find the words to begin with. Beam search is a specially designed algorithm to help us with that. It will find the most probable word(s) from the dictionary of the target language that will be used in the translated text. Let's take a look at the image below:-

<img src="../images/beamsearch_step1.png"/>


Following are the actions performed by the beam search algorithm to find out the words from the target language.

__Step 1__:

1. Given dictionary of the target language & the beam width (explained in the figure), find out the probablity of the all the words from the dictionary given the source text. 

    $P(Y_i|X)$ , where $Y_i$ are the words from the dictionary and $X$ is the source text to be translated.
    
    So essentially the beam search algorithm would take all the words in the dictionary of the target language and run it through a softmax layer over an encoder-decoder network, essentially calculating conditinal probability of all the words in the dictionary so as to choose the best first word.
    
    <img src="../images/beamsearch2.png"/>

2. Select top n = beam width words which have the highest conditional probablity. In this case let us say the top three words that the beam search algorithm picks up is "patrick", "in", "september".


__Step 2__:

1. After getting the most probable first words for the translation, the beam search algorithms needs to find the next words. To do that,
    * It keeps the first words in memory(number of words equal to the max beam length) calculates the probablity of the second word given the first word and the input sentence. An intuition can be developed from the image below:-
    <img src="../images/beamsearch_s20.png"/>
    
    


<img src="../images/beamsearch_step2.png"/>


As we can see from the image above, the algorithm now tries to find out the second word keeping the first set of words fixed (from step 1). Mathematically, in this step we are trying to maximize the probablity of two words given the input sentence.
$P(Y_1,Y_2|X)$

By the rules of conditional probablity, this can also be written as:-

$P(Y_1,Y_2|X)=P(Y_1|X)*P(Y_2|X)$

When the above operation is carried out on all the words of the dictionary one by one, we get the pair of words with the highest probablity as probable candidates for the translation.

The above equation can also be extented to n terms, for example

$P(Y_1...Y_n|X)=P(Y_1|X)*P(Y_2|X)...P(Y_n|X)$

The above steps are repeated until all the probable words for the translation are found out by the algorithm. So the entire process when combined might look something like this:-

<img src="../images/machine_translation.png"/>


#### Translation Evaluation: Bleu Score (Bilingual evaluation understudy)

Now, once we have a proper translation, how do we verify the results? To evaluate a MT model, the standard way is to calculate the Blue score. However, The standard way to calculate the accuracy of a ML model is by calculating __Precision__. Let's see why calculating Precision will not work in our example. So let's take our translation example:-

Spanish : Patrick visitará la India en septiembre.

Some human verified references :

1. Patrick will visit India in September.

2. Patrick is going to visit India in September


Now, let's take a look at what a machine translated output could look like in extreme cases :-

1. Patrick will visit India in September.

2. visit visit visit visit visit visit

Now if we were to calculate the Precision of the above outputs, it would give some unexpected results. Like, let's calculate the precision of machine translated output #2 

Recall that precision is number of correct predections (In this case a prediction is correct is the word is present in the human reference ) divided by total correct predictions. So in our reference num 1 , we see there are 6 words, and from the machine translated output, we see that all the 6 words (visit) is present in the human reference. So this essentially gives us a precision of 1 for a poorly translated output.

It is for this problem, a new measure for calculating the effectiveness of machine translation was invented.

Blue score is calculated in the following way:-

1. n-Grams are considered for calculating the score.

2. Number of occurance of n-Grams(in the above example, unigrams ) is capped to the maximum number of occurance of the n-Grams in any of the references.


Having gained insight into how seq2seq model works, let's try and implement it in python in the next section.



## 2.2 Machine Translation in Python

In the last chapter we learnt about Seq2Seq modelling for machine translation. In this chapter we will actually learn how to implement a machine learning model in Python. We will learn how to use translate sentence from a source language to another language by building a Seq2Seq model using Keras, a popular python deep learning API. 

Here is what we are going to do:

1. Turn each of the sentence into 3 Numpy arrays, __encoder_input_data, decoder_input_data, decoder_target_data__:

__encoder_input_data__ is a 3D array of shape (num_pairs, max_jap_sentence_length, num_jap_characters) containing a one-hot vectorization of the Japanese sentences.

__decoder_input_data__ is a 3D array of shape (num_pairs, max_english_sentence_length, num_english_characters) containing a one-hot vectorization of the English sentences.

__decoder_target_data__ is the same as decoder_input_data but offset by one timestep. decoder_target_data[:, t, :] will be the same as decoder_input_data[:, t + 1, :].

2. Train a basic LSTM-based Seq2Seq model to predict decoder_target_data given encoder_input_data and decoder_input_data. Our model uses teacher forcing.

3. Decode some sentences to check that the model is working (i.e. turn samples from encoder_input_data into corresponding samples from decoder_target_data).


Because the training process and inference process (decoding sentences) are quite different, we use different models for both, i.e, they all leverage the same inner layers.

This is our training model. It leverages three key features of Keras RNNs:

1. The return_state contructor argument, configuring a RNN layer to return a list where the first entry is the outputs and the next entries are the internal RNN states. This is used to recover the states of the encoder.

2. The inital_state call argument, specifying the initial state(s) of a RNN. This is used to pass the encoder states to the decoder as initial states.

3. The return_sequences constructor argument, configuring a RNN to return its full sequence of outputs (instead of just the last output, which the defaults behavior). This is used in the decoder.

Knowing the overview of we are trying to achieve, let's understand the code snippets that we can use the build the Keras  model.


First and formost, we import the libraries that we will be needing. Following snippet does that:-

```python
from keras.models import Model
from keras.layers import Input, LSTM, Dense
```

In the snippet below, we define the Encoder Inputs as a Keras Input Layer. The keras input layer is passed on parameters about the shape of the input and the number of word tokens in the input sequence.

We also define an LSTM model `encoder`.

We then call the encoder function to get the encoder output and the hidden state variables. We discard encoder_output and keep the state variables, namely `state_h` and `state_c` for further processing.
```python
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]
```

We will not setup the decoder network to decode from the encoder states. Following are the things that we do in this phase:

1. Set up the decoder, using `encoder_states` as initial state.

2. We set up our decoder to return full output sequences,and to return internal states as well. We don't use the 
return states in the training model, but we will use them in inference.

Following snippet shows the process in more detail.

``` python

# Setup Decoder network

decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
```

Once we have the decoder layer setup, we now define a Keras model that takes in the `encoder_input_data` and `decoder_inputs` and outputs the `decoder_target_data`.

```python
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
```

Once we have the model, we run training on the model to train it on the input Japanese data. Following snippet does that:-


``` python
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2)
```

Once we have the trained model, we will use this model to decode the sentence and print out the output. 

Hope the above section of implementing a machine translation using python brings some clarity on how can you implement your own machine translation system. In the section below, we will now actually implement the above concept on a sample dataset. The dataset contains English - French translation. On the lines described above, let's start to build an encoder decoder model using Keras.

In [None]:
# Import necessary packages
from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np
import pandas as pd
from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(2)

# Setting training environments and other variables
train_batch_size = 64
epochs = 100
latent_dim=256
num_samples=10000



# Explore the dataset a bit, aye ?
df = pd.read_csv("../data/frenchenglish-bilingual-pairs/fra-eng/fra.txt",delimiter='\t')
df.head(5)




As you can see, the dataset is a tad delimited file of Eglish - French translation. Let us now prepare this dataset to suit the parameters required to use Keras apis as described in the section above:

In [None]:
input_texts=[]
target_texts=[]
input_chars=set()
target_chars=set()
lines = open("../data/frenchenglish-bilingual-pairs/fra-eng/fra.txt",encoding='utf-8').read().split("\n")
for line in lines:
    try:
        input_text, target_text= line.split("\t")
        target_text = '\t' + target_text + '\n'
        input_texts.append(input_text)
        target_texts.append(target_text)

        for char in input_text:
            if char not in input_chars:
                input_chars.add(char)

        # Create a set of all unique output characters
        for char in target_text:
            if char not in target_chars:
                target_chars.add(char)

       
    except:
        pass
    
input_chars = sorted(list(input_chars))
target_chars = sorted(list(target_chars))
num_encoder_tokens = len(input_chars) # aka size of the english alphabet + numbers, signs, etc.
num_decoder_tokens = len(target_chars) # aka size of the french alphabet + numbers, signs, etc.

                      
input_token_index = {char: i for i, char in enumerate(input_chars)}
target_token_index = {char: i for i, char in enumerate(target_chars)}

max_encoder_seq_length = max([len(txt) for txt in input_texts]) # Get longest sequences length
max_decoder_seq_length = max([len(txt) for txt in target_texts])


# encoder_input_data is a 3D array of shape (num_pairs, max_english_sentence_length, num_english_characters) 
# containing a one-hot vectorization of the English sentences.

encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')

# decoder_input_data is a 3D array of shape (num_pairs, max_french_sentence_length, num_french_characters) 
# containg a one-hot vectorization of the French sentences.

decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

# decoder_target_data is the same as decoder_input_data but offset by one timestep. 
# decoder_target_data[:, t, :] will be the same as decoder_input_data[:, t + 1, :]

decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')


# Loop over input texts
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    # Loop over each char in an input text
    for t, char in enumerate(input_text):
        # Create one hot encoding by setting the index to 1
        encoder_input_data[i, t, input_token_index[char]] = 1.
    # Loop over each char in the output text
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.

print("Done preparing data for training")
                    

Now once we have the dataset prepared as per the requirement of Keras APIs, we will now train the model as per the explanation above:-

In [None]:
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens), 
                       name = 'encoder_inputs')

# The return_state contructor argument, configuring a RNN layer to return a list 
# where the first entry is the outputs and the next entries are the internal RNN states. 
# This is used to recover the states of the encoder.
encoder = LSTM(latent_dim, 
                    return_state=True, 
                    name = 'encoder')

encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens), 
                       name = 'decoder_inputs')

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, 
                         return_sequences=True, 
                         return_state=True, 
                         name = 'decoder_lstm')

# The inital_state call argument, specifying the initial state(s) of a RNN. 
# This is used to pass the encoder states to the decoder as initial states.
# Basically making the first memory of the decoder the encoded semantics
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)

decoder_dense = Dense(num_decoder_tokens, 
                      activation='softmax', 
                      name = 'decoder_dense')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [None]:
# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], 
                    decoder_target_data,
                    batch_size=1000,
                    epochs=epochs,
                    validation_split=0.2)
# Save model
#model.save('s2s.h5')

The next step is the inference mode. In this we do the following:

1. Encode input and retrieve initial decoder state

2. Run 1 step of decoder with this initial state and a start sequence as target. The output of this will be the next target token

3. Repeat the current target token and the current states.

Let's see how to implement this in the following section

In [None]:
#Define the encoder model
encoder_model = Model(encoder_inputs, encoder_states)
#define decoder initial states
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
#define decoder output seq and model
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict(
    (i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
    (i, char) for char, i in target_token_index.items())

Once we have the inference model ready,  we gotta write a function that will convert the decoded output (which is basically location of the words) to human readable translations

In [None]:
def decode(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index['\t']] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    return decoded_sentence


Now let us take our model for a spin and translate some of the English sentences to French.

In [None]:
for seq_index in range(100):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode(input_seq)
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('Decoded sentence:', decoded_sentence)

# Chapter 3: Attention Model

In the last chapter, we learnt about Machine Translation and how to do it in Python. In this chapter, we will learn another very important concept called Attention Mechanism. Attention mechanisms have gained much popularity in recent times because of using them has increased the accuracy of the Encoder-Decoder RNN model.

### 3.1 Introduction of Attention Model

__What is Attention Model?__

Well, attention model was majorly proposed as a solution to the traditional Encoder-Decoder model in machine translation. Recall that the Encoder-Decoder model used to encode a fixed length vector injesting all the input data at once to feed it to the Decoder which would then decode all the input (output of the decoder) at once.

__Why Attention Models?__

Now compare it to a typical problem of machine translation. Imagine if you or any human were to translate the a given input sentence to a target output language. If you were reading the complete sentence at once and translating it to the target language, you would probably do good for short sentences, like < 10 words, but would you be efficient if the input sentences were long ? No, right. Ironically the same behaviour was observed when traditional machine translation models were used in translation. It was found that the accuracy of the model was okay till about a certain length of the input sentence. However, when the length of the input sentence increased, the accuracy of the MT models started going down.

This could be explained from the following figure:-

<img src="../images/attention_1.png"/>

__Advantages__

Attention models address this exact problem faced in traditional MT models. Let's try to understand this with the human translating sentence example we saw earlier. Now typically when a human gets a long input sentence, when he/she would sub-consciously do is break the sentence into parts, like of 4-5 words, read them and translate it into target language, and then move on to the next chunk of sentence. The way these small chunks are decided completely depends on person to person, but the idea is that a large sentence is then broken down into smaller sentence and then each of the smaller sentence is translated. It was observed that using this approach, the MT accuracy was high, even for sentences with large number of words.

<img src="../images/attention_2.png"/>

### 3.2 How does it work ?

In this section let's try to understand how does attention model work. How is it able to maintain accuracy of the prediction over a large input sentence. To do that lets take a look at the following figure, which extends to our last learnt concept of encoder-decoder networks.

<img src="../images/attention_3.png"/>

The way the Attention Model differs from typical Encoder Decoder model is that instead of the decoder network directly decoding the output of the Encoder network, it is not passed on a Context vector $C$ which is a combination of output of the Encoder network $\alpha$ and an attention parameter $A$. The Context vector $C$ is then passed on to the decoder network to get the final output. Let's try to understand this process in more detail.

1. As you can see in the figure above, attention parameter is calculated for each of the input sequences. We denote this attention parameter using $\alpha$. $\alpha_i$ denotes the amount of attention that you need to pay to its corresponding input vector $x_i$. To get an intuition, think of it as a complex function which determines the weight of the input sequence. 
The way you calculate $\alpha_i$ is to train a small neural network which basically takes in two parameters:-

    1. The output of the encoder network of a particular input sequence.
    
    2. The hidden state of the encoder network (which we were not considering at all in the traditional Encoder- Decoder) network.
    
    The following figure demonstrates the above concept. 
    
    <img src="../images/attentionparam.png"/>

2. Once you have the attention parameter $\alpha$, we now compute a Context vector which is nothing but a weighted sum of the product of:
    1. Output of the encoder netework.
    
    2. Attention parameter.
    
    $C=\sum_{i=1}^n\alpha_i*A_i$
    
3. The Context vector $C$ is then fed into the deocder network to predict the output. To predict the output, we use a feed forward neural network, which takes in the Context Vectors (from step 2 ) as the parameters and then predicts the decoded output of the network. The following figures attempts to give an intuition on the same.

<img src = "../images/decoder_attention.png"/>

Notice here that the input to the decoder is now an attention parameter which indicates how much of attention to give to each of the words in the input sequence and the output of the encoder. This way the decoder network is able to decode the input sequence in batches (since the attention parameter tends to zero for far off words. Which makes sense also since usuallly the far off words do not carry the context or any relation with the current word in a language.)

Now let's look at how do we implement attention model using tensorflow in the following section.




We'll use a language dataset provided by http://www.manythings.org/anki/. This dataset contains language translation pairs in the format:

There are a variety of languages available, but we'll use the English-French dataset like in the previous chapter. Here is a breif description of the process that we are going to follow to prepare the dataset.

1. Add a *start* and *end* token to each sentence.

2. Clean the sentences by removing special characters.

3. Create a word index and reverse word index (dictionaries mapping from word → id and id → word).

4. Pad each sentence to a maximum length.

As we saw in the last tutorial let's load the dataset for processing



In [None]:
import pandas as pd
import tensorflow as tf
tf.enable_eager_execution()
import unicodedata
import re
import numpy as np
from sklearn.model_selection import train_test_split
import time


path_to_file="../data/frenchenglish-bilingual-pairs/fra-eng/fra.txt"
df = pd.read_csv("../data/frenchenglish-bilingual-pairs/fra-eng/fra.txt",delimiter='\t')
df.head(5)


# We will now preprocess the dataset to be used to Model Consumption.

## The following class will create a dictionary of words for the dataset. 
## The dictionary will be in the form of ID-> WORD structure. Forexample, "mom"->7

class LanguageIndex():
    def __init__(self, lang):
        self.lang = lang
        self.word2idx = {}
        self.idx2word = {}
        self.vocab = set()

        self.create_index()
    
    def create_index(self):
        for phrase in self.lang:
            self.vocab.update(phrase.split(' '))

        self.vocab = sorted(self.vocab)

        self.word2idx['<pad>'] = 0
        for index, word in enumerate(self.vocab):
            self.word2idx[word] = index + 1

        for word, index in self.word2idx.items():
            self.idx2word[index] = word
            
## Load the dataset in proper format

def max_length(tensor):
    return max(len(t) for t in tensor)


def load_dataset(path, num_examples):
    # creating cleaned input, output pairs
    pairs = create_dataset(path, num_examples)

    # index language using the class defined above    
    inp_lang = LanguageIndex(sp for en, sp in pairs)
    targ_lang = LanguageIndex(en for en, sp in pairs)
    
    # Vectorize the input and target languages
    
    # French sentences
    input_tensor = [[inp_lang.word2idx[s] for s in sp.split(' ')] for en, sp in pairs]
    
    # English sentences
    target_tensor = [[targ_lang.word2idx[s] for s in en.split(' ')] for en, sp in pairs]
    
    # Calculate max_length of input and output tensor
    # Here, we'll set those to the longest sentence in the dataset
    max_length_inp, max_length_tar = max_length(input_tensor), max_length(target_tensor)
    
    # Padding the input and output tensor to the maximum length
    input_tensor = tf.keras.preprocessing.sequence.pad_sequences(input_tensor, 
                                                                 maxlen=max_length_inp,
                                                                 padding='post')
    
    target_tensor = tf.keras.preprocessing.sequence.pad_sequences(target_tensor, 
                                                                  maxlen=max_length_tar, 
                                                                  padding='post')
    
    return input_tensor, target_tensor, inp_lang, targ_lang, max_length_inp, max_length_tar


def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')


def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())
    
    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ." 
    # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)
    
    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
    
    w = w.rstrip().strip()
    
    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = '<start> ' + w + ' <end>'
    return w

# 1. Remove the accents
# 2. Clean the sentences
# 3. Return word pairs in the format: [ENGLISH, SPANISH]
def create_dataset(path, num_examples):
    lines = open(path, encoding='UTF-8').read().strip().split('\n')
    
    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]
    
    return word_pairs

In [None]:


# Try experimenting with the size of that dataset
num_examples = 30000
input_tensor, target_tensor, inp_lang, targ_lang, max_length_inp, max_length_targ = load_dataset(path_to_file, num_examples)

# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)

# Show length
len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val)


BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
N_BATCH = BUFFER_SIZE//BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word2idx)
vocab_tar_size = len(targ_lang.word2idx)

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)




Let us now write our Encoder Decoder model. Also for this chapter we will be using GRU instead of LSTM for simplicity since GRU has just one state.

In [None]:
# Let up define GRU units for calculation
def gru(units):
  # If you have a GPU, we recommend using CuDNNGRU(provides a 3x speedup than GRU)
  # the code automatically does that.
    return tf.keras.layers.GRU(units, 
                               return_sequences=True, 
                               return_state=True, 
                               recurrent_activation='sigmoid', 
                               recurrent_initializer='glorot_uniform')

class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = gru(self.enc_units)
        
    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state = hidden)        
        return output, state
    
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.enc_units))
    


In [None]:
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = gru(self.dec_units)
        self.fc = tf.keras.layers.Dense(vocab_size)
        
        # used for attention
        self.W1 = tf.keras.layers.Dense(self.dec_units)
        self.W2 = tf.keras.layers.Dense(self.dec_units)
        self.V = tf.keras.layers.Dense(1)
        
    def call(self, x, hidden, enc_output):
        # enc_output shape == (batch_size, max_length, hidden_size)
        
        # hidden shape == (batch_size, hidden size)
        # hidden_with_time_axis shape == (batch_size, 1, hidden size)
        # we are doing this to perform addition to calculate the score
        hidden_with_time_axis = tf.expand_dims(hidden, 1)
        
        # score shape == (batch_size, max_length, 1)
        # we get 1 at the last axis because we are applying tanh(FC(EO) + FC(H)) to self.V
        score = self.V(tf.nn.tanh(self.W1(enc_output) + self.W2(hidden_with_time_axis)))
        
        # attention_weights shape == (batch_size, max_length, 1)
        attention_weights = tf.nn.softmax(score, axis=1)
        
        # context_vector shape after sum == (batch_size, hidden_size)
        context_vector = attention_weights * enc_output
        context_vector = tf.reduce_sum(context_vector, axis=1)
        
        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
        x = self.embedding(x)
        
        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
        
        # passing the concatenated vector to the GRU
        output, state = self.gru(x)
        
        # output shape == (batch_size * 1, hidden_size)
        output = tf.reshape(output, (-1, output.shape[2]))
        
        # output shape == (batch_size * 1, vocab)
        x = self.fc(output)
        
        return x, state, attention_weights
        
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.dec_units))

In [None]:
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

In [None]:
# We will define a loss function to train our Encoder
optimizer = tf.train.AdamOptimizer()


def loss_function(real, pred):
    mask = 1 - np.equal(real, 0)
    loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=real, logits=pred) * mask
    return tf.reduce_mean(loss_)

In [None]:
from __future__ import absolute_import, division, print_function


EPOCHS = 10

for epoch in range(EPOCHS):
    start = time.time()
    
    hidden = encoder.initialize_hidden_state()
    total_loss = 0
    
    for (batch, (inp, targ)) in enumerate(dataset):
        loss = 0
        
        with tf.GradientTape() as tape:
            enc_output, enc_hidden = encoder(inp, hidden)
            
            dec_hidden = enc_hidden
            
            dec_input = tf.expand_dims([targ_lang.word2idx['<start>']] * BATCH_SIZE, 1)       
            
            # Teacher forcing - feeding the target as the next input
            for t in range(1, targ.shape[1]):
                # passing enc_output to the decoder
                predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
                
                loss += loss_function(targ[:, t], predictions)
                
                # using teacher forcing
                dec_input = tf.expand_dims(targ[:, t], 1)
        
        batch_loss = (loss / int(targ.shape[1]))
        
        total_loss += batch_loss
        
        variables = encoder.variables + decoder.variables
        
        gradients = tape.gradient(loss, variables)
        
        optimizer.apply_gradients(zip(gradients, variables))
        
        if batch % 100 == 0:
            print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                         batch,
                                                         batch_loss.numpy()))
    # saving (checkpoint) the model every 2 epochs
    if (epoch + 1) % 2 == 0:
        checkpoint.save(file_prefix = checkpoint_prefix)
    
    print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                        total_loss / N_BATCH))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

In [None]:
def evaluate(sentence, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ):
    attention_plot = np.zeros((max_length_targ, max_length_inp))
    
    sentence = preprocess_sentence(sentence)

    inputs = [inp_lang.word2idx[i] for i in sentence.split(' ')]
    inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs], maxlen=max_length_inp, padding='post')
    inputs = tf.convert_to_tensor(inputs)
    
    result = ''

    hidden = [tf.zeros((1, units))]
    enc_out, enc_hidden = encoder(inputs, hidden)

    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang.word2idx['<start>']], 0)

    for t in range(max_length_targ):
        predictions, dec_hidden, attention_weights = decoder(dec_input, dec_hidden, enc_out)
        
        # storing the attention weights to plot later on
        attention_weights = tf.reshape(attention_weights, (-1, ))
        attention_plot[t] = attention_weights.numpy()

        predicted_id = tf.argmax(predictions[0]).numpy()

        result += targ_lang.idx2word[predicted_id] + ' '

        if targ_lang.idx2word[predicted_id] == '<end>':
            return result, sentence, attention_plot
        
        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 0)

    return result, sentence, attention_plot

# function for plotting the attention weights
def plot_attention(attention, sentence, predicted_sentence):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(attention, cmap='viridis')
    
    fontdict = {'fontsize': 14}
    
    ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
    ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)

    plt.show()
    
    
def translate(sentence, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ):
    result, sentence, attention_plot = evaluate(sentence, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ)
        
    print('Input: {}'.format(sentence))
    print('Predicted translation: {}'.format(result))
    
    attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
    plot_attention(attention_plot, sentence.split(' '), result.split(' '))

So this brings us to the conclusion of the chapter on Attention mechanism. Let us try and answer the following questions to test our undersanding on Attention MEchanisms.

1. As a datascientist when using machine translation algorithms, you find that the accuracy of your model is decreasing and also the length of your input sentences is increasing. What would you do ?

1. Use recursive units instead of recurrent

2. Use attention mechanism

3. Use character level translation

4. None of these

Solution: __2__

2. The network learns to pay attention by learning the values of Context vector. Can we train a small NN to get the context vectors?

1. True

2. False

Solution:__1__

3. We expect RNN with attention mechanism to have the greatest advantage when,

1. The length of the input sentence is large.

2. The length of the input sentence is short.

Solution:__1__