# Attention Model

In the last chapter, we learnt about Machine Translation and how to do it in Python. In this chapter, we will learn another very important concept called Attention Mechanism. Attention mechanisms have gained much popularity in recent times because of using them has increased the accuracy of the Encoder-Decoder RNN model.

__What is Attention Model?__

Well, attention model was majorly proposed as a solution to the traditional Encoder-Decoder model in machine translation. Recall that the Encoder-Decoder model used to encode a fixed length vector injesting all the input data at once to feed it to the Decoder which would then decode all the input (output of the decoder) at once.

__Why Attention Models?__

Now compare it to a typical problem of machine translation. Imagine if you or any human were to translate the a given input sentence to a target output language. If you were reading the complete sentence at once and translating it to the target language, you would probably do good for short sentences, like < 10 words, but would you be efficient if the input sentences were long ? No, right. Ironically the same behaviour was observed when traditional machine translation models were used in translation. It was found that the accuracy of the model was okay till about a certain length of the input sentence. However, when the length of the input sentence increased, the accuracy of the MT models started going down. With an increase in the number of words in the input text, the accuracy of the model in correctly translating the input sentence started going down. The problem stems from the fixed-length internal representation that must be used to decode each word in the output sequence. The solution is the use of an attention mechanism that allows the model to learn where to place attention on the input sequence as each word of the output sequence is decoded.

This could be explained from the following figure:-

<img src="../images/attention_1.png"/>

The key benefit to the approach is that a single system can be directly trained directly on source and target text, no longer requiring the pipeline of specialized systems used in statistical machine learning.

> Unlike the traditional phrase-based translation system which consists of many small sub-components that are tuned separately, neural machine translation attempts to build and train a single, large neural network that reads a sentence and outputs a correct translation.

— [Neural Machine Translation by Jointly Learning to Align and Translate, 2014](https://arxiv.org/abs/1409.0473)

>
The strength of NMT lies in its ability to learn directly, in an end-to-end fashion, the mapping from input text to associated output text.

— [Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, 2016.](https://arxiv.org/abs/1609.08144)

__Advantages__

Attention models address this exact problem faced in traditional MT models. Let's try to understand this with the human translating sentence example we saw earlier. Now typically when a human gets a long input sentence, when he/she would sub-consciously do is break the sentence into parts, like of 4-5 words, read them and translate it into target language, and then move on to the next chunk of sentence. The way these small chunks are decided completely depends on person to person, but the idea is that a large sentence is then broken down into smaller sentence and then each of the smaller sentence is translated. It was observed that using this approach, the MT accuracy was high, even for sentences with large number of words.

<img src="../images/attention_2.png"/>



### How does it work ?

In this section let's try to understand how does attention model work. How is it able to maintain accuracy of the prediction over a large input sentence. To do that lets take a look at the following figure, which extends to our last learnt concept of encoder-decoder networks.

<img src="../images/attention_3.png"/>

The way the Attention Model differs from typical Encoder Decoder model is that instead of the decoder network directly decoding the output of the Encoder network, it is not passed on a Context vector $C$ which is a combination of output of the Encoder network $\alpha$ and an attention parameter $A$. The Context vector $C$ is then passed on to the decoder network to get the final output. Let's try to understand this process in more detail.

1. As you can see in the figure above, attention parameter is calculated for each of the input sequences. We denote this attention parameter using $\alpha$. $\alpha_i$ denotes the amount of attention that you need to pay to its corresponding input vector $x_i$. To get an intuition, think of it as a complex function which determines the weight of the input sequence. 
The way you calculate $\alpha_i$ is to train a small neural network which basically takes in two parameters:-

    1. The output of the encoder network of a particular input sequence.
    
    2. The hidden state of the encoder network (which we were not considering at all in the traditional Encoder- Decoder) network.
    
    The following figure demonstrates the above concept. 
    
    <img src="../images/attentionparam.png"/>

2. Once you have the attention parameter $\alpha$, we now compute a Context vector which is nothing but a weighted sum of the product of:
    1. Output of the encoder netework.
    
    2. Attention parameter.
    
    $C=\sum_{i=1}^n\alpha_i*A_i$
    
3. The Context vector $C$ is then fed into the deocder network to predict the output. To predict the output, we use a feed forward neural network, which takes in the Context Vectors (from step 2 ) as the parameters and then predicts the decoded output of the network. The following figures attempts to give an intuition on the same.

<img src = "../images/decoder_attention.png"/>

Notice here that the input to the decoder is now an attention parameter which indicates how much of attention to give to each of the words in the input sequence and the output of the encoder. This way the decoder network is able to decode the input sequence in batches (since the attention parameter tends to zero for far off words. Which makes sense also since usuallly the far off words do not carry the context or any relation with the current word in a language.)

Now let's look at how do we implement attention model using tensorflow in the following section.

We'll use a language dataset provided by http://www.manythings.org/anki/. This dataset contains language translation pairs in the format:

There are a variety of languages available, but we'll use the English-French dataset like in the previous chapter. Here is a breif description of the process that we are going to follow to prepare the dataset.

1. Add a *start* and *end* token to each sentence.

2. Clean the sentences by removing special characters.

3. Create a word index and reverse word index (dictionaries mapping from word → id and id → word).

4. Pad each sentence to a maximum length.

As we saw in the last tutorial let's load the dataset for processing



In [2]:
import pandas as pd
import tensorflow as tf
tf.enable_eager_execution()
import unicodedata
import re
import numpy as np
from sklearn.model_selection import train_test_split
import time


path_to_file="../data/frenchenglish-bilingual-pairs/fra-eng/fra.txt"
df = pd.read_csv("../data/frenchenglish-bilingual-pairs/fra-eng/fra.txt",delimiter='\t')
df.head(5)

ModuleNotFoundError: No module named 'tensorflow'

In [None]:
# We will now preprocess the dataset to be used to Model Consumption.

## The following class will create a dictionary of words for the dataset. 
## The dictionary will be in the form of ID-> WORD structure. Forexample, "mom"->7

class LanguageIndex():
    def __init__(self, lang):
        self.lang = lang
        self.word2idx = {}
        self.idx2word = {}
        self.vocab = set()

        self.create_index()
    
    def create_index(self):
        for phrase in self.lang:
            self.vocab.update(phrase.split(' '))

        self.vocab = sorted(self.vocab)

        self.word2idx['<pad>'] = 0
        for index, word in enumerate(self.vocab):
            self.word2idx[word] = index + 1

        for word, index in self.word2idx.items():
            self.idx2word[index] = word
            
## Load the dataset in proper format

def max_length(tensor):
    return max(len(t) for t in tensor)


def load_dataset(path, num_examples):
    # creating cleaned input, output pairs
    pairs = create_dataset(path, num_examples)

    # index language using the class defined above    
    inp_lang = LanguageIndex(sp for en, sp in pairs)
    targ_lang = LanguageIndex(en for en, sp in pairs)
    
    # Vectorize the input and target languages
    
    # French sentences
    input_tensor = [[inp_lang.word2idx[s] for s in sp.split(' ')] for en, sp in pairs]
    
    # English sentences
    target_tensor = [[targ_lang.word2idx[s] for s in en.split(' ')] for en, sp in pairs]
    
    # Calculate max_length of input and output tensor
    # Here, we'll set those to the longest sentence in the dataset
    max_length_inp, max_length_tar = max_length(input_tensor), max_length(target_tensor)
    
    # Padding the input and output tensor to the maximum length
    input_tensor = tf.keras.preprocessing.sequence.pad_sequences(input_tensor, 
                                                                 maxlen=max_length_inp,
                                                                 padding='post')
    
    target_tensor = tf.keras.preprocessing.sequence.pad_sequences(target_tensor, 
                                                                  maxlen=max_length_tar, 
                                                                  padding='post')
    
    return input_tensor, target_tensor, inp_lang, targ_lang, max_length_inp, max_length_tar


def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')


def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())
    
    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ." 
    # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)
    
    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
    
    w = w.rstrip().strip()
    
    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = '<start> ' + w + ' <end>'
    return w

# 1. Remove the accents
# 2. Clean the sentences
# 3. Return word pairs in the format: [ENGLISH, SPANISH]
def create_dataset(path, num_examples):
    lines = open(path, encoding='UTF-8').read().strip().split('\n')
    
    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]
    
    return word_pairs

In [None]:


# Try experimenting with the size of that dataset
num_examples = 30000
input_tensor, target_tensor, inp_lang, targ_lang, max_length_inp, max_length_targ = load_dataset(path_to_file, num_examples)

# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)

# Show length
len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val)


BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
N_BATCH = BUFFER_SIZE//BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word2idx)
vocab_tar_size = len(targ_lang.word2idx)

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)




Let us now write our Encoder Decoder model. Also for this chapter we will be using GRU instead of LSTM for simplicity since GRU has just one state.

In [None]:
# Let up define GRU units for calculation
def gru(units):
  # If you have a GPU, we recommend using CuDNNGRU(provides a 3x speedup than GRU)
  # the code automatically does that.
    return tf.keras.layers.GRU(units, 
                               return_sequences=True, 
                               return_state=True, 
                               recurrent_activation='sigmoid', 
                               recurrent_initializer='glorot_uniform')

class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = gru(self.enc_units)
        
    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state = hidden)        
        return output, state
    
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.enc_units))
    


In [None]:
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = gru(self.dec_units)
        self.fc = tf.keras.layers.Dense(vocab_size)
        
        # used for attention
        self.W1 = tf.keras.layers.Dense(self.dec_units)
        self.W2 = tf.keras.layers.Dense(self.dec_units)
        self.V = tf.keras.layers.Dense(1)
        
    def call(self, x, hidden, enc_output):
        # enc_output shape == (batch_size, max_length, hidden_size)
        
        # hidden shape == (batch_size, hidden size)
        # hidden_with_time_axis shape == (batch_size, 1, hidden size)
        # we are doing this to perform addition to calculate the score
        hidden_with_time_axis = tf.expand_dims(hidden, 1)
        
        # score shape == (batch_size, max_length, 1)
        # we get 1 at the last axis because we are applying tanh(FC(EO) + FC(H)) to self.V
        score = self.V(tf.nn.tanh(self.W1(enc_output) + self.W2(hidden_with_time_axis)))
        
        # attention_weights shape == (batch_size, max_length, 1)
        attention_weights = tf.nn.softmax(score, axis=1)
        
        # context_vector shape after sum == (batch_size, hidden_size)
        context_vector = attention_weights * enc_output
        context_vector = tf.reduce_sum(context_vector, axis=1)
        
        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
        x = self.embedding(x)
        
        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
        
        # passing the concatenated vector to the GRU
        output, state = self.gru(x)
        
        # output shape == (batch_size * 1, hidden_size)
        output = tf.reshape(output, (-1, output.shape[2]))
        
        # output shape == (batch_size * 1, vocab)
        x = self.fc(output)
        
        return x, state, attention_weights
        
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.dec_units))

In [None]:
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

In [None]:
# We will define a loss function to train our Encoder
optimizer = tf.train.AdamOptimizer()


def loss_function(real, pred):
    mask = 1 - np.equal(real, 0)
    loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=real, logits=pred) * mask
    return tf.reduce_mean(loss_)

In [None]:
from __future__ import absolute_import, division, print_function


EPOCHS = 10

for epoch in range(EPOCHS):
    start = time.time()
    
    hidden = encoder.initialize_hidden_state()
    total_loss = 0
    
    for (batch, (inp, targ)) in enumerate(dataset):
        loss = 0
        
        with tf.GradientTape() as tape:
            enc_output, enc_hidden = encoder(inp, hidden)
            
            dec_hidden = enc_hidden
            
            dec_input = tf.expand_dims([targ_lang.word2idx['<start>']] * BATCH_SIZE, 1)       
            
            # Teacher forcing - feeding the target as the next input
            for t in range(1, targ.shape[1]):
                # passing enc_output to the decoder
                predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
                
                loss += loss_function(targ[:, t], predictions)
                
                # using teacher forcing
                dec_input = tf.expand_dims(targ[:, t], 1)
        
        batch_loss = (loss / int(targ.shape[1]))
        
        total_loss += batch_loss
        
        variables = encoder.variables + decoder.variables
        
        gradients = tape.gradient(loss, variables)
        
        optimizer.apply_gradients(zip(gradients, variables))
        
        if batch % 100 == 0:
            print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                         batch,
                                                         batch_loss.numpy()))
    # saving (checkpoint) the model every 2 epochs
    if (epoch + 1) % 2 == 0:
        checkpoint.save(file_prefix = checkpoint_prefix)
    
    print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                        total_loss / N_BATCH))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

In [None]:
def evaluate(sentence, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ):
    attention_plot = np.zeros((max_length_targ, max_length_inp))
    
    sentence = preprocess_sentence(sentence)

    inputs = [inp_lang.word2idx[i] for i in sentence.split(' ')]
    inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs], maxlen=max_length_inp, padding='post')
    inputs = tf.convert_to_tensor(inputs)
    
    result = ''

    hidden = [tf.zeros((1, units))]
    enc_out, enc_hidden = encoder(inputs, hidden)

    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang.word2idx['<start>']], 0)

    for t in range(max_length_targ):
        predictions, dec_hidden, attention_weights = decoder(dec_input, dec_hidden, enc_out)
        
        # storing the attention weights to plot later on
        attention_weights = tf.reshape(attention_weights, (-1, ))
        attention_plot[t] = attention_weights.numpy()

        predicted_id = tf.argmax(predictions[0]).numpy()

        result += targ_lang.idx2word[predicted_id] + ' '

        if targ_lang.idx2word[predicted_id] == '<end>':
            return result, sentence, attention_plot
        
        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 0)

    return result, sentence, attention_plot

# function for plotting the attention weights
def plot_attention(attention, sentence, predicted_sentence):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(attention, cmap='viridis')
    
    fontdict = {'fontsize': 14}
    
    ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
    ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)

    plt.show()
    
    
def translate(sentence, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ):
    result, sentence, attention_plot = evaluate(sentence, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ)
        
    print('Input: {}'.format(sentence))
    print('Predicted translation: {}'.format(result))
    
    attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
    plot_attention(attention_plot, sentence.split(' '), result.split(' '))

So this brings us to the conclusion of the chapter on Attention mechanism. Let us try and answer the following questions to test our undersanding on Attention MEchanisms.

1. As a datascientist when using machine translation algorithms, you find that the accuracy of your model is decreasing and also the length of your input sentences is increasing. What would you do ?

1. Use recursive units instead of recurrent

2. Use attention mechanism

3. Use character level translation

4. None of these

Solution: __2__

2. The network learns to pay attention by learning the values of Context vector. Can we train a small NN to get the context vectors?

1. True

2. False

Solution:__1__

3. We expect RNN with attention mechanism to have the greatest advantage when,

1. The length of the input sentence is large.

2. The length of the input sentence is short.

Solution:__1__

# Concept Quiz

__1__ What is statistical machine translation ?

    a.
    
    b.
    
    Solution: a
    
__2__ What MT method will you adopt if you have probablity distribution of input text word to output words.

    a. Rule based MT
    
    b. Neural Network
    
    c. Statistical MT
    
    Solution: c
    
__3__ In beam search, if you increase the beam width B, which of the following would you expect to be true?

    a. Beam search will run more slowly
    
    b. Beam search will use more memory
    
    c. Beam search will generally find better solutions(i.e, do a better job at maximizing P(y|x))
    
    d. Beam search will converge after fewer steps
    
    Solution: c
    
__4__ In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations.

    a. True
    
    b. False
    
    Solution: a
    
__5__ Consider using this encoder decoder model for MT

<img src="../images/attention_model_quiz.png"/>
    

This model is "conditional language model" in the sense that the encoder portion (shown in green) is modelling probablity of the input sentence x.

    a. True
    b. False
    
    Solution: b

__6__ Compared to the encoder-decoder model shown above of this quiz (which does not use an attention mechanism), we expect the attention model to have the greatest advantage when:

    a. The input sequence Tx is large
    
    b. The input sequence Tx is small
    
    Solution: a
    
__7__ What is a bleu score ?

    a. Formula to calculate the accuracy of translation of MT models.
    
    b. Statistical score to calculate the parameter weight
    
    c. Score to evaluate the weight of the features.
    
    Solution: a


__8__ Which RNN cell performs better in seq2seq Machine Translation ?

    a. RNN
    
    b. LSTM
    
    c. GRU
    
    Solution: c
    
__9__ Would you use attention if the size of the input text is rougly 5 words.
    
    a. Yes
    
    b. No
    
    Solution: b

    Explanation: No, seq2seq would suffice in such cases as attention only performs better if the number of input words are more.
    
__10__ How do you handle input text of variable length data ?

    a. By ignoring the words outside of the threshold 
    
    b. By having a a threshold equal to the maximum length of the input data and padding the remaining input.
    
    Ans: b
