<div align='center'>
    <h1>Sequence-to-Sequence Learning – Neural Machine Translation</h1>
</div>

- **Sequence-to-sequence learning** is the term used for tasks that require mapping an arbitrary-length sequence to another arbitrary-length sequence. 

- This is one of the most sophisticated tasks in NLP, which involves learning many-to-many mappings. 

- Examples of this task include **Neural Machine Translation (NMT)** and creating **chatbots**. NMT is where we translate a sentence from one language (source language) to another (target language) like Google Translate.

The BLEU score denotes the number of n-grams (for example, unigrams and bigrams) of candidate translation that matched in the reference translation. So the higher the BLEU score, the better the MT system.

## Understanding Neural Machine Translation

### Intuition behind NMT systems

Let’s understand the intuition underlying an NMT system’s design. 

- For translating a sentence from English to German, say: `I went home --> Ich ging nach Hause`. The way we translate this is:
    - First, you read the English sentence, and then you create a thought or concept about what this sentence represents or implies, in your mind. And finally, you translate the sentence into German.<br></br>
    
- The same idea is used for building NMT systems. 
    - The **encoder** reads the source sentence (that is, similar to you reading the English sentence). 
    
    - Then the encoder outputs a **context vector** (the context vector corresponds to the thought/concept you imagined after reading the sentence). 
    
    - Finally, the **decoder** takes in the context vectors and outputs the translation in German:
    
<div align='center'>
    <img src='images/nmt_system.png'/>
</div>

### NMT Architecture

- NMT is a encoder-decoder architecture. 

- The **encoder** converts a sentence from a given source language into a thought vector (i.e. a contextualized representation), and the **decoder** decodes or translates the thought into a target language.

- The left-hand side of the context vector denotes the encoder (which takes a source sentence in word by word to train a time-series model). 

- The right-hand side denotes the decoder, which outputs word by word (while using the previous word as the current input) the corresponding translation of the source sentence. 

- We will also use embedding layers (for both the source and target languages) where the semantics of the individual tokens will be learned and fed as inputs to the models:

<div align='center'>
    <img src='images/nmt_system1.png'/>
</div>

* **

The objective of the NMT is to maximize the log-likelihood, given a source sentence $x_s$ and its corresponding translation $y_T$ : $$\frac{1}{N} \sum_{i = 1}^{N} \log P(y_T | x_s)$$ Here, $N$ refers to the number of source and target sentence inputs we have as training data.

Then, during inference, for a given source sentence, $x_s^{infer}$, we will find the $Y_T^{best}$ translation as:

$$Y_T^{best} = {argmax}_{y\in Y_T} P(y_T | x_s^{infer})$$

* **

Let's connect the dots b/w the embedding layer, the encoder, the context vector, and the decoder:

1. We use two word embedding layers, one for the source language and the other for the target, to better represent the semantics b/w the words of the respective languages.

2. The encoder is responsible for generating a thought vector or a context vector that represents what is meant by the source language.
    - The encoder is an RNN cell.
    
    - At time step $t_0$ the enocoder is initialized with a zero vector by default. After finally getting trained on the sequence of source sentences/words, It produces a context vector which is it's final external hidden state.<br></br>
    
3. The idea of the context vector is to represent a sentence of a source language concisely.
    - Also, in contrast to how the encoder’s state is initialized (that is, it is initialized with zeros), the context vector becomes the initial state for the decoder.
    
    - This creates a linkage between the encoder and the decoder and makes the whole model end-to-end differentiable. <br></br>
    
4. The decoder is responsible for decoding the context vector into the desired translation. Our decoder is an RNN as well.
    - The context vector is the only piece of information that is available to the decoder about the source sentence. Thus, it is a crucial link b/w encoder and decoder.
    
    - After getting initialized with the context vector as it's initial state the decoder then learns the patterns in the target text.
    
    - Though it is possible for the encoder and decoder to share the same set of weights, it is usually better to use two different networks for the encoder and the decoder. This increases the number of parameters in our model, allowing us to learn the translations more effectively.
    
    - For the prediction we use something like softmax function to predict the words.
    
    
The full NMT system with the details of how the GRU cell in the encoder connects to the GRU cell in the decoder, and how the softmax layer is used to output predictions, is shown:

<div align='center'>
    <img src='images/nmt_gru.png'/>
</div>

# Neural Machine Translation: English to German

In [1]:
import os
import random
import tensorflow as tf
import numpy as np
import pandas as pd
import time
import json

def fix_random_seed(seed):
    """ Setting the random seed of various libraries """
    try:
        np.random.seed(seed)
    except NameError:
        print("Warning: Numpy is not imported. Setting the seed for Numpy failed.")
    try:
        tf.random.set_seed(seed)
    except NameError:
        print("Warning: TensorFlow is not imported. Setting the seed for TensorFlow failed.")
    try:
        random.seed(seed)
    except NameError:
        print("Warning: random module is not imported. Setting the seed for random failed.")
 
# Set the random seed
random_seed=4321
fix_random_seed(random_seed)

print(f"TensorFlow version: {tf.__version__}")

print(f"Tensorflow GPU Access status: {tf.config.list_physical_devices('GPU')}")

TensorFlow version: 2.10.1
Tensorflow GPU Access status: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## The Data

WMT-14 English-German translation data from https://nlp.stanford.edu/projects/nmt/. There are ~4.5 million sentence pairs available. However, we will use only 250,000 sentence pairs due to computational feasibility.

The required files are:
- File containing German sentences: `train.de`
- File containing English sentences: `train.en`
- File containing German vocabulary: `vocab.50K.de`
- File containing English vocabulary: `vocab.50K.en`

### Reading the dataset

In [2]:
n_sentences = 250000

# Loading English sentences
original_en_sentences = []
with open('data/train.en', 'r', encoding='utf-8') as en_file:
    for i,row in enumerate(en_file):
        # if i < 50: continue # or i==22183 or i==27781 or i==81827: continue
        if i >= n_sentences: break
        original_en_sentences.append(row.strip().split(" "))
        
# Loading German sentences
original_de_sentences = []
with open('data/train.de', 'r', encoding='utf-8') as de_file:
    for i, row in enumerate(de_file):
        # if i < 50: continue # or i==22183 or i==27781 or i==81827: continue
        if i >= n_sentences: break
        original_de_sentences.append(row.strip().split(" "))

In [3]:
len(original_en_sentences), len(original_de_sentences)

(250000, 250000)

In [4]:
# Print a few sentences
for en_s, de_s in zip(original_en_sentences[:10], original_de_sentences[:10]):
    print(f"English: {' '.join(en_s)}\nGerman: {' '.join(de_s)}\n")

English: iron cement is a ready for use paste which is laid as a fillet by putty knife or finger in the mould edges ( corners ) of the steel ingot mould .
German: iron cement ist eine gebrauchs ##AT##-##AT## fertige Paste , die mit einem Spachtel oder den Fingern als Hohlkehle in die Formecken ( Winkel ) der Stahlguss -Kokille aufgetragen wird .

English: iron cement protects the ingot against the hot , abrasive steel casting process .
German: Nach der Aushärtung schützt iron cement die Kokille gegen den heissen , abrasiven Stahlguss .

English: a fire restant repair cement for fire places , ovens , open fireplaces etc .
German: feuerfester Reparaturkitt für Feuerungsanlagen , Öfen , offene Feuerstellen etc.

English: Construction and repair of highways and ...
German: Der Bau und die Reparatur der Autostraßen ...

English: An announcement must be commercial character .
German: die Mitteilungen sollen den geschäftlichen kommerziellen Charakter tragen .

English: Goods and services adva

### Adding special tokens

We add special tokens `<s>` and `</s>` to denote the beginning and end of sequences respectively.

- **This is a very important step for Seq2Seq models. `<s>` and `</s>` tokens serve an extremely important role during model inference.** 

- At inference time, we will be using the decoder to predict one word at a time, by using the output of the previous time step as an input. This way we can predict for an arbitrary number of time steps. 

- Using `<s>` as the starting token gives us a way to signal to the decoder that it should start predicting tokens from the target language. 

- Next, if we do not use the `</s>` token to mark the end of a sentence, we cannot signal the decoder to end a sentence. This can lead the model to enter an infinite loop of predictions.

In [5]:
en_sentences = [["<s>"]+sent+["</s>"] for sent in original_en_sentences]
de_sentences = [["<s>"]+sent+["</s>"] for sent in original_de_sentences]

In [6]:
# Print a few sentences
for en_s, de_s in zip(en_sentences[:5], de_sentences[:5]):
    print(f"English: {' '.join(en_s)}\nGerman: {' '.join(de_s)}\n")

English: <s> iron cement is a ready for use paste which is laid as a fillet by putty knife or finger in the mould edges ( corners ) of the steel ingot mould . </s>
German: <s> iron cement ist eine gebrauchs ##AT##-##AT## fertige Paste , die mit einem Spachtel oder den Fingern als Hohlkehle in die Formecken ( Winkel ) der Stahlguss -Kokille aufgetragen wird . </s>

English: <s> iron cement protects the ingot against the hot , abrasive steel casting process . </s>
German: <s> Nach der Aushärtung schützt iron cement die Kokille gegen den heissen , abrasiven Stahlguss . </s>

English: <s> a fire restant repair cement for fire places , ovens , open fireplaces etc . </s>
German: <s> feuerfester Reparaturkitt für Feuerungsanlagen , Öfen , offene Feuerstellen etc. </s>

English: <s> Construction and repair of highways and ... </s>
German: <s> Der Bau und die Reparatur der Autostraßen ... </s>

English: <s> An announcement must be commercial character . </s>
German: <s> die Mitteilungen sollen 

## Train, validation and test split

Here we split the full dataset as follows:
- Train - 80%
- Validation - 10%
- Test - 10%

In [7]:
from sklearn.model_selection import train_test_split

train_en_sentences, valid_test_en_sentences, train_de_sentences, valid_test_de_sentences = train_test_split(
    np.array(en_sentences, dtype=object), np.array(de_sentences, dtype=object), test_size=0.2
)

valid_en_sentences, test_en_sentences, valid_de_sentences, test_de_sentences = train_test_split(
    valid_test_en_sentences, valid_test_de_sentences, test_size=0.5)

print(f"Train size: {len(train_en_sentences)}")
print(f"Valid size: {len(valid_en_sentences)}")
print(f"Test size: {len(test_en_sentences)}")

Train size: 200000
Valid size: 25000
Test size: 25000


## Analyse lengths of sequences

A key statistic we have to understand at this point is how long, generally, the sentences in our corpus are. It is quite likely that the two languages will have different sentence lengths.


- Here we can see that 95% of English sentences have 54 tokens, where 95% of German sentences have 49 tokens

- We use the 80% percentile of the sequence lengths for each language as a threshold to,
    - Truncate sequences longer than that
    - Add a special token (`<pad>`) to bring shorter sentences to that length

>Note that we are only using the training data for this calculation. If you include validation or test datasets in these calculations, we may be leaking data about validation and test data. Therefore, it’s best to only use the training dataset for these calculations.

In [8]:
print("Sequence lengths (English)")
print(pd.Series(train_en_sentences).str.len().describe(percentiles=[0.05,0.5,0.8,0.9,0.95]))

print("\nSequence lengths (German)")
print(pd.Series(train_de_sentences).str.len().describe(percentiles=[0.05,0.5,0.8,0.9,0.95]))

Sequence lengths (English)
count    200000.000000
mean         26.864400
std          13.587173
min           8.000000
5%           11.000000
50%          24.000000
80%          36.000000
90%          45.000000
95%          54.000000
max         102.000000
dtype: float64

Sequence lengths (German)
count    200000.000000
mean         24.757705
std          12.426796
min           8.000000
5%           11.000000
50%          22.000000
80%          33.000000
90%          41.000000
95%          49.000000
max         102.000000
dtype: float64


## Padding the sentences to a fixed length

In [9]:
n_en_seq_length = 36
n_de_seq_length = 33

In [10]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

pad_token = '<pad>'

train_en_sentences_padded = pad_sequences(train_en_sentences, maxlen=n_en_seq_length,
                                          value=pad_token, dtype=object, truncating='post',
                                          padding='post')

valid_en_sentences_padded = pad_sequences(valid_en_sentences, maxlen=n_en_seq_length,
                                          value=pad_token, dtype=object, truncating='post',
                                          padding='post')

test_en_sentences_padded = pad_sequences(test_en_sentences, maxlen=n_en_seq_length,
                                         value=pad_token, dtype=object, truncating='post',
                                         padding='post')


train_de_sentences_padded = pad_sequences(train_de_sentences, maxlen=n_de_seq_length,
                                          value=pad_token, dtype=object, truncating='post',
                                          padding='post')

valid_de_sentences_padded = pad_sequences(valid_de_sentences, maxlen=n_de_seq_length,
                                          value=pad_token, dtype=object, truncating='post',
                                          padding='post')

test_de_sentences_padded = pad_sequences(test_de_sentences, maxlen=n_de_seq_length,
                                         value=pad_token, dtype=object, truncating='post',
                                         padding='post')

In [11]:
print("Some validation sentences ...\n")
for en_sent, de_sent in zip(valid_en_sentences_padded[:3], valid_de_sentences_padded[:3]):
    en_sent_str = ' '.join(en_sent)
    de_sent_str = ' '.join(de_sent)
    print(f"English: {en_sent_str}\nGermen: {de_sent_str}\n")

print("*"*50)
print("Some test sentences ...\n")
for en_sent, de_sent in zip(test_en_sentences_padded[:3], test_de_sentences_padded[:3]):
    en_sent_str = ' '.join(en_sent)
    de_sent_str = ' '.join(de_sent)
    print(f"English: {en_sent_str}\nGermen: {de_sent_str}\n")

Some validation sentences ...

English: <s> For DVDs and Videos , delivery charges and handling fees are already included in the price . </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
Germen: <s> In den Preisangaben für DVDs und Videos sind die Versandkosten und Bearbeitungsgebühren bereits enthalten . </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>

English: <s> So configure the \ xampp \ apache \ bin \ php.ini for web changes . </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
Germen: <s> Das ist zwar auch nicht ganz falsch , sollte PHP als Konsolenprogramm ( cli ) benutzt werden . In der Regel aber wird PHP im XAMPP über den Apache Webserver via mod

English: <s> Modern , elegant and equipped well , this excellent residence of the news hotel is a fantastic base of the festivities . </s> <pad> <pad> <pad>

* **
<div align='center'>
    <h3>Preprocessing Trick: Reversing the source sentence</h3>
</div>

We can also perform a special trick on the source sentences. Say we have the sentence
**ABC** in the source language, which we want to translate to $\alpha \beta \gamma \phi$  in the target language. 

We will first reverse the source sentences so that the sentence ABC is read as CBA. This means that in order to translate **ABC** to $\alpha \beta \gamma \phi$, we need to feed in **CBA**.

- This improves the performance of our model significantly, especially **when the source and target languages share the same sentence structure (for example, subject-verb-object).**

- Let’s try to understand why this helps. Mainly, it helps to build good communication between the encoder and the decoder. Let’s start from the previous example. We will concatenate the source and target sentences: $$ABC\alpha \beta \gamma$$

- If you calculate the distance (that is, the number of words separating two words)
from A to $\alpha$ or B to $\beta$, they will be the same. However, consider this when you reverse the source sentence, as shown here: $$CBA\alpha \beta \gamma$$


- Here, A is very close to $\alpha$ and so on. Also, to build good translations, building good communications at the very start is important. This simple trick can possibly help NMT systems to improve their performance.

>**Note:** that the source sentence reversing step is a subjective preprocessing step. This might not be necessary for some translational tasks. 

- For example, if your translation task is to translate from Japanese (which is often written in *subject-object-verb* format) to Filipino (often written *verb-subject-object*), then reversing the source sentence might actually cause harm rather than helping. This is because by reversing the text in Japanese, you are increasing the distance between the starting element of the target sentence (that is, the verb (Japanese)) and the corresponding source language entity (that is, the verb (Filipino)).
* **

## Loading Vocabulary 

Let's build the vocabulary dictionaries for both the source (German) and target (English) languages

Originally, each vocabulary contains 50,000 tokens. However, we’ll take only half of this to reduce the memory requirement. 

> **Note:** that we allow one extra token as there’s a special token `<unk>` to denote **out-of-vocabulary (OOV)** words. 
    
With a 50,000-token vocabulary, it is quite easy to run out of memory due to the size of the final prediction layer we’ll build. 

- **While cutting back on the size of the vocabulary, we have to make sure that we preserve the most common 25,000 words:**
    - Fortunately, each vocabulary file is organized such that words are ordered by their frequency of occurrence (high to low). Therefore, we just need to read the first 25,001 lines of text from the file:

In [12]:
# +1 for to denote out-of-vocabulary (OOV) words as <unk>.
n_vocab = 25000 + 1

en_vocabulary = []
with open('data/vocab.50K.en', 'r', encoding='utf-8') as en_file:
    for ri, row in enumerate(en_file):
        if ri  >= n_vocab: break
            
        en_vocabulary.append(row.strip())
        

de_vocabulary = []
with open('data/vocab.50K.de', 'r', encoding='utf-8') as de_file:
    for ri, row in enumerate(de_file):
        if ri >= n_vocab: break
            
        de_vocabulary.append(row.strip())
        

# Each of the vocabularies contain the special OOV token <unk> as the first line. 
# We pop out that, from the en_vocabulary and de_vocabulary lists as we need this 
# for the next step:
en_unk_token = en_vocabulary.pop(0)
de_unk_token = de_vocabulary.pop(0)

## String lookup layer: Converting tokens to IDs

Here we define a `StringLookup` layer for each language, which will convert string tokens to numerical IDs.

- After getting the vocabulary of our text data, **we have one more text processing operation remaining, that is, converting the processed text tokens into numerical IDs.** 

- We are going to use a `tf.keras.layers.StringLookup` to create a layer in our model that converts each token into a numerical ID. 

In [13]:
en_lookup_layer = tf.keras.layers.StringLookup(vocabulary=en_vocabulary, oov_token=en_unk_token,
                                               mask_token=pad_token, pad_to_max_tokens=False)

de_lookup_layer = tf.keras.layers.StringLookup(vocabulary=de_vocabulary, oov_token=de_unk_token,
                                               mask_token=pad_token, pad_to_max_tokens=False)

Let’s understand the arguments provided to this layer:
- `vocabulary` – Contains a list of words that are found in the corpus (except certain special tokens that will be discussed below)

- `oov_token` – A special out-of-vocabulary token that will be used to replace tokens not listed in the vocabulary

- `mask_token` – A special token that will be used to mask inputs (e.g. uninformative padded tokens)

- `pad_to_max_tokens` – If padding should occur to bring arbitrary-length sequences in a batch of data to the same length

In [14]:
wid_sample = en_lookup_layer(
    "iron cement protects the ingot against the hot , abrasive steel casting process .".split(" ")
)
print(f"Word IDs: {wid_sample}")
print(f"Sample vocabulary: {en_lookup_layer.get_vocabulary()[:15]}")

Word IDs: [ 4304 10519  6386     4     1   179     4  1840     5 19429  2315  7705
   224     6]
Sample vocabulary: ['<pad>', '<unk>', '<s>', '</s>', 'the', ',', '.', 'of', 'and', 'to', 'in', 'a', 'is', 'that', 'for']


## Defining the Model

Here we define the model. We'll be focusing on the following primary components.

- **Encoder** - Convert an English token sequence to a context vect

- **Decoder** - Consumes the context vector and generate predictions

- **Decoder Attention** - Allows the decoder to look at any encoder state in order to learn about the source sentence

## Defining the Encoder

In [17]:
from tensorflow.keras import layers
import tensorflow.keras.backend as K
K.clear_session()

# Defining the encoder layers
encoder_input = layers.Input(shape=(n_en_seq_length,), dtype=tf.string)
# Converting tokens to IDs
encoder_wid_out = en_lookup_layer(encoder_input) # wid here means word_id

# Embedding layer and lookup
en_full_vocab_size = len(en_lookup_layer.get_vocabulary())
encoder_emb_out = layers.Embedding(en_full_vocab_size, 128, mask_zero=True)(encoder_wid_out)

# Encoder GRU layer
encoder_gru_out, encoder_gru_last_state = layers.GRU(256, return_sequences=True, return_state=True)(encoder_emb_out)

# Defining the encoder model: in - encoder_input / out - output of the GRU layer
encoder = tf.keras.models.Model(inputs=encoder_input, outputs=encoder_gru_out)

<div align='center'>
    <h3>Explanation of the Encoder model</h3>
</div>

1. We start the encoder with an input layer. The input layer will take in a batch of sequences of tokens. Each sequence of tokens  is `n_en_seq_length` elements long. Remember that we padded or truncated the sentences to make sure all of them have a fixed length of `n_en_seq_length`:
```
# Defining the encoder layers
encoder_input = layers.Input(shape=(n_en_seq_length,), dtype=tf.string)
```

2. Next we use the previously defined `StringLookup` layer to convert the string tokens into word IDs. As we saw, the `StringLookup` layer can take a list of unique words (i.e. a vocabulary) and create a lookup operation to convert a given token into a numerical ID:
```
# Converting tokens to IDs
encoder_wid_out = en_lookup_layer(encoder_input)
```

3. With the tokens converted into IDs, we route the generated word IDs to a token embedding layer. We pass in the size of the vocabulary (derived from the en_lookup_layer's `get_vocabulary()` method) and the embedding size (128) and finally we ask the layer to mask any zero-valued inputs as they don’t contain any information: 
```
# Embedding layer and lookup
en_full_vocab_size = len(en_lookup_layer.get_vocabulary())
encoder_emb_out = layers.Embedding(en_full_vocab_size, 128, 
                                   mask_zero=True)(encoder_wid_out)
```

4. The output of the embedding layer is stored in encoder_emb_out. Next we define a GRU layer to process the sequence of English token embeddings:

    - Note how we are setting both the `return_sequences` and `return_state` arguments to `True`. 

    - To recap, `return_sequences` returns the full sequence of hidden states as the output (instead of returning only the last), where `return_state` returns the last state of the model as an additional output. 

    - We need both these outputs to build the rest of our model. For example, we need to pass the last state of the encoder to the decoder as the initial state. For that, we need the last state of the encoder (stored in `encoder_gru_last_state`). 
    
```
# Encoder GRU layer
encoder_gru_out, enocder_gru_last_state = layers.GRU(256, return_sequences=True, 
                                             return_state=True)(encoder_emb_out)
```


5. We now have everything to define the encoder part of our model. It takes in a batch of sequences of string tokens and returns the full sequence of GRU hidden states as the output:
```
# Defining the encoder model: in - encoder_input / out - output of the GRU layer
encoder = tf.keras.models.Model(inputs=encoder_input, outputs=encoder_gru_out)
```

## Defining the decoder

Our decoder will be more complex than the encoder. **The objective of the decoder is, given the last encoder state and the previous token the decoder predicted, predict the next token.** 

For example, for the German sentence: `<s> ich ging zum Laden </s>` --> `i went to the store`

We define:

|  |  |  |  |  |  |
| ----- | --- | ---- | ---- | --- | ----- |
| **Input** | `<s>` | ich | ging | zum | Laden |
| **Output** | ich | ging | zum | Laden | `</s>` |

This technique is known as **teacher forcing**. In other words, the decoder is leveraging previous tokens of the target itself to predict the next token. This makes the translation task easier for the model.

* **

Training model architecture (Note that we're using GRU cells instead of LSTMs as shown in the image)

<div align='center'>
    <img src='images/model_arc.png'/>
</div>

To feed in previous tokens predicted by the decoder, we need an input layer for the decoder.

|  |  |  |  |  |  | |
| ----- | --- | ---- | ---- | --- | ----- | --- |
| **Input** | `<s>` | ich | ging | zum | Laden |---> i/p length is $n-1$ |
| **Output** | ich | ging | zum | Laden | `</s>` |---> o/p length is $n-1$  |

`<s> ich ging zum Laden </s>` ---> Actual length is $n$

Like we saw in this table, When formulating the decoder inputs and outputs this way, for a sequence of tokens with length $n$, the input and output are $n-1$ tokens long. Thus in the decoder i/p layer we passed the `shape=(n_de_seq_length-1,)`:
```
# Defining the decoder layers 
decoder_input = layers.Input(shape=(n_de_seq_length-1,), dtype=tf.string)
```

In [20]:
# Defining the decoder layers 
decoder_input = layers.Input(shape=(n_de_seq_length-1,), dtype=tf.string)
# Converting tokens to IDs (Decoder)
decoder_wid_out = de_lookup_layer(decoder_input) # wid here means word_id

# Embedding layer and lookup (decoder)
de_full_vocab_size = len(de_lookup_layer.get_vocabulary())
decoder_emb_out = layers.Embedding(de_full_vocab_size, 128, mask_zero=True)(decoder_wid_out)

# Decoder GRU layer
decoder_gru_out = layers.GRU(256, return_sequences=True)(decoder_emb_out, 
                                                         initial_state=encoder_gru_last_state)

# Note that we are passing the encoder’s last state to a special argument called 
# initial_state in the GRU’s call() method. This ensures that the decoder uses the 
# encoder's last state to initialize its memory.

## Attention: Analyzing the encoder states

- The next step of our journey takes us to one of the most important concepts in machine learning, **'attention'**. 

- *So far, the decoder had to rely on the encoder's last state as the 'only' input/signal about the source language. This is like asking to summarize a sentence using a single word. Generally, when doing so, you lose a lot of the meaning and message in this conversion.* 

- **Attention** alleviates this problem.


Instead of relying just on the encoder's last state, attention enables the decoder to analyze the complete history of state outputs. The decoder does this at every step of the prediction and creates a weighted average of all the state outputs depending on what it needs to produce at that step.

For example, in the translation `I went to the shop -> ich ging zum Laden`, when predicting the word `ging`, the decoder will pay more attention to the first part of the English sentence than the latter.

### The context/thought vector is a performance bottleneck

As we have seen in the encoder-decoder architecture of NMT, that the encoder part spits out a summazied representation of the source language sentence as *'context/thought vector'*, which basically creates a link b/w the encoder and the decoder; which later the decoder uses to translate the sentence.

<div align='center'>
    <img src='images/enc_dec.png'/>
</div>

- ***To understand why the context/thought vector is a performance bottleneck***,

    - Let's imagine translating the foll. English sentence: $$\text{I went to the flower market to buy some flowers}$$

    - This translates to the following: $$\text{Ich ging zum Blumenmarkt, um Blumen zu kaufen}$$ 

- If we are to compress this into a fixed-length vector, the resulting vector needs to contain these:
    - *Information about the subject (I)*
    - *Information about the verbs (buy and went)*
    - *Information about the objects (flowers and flower market)*
    - *Interaction of the subjects, verbs, and objects with each other in the sentence*<br></br>


- *Generally, the context vector has a size of 128 or 256 elements.* Reliance on the context vector to store all this information with a small-sized vector is very impractical and an extremely difficult requirement for the system. 

- *Therefore, most of the time, the context vector fails to provide the complete information required to make a good translation.*

- **This results in an underperforming decoder that suboptimally translates a sentence.**


- To make the problem worse, during decoding the context vector is observed only in the beginning. Thereafter, the decoder GRU must memorize the context vector until the end of the translation. This becomes more and more difficult for long sentences.

### How **Attention** deals with this issue?

<div align='center'>
    <h4>Attention sidesteps this issue:</h4>
</div>


1. **With attention, the decoder will have access to the full state history of the encoder for each decoding time step.** 
    - *This allows the decoder to access a very rich representation of the source sentence.*<br></br> 

2. Furthermore, the attention mechanism introduces a `softmax` layer **that allows the decoder to calculate a weighted mean of the past observed encoder states, which will be used as the context vector for the decoder.** 
    - *This allows the decoder to pay different amounts of attention to different words at different decoding steps.*<br></br>
    
    
<div align='center'>
    <h4>Conceptual breakdown of the Attention Mechanism</h4>
    <img src='images/att_mech_nmt.png'/>
</div>

## **The Bahdanau Attention Mechanism**

Also called *Additive Attention*


#### ***Computing Attention:*** 

- [The Bahdanau Attention Mechanism - ML Mastery](https://machinelearningmastery.com/the-bahdanau-attention-mechanism/)
- [The Bahdanau Attention Mechanism - dl.ai](https://d2l.ai/chapter_attention-mechanisms-and-transformers/bahdanau-attention.html)

The Bahdanau attention mechanism introduced in the paper [Neural Machine Translation by Learning to Jointly Align and Translate](https://arxiv.org/abs/1409.0473), by [Dzmitry Bahdanau](https://rizar.github.io/).

**We'll implement a slightly different version of it, due to the limitations of TensorFlow.** 

Some notations:
- Encoder's $j^{th}$ hidden state: $h_j$
- $i^{th}$ target token: $y_i$
- $i^{th}$ decode hidden state in the $i^{th}$ time step: $s_i$
- Context Vector: $c_i$

* **

- Our **decoder GRU** is a function of an input $y_i$ and a previous step's hidden state $s_{i-1}$. This can be represented as follows: $${GRU}_{dec} = \mathcal{f}(y_i, s_{i-1})$$ 

    - Here, $\mathcal{f}$ represents the actual update rules used to calculate $y_i$ and $s_{i-1}$.<br></br>
    
- *With the attention mechanism, we are introducing a new time-dependent context vector $c_i$ for the $i^{th}$ decoding step:* 

    - **The $c_i$ vector is a weighted mean of the hidden states of all the unrolled encoder steps.**

    - A higher weight will be given to the $j^{th}$ hidden state of the encoder if the $j^{th}$ word is more important for translating the $i^{th}$ word in the target language. 

    - This means the model can learn which words are important at which time step, regardless of the directionality of the two languages or alignment mismatches.<br></br>

- Now the decoder GRU becomes this: $${GRU}_{dec} = \mathcal{f}(y_i, s_{i-1}, c_i)$$ 


* **

- Conceptually, the attention mechanism can be thought of as a separate layer and as illustrated. As shown, **attention functions as a layer**. 

- The attention layer is responsible for producing the context vetor $c_i$ for the $i^{th}$ time step of the decoding process. $c_i$ is calculated as: $$c_i = \sum_{j=1}^{L}\alpha_{ij}h_j$$

    - Here, $L$ is the number of words in the source sentence, and 
    
    - $\alpha_{ij}$ is a normalized weight representing the importance of the $j^{th}$ encoder hidden state for calculating the $i^{th}$ decoder prediction.<br></br>
    

- $\alpha_{ij}$ is calculated using something called **energy value**.
    - We represent $e_{ij}$ as the energy of the encoder’s $j^{th}$ position for predicting the decoder’s $i^{th}$ position. 
    
    - $e_{ij}$ is computed using a small fully connected network as follows: $$e_{ij} = \nu_{a}^{T} \tanh(W_a s_{i-1} + U_a h_j)$$ 
    
    - In other words, $e_{ij}$ is calculated with a multilayer perceptron whose weights are $\nu_{a}$, $W_a$, and $U_a$ ; and<br></br> $s_{i-1}$ (decoder’s previous hidden state from $(i-1)^{th}$ time step) and $h_j$ (encoder’s $j^{th}$ hidden output) are the inputs to the network.<br></br>
    
-  Finally, we compute the normalized energy values (i.e. weights) using softmax normalization over all encoder timesteps: $$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{L} \exp({e_{ik}})}$$ <br></br>




<div align='center'>
    <h4>The Attention Mechanism</h4>
    <img src='images/bahdanau_attention.png'/>
</div>



### Simple & Intuitive Explanation of Bahdanau Attention

Credit: ChatGPT

- **The Bahdanau Attention Mechanism, also known as Additive Attention**, is like a spotlight that helps a machine learning model focus on the most relevant parts of a long piece of information when making decisions, just like how you pay attention to different words when reading a sentence.

Here's a simple and intuitive explanation:

- Imagine you're translating a sentence from one language to another, and the sentence is quite long. Bahdanau Attention is like having a little assistant who highlights specific words in the original sentence for you as you translate.

    1. **The Sentence**: Let's say you have a long sentence in a foreign language you want to translate, like "The big blue car drove quickly down the winding mountain road."

    2. **The Assistant**: Your Bahdanau Attention assistant looks at each word in the sentence and decides which words are the most important for you to pay attention to while translating.

    3. **Highlighting**: It highlights certain words, like "big," "blue," and "car," which are the key pieces of information for understanding the sentence.

    4. **Translating**: As you translate, you focus more on the highlighted words, so you might say something like, "The important thing here is that there's a big blue car." You give extra importance to those highlighted words because they carry the crucial details.

    5. **Dynamic Attention**: What's cool is that the assistant can change its highlights for different sentences. If the next sentence is, "The small red bicycle went slowly up the steep hill," it will highlight different words like "small," "red," and "bicycle."

In summary, Bahdanau Attention is like having a helpful spotlight that guides you through understanding and translating sentences by emphasizing the important words. It's a way for machines to focus on the relevant parts of information when processing sequences of data, making them more efficient and accurate in tasks like translation, summarization, and more.

## *Implementing Attention*

- **As stated above we’ll implement a slightly different variation of Bahdanau attention.**
    - *In the traditional Bahdanau attention mechanism, which is commonly used in sequence-to-sequence models, the attention scores are computed at each time step(time-dependent context vector) within the RNN model. These scores are then used to weigh the importance of different parts of the input sequence when making predictions.*
    
    - *TensorFlow currently does not support an attention mechanism that can be iteratively computed for each time step, similar to how an RNN works.*<br></br>

- **Therefore, we are going to decouple the attention mechanism from the GRU model and have it computed separately.** 

    - *We will concatenate the attention output with the hidden output of the GRU layer and feed it to the final prediction layer. In other words, we are not feeding attention output to the GRU model, but directly to the prediction layer as illustrated below:*


<div align='center'>
    <img src='images/att_mech_implemented.png'/>
</div>

* **

To implement attention, we define a class called `BahdanauAttention` (which inherits from the `tf.keras.layers.Layer` class) and override two functions in that:
- `__init__()` – Defines the layer's initialization logic
- `call()` – Defines the computational logic of the layer

In [22]:
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super().__init__()
        # Weights to compute Bahdanau attention
        self.Wa = tf.keras.layers.Dense(units, use_bias=False)
        self.Ua = tf.keras.layers.Dense(units, use_bias=False)
        
        # Additive attention layer, a.k.a. Bahdanau-style attention
        self.attention = tf.keras.layers.AdditiveAttention(use_scale=True)
        
    def call(self, query, key, value, mask, return_attention_scores=False):
        
        # Compute Wa.ht
        wa_query = self.Wa(query)
        
        # Compute Ua.ht
        ua_key = self.Ua(key)
        
        # Compute masks
        query_mask = tf.ones(tf.shape(query)[:-1], dtype=bool)
        value_mask = mask
        
        # Compute the attention
        context_vector, attention_weights = self.attention(inputs = [wa_query, value, ua_key],
                                                           mask = [query_mask, value_mask, value_mask],
                                                           return_attention_scores = True)
        
        if not return_attention_scores:
            return context_vector
        else:
            return context_vector, attention_weights