# Bahdanau Attention (Additive Attention)

One of the motivations behind Bahdanau Attention approach was the use of a fixed-length context vector in the basic encoder–decoder approach. This limitation makes the basic encoder-decoder approach to underperform with long sentences. In basic encoder-decoder approach, the last element of a sequence contains the memory of all the previous elements and thus form a fixed-dimension context vector. But in case of Bahdanau attention approach:

- First, we initialize the Decoder states by using the last states of the Encoder as usual
- Then at each decoding time step:
    - We use Encoder's all hidden states and the previous Decoder's output to calculate a Context Vector by applying an Attention Mechanism
    - Lastly, we concatenate the Context Vector with the previous Decoder's output to create the input to the decoder.

All the preprocessing steps will be same as that used in seq2seq model. Let's start by doing the same.

In [None]:
import os
import time
import math
import torch
import random
import torch.nn as nn
from torch.optim import Adam
from torch.nn.utils.rnn import pad_sequence
from typing import Iterable, List
from torch.utils.data import DataLoader
from torchtext.datasets import Multi30k
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator as bvfi

## Tokenization and Vocabulary Building

In [None]:
SRC_LANG = 'de'
TGT_LANG = 'en'
specials = {'<UNK>': 0, '<PAD>': 1, '<SOS>': 2, '<EOS>': 3}

tokenizer = dict()
vocab = dict()

Create source and target language tokenizer. Make sure to install the dependencies.

```
pip install -U torchdata
pip install -U spacy
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm
```

In [None]:
# !pip install -U torchdata
# !pip install -U spacy
# !python -m spacy download en_core_web_sm
# !python -m spacy download de_core_news_sm

In [None]:
tokenizer[SRC_LANG] = get_tokenizer('spacy', language='de_core_news_sm')
tokenizer[TGT_LANG] = get_tokenizer('spacy', language='en_core_web_sm')

In [None]:
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
    language_index = {SRC_LANG: 0, TGT_LANG: 1}

    for data_sample in data_iter:
        yield tokenizer[language](data_sample[language_index[language]])

In [None]:
for lang in [SRC_LANG, TGT_LANG]:
    train_iterator, valid_iterator, test_iterator = Multi30k()    # Training data Iterator
    vocab[lang] = bvfi(yield_tokens(train_iterator, lang), min_freq=1, specials=specials.keys(), special_first=True)

Set token index (i.e. 0 here) as the default index. This index is returned when the token is not found. If not set, it throws RuntimeError when the queried token is not found in the Vocabulary.

In [None]:
for lang in [SRC_LANG, TGT_LANG]:
  vocab[lang].set_default_index(specials['<UNK>'])

## Encoder

The encoder architecture is same as that used in seq2seq except the following two facts:
- We will be using single layer of RNN
- We will be using bidirectional RNN (forward + backward)

As done in seq2seq, we initialize both forward and backward hidden states to a tensor of zeros. We get two context vectors one from each of forward and backward RNNs. However the decoder being unidirectional needs single context vector as input. To facilitate this we'll be concatinating two context vectors together.

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        self.embed = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.LSTM(emb_dim, enc_hid_dim, bidirectional=True)
        self.fc = nn.Linear(enc_hid_dim*2, dec_hid_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        embedding = self.dropout(self.embed(src))  # [len(src), batch_size, emb_dim]
        output, hidden = self.rnn(embedding)
        # outputs = [len(src), batch_size, hid_dim * n_directions]
        # hidden = cell = [n layers * n directions, batch size, hid dim]
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)))  # [batch_size, dec_hid_dim]
        return output, hidden

## Attention

In [None]:
class BahdanauAttention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        self.w1 = nn.Linear(dec_hid_dim, dec_hid_dim)
        self.w2 = nn.Linear(enc_hid_dim, dec_hid_dim)
        # self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Parameter(torch.FloatTensor(dec_hid_dim).uniform_(-0.1, 0.1))
        
    def forward(self, hidden, encoder_outputs):
        pass

## Decoder

## References

- [The Power of Attention in Deep Learning](https://www.youtube.com/watch?v=Qu81irGlR-0)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/pdf/1409.0473.pdf)
- [The Bahdanau Attention Mechanism](https://machinelearningmastery.com/the-bahdanau-attention-mechanism/#:~:text=The%20Bahdanau%20attention%20was%20proposed,mechanism%20for%20neural%20machine%20translation.)