# Attention is all you need

Attention Mechanisms in Neural Networks are (very) loosely based on the visual attention mechanism found in humans

**Issues with Seq2Seq Models**

![seq-2-seq](https://lilianweng.github.io/lil-log/assets/images/encoder-decoder-example.png)

* Unable to capture long term dependencies. Long-range dependencies still tricky despite of gating models like LSTMs and GRUs.
* Handle input sequence word by word.
* In translation systems, only the last hidden state is passed to the decoder, but it is not good enough to capture all the information pertaining to translation.

> You can’t cram the meaning of a whole %&! # sentence into a single &!#* vector!

In an attention mechanism, we do not encode the entire sequence into a single vector, rather allow decoder to attend to different parts of the source sentence at each step of the output generation. 

Each decoder output word depends on a weighted combination of all the input states, not just the last state. Attention allows us to interpret and visualise what the model is doing as heatmaps. 

## DECODER

Rather than building a single context vector out of the encoder’s last hidden state, attention adds shortcuts between the context vector and the entire source input. The weights of these shortcut connections are customizable for each output element. 

Consider, source sequence *x* of length *n* and target sequence *y* of length *m* 

\begin{aligned}
\mathbf{x} &= [x_1, x_2, \dots, x_n] \\
\mathbf{y} &= [y_1, y_2, \dots, y_m]
\end{aligned}

Encoder produces hiddent state $h_{i}$ for each input timestep. The decoder has hidden state, 
$s_{t}=f(s_{t-1}, y_{t-1}, c_{t})$ for $t=1, ..., m$, where

\begin{aligned}
\mathbf{c}_t &= \sum_{i=1}^n \alpha_{t,i} \boldsymbol{h}_i & \small{\text{; Context vector for output }y_t}\\
\alpha_{t,i} &= \text{align}(y_t, x_i) & \small{\text{; How well two words }y_t\text{ and }x_i\text{ are aligned.}}\\
&= \frac{\exp(\text{score}(\boldsymbol{s}_{t-1}, \boldsymbol{h}_i))}{\sum_{i'=1}^n \exp(\text{score}(\boldsymbol{s}_{t-1}, \boldsymbol{h}_{i'}))} & \small{\text{; Softmax of some predefined alignment score.}}.
\end{aligned}


$\alpha_{t,i}$ is the score for the input at position $i$ and output at position $t$ i.e for the pair $(y_{t}, x_{i})$


**Bahdanau**

$$score(s_t, h_i) = {v}_a^\top \tanh({W}_a[s_t;h_i])$$
where both $v_a$ and $W_a$ are weight matrices to be learned in the alignment model.

**Luong**

$$score(s_t, h_i) = s_t^\top\mathbf{W}_a{h}_i$$
where $W_a$ is a trainable weight matrix in the attention layer.

Decoder RNN gives only the hidden state vector, which is used to calculate the *output vector*, as well as combined with the context vector to pass as input to the next decoder unit. 

### CodeWise

>score = 

In [11]:
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence

In [5]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=5000)

In [6]:
X_train.shape

(25000,)

In [10]:
for i in range(10): print(len(X_train[i]))

218
189
141
550
147
43
123
562
233
130


In [13]:
X_train = sequence.pad_sequences(X_train, maxlen=500)

In [14]:
for i in range(10): print(len(X_train[i]))

500
500
500
500
500
500
500
500
500
500


In [15]:
X_train.shape

(25000, 500)