# Attention

## Idea
The key idea behind attention is to formulate how much "attention" a system pays towards different regions in its input (trys to find correlatinons).
Attention origniates from encoder-decoder architectures for NLP tasks, where long sequences need to be processed and references could get lost.
The idea of attention is to give the decoder in an encoder-decoder architecture the most relevant context informations.
This is done by calculating an attention vector/ matrix also known as context vector.
For each decoding step the decoder gets the previous hidden states as well the attention vector as input and therefore knows on what to focus given the encoding sequence.

![](sentence-example-attention.png) - [Reference](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html#a-family-of-attention-mechanisms)

## Improvement

In comparison to classic encoder decoder approaches, with attention the model is able to focus on useful parts of the input over longer sequences.
The attention weight will tell the model which part of the input should be in focus.


## Concept

### Definition

1. $a^{<t'>}$: activation vector for timestep *t*

2. $\alpha^{<t,t'>}$: amount of *attention* $\hat{y}^{<t>}$ should pay to $a^{<t'>}$

3. $e^{<t,t'>}$: alignment score (aka energy) (can be trained with a single layer feed forward network)

4. $c^{<t>}$: context vector as input to the decoder

### Calculus

1. $a^{<t'>}=(\overrightarrow{a}^{<t'>},\overleftarrow{a}^{<t'>})$

2. $\sum_{t'}\alpha^{<1,t'>}=1$

3. $\alpha^{<t,t'>}=\frac{\exp(e^{<t,t'>})}{\sum_{t'=1}^{Tx}\exp(e^{<t,t'>})}$ (using Softmax to ensure $\sum\alpha=1$)

4. $e^{<t,t'>}=g_t(W_ya^{<t>}+b_y)$ (simple linear network - see below scoring functions)

5. $c^{<t>}=\sum_{t'}\alpha^{<1,t'>}a^{<t>}$

![Attention](attention.png) - [Reference](https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3)

### Scoring Functions
 
![Scoring-functions](scoring-functions.png) - [Reference](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html#a-family-of-attention-mechanisms)

### Approaches

1. Bahdanau et. al (2015)
   1. using additive/concat scoring function
   2. input to the next decoder step is the concatenation between the output from the previous decoder time step and context vector from the current time step
2. Luong et. al (2015)
   1. using different scoring functions:
      1. additive/concat
      2. dot product
      3. location-based
      4. general
   2. input to the next decoder step is the output of a feed-forward neural network which gets the concatenation of the previous decoder time step and context vector from the current time step

## Example

In [2]:
def bahdanau(x, hidden, encoder):    

    #compute score/ attention energy first
    score = torch.tanh(torch.cat(self.linear_hidden(hidden), self.linear_encoder(encoder)))
    # vs. Luong
    # score = torch.tanh(self.linear(torch.cat(hidden, encoder)))

    # mulitply score with weigth matrix
    attn_weights = torch.bmm(self.weight, score)

    # attention masking | set padded values to neg infinity -> after softmax they will become 0
    attn_weights[~pad_mask.unsqueeze(1)] = float('-inf')

    attn_weights = self.softmax(attn_weights)

    # multiply attention weights with encoder outputs
    context = torch.bmm(attn_weights, encoder_outputs)
    
    # concat hidden states
    output = torch.cat((x, context))

    # feed concat of input and context into model
    output, hidden = self.model(output, hidden)
    return output, hidden

def luong(x, hidden, encoder, method):

    # compute model "step" first
    output, hidden = self.model(x, hidden)

    #compute score/ attention energy
    if self.method == "dot":
      score = torch.bmm(encoder, hidden)
    
    elif self.method == "general":
      hidden = self.linear(hidden)
      score = torch.bmm(encoder, hidden)
    
    elif self.method == "concat":
      out = torch.tanh(self.linear(hidden + encoder))
      score = torch.bmm(out, self.weight)

    # mulitply energy with weigth matrix
    attn_weights = score

    # attention masking | set padded values to neg infinity -> after softmax they will become 0
    attn_weights[~pad_mask.unsqueeze(1)] = float('-inf')

    attn_weights = self.softmax(attn_weights)

    # multiply attention weights with encoder outputs
    context = torch.bmm(attn_weights, encoder_outputs)
    
    # concat hidden states and pass through linear projection layer
    output = self.linear(torch.cat((output, context)))

    # return concat of output and context
    return output, hidden



## References

1. [Attn: Illustrated Attention](https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3)
2. [Attention? Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html#a-family-of-attention-mechanisms)
3. [Bahdanau et. al, 2015 - Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
4. [Luong et. al, 2015 - Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025)
5. [Vaswani et. al, 2017 - Attention Is All You Need](https://arxiv.org/abs/1706.03762)