# Attention

After this lecture you will learn:
* The concept of attention and its relevance in NNs
* How attention is used in seq2seq with RNNs
* How attention is used in seq2seq without RNNs

Sequence-to-sequence (seq2seq) model [(Sutskever et al., 2014)](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf):
* One RNN reads (encodes) the source sentence S into a fixed-length vector
* Another RNN produces a target sentence T from the encoded vector (e.g. a translation)
Note that $length(S)$ not necessarily equal to $length(T)$
<img src="pics/seq2seq_lite.png">
What is the problem?

We put the meaning of a whole sentence into a fixed-length vector. This does not work for long sentences

## Attention in seq2seq with RNNs [(Bahdanau et al., 2014)](https://arxiv.org/pdf/1409.0473.pdf)
* Consider all the hidden states from the encoder (not only the last one)
* When decoding, the NN may decide to pay more **attention** to some source tokens than to others

<img src="pics/seq2seq_attention.png" width="30%">

<img src="pics/seq2seq_attention.png" width="15%">

* Input: $x=[x_1, x_2, ..., x_n]$
* Output: $y=[y_1, y_2, ..., y_m]$
* Encoder's hidden-state: $h_i=[\overrightarrow{h_i};\overleftarrow{h_i}]$
* Decoder's hidden-state: $s_t=f(s_{t-1}, y_{t-1}, c_t)$
* Context vector: $c_t=\sum_{i=1}^{n}\alpha_{t,i}h_i$
* $\alpha_{t,i}=\frac{exp(score(s_{t-1},h_i))}{\sum_{i'=1}^{n}exp(score(s_{t-1},h_{i'}))}$
* $score(s_t,h_i)=s_t^T W_a h_i$ [(Luong et al., 2015)](https://arxiv.org/pdf/1508.4025.pdf)

<img src="pics/seq2seq_attention_length.png" width="60%">

Performance of RNN-based seq2seq without (RNNenc) and with attention (RNNsearch) for input sentences of different lengths.

## Transformer [(Vaswani et al., 2017)](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)

* A seq2seq architecture that does not use RNNs nor CNNs.
* Instead, an attention mechanism draws global dependencies between input and output
* Advantage: training is parallelizable. No backpropagation in time.

### Self-attention and path length

**Self-attention**: attention mechanism that relates different positions of a single sequence $s$.

**Path length** between long-range dependencies: RNN: $O(n)$, CNN: $O(log_k(n))$, self-attention: $O(1)$
* Self-attention: constant path length between any pair of positions.

<img src="pics/wavenet.jpg" width="60%">
Paths in WaveNet (CNN)

<img src="pics/transformer_overview.png" width="30%">

Masked attention: prevents attending to future words (they have not been predicted yet)

### Attention in Transformer

<img src="pics/transformer_attention.png" width="25%">

* Query Q. Current word
* Keys Ks. Previous words
* Values Vs. Keys' values

$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$

Find the Ks that are most similar to Q. Then get Ks' values V.

### Results with Transformer

<img src="pics/transformer_results.png" width="75%">

## References

* D. Rao & B. McMahan's NLP with PyTorch (chapter 8)
* [Attention? Attention! Blog post by Lilian Weng](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)
* [The annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
* [Video: Transformer presented by one of its authors](https://www.youtube.com/watch?v=rBCqOTEfxvg)