# Self-attention

Typical model input is a vector.
What to do if we want a set of vectors as input, e.g., each word of different lengths in a sentence?
- speech
- graph is also a set of vectors, considering each node as a vector


Output:
- each vector has a label, therefore, the output is a set of vectors with labels
  - e.g. sequence labelling, POS tagging: "I saw a saw" -> "I/PRON saw/VERB a/DET saw/NOUN"
- the wholse sequence has a label
  - e.g. sentiment analysis: "This is good" -> "positive" 
- model decides the number of labels itself 
  - e.g., sequence to sequence modeling


## Sequence Labeling

Sequences are in different lengths, what is the best way to represent them?

Self-attention:
- flow: sequence(words) -> self-attention -> FC layer -> self-attention -> FC layer -> output
- self-attention layer: find the relevance of each word to each other word in the sequence
  - relevance calculation: dot product of two vectors, or additively combine two vectors, etc.
  - dot product
    - query $q^1$: the word we are interested in
    - key $k^1$: the word we are comparing to
    - value: the word we are interested in

Get the relevance of $a^1$ to all other vectors $a^\{2,3,...,n\}$:

- self-attention layer
  - specify parameters
    - $q^1 = W^q a^1$
    - $k^1 = W^k a^1$
    - $k^2 = W^k a^2$
    - ...
    - $k^n = W^k a^n$
  - get the relevance
    - $\alpha_{1,1} = q^1k^1$
    - $\alpha_{1,2} = q^1k^2$
    - ...
    - $\alpha_{1,n} = q^1k^n$
- softmax layer or others -> attention scores
  - $\alpha_{1,1} = \frac{exp(\alpha_{1,1})}{\sum_{i=1}^n exp(\alpha_{1,i})}$ 
- extract information based on attention scores
  - $v^1 = W^va^1$
  - $v^2 = W^va^2$
  - ... 
  - $v^n = W^va^n$
  - $b^1 = \sum_{i=1}^n \alpha_{1,i} v^i$ 


Matrix operations

- $Q = W^q X$
- $K = W^k X$
- $V = W^v X$
- self-attention matrix $A^{\prime} \rightarrow A = K^T Q$
- $O = V A^{\prime}$


## Multi-head Attention

Multi-head self-attention is to learn different types of relevance.



## Positional Encoding

In self-attention, the order of the sequence is not considered as the operation is the same for all vectors.
Positional encoding is to add the position information to the vectors.
- each position has a unique positional vector $p^i$, and add it to the original vector $a^i$ before Q,K,V operations.
  - $a^i = a^i + p^i$
- hand-crafted positional vector
  - sinusoidal positional encoding
  - $p^i_j = \begin{cases} sin(\frac{i}{10000^{2j/d}}) & \text{if } j \text{ is even} \\ cos(\frac{i}{10000^{2j/d}}) & \text{if } j \text{ is odd} \end{cases}$
  - $i$: position
  - $j$: dimension
  - $d$: dimension of the vector
- learnable positional vector
- positional encoding is not necessary for RNNs as the order is considered in the operation

## Applications

- transformer
- BERT
- self-attention for speech
  - truncated self-attention to avoid long sequences, big self-attention matrix
- self-attention for images
  - image is a set of vectors, each pixel is a vector of 3 for RGB images
  - CNN is simplified self-attention, with fixed shape of the convolutional kernel
    - ref: [on the relationship between self-attention and convolutional layers](https://arxiv.org/pdf/1911.03584.pdf)
  - CNN model can be considered a subset of self-attention, thus CNN model is less complex. Therefore, CNN can have good results with less data, and self-attention needs more data to train.
    - ref: [An image is worth 16x16 words: Transformers for image recognition at scale](https://arxiv.org/pdf/2010.11929.pdf)

- self-attention for time series
  - RNN is now replaced by self-attention
  
  - self-attention vs RNN
    - similar in terms of finding relationship between vectors
    - dissimilar in terms of the operation
      - RNN has to consider previous history by sequential operation. all histories are saved in memeory
      - self-attention: parallel operation. no need to wait for the previous history to be calculated and forward.
  - ref: 
    - [Transformers are RNNs: fast autoregressive transformers with linear attention](https://arxiv.org/abs/2006.16236)
    - [Time Series Forecasting with Self-Attention Transformers](https://arxiv.org/pdf/1910.13051.pdf)