# Transformer in Formula and Code

Based on [Dive Into Deep Learning (D2L)](https://d2l.ai/chapter_attention-mechanisms-and-transformers/index.html)

## Attention

### Basics
Denote a database of $m$ tuples of _keys_ and _values_ $D\stackrel{def}{=}\{(k_{1}, v_{1}), ..., (k_{m}, v_{m})\}$, also denote a _query_ by $q$. Then we can define the _Attention_ over D as

$$ Attention(q, D)\stackrel{def}{=}\sum_{i=1}^m \alpha(q, k_{i})v_{i} $$ 

where $\alpha(q, k_{i})$ are scalar attention weights. 

This operation pays more attention to terms where the weight is larger, hence the name _attention_.

### Requirements
To train a model using this function smoothly and stably, we want to ensure a number of requirements:
- The weights $\alpha(q, k_{i})$ are nonnegative.
- The weights $\alpha(q, k_{i})$ form a convex combination, i.e., $\sum_{i}\alpha(q,k_{i})=1$ and $\alpha(q, k_{i})\ge0$
- Exactly one of the weights is 1 and all others are 0

#### Sum to 1
To ensure the weights sum up to 1, we can normalize them:

$$ \alpha(q, k_{i}) = \frac{\alpha(q, k_{i})}{\sum_{j}\alpha(q,k_{j})} $$

#### Non-negative
To also ensure that the weights are non-negative, we can use exponentiation:

$$ \alpha(q, k_{i}) = \frac{exp(\alpha(q, k_{i}))}{\sum_{j}exp(\alpha(q, k_{j}))} $$

Now it is differentiable and its gradient never vanishes, all of which are desirable properties in a model.

This is the $softmax$ operation.