# Transformer

A transformer is a deep learning architecture that uses an attention mechanism.

## Self-attention

Self-attention is a mechanism that processes an array of input data. In the general case, each element of the array is a vector, $x_i \in \mathbb{R}^k$. Unlike the RNN architecture, the order of the $x_i$ not central here, but self-attention is usually used to process the ordered sequences as well.

The main idea is to build such transformation mechanism $SA$:

$$SA: x_i \rightarrow y_i.\\ x_i \in \mathbb{R}^k, y_i \in \mathbb{R}^l.$$

For each $x_i$, it considers all the other members of the array, $x_j, i \neq j$, and it allows the most significant member to influence the result.

The procedure involves calculating several vectors that somehow represent each element. An important parameter of the entire procedure is $d$, which is the dimentionality of the information calculated for each element of the processed sequence.

At the highest level, the idea is simple: $y_i$ is the weighed sum of all the elements in the sequence:

$$y_i = \sum_{j=1}^n w_j W^\nu x_j$$

Where:

- $W^\nu \in \mathbb{R}^{(d \times k)}$: learnable matrix.
- $w_j \in \mathbb{R}^d$: is a crucial element of the self-attention approach. It is the weight of the $j$-th element of the array in the context of processing $y_i$. The process of finding this weight is described below.

**Note:** In some sources, the product $W^\nu x_i$ is interpreted as the vector $\nu_i \in \mathbb{R}^d$, which contains the general representation of a word regardless of context.

For each element introduce two vectors:

$$k_i = W^k x_i\\ q_i = W^q x_i.$$

Here $W^k \in \mathbb{R}^{(d \times k)}$ and $W^q \in \mathbb{R}^{(d \times k)}$ are learnabale parameters.

The idea behind the method is that these vectors are queries ($q$) and keys ($k$). The matrices that produce them ($W^k, W^q$) are learned in such a way that keys of one elements have to match the queries of the other elements.

In this context, "match" refers to the high result of the scalar product of the $q_i$ and $k_j$.

For each $j$-s element of the array that is processed, the vector $(q_i k_1, q_i k_2, \ldots, q_i k_n)$ is computed. The elements which $k_j$ better matches to the $q_i$ will be higher.

So for a chosen element, $x_i$, the vector can be considered as the weights of the matches with all other elements. However, to give them the properties of the real weights, this vector is usually processed by a softmax function.

Finally for weights we got such approach:

$$w_i = softmax \left( W^qx_i W^kx_1, W^qx_i W^kx_2, \ldots, W^qx_i W^kx_n \right)$$

The entire transformation will take the following form:


$$y_i = \sum_{j=1}^n softmax \left( W^qx_i W^kx_1, W^qx_i W^kx_2, \ldots, W^qx_i W^kx_n \right) W^\nu x_j$$