# Attention in Transformers Concepts and Code in PyTorch

Instructor: Josh Starmer

## Introduction

The transformer architecture was first introduced in the 2017 paper ["Attention is all you need"](https://arxiv.org/abs/1706.03762) for machine translation tasks. The idea was like to input an English sentence and have the network output a German sentence. The same architecture tends to be great at inputting a **prompt** and outputting a **response** to that prompt, like a a question and the answer to that question. So it started the rise of **large language models**.

## The Main Ideas Behind Transformers and Attention

Transformers are based on 3 building blocks:

1) **Word Embedding**: converts tokens (words, part of words, symbols, etc..) into numbers to be fed into a NN.

2) **Positional Encoding**: helps keep track of word order.

3) **Attention**: helps establishing relations among words. In *Self-Attention* for each word it calculates the similarity for every word in the sentence. Then the similarities are used to determine how the Transformer encodes each word. In the example below, in the sentences about "pizza" the word "it" was commonly more associated with "pizza" than "oven", then the similary score for "pizza" will have a larger impact on how the word "it" is encoded by the Transformer.

<img src="images/sa_example.jpg" width="400px" />


## The Matrix Math for Calculating Self-Attention

In Self-Attention, the equation to calculate the Attention depends on the 3 inputs Query, Key and Value:

$$Attention(Q,K,V) = SoftMax\bigl(\frac{QK^T}{\sqrt{d_k}} \bigr)V$$

The names come from database terminology where the Query is the input used to match a Key, given their similarity, and get the associated Value.

To see how Attention works consider the sentence

$$\text{write a poem}$$

And imagin that word embedding converts each of the 3 token in a 2-dimensional vector (typically it would have dimension 512 or more), and stack them one on top of the other:

<img src="images/encoded_input.jpg" width="200px" />

The get the values of Q, K and V we then multiply the encoded values for 3 squared matrices of weights to get the matrices Q, K and V with the same dimension of the encoded values:

<img src="images/QKV.jpg" width="400px" />

The transpose symbols is because PyTorch prints out the weights in a way that requires them to be transposed before we can get the math correctly.

Now the first step is to multiply matrix Q by the transpose of K:

<img src="images/QK.jpg" width="600px" />

Where each entry of the resulting matrix is the *unscaled Dot Product* of all combinations of Queries and Keys for each word. The Dot Product can be used as an unscaled measure of similarity between two things, and it's closely related to the *Cosine Similarity*, with the difference that the latter is scaled to be between $-1$ and $1$.

The second step is to scale the Dot Product by $\sqrt{d_k}$, the square of the dimension of the matrix of Keys, in the example 2. This way we obtain a *scaled Dot Product* of similarities. Note that scaling by the number of values per token doesn't scale the Dot Product in any systematic ways, but the authors claimed it improved performance.

The next step is to take the soft-max of each row in the matrix of the scaled Dot Product similarities, where the soft-max function is $\sigma:\mathbb{R}^K \rightarrow (0,1)^K$ where $K>1$, takes a vector $z = (z_1, ..., z_K) \in \mathbb{R}^K$ and computes each component of vector $\sigma(z) \in (0,1)^K$ with

$$\sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$$

<img src="images/softmax.jpg" width="600px" />

Where the sum of each row is $1$, and stack them one on top of the other. This way we can think of these values as a summary of the relationships among tokens. For example the word "write" is 36% similar to itself, 40% similar to "a" and 24% similar to "poem".

The last step is to multiply the soft-max percentages by the Values in matrix V. In this way the first Self-Attention of the word "write" is given by calculating 36% of the first Value for the word "write", and add it to 40% of the first Value for "a", and then add 24% of the first Value for "poem", that is $0.60 * 36\% - 0.35 * 40\% + 3.86 * 24\% = 1$. In other words, the percentages that come out of the soft-max function tell us how much influence each word should have on the final encoding for any given word:

<img src="images/softmax_V.jpg" width="600px" />

In summary the equation for calculating the Self-Attention is

1) calculating the scaled Dot Product similarity among all the words

2) converting those scaled similarities into percentages with the soft-max function

3) using those percentages to scale the Values to become the Self-Attention scores for each word

## Self-Attention vs Masked Self-Attention

Consider a simple case of creating word embeddings. The training data is made by 2 sentences "My pizza is great!" and "My pizza is awesome". Each of the 5 *different* words is encoded as a $1$ and passed throught some random weights and some activation functions that, put together will give a probability to the next word. The number of activation functions will be the dimension of the embedding. After some training, the weights will be the word embeddings, and similar words will have closer distance, for example the words "great" and "awesome":

<img src="images/embeddings.jpg" width="700px" />

With more context one can train word embeddings not only by using one word to predict the next one, but by using many preceeding words at the same time. However, **order also matters**, hence the idea of Positional Encoding: created to take into account word order when creating embeddings and then followed by an Attention layer, that helps establish relationships among words.