# Attention is all you need

Paper: [Attention Is All You Need. Vaswani et al 2017](https://arxiv.org/abs/1706.03762)
Explanation taken from the [Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) by Jay Alammar

In [67]:
import torch
import torch.nn as nn

## Attention

$$ 
Attention (Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V 
$$

### Scaled Dot-Product Attention

<img src="assets/scaled_dotptoduct_attention.png" width="400" height="400">

Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually implemented – using matrices.

The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.

<img src="assets/transformer_self_attention_vectors.png" width="600" height="400">

In [68]:
embedding_size = 512
d_k = 64 

x1 = torch.rand(embedding_size)        # word embedding: input vector

Wq = torch.rand(embedding_size, d_k)   # query matrix
Wk = torch.rand(embedding_size, d_k)   # key matrix
Wv = torch.rand(embedding_size, d_k)   # value matrix

q1 = torch.matmul(x1, Wq)        # query vector
k1 = torch.matmul(x1, Wk)
v1 = torch.matmul(x1, Wv)

print(f"x1 shape: {x1.shape}")
print(f"Wq shape: {Wq.shape}")
print(f"q1 shape: {q1.shape}")

x1 shape: torch.Size([512])
Wq shape: torch.Size([512, 64])
q1 shape: torch.Size([64])


The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.

<img src="assets/transformer_self_attention_score.png" width="600" height="350">

In [69]:
score = torch.matmul(q1, k1)
score

tensor(1062170.5000)

The third and fourth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.

<img src="assets/self-attention_softmax.png" width="600" height="350">

In [70]:
normalized_score = score / (d_k ** 0.5)
softmax_score = torch.softmax(normalized_score, 0)

print(f"normalized score: {normalized_score}")
print(f"softmax score: {softmax_score}")

normalized score: 132771.3125
softmax score: 1.0


In [71]:
# same example with 2 input vectors

x2 = torch.rand(embedding_size)

q2 = torch.matmul(x2, Wq)        # query vector
k2 = torch.matmul(x2, Wk)
v2 = torch.matmul(x2, Wv)

score_x2 = torch.matmul(q1, k2)
normalized_x2_score = score / (score_x2 ** 0.5)

softmax_score_x_x2 = torch.softmax(torch.stack((score, score_x2)), 0)
softmax_score_x_x2

tensor([1., 0.])