# <font color='#2B4865'>**Transformer's Attention Mechanism**</font>

---
### Natural Language Processing
Date: Dec 13, 2022

Last Update: Nov 23, 2023

---

This notebook is based on the reference notebooks from [Denis Rothman](https://github.com/Denis2054/Transformers-for-NLP-2nd-Edition/blob/main/Chapter02/Multi_Head_Attention_Sub_Layer.ipynb) and [Manuel Romero.](https://colab.research.google.com/drive/1rPk3ohrmVclqhH7uQ7qys4oznDdAhpzF)

Our goal here is to obtain a mathematical view of the Transformer's Multi-Head attention mechanism.

In [None]:
%pip install --quiet colored

from colored import Fore, Back, Style
import numpy as np
from scipy.special import softmax

## <font color='#2B4865'>**Step 1: Represent the input**
---
</font>

For visualization purposes, we are scaling down the input of the attention mechanism from $d_{model}=512$ as in the original Transformer model to $d_{model}=4$. This brings the dimensions of the vector of an input $x$ to the Transformer model down to $d_{model}=4$.

<br><center><img src="https://drive.google.com/uc?id=1uGolVso4Z72KA88aQ3wWCoS-k9Vd8Y70" width="60%"></center><br>

In [None]:
x =np.array([[1.0, 0.0, 1.0, 0.0],   # Input #1
             [0.0, 2.0, 0.0, 2.0],   # Input #2
             [1.0, 1.0, 1.0, 1.0]])  # Input #3
print(x)

[[1. 0. 1. 0.]
 [0. 2. 0. 2.]
 [1. 1. 1. 1.]]


## <font color='#2B4865'>**Step 2: Initialize weight matrices**
---
</font>

Every input must have three representations, namely <font color='#80D0FF'>queries</font>, <font color='#82D9B3'>keys</font> and <font color='#F89797'>values</font>. To obtain these representations, **each input has three weight matrices:**

*   <font color='#80D0FF'>$W^Q$</font> to train the <font color='#80D0FF'>queries</font>
*   <font color='#82D9B3'>$W^K$</font> to train the <font color='#82D9B3'>keys</font>
*   <font color='#F89797'>$W^V$</font> to train the <font color='#F89797'>values</font>

In the original Transformer model, these matrices are of $d_{k}=64$ dimensions, but here we will be scaling them down to $d_{k}=3$. Because every input has a dimension of $4$, this means each set of the weights must have a shape of $4\times 3$.

<br><center><img src="https://drive.google.com/uc?id=1CFA4GNYXHlGIF95hw9jjDvj_vpWYMuOn" width="60%"></center><br>

*Note: In a neural network setting, these weights are usually small numbers, initialized randomly using an appropriate random distribution like Gaussian, Xavier, and Kaiming distributions.*

In [None]:
w_query =np.array([[1, 0, 1],
                   [1, 0, 0],
                   [0, 0, 1],
                   [0, 1, 1]])

w_key =np.array([[0, 0, 1],
                 [1, 1, 0],
                 [0, 1, 0],
                 [1, 1, 0]])

w_value = np.array([[0, 2, 0],
                    [0, 3, 0],
                    [1, 0, 3],
                    [1, 1, 0]])

print(Fore.LIGHT_BLUE + Style.BOLD + "Weights for query: \n" + Style.RESET, w_query)
print(Fore.LIGHT_GREEN + Style.BOLD + "Weights for key: \n" + Style.RESET, w_key)
print(Fore.LIGHT_RED + Style.BOLD + "Weights for value: \n" + Style.RESET, w_value)

[38;5;12m[1mWeights for query: 
[0m [[1 0 1]
 [1 0 0]
 [0 0 1]
 [0 1 1]]
[38;5;10m[1mWeights for key: 
[0m [[0 0 1]
 [1 1 0]
 [0 1 0]
 [1 1 0]]
[38;5;9m[1mWeights for value: 
[0m [[0 2 0]
 [0 3 0]
 [1 0 3]
 [1 1 0]]


## <font color='#2B4865'>**Step 3: Matrix multiplication to obtain queries, keys and values**
---
</font>

Now that we have the three sets of weights, we obtain the <font color='#80D0FF'>query</font>, <font color='#82D9B3'>key</font> and <font color='#F89797'>value</font> representations for every input. We do this by multiplying the input vectors by the weight matrices:

<br><center><img src="https://drive.google.com/uc?id=10UTnlUdEf9xtbaCrcJqrjg9A24NUPwiM" width="60%"></center><br>

*Note: In practice, a bias vector may be added to the product of matrix multiplication*

In [None]:
Q=np.matmul(x,w_query)
K=np.matmul(x,w_key)
V=np.matmul(x,w_value)

print(Fore.LIGHT_BLUE + Style.BOLD + "Queries: \n" + Style.RESET, Q)
print(Fore.LIGHT_GREEN + Style.BOLD + "Keys: \n" + Style.RESET, K)
print(Fore.LIGHT_RED + Style.BOLD + "Values: \n" + Style.RESET, V)

[38;5;12m[1mQueries: 
[0m [[1. 0. 2.]
 [2. 2. 2.]
 [2. 1. 3.]]
[38;5;10m[1mKeys: 
[0m [[0. 1. 1.]
 [4. 4. 0.]
 [2. 3. 1.]]
[38;5;9m[1mValues: 
[0m [[1. 2. 3.]
 [2. 8. 0.]
 [2. 6. 3.]]


## <font color='#2B4865'>**Step 4: Calculate scaled attention scores**
---
</font>

The attention head now implements the original Transformer equation:

$$Attention(\mathbf{Q,K,V}) = softmax\left(\frac{\mathbf{QK^T}}{\sqrt{d_k}}\right)\mathbf{V}$$

This step focuses on $\mathbf{Q}$ and $\mathbf{K}$:

$$\left(\frac{\mathbf{QK^T}}{\sqrt{d_k}}\right)$$

We start by calculating the **scaled attention scores** for Input #1 by taking a scaled dot product between Input #1's <font color='#80D0FF'>query</font> with **all** <font color='#82D9B3'>keys</font>, including itself:

$$\text{Score of Input #1} = Q_1 \cdot K \hspace{2mm} \text{(all three keys)}$$

Since there are 3 key representations (because we have 3 inputs), we obtain 3 attention scores for Input #1:

$$\text{Score of Input #1} = [2, 4, 4]$$

Then, we repeat the same steps for both Input #2 and Input #3. In practice, we calculate the attention scores for all three inputs at once in matrix form.

*Note: For this example, we will round $\sqrt{d_k}=\sqrt{3}=1.75$ to $1$ to simplify the computations*

In [None]:
k_d=1
attention_scores = (Q @ K.transpose())/k_d
print(Fore.YELLOW + Style.BOLD + "Attention scores: \n" + Style.RESET, attention_scores)

# tensor([[ 2.,  4.,  4.],  # attention scores from Query 1
#         [ 4., 16., 12.],  # attention scores from Query 2
#         [ 4., 12., 10.]]) # attention scores from Query 3

[38;5;3m[1mAttention scores: 
[0m [[ 2.  4.  4.]
 [ 4. 16. 12.]
 [ 4. 12. 10.]]


<br><center><img src="https://drive.google.com/uc?id=1vGdAIEDVa1eg3XbdXupHJcveKWsOEpmU" width="60%"></center><br>


## <font color='#2B4865'>**Step 5: Scaled softmax attention scores**
---
</font>

We now apply a softmax function to each intermediate attention score. Instead of doing a matrix multiplication, let's zoom down to each vector:

In [None]:
attention_scores[0]=softmax(attention_scores[0])
attention_scores[1]=softmax(attention_scores[1])
attention_scores[2]=softmax(attention_scores[2])
print(Fore.YELLOW + Style.BOLD + "Scaled softmax attention scores: " + Style.RESET)
print(attention_scores[0])
print(attention_scores[1])
print(attention_scores[2])

[38;5;3m[1mScaled softmax attention scores: [0m
[0.06337894 0.46831053 0.46831053]
[6.03366485e-06 9.82007865e-01 1.79861014e-02]
[2.95387223e-04 8.80536902e-01 1.19167711e-01]


With this, we obtain a scaled softmax attention score for each vector. For example, the softmax of the score of $x_1$ is:

$$\text{Softmax(Score of Input #1)} = [0.06, 0.46, 0.46]$$

<br><center><img src="https://drive.google.com/uc?id=1KU1KdQPkFzdS94e3NoVo1w9TaJC6Jz0m" width="60%"></center><br>

## <font color='#2B4865'>**Step 6: The final attention representations**
---
</font>

Based on the results obtained in Step 5, we can now finalize the complete attention equation presented in Step 4 by plugging V in:

$$Attention(\mathbf{Q,K,V}) = softmax\left(\frac{\mathbf{QK^T}}{\sqrt{d_k}}\right)\mathbf{V}$$

We will first calculate the attention score for input $x_1$ for *Steps 6 and 7*. Note that we calculate **one attention value per word vector**. When we reach *Step 8*, we will generalize the attention calculation to the other two input vectors.

To obtain $Attention(\mathbf{Q, K,V})$ for $x_1$, we multiply the intermediate attention score by the 3 value vectors one by one to zoom down into the inner workings of the equation:

In [None]:
print(Fore.LIGHT_RED + Style.BOLD + "V1: \n" + Style.RESET, V[0])
print(Fore.LIGHT_RED + Style.BOLD + "V2: \n" + Style.RESET, V[1])
print(Fore.LIGHT_RED + Style.BOLD + "V3: \n" + Style.RESET, V[2])
print()

print(Fore.YELLOW + Style.BOLD + "Attention 1: " + Style.RESET)
attention1=attention_scores[0].reshape(-1,1)
attention1=attention_scores[0][0]*V[0]
print(attention1)

print(Fore.YELLOW + Style.BOLD + "Attention 2: " + Style.RESET)
attention2=attention_scores[0][1]*V[1]
print(attention2)

print(Fore.YELLOW + Style.BOLD + "Attention 3: " + Style.RESET)
attention3=attention_scores[0][2]*V[2]
print(attention3)

[38;5;9m[1mV1: 
[0m [1. 2. 3.]
[38;5;9m[1mV2: 
[0m [2. 8. 0.]
[38;5;9m[1mV3: 
[0m [2. 6. 3.]

[38;5;3m[1mAttention 1: [0m
[0.06337894 0.12675788 0.19013681]
[38;5;3m[1mAttention 2: [0m
[0.93662106 3.74648425 0.        ]
[38;5;3m[1mAttention 3: [0m
[0.93662106 2.80986319 1.40493159]


*Step 6* is complete: The 3 attention values for $x_1$ for each input have been calculated:

<br><center><img src="https://drive.google.com/uc?id=1DrE82eezWWpQ4o5OtNPSI8jwFdR6c5u8" width="60%"></center><br>

## <font color='#2B4865'>**Step 7: Summing up the results**
---
</font>

At this step, we sum up the results of *Step 6* to create the first line of the output matrix. The second line will be for the output of the next input, that is, $x_2$ for this example.

In [None]:
attention_input1=attention1+attention2+attention3
print(Fore.YELLOW + Style.BOLD + "Sum Attention for x1: " + Style.RESET, attention_input1)

[38;5;3m[1mSum Attention for x1: [0m [1.93662106 6.68310531 1.59506841]


We can see the summed attention value for $x_1$ in the figure below, thus completing the steps for the first input.

<br><center><img src="https://drive.google.com/uc?id=1Na8BqeyuZYNY6E98-3qWiJ3M4QGG2ZCg" width="60%"></center><br>

## <font color='#2B4865'>**Step 8: Steps 1 to 7 for all the inputs**
---
</font>

The Transformer can now produce the attention values of inputs two and three using the same method described in *Steps 1 to 7* for one attention head.

From this step onwards, we will assume we have 3 attention values with learned weights with $d_{model}/8=64$. We now want to see what the original dimensions look like when they reach the sublayer's output.

We have seen the attention representation process in detail with a small model, so let's now assume we have already generated the 3 attention representations with a dimension of $d_{model}/8=64$:

In [None]:
# We assume we have 3 results with learned weights (they were not trained in this example)
# We assume we are implementing the original Transformer paper. We will have 3 results of 64 dimensions each
attention_head1=np.random.random((3, 64))
print(attention_head1)

[[0.38464587 0.97440416 0.71904677 0.33169658 0.6099698  0.33166482
  0.89708217 0.51369812 0.80581456 0.39938642 0.82353179 0.85588956
  0.97686275 0.7680037  0.782911   0.93144075 0.55540618 0.25077953
  0.30564976 0.27504848 0.31693901 0.83645785 0.94001682 0.72108591
  0.03548944 0.66078724 0.89055292 0.70954996 0.83929716 0.63514769
  0.92006666 0.53657751 0.18989152 0.65710533 0.71292925 0.24049143
  0.12963479 0.4124736  0.28287217 0.32416978 0.115054   0.0779267
  0.88731908 0.18191865 0.25955368 0.10311267 0.49050582 0.78101243
  0.18736321 0.59980406 0.30262333 0.60651996 0.89307057 0.98256314
  0.95473747 0.4286933  0.06936093 0.52227428 0.64624981 0.27355311
  0.19688582 0.51638073 0.06039197 0.65467468]
 [0.48029979 0.87775841 0.08946413 0.27512154 0.26891502 0.32198958
  0.6145797  0.14588073 0.18634514 0.4750353  0.12467958 0.04376944
  0.80676552 0.79888299 0.96257485 0.063964   0.29551825 0.75819194
  0.05427341 0.38685383 0.3692209  0.49537121 0.62436386 0.45598848
  

The result above displayed simulates $z_0$, that is, the 3 output vectors of $d_{model}=64$ for head 1. With this, the Transformer now has the output vectors for the inputs of one head. The next step is to generate the output of the 8 heads to create the final output attention sublayer.

## <font color='#2B4865'>**Step 9: The output of the heads of the attention sublayer**
---
</font>

For this step, we assume that we have trained the 8 heads of the attention sublayer. The Transformer model now has 3 output vectors (of 3 input vectors that are words or word pieces) of $d_{model}=64$ dimensions each:

In [None]:
z0h1=np.random.random((3, 64))
z1h2=np.random.random((3, 64))
z2h3=np.random.random((3, 64))
z3h4=np.random.random((3, 64))
z4h5=np.random.random((3, 64))
z5h6=np.random.random((3, 64))
z6h7=np.random.random((3, 64))
z7h8=np.random.random((3, 64))
print("Shape of each head: ", z0h1.shape, "\nDimension of 8 heads: ",64*8)

Shape of each head:  (3, 64) 
Dimension of 8 heads:  512


The 8 heads have now produced $Z$:

$$Z = (Z_0,Z_1,Z_2,Z_3,Z_4,Z_5,Z_6,Z_7)$$

The Transformer will now concatenate the 8 elements of $Z$ for the final output of the Multi-Head attention sublayer.

## <font color='#2B4865'>**Step 10: Concatenation of the output of the heads**
---
</font>

The Transformer concatenates the 8 elements of $Z$:

$$MultiHead(Output)= Concat(Z_0,Z_1,Z_2,Z_3,Z_4,Z_5,Z_6,Z_7)W^0 = z, d_{model}$$

Note that $Z$ is multiplied by $W^0$, which is a weight matrix that is trained as well. In this model, we will assume $W^0$ is trained and integrated into the concatenation function.

In [None]:
output_attention=np.hstack((z0h1,z1h2,z2h3,z3h4,z4h5,z5h6,z6h7,z7h8))
print(Fore.YELLOW + Style.BOLD + "MultiHead Attention Output :\n " + Style.RESET, output_attention)

[38;5;3m[1mMultiHead Attention Output :
 [0m [[0.02921319 0.99847071 0.87075338 ... 0.58217167 0.53745708 0.64086444]
 [0.42067198 0.98503287 0.1770938  ... 0.25072938 0.79252486 0.17376745]
 [0.54808903 0.90267648 0.90987587 ... 0.57604564 0.54924641 0.62331917]]


The concatenation can be visualized as stacking the elements of $Z$ side by side:

<br><center><img src="https://drive.google.com/uc?id=1e9hXbfE4j3zFgPY_rivRpMdNQsqFAHrQ" width="35%"></center><br>

<br><center><img src="https://drive.google.com/uc?id=1D3dyTN3msoLjgt-YoCMkcAyDPfq_h_td" width="60%"></center><br>