# Self Attention in Transformers

<img src="images/transformer-architecture-1.png" width="500">

## Generate Data

In [1]:
import math
import numpy as np

`L` is going to be the length of the input sequence. As an example, assume the input sequence: *My name is John*. The sequence has 4 words. Hence, `L` in this case would be 4.

For illustrative purposes, `d_k` and `d_v` are going to be the sizes of the matrices, `q`, `k` and `v` which are both equal to 8.

The vectors are then initialised via the normal distribution using the `np.random.randn` function.

In [2]:
L, d_k, d_v = 4, 8, 8

# Generate Query, Key and Value vectors
q = np.random.randn(L, d_k)
k = np.random.randn(L, d_k)
v = np.random.randn(L, d_v)

For every single word, it will be represented as an 8 x 1 vector in the matrices, `q`, `k` and `v` as shown below.

For example, `q[0]`, `k[0]` and `v[0]` will all represent the word *My* in the input sequence.

In [3]:
print("Q\n", q)
print("K\n", k)
print("V\n", v)

Q
 [[ 1.34972467  1.35332753 -0.77556227  1.91676259 -1.61592118  0.59815915
  -0.18490108  1.15882701]
 [-0.33608825 -1.19344485 -0.97662111  1.20519669  0.2540626   0.45435371
  -2.21570849 -0.55022568]
 [ 1.33711846  0.01455579 -0.61734755  0.34603767 -1.30180037  0.92430392
  -0.80426423 -0.1787877 ]
 [-0.44801764  0.64280112  0.40218577 -0.28686598  0.38526026  0.6576787
  -0.08885837 -0.07665715]]
K
 [[-0.96890619  0.96396798 -0.73914589  0.12728142 -0.134454    0.2575052
  -0.25883817 -0.97963846]
 [ 1.42722162  0.36437203 -0.33366425 -1.58075897 -0.18005312  0.81230955
   2.22149487 -0.94734918]
 [ 0.19142128 -1.33368536 -1.07531501 -0.2893011   0.30201615  0.06685315
  -0.07141276 -1.27233625]
 [ 0.41115131 -1.01680208  1.93874384 -1.34373299 -0.60134847  0.26848415
   0.01355881 -1.11879544]]
V
 [[-0.82589176 -0.8639556   0.75185056  1.36580367 -0.80202648  0.53314709
   0.60224679 -0.13482091]
 [-1.21390252  0.20359992  0.26225157  0.03961611 -0.15492762  0.64232898
  -1.175

## Self Attention

In order to create the initial self attention matrix, we need every word to look at every other word to determine if there was a high affinity towards another word or not.

This is represented by the query, $Q$ which is, for every word, what is it that I am looking for and the key, $K$ which is, what I currently have.

\begin{align}

\text{self attention} = \text{softmax}(\frac{Q·K^{T}}{\sqrt{d_{k}}} + M) \nonumber

\end{align}

\begin{align}

\text{new } V = \text{self attention}·V \nonumber
 
\end{align}

The matrix multiplication between $Q$ and $K$ leads to a 4 x 4 matrix because our input sequence is of length 4. The resulting matrix holds values which is proportional to exactly how much attention we want to focus on for each word.

In [4]:
np.matmul(q, k.T)

array([[ 0.09795386, -1.08341991, -3.17635367, -5.0670283 ],
       [ 1.24583087, -6.57139034,  3.19426403, -1.88281241],
       [ 0.01520996,  0.94057239,  0.75381478,  0.09321527],
       [ 0.93559223,  0.2541633 , -1.0283383 ,  0.3568614 ]])

### Why do we need the denominator, $\sqrt{d_{k}}$ ?

This is because we want to minimise the variance and stabilise the values of the $Q·K^{T}$ matrix.

We can view the variance of the `q` and `k` matrices as well as the resulting matrix from the matrix multiplication.

In [5]:
# Why do we need sqrt(d_k) in denominator?
q.var(), k.var(), np.matmul(q, k.T).var()

(0.8941885476429156, 0.8540676867127687, 5.722632018415711)

We can see that while the variance of the `q` and `k` matrices are close to 1, the resulting matrix have a much higher variance. Therefore, in order to make sure we stabilise these values and reduce the variance, we divide the resulting matrix by the square root of the dimension of the query matrix, $Q$.

In [6]:
scaled = np.matmul(q, k.T) / math.sqrt(d_k)

q.var(), k.var(), scaled.var()

(0.8941885476429156, 0.8540676867127687, 0.7153290023019637)

Now, you can see that the variance are more or less in the same range (i.e. close to 1).

In [7]:
scaled

array([[ 0.03463192, -0.38304678, -1.12301061, -1.79146503],
       [ 0.44046773, -2.32333734,  1.12934288, -0.66567471],
       [ 0.00537753,  0.33254256,  0.26651377,  0.03295658],
       [ 0.3307818 ,  0.0898603 , -0.36357249,  0.12616956]])

## Masking

Masking is not required in the encoders, but required in the decoders. 

It is required in the decoders to ensure that words don't get context from words generated in the future. That would be considered cheating and in reality, you don't know the word that is going to be generated next. Therefore, it does not make sense to generate your vectors based off of those future words.

However, as for the encoder, masking isn't required because all of our inputs from the input sequence gets passed into the encoder simultaneously.

In [8]:
mask = np.tril(np.ones((L, L)))
mask

array([[1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 1., 0.],
       [1., 1., 1., 1.]])

The triangular matrix above will simulate the aforementioned masking. For example, recall our input sequence: *My name is John*.

The word *My* wouldn't be able to get context from the words after it. This is represented by the first row of the mask matrix above where all other words are 0 except for itself. 

Similarly, for the word *is*, it has contextual information of previous words *My* and *name* (and itself, of course) while having no context from *John*, which comes after it. Therefore, the third row in the mask matrix above has 1's for the words *My*, *name* and *is* and 0 for *John*.

In [9]:
mask[mask == 0] = -np.infty
mask[mask == 1] = 0
mask

array([[  0., -inf, -inf, -inf],
       [  0.,   0., -inf, -inf],
       [  0.,   0.,   0., -inf],
       [  0.,   0.,   0.,   0.]])

The 1 values in the mask is converted into 0 because when we add the mask to the `scaled` matrix, the bottom diagonal in matrix remains the same (because it is simply an addition with 0).

The 0 values in the mask is converted into negative infinity because of the softmax function which we will be applying to the `scaled` matrix. Softmax of negative infinity results in 0 (which is what we ideally want from the mask).

In [10]:
scaled + mask

array([[ 0.03463192,        -inf,        -inf,        -inf],
       [ 0.44046773, -2.32333734,        -inf,        -inf],
       [ 0.00537753,  0.33254256,  0.26651377,        -inf],
       [ 0.3307818 ,  0.0898603 , -0.36357249,  0.12616956]])

## Softmax

The softmax operation is used to convert a vector into a probability distribution. Therefore, their values will add up to 1 and it is very interpretable and stable.

\begin{align}

\text{softmax} = \frac{e^{x_{i}}}{\sum_{j}e^{x}_{j}} \nonumber

\end{align}

In [11]:
def softmax(x):
    return (np.exp(x).T / np.sum(np.exp(x), axis=-1)).T

In [12]:
attention = softmax(scaled + mask)
attention

array([[1.        , 0.        , 0.        , 0.        ],
       [0.94068829, 0.05931171, 0.        , 0.        ],
       [0.2713384 , 0.37635459, 0.35230701, 0.        ],
       [0.32255324, 0.25349566, 0.16108206, 0.26286904]])

Now we perform a matrix multiplication between the attention matrix and `v` matrix. The resulting matrix should better encapsulate the context of the input sequence.

In [13]:
new_v = np.matmul(attention, v)
new_v

array([[-0.82589176, -0.8639556 ,  0.75185056,  1.36580367, -0.80202648,
         0.53314709,  0.60224679, -0.13482091],
       [-0.84890534, -0.80063705,  0.72281161,  1.28714522, -0.76364594,
         0.53962286,  0.49682961, -0.21523046],
       [-0.88784853, -0.0067005 , -0.06165211,  0.61555196, -0.43572516,
         1.00204298, -0.68338215, -0.51311602],
       [-0.54625073, -0.13250759,  0.19673492,  0.24677181, -0.12115026,
         0.12774933, -0.05217519, -0.85492447]])

By comparing the new `v` matrix to the previous `v` matrix, we can see that the first row, which corresponds to the first word, *My* is very similar. 

This is a direct effect of the matrix multiplication with the masked matrix where the words in the first row after the first word are masked. Whereas, when you move down the rows of the matrix to the later words, you will notice how different the vectors are.

In [14]:
v

array([[-0.82589176, -0.8639556 ,  0.75185056,  1.36580367, -0.80202648,
         0.53314709,  0.60224679, -0.13482091],
       [-1.21390252,  0.20359992,  0.26225157,  0.03961611, -0.15492762,
         0.64232898, -1.17509491, -1.49053203],
       [-0.58725654,  0.42888178, -1.03420484,  0.65297388, -0.45357274,
         1.74744249, -1.14826684,  0.23966213],
       [ 0.46585498,  0.09688181,  0.20670099, -1.17548068,  0.9505946 ,
        -1.85844719,  0.89936342, -1.7963295 ]])

## Function

In [15]:
import math
import numpy as np


def softmax(x):
    return (np.exp(x).T / np.sum(np.exp(x), axis=-1)).T


def scaled_dot_product_attention(q, k, v, mask=None):
    d_k = k.shape[-1]
    scaled = np.matmul(q, k.T) / math.sqrt(d_k)
    if mask is not None:
        scaled += mask
    attention = softmax(scaled)
    out = np.matmul(attention, v)
    return out, attention

The `mask` parameter is optional as the function can be used in the encoder as well as in the decoder. As mentioned before, the encoder does not require a mask as the inputs from the input sequence are passed into the encoder simultaneously.

### Encoder

In [16]:
values, attention = scaled_dot_product_attention(q, k, v)

print("Q\n", q)
print("K\n", k)
print("V\n", v)
print("New V\n", values)
print("Attention\n", attention)

Q
 [[ 1.34972467  1.35332753 -0.77556227  1.91676259 -1.61592118  0.59815915
  -0.18490108  1.15882701]
 [-0.33608825 -1.19344485 -0.97662111  1.20519669  0.2540626   0.45435371
  -2.21570849 -0.55022568]
 [ 1.33711846  0.01455579 -0.61734755  0.34603767 -1.30180037  0.92430392
  -0.80426423 -0.1787877 ]
 [-0.44801764  0.64280112  0.40218577 -0.28686598  0.38526026  0.6576787
  -0.08885837 -0.07665715]]
K
 [[-0.96890619  0.96396798 -0.73914589  0.12728142 -0.134454    0.2575052
  -0.25883817 -0.97963846]
 [ 1.42722162  0.36437203 -0.33366425 -1.58075897 -0.18005312  0.81230955
   2.22149487 -0.94734918]
 [ 0.19142128 -1.33368536 -1.07531501 -0.2893011   0.30201615  0.06685315
  -0.07141276 -1.27233625]
 [ 0.41115131 -1.01680208  1.93874384 -1.34373299 -0.60134847  0.26848415
   0.01355881 -1.11879544]]
V
 [[-0.82589176 -0.8639556   0.75185056  1.36580367 -0.80202648  0.53314709
   0.60224679 -0.13482091]
 [-1.21390252  0.20359992  0.26225157  0.03961611 -0.15492762  0.64232898
  -1.175

### Decoder

In [17]:
values, attention = scaled_dot_product_attention(q, k, v, mask=mask)

print("Q\n", q)
print("K\n", k)
print("V\n", v)
print("New V\n", values)
print("Attention\n", attention)

Q
 [[ 1.34972467  1.35332753 -0.77556227  1.91676259 -1.61592118  0.59815915
  -0.18490108  1.15882701]
 [-0.33608825 -1.19344485 -0.97662111  1.20519669  0.2540626   0.45435371
  -2.21570849 -0.55022568]
 [ 1.33711846  0.01455579 -0.61734755  0.34603767 -1.30180037  0.92430392
  -0.80426423 -0.1787877 ]
 [-0.44801764  0.64280112  0.40218577 -0.28686598  0.38526026  0.6576787
  -0.08885837 -0.07665715]]
K
 [[-0.96890619  0.96396798 -0.73914589  0.12728142 -0.134454    0.2575052
  -0.25883817 -0.97963846]
 [ 1.42722162  0.36437203 -0.33366425 -1.58075897 -0.18005312  0.81230955
   2.22149487 -0.94734918]
 [ 0.19142128 -1.33368536 -1.07531501 -0.2893011   0.30201615  0.06685315
  -0.07141276 -1.27233625]
 [ 0.41115131 -1.01680208  1.93874384 -1.34373299 -0.60134847  0.26848415
   0.01355881 -1.11879544]]
V
 [[-0.82589176 -0.8639556   0.75185056  1.36580367 -0.80202648  0.53314709
   0.60224679 -0.13482091]
 [-1.21390252  0.20359992  0.26225157  0.03961611 -0.15492762  0.64232898
  -1.175