### Import Libraries

This cell imports the necessary libraries: `numpy` for numerical operations and `math` for mathematical functions.

In [None]:
import numpy as np
import math

### Initialize Variables

This cell initializes the variables `L`, `d_k`, and `d_v` which represent the sequence length and the dimensions of the key and value vectors, respectively. It also creates random query, key, and value matrices (`q`, `k`, and `v`).

In [None]:
L, d_k, d_v = 4, 8, 8
q = np.random.randn(L, d_k)
k = np.random.randn(L, d_k)
v = np.random.randn(L, d_v)

### Print Q, K, and V

This cell prints the initialized query (`q`), key (`k`), and value (`v`) matrices.

In [None]:
print("Q\n", q)
print("K\n", k)
print("V\n", v)

Q
 [[-0.23303838  0.4552888   0.95491234  0.30506956  0.32863433 -0.21465489
  -1.99278369  0.6920726 ]
 [-0.66801934 -0.47584393  0.02423653 -2.02101804  0.660781   -0.64228401
  -0.73995386  0.51579461]
 [-1.23958879 -1.38750281 -0.31920982 -1.50463355  1.96619006  1.4805927
  -0.09300824 -1.00037272]
 [ 1.01917479  1.15913976  0.10872518 -0.1800475  -0.55870587  1.37986459
   2.6599079   0.04387907]]
K
 [[-0.96900161 -1.12374365 -1.26487243  0.53755025 -1.35861454  0.65570976
  -1.15999351  0.12286737]
 [-0.06005638 -0.16036109 -0.74406779 -1.29358825 -0.56232546  0.49575063
   0.877804   -0.32933588]
 [ 0.17975888  0.95998137  1.42153112 -1.41099863 -0.48974893  0.28141373
   0.07628394  0.40003283]
 [ 2.07843739  1.33156505 -0.28815526 -1.24923639  1.09823374 -0.39904121
   1.00280328 -0.77046439]]
V
 [[-0.14568428  0.31482634  0.69393521  0.22385512  1.89562802  0.16136048
   0.530136   -1.01611595]
 [-0.92017587 -0.75384648 -0.96106     0.22128155  2.16263508 -0.4784105
   0.604

### Calculate Dot Product of Q and K Transpose

This cell calculates the dot product of the query matrix (`q`) and the transpose of the key matrix (`k.T`). This is a crucial step in the self-attention mechanism.

In [None]:
np.matmul(q,k.T)

array([[ 0.47974519, -3.43258167,  1.22564215, -2.6193899 ],
       [-0.33219901,  1.20336642,  1.9547468 ,  0.33825211],
       [ 0.63982505,  2.35701837, -0.83909208, -0.20633162],
       [-3.94069403,  3.22358971,  2.58696108,  5.32469779]])

### Calculate Variance of Q, K, and the Dot Product

This cell calculates and prints the variance of the query matrix (`q`), the key matrix (`k`), and the result of the dot product of `q` and `k.T`. This helps to understand the spread of values in these matrices.

In [None]:
q.var(), k.var(), np.matmul(q,k.T).var()

(np.float64(1.1530857762373627),
 np.float64(0.857386828919557),
 np.float64(5.598960354561339))

### Scale the Dot Product

This cell scales the dot product of `q` and `k.T` by dividing it by the square root of `d_k`. This scaling is done to prevent the dot product from becoming too large, which can lead to vanishing gradients during training. It also prints the variance of the scaled matrix.

In [None]:
scaled = np.matmul(q,k.T) / math.sqrt(d_k)
q.var(), k.var(), scaled.var()

(np.float64(1.1530857762373627),
 np.float64(0.857386828919557),
 np.float64(0.6998700443201673))

### Print Scaled Matrix

This cell prints the scaled dot product matrix.

In [None]:
scaled

array([[ 0.16961554, -1.21360089,  0.43332994, -0.92609418],
       [-0.11745009,  0.42545428,  0.69110736,  0.11959018],
       [ 0.22621232,  0.83333184, -0.29666385, -0.07294924],
       [-1.39324573,  1.13971107,  0.91462886,  1.88256496]])

### Create a Lower Triangular Mask

This cell creates a lower triangular mask using `np.tril` with dimensions `L x L`. This mask is used in masked self-attention to prevent attending to future tokens.

In [None]:
mask = np.tril(np.ones(( L,L )))
mask

array([[1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 1., 0.],
       [1., 1., 1., 1.]])

### Modify the Mask

This cell modifies the mask so that the lower triangle contains 0s and the upper triangle contains negative infinity (`-np.inf`). This is done so that when the mask is added to the scaled attention scores, the softmax function will output 0 for the masked values.

In [None]:
mask[mask == 0] = -np.inf
mask[mask == 1] = 0

mask

array([[  0., -inf, -inf, -inf],
       [  0.,   0., -inf, -inf],
       [  0.,   0.,   0., -inf],
       [  0.,   0.,   0.,   0.]])

### Add Mask to Scaled Matrix

This cell adds the modified mask to the scaled dot product matrix. The negative infinity values in the mask will effectively mask out the corresponding values in the scaled matrix.

In [None]:
mask += scaled
mask

array([[ 0.33923108,        -inf,        -inf,        -inf],
       [-0.23490017,  0.85090856,        -inf,        -inf],
       [ 0.45242463,  1.66666367, -0.5933277 ,        -inf],
       [-2.78649147,  2.27942214,  1.82925772,  3.76512991]])

### Define Softmax Function

This cell defines a `softmax` function that takes a matrix `x` as input and applies the softmax function along the last axis. The softmax function is used to convert the attention scores into attention weights.

In [None]:
def softmax(x):
  return (np.exp(x).T / np.sum(np.exp(x), axis =-1)).T

### Calculate Attention Weights

This cell applies the `softmax` function to the masked and scaled dot product matrix to obtain the attention weights.

In [None]:
attention = softmax(scaled + mask)
attention

array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.64006731e-01, 8.35993269e-01, 0.00000000e+00, 0.00000000e+00],
       [1.35344010e-01, 8.36459676e-01, 2.81963139e-02, 0.00000000e+00],
       [4.64075221e-05, 9.26266314e-02, 4.71498140e-02, 8.60177147e-01]])

### Calculate New Value Matrix

This cell calculates the final output of the self-attention layer by multiplying the attention weights with the value matrix (`v`).

In [None]:
new_v = np.matmul(attention, v)
new_v

array([[-0.14568428,  0.31482634,  0.69393521,  0.22385512,  1.89562802,
         0.16136048,  0.530136  , -1.01611595],
       [-0.79315404, -0.57857694, -0.68962964,  0.22170364,  2.11884413,
        -0.37348375,  0.59204784, -1.59818304],
       [-0.79013698, -0.54460583, -0.69202264,  0.2189044 ,  2.0768932 ,
        -0.36643202,  0.57135829, -1.51267357],
       [-1.79824188, -0.71996795,  0.33820553, -0.40095999, -0.55024229,
        -0.22083659,  0.50540498, -0.12241615]])

In [None]:
v

array([[-0.14568428,  0.31482634,  0.69393521,  0.22385512,  1.89562802,
         0.16136048,  0.530136  , -1.01611595],
       [-0.92017587, -0.75384648, -0.96106   ,  0.22128155,  2.16263508,
        -0.4784105 ,  0.60419382, -1.71237406],
       [-0.02587124,  1.53731072,  0.636442  ,  0.1246209 ,  0.40339543,
         0.42203736, -0.20485709,  2.02804858],
       [-1.99003486, -0.84010627,  0.46174786, -0.49680789, -0.89477834,
        -0.22835937,  0.5336981 , -0.0690319 ]])

In [None]:
def softmax(x):
  return (np.exp(x).T / np.sum(np.exp(x), axis =-1)).T

def scaled_dot_product_attention(q, k, v, mask=None):
  d_k = q.shape[-1]
  scaled = np.matmul(q, k.T) / math.sqrt(d_k)
  if mask is not None:
    scaled += mask
  attention = softmax(scaled)
  out = np.matmul(attention, v)
  return out, attention

In [None]:
values, attention = scaled_dot_product_attention(q, k, v, mask=mask)
print("Q\n", q)
print("K\n", k)
print("V\n", v)
print("New V\n", values)
print("Attention\n", attention)

Q
 [[-0.23303838  0.4552888   0.95491234  0.30506956  0.32863433 -0.21465489
  -1.99278369  0.6920726 ]
 [-0.66801934 -0.47584393  0.02423653 -2.02101804  0.660781   -0.64228401
  -0.73995386  0.51579461]
 [-1.23958879 -1.38750281 -0.31920982 -1.50463355  1.96619006  1.4805927
  -0.09300824 -1.00037272]
 [ 1.01917479  1.15913976  0.10872518 -0.1800475  -0.55870587  1.37986459
   2.6599079   0.04387907]]
K
 [[-0.96900161 -1.12374365 -1.26487243  0.53755025 -1.35861454  0.65570976
  -1.15999351  0.12286737]
 [-0.06005638 -0.16036109 -0.74406779 -1.29358825 -0.56232546  0.49575063
   0.877804   -0.32933588]
 [ 0.17975888  0.95998137  1.42153112 -1.41099863 -0.48974893  0.28141373
   0.07628394  0.40003283]
 [ 2.07843739  1.33156505 -0.28815526 -1.24923639  1.09823374 -0.39904121
   1.00280328 -0.77046439]]
V
 [[-0.14568428  0.31482634  0.69393521  0.22385512  1.89562802  0.16136048
   0.530136   -1.01611595]
 [-0.92017587 -0.75384648 -0.96106     0.22128155  2.16263508 -0.4784105
   0.604