While editing this notebook, don't change cell types as that confuses the autograder.

Before you turn this notebook in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel $\rightarrow$ Restart) and then **run all cells** (in the menubar, select Cell $\rightarrow$ Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name below:

In [1]:
NAME = "Carmen Pelayo Fernández"

_Understanding Deep Learning_

---

<a href="https://colab.research.google.com/github/DL4DS/sp2024_notebooks/blob/main/release/nbs12/12_2_Multihead_Self_Attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook 12.2: Multihead Self-Attention

This notebook builds a multihead self-attention mechanism as in figure 12.6

In [2]:
import numpy as np
import matplotlib.pyplot as plt

The multihead self-attention mechanism maps $N$ inputs $\mathbf{x}_{n}\in\mathbb{R}^{D}$ and returns $N$ outputs $\mathbf{x}'_{n}\in \mathbb{R}^{D}$.  



In [3]:
# Set seed so we get the same random numbers
np.random.seed(3)
# Number of inputs
N = 6
# Number of dimensions of each input
D = 8

# Create an input matrix
X = np.random.normal(size=(D,N))

# Print X
print(X)

[[ 1.78862847  0.43650985  0.09649747 -1.8634927  -0.2773882  -0.35475898]
 [-0.08274148 -0.62700068 -0.04381817 -0.47721803 -1.31386475  0.88462238]
 [ 0.88131804  1.70957306  0.05003364 -0.40467741 -0.54535995 -1.54647732]
 [ 0.98236743 -1.10106763 -1.18504653 -0.2056499   1.48614836  0.23671627]
 [-1.02378514 -0.7129932   0.62524497 -0.16051336 -0.76883635 -0.23003072]
 [ 0.74505627  1.97611078 -1.24412333 -0.62641691 -0.80376609 -2.41908317]
 [-0.92379202 -1.02387576  1.12397796 -0.13191423 -1.62328545  0.64667545]
 [-0.35627076 -1.74314104 -0.59664964 -0.58859438 -0.8738823   0.02971382]]


We'll use two heads.  We'll need the weights and biases for the keys, queries, and values (equations 12.2 and 12.4).  We'll use two heads, and (as in the figure), we'll make the queries keys and values of size D/H

In [4]:
# Number of heads
H = 2
# QDV dimension
H_D = int(D/H)

# Set seed so we get the same random numbers
np.random.seed(0)

# Choose random values for the parameters for the first head
omega_q1 = np.random.normal(size=(H_D,D))
omega_k1 = np.random.normal(size=(H_D,D))
omega_v1 = np.random.normal(size=(H_D,D))
beta_q1 = np.random.normal(size=(H_D,1))
beta_k1 = np.random.normal(size=(H_D,1))
beta_v1 = np.random.normal(size=(H_D,1))

# Choose random values for the parameters for the second head
omega_q2 = np.random.normal(size=(H_D,D))
omega_k2 = np.random.normal(size=(H_D,D))
omega_v2 = np.random.normal(size=(H_D,D))
beta_q2 = np.random.normal(size=(H_D,1))
beta_k2 = np.random.normal(size=(H_D,1))
beta_v2 = np.random.normal(size=(H_D,1))

# Choose random values for the parameters
omega_c = np.random.normal(size=(D,D))

Now let's compute the multiscale self-attention

In [5]:
# Define softmax operation that works independently on each column
def softmax_cols(data_in):
  # Exponentiate all of the values
  exp_values = np.exp(data_in) ;
  # Sum over columns
  denom = np.sum(exp_values, axis = 0);
  # Replicate denominator to N rows
  denom = np.matmul(np.ones((data_in.shape[0],1)), denom[np.newaxis,:])
  # Compute softmax
  softmax = exp_values / denom
  # return the answer
  return softmax

In [6]:
 # Now let's compute self attention in matrix form
def multihead_scaled_self_attention(X, omega_v1, omega_q1, omega_k1, beta_v1, beta_q1, beta_k1, omega_v2, omega_q2, omega_k2, beta_v2, beta_q2, beta_k2, omega_c):

  # TODO Write the multihead scaled self-attention mechanism.
  # 1. Compute the values, queries, and keys for the first head
    v1 = beta_v1 + omega_v1 @ X
    q1 = beta_q1 + omega_q1 @ X
    k1 = beta_k1 + omega_k1 @ X

  # 2. Compute the dot products and scale
    prod1 = k1.T @ q1
    scaled_prod1 = prod1/np.sqrt(q1.shape[0])
    
  # 3. Compute the attentions for the first head
    attentions1 = softmax_cols(scaled_prod1)
    
  # 4. Compute the output for the first head
    X_prime1 = v1 @ attentions1

  # 5. Compute the values, queries, and keys for the second head
    v2 = beta_v2 + omega_v2 @ X
    q2 = beta_q2 + omega_q2 @ X
    k2 = beta_k2 + omega_k2 @ X

  # 6. Compute the dot products and scale for the second head
    prod2 = k2.T @ q2
    scaled_prod2 = prod2/np.sqrt(q2.shape[0])
    
  # 7. Compute the attentions for the second head
    attentions2 = softmax_cols(scaled_prod2)

  # 8. Compute the output for the second head
    X_prime2 = v2 @ attentions2
   
  # 9. Concatenate the outputs from the two heads and weight them by omega_c to produce the final output
    X_prime_combined = np.concatenate((X_prime1, X_prime2), axis=0) 
    X_prime = omega_c @ X_prime_combined

    return X_prime 


In [7]:
# Run the self attention mechanism
X_prime = multihead_scaled_self_attention(X,omega_v1, omega_q1, omega_k1, beta_v1, beta_q1, beta_k1, omega_v2, omega_q2, omega_k2, beta_v2, beta_q2, beta_k2, omega_c)

# Print out the results
np.set_printoptions(precision=3)
print("Your answer:")
print(X_prime)

print("True values:")
print("[[-21.207  -5.373 -20.933  -9.179 -11.319 -17.812]")
print(" [ -1.995   7.906 -10.516   3.452   9.863  -7.24 ]")
print(" [  5.479   1.115   9.244   0.453   5.656   7.089]")
print(" [ -7.413  -7.416   0.363  -5.573  -6.736  -0.848]")
print(" [-11.261  -9.937  -4.848  -8.915 -13.378  -5.761]")
print(" [  3.548  10.036  -2.244   1.604  12.113  -2.557]")
print(" [  4.888  -5.814   2.407   3.228  -4.232   3.71 ]")
print(" [  1.248  18.894  -6.409   3.224  19.717  -5.629]]")

# If your answers don't match, then make sure that you are doing the scaling, and make sure the scaling value is correct

Your answer:
[[-21.207  -5.373 -20.933  -9.179 -11.319 -17.812]
 [ -1.995   7.906 -10.516   3.452   9.863  -7.24 ]
 [  5.479   1.115   9.244   0.453   5.656   7.089]
 [ -7.413  -7.416   0.363  -5.573  -6.736  -0.848]
 [-11.261  -9.937  -4.848  -8.915 -13.378  -5.761]
 [  3.548  10.036  -2.244   1.604  12.113  -2.557]
 [  4.888  -5.814   2.407   3.228  -4.232   3.71 ]
 [  1.248  18.894  -6.409   3.224  19.717  -5.629]]
True values:
[[-21.207  -5.373 -20.933  -9.179 -11.319 -17.812]
 [ -1.995   7.906 -10.516   3.452   9.863  -7.24 ]
 [  5.479   1.115   9.244   0.453   5.656   7.089]
 [ -7.413  -7.416   0.363  -5.573  -6.736  -0.848]
 [-11.261  -9.937  -4.848  -8.915 -13.378  -5.761]
 [  3.548  10.036  -2.244   1.604  12.113  -2.557]
 [  4.888  -5.814   2.407   3.228  -4.232   3.71 ]
 [  1.248  18.894  -6.409   3.224  19.717  -5.629]]


In [8]:
# Test cell. Do not edit

X_prime_true = np.array([[-21.207,  -5.373, -20.933,  -9.179, -11.319, -17.812],
 [ -1.995,   7.906, -10.516,   3.452,   9.863,  -7.24 ],
 [  5.479,   1.115,   9.244,   0.453,   5.656,   7.089],
 [ -7.413,  -7.416,   0.363,  -5.573,  -6.736,  -0.848],
 [-11.261,  -9.937,  -4.848,  -8.915, -13.378,  -5.761],
 [  3.548,  10.036,  -2.244,   1.604,  12.113,  -2.557],
 [  4.888,  -5.814,   2.407,   3.228,  -4.232,   3.71 ],
 [  1.248,  18.894,  -6.409,   3.224,  19.717,  -5.629]])

assert np.allclose(X_prime, X_prime_true, rtol=1e-3, atol=1e-3), "Test failed. Your answer did not match the correct answer."