## Understanding Self-Attention with Numpy ##

In this notebook we are going through the understanding of simple self-attention mechanism. We are going to use <b>numpy</b> only and try to get the basic understanding of the concept.

In [1]:
## Importing necessary packages ##

import numpy as np

Before diving into the code base, let take a moment and summarize what needs to be done and how to use self-attention.

- [ ] Here, we will use self-attention on sequences, i.e. on an English sentence - "Self Attention is good". But, our self-attention model doesn't understand words. So we need to convert each word into a number. The easiest way of doing this is through the creation of one-hot encoded vectors for each word. Since, there are in total 4 words in our sentence, our one hot vector for each word will have dimension 4. So, lets prepare that: self_in_number = [1 , 0 , 0 , 0] ,  attention_in_number = [0 , 1 , 0 , 0] , is_in_number = [0 , 0 , 1 , 0] , and good_in_number = [0, 0 , 0 , 1].
- [ ] These vectors will be our inputs for self-attention.
- [ ] For self attention we need to figure out the weights of each word on finding the output at the current time step. We calculate the weights by multiplying the input of the current timestep with all the inputs and getting the respective weights. We put this weight through a softmax function to crush overwhelming values. Then we multiply each input with their corresponding weights and add all of these values up to finally get the output of the current step.

So, lets first make our inputs.

In [2]:
## Making inputs ##

x1 = np.array([1 , 0 , 0 , 0])
x2 = np.array([0, 1 , 0 , 0])
x3 = np.array([0 , 0 , 1 , 0])
x4 = np.array([0 , 0 , 0 , 1])

print(f'Inputs are \nx1 : \n{x1}\n\nx2 : \n{x2}\n\nx3 : \n{x3}\n\nx4 : \n{x4}')

inputs = np.array([x1 , x2 , x3 , x4])
print(f'Combined inputs \n{inputs}')

Inputs are 
x1 : 
[1 0 0 0]

x2 : 
[0 1 0 0]

x3 : 
[0 0 1 0]

x4 : 
[0 0 0 1]
Combined inputs 
[[1 0 0 0]
 [0 1 0 0]
 [0 0 1 0]
 [0 0 0 1]]


What next? 

Now we must create our outputs.

Since we have 4 inputs we are expected to have 4 outputs.

We will initialize our outputs as zeros and will fill them with values later.

In [3]:
## Initializing our output values with zeros ##

outputs = np.zeros((4 , 4))

print(f'Initialiazed outputs : \n{outputs}')

Initialiazed outputs : 
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


But, how do we fill the values of our outputs? 

We calculate the output at step t by a weighted sum of all the inputs with their respective weights. The respective weights are calculated by a simple product of a input vector with the input vector at time step t.

So, let do it :D

In [4]:
## Filling the values of output ##

for i in range(len(outputs)):
    weight = np.zeros(outputs.shape[0])
    for j in range(len(inputs)):
        weight[j] = np.dot(inputs[i] ,  inputs[j])
    weight = np.exp(weight) / np.sum(np.exp(weight))
    print(f'Weight for word {i} is {weight}')
    
    for j in range(len(inputs)):
        outputs[i] += inputs[j] * weight[j]
    
print(f'Outputs is {outputs}')

Weight for word 0 is [0.47536689 0.1748777  0.1748777  0.1748777 ]
Weight for word 1 is [0.1748777  0.47536689 0.1748777  0.1748777 ]
Weight for word 2 is [0.1748777  0.1748777  0.47536689 0.1748777 ]
Weight for word 3 is [0.1748777  0.1748777  0.1748777  0.47536689]
Outputs is [[0.47536689 0.1748777  0.1748777  0.1748777 ]
 [0.1748777  0.47536689 0.1748777  0.1748777 ]
 [0.1748777  0.1748777  0.47536689 0.1748777 ]
 [0.1748777  0.1748777  0.1748777  0.47536689]]


Woohoo as expected, the output resembles the input since there is no connection between words. If the one-hot representation is changed to an embedding representation also, you will observe the differences.

But before we move on to that, lets just wait a second and discuss this question: "Is the previous solution efficient?". The answer is ofcourse not. We can easily remove the loops with the help of matrices. So lets do it.

Also this time we won't be using one hot encoded values but will be using random values.

#### We hate loops! Let the fun begin with matrices ####

In [5]:
## Setting our input ##
np.random.seed(4)
inp = np.random.randn(6 , 4) # each input is of dimension 4 #
print(f'Input is \n{inp}')
print(f'Input shape is {inp.shape}')

## Setting our weights ##
# our previous weights were nothing but the dot product of current step input and all other inputs ## 

weights = np.dot(inp , inp.T) # shape is (6 , 6) #

# We must softmax the weights #
weights = np.exp(weights)
weights = weights / np.sum(weights , axis = 0)
print(f'Weights is \n{weights}')
print(f'Weights shape is {weights.shape}')
print(f'Checking if softmax is working : {np.sum(weights , axis = 0)}')

Input is 
[[ 0.05056171  0.49995133 -0.99590893  0.69359851]
 [-0.41830152 -1.58457724 -0.64770677  0.59857517]
 [ 0.33225003 -1.14747663  0.61866969 -0.08798693]
 [ 0.4250724   0.33225315 -1.15681626  0.35099715]
 [-0.60688728  1.54697933  0.72334161  0.04613557]
 [-0.98299165  0.05443274  0.15989294 -1.20894816]]
Input shape is (6, 4)
Weights is 
[[0.41676423 0.03317058 0.02581661 0.36795706 0.03431829 0.02264204]
 [0.09500843 0.82755424 0.30216376 0.09750615 0.0023233  0.03800751]
 [0.0216077  0.08829633 0.54601934 0.02816939 0.00701479 0.05226323]
 [0.36149091 0.03344438 0.03306499 0.435864   0.01848149 0.02289683]
 [0.07837035 0.00185235 0.01913955 0.04295988 0.86969345 0.13172454]
 [0.02675839 0.01568212 0.07379574 0.02754352 0.06816868 0.73246585]]
Weights shape is (6, 6)
Checking if softmax is working : [1. 1. 1. 1. 1. 1.]


So lets get our output !!

In [6]:
a = weights[:][0].reshape(1 , -1)
np.repeat(a , 4 , 0)

array([[0.41676423, 0.03317058, 0.02581661, 0.36795706, 0.03431829,
        0.02264204],
       [0.41676423, 0.03317058, 0.02581661, 0.36795706, 0.03431829,
        0.02264204],
       [0.41676423, 0.03317058, 0.02581661, 0.36795706, 0.03431829,
        0.02264204],
       [0.41676423, 0.03317058, 0.02581661, 0.36795706, 0.03431829,
        0.02264204]])

In [7]:
## Getting output ##

outputs = np.zeros(inp.shape)
print(f'output shape {outputs.shape}')

for i in range(len(outputs)):
    current_weight = weights[:][i].reshape(-1 , 1)
    print(f'current weight shape : {current_weight.shape}')
    outputs[i] = np.sum(np.multiply(inp , current_weight) , axis = 0)
    
print(f'Output is \n{outputs}')

output shape (6, 4)
current weight shape : (6, 1)
current weight shape : (6, 1)
current weight shape : (6, 1)
current weight shape : (6, 1)
current weight shape : (6, 1)
current weight shape : (6, 1)
Output is 
[[ 0.1290987   0.30275356 -0.81778664  0.41001273]
 [-0.23829336 -1.5724902  -0.54873169  0.52304713]
 [ 0.1019155  -0.73259804  0.23993999 -0.03317603]
 [ 0.16682378  0.26444536 -0.84840296  0.39399784]
 [-0.62948215  1.38112546  0.53304216 -0.05026326]
 [-0.73035879  0.05832655  0.14341142 -0.8512471 ]]


And done :)

This was concise and I hope this clears a lot of doubt regarding self-attention.
We are going to build up on this concept and work on with our Transformers models using Pytorch soon.