In [1]:
import pandas as pd 
import numpy as np
import math

For every word we are going to have a 8 X 1 vector. There are 4 words so we are going to have 4 * 8 X 1 vectors. 
So, for query vector we have 4 word representations each of shape 8 X 1. Same for Key and Value vectors. 

In [4]:
length, dk,dv = 4,8,8
#length is the length of the inpute sentence or sequence
#dk - size of the 4 vectors/words of input sentence 
#dv - for each word there are two vectors of size 8 

q = np.random.randn(length,dk)
k = np.random.randn(length,dk)
v = np.random.randn(length,dv)

In [5]:
print("Query vector initialized to : ",q)
print("Key vector initialized to : ",k)
print("Value vector initialized to : ",v)

Query vector initialized to :  [[-1.05226786 -1.45622728 -1.49409858 -0.22401677  1.27378093  0.37489617
   1.39358028 -2.04254455]
 [ 1.12208672  0.29292798  0.41313787 -2.65463817  0.1442023  -0.81215815
   1.6343774   1.63381934]
 [-0.3734988  -1.20677705  0.17089084  0.7537739  -1.55316242  0.0325164
  -0.20545718 -0.59549639]
 [-0.51613113 -0.18086092 -0.73833255  0.34159166 -0.02267987  0.3804978
   0.03070229  0.31732767]]
Key vector initialized to :  [[ 0.08967536 -0.73662656 -0.21155483 -0.34988207 -1.40330671  0.46893129
  -0.31229733 -0.54981254]
 [ 0.61002556  0.52360371  0.16391828  0.22941029 -0.63037566 -0.65122393
  -0.661597    0.03848236]
 [ 0.24159718  0.80764755  0.52046895 -1.26594196  0.91288491 -0.43786117
  -1.37108249 -0.98343145]
 [-1.7183082  -1.39770631 -1.61196873 -0.38908481  0.28559876 -0.25596778
  -1.26473164 -0.25460589]]
Value vector initialized to :  [[-0.0535808  -0.22407469 -1.00780423  1.45763766 -0.22523626 -1.17556314
  -0.51913612 -0.48569348]


To create an initial attention matrix we need every word to look at every single other word to check its affinity toward the word. The word that we are looking for is the query. Word that we currently have is the key. 

In [6]:
np.matmul(q,k.T)

array([[ 0.44889684, -3.74839043, -0.82773109,  5.36446798],
       [-1.2656585 , -0.28383146,  0.72295591, -4.20456257],
       [ 3.14195194,  0.41213294, -2.494945  ,  1.71931393],
       [ 0.14981889, -0.69380765, -1.62895679,  1.97342748]])

This matrix multiplication results in a 4X4 matrix. Diving deep into it, the first word is going to focus the most on the 4th word. We want to minimize the variance of this matrix and stabilize the values. So we divide it by the square root of the dimension of the key vector or query vector ? 

In [7]:
q.var()

1.0455276269036424

In [8]:
k.var()

0.5324130632969337

In [9]:
np.matmul(q,k.T).var()

5.611252748084973

In [12]:
scaled = np.matmul(q,k.T) / math.sqrt(dk)
q.var(),k.var(),scaled.var()

(1.0455276269036424, 0.5324130632969337, 0.7014065935106215)

Masking is done ensure words do not get context from the words generated in the next or succeeding timesteps.
It is not necessary in encoders but decoders need it. 

In [15]:
mask = np.tril(np.ones( (length,length) ))
mask

array([[1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 1., 0.],
       [1., 1., 1., 1.]])

In [16]:
mask[mask==0] = -np.infty
mask[mask==1] = 0
mask

array([[  0., -inf, -inf, -inf],
       [  0.,   0., -inf, -inf],
       [  0.,   0.,   0., -inf],
       [  0.,   0.,   0.,   0.]])

In [17]:
scaled+mask

array([[ 0.158709  ,        -inf,        -inf,        -inf],
       [-0.44747785, -0.10034957,        -inf,        -inf],
       [ 1.11084776,  0.145711  , -0.88209626,        -inf],
       [ 0.05296898, -0.24529805, -0.5759232 ,  0.69771198]])

In [18]:
def softmax(x):
    return (np.exp(x).T/np.sum(np.exp(x),axis=-1)).T

In [19]:
attention = softmax(scaled+mask)

In [20]:
attention 

array([[1.        , 0.        , 0.        , 0.        ],
       [0.41407898, 0.58592102, 0.        , 0.        ],
       [0.65909816, 0.25107099, 0.08983085, 0.        ],
       [0.23918967, 0.17750341, 0.12753166, 0.45577526]])

Attention matrix consists of attention weights of the words 

The negative infinity initialization resulted in zeroing by the softmax function. Here, in the attention matrix, the words only focus on the words with non zero attention weights. 

Multiplying the attention matrix with the value matrix we get a better encapsulation of the contextual information 

In [22]:
computed_v = np.matmul(attention,v)
computed_v

array([[-0.0535808 , -0.22407469, -1.00780423,  1.45763766, -0.22523626,
        -1.17556314, -0.51913612, -0.48569348],
       [ 0.04744973, -0.30369767, -0.44676943,  1.38830232, -0.10962902,
        -0.34936582, -0.60095694,  0.0656511 ],
       [ 0.12489735, -0.2292826 , -0.69285061,  1.48262806, -0.14077788,
        -0.49975505, -0.44115993, -0.27139896],
       [ 0.76157713, -0.44783889,  0.44220809,  0.63717761,  0.47283842,
         0.13194181,  0.10563615, -1.28337573]])

In [23]:
def scaled_dot_product(q,k,v,mask=None):
    dk = q.shape[-1]
    scaled = np.matmul(q,k.T)/math.sqrt(dk)
    if mask is not None:
        scaled = scaled + mask 
    attention = softmax(scaled)
    out = np.matmul(attention,v)
    return out,attention
