# Self Attention

### Terminos a usar:
* Q - Query (Que estoy buscando)
* K - Key (Que puedo ofrecer)
* V - Value (Lo que ofresco)
* d_X - Dimension de X conjunto (dV, dK, dQ, ...)

De acuerdo con el paper de Attention is all you need, una funcion de atencion se describe como el mapeo de la Query y el conjunto de pares de Key-Values a una salida, donde Q-K-V son vectores. La salida es computada como una sumatoria con pesos de estos valores, y donde los pesos son obtenidos mediante una funcion de compatibilidad entre la query y la key correspondiente.

Primeramente, para abordar esta parte tenemos que generar nuestos datos Q-K-V

In [13]:
import numpy as np
import math

phrase = "Hola mi nombre es Alex"
input = phrase.split(' ')

In [4]:
input

['Hola', 'mi', 'nombre', 'es', 'Alex']

In [17]:
in_len = len(input)
d_K, d_V = 8, 8
q = np.random.randn(in_len, d_K)
k = np.random.randn(in_len, d_K)
v = np.random.randn(in_len, d_V)

In [19]:
q

array([[ 0.28992813,  0.85906143,  0.21495725, -2.1308364 ,  1.02194232,
        -0.36578623, -0.59514708, -0.46351817],
       [ 1.87962513, -0.02568894, -0.51524885,  1.23020388, -0.67709758,
        -0.91531064,  0.16847735,  0.2347544 ],
       [-0.2302133 , -0.51736049,  0.22206796, -0.21667557,  0.98000167,
        -0.25227682,  1.05231845,  0.03994449],
       [-0.57217994, -0.13183049, -0.7318054 , -0.90349994,  0.4996766 ,
         0.51383088,  1.47310469, -0.82662214],
       [ 0.71240792,  0.60729948,  1.27335944,  0.31328028, -1.56159438,
         0.12807969, -0.73653941,  1.42056491]])

In [20]:
k

array([[-1.58231432, -0.19734611, -1.00242319, -0.03805725, -1.28308434,
         0.8047156 ,  0.89423467,  0.17954249],
       [ 0.36907596, -1.14338458, -0.93052493, -0.5720036 ,  0.65796214,
         0.10875771,  0.39224373, -0.1506256 ],
       [-0.09344037,  0.80946902,  0.72635436,  0.34858423,  0.60613819,
        -3.30908328, -1.47275347,  0.1766531 ],
       [-1.61318901,  0.30499548,  1.20877345,  0.9539165 ,  1.00297772,
         1.20746745,  2.98391289, -0.11909048],
       [-1.08699605,  0.60438553,  0.6649686 , -0.12199497, -0.52800097,
        -0.27451453, -0.15154801,  0.91716749]])

In [21]:
v

array([[-2.09988025,  0.01087252, -0.66951067, -1.98764867,  1.22440004,
         0.12023061, -0.35682981, -0.18709454],
       [-0.05035914, -0.86577554,  2.00003202, -1.35962342,  0.48887871,
         0.11802904, -0.21923461,  1.30231047],
       [-0.24799169,  0.22273024,  0.77994725, -0.55120025,  1.04939514,
        -0.61955624, -2.75513078, -1.22062181],
       [ 0.09765184, -0.98867081, -0.83415428, -0.02786349,  0.20555664,
         0.42917456,  0.29869629, -0.32787272],
       [ 0.17731818,  0.85586311, -0.86626718, -1.6316923 ,  1.91028236,
        -0.93733119, -1.97092547, -1.42572689]])

Una vez teniendo estos, vamos a ser capaces de representar la parte de la funcion de atencion o el scaled dot-product attention, tomando como base lo que hay el en paper.

<img src="./images/scaled_dot_prod_attent.png" alt="scaled_dot_prod_attent" width="200" height="auto"> <img src="./images/attention.png" alt="attention formula" width="400" height="auto">

Para esto, en la formula mostrada en el paper se ignora la parte de la Mask, ya que esta es opcional, pero bastante util. unicamente se añade a la formula sumandola al resultado de lo que la division.

Funcion softmax se representa como: 

<img src="./images/softmax.png" alt="softmax func" width="200" height="auto">

In [23]:
def softmax(x):
    return (np.exp(x).T / np.sum(np.exp(x), axis=-1)).T

In [24]:
scaled = np.matmul(q, k.T)/math.sqrt(d_K)

In [26]:
#Mask optional
mask = np.tril(np.ones( (in_len, in_len) ))

In [27]:
mask

array([[1., 0., 0., 0., 0.],
       [1., 1., 0., 0., 0.],
       [1., 1., 1., 0., 0.],
       [1., 1., 1., 1., 0.],
       [1., 1., 1., 1., 1.]])

In [28]:
mask[mask == 0] = -np.infty
mask[mask == 1] = 0

In [29]:
mask

array([[  0., -inf, -inf, -inf, -inf],
       [  0.,   0., -inf, -inf, -inf],
       [  0.,   0.,   0., -inf, -inf],
       [  0.,   0.,   0.,   0., -inf],
       [  0.,   0.,   0.,   0.,   0.]])

In [30]:
attention = softmax(scaled+mask)

In [31]:
attention

array([[1.        , 0.        , 0.        , 0.        , 0.        ],
       [0.31793058, 0.68206942, 0.        , 0.        , 0.        ],
       [0.26506949, 0.48489931, 0.2500312 , 0.        , 0.        ],
       [0.24693096, 0.21313512, 0.01900443, 0.52092948, 0.        ],
       [0.13352946, 0.05557004, 0.29655043, 0.06418016, 0.45016991]])

In [32]:
out = np.matmul(attention, V)

In [33]:
out

array([[-0.58386373, -0.38535245,  0.21646513,  0.49997813,  1.14483231,
         0.35961925, -0.6517341 , -0.59169241],
       [-0.29159001,  0.49177799,  0.77658553,  1.39734939,  0.59765795,
        -0.1503909 , -0.30023651, -0.76811127],
       [-0.22086216, -0.11948308,  0.33831905,  0.79815492,  0.53506018,
         0.01258004, -0.70790177, -0.80850755],
       [ 0.22481361,  0.90139912,  0.35280004,  1.58500633,  0.74560973,
         0.37851397, -1.21530399,  0.45751405],
       [-0.12599258, -0.3066491 ,  0.19512373,  0.90484334,  0.26112036,
         0.48219163, -0.52767298,  0.06996178]])

Todo esto conforma la parte del Scaled Dot Product Attention. El siguiente paso es escalar eso al muli-head attention, que en escencia es lo mismo pero a mayor escala.

<img src="./images/multi_head_attention.png" alt="scaled_dot_prod_attent" width="300" height="auto">