# Notebook Tasks

1. Perform matrix multiplication with the use of Numpy's matrix multiplication.
2. Compute weighted sum.
3. Compute vectorized weight matrices.
4. Use the softmax function from `scipy.special` to compute the attention scores.
5. Utilize hstack from NumPy to concat the attention scores from multihead attention together.

# Imports

In [6]:
from numpy import array
from numpy.random import randint
from numpy import random
from numpy import dot, matmul, hstack, vstack
from scipy.special import softmax
from math import sqrt

#ignore warnings
import warnings 
warnings.filterwarnings('ignore')

In [7]:
# set seed for reproducibility
random.seed(42) 

This Notebook is based on the following sources:
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Transformers for Natural Language Processing](https://www.packtpub.com/product/transformers-for-natural-language-processing-second-edition/9781803247335), by Denis Rothman
- [Machine Learning for Text](https://link.springer.com/book/10.1007/978-3-030-96623-2), by Charu C. Aggarwal
- [Neural Networks and Deep Learning](https://link.springer.com/book/10.1007/978-3-319-94463-0), by Charu C. Aggarwal
- https://jalammar.github.io/illustrated-transformer/

We will discuss the inner workings of the network from the following paper:

[Link to the paper](https://arxiv.org/pdf/1706.03762.pdf)

# Attention

The neural architecture in a recurrent neural network (RNN) with  an extra attention layer added to.

Image credit: [Neural Networks and Deep Learning](https://link.springer.com/book/10.1007/978-3-319-94463-0), by Charu C. Aggarwal

![Transformer Architecture](images/transformer_architecture.png)

---

## Positional Encoding
<br>
Different than RNN which  parse a sentence word by word in  a seequential maner, in transforme models each  word in   a  sentence simultaneously flows through. To account for the order of the words in the input sequence, the transformer adds a vector to each input embedding. These positional encoding vector values foollow a specific patteren and provide a way to compute a distances between embedding vectors once they are projected into Q/K/V and during the dot-prrooduct attention.

The $PE(w_{t})$ is the positional encoding for the word $w$ at position $t$, which is a vector of dimension $d_{model}$ equal to the embedding dimension. We compute each dimension $i$ of this vector as follows: <br>

$$
PE_{i}\left(w_{t}\right)= \begin{cases}\sin \left(k_{j} * t\right) & \text { if } \quad i=2 j \\ \cos \left(k_{j} * t\right) & \text { if } \quad i=2 j+1\end{cases}
$$
where
$$ w_k = \frac{1}{{1000}^{\frac{2k}{d}}} $$
The frequencies are decreasing along the vector dimension and it forms geometric prrrogression from $2\pi$ to $10000.2\pi$. The positional embedding can be represented as a vector containing pairs of sin and  cos for each frequency:

$$
\vec{p_t} = \left[\begin{array}{c}
\sin(w_1 \cdot t) \\
\cos(w_1 \cdot t) \\
\sin(w_2 \cdot t) \\
\cos(w_2 \cdot t) \\
\vdots \\
\sin(w_{\frac{d}{2}} \cdot t) \\
\cos(w_{\frac{d}{2}} \cdot t)
\end{array}\right]_{d \times 1}
$$

We can think of this as a bit representation of numbers with each dimension of the vector as a bit. Each bit changes periodically, and we can tell that one number is bigger than another because of the bits activated and their order. More of this intuition in this blog post <a href="https://kazemnejad.com/blog/transformer_architecture_positional_encoding/" target="_blank">blog post</a>.

---

For this simple example, the attention mechanism we are building, is scaled down to $d_{model} = 3$
instead of $d_{model} = 512$. This brings the dimensions of the vector of an input $x$ down to
$d_{model}$ = 3, which is simpler to visualize.

Now, let's begin by defining the word embeddings of the four distinct words whose attention will be calculated. In practice, these word embeddings would have been created by an encoder; nevertheless, for the purposes of this example, we will manually define them.



For the sake of simplicity, we only use 4 inputs and 3 dim positional encoding.

------

## Step 1: Inputs - Word Embeddings

In [8]:
# manually defined word embeddings
word_1 = array([1, 0, 1])
word_2 = array([0, 1, 1])
word_3 = array([1, 0, 0])
word_4 = array([1, 1, 1])
 
# stacking the word embeddings into a single array
words = array([word_1, word_2, word_3, word_4])

# print stacked words
print(words,'\n')

#print shape
print('\x1b[1;31mInputs: ',words.shape[0], '\x1b[1;30m ,','\x1b[1;32mDimension: ',words.shape[1])

[[1 0 1]
 [0 1 1]
 [1 0 0]
 [1 1 1]] 

[1;31mInputs:  4 [1;30m , [1;32mDimension:  3


-----

## Step 2: Weight Matrices

### Self Attention Calculation
1. For each word create a Query vector, a Key vector, and a Value vector by multiplying the embedding vector of each word by the weight matrices ${W_Q}$, ${W_K}$ and ${W_V}$ that are learned during the training process.

![Query, Key and Value vectors](images/transformer_self_attention_vectors.png)

Generating these weight matrices randomly;<br>
<font color=red>Note: *In a neural network setting, these weights are usually small numbers, initialised randomly using an appropriate random distribution like for instance Gaussian distribution.* </font><br>

In order to perform matrix multiplication, note that the number of rows in each of these matrices is equal to the dimension of the word embeddings (three in this example).

In [9]:
# randomly create weight matrices
W_Q = randint(3, size=(3, 3))
W_K = randint(3, size=(3, 3))
W_V = randint(3, size=(3, 3))

2.  Score each word of the input sentence against all the words in the sentence which determines how much focus to place on other parts of the input sentence as the word is encoded in ceratin position. The score is calculated by taking the dot product of the Query vector with key vector of the respective word

----

## Step 3: Matrix multiplication to obtain `Q,K,V`

This is a simple illustration of matrix multiplication, where $X$ are the words embedings, and $W^{Q}$, $W^{K}$ and $W^{V}$ the weight matrices and Q, K and V are abstractions used for computing attention.  The query, key, and value vectors for each word are generated by multiplying each word embedding by each of the weight matrices. 

![Matrix Calculattion of Self Attention](images/self-attention-matrix-calculation.png)


Image sources: https://jalammar.github.io/illustrated-transformer/

---

## Mattrix multiplication to find Queries, Keeys and Values

For every word embedding and every weight matrix $W_{Q}$, $W_{K}$ and $W_{V}$, respectively. 

In [10]:
# generating the queries, keys and values separately for each input word embedding

#generate weights for each word separetly - can be done with the numpy function dot(word, W) or 
#callingthe method doton the  array word passing the matrix W as argument query_1 = word_1.dot(W_Q)
query_1 = dot(word_1, W_Q)
key_1 = dot(word_1, W_K)
value_1 = dot(word_1, W_V)

# word 2
query_2 = dot(word_2, W_Q)
key_2 = dot(word_2, W_K) 
value_2 = dot(word_2, W_V)

# word 3
query_3 = dot(word_3, W_Q)
key_3 = dot(word_3, W_K)
value_3 = dot(word_3, W_V)
 
# word 4    
query_4 = dot(word_4, W_Q)
key_4 = dot(word_4, W_K)
value_4 = dot(word_4, W_V)

Addressing the first word only, compare the its query vector to all the key vectors using the dot product operation.

In [11]:
# comparing the first query vector to all key vector
scores = array([
    dot(query_1, key_1),
    dot(query_1, key_2),
    dot(query_1, key_3),
    dot(query_1, key_4)
])

print(scores)

[23 11 18 29]


---

## Step 4: Scaled Attention Scores

<br> <br>
<font size = 4>
The generalized attention is then computed by a weighted sum of the value vectors, where each value vector is paired with a corresponding key:
    </font>
The score determines how much focus to place on other parts of the input sentence as the word  is encoded at a certain position.


$$
\text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d}}\right) V
$$

3.  Divide the scores by the square root of the dimension of the key vectors, which  leads to having more stable gradients
4. Pass the result thoughough a softmax operation, which determines how much each word will be expressed at this posisiton

![Self Attention calculation in matrix form](images/self-attention-matrix-calculation-2.png)

In [12]:
# computing the weights using the softmax function
weights = softmax(scores / sqrt(key_1.shape[0]))

# we use Softmax that all sums up to 1
# here we check if this is indeed the case
assert(weights.sum().round(4) == 1.0)

# print out result
print(weights)

[3.02989149e-02 2.96856554e-05 1.68937823e-03 9.67982021e-01]


----

## Step 5: Weighted Sum

**Compute the weighted sum.**

The attention output is computed by weighting the total of the four value vectors, `value_1`, `value_2`,`value_3`, `value_4`. We can compute this by:
- multiplying the `first row` of our weights matrix (from above) by `value_1` and then 
- `adding` this to 
- the multiplication of `second row` of our weights matrix with `value_2` and so on. 

**Note:** You can access the first row of a matrix with: `weights[0]`

In [13]:
print(f"weights: {weights}")
print()
print(f"first row of the weight matrix: {weights[0]}")
print(f"values 1: {value_1}")
print(f"multiplication of first weigth by value_1: {weights[0] * value_1}")
print()
print(f"multiplication of second weigth by value_2: {weights[1] * value_2}")
print(f"multiplication of third weigth by value_3: {weights[2] * value_3}")
print(f"multiplication of forth weigth by value_4: {weights[3] * value_4}")  
print()
print(f"attention is the sum of the weighted values: {weights[0] * value_1 + weights[1] * value_2 + weights[2] * value_3 + weights[3] * value_4}")

weights: [3.02989149e-02 2.96856554e-05 1.68937823e-03 9.67982021e-01]

first row of the weight matrix: 0.03029891486600504
values 1: [1 1 0]
multiplication of first weigth by value_1: [0.03029891 0.03029891 0.        ]

multiplication of second weigth by value_2: [0.00000000e+00 2.96856554e-05 2.96856554e-05]
multiplication of third weigth by value_3: [0.00168938 0.00168938 0.        ]
multiplication of forth weigth by value_4: [0.96798202 1.93596404 0.96798202]

attention is the sum of the weighted values: [0.99997031 1.96798202 0.96801171]


In [14]:
# weighted sum
values = array([value_1, value_2, value_3, value_4])
weights = array([weights[0], weights[1], weights[2], weights[3]])

#attention = sum([weight * value for weight, value in zip(weights, values)])
attention = dot(weights,values)

print(attention)

[0.99997031 1.96798202 0.96801171]


----

# Vectorized Form

As mentioned in the video "What is the attention mechanism?", for faster processing, the computations can be implemented in vectorized form to generate an attention output for all four words simultaneously. So, we are now looking at the vectorized version. 

![Attention Computation Vectorized](images/attention_comp.png)

For the vectorized version use the stack together words to get the matrices $Q$, $K$ and $V$, respectively. 

In [15]:
# generating the weight matrices for queries Q, keys K and values V
Q = dot(words, W_Q)
K = dot(words, W_K)
V = dot(words, W_V)

## Attention

Compute the attention by using this equation: $\frac{Q K^{T}}{\sqrt{d}}$ remember, `K.shape[1]` is our dimension here. 


$$
\text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d}}\right) V
$$

5. Multiply each value vector by the softmax score, which makes possible to focus on some words and less on others
6. Sum up the weighted value vectors for that word

The resulting vector is feeded-forward.

Use the [softmax function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.softmax.html) from `scipy`

In [16]:
# Specify the coolumn axis along which softmax sums up the values to 1
attention = dot(softmax(dot(Q, K.T) / sqrt(K.shape[1]),  axis=1), V)

print(attention)

[[0.99997031 1.96798202 0.96801171]
 [0.99972362 1.89509373 0.89537011]
 [0.99307425 1.70208093 0.70900668]
 [0.99999705 1.9680079  0.96801085]]


---

# Multihead Attention

Different relations are captured by different attention mechanisims, thus Multi-Head Attentionn

The weight matrices which are learned parameters within the neural architecture. These matrices emphasize certain dimensions (metaconcepts) of the words, and their impact is to highlight these metaconcepts. A single phrase may include many meta-concepts of interest. The purpose of the multihead technique is thus to use $k$ distinct sets of such matrices.

Assume 3 outputs with 64 dimensions each, and these three outputs have learned weights.

Hence, assume that three attention representations with a dimension of $d_{model} = 64$:

In [17]:
# just for attention head 1
attention_head_1 = random.random((3, 64))
print(attention_head_1)

[[5.24774660e-01 3.99860972e-01 4.66656632e-02 9.73755519e-01
  2.32771340e-01 9.06064345e-02 6.18386009e-01 3.82461991e-01
  9.83230886e-01 4.66762893e-01 8.59940407e-01 6.80307539e-01
  4.50499252e-01 1.32649612e-02 9.42201756e-01 5.63288218e-01
  3.85416503e-01 1.59662522e-02 2.30893826e-01 2.41025466e-01
  6.83263519e-01 6.09996658e-01 8.33194912e-01 1.73364654e-01
  3.91060608e-01 1.82236088e-01 7.55361410e-01 4.25155874e-01
  2.07941663e-01 5.67700328e-01 3.13132925e-02 8.42284775e-01
  4.49754133e-01 3.95150236e-01 9.26658866e-01 7.27271996e-01
  3.26540769e-01 5.70443974e-01 5.20834260e-01 9.61172024e-01
  8.44533849e-01 7.47320110e-01 5.39692132e-01 5.86751166e-01
  9.65255307e-01 6.07034248e-01 2.75999182e-01 2.96273506e-01
  1.65266939e-01 1.56364067e-02 4.23401481e-01 3.94881518e-01
  2.93488175e-01 1.40798227e-02 1.98842404e-01 7.11341953e-01
  7.90175541e-01 6.05959975e-01 9.26300879e-01 6.51077026e-01
  9.14959676e-01 8.50038578e-01 4.49450674e-01 9.54101165e-02]
 [3.708

----

Suppose that the eight heads of the attention sub-layer have been trained. The transformer now has three output vectors, that is of the three input vectors that were either words or can be word fragments of $d_{model} = 64$ dimensions each:

![Multi Head attention mechanism](images/transformer_multi-headed_self-attention-recap.png)


----


![Attention heads in multi-head attention mecchanism](images/transformer_self-attention_heads.png)


Image source: https://jalammar.github.io/illustrated-transformer/

In [18]:
z0 = random.random((3, 64))
z1 = random.random((3, 64))
z2 = random.random((3, 64))
z3 = random.random((3, 64))
z4 = random.random((3, 64))
z5 = random.random((3, 64))
z6 = random.random((3, 64))
z7 = random.random((3, 64))
print('Shape of one head', z0.shape, 'dimension of 8 heads', 64*8)

Shape of one head (3, 64) dimension of 8 heads 512


## Concatenation of heads 1 - 8 to obtain 8x64=512 output dimension of the model

![Concatenated atteention heads](images/transformer_attention_heads_weight_matrix_o.png)

Image source: https://jalammar.github.io/illustrated-transformer/

Utilize [`hstack` from NumPy](https://numpy.org/doc/stable/reference/generated/numpy.hstack.html) to concat `z0 - z7` toghter. 

In [19]:
output_attention = hstack((z0, z1, z2, z3, z4, z5, z6, z7))
print(output_attention)

[[0.44842414 0.99445746 0.17592525 ... 0.37847642 0.36294427 0.99548049]
 [0.62789441 0.19427395 0.07094092 ... 0.95732289 0.57304143 0.93273074]
 [0.76336442 0.80691298 0.34630432 ... 0.13380374 0.41858062 0.79033774]]


----

# Additional: Feed-Forward Network

The illustration below depicts the fundamental structure of a feed-forward network, consisting of two hidden layers and one output layer. Even though each unit includes a single scalar variable, all units within a single layer are often represented as a single vector unit. Typically, vector units are shown as rectangles with connection matrices between them.

![Feed-forward network structure](images/feed_forward_network.png)

Image credit: [Neural Networks and Deep Learning](https://link.springer.com/book/10.1007/978-3-319-94463-0), by Charu C. Aggarwal

---

<font color=red size=3>To play with neural network use the website: [playground.tensorflow](https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.97131&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false)</font>