# Sample Question

The intent of all questions in the exam would be to check your ability to convert algorithm or pseudocode into actual python code. 

You would not be asked about topics that have not been discussed or covered in PML or ML lectures in the actual exam, however you should still be able to complete this sample exercise with the provided information.

The solution is provided at the end of the notebook.

## Exercise (15 Points)
*You should assume having 40-45 mins to complete this question*


### Attention Layer

In this part you should implement a simple self-attention layer in Python from scratch without using deep learning libraries like TensorFlow or PyTorch. You are only allowed to use numpy for the matrix operations. We only need to code the feedforward part of the layer.

#### Brief Explaination
The intuition behind self-attention is giving a neural network the ability to focus on relevant parts of the input data, like a human mind selectively focusing on parts of a given information. In a sequence of words, not all words contribute equally to the meaning of each word in the sentence. Self-attention allows a model to weigh the importance of words relative to each other for a given task, helping it understand context and relationships within the data. It does this by:
- Creating Query, Key, and Value vectors for each element in the sequence.
- Computing 'attention scores' by matching the Query with each Key, determining the relevance of each element to other elements.
- Computing an output for each element by combining the Value vectors, weighted by the attention scores.
- In essence, self-attention helps a model to make connections between different elements in a sequence, similar to how humans understand context between different words in a sentence.


#### Pseudocode

Each question would be provided with a Pseudocode either in text form as:

-------

```
Define Class SelfAttention:
  Define initialization function, __init__()
     Input: dims
     -  Declare self.dims
     -  Initialize weight matrices W_q, W_k, W_v, W_o randomly with shape (dims, dims). The matrices are to be initialized using normal distribution with mean = 0 and std = 1
  
  Define function forward
     Input: x with shape (n, dims)
     -  Calculate Q, K, V by performing dot product of x with W_q, W_k and W_v respectively.
     -  Calculate dot product of Q and transpose of K. Divide by sqrt of dims.
     -  Apply softmax to obtain scores
     -  Produce self-attention by calculating dot product of result from step above and V.
     -  Produce Output by calculating dot product of result from step above and W_o.
     -  Return Output
```

------

***OR* in algorithmic form**

Your class should have 2 functions, an `__init__` function
$$
\begin{align*}
&\mathrm{Initialize}(dims : num \ of \ dimensions \ for \ output):\\
\\
&W_Q,W_K,W_V,W_O  = Randomly \ Initalized \ 2D \ matrices \ [dims \times dims] \\ &from \ Normal \ Distribution (Mean = 0, Std = 1)\\
\end{align*}$$

And a `forward` funciton 

$$\begin{align*}
&\mathrm{Foward Pass}(X\mathrm{: 2D \ matrix \ [n \times dims]}):\\
\\
&Q = X \times W_Q\\
&K = X \times W_K,\\
&V = X \times W_V\\
&Score = Q \times K^T / \sqrt{dims},\\
&\mathrm{Attention \ Weights} = \mathrm{softmax}(Score)\\
&\mathrm{Self \ Attention} = \mathrm{Attention \ Weights} \times V\\
&\mathrm{Output} = \mathrm{Self \ Attention} \times W_O\\
\end{align*}$$

------

Where:
- $\ softmax(x)_i = \frac{e^{x_{i}}}{\sum_{j=1}^{n} e^{x_{j}}}\ $. You need to code this function as well. 
- The sum of attention weights (i.e. after softmax) along each row should be equal to one. Hence, you need to apply the softmax function row-wise.
- \`$\times$\`  denotes matrix multiplication

Note: `Score` and `Attention Weights` should of dimension $[n \times n]$, all other matrices are of shape $[n \times dims]$

In [5]:
### Write your code here:

import numpy as np

def softmax(X):
    e = np.exp(X)
    sum = np.sum(e, axis=1, keepdims=True)
    return e/sum

class SelfAttention:
    def __init__(self, dims):
        self.dims = dims
        self.W_q = np.random.normal(0.0, 0.1, (dims, dims))
        self.W_k = np.random.normal(0.0, 0.1, (dims, dims))
        self.W_v = np.random.normal(0.0, 0.1, (dims, dims))
        self.W_o = np.random.normal(0.0, 0.1, (dims, dims))
        
    def forward(self, x):
        Q = np.dot(x, self.W_q)
        K = np.dot(x, self.W_k)
        V = np.dot(x, self.W_v)
        score = np.dot(Q, K.T) / np.sqrt(self.dims)
        attention_weigths = softmax(score)
        self.attention = np.dot(attention_weigths, V)
        output = np.dot(self.attention, self.W_o)
        return output

In [6]:
my_layer = SelfAttention(32)
output = my_layer.forward(np.random.random(size=(100,32)))
output .shape

(100, 32)

In [2]:
### Solution:

import numpy as np

def softmax(X):
    exp_x = np.exp(X)
    return exp_x/exp_x.sum(axis=-1, keepdims=True)

"""
Other valid implemenations: 
def softmax(X):
    num =  np.exp(X)
    denum = np.exp(X).sum(axis=-1, keepdims=True)
    return num/denum

def softmax_naive(X):
    new_matrix = []
    for ii in range(X.shape[0]):
        new_row = []
        for jj in range(X.shape[1]):
            num =  np.exp(X[ii,jj])
            new_row.append(num)
        new_row = np.array(new_row)
        new_row_sum = np.sum(new_row)
        new_matrix.append(new_row/new_row_sum)
    return np.vstack(new_matrix)
    
"""


class SelfAttention:
    def __init__(self, dims):
        self.dims = dims    
        
        self.W_q = np.random.normal(loc=0,scale=1,size=(dims,dims))
        self.W_k = np.random.normal(loc=0,scale=1,size=(dims,dims))
        self.W_v = np.random.normal(loc=0,scale=1,size=(dims,dims))
        self.W_o = np.random.normal(loc=0,scale=1,size=(dims,dims))
        
    def forward(self, x):
        # calculation Q,K and V
        Q = np.matmul(x, self.W_q)
        K = np.matmul(x, self.W_k)
        V = np.matmul(x, self.W_v)

        score = np.matmul(Q, np.transpose(K))
        score = score / np.sqrt(self.dims)
        attention_weights =  softmax(score)

        self_attention = np.matmul(attention_weights, V)
        output = np.matmul(self_attention, self.W_o)

        return output
    
### You can also use A@B notation for matrix multiplication instead of A.matmul(B), they are both equivalent.

my_layer = SelfAttention(32)
output = my_layer.forward(np.random.random(size=(100,32)))

In [4]:
output.shape

(100, 32)