### Important things to understand about pytorch dimensions : 
- We are gonna work with tensors that extend upto $3$ dimensions ( and $4$ dimensions as well 💀)
- First get familiar with $dim = 3$ 
    <div align="center">

    <img src="dim_explain1.jpeg">

    </div>

    <div align="center">

    <img src="dim_explain2.jpeg">

    </div>

    <div align="center">

    <img src="dim_explain3.jpeg">

    </div>

    
    - Say we want to get the sum. 
    - Now we can get it via 3 ways : 
- $dim = 0$
    - Here the sum will be taken across the **batch** dimension like shown above. 
    - Basically, the pink tiles will be summed up and then same goes for the other cells. 
    - What's the output ? Say we had the input shape of $(3,4,5)$ 
    - We summed along the batch dimension for each element in the individual matrix. 
    - So final shape would be the shape of $(4,5)$

```python
import torch
torch.set_printoptions(precision=4)
x = torch.tensor([
                    [
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                    ],
                    [
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                    ],
                    [
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                    ],
                ],
                dtype=torch.float32)
print("x.shape = ",x.shape)
sm = x.sum(dim=0)
print("sm.shape = ",sm.shape)
print(sm)
>>>x.shape =  torch.Size([3, 4, 5])
>>>sm.shape =  torch.Size([4, 5])
>>>tensor([[111., 222., 333., 444., 555.],
           [111., 222., 333., 444., 555.],
           [111., 222., 333., 444., 555.],
           [111., 222., 333., 444., 555.]])
```


- $dim = 1$
    - Here we sum acoss every column of the matrix . So essentially all the pink boxes are summed up for each matrix. 
    - Final Shape ? Here each matrix shrinks into a row where every element shows the sum of it's column and we have $3$ such matrices. So essentially 3 such rows and 5 columns. Thereby the final shape becomes $(3,5)$

```python
import torch
torch.set_printoptions(precision=4)
x = torch.tensor([
                    [
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                    ],
                    [
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                    ],
                    [
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                    ],
                ],
                dtype=torch.float32)
print("x.shape = ",x.shape)
sm = x.sum(dim=1)
print("sm.shape = ",sm.shape)
print(sm)

>>>x.shape =  torch.Size([3, 4, 5])
>>>sm.shape =  torch.Size([3, 5])
>>>tensor([[   4.,    8.,   12.,   16.,   20.],
           [  40.,   80.,  120.,  160.,  200.],
           [ 400.,  800., 1200., 1600., 2000.]])
```


- $dim = 2$
    - Here we're summing up the elements in a row. This means that all the green cells that we're able to see will shrink into one single cell for every matrix and we have $3$ such matrices. So this means that we'll be having $1$ column with sum of $4$ such rows and we have $3$ matrices . Therefore we convert the column into a row and stack $3$ such rows. Therefore the shape becomes $(3,4)$

```python

import torch
torch.set_printoptions(precision=4)
x = torch.tensor([
                    [
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                    ],
                    [
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                    ],
                    [
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                    ],
                ],
                dtype=torch.float32)
print("x.shape = ",x.shape)
sm = x.sum(dim=2)
print("sm.shape = ",sm.shape)
print(sm)


>>>x.shape =  torch.Size([3, 4, 5])
>>>sm.shape =  torch.Size([3, 4])
>>>tensor([[  15.,   15.,   15.,   15.],
>>>        [ 150.,  150.,  150.,  150.],
>>>        [1500., 1500., 1500., 1500.]])

```

- **NOTE** : final shape of matrix  : remove the dimension from the shape of the original matrix along which the operation was applied. Note that for $dim=3$, the structre is as follows : (Batch,Column,Row).

- Similar logic goes for applying **softmax**. The dimension we specify in the softmax function, the softmax is applied along the elements of the same dimension.

- $dim=0$ : Here the softmax would be applied for the **all the elements stackwise of every matrix**. 

- $dim=1$ : Here the softmax would be applied for the **elements column-wise for each column of every matrix**.

- $dim=2$ : Here the softmax would be applied for the **elements row-wise for each row of every matrix**. 

- Note that softmax **doesn't produce a change in the final shape of the matrix**. It rather performs an **aggregation operation on a specified elements** in a dimension and **update those elements accordingly**.

- Note that : 
<div align="center">

|dim:    |Batch|Column|Row |
|--------|-----|------|----|
|positive|$0$  |$1$   |$2$ |
|negative|$-3$ |$-2$  |$-1$|
</div>

------

### Last time : 
- We understood that there are $3$ spearate matrices that are used to generate the key, query and vlaue vectors for a single word.
- But what are we actually finding out : 

$\text{ATTENTION(K,Q,V) } =softmax(\displaystyle\frac{QK^{T}}{\sqrt{d_{k}}})V $

where : 
$K = \text{ Key vector }$

$Q = \text{ Query vector }$

$V = \text{ Value vector }$

$d_{k}= \text{ dimension of key vectors }$


- Here some question arise : what is the shape of the matrices that we're going to multiply our vector with ? 
- the answer is $d_{model} \times d_{model}$
- Here's what happens : 
- Suppose you have a sequence of words :
<div align="center">

`My cat is a lovely cat.`😸

</div>
     
    
- each word has it's own word embedding : $e_{My}\text{  }e_{cat}\text{  }e_{is}\text{  }e_{a}\text{  }e_{lovely}\text{  }e_{cat}$
- Now merge these vectors into one single matrix i.e :


<div align="center">

|dimensions      |$d_{0}$|$d_{1}$|$d_{2}$|$d_{3}$|$\cdots$|$d_{model}-1$|
|----------------|-------|-------|-------|-------|--------|-------------|     
|$e_{My}$        |   102 |   452 |   12  |  212  |$\cdots$|   864       |         
|$e_{cat}$       |   102 |   452 |   12  |  212  |$\cdots$|   864       |             
|$e_{is}$        |   102 |   452 |   12  |  212  |$\cdots$|   864       |     
|$e_{a}$         |   102 |   452 |   12  |  212  |$\cdots$|   864       | 
|$e_{lovely}$    |   102 |   452 |   12  |  212  |$\cdots$|   864       |         
|$e_{cat}$       |   102 |   452 |   12  |  212  |$\cdots$|   864       |         

</div>

- So here the dimensions of the input are not just $1\times d_{model}$ but actually $\text{ sequence\_length }\times d_{model}$
- Now we have 3 matrices each for $K,Q\text{ \& }V$
- We multiply this matrix for each $K,Q,V$ :
<div align="center">

<img src="attention_calculation.jpeg">

</div>


- Here individual vectors show each key, query and value vector for each word.
- Suppose you have a $5$ sequence length sentence as input, so to calcuate the contextual emebeddings of the $0^{th}$ word, we do the process decribed below: 

<div align="center">

<img src="self_attention_calculation_process.jpeg">

</div>

- Here the scaled dot product is nothing but the division of $QK^{T}$ by $\sqrt{d_{model}}$.

- We'll explore why it's done later. **TODO**




### Any Flaw yet ?
- As of now we don't see any issue in out approach. 
- okay how about double meaning text 😏😼. Consider the text: 
<div align="center">

`A man saw a person with a telescope.`

</div> 

- There 2 possible meaning are there : 
- The first person saw a man **using a telescope**. 
- The first person saw a man **holding a telescope**.
- But in our current scenario, we only can capture only one type of meaning under current workflow. 
- So what's the fix ? 
- Instead of just one **type** of attention, calculate multiple attention scores. 
- Here comes the concept of **MULTIHEAD ATTENTION**. 
- Instead of calculating a single attention score, calculate multiple attention scores. 
- But how exactly will we do this ? 
- Above we saw that we're supposed to find the key, query, value matrix for fetching out the meaning of the current word given the words around it. 
- Here also we keep multiple key, query $\&$ value matrices for capturing a different meaning but with a slightly different way : 
    - We break the vector alongside the dimensions . We know that the matrix we're working around it $sequence\_length \times d_{model}$. 
    - Suppose we want to capture 8 different meanings of the same word, so what we do is that we break the whole vector ( column wise ) into 8 different pieces :

    <div align="center">

    <img src="multihead_attention_splitting.jpeg">


    </div> 

- Here one important thing to note : $d_{model}\text{  }\% \text{  }  \text{number of attention heads }=0$
- Why ? How the duck will you divide the $d_{model}$ into equal parts then!. 
- Now that we have broken down the matrix for multi-head attention, we'll do the same thing for each individual vector : find the contextual embeddings of each word and then concatenate the emebeddings.
- How exactly ? Like this : 


    <div align="center">

    <img src="multihead_attention_calculation_1.jpeg">

    </div> 

- Here we can see that we have calculated contextual embeddings for a single head. 
- A point to note that since each head has it's own set of key, query, and value pairs, there will be $\text{num\_heads}$ numbers of key, query, value matrices belonging to each head. 

- But this is for one head. We had $\text{num\_heads}$. Therefore we have $\text{num\_heads}$ numebr of contextual emebeddings. 
- Now what? We have $\text{num\_heads}$ number of matrices with the dimension $\text{ sequence\_length } \times \frac{d_{model}}{num\_heads}$
- We'll simply concatenate 😛:

<div align="center">
<img src="multihead_attention_calculation_2.jpeg">
</div> 

-------


### SOME FINE DETAIL FOR GENERALITY : 
In the paper, attention is all you need, a more generalized approach towards calculating multihead attention was proposed : 
- We know that attention is : 
    - $\text{ Attention}(K,Q,V)=\textstyle{ softmax}(\frac{\textstyle{QK^{T}}}{\textstyle\sqrt{d_{k}}})V$
    - Here the $d_{k}$ is the dimensions of the key matrix found by multiplying the word embeddings of input word by the key matrix for a given head. 
    - Note that the dimensions of $Q$ and $K$ must be $(sequence\_length \times d_{model})$ so that the product  $QK^{T}$ is valid. 
    - In the original paper :
    - $\text{MULTIHEAD ATTENTION} = \textstyle{Concat}(head_{1},head_{2},\cdots,head_{num\_heads})W^{0} $  
    where $head_{i} = \text{ Attention}(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})$ 
    where $W_{i}^{k,q,v}$  are the matrices that we use for converting simple word embeddings to key,query and value vectors. 
    - Wait a 🦆ing minute ! what's $W^{0}$, I'll come back to this later.
    - Note that their dimensions are as follow : 
        - Q : $\textstyle{sequence\_length} \times \textstyle{d_{model}}$
        - K : $\textstyle{sequence\_length} \times \textstyle{d_{model}}$
        - V : $\textstyle{sequence\_length} \times \textstyle{d_{model}}$
        - $W_{i}^{Q} : \textstyle{d_{model}} \times \textstyle{d_{key}}$
        - $W_{i}^{K} : \textstyle{d_{model}} \times \textstyle{d_{key}}$
        - $W_{i}^{V} : \textstyle{d_{model}} \times \textstyle{d_{value}}$
    - wait a minute : How did this $d_{key} \& d_{value}$ come into picture ? 
    - Just to add generality ! In the discussing above, we divided the matrix such that the the $d_{key} \& d_{value}$ came out to be same. What if someone decided to take up different sizes for key-query and value matrices. 
    - Just in case, it's key-query and value, why not key , query and value ? beacuse they're supposed to be multiplied 😭 like this: 
    $QK^{T}$ so for them to multiply all along well, we'll need $d_{key}$ same in both key and query matrix. 
    Note that this difference in the dimensions of key query value is in the matrix that is used to generate the key query and value matrices.
    - Note that we still want the $d_{key}$ to me a factor of $d_{model}$
    - Now follow the flow : 
    $head_{i} = \text{ Attention}(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V}) = \textstyle{ softmax}(\frac{\textstyle{QW_{i}^{Q}(KW_{i}^{K})^{T}}}{\textstyle\sqrt{d_{k}}})VW_{i}^{V}$

    $ QW_{i}^{Q} : (\textstyle{sequence\_length} \times d_{model})  \times (d_{model} \times \textstyle{d_{key}}) = (\textstyle{sequence\_length} \times d_{key}) $

    $ KW_{i}^{K} : (\textstyle{sequence\_length} \times d_{model})  \times (d_{model} \times \textstyle{d_{key}})  = (\textstyle{sequence\_length} \times d_{key})$

    $VW_{i}^{V} : (\textstyle{sequence\_length} \times d_{model})  \times (d_{model} \times \textstyle{d_{value}}) = (\textstyle{sequence\_length} \times d_{value}) $


    $\textstyle{QW_{i}^{Q}(KW_{i}^{K})^{T}} : (\textstyle{sequence\_length} \times \textstyle{sequence\_length})$

    $\textstyle{QW_{i}^{Q}(KW_{i}^{K})^{T}} :(\textstyle{sequence\_length} \times {d_{value}}) $

    - So the final shape of attention for a single head comes out to be : $(\textstyle{sequence\_length} \times {d_{value}}) $

    - Since $\text{MULTIHEAD ATTENTION} = \textstyle{Concat}(head_{1},head_{2},\cdots,head_{num\_heads})$
    $\implies \text{MULTIHEAD ATTENTION} = (\textstyle{sequence\_length} \times (num\_heads \times d_{value})) $ 
    - Note that it's still a $2D$ Matrix, it's just that the columns of this matrix increased $num\_heads$ times. 
    - Now here we have a smol 🤏 problem : 
        - the current shape is : $(\textstyle{sequence\_length} \times (num\_heads \times d_{value}))$
        - But for further steps we need the matrix in $(\textstyle{sequence\_length} \times  d_{model})$
        - How to convert ? 🤔💡
        - Simple , multiply it by a matrix whose dimensions are $((num\_heads \times d_{value})\times d_{model})$ which is nothing but $W^{0}$ as in the formula shown in the very beginning!
        - That's why  $\text{MULTIHEAD ATTENTION} = \textstyle{Concat}(head_{1},head_{2},\cdots,head_{num\_heads}) \times \textbf W^{0}$
        - This $W^0$ is important in this aspect. 


----

### Some noteworthy implementation details : 

- we're going not gonna make num_heads number of different matrices man 😩! 

- That's not handy to maintain and will need some different logic for training 💀.I know I can't even implement a simple backpropagation 😔, implementing for a custom system of which I've got no clue of isn't very convincing to go with 🤐.   
- What we're gonaa do is rather something that we consider a smart move but it's really not cause we're not that smart as we think of ourselves 😔💡: 
     - Make a single matrix for Key query values for all the attention heads. 
     - Fragment the final matrix of key query values into the desired number of heads and then calculate the attention for each head.
     - After you're done with calcuating the attention of each head, we'll simlpy 
     concatenate that as we planned before. 
     - Something like this : 
    <div align="center">

    <img src="multihead_attention_implementation_for_key_qeury_value_matrices.jpeg">

    </div>

    - Now one question : Wouldn't this kill  our very idea of multi-head attention 😨🥹?
    - Isn't this simply calculating attention for once and not for the times the number of heads we have 😭? 

    - **Actually not!**
    - We're seeing this in the wrong way!
    - We're just postponing steps for our conveience. 
    - Instead of breaking the key query value matrix for each head, we're first calcuating the combined matrices for key query value and then break them down for calcuating attention.  
```python
import numpy as np 
word_vector = np.array([[1,2,3,4],[5,6,7,8]])
key = np.array([[1,2,3,4],
                [1,2,3,4],
                [1,2,3,4],
                [1,2,3,4]
            ])
query = np.array([[1,2,3,4],
                  [1,2,3,4],
                  [1,2,3,4],
                  [1,2,3,4]
            ])
value = np.array([[1,2,3,4],
                  [1,2,3,4],
                  [1,2,3,4],
                  [1,2,3,4]
            ])

word_key = np.matmul(word_vector,key)
word_query = np.matmul(word_vector,query)
word_value = np.matmul(word_vector,value)

word_key_part_1 = word_key[:,0:2]
word_key_part_2 = word_key[:,2:4]

word_query_part_1 = word_query[:,0:2]
word_query_part_2 = word_query[:,2:4]

word_value_part_1 = word_value[:,0:2]
word_value_part_2 = word_value[:,2:4]


attention_part1 = np.matmul(np.matmul(word_query_part_1,word_key_part_1.transpose())/np.sqrt(2),word_value_part_1)
attention_part2 = np.matmul(np.matmul(word_query_part_2,word_key_part_2.transpose())/np.sqrt(2),word_value_part_2)

key1 = np.array([[1,2,],
                [1,2,],
                [1,2,],
                [1,2,]
            ])
query1 = np.array([[1,2,],
                  [1,2,],
                  [1,2,],
                  [1,2,]
            ])
value1 = np.array([[1,2,],
                  [1,2,],
                  [1,2,],
                  [1,2,]
            ])

key2 = np.array([[3,4,],
                 [3,4,],
                 [3,4,],
                 [3,4,]
            ])
query2 = np.array([[3,4,],
                   [3,4,],
                   [3,4,],
                   [3,4,]
            ])
value2 = np.array([[3,4,],
                   [3,4,],
                   [3,4,],
                   [3,4,]
            ])

word_key1 = np.matmul(word_vector,key1)
word_query1 = np.matmul(word_vector,query1)
word_value1 = np.matmul(word_vector,value1)
attention1 = np.matmul(np.matmul(word_query1,word_key1.transpose()/np.sqrt(2)),word_value1)


word_key2 = np.matmul(word_vector,key2)
word_query2 = np.matmul(word_vector,query2)
word_value2 = np.matmul(word_vector,value2)
attention2 = np.matmul(np.matmul(word_query2,word_key2.transpose()/np.sqrt(2)),word_value2)


print("attention1 = "attention1,)
print("attention2 = "attention2,)
print("attention_part1 = "attention_part1,)
print("attention_part2 = "attention_part2,)


OUTPUTS : 
attention1 =      [[ 27435.74311004  54871.48622008]
                   [ 71332.9320861  142665.8641722 ]]
attention2 =      [[ 411536.14665057  548714.86220076]
                   [1069993.98129148 1426658.64172198]]
attention_part1 = [[ 27435.74311004  54871.48622008]
                   [ 71332.9320861  142665.8641722 ]]
attention_part2 = [[ 411536.14665057  548714.86220076]
                   [1069993.98129148 1426658.64172198]]
```
- This is a small simulation of what we're doing here. 
- In the first case, we're calculating the attention without splitting the key query and value matrices. 
- In the second case we have splitted the key query value matrix and then we're calculating the attention. 
- In both the cases the correspoinding attention values came out to be same thus proving our point 😎. 


------
 

### A time for reality check ( yes, once again 😔)
- Here above we said the we can use different dimensions for the key-query and value matrices but that's not quite true if we look it from the architectural point of view. 
- Here's why : 
    - Suppose we have $512$ dimensions for our word embeddings.
    - Now we decide to take up $8$ attention heads. 
    - $\implies d_{key} = 64$
    - Now there are 8 key-value matrices. 
    - What about $d_{value}$?
    - Consider that $d_{value} \neq 64$
    - This implies we'll have different number of value matrices which have some problems: 
        - How will we handle the $8$ resultant key query products ? 
        - With what value matrix we'll multiply the products since we don't have equal number of value matrices to multiply with the key-query product matrices. 
    - Hence for keeping our lives peaceful, we'll do $d_{key}=d_{value}=\frac{d_{model}}{h}=\frac{512}{8} = 64$  
    - Now one important quesiton : What of the $W^{0}$ matrix ? Will we discard it off 😔? 
    - No, we'll keep it ( reason : TODO : FIND OUT ) 


-------------------


### ANOTHER CONCEPT : MASKING 🤡

- While performing the operations on decoder, this is the general flow : 
- Suppose the sentence under consideration is : 

<div align="center">

**INPUT** 
|My | cat | is | a | beautiful | cat.| 😸 |
|---|-----|----|---|-----------|-----|----|

**OUTPUT**

|मेरी | बिल्ली | एक | सुंदर | बिल्ली | है | 😸 |
|----|------|----|------|------|---|-----|
</div>

- In the encoder it will be converted to a matrix of embeddings and positional embeddings. 
- The input will essentially look like this before entering the decoder : 

<div align="center">

 (say $d_{model}=5$ )
| My      |  $1$  |  $3$  |  $4$ |  $8$ |  $2$ |
|---------|-------|-------|------|------|------|
| cat     |  $1$  |  $5$  |  $4$ |  $5$ |  $6$ | 
| is      |  $1$  |  $5$  |  $4$ |  $5$ |  $6$ |
|  a      |  $1$  |  $5$  |  $4$ |  $5$ |  $6$ |
|beautiful|  $1$  |  $5$  |  $4$ |  $5$ |  $6$ |
|  cat    |  $1$  |  $5$  |  $4$ |  $5$ |  $6$ |


</div>

- Now there are 2 scenarios which can be considered : 
     - Training 
     - Inference ( basically running the trained model for use )

- Case 1 : **INFERENCE MODE**
     - Here since the traditional method is to pass on the inputs of the previous decoder unit, there is no scope of parallelism. 
     - Something like this happens : 
     
     <div align="center">

     <img src="inference_in_transformer.jpg">

     </div>

     - So if we have say $100$ words, this has to take $100$ units of time given $1$ computation takes $1s$. 


- Case 2 : **TRANING MODE** : 
     - Here the story is a bit different. 
     - Here something similar happens but with a twist : 
          - Take the same translation example as above. 
          - Suppose instead of producing the first token as **मेरी**, the token produced was **हमारी**. 
          <div align="center">

          <img src="traninig_mode_in_transformer.jpg"> 

          </div>

          - Now here instead of passing **हमारी** in the next component of the decoder we pass on the True value that should have come at that place to train the model with fact that we want **बिल्ली** after the token **मेरी**. 
          - So the question arises : Do we need to actually know the **pervious state through the transformer** ? Or we simply know what's the next state gonna be ?
          Also if know the next state , why to even perform the calculation using the transformer ? 
          - So the answer to the last question is **TRAINING🤡**. How the hell are we supposed to calculate the loss without knowing what the transformer knows and what we want it to know and make it learn. 
          - The answer to the second question is : Yes we do! from the training data itself. 
          - The answer is the first question is : No , we simply don't. Our sole purpose of using performing this operation is to train the transformer.
          - Since we know what's gonna be the next state , can we parallelize this operation for a single sequence ? The answer is Yes we can and that's what we do. 

### SO EVERYTHING LOOKS FINE TILL NOW ? 
- One wise man said, if you think it going to end well, then you're not paying attention 💀.
- Here there's one issue while training : 
- When we send the embeddings into the decoder parallely, what essentially we're doing is: 
      - Suppose we send the sentence : 
     <div align="center">

     |My | cat | is | a | beautiful | cat.| 😸 |
     |---|-----|----|---|-----------|-----|----|

     </div>

- here when we sent the word **My**, we're essentially sending the information of what's going to come next in our sentence . 
- How so ? 
- Like this : 
- We wanted sent this input to the decoder : 

<div align="center">

(say $d_{model}=5$ )

| My      |  $1$  |  $3$  |  $4$ |  $8$ |  $2$ |
|---------|-------|-------|------|------|------|
| cat     |  $1$  |  $5$  |  $4$ |  $5$ |  $6$ | 
| is      |  $1$  |  $5$  |  $4$ |  $5$ |  $6$ |
|  a      |  $1$  |  $5$  |  $4$ |  $5$ |  $6$ |
|beautiful|  $1$  |  $5$  |  $4$ |  $5$ |  $6$ |
|  cat    |  $1$  |  $5$  |  $4$ |  $5$ |  $6$ |
           
</div>

- We sent the contextual embeddings ! 
- These emebeddings contain the information about the words that came before and **AFTER** the current word. 
- But how ? 
- Like this : 
- What we learned before was that we need to express the embedding of the current word in terms of it's own meaning and the words around it. 
<div align="center">

Say we're considering the word **beautiful**:

$e_{beautiful} = \alpha_{1}.e_{my} + \alpha_{2}.e_{cat} + \alpha_{3}.e_{is} + \alpha_{4}.e_{a} + \alpha_{5}.e_{beautiful} + \alpha_{6}.e_{cat} $
</div>

- Till $\alpha_{5}.e_{beautiful}$ was okay because till that point we had the context of the words that came before. But the moment the last embedding vector term came into picture  $ \alpha_{6}.e_{cat} $ , it kind of gave off the hint to the transformer that the next word is cat which shouldn't happen. 
- Ideally we would want to **HIDE** this information that the word **cat** is going to come up because in that case it'll fail to learn to learn. 
- So what we want is that while processing the inputs, we keep the coefficients of the words that are comming after the current word to be zero. 
- That means : 
      - If our current word is Cat ( the first occurance ) then the values : 
      $\alpha_{3} = \alpha_{4} =  \alpha_{5} =  \alpha_{6} = 0$
- So ideally speaking, this should be done at the step of self attention calculation.
- There we calculated the coefficient matrix for multiplying it with the word embeddings. 
- $softmax(\displaystyle\frac{QK^{T}}{\sqrt{d_{k}}})$ this step to be specific. 
- The output of this operation is actually the coefficients for the value matrix. 
- For better understanding, see this : 
<div align="center">

|          |My        |cat       |is        |a         |beautiful|cat       |
|----------|----------|----------|----------|----------|---------|----------|
|My        |  1       |   5      |   6      |    6     |   3     |   5      |              
|cat       |  2       |   5      |   6      |    6     |   3     |   5      |              
|is        |  3       |   5      |   6      |    6     |   3     |   5      |    
|a         |  4       |   5      |   6      |    6     |   3     |   5      |    
|beautiful |  5       |   5      |   6      |    6     |   3     |   5      |         
|cat       |  5       |   5      |   6      |    6     |   3     |   5      |              

</div>

- Here for the first row, the values in the row represent the coefficeints that are to be multiplied with the value vectors of the corresponding words . 
- Now here what we would like to have is something like this : 

<div align="center">

|          |My        |cat       |is        |a         |beautiful|cat       |
|----------|----------|----------|----------|----------|---------|----------|
|My        |  1       |   0      |   0      |    0     |   0     |   0      |              
|cat       |  2       |   5      |   0      |    0     |   0     |   0      |              
|is        |  3       |   5      |   6      |    0     |   0     |   0      |    
|a         |  4       |   5      |   6      |    6     |   0     |   0      |    
|beautiful |  5       |   5      |   6      |    6     |   3     |   0      |         
|cat       |  5       |   5      |   6      |    6     |   3     |   5      |              

</div>
- So to achieve this what we do is : 
<div align="center">

<img src="masking_demo.jpg">

</div>

- Take a mask matrix and add it to the scaled matrix before performing softmax. 
- $softmax(-\infty)=0$
- This makes the coefficient of the future terms to be $0$
----------


### ADD AND NORM 



### Views and contiguous in pytorch : [Reference Article](https://medium.com/analytics-vidhya/pytorch-contiguous-vs-non-contiguous-tensor-view-understanding-view-reshape-73e10cdfa0dd)


- The `.view` function in python does nothing much apart from returning the alternate way of **viewing** the data chunk

- View is nothing but an alternative way to interpret the original tensor’s dimension without making a physical copy in the memory.

- This means that any change in the view instance of the tensor will reflect in the orginial tensor as well since the view tensor is reading the data from the same memory address as of the original tensor.

- Same goes for the fact when the orginal tensor is modified, the view tensor also gets changed for the very same reason. 

- Now certain operation in return a view and certain don't. 
- #TODO 


In [None]:
from torch import nn
import torch 
class MultiHeadAttention(nn.Module):
    def __init__(self,d_model:int = 512,h:int = 8, dropout:float = 0.1):
        super().__init__()
        self.d_model = d_model  # expected input dimesions in each word
        
        self.h  = h   # number of heads 
        assert d_model%h == 0 ,"Expected values of d_model and h such that d_model is divisible by h"
        self.d_key = d_model//h
        self.w_query = nn.Linear(d_model,d_model) # weight matrix for query 
        self.w_key = nn.Linear(d_model,d_model) # weight matrix for key 
        
        
        # Note that here we're  going for d_model X d_model and not for d_model X (h*d_value)
        # Since we're given an input matrix of d_model.  
        self.w_value = nn.Linear(d_model,d_model) # weight matrix for value 
        self.w_o = nn.Linear(d_model,d_model)
    
    
    @staticmethod # this is a static method and can be accessed outside the class using this: MultiHeadAttention.attention(input parameters)
    def attention(self,key:torch.Tensor,query:torch.Tensor,value:torch.Tensor,mask):
        key_query = torch.matmul(query,key.T)
        scaled = key_query/torch.sqrt(self.d_key)  
        # At this point , the shape of the matrix is : (Batch , h , seqeuence_length , seqeuence_length)
        if mask is not None:
            scaled.masked_fill(mask==0,-1e9)
            
        softmax_ = scaled.softmax(dim=-1)
        
        return torch.matmul(softmax_,value)
    
    
    
    def forward(self,query,key,value,mask):
        q_w = self.w_query(query) # query matrix for input sequence , Shape : (Batch , sequence_length , d_model) 
        k_w = self.w_key(key)     # key matrix for input sequence   , Shape : (Batch , sequence_length , d_model)
        v_w = self.w_value(value) # value matrix for input sequence , Shape : (Batch , sequence_length , d_model)
        batch,sequence_length,d_model = q_w.shape
        
        # This is done to break the resultant query matrix of the sequence into h matrices
        # transpose is mainly done to make the matrix head-based indexable. 
        q_q_head_wise = q_w.view((batch,sequence_length,self.h,self.d_key).transpose(2,1)) 
        q_k_head_wise = q_w.view((batch,sequence_length,self.h,self.d_key).transpose(2,1)) 
        q_v_head_wise = q_w.view((batch,sequence_length,self.h,self.d_key).transpose(2,1)) 
        attention_score = MultiHeadAttention.attention(self,q_k_head_wise,q_q_head_wise,q_v_head_wise,mask)
        
        #TODO : Study contiguous memory allocation in detail. 
        
        attention_score = attention_score.transpose(1,2).contiguous().view(batch,sequence_length,self.h*self.d_key)
        # Tanspose to get the dimension normally indexable and merging the tensors back. 
        
        return self.w_o(attention_score)
    
    
class LayerNormalization(nn.Module):
    def __init__(self,epsillon: float = 1e-8):
        super().__init__()
        self.epsillon = epsillon
        self.aplha = nn.Parameter(torch.ones(size=1))
        self.beta  = nn.Parameter(torch.ones(size=1))
    def forward(self,x:torch.Tensor):
        mean = x.mean(dim = -1,keepdim=True) # x ---> (batch,row,col)
        std  = x.std(dim = -1,keepdim=True) 
        normalized = self.aplha * (x-mean)/(std + self.epsillon) + self.beta
        return normalized





class ResidualConnection(nn.Module):
    def __init__(self, self_attention_block:MultiHeadAttention,feed_forward_block : ):
        super().__init__()

### Residual Connection : TODO😭
- They are specifically placed at many places. 
- Study [this paper](https://arxiv.org/pdf/1603.05027) for better understanding of why residual connections workout well for large and deep models!

In [6]:
import torch
torch.set_printoptions(precision=4)
x = torch.tensor([
                    [
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                    ],
                    [
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                    ],
                    [
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                    ],
                ],
                dtype=torch.float32)
# print("x.shape = ",x.shape)
# transpose = x.transpose(0,1)
# print("transpose.shape = ",transpose.shape)
# print(transpose)
print(x)
print("----------------------")
print(x.mean(dim=-1,keepdim=True))
print("----------------------")
print(x.mean(dim=-1))


tensor([[[  1.,   2.,   3.,   4.,   5.],
         [  1.,   2.,   3.,   4.,   5.],
         [  1.,   2.,   3.,   4.,   5.],
         [  1.,   2.,   3.,   4.,   5.]],

        [[ 10.,  20.,  30.,  40.,  50.],
         [ 10.,  20.,  30.,  40.,  50.],
         [ 10.,  20.,  30.,  40.,  50.],
         [ 10.,  20.,  30.,  40.,  50.]],

        [[100., 200., 300., 400., 500.],
         [100., 200., 300., 400., 500.],
         [100., 200., 300., 400., 500.],
         [100., 200., 300., 400., 500.]]])
----------------------
tensor([[[  3.],
         [  3.],
         [  3.],
         [  3.]],

        [[ 30.],
         [ 30.],
         [ 30.],
         [ 30.]],

        [[300.],
         [300.],
         [300.],
         [300.]]])
----------------------
tensor([[  3.,   3.,   3.,   3.],
        [ 30.,  30.,  30.,  30.],
        [300., 300., 300., 300.]])


In [50]:
import torch
from torch import nn
torch.manual_seed(42)
batch, sentence_length, embedding_dim = 1, 5, 4
embedding = torch.randn(batch, sentence_length, embedding_dim)
layer_norm = nn.LayerNorm(embedding_dim)
print(embedding)
layer_norm_output = layer_norm(embedding)
print(layer_norm_output)
mean = embedding.sum(dim = 2)/embedding.shape[2]
std = embedding.std(dim=2)
std


tensor([[[ 1.9269,  1.4873,  0.9007, -2.1055],
         [-0.7581,  1.0783,  0.8008,  1.6806],
         [ 0.3559, -0.6866, -0.4934,  0.2415],
         [-0.2316,  0.0418, -0.2516,  0.8599],
         [-0.3097, -0.3957,  0.8034, -0.6216]]])
tensor([[[ 0.8716,  0.5928,  0.2209, -1.6853],
         [-1.6203,  0.4198,  0.1115,  1.0889],
         [ 1.1111, -1.1985, -0.7703,  0.8577],
         [-0.7452, -0.1393, -0.7894,  1.6739],
         [-0.3243, -0.4803,  1.6947, -0.8900]]],
       grad_fn=<NativeLayerNormBackward0>)


tensor([[1.8211, 1.0394, 0.5212, 0.5210, 0.6366]])

In [49]:
list(layer_norm.named_parameters())

[('weight',
  Parameter containing:
  tensor([1., 1., 1., 1.], requires_grad=True)),
 ('bias',
  Parameter containing:
  tensor([0., 0., 0., 0.], requires_grad=True))]