### Defining the input embeddings class : 
- takes up 2 inputs : 
    - number of words i.e `vocab_size`
    - dimentions of each word embedding i.e `d_model`
- this is used to create an embedding object present in `torch.nn` library. 
- can refer to Documentation of [nn.Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)
- `torch.nn.Embedding(num_embeddings, embedding_dim)` : 
    - `num_embeddings` = total no. of embeddings needed to store 
    - `embedding_dim`  = the dimensions we want for each word embedding
- the forward method has this input x that's nothing but the index of the word whose embedding is required. 
- **NOTE** that the embedding that we return is multiplied by ${\sqrt{d_{model}}}$. This is discussed in the section `3.4` of the paper [Attention is all you Need](https://arxiv.org/pdf/1706.03762)
##### TODO : Checkout for the reason for this step of multiplying by ${\sqrt{d_{model}}}$ 


In [2]:
import torch
import torch.nn as nn
from typing import List,OrderedDict,Dict,Tuple,Literal
from math import sqrt,pow,sin,cos
class InputEmbeddings(nn.Module):
    def __init__(self,d_model:int,vocab_size:int):
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        self.embeddings = nn.Embedding(num_embeddings=self.vocab_size,
                                       embedding_dim=self.vocab_size)
    def forward(self,x):
        return self.embeddings(x)*sqrt(self.d_model)
    # TODO : Find the reason for mutiplying with the sqrt(d_model)
        

### Positional Encoding:
<div align="center">

|Good |morning| to| all |of |you|
|----|----|----|----|----|----|
|120|125|156|2145|215|7653|
|20|1345|156|2145|215|7653|
|10|145|156|2145|215|7653|
|11|345|156|2145|215|7653|
|10|135|156|2145|215|7653|
|1|12|156|2145|215|7653|
|2|345|156|2145|215|7653|
|....|.....|....|....|....|....|
|....|.....|....|....|....|....|
|....|.....|....|....|....|....|
|....|.....|....|....|....|....|
|....|.....|....|....|....|....|

</div>

- Each word in the vocabulary will have embedding vectors for them.
- Each vector will be `d_model` long. 
- We want to help the model to get an idea of what has come before the current word. 
- Further more we want the word to carry some information about the position and it's relative importance in the sentence 
- One way to do this is via **Positional Encoding**
- We add another vector : the positional encoding vector to it.
- ${\textbf{PE}(x,2y  )=\sin({\frac{x}{1000^{\frac{2y  }{d_{model}}}}}) \quad \forall\textbf{ embedding values at even positions in a word embedding}  }$ 
- ${\textbf{PE}(x,2y+1)=\cos({\frac{x}{1000^{\frac{2y  }{d_{model}}}}}) \quad	\forall\textbf{ embedding values at odd  positions in a word embedding}}$ 
- here $x$ is the index of word in the sentence and $y$ supposedly denotes the index of embedding. 


|Good |morning| to| all |of |you|
|----|----|----|----|----|----|
|${\textbf{PE(0,0)}}$|${\textbf{PE(1,0)}}$|${\textbf{PE(2,1)}}$|${\textbf{PE(3,0)}}$|${\textbf{PE(4,0)}}$|${\textbf{PE(5,0)}}$|
|${\textbf{PE(0,1)}}$|${\textbf{PE(1,1)}}$|${\textbf{PE(2,1)}}$|${\textbf{PE(3,1)}}$|${\textbf{PE(4,1)}}$|${\textbf{PE(5,1)}}$|
|${\textbf{PE(0,2)}}$|${\textbf{PE(1,2)}}$|${\textbf{PE(2,2)}}$|${\textbf{PE(3,2)}}$|${\textbf{PE(4,2)}}$|${\textbf{PE(5,2)}}$|
|${\textbf{PE(0,3)}}$|${\textbf{PE(1,3)}}$|${\textbf{PE(2,3)}}$|${\textbf{PE(3,3)}}$|${\textbf{PE(4,3)}}$|${\textbf{PE(5,3)}}$|
|${\textbf{PE(0,4)}}$|${\textbf{PE(1,4)}}$|${\textbf{PE(2,4)}}$|${\textbf{PE(3,4)}}$|${\textbf{PE(4,4)}}$|${\textbf{PE(5,4)}}$|
|....|.....|....|....|....|....|
|....|.....|....|....|....|....|
|upto ${\textbf{PE}(0,d_{model})}$ |upto ${\textbf{PE}(1,d_{model})}$|upto ${\textbf{PE}(2,d_{model})}$|....|....|....|


- This formula was given in the section `3.5` of the paper : [Attention is all you need](https://arxiv.org/pdf/1706.03762)

- A really nice article to understand as to why we need this very formula for postional encodings : [Transformer Architecture: The Positional Encoding](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/)

- A really nice video for the same : [Visual Guide to Transformer Neural Networks - (Episode 1) Position Embeddings](https://www.youtube.com/watch?v=dichIcUZfOw)


##### Personal understanding: 
- Transformers are encoder-decoder stacks. 
- They don't have any understanding of words as such nor their position or order.  
- Hence we need to incorporate this understanding via some means. 
- One possible way could be by adding some clues or some pieces of information in the inputs itself.
- This information is mainly about the position of the word in the sequence.
- So we now need to figure out any method or trick using which we can pass the positional information of a word to the model. 

-----
### Idea 1: 
One possible solution would be to assign numbers index wise like:

<div align="center">




|I|love|to|pet|cats|
|----|----|----|----|----|
|$1$| $2$  | $3$|$4$  | $5$   |

</div>

the **issue** here is that as we go for longer sentences, the words coming later will have higher weightage which is not a good direction to generalize the model for learning the sequential information. 

### Idea 2: 
As the size of sentence increases, the importance distributes. Something like: 

<div align="center">


| I    | love |  to  | play |
| ---- | ---- | ---- | ---- |
| $0.00$ | $0.33$ | $0.66$ | $1.00$ |

</div>

- We had to assign $4$ numbers in between $0$ and $1$. So we skipped the $0^{th}$ one and checked 1 can be divided into how many pieces. $3$ pieces remain, so we divide $1$ into $3$ pieces with each piece worth $0.33$. Therefore
<div align="center">



$I \rightarrow 0.00$  
$love\rightarrow 0.33$  
$to\rightarrow0.66$  
$play\rightarrow 1.00$ 

</div>

<div align="center">

| I    | love |  to  | play | it   |
| ---- | ---- | ---- | ---- | ---- |
| $0.00$ | $0.25$ | $0.50$ | $0.75$ | $1.00$ |

</div>

- Extending the same logic for $7-1 = 6$ words: 

<div align="center">

| I    | love |  to  | play | it   | very | much |  
| ---- | ---- | ---- | ---- | ---- | ---- | ---- |  
| $0.00$ | $0.16$ | $0.33$ | $0.50$ | $0.66$ | $0.83$ | $1.00$ | 

</div>

The **Issue** here is that the meaning conveyed here will vary drastically as we change the length of the sentence.
- Suppose we have a $5$ length sentence and a $10$ length sentence. The difference of $0.5$ means a gap of $1$ word in the 5 word sentence whereas the difference of $0.5$ would mean a difference of $5$ words in a $10$ word sentence. Therefore the meaning conveyed by the values is highly varied. 

- In a $5$ length word, the value of $0.5$ conveys the positional value of $ 3^{rd} $ word in the sequence. But the same $0.5$ in a $10$ length word will convery the positional value of $5^{th}$ word in the sequence. 

Since our old $2$ ideas failed, we are now clear with the notion of what we want:
- It should output a unique encoding for each position in the sentence. 
- Distance between any $2$ positions should be consistent through out the sentences even with different length. 
- Our model should be able to learn the positional significance of the values without much difficulty and in a generalizable way.

------
Here a point to **note** that we are using a **vector to represent a word rather a single number.** Hence we need to devise a mechanism to actually induce a **positional sense in vector that represents a words in a sequence**. 
Furthermore, this will **not become the part of the model's input parameters**, rather it will help the model to tune it's parameters. Hence we're **improving the input's interpretability for the model** to learn better.

------
So we need to define a function that takes a **vector index** and a **position** as an input, and returns the **value that will be stored in that vector for that vector index** for the output. 
So therefore, the input is : 
- ***position $t$*** : a number that signifies the location of the word in the sentence. 
- ***vector index $i$*** : a number that signifies a specific dimension in the embedding vector ranging from $[0,d_{model}-1]$ (considering $0$ based indexing)

Let $ \overrightarrow{p^{t}} \in \mathbb{R^{d_{model}}} $ be the final vector outout for the position $t$ in the sentence. 


<div align="center">


$\implies\overrightarrow{p^{t}_{i}} = f(t)^{(i)} = \begin{Bmatrix}
\sin{(\omega_{k} t)} \textbf{  }\textbf{ if }\textbf{  } i = 2k; \\
\\
\cos{(\omega_{k} t)} \textbf{  }\textbf{ if }\textbf{  } i = 2k+1
\end{Bmatrix}
\newline
where : \omega_{k} = \frac{1}{1000^{\frac{2k}{d_{model}}}}
\newline
\newline 
\implies\overrightarrow{p^{t}_{i}} = f(t)^{(i)} = \begin{Bmatrix}
\sin{(\frac{t}{1000^{\frac{i}{d_{model}}}})} \textbf{  }\textbf{ if }\textbf{  } i = 2k; \\
\\
\cos{(\frac{t}{1000^{\frac{i-1}{d_{model}}}})} \textbf{  }\textbf{ if }\textbf{  } i = 2k+1
\end{Bmatrix}
\newline 
\textbf{ final vector will look like : }\newline \begin{pmatrix}
\sin{(\omega_{0} t)}  \textbf{ for } i = 0 \\
\cos{(\omega_{0} t)}  \textbf{ for } i = 1 \\
\sin{(\omega_{1} t)}  \textbf{ for } i = 2 \\
\cos{(\omega_{1} t)}  \textbf{ for } i = 3 \\
\sin{(\omega_{2} t)}  \textbf{ for } i = 4 \\
\cos{(\omega_{2} t)}  \textbf{ for } i = 5 \\
\sin{(\omega_{3} t)}  \textbf{ for } i = 6 \\
\cos{(\omega_{3} t)}  \textbf{ for } i = 7 \\
\vdots \\ 
\sin{(\omega_{d_{model}-1} t)}  \textbf{ for } i = d_{model}-2 \\
\cos{(\omega_{d_{model}-1} t)}  \textbf{ for } i = d_{model}-1 
\end{pmatrix} \textbf{which is also :}  \begin{pmatrix}
\sin{(\frac{t}{1000^{\frac{0}{d_{model}}}})}  \textbf{ for } i = 0  \\
\cos{(\frac{t}{1000^{\frac{0}{d_{model}}}})}  \textbf{ for } i = 1  \\
\sin{(\frac{t}{1000^{\frac{1}{d_{model}}}})}  \textbf{ for } i = 2  \\
\cos{(\frac{t}{1000^{\frac{1}{d_{model}}}})}  \textbf{ for } i = 3  \\
\sin{(\frac{t}{1000^{\frac{2}{d_{model}}}})}  \textbf{ for } i = 4  \\
\cos{(\frac{t}{1000^{\frac{2}{d_{model}}}})}  \textbf{ for } i = 5  \\
\sin{(\frac{t}{1000^{\frac{3}{d_{model}}}})}  \textbf{ for } i = 6  \\
\cos{(\frac{t}{1000^{\frac{3}{d_{model}}}})}  \textbf{ for } i = 7  \\
\vdots \\ 
\sin{(\frac{t}{1000^{\frac{d_{model}-1}{d_{model}}}})}  \textbf{ for } i = d_{model}-2 \\
\cos{(\frac{t}{1000^{\frac{d_{model}-1}{d_{model}}}})}  \textbf{ for } i = d_{model}-1 
\end{pmatrix}$

</div>

Now since $d_{model}$ is fixed for a given case, thereby for the change in $i$,$\newline$
as $i$ increases ($k$ increases), the frequency $\omega_{k}$ decreases.
### TODO : I'll dig deeper into this later. 











-----























#### Postional Encoding Class info :
- **d_model** : number of dimensions we need in a word embedding.
- **sequence length** : length of input sequence ( basically number of words in input sentence )
- **p_dropout** : The dropout-regularization probability. Dropout regularization has been used as the method of regularization while summing the embeddings and positional embeddings as explained in the section `5.4` of the paper [Attention is all you need](https://arxiv.org/pdf/1706.03762) . 
- **Note** that the shape of the position embeddings is $({\textbf{sequence length} \times \textbf{d\_{model}}}) $
- The formula is implemented in the same way as described above. 
-----
```python
        self.positional_embeddings = self.positional_embeddings.unsqueeze(0)
```
- Reason for doing this : this adds an extra dimension in the positional embeddings structure for enabling batching. This means the new dimensions of this positional_embeddings is  $ (1 \times \textbf{sequence length} \times \textbf{d\_{model}}) $
------
```python
        self.register_buffer(name='positional_embeddings',tensor=self.positional_embeddings)
```
- This line is used when we want to include the parameters into the `state_dict` of the model though we don't intend to train it. 
- What was the need ? Simply because for real-life cases where d_model is $512$ and sequence length becomes $>1000$, at that point recompting this matrix is extremely costly and time taking. So it's better to keep it pre computed somewhere. 
- **Note**: You can read the buffer stores spearately using the following code : 
```python
         model = MyCustomModel() # suppose you have your own model class where you defined your buffer
         for buffer_name, buffer in model.named_buffers():
             print(f"Buffer '{buffer_name}': {buffer}")
```

- Reference Documentation : 

    - [torch.nn.Module.register_buffer](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.register_buffer)
    - [torch.nn.Module.get_buffer](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.get_buffer) $\rightarrow$ this one has no code for understanding in the docs, so try and test. 
-----
```python
def forward(self,sequence:torch.Tensor):
    if sequence.ndim==2:
        sequence =sequence.unsqueeze(0)
    assert sequence.shape[2] == self.d_model,f"""The embedding dimensions of model and input dont match model's dimensions : {self.d_model} , input dimensions : {sequence.shape[0]}"""
    sequence = sequence + (self.positional_embeddings[:,:sequence.shape[1],:]).requires_grad_(False)
    return self.dropout(sequence)
```
- Here, the input is a sequence of words but we're only concerned about the shape of the input sequence. The reason is that we only need the positional values of the embeddings and nothing related to the word in reality. 

- assertion for equality `sequence.shape[2]` and `d_model` is necessary as they show the number of respective dimensions.  

- taking up `sequence.shape[1]` tells that we are taking the sequence length of the input only. 

- `self.positional_embeddings[:,:sequence.shape[1],:].requires_grad_(False)`: here setting `requires_grad(False)` was important because we aren't supposed to learn these positional embeddings.


- Furthermore, we're expecting the inputs to in the batched form as well i.e they also should have an extra dimension. To be precise, the following dimensions : $\newline$
`[n_batches,sequence_length,d_model]`. 
- If we don't get the input in this format , then we make the input in this format using the following code : 
```python 
if sequence.ndim==2:
    sequence =sequence.unsqueeze(0)
```

- **Note** that we add the values of the positional encodings to the input sequence, i.e the input to the model is :

<div align="center">

$\newline \phi(x) + PE(x) \newline \textbf{ where } \phi(x) \textbf{ is the word embedding matrix of the sentence x}\newline \textbf{ where }\phi(x) \textbf{ looks like : } $



|    | $d_{0}$   | $d_{1}$   | $d_{2}$|$d_{3}$| $d_{4}$ |$d_{5}$ |$d_{6}$ |$\cdots$|$d_{d_{model}-1}$ |
|----|----       |----       |----    |----   |----     |----    |----    |----    |----              |
|I   | 0.1       | 0.4       | 0.9    |0.6    |$\cdots$ |$\cdots$|$\cdots$|$\cdots$|$\cdots$          |
|love| 0.4       | 0.8       | 0.2    |0.7    |$\cdots$ |$\cdots$|$\cdots$|$\cdots$|$\cdots$          |
|cats| 0.1       | 0.3       | 0.1    |0.2    |$\cdots$ |$\cdots$|$\cdots$|$\cdots$|$\cdots$          |

</div>




In [3]:
class PositionalEncoding(nn.Module):
    def __init__(self,d_model:int,sequence_length:int,p_drop:float = 0.1):
        super().__init__()
        self.d_model = d_model
        self.sequence_length = sequence_length
        self.dropout = nn.Dropout()
        positional_embeddings = torch.zeros(size=(self.sequence_length,self.d_model),dtype=torch.float64)
        for i in range(self.d_model):
            for j in range(self.sequence_length):
                if i%2==0:
                    omega = torch.tensor(pow(1000,i/self.d_model))
                    
                    positional_embeddings[j][i] = sin(j/omega)
                else:
                    omega = torch.tensor(pow(1000,i-1/self.d_model))
                    positional_embeddings[j][i] = torch.cos(j/omega)
        self.positional_embeddings = positional_embeddings.unsqueeze(0)
        self.register_buffer(name='positional_embeddings',tensor=self.positional_embeddings)
        
        
    def forward(self,sequence:torch.Tensor):
        if sequence.ndim==2:
            sequence =sequence.unsqueeze(0)
        assert sequence.shape[2] == self.d_model,f"""The embedding dimensions of model and input dont match model's dimensions : {self.d_model} , input dimensions : {sequence.shape[0]}"""
        sequence =  sequence + (self.positional_embeddings[:,:sequence.shape[1],:]).requires_grad_(False)
        return self.dropout(sequence)
# pe = PositionalEncoding(d_model=4,sequence_length=3,p_drop=0.3) # 3 word sequence with each word represented as a vector of 4 numbers


#### Feed Forward Neural Network
- In the paper each encoder decoder stack has it's own FFN layer. This is discussed in the section `3.3` of the paper [Attention is all you need](https://arxiv.org/pdf/1706.03762)

- As per the paper, there are 3 steps which has 2 linear tranformations :

    1. A linear layer with the output : $y_{1} = W_{1}(x) + b_{1}$
    2. A relu operation with the output : $y_{relu} = max(0,y_{1})$
    3. Another linear layer with the output : $y_{2} = W_{2}(y_{relu})+b_{2}$ 
- Final output : $y_{2} = W_{2}(max(0,W_{1}(x)+b_{1})) + b_{2}$
- The first layer has input dimensions of $512(d_{model})$ and output dimension of $2048(d_{ffn})$
- The second layer has input dimensions of $2048(d_{ffn})$ and output dimension of $512(d_{model})$
- One question : What about the sequence ? How is the Sequence processed in the FFN ?
- Simply one-by-one ( or we can say token-by-token )

--- 
### Some detailed info about FFN Layer : 
- Input to `layer1` : $(1,1,d_{model})\rightarrow \textbf{ this is for a single word }$
- This means that for the whole sequence which is `seq_len` long, Input : $(1,seq\_len,d_{model})\rightarrow\textbf{this is for a single batch}$
- This means for a whole batch which has `num_batch` entries : $(num\_batch,seq\_len,d_{model})$
- Now comes the Output of `layer1` : 
    - since `layer1` has $d_{ff}$ number of neurons, therefore the output's shape will be : $(1,1,d_{ff})\rightarrow \textbf{ this is for a single word }$
    - This means that for the whole sequence which is `seq_len` long, Input : $(1,seq\_len,d_{ff})\rightarrow\textbf{this is for a single batch}$
    - This means for a whole batch which has `num_batch` entries : $(num\_batch,seq\_len,d_{ff})$

- `ReLu layer` doesn't modfiy the input shape of the input provided to it. So it passes it as it is. 
- 
 

In [4]:
class FeedForwardNeuralNetwork(nn.Module):
    def __init__(self,input_dim:int = 512, output_dim = 2048, p_dropout:float = 0.1):
        super().__init__()
        self.linear_layer1 = nn.Linear(in_features=input_dim,out_features=output_dim)
        self.linear_layer2 = nn.Linear(in_features=output_dim,out_features=input_dim)
        self.dropout = nn.Dropout(p = p_dropout)
    def forward(self,x:torch.Tensor):
        return self.linear_layer2(self.dropout(self.linear_layer1(x)))

#### ATTENTION
- In the old sequence to sequence models, we had an encoder and a decoder. 
- We feed the Encoder with data in a sequential format and then feeding them to a decoder in the same sequential format. 
<div align="center">
<img src="encoder_decoder.gif" width="700" height="200">
</div>

- After getting the whole input sequence, the encoder converts the whole sequence into a vector $\overrightarrow{C}$ (that can be seen as the summary of the input text)

- This vector $\overrightarrow{C}$ is then passed into the decoder and the decoder converts this vector to probabilities of next possible tokens.

#### ISSUE HERE : 

- For a long input sentence, it's not able to keep up with the context of the sentence and the vector $\overrightarrow{C}$ that was supposed to carry the information to the decoder is overloaded with information. 

- Suppose we want to translate this sentence: बत्तियाँ बंद कर दो
- The transaltion would be : Turn off the lights. 
<div align="center">

|Turn  | off| the | lights| 
|----- |----|----|----|
|बत्तियाँ |बंद  |कर  |दो  |
</div>

- here to translate the first word `बत्तियाँ` we need to focus more on the word `lights` instead of the whole sentence. 

- similarly for the word `बंद` we need to focus more on the word `turn off`

- This translation could be done properly by the LSTM if it could see the section of sentence where `बत्तियाँ` was used or where `बंद` was used instead of going through the entire sentence.

- It would be preffered that the vector representation $\overrightarrow{C}$ was more of dynamic than static. 
- We would want the LSTM to be able to specifically notice some part of the sentence for generating the probability of the most likely character for a particular state. 
- Or in other words we would want the LSTM to give more **Atttention** to a specific portion of sentence for generating the output for a particular state.

----

- Another point to note is that context in which the word was used is also important.

<div align="center">

|`Apple` makes great mobile phones.|
|-----|
|An `apple` a day keeps the doctor away.|

</div>

- In 2 different sentences the meanings of the word `apple` changed. 

- Therefore the words surronding the word under consideration also affect the true meaning of the word under consideration. 

- Thereby the `attention` should not only focus the main words around which needed to be translated or interpreted but the actual meaning of the words should also be delivered by considering the surroundings. 


- **One question** : can't we just set out our word embeddings in such a way that we can capture the context of use of the word just by seeing it's word embeddings. 

- The answer is : 
     - Word embeddings do capture the meaning of the words but in a **average out format** i.e they try to capture all possible meanings of the word. 

     - Each dimension of the embeddings caputures a different meaning of a word hence the meaning of `apple` as a fruit and a brand is being captured in the word embeddings. It just so happens that we're looking for a more refined and specific meaning that word embeddings do have but in a **diluted** format. 

     - Word embeddings are created once but used countless times. **They are static**.
     Hence re-training them is not a good idea!

     - **ONE POINT TO NOTE** : This may also happen that the training data that was used to generate word embeddings may have the word `apple` used as a fruit more than as a brand. Therefore, there may be an **inherit bias** towards using the word `apple` as a friut than as a brand. To address this we need the help of surrounding words to evaluate the true usage of the word. 

For a better understanding of the expectation from the **attention** mechanism , consider the following example : 
<div align="center">

`Apple` launched a new mobile phone while I was eating an `apple`. 

</div>

- Here in a single sentence the word `apple` was used in $2$ different meanings.

- Here we want our **attention mechanism** to actually differentiate between the $2$ types of apples being talked about.

- We want that based on the surrounding text of the $1^{st}$ `apple` , the weightage of the **technology/brand** aspect of the word should be given more weightage whereas for the the $2^{nd}$ `apple`, the **friut/edible** aspect of the word should be given more weightage.

-----

#### Self Attention 
- Suppose I have a sentence : 

<div align="center">

`Apple` launched a new mobile phone. 

</div>

- Here the word apple redirects us to the context of a brand. 

- Since this conclusion is based on the words surrounding the word `Apple`, therefore we must devise some mechanism to represent the word embeddings of the word `Apple` as some factor of the surrounding words.

<div align="center">


`Apple` $=\alpha_{11}*(\textbf{Apple}) +\alpha_{12}*(\textbf{lauched}) + \alpha_{13}*(\textbf{a})+\alpha_{14}*(\textbf{new})+\alpha_{15}*(\textbf{mobile})+\alpha_{16}*(\textbf{phone}) \newline \text{ where : } \displaystyle\sum_{i=0}^{n}\alpha_{i}=1$ 

</div>

- Similarly, for the word `launched`: 

<div align="center">

`launched`$=\alpha_{21}*(\textbf{Apple}) +\alpha_{22}*(\textbf{lauched}) + \alpha_{23}*(\textbf{a})+\alpha_{24}*(\textbf{new})+\alpha_{25}*(\textbf{mobile})+\alpha_{26}*(\textbf{phone})$

</div>

and so on for other words. 

- **Note** that the words here are not literal words but rather word embeddings.  

- Now comes the main question : what should be these $\alpha_{ij}$'s should be?

- One approach can be : 
     - Since we want to influence the word embeddings of the target word with the surrounding words that helps in defining the meaning of this, how about the similarity score between the 2 words ? ( basically the dot product )

     - Suppose we want the coefficient $\alpha_{15}$ i.e the simiarity between the words: `apple` and `phone` (took this deliberately to prove a point later.)

     - Now if the words are of similar meaning then they have a high dot product. 

     - If the words or very unrelated then they will have a low dot product. 
     - Explanation : 
          - word embeddings are vectors that capute the meaning of a word in higher dimensional number spaces. 
          - words may have different meanings in different aspects. Since there can be various aspects in which a word can be used, therefore there are various dimensions where we can find out meaning captured in the vector. 
          <div align="center">

          $\text{dot\_product}(a,b)  = \displaystyle\sum_{i=0}^{d_{model}}a_{i}.b_{i}$

          </div>
          
          - **Note** that  $\text{dot\_product}(a,b)$ is a scalar value and not a vector. 

          - Higher the value of $\text{dot\_product}(a,b)$ influence of $b$  will be more on the meaning of $a$ and will enhance the overall contextual meaing of $a$.

          - Furthermore, other words in the sentence also contribute the new word embeddings of the current word by the same process.

          - **NOTE**: we use normalized coefficients and not the exact value of dot_product operation. These normalized coefficients are produced by softmax function : 
          <div align="center">

          $\alpha_{ij}  =  \frac{\displaystyle\ e^{\text{dot\_product}(i,j)}}{\displaystyle\sum_{i=0}^{i=n}e^{\text{dot\_product}(i,j)}}$

          </div>

          - Benifits of this scheme :
               - The process of calculation of refined-embeddings can be parallelized as there is no dependency of the next operation on the previous operation. But how? 

                    - Here we only need to compute the dot product of word embeddings of the words that we have. Here we don't need the inputs from the previous word or the sequence to compute the next set of word embeddings. 

                    - Suppose for the sequence : 
                      <div align="center">

                      `Apple launched new mobile`

                      </div>

                      - For this sentence, we can easily parallelize the following computations:

                         - $\text{dot\_product ( Apple, launched ) }$

                         - $\text{dot\_product ( Apple, new ) }$

                         - $\text{dot\_product ( Apple, mobile )}$

                         - $\text{dot\_product ( launched, new )}$

                         - $\text{dot\_product ( new, mobile )}$
                         
               - There is no need of learning the parameters, a simple runtime computation can do the job as the word embeddings are readily available at hand. 
- Summarizing the whole process : 
     1) Input setence's word embeddings are fetched for each word. 
     2) We compute the dot product of each of the word embedding pair. 
     3) Before using them to find new word embedding for each sentence, take the softmax of dot product entity involved in making the vector. 
     4) Find the new word embeddings that will have the meanings caputred as per the surrounding words. 


---- 
#### Is everything alright ? 

- Below is the whole flow of the concept of attention discussed above : 

<img src="key_query_value.jpeg">

- One thing to note is that : The embeddings in **Green**, they are acting like someone whole is asking a questiong to all the words around it : Are you similar to me ? If yes then how much ? 

- Word embeddings used in this part of process of finding  self-attention is called **Query**. 

- The word embeddings in Yellow : they are the ones who're being asked the question and they're comparing their values with the values of the **Query** embedding.

- These word embeddings which are used to compare the values of the Query embeddings are called **Keys**.

- In the final stage, the Blue emebeddings are nothing but the emebeddings of the **Key**
words that are now being multiplied by the coefficeints.

- These embeddings are called **Value**

- Symbols : 
     - **Key** : $K$ 
     - **Query** : $Q$ 
     - **Value** : $V$ 

- Hence we can say that word embeddings for each word can be used in $3$ different forms i.e as **Key**,**Query** $\&$ **Value**.  

- Now comes the main question : Do you think the **same word embedding** should be **used for all the 3 usage aspects**? 

- People say it's not. The reason for keeping the word emebddings different for the same word : 
     - # TODO : Get a convincing reason for it! 




----

- So now we know that we need different word embeddings to use as key , query and value vectors.

- But how will we find them? 

- So now our question is : Given the word embeddings of the word W: 
     - What is the **transformation T** that's to be applied on W to get the **Key embeddings** or $K(W)$. 
          - $T_{key}(W) = K(W)$

     - Similarly, for query and value : 
          - $T_{query}(W) = Q(W)$
          - $T_{value}(W) = V(W)$
     - Hence now our task is to evaluate the linear transformations: $T_{key}(W),T_{query}(W),T_{value}(W)$

- Before jumping into how we're supposed to find, better look at exactly what are we gonna do: 

- We'll be having a vector for each word. 
- Now we will transfrom the vector into Key, Query and Value vector into some other vector by multiplying it from a Tranformation matrix. 
- Now how will we make this transformation matrix for each key query and value ? Simply by learning from data.

- So essentially we have 3 matrices which we learn during the training process : $W^{k},W^{q}\text{ \& }W^{v} $ matrices. The embeddings are multiplied with these matrices to generate a new vector that can be used a key vector , query vector and a value vector. 
This is something we actually do in a linear transform as well! we use a matrix to modify a vector right. 


<div align="center">
<img src="key_query_value_matrix.jpeg">
</div>

- The matrices shown above are learned from the data itself. 

------

### Views and contiguous in pytorch : [Reference Article](https://medium.com/analytics-vidhya/pytorch-contiguous-vs-non-contiguous-tensor-view-understanding-view-reshape-73e10cdfa0dd)


- The `.view` function in python does nothing much apart from returning the alternate way of **viewing** the data chunk

- View is nothing but an alternative way to interpret the original tensor’s dimension without making a physical copy in the memory.

- This means that any change in the view instance of the tensor will reflect in the orginial tensor as well since the view tensor is reading the data from the same memory address as of the original tensor.

- Same goes for the fact when the orginal tensor is modified, the view tensor also gets changed for the very same reason. 

- Now certain operation in return a view and certain don't. 
- #TODO 


### Important things to understand about pytorch dimensions : 
- We are gonna work with tensors that extend upto $3$ dimensions ( and $4$ dimensions as well 💀)
- First get familiar with $dim = 3$ 
    <div align="center">

    <img src="dim_explain1.jpeg">

    </div>

    <div align="center">

    <img src="dim_explain2.jpeg">

    </div>

    <div align="center">

    <img src="dim_explain3.jpeg">

    </div>

    
    - Say we want to get the sum. 
    - Now we can get it via 3 ways : 
- $dim = 0$
    - Here the sum will be taken across the **batch** dimension like shown above. 
    - Basically, the pink tiles will be summed up and then same goes for the other cells. 
    - What's the output ? Say we had the input shape of $(3,4,5)$ 
    - We summed along the batch dimension for each element in the individual matrix. 
    - So final shape would be the shape of $(4,5)$

```python
import torch
torch.set_printoptions(precision=4)
x = torch.tensor([
                    [
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                    ],
                    [
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                    ],
                    [
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                    ],
                ],
                dtype=torch.float32)
print("x.shape = ",x.shape)
sm = x.sum(dim=0)
print("sm.shape = ",sm.shape)
print(sm)
>>>x.shape =  torch.Size([3, 4, 5])
>>>sm.shape =  torch.Size([4, 5])
>>>tensor([[111., 222., 333., 444., 555.],
           [111., 222., 333., 444., 555.],
           [111., 222., 333., 444., 555.],
           [111., 222., 333., 444., 555.]])
```


- $dim = 1$
    - Here we sum acoss every column of the matrix . So essentially all the pink boxes are summed up for each matrix. 
    - Final Shape ? Here each matrix shrinks into a row where every element shows the sum of it's column and we have $3$ such matrices. So essentially 3 such rows and 5 columns. Thereby the final shape becomes $(3,5)$

```python
import torch
torch.set_printoptions(precision=4)
x = torch.tensor([
                    [
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                    ],
                    [
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                    ],
                    [
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                    ],
                ],
                dtype=torch.float32)
print("x.shape = ",x.shape)
sm = x.sum(dim=1)
print("sm.shape = ",sm.shape)
print(sm)

>>>x.shape =  torch.Size([3, 4, 5])
>>>sm.shape =  torch.Size([3, 5])
>>>tensor([[   4.,    8.,   12.,   16.,   20.],
           [  40.,   80.,  120.,  160.,  200.],
           [ 400.,  800., 1200., 1600., 2000.]])
```


- $dim = 2$
    - Here we're summing up the elements in a row. This means that all the green cells that we're able to see will shrink into one single cell for every matrix and we have $3$ such matrices. So this means that we'll be having $1$ column with sum of $4$ such rows and we have $3$ matrices . Therefore we convert the column into a row and stack $3$ such rows. Therefore the shape becomes $(3,4)$

```python

import torch
torch.set_printoptions(precision=4)
x = torch.tensor([
                    [
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                        [1  ,2  ,3  ,4  ,5  ],
                    ],
                    [
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                        [10 ,20 ,30 ,40 ,50 ],
                    ],
                    [
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                        [100,200,300,400,500],
                    ],
                ],
                dtype=torch.float32)
print("x.shape = ",x.shape)
sm = x.sum(dim=2)
print("sm.shape = ",sm.shape)
print(sm)


>>>x.shape =  torch.Size([3, 4, 5])
>>>sm.shape =  torch.Size([3, 4])
>>>tensor([[  15.,   15.,   15.,   15.],
>>>        [ 150.,  150.,  150.,  150.],
>>>        [1500., 1500., 1500., 1500.]])

```

- **NOTE** : final shape of matrix  : remove the dimension from the shape of the original matrix along which the operation was applied. Note that for $dim=3$, the structre is as follows : (Batch,Column,Row).

- Similar logic goes for applying **softmax**. The dimension we specify in the softmax function, the softmax is applied along the elements of the same dimension.

- $dim=0$ : Here the softmax would be applied for the **all the elements stackwise of every matrix**. 

- $dim=1$ : Here the softmax would be applied for the **elements column-wise for each column of every matrix**.

- $dim=2$ : Here the softmax would be applied for the **elements row-wise for each row of every matrix**. 

- Note that softmax **doesn't produce a change in the final shape of the matrix**. It rather performs an **aggregation operation on a specified elements** in a dimension and **update those elements accordingly**.

- Note that : 
<div align="center">

|dim:    |Batch|Column|Row |
|--------|-----|------|----|
|positive|$0$  |$1$   |$2$ |
|negative|$-3$ |$-2$  |$-1$|
</div>

------

### Last time : 
- We understood that there are $3$ spearate matrices that are used to generate the key, query and vlaue vectors for a single word.
- But what are we actually finding out : 

$\text{ATTENTION(K,Q,V) } =softmax(\displaystyle\frac{QK^{T}}{\sqrt{d_{k}}})V $

where : 
$K = \text{ Key vector }$

$Q = \text{ Query vector }$

$V = \text{ Value vector }$

$d_{k}= \text{ dimension of key vectors }$


- Here some question arise : what is the shape of the matrices that we're going to multiply our vector with ? 
- the answer is $d_{model} \times d_{model}$
- Here's what happens : 
- Suppose you have a sequence of words :
<div align="center">

`My cat is a lovely cat.`😸

</div>
     
    
- each word has it's own word embedding : $e_{My}\text{  }e_{cat}\text{  }e_{is}\text{  }e_{a}\text{  }e_{lovely}\text{  }e_{cat}$
- Now merge these vectors into one single matrix i.e :


<div align="center">

|dimensions      |$d_{0}$|$d_{1}$|$d_{2}$|$d_{3}$|$\cdots$|$d_{model}-1$|
|----------------|-------|-------|-------|-------|--------|-------------|     
|$e_{My}$        |   102 |   452 |   12  |  212  |$\cdots$|   864       |         
|$e_{cat}$       |   102 |   452 |   12  |  212  |$\cdots$|   864       |             
|$e_{is}$        |   102 |   452 |   12  |  212  |$\cdots$|   864       |     
|$e_{a}$         |   102 |   452 |   12  |  212  |$\cdots$|   864       | 
|$e_{lovely}$    |   102 |   452 |   12  |  212  |$\cdots$|   864       |         
|$e_{cat}$       |   102 |   452 |   12  |  212  |$\cdots$|   864       |         

</div>

- So here the dimensions of the input are not just $1\times d_{model}$ but actually $\text{ sequence\_length }\times d_{model}$
- Now we have 3 matrices each for $K,Q\text{ \& }V$
- We multiply this matrix for each $K,Q,V$ :
<div align="center">

<img src="attention_calculation.jpeg">

</div>


- Here individual vectors show each key, query and value vector for each word.
- Suppose you have a $5$ sequence length sentence as input, so to calcuate the contextual emebeddings of the $0^{th}$ word, we do the process decribed below: 

<div align="center">

<img src="self_attention_calculation_process.jpeg">

</div>

- Here the scaled dot product is nothing but the division of $QK^{T}$ by $\sqrt{d_{model}}$.

- We'll explore why it's done later. **TODO**


-------

### Any Flaw yet ?
- As of now we don't see any issue in out approach. 
- okay how about double meaning text 😏😼. Consider the text: 
<div align="center">

`A man saw a person with a telescope.`

</div> 

- There 2 possible meaning are there : 
- The first person saw a man **using a telescope**. 
- The first person saw a man **holding a telescope**.
- But in our current scenario, we only can capture only one type of meaning under current workflow. 
- So what's the fix ? 
- Instead of just one **type** of attention, calculate multiple attention scores. 
- Here comes the concept of **MULTIHEAD ATTENTION**. 
- Instead of calculating a single attention score, calculate multiple attention scores. 
- But how exactly will we do this ? 
- Above we saw that we're supposed to find the key, query, value matrix for fetching out the meaning of the current word given the words around it. 
- Here also we keep multiple key, query $\&$ value matrices for capturing a different meaning but with a slightly different way : 
    - We break the vector alongside the dimensions . We know that the matrix we're working around it $sequence\_length \times d_{model}$. 
    - Suppose we want to capture 8 different meanings of the same word, so what we do is that we break the whole vector ( column wise ) into 8 different pieces :

    <div align="center">

    <img src="multihead_attention_splitting.jpeg">


    </div> 

- Here one important thing to note : $d_{model}\text{  }\% \text{  }  \text{number of attention heads }=0$
- Why ? How the duck will you divide the $d_{model}$ into equal parts then!. 
- Now that we have broken down the matrix for multi-head attention, we'll do the same thing for each individual vector : find the contextual embeddings of each word and then concatenate the emebeddings.
- How exactly ? Like this : 


    <div align="center">

    <img src="multihead_attention_calculation_1.jpeg">

    </div> 

- Here we can see that we have calculated contextual embeddings for a single head. 
- A point to note that since each head has it's own set of key, query, and value pairs, there will be $\text{num\_heads}$ numbers of key, query, value matrices belonging to each head. 

- But this is for one head. We had $\text{num\_heads}$. Therefore we have $\text{num\_heads}$ numebr of contextual emebeddings. 
- Now what? We have $\text{num\_heads}$ number of matrices with the dimension $\text{ sequence\_length } \times \frac{d_{model}}{num\_heads}$
- We'll simply concatenate 😛:

<div align="center">
<img src="multihead_attention_calculation_2.jpeg">
</div> 

-------


### SOME FINE DETAIL FOR GENERALITY : 
In the paper, attention is all you need, a more generalized approach towards calculating multihead attention was proposed : 
- We know that attention is : 
    - $\text{ Attention}(K,Q,V)=\textstyle{ softmax}(\frac{\textstyle{QK^{T}}}{\textstyle\sqrt{d_{k}}})V$
    - Here the $d_{k}$ is the dimensions of the key matrix found by multiplying the word embeddings of input word by the key matrix for a given head. 
    - Note that the dimensions of $Q$ and $K$ must be $(sequence\_length \times d_{model})$ so that the product  $QK^{T}$ is valid. 
    - In the original paper :
    - $\text{MULTIHEAD ATTENTION} = \textstyle{Concat}(head_{1},head_{2},\cdots,head_{num\_heads})W^{0} $  
    where $head_{i} = \text{ Attention}(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})$ 
    where $W_{i}^{k,q,v}$  are the matrices that we use for converting simple word embeddings to key,query and value vectors. 
    - Wait a 🦆ing minute ! what's $W^{0}$, I'll come back to this later.
    - Note that their dimensions are as follow : 
        - Q : $\textstyle{sequence\_length} \times \textstyle{d_{model}}$
        - K : $\textstyle{sequence\_length} \times \textstyle{d_{model}}$
        - V : $\textstyle{sequence\_length} \times \textstyle{d_{model}}$
        - $W_{i}^{Q} : \textstyle{d_{model}} \times \textstyle{d_{key}}$
        - $W_{i}^{K} : \textstyle{d_{model}} \times \textstyle{d_{key}}$
        - $W_{i}^{V} : \textstyle{d_{model}} \times \textstyle{d_{value}}$
    - wait a minute : How did this $d_{key} \& d_{value}$ come into picture ? 
    - Just to add generality ! In the discussing above, we divided the matrix such that the the $d_{key} \& d_{value}$ came out to be same. What if someone decided to take up different sizes for key-query and value matrices. 
    - Just in case, it's key-query and value, why not key , query and value ? beacuse they're supposed to be multiplied 😭 like this: 
    $QK^{T}$ so for them to multiply all along well, we'll need $d_{key}$ same in both key and query matrix. 
    Note that this difference in the dimensions of key query value is in the matrix that is used to generate the key query and value matrices.
    - Note that we still want the $d_{key}$ to me a factor of $d_{model}$
    - Now follow the flow : 
    $head_{i} = \text{ Attention}(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V}) = \textstyle{ softmax}(\frac{\textstyle{QW_{i}^{Q}(KW_{i}^{K})^{T}}}{\textstyle\sqrt{d_{k}}})VW_{i}^{V}$

    $ QW_{i}^{Q} : (\textstyle{sequence\_length} \times d_{model})  \times (d_{model} \times \textstyle{d_{key}}) = (\textstyle{sequence\_length} \times d_{key}) $

    $ KW_{i}^{K} : (\textstyle{sequence\_length} \times d_{model})  \times (d_{model} \times \textstyle{d_{key}})  = (\textstyle{sequence\_length} \times d_{key})$

    $VW_{i}^{V} : (\textstyle{sequence\_length} \times d_{model})  \times (d_{model} \times \textstyle{d_{value}}) = (\textstyle{sequence\_length} \times d_{value}) $


    $\textstyle{QW_{i}^{Q}(KW_{i}^{K})^{T}} : (\textstyle{sequence\_length} \times \textstyle{sequence\_length})$

    $\textstyle{QW_{i}^{Q}(KW_{i}^{K})^{T}} :(\textstyle{sequence\_length} \times {d_{value}}) $

    - So the final shape of attention for a single head comes out to be : $(\textstyle{sequence\_length} \times {d_{value}}) $

    - Since $\text{MULTIHEAD ATTENTION} = \textstyle{Concat}(head_{1},head_{2},\cdots,head_{num\_heads})$
    $\implies \text{MULTIHEAD ATTENTION} = (\textstyle{sequence\_length} \times (num\_heads \times d_{value})) $ 
    - Note that it's still a $2D$ Matrix, it's just that the columns of this matrix increased $num\_heads$ times. 
    - Now here we have a smol 🤏 problem : 
        - the current shape is : $(\textstyle{sequence\_length} \times (num\_heads \times d_{value}))$
        - But for further steps we need the matrix in $(\textstyle{sequence\_length} \times  d_{model})$
        - How to convert ? 🤔💡
        - Simple , multiply it by a matrix whose dimensions are $((num\_heads \times d_{value})\times d_{model})$ which is nothing but $W^{0}$ as in the formula shown in the very beginning!
        - That's why  $\text{MULTIHEAD ATTENTION} = \textstyle{Concat}(head_{1},head_{2},\cdots,head_{num\_heads}) \times \textbf W^{0}$
        - This $W^0$ is important in this aspect. 


----

### Some noteworthy implementation details : 

- we're going not gonna make num_heads number of different matrices man 😩! 

- That's not handy to maintain and will need some different logic for training 💀.I know I can't even implement a simple backpropagation 😔, implementing for a custom system of which I've got no clue of isn't very convincing to go with 🤐.   
- What we're gonaa do is rather something that we consider a smart move but it's really not cause we're not that smart as we think of ourselves 😔💡: 
     - Make a single matrix for Key query values for all the attention heads. 
     - Fragment the final matrix of key query values into the desired number of heads and then calculate the attention for each head.
     - After you're done with calcuating the attention of each head, we'll simlpy 
     concatenate that as we planned before. 
     - Something like this : 
    <div align="center">

    <img src="multihead_attention_implementation_for_key_qeury_value_matrices.jpeg">

    </div>

    - Now one question : Wouldn't this kill  our very idea of multi-head attention 😨🥹?
    - Isn't this simply calculating attention for once and not for the times the number of heads we have 😭? 

    - **Actually not!**
    - We're seeing this in the wrong way!
    - We're just postponing steps for our conveience. 
    - Instead of breaking the key query value matrix for each head, we're first calcuating the combined matrices for key query value and then break them down for calcuating attention.  
```python
import numpy as np 
word_vector = np.array([[1,2,3,4],[5,6,7,8]])
key = np.array([[1,2,3,4],
                [1,2,3,4],
                [1,2,3,4],
                [1,2,3,4]
            ])
query = np.array([[1,2,3,4],
                  [1,2,3,4],
                  [1,2,3,4],
                  [1,2,3,4]
            ])
value = np.array([[1,2,3,4],
                  [1,2,3,4],
                  [1,2,3,4],
                  [1,2,3,4]
            ])

word_key = np.matmul(word_vector,key)
word_query = np.matmul(word_vector,query)
word_value = np.matmul(word_vector,value)

word_key_part_1 = word_key[:,0:2]
word_key_part_2 = word_key[:,2:4]

word_query_part_1 = word_query[:,0:2]
word_query_part_2 = word_query[:,2:4]

word_value_part_1 = word_value[:,0:2]
word_value_part_2 = word_value[:,2:4]


attention_part1 = np.matmul(np.matmul(word_query_part_1,word_key_part_1.transpose())/np.sqrt(2),word_value_part_1)
attention_part2 = np.matmul(np.matmul(word_query_part_2,word_key_part_2.transpose())/np.sqrt(2),word_value_part_2)

key1 = np.array([[1,2,],
                [1,2,],
                [1,2,],
                [1,2,]
            ])
query1 = np.array([[1,2,],
                  [1,2,],
                  [1,2,],
                  [1,2,]
            ])
value1 = np.array([[1,2,],
                  [1,2,],
                  [1,2,],
                  [1,2,]
            ])

key2 = np.array([[3,4,],
                 [3,4,],
                 [3,4,],
                 [3,4,]
            ])
query2 = np.array([[3,4,],
                   [3,4,],
                   [3,4,],
                   [3,4,]
            ])
value2 = np.array([[3,4,],
                   [3,4,],
                   [3,4,],
                   [3,4,]
            ])

word_key1 = np.matmul(word_vector,key1)
word_query1 = np.matmul(word_vector,query1)
word_value1 = np.matmul(word_vector,value1)
attention1 = np.matmul(np.matmul(word_query1,word_key1.transpose()/np.sqrt(2)),word_value1)


word_key2 = np.matmul(word_vector,key2)
word_query2 = np.matmul(word_vector,query2)
word_value2 = np.matmul(word_vector,value2)
attention2 = np.matmul(np.matmul(word_query2,word_key2.transpose()/np.sqrt(2)),word_value2)


print("attention1 = "attention1,)
print("attention2 = "attention2,)
print("attention_part1 = "attention_part1,)
print("attention_part2 = "attention_part2,)


OUTPUTS : 
attention1 =      [[ 27435.74311004  54871.48622008]
                   [ 71332.9320861  142665.8641722 ]]
attention2 =      [[ 411536.14665057  548714.86220076]
                   [1069993.98129148 1426658.64172198]]
attention_part1 = [[ 27435.74311004  54871.48622008]
                   [ 71332.9320861  142665.8641722 ]]
attention_part2 = [[ 411536.14665057  548714.86220076]
                   [1069993.98129148 1426658.64172198]]
```
- This is a small simulation of what we're doing here. 
- In the first case, we're calculating the attention without splitting the key query and value matrices. 
- In the second case we have splitted the key query value matrix and then we're calculating the attention. 
- In both the cases the correspoinding attention values came out to be same thus proving our point 😎. 


------
 

### A time for reality check ( yes, once again 😔)
- Here above we said the we can use different dimensions for the key-query and value matrices but that's not quite true if we look it from the architectural point of view. 
- Here's why : 
    - Suppose we have $512$ dimensions for our word embeddings.
    - Now we decide to take up $8$ attention heads. 
    - $\implies d_{key} = 64$
    - Now there are 8 key-value matrices. 
    - What about $d_{value}$?
    - Consider that $d_{value} \neq 64$
    - This implies we'll have different number of value matrices which have some problems: 
        - How will we handle the $8$ resultant key query products ? 
        - With what value matrix we'll multiply the products since we don't have equal number of value matrices to multiply with the key-query product matrices. 
    - Hence for keeping our lives peaceful, we'll do $d_{key}=d_{value}=\frac{d_{model}}{h}=\frac{512}{8} = 64$  
    - Now one important quesiton : What of the $W^{0}$ matrix ? Will we discard it off 😔? 
    - No, we'll keep it ( reason : TODO : FIND OUT ) 


-------------------


### ANOTHER CONCEPT : MASKING 🤡

- While performing the operations on decoder, this is the general flow : 
- Suppose the sentence under consideration is : 

<div align="center">

**INPUT** 
|My | cat | is | a | beautiful | cat.| 😸 |
|---|-----|----|---|-----------|-----|----|

**OUTPUT**

|मेरी | बिल्ली | एक | सुंदर | बिल्ली | है | 😸 |
|----|------|----|------|------|---|-----|
</div>

- In the encoder it will be converted to a matrix of embeddings and positional embeddings. 
- The input will essentially look like this before entering the decoder : 

<div align="center">

 (say $d_{model}=5$ )
| My      |  $1$  |  $3$  |  $4$ |  $8$ |  $2$ |
|---------|-------|-------|------|------|------|
| cat     |  $1$  |  $5$  |  $4$ |  $5$ |  $6$ | 
| is      |  $1$  |  $5$  |  $4$ |  $5$ |  $6$ |
|  a      |  $1$  |  $5$  |  $4$ |  $5$ |  $6$ |
|beautiful|  $1$  |  $5$  |  $4$ |  $5$ |  $6$ |
|  cat    |  $1$  |  $5$  |  $4$ |  $5$ |  $6$ |


</div>

- Now there are 2 scenarios which can be considered : 
     - Training 
     - Inference ( basically running the trained model for use )

- Case 1 : **INFERENCE MODE**
     - Here since the traditional method is to pass on the inputs of the previous decoder unit, there is no scope of parallelism. 
     - Something like this happens : 
     
     <div align="center">

     <img src="inference_in_transformer.jpg">

     </div>

     - So if we have say $100$ words, this has to take $100$ units of time given $1$ computation takes $1s$. 


- Case 2 : **TRANING MODE** : 
     - Here the story is a bit different. 
     - Here something similar happens but with a twist : 
          - Take the same translation example as above. 
          - Suppose instead of producing the first token as **मेरी**, the token produced was **हमारी**. 
          <div align="center">

          <img src="traninig_mode_in_transformer.jpg"> 

          </div>

          - Now here instead of passing **हमारी** in the next component of the decoder we pass on the True value that should have come at that place to train the model with fact that we want **बिल्ली** after the token **मेरी**. 
          - So the question arises : Do we need to actually know the **pervious state through the transformer** ? Or we simply know what's the next state gonna be ?
          Also if know the next state , why to even perform the calculation using the transformer ? 
          - So the answer to the last question is **TRAINING🤡**. How the hell are we supposed to calculate the loss without knowing what the transformer knows and what we want it to know and make it learn. 
          - The answer to the second question is : Yes we do! from the training data itself. 
          - The answer is the first question is : No , we simply don't. Our sole purpose of using performing this operation is to train the transformer.
          - Since we know what's gonna be the next state , can we parallelize this operation for a single sequence ? The answer is Yes we can and that's what we do. 






### SO EVERYTHING LOOKS FINE TILL NOW ? 
- One wise man said, if you think it going to end well, then you're not paying attention 💀.
- Here there's one issue while training : 
- When we send the embeddings into the decoder parallely, what essentially we're doing is: 
      - Suppose we send the sentence : 
<div align="center">

|My | cat | is | a | beautiful | cat.| 😸 |
|---|-----|----|---|-----------|-----|---- |

</div>


- here when we sent the word **My**, we're essentially sending the information of what's going to come next in our sentence . 
- How so ? 
- Like this : 
- We wanted sent this input to the decoder : 

<div align="center">

(say $d_{model}=5$ )

| My      |  $1$  |  $3$  |  $4$ |  $8$ |  $2$ |
|---------|-------|-------|------|------|------|
| cat     |  $1$  |  $5$  |  $4$ |  $5$ |  $6$ | 
| is      |  $1$  |  $5$  |  $4$ |  $5$ |  $6$ |
|  a      |  $1$  |  $5$  |  $4$ |  $5$ |  $6$ |
|beautiful|  $1$  |  $5$  |  $4$ |  $5$ |  $6$ |
|  cat    |  $1$  |  $5$  |  $4$ |  $5$ |  $6$ |
           
</div>

- We sent the contextual embeddings ! 
- These emebeddings contain the information about the words that came before and **AFTER** the current word. 
- But how ? 
- Like this : 
- What we learned before was that we need to express the embedding of the current word in terms of it's own meaning and the words around it. 
<div align="center">

Say we're considering the word **beautiful**:

$e_{beautiful} = \alpha_{1}.e_{my} + \alpha_{2}.e_{cat} + \alpha_{3}.e_{is} + \alpha_{4}.e_{a} + \alpha_{5}.e_{beautiful} + \alpha_{6}.e_{cat} $
</div>

- Till $\alpha_{5}.e_{beautiful}$ was okay because till that point we had the context of the words that came before. But the moment the last embedding vector term came into picture  $ \alpha_{6}.e_{cat} $ , it kind of gave off the hint to the transformer that the next word is cat which shouldn't happen. 
- Ideally we would want to **HIDE** this information that the word **cat** is going to come up because in that case it'll fail to learn to learn. 
- So what we want is that while processing the inputs, we keep the coefficients of the words that are comming after the current word to be zero. 
- That means : 
      - If our current word is Cat ( the first occurance ) then the values : 
      $\alpha_{3} = \alpha_{4} =  \alpha_{5} =  \alpha_{6} = 0$
- So ideally speaking, this should be done at the step of self attention calculation.
- There we calculated the coefficient matrix for multiplying it with the word embeddings. 
- $softmax(\displaystyle\frac{QK^{T}}{\sqrt{d_{k}}})$ this step to be specific. 
- The output of this operation is actually the coefficients for the value matrix. 
- For better understanding, see this : 
<div align="center">

|          |My        |cat       |is        |a         |beautiful|cat       |
|----------|----------|----------|----------|----------|---------|----------|
|My        |  1       |   5      |   6      |    6     |   3     |   5      |              
|cat       |  2       |   5      |   6      |    6     |   3     |   5      |              
|is        |  3       |   5      |   6      |    6     |   3     |   5      |    
|a         |  4       |   5      |   6      |    6     |   3     |   5      |    
|beautiful |  5       |   5      |   6      |    6     |   3     |   5      |         
|cat       |  5       |   5      |   6      |    6     |   3     |   5      |              

</div>

- Here for the first row, the values in the row represent the coefficeints that are to be multiplied with the value vectors of the corresponding words . 
- Now here what we would like to have is something like this : 

<div align="center">

|          |My        |cat       |is        |a         |beautiful|cat       |
|----------|----------|----------|----------|----------|---------|----------|
|My        |  1       |   0      |   0      |    0     |   0     |   0      |              
|cat       |  2       |   5      |   0      |    0     |   0     |   0      |              
|is        |  3       |   5      |   6      |    0     |   0     |   0      |    
|a         |  4       |   5      |   6      |    6     |   0     |   0      |    
|beautiful |  5       |   5      |   6      |    6     |   3     |   0      |         
|cat       |  5       |   5      |   6      |    6     |   3     |   5      |              

</div>
- So to achieve this what we do is : 
<div align="center">

<img src="masking_demo.jpg">

</div>

- Take a mask matrix and add it to the scaled matrix before performing softmax. 
- $softmax(-\infty)=0$
- This makes the coefficient of the future terms to be $0$
----------


### Residual Connection : TODO😭
- They are specifically placed at many places. 
- Study [this paper](https://arxiv.org/pdf/1603.05027) for better understanding of why residual connections workout well for large and deep models!
- This one got a lot of math to understand. 

In [5]:
from torch import nn
import torch 
class MultiHeadAttention(nn.Module):
    def __init__(self,d_model:int = 512,h:int = 8, dropout:float = 0.1):
        super().__init__()
        self.d_model = d_model  # expected input dimesions in each word
        
        self.h  = h   # number of heads 
        assert d_model%h == 0 ,"Expected values of d_model and h such that d_model is divisible by h"
        self.d_key = d_model//h
        self.w_query = nn.Linear(d_model,d_model) # weight matrix for query 
        self.w_key = nn.Linear(d_model,d_model) # weight matrix for key 
        
        
        # Note that here we're  going for d_model X d_model and not for d_model X (h*d_value)
        # Since we're given an input matrix of d_model.  
        self.w_value = nn.Linear(d_model,d_model) # weight matrix for value 
        self.w_o = nn.Linear(d_model,d_model)
    
    
    @staticmethod # this is a static method and can be accessed outside the class using this: MultiHeadAttention.attention(input parameters)
    def attention(self,key:torch.Tensor,query:torch.Tensor,value:torch.Tensor,mask):
        key_query = torch.matmul(query,key.T)
        scaled = key_query/torch.sqrt(self.d_key)  
        # At this point , the shape of the matrix is : (Batch , h , seqeuence_length , seqeuence_length)
        if mask is not None:
            scaled.masked_fill(mask==0,-1e9)
            
        softmax_ = scaled.softmax(dim=-1)
        
        return torch.matmul(softmax_,value)
    
    
    
    def forward(self,query,key,value,mask):
        q_w = self.w_query(query) # query matrix for input sequence , Shape : (Batch , sequence_length , d_model) 
        k_w = self.w_key(key)     # key matrix for input sequence   , Shape : (Batch , sequence_length , d_model)
        v_w = self.w_value(value) # value matrix for input sequence , Shape : (Batch , sequence_length , d_model)
        batch,sequence_length,d_model = q_w.shape
        
        # This is done to break the resultant query matrix of the sequence into h matrices
        # transpose is mainly done to make the matrix head-based indexable. 
        q_q_head_wise = q_w.view(batch,sequence_length,self.h,self.d_key).transpose(2,1) 
        q_k_head_wise = q_w.view(batch,sequence_length,self.h,self.d_key).transpose(2,1) 
        q_v_head_wise = q_w.view(batch,sequence_length,self.h,self.d_key).transpose(2,1) 
        attention_score = MultiHeadAttention.attention(self,q_k_head_wise,q_q_head_wise,q_v_head_wise,mask)
        
        #TODO : Study contiguous memory allocation in detail. 
        
        attention_score = attention_score.transpose(1,2).contiguous().view(batch,sequence_length,self.h*self.d_key)
        # Tanspose to get the dimension normally indexable and merging the tensors back. 
        
        return self.w_o(attention_score)
    
    
class LayerNormalization(nn.Module):
    def __init__(self,epsillon: float = 1e-8):
        super().__init__()
        self.epsillon = epsillon
        self.aplha = nn.Parameter(torch.ones(size=1))
        self.beta  = nn.Parameter(torch.ones(size=1))
    def forward(self,x:torch.Tensor):
        mean = x.mean(dim = -1,keepdim=True)
        std  = x.std(dim = -1,keepdim=True) 
        normalized = self.aplha * (x-mean)/(std + self.epsillon) + self.beta
        return normalized





class ResidualConnection(nn.Module):
    def __init__(self):
        super().__init__()
        self.normalize = LayerNormalization(epsillon=1e-8)
    def forward(self,x:torch.Tensor,sub_layer):
        # here the logic of sublayer is that the operations are performed on the input using some layer be if feed-forward or muti-head attention these 2 shall act as sublayers and then provide the output for it. Once we have the output we can continue with the addition of output and the input and then normalize them as the name says "add and norm" in the paper.  
        return self.normalize(x + sub_layer(x))

In [6]:
class EncoderBlock(nn.Module):
    def __init__(self,
                 d_model : int ,
                 h : int , 
                 dropout : float,
                 feed_forward_output : int , 
                ):
        super().__init__()
        self.self_attention = MultiHeadAttention(d_model=d_model,h=h,dropout=dropout)
        self.feed_forward_block = FeedForwardNeuralNetwork(input_dim=d_model,
                                                           output_dim = feed_forward_output,
                                                           p_dropout=dropout)
        # this is done to actually provide more flexibility for using skip connections properly. Since these skip connections are to be used 2 times here that too in between of serveral other connections. So we're essentially using it for skipping out on extra components and just focus on the residual connection.  
        
        self.skip_connections  = nn.ModuleList([ResidualConnection() for _ in range(2)])
        
    def forward(self,x:torch.Tensor,mask:torch.Tensor):
        # first we pass the input to the multi-head attention layer. 
        x1 = self.skip_connections[0](x,lambda x: self.self_attention(x,x,x,mask))
        x2 = self.skip_connections[1](x1,lambda x1:self.feed_forward_block(x1))
        return x2        

In [None]:


class Encoder(nn.Module):
    def __init__(self,
                 num_stacks: int = 6,
                 d_model : int = 512 , 
                 h : int = 8 , 
                 dropout : float = 0.1 , 
                 feed_forward_output : int = 2048 
                ):
        super().__init__()
        self.normalize = LayerNormalization()
        
        self.encoder_list = nn.ModuleList([EncoderBlock(d_model=d_model,
                                                        h = h,
                                                        dropout=dropout,
                                                        feed_forward_output=feed_forward_output)
                                           for _ in range(num_stacks)])
    
    def forward(self,x,mask):
        for layers in self.encoder_list:
            x = layers(x,mask)
        return self.normalize(x) # TODO : Check out in which part the normalization at this step is prescribed. Add that into the notes.  

In [8]:
class DecoderBlock(nn.Module):
    def __init__(self,
                 d_model : int ,
                 h : int , 
                 dropout : float,
                 feed_forward_output : int , 
                ) -> None:
        super().__init__(self )
        self.self_attention = MultiHeadAttention(
                                                 d_model=d_model,
                                                 h = h,
                                                 dropout=dropout
                                                )
        
        self.cross_attention = MultiHeadAttention(
                                                 d_model=d_model,
                                                 h = h,
                                                 dropout=dropout
                                                 )
        
        self.feed_forward_layers = FeedForwardNeuralNetwork(
                                                            input_dim=d_model,
                                                            output_dim=feed_forward_output,
                                                            p_dropout=dropout
                                                           )
        
        self.residual_connection = nn.ModuleList([ResidualConnection() for _ in range(3)])
        
    def forward(self,x:torch.Tensor,
                encoder_outputs : torch.Tensor,
                source_mask:torch.Tensor,
                target_mask:torch.Tensor,
                ):
        # TODO : understand the working of this source_mask and target_mask . this is something that I didn't find in the paper nor the tutorial explained quite clear
        x1 = self.residual_connection[0](x , lambda x: self.self_attention(x,x,x,target_mask))
        # TODO : Check this one out : What I've understood from the varius sources is that the encoder output is used as the query and key values and the encoded values produced by the decoder's self_attention are used as the value vector. In the function implementaiton of multihead attention, we've used the sequecne of query key value as input, but in the tutorial they've use the  same sequnce but while calling it here in the second part where we needed the values from the decoder as well in the form of value vector. In the tutorial they passed it as x1,encoder_output , encoder_output which is kind of wrong. 
        
        
        x2 = self.residual_connection[1](x1, lambda x1:self.cross_attention(encoder_outputs,encoder_outputs,x1,source_mask))
        x3 = self.residual_connection[2](x2,lambda x2 : self.feed_forward_layers(x2))
        return x3
    

In [9]:
class Decoder(nn.Module):
    def __init__(self,
                 num_stack:int = 6,
                 d_model : int = 512,
                 h : int = 8, 
                 dropout : float = 0.1,
                 feed_forward_output : int = 2048 ,
                 ) -> None:
        super().__init__()
        self.normalize = LayerNormalization()
        
        self.decoder_list = nn.ModuleList([DecoderBlock(
                                                        d_model=d_model,
                                                        h = h,
                                                        dropout=dropout,
                                                        feed_forward_output=feed_forward_output
                                                      ) for _ in range(num_stack)])
    
    def forward(self,x,encoder_output,source_mask,target_mask):
        for layer in self.decoder_list:
            x = layer(x,encoder_output,source_mask,target_mask)
        return self.normalize(x) # TODO : Check out in which part the normalization at this step is prescribed. Add that into the notes.         

In [10]:
class ProjectionLayer(nn.Module):
    def __init__(self, vocab_size : int , d_model : int = 512) -> None:
        super().__init__()
        self.linear_layer = nn.Linear(in_features=d_model,out_features=vocab_size)
        
    
    def forward(self,x:torch.Tensor):
        return torch.log_softmax(self.linear_layer(x),dim = -1)
    

In [11]:
class Transformer(nn.Module):
    def __init__(self,
                 encoder:Encoder,
                 decoder:Decoder,
                 source_embedding: InputEmbeddings,
                 target_embedding: InputEmbeddings,
                 source_positional_encoding:PositionalEncoding,
                 target_positional_encoding:PositionalEncoding,
                 projection_layer:ProjectionLayer
                ):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.source_embedding = source_embedding
        self.target_embedding = target_embedding
        self.source_positional_encoding = source_positional_encoding
        self.target_positional_encoding = target_positional_encoding
        self.projection_layer = projection_layer
    def encode(self,
               source,
               source_mask
              ):
        input_embeddings = self.source_embedding(source)
        input_embeddings_with_positional_encodings = self.source_positional_encoding(input_embeddings)
        encodings = self.encoder(input_embeddings_with_positional_encodings,
                                 source_mask)
        return encodings
    
    
    def decode(self,encoder_output,source_mask,target,target_mask):
        target_embeddings = self.target_embedding(target)
        target_embeddings_with_positional_encodings = self.target_positional_encoding(target_embeddings)
        decodings = self.decoder(target_embeddings_with_positional_encodings,
                                 encoder_output,
                                 source_mask,
                                 target_mask)
        return decodings
    
    def project(self,x):
        return self.projection_layer(x)
    
    
    
    

In [None]:
def build_transformer(input_vocab_size:int,
                      target_vocab_size:int,
                      max_input_seq_lenght : int ,
                      max_target_seq_lenght : int ,
                      d_model : int  = 512,
                      dropout : float = 0.1, 
                      num_stacks : int = 6, 
                      num_attention_heads : int = 8, 
                      feed_forward_num_out : int = 2048)->Transformer:
    encoder= Encoder(
                      num_stacks=num_stacks,
                      d_model=d_model,
                      h = num_attention_heads,
                      dropout=dropout,
                      feed_forward_output=feed_forward_num_out)
    decoder= Decoder(num_stack=num_stacks,
                     d_model=d_model,
                     h=num_attention_heads,
                     dropout=dropout,
                     feed_forward_output=feed_forward_num_out)
    source_embedding=  InputEmbeddings(
                                        d_model=d_model,
                                        vocab_size=input_vocab_size
                                      )
    target_embedding=  InputEmbeddings(
                                        d_model=d_model,
                                        vocab_size=target_vocab_size
                                      )
    source_positional_encoding = PositionalEncoding(
                                                     d_model=d_model,
                                                     sequence_length=max_input_seq_lenght,
                                                     p_drop=dropout
                                                   )
    target_positional_encoding= PositionalEncoding(
                                                    d_model=d_model,
                                                    sequence_length=max_target_seq_lenght
                                                  )
    projection_layer= ProjectionLayer(  
                                        vocab_size=target_vocab_size,
                                        d_model=d_model
                                     )
    
    transformer_object = Transformer(
                                      encoder=encoder,
                                      decoder=decoder,
                                      source_embedding=source_embedding,
                                      target_embedding=target_embedding,
                                      source_positional_encoding=source_positional_encoding,
                                      target_positional_encoding=target_positional_encoding,
                                      projection_layer=projection_layer)
    
    
    # for p in transformer_object.parameters():
    #   print(p)
    return transformer_object 