#### Coding attention mechanism

To avoid grammatical structures in the source and target language, Encoder and Decoder are used in deep neural network. The job of Encoder is to first read in and process the entire text, and Decoder then produces the translated text.

Before Transformers, RNNs (recurrent neural network) were most popular encoder-decoder architecture for translation. RNN is a neural network in which outputs from previous steps are fed as inputs to the current step, making sequential data like text. 

Limitation of RNN - We can not access eariler hidden states from the encoder during decoding because it only depend on current state hidden state. This leads to a loss of context. 

#### Self Mechanism

`Self-attention` is a mechanisam that allows each position in input sequence to consider the relevancy of all other position in the same sequence when computing repersentation of a sequence. It is key component of LLMs based on transformation architecture. 

We have covered Input text and Preprocessing part, Now we will go onto Self-attention module.

#### Attending to different parts of the input with serlf-attention

In self-attention, the “self” refers to the mechanism’s ability to compute attention weights by relating different positions within a single input sequence. It assesses and learns the relationships and dependencies between various parts of the input itself, such as words in a sentence or pixels in an image.


#### Self- attention mechanism without trainable weights

Notation used - x is input sequence with T elemenets, represented as x^(1) to x^(T). Our goal is to find `Context Vector` z^(i) for each element x^(i).

A context vector can be seen as enriched embedding vector. These vector play important role in Self-attention mechanism. They create representation of each element in input sequence by incorporating information from all other elements in the sequence. And later trainable weights are added that help LLM learn to construct these context vector.

Let's try to create a simple self-attention mechanism-

In [2]:
# Take example from book - Your journey starts with one step - our input sequence with 6 elements
import torch
inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your (x^1)
[0.55, 0.87, 0.66], # journey (x^2)
[0.57, 0.85, 0.64], # starts (x^3)
[0.22, 0.58, 0.33], # with (x^4)
[0.77, 0.25, 0.10], # one (x^5)
[0.05, 0.80, 0.55]] # step (x^6)
)

Now, first step is to compute intermediate valeus w called as `attention scores`. Why? Beacuse due to values like 0.87 or 0.85 input tensor truncate them to 0.8. So, from above input embeddings of words "journey" and "starts" may appear similar.

We calculate these scores by dot product. If query token is 2nd token x^(2), all w will be calulated by dot prodcut with this embedded query token. (w21, w22, w23....w2T)

let's see this concept-

In [5]:
query = inputs[1] # 2nd input token as query, see inputs above
attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query)
print(attn_scores_2) 

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


Now next step is `Normlaization`, here goal is to obtain attention weigths that sum upto 1. It is useful for interpretation and maintaining training stability in LLM.

We will get alpha values for each attention score, these are nothing but `Attention weights`. So, we normalize w to get alpha.

In [7]:
attn_scores_2_tmp = attn_scores_2 / attn_scores_2.sum()
print("Attention weights: ", attn_scores_2_tmp)
print("Sum: ", attn_scores_2_tmp.sum())

Attention weights:  tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum:  tensor(1.0000)


In practise, `Softmax` function is used. AS it manages extreme values and offers more favorable gradient properties during training. 

Softmax function also ensures that the attention weights are always positive. This makes output interpretable as probabilities where higher weights indicate greater importance. 

Let's see this-

In [8]:
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

attn_weights_2_naive = softmax_naive(attn_scores_2)
print("Attention weights:", attn_weights_2_naive)
print("Sum:", attn_weights_2_naive.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


To avoid numerically instability that causes overflow and underflow duwe to large or small values, we use PyTorch . Here, our case is small, so it will give same output as above.

In [9]:
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)
print("Attention weights:", attn_weights_2)
print("Sum:", attn_weights_2.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


Now that we have calculated normalized attention weights, we will go for final step, calculating the context vector z^(2) by multiplying the embedded input tokens x^(i) with corresponding attention weights and summing them to get resulting vectors.

So we can say, context vector z(2) is the weighted sum of all input vectors, obtained by multiplying each input vector by its corresponding attention weight.

In [12]:
query = inputs[1]
context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i]*x_i
print(context_vec_2)

tensor([0.4419, 0.6515, 0.5683])


#### we can see a example that explain it step by step:

inputs = torch.tensor([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
attn_weights_2 = torch.tensor([0.1, 0.2, 0.7])

Here's what happens step-by-step:

query = inputs[1]: query is assigned the value [3.0, 4.0].

context_vec_2 = torch.zeros(query.shape): context_vec_2 is initialized as a tensor [0.0, 0.0].

The for loop iterates over each input vector:

* For i = 0, x_i = [1.0, 2.0], the weighted vector is 0.1 * [1.0, 2.0] = [0.1, 0.2]. So, context_vec_2 becomes [0.1, 0.2].
* For i = 1, x_i = [3.0, 4.0], the weighted vector is 0.2 * [3.0, 4.0] = [0.6, 0.8]. So, context_vec_2 becomes [0.1 + 0.6, 0. 2 + 0.8] = [0.7, 1.0].
* For i = 2, x_i = [5.0, 6.0], the weighted vector is 0.7 * [5.0, 6.0] = [3.5, 4.2]. So, context_vec_2 becomes [0.7 + 3.5, 1.0 + 4.2] = [4.2, 5.2].
* print(context_vec_2) prints the tensor [4.2, 5.2].

Now we will generalize above process- 

##### Computing attention weights for al input tokens

From our inputs, we have 6X6 tensor and we saw all the steps above - so make some modification to code to get all the context vector instead of z^(2):

In [13]:
attn_scores = torch.empty(6,6)
for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i, j] = torch.dot(x_i, x_j)
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])
