### Implementing the attention Mechanism

We will implement 4 attention mechanism
- Simplified self-attention
- Self-attention
- Causal attention
- Multi-head attention

###  why do we need  the attention mechanism?

Before diving into the details of how the attention mechanism works, it is better to first understand the S-O-T-A models then, which was the RNN encoder-decoder network. This network had its own advantages and was very useful for language translation. It works based on the  principle of using the previous state value and the current input, to compute the next state. This works well for short sentence and texts. But in cases of long sentence, it seems to falter, because it has information of mainly the most recently passed states and earlier states are hard to remember in this architecture design.


This is what give rise to the need of a new architecture that  can solve this problem that the RNN encoder-decoder has.

#### Capturing data dependencies with attention mechanism


So sometime in 2014, some dude started working on this attention this on the RNN decoder part. Through his work, we could access the input sequence selectively depending on the importance attached to each input in the sequence. Then a question arises. how do we know the importance of each input in the sequence? this is done by the self-attention mechanism which we will see in a bit


Using an attention mechanism, the text-generating decoder part can access all the input tokens selectively. this means that some input tokens are more important than others for generating a given output token. The importance is determined by an attention weight.

### Attending to different parts of the inputs with self-attention




In [None]:
import torch

inputs  = torch.tensor(
    [
        [0.43, 0.15, 0.89], # your
        [0.55, 0.87, 0.66], # journey
        [0.57, 0.85, 0.64], # starts
        [0.22, 0.58, 0.33], # with 
        [0.77, 0.25, 0.10], # one
        [0.05, 0.80, 0.55]  # step
    ]
)

#### How is the attention mechanism calculated?

first we take the econding for each word and compute the importance of each word with other words in the sequence. This is done by computing the dot product of the input embedding with its transpose. this will yield a new matrix, with the attention scores. this matrix is then normalize for better representation. this will then yield the attention weights which will again be multiplied by the input to get the context vector.

In [3]:
query = inputs[1]
attention_score_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
    attention_score_2[i] = torch.dot(x_i, query)
print(attention_score_2)

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


In [4]:
# attention scores normalization will lead to the attention weights
attention_weight_2 =  attention_score_2 / attention_score_2.sum()
print("Attention Weights: ", attention_weight_2)
print("Sum: ", attention_weight_2.sum())

Attention Weights:  tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum:  tensor(1.0000)


It is advisable to use the softmax normalization as it is more stable and works well if we have negative values in the matrix


In [5]:
attn_weight_2 =  torch.softmax(attention_score_2, dim=0)
print("sofmax normalization:  ", attn_weight_2)

sofmax normalization:   tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])


In [6]:
# now let's compute the attention scores for all inputs
attn_scores = inputs @ inputs.T
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


In [8]:
# now normalize the attention scores obtained
attn_weights = torch.softmax(attn_scores, dim=-1) # dim=-1 so it normalizes across the rows instead of columns
print(attn_weights)

tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


In [9]:
# compute the context vector for each sequence
all_context_vec = attn_weights @ inputs
print("The context vector for all input sequence:  ", all_context_vec)

The context vector for all input sequence:   tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])


### implementing self-attention with trainable weights