# Attention Mechanisms 

This notebook provides an overview of the attention mechanism in machine learning models, particularly for tasks like machine translation. It explains how attention allows the decoder to focus on the most relevant parts of the input sequence. Key concepts like queries, keys, and values are illustrated with clear examples and step-by-step implementation.  


## Libraries/packages used

* [<code style="color:blue;">numpy:</code>](https://numpy.org/doc/) A powerful library for numerical computing, providing support for multi-dimensional arrays, mathematical functions, and array operations

* [<code style="color:blue;">scipy:</code>](https://docs.scipy.org/doc/)  A library that provides additional scientific computing functionalities

To run this code on your local machine, please run the following commands in your command line tool.
* `pip install numpy`
* `pip install scipy`

<a id='TOC'></a>  
## Table of contents  

1. <a href="#attention_mechanism">Attention mechanism</a><br>  
2. <a href="#required_imports">Required imports</a><br>  
3. <a href="#word_embeddings">Word embeddings</a><br>  
4. <a href="#weight_matrices">Creating weight matrices</a><br>  
5. <a href="#queries_keys_values">Queries, keys, and values</a><br>  
6. <a href="#calculating_scores">Calculating scores</a><br>  
7. <a href="#softmax_weights">Applying softmax weights</a><br>  
8. <a href="#calculating_attention">Calculating attention</a><br>  

<a id='attention_mechanism'></a>  
## 1. Attention mechanism  
[Back to table of contents](#TOC)  

The attention mechanism enhances the encoder-decoder model by allowing the decoder to focus on the most relevant parts of the input sequence. It computes weighted sums of the input vectors, giving higher weights to the most relevant components.  

<a id='required_imports'></a>  
## 2. Required imports  
[Back to table of contents](#TOC)  

Below are the functions needed for this implementation.  



In [2]:
# importing required functions  
from numpy import array  # for creating and handling arrays  
from numpy import random  # for generating random numbers  
from scipy.special import softmax  # for applying the softmax function

<a id='word_embeddings'></a>  
## 3. Word embeddings  
[Back to table of contents](#TOC)  

Here we define the word embeddings for four words, which we will later use to compute attention. In real-world applications, these word embeddings would typically be generated by an encoder. However, for this specific example, we will manually define them.

* [<code style="color:blue;">array</code>](https://numpy.org/doc/stable/reference/generated/numpy.array.html)  


In [3]:
# manually defining word embeddings for four words  
word_1 = array([1, 0, 0])  
word_2 = array([0, 1, 0])  
word_3 = array([1, 1, 0])  
word_4 = array([0, 0, 1])  

In [4]:
# stacking all word embeddings into one array  
words = array([word_1, word_2, word_3, word_4])  

In [5]:
# printing the word embeddings  
print("Word embeddings:\n", words)  

Word embeddings:
 [[1 0 0]
 [0 1 0]
 [1 1 0]
 [0 0 1]]


<a id='weight_matrices'></a>  
## 4. Creating weight matrices  
[Back to table of contents](#TOC)  

This step involves creating the weight matrices that will be used to produce the queries, keys, and values from the word embeddings. In this context, we are generating these weight matrices in a random manner. However, it's worth noting that in real-world applications, these matrices would typically be learned and fine-tuned through training processes.

* [<code style="color:blue;">random</code>](https://numpy.org/doc/stable/reference/random/index.html)  

* [<code style="color:blue;">randint</code>](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html)  
* [<code style="color:blue;">seed</code>](https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html)  

In [6]:
# setting a seed for reproducibility  
random.seed(42)  

# defining weight matrices for query, key, and value  
W_Q = random.randint(3, size=(3, 3))  # query weights  
W_K = random.randint(3, size=(3, 3))  # key weights  
W_V = random.randint(3, size=(3, 3))  # value weights  

In [7]:
# printing the weight matrices  
print("Query weights (W_Q):\n", W_Q)  
print("Key weights (W_K):\n", W_K)  
print("Value weights (W_V):\n", W_V)  

Query weights (W_Q):
 [[2 0 2]
 [2 0 0]
 [2 1 2]]
Key weights (W_K):
 [[2 2 2]
 [0 2 1]
 [0 1 1]]
Value weights (W_V):
 [[1 1 0]
 [0 1 1]
 [0 0 0]]


<a id='queries_keys_values'></a>  
## 5. Queries, keys, and values  
[Back to table of contents](#TOC)  

The **query (Q)**, **key (K)**, and **value (V)** concept can be compared to how retrieval systems work. For instance, when you search for videos on YouTube, your **query (Q)** — the text entered in the search bar—is compared against a set of **keys (K)**, such as video titles, descriptions, or metadata, stored in the database. The system then retrieves and presents the most relevant results as **values (V)** — the videos themselves.

* [<code style="color:blue;">@ (Matrix Multiplication)</code>](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html)  

In [8]:
# computing query, key, and value matrices  
Q = words @ W_Q  # query  
K = words @ W_K  # key  
V = words @ W_V  # value  

# printing the matrices  
print("Query matrix (Q):\n", Q)  
print("Key matrix (K):\n", K)  
print("Value matrix (V):\n", V)  

Query matrix (Q):
 [[2 0 2]
 [2 0 0]
 [4 0 2]
 [2 1 2]]
Key matrix (K):
 [[2 2 2]
 [0 2 1]
 [2 4 3]
 [0 1 1]]
Value matrix (V):
 [[1 1 0]
 [0 1 1]
 [1 2 1]
 [0 0 0]]


<a id='calculating_scores'></a>  
## 6. Calculating scores  
[Back to table of contents](#TOC)  

The score determines the level of focus placed on different parts of the input sentence when encoding a word at a specific position. It is calculated by taking the dot product of the query vector with the key vector corresponding to the word being scored.

* [<code style="color:blue;">transpose</code>](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.transpose.html)  

In [9]:
# calculating attention scores  
scores = Q @ K.transpose()  

# printing the attention scores  
print("Attention scores:\n", scores) 

Attention scores:
 [[ 8  2 10  2]
 [ 4  0  4  0]
 [12  2 14  2]
 [10  4 14  3]]


<a id='softmax_weights'></a>  
## 7. Applying softmax weights  
[Back to table of contents](#TOC)  

Now, we'll adjust the scores by dividing them by the square root of the key matrix's dimensionality, which helps stabilize gradients. The `softmax` function then ensures that the scores become positive and collectively sum to 1. These softmax scores dictate the relative emphasis each word will have at this particular position.

* [<code style="color:blue;">softmax</code>](https://docs.scipy.org/doc/scipy-1.15.0/reference/generated/scipy.special.softmax.html)  
* [<code style="color:blue;">shape</code>](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.shape.html)  

In [10]:
# scaling and applying softmax to the scores  
weights = softmax(scores / K.shape[1] ** 0.5, axis=1)  # axis=1 means row-wise  

# printing the attention weights  
print("Attention weights:\n", weights)  

Attention weights:
 [[2.36089863e-01 7.38987555e-03 7.49130386e-01 7.38987555e-03]
 [4.54826323e-01 4.51736775e-02 4.54826323e-01 4.51736775e-02]
 [2.39275049e-01 7.43870015e-04 7.59237211e-01 7.43870015e-04]
 [8.99501754e-02 2.81554063e-03 9.05653685e-01 1.58059922e-03]]


<a id='calculating_attention'></a>  
## 8. Calculating attention  
[Back to table of contents](#TOC)  

We'll perform a multiplication between the value matrix and the softmax scores. The idea behind this is to preserve the original values of the word(s) we intend to emphasize or pay attention to.

In [11]:
# calculating attention output  
attention = weights @ V  

# printing the attention output  
print("Attention output:\n", attention) 

Attention output:
 [[0.98522025 1.74174051 0.75652026]
 [0.90965265 1.40965265 0.5       ]
 [0.99851226 1.75849334 0.75998108]
 [0.99560386 1.90407309 0.90846923]]


In [12]:
# printing the final attention output  
print("Final attention output:\n", attention)

Final attention output:
 [[0.98522025 1.74174051 0.75652026]
 [0.90965265 1.40965265 0.5       ]
 [0.99851226 1.75849334 0.75998108]
 [0.99560386 1.90407309 0.90846923]]
