# Attention

> Attention is a mechanism in neural networks that allows the model to focus on specific parts of the input sequence by assigning different weights to different elements, enabling it to capture dependencies and relationships more effectively.

> Note there are many others attention mechanisms.

## Code

In [2]:
import numpy as np

def softmax(x):
    """Compute softmax values for each row."""
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))  # Stability trick
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

def scaled_dot_product_attention(Q, K, V):
    """
    Compute scaled dot-product attention.
    
    Args:
    - Q: Query matrix (seq_len, d)
    - K: Key matrix (seq_len, d)
    - V: Value matrix (seq_len, d)
    
    Returns:
    - Attention output matrix (seq_len, d)
    """
    d_k = Q.shape[-1]  # Dimension of keys (same as queries)
    
    # Step 1: Compute attention scores (QK^T)
    scores = np.dot(Q, K.T) / np.sqrt(d_k)
    
    # Step 2: Apply softmax to get attention weights
    attention_weights = softmax(scores)
    
    # Step 3: Multiply attention weights by values (weighted sum)
    output = np.dot(attention_weights, V)
    
    return output, attention_weights


## Test

In [3]:
# Define a small sequence of 3 input token embeddings (each of size 4)
np.random.seed(42)  # For reproducibility
X = np.random.rand(3, 4)  # (seq_len=3, d=4)

# Simulate learned weight matrices (random for now)
W_Q = np.random.rand(4, 4)  # Projection matrix for Q
W_K = np.random.rand(4, 4)  # Projection matrix for K
W_V = np.random.rand(4, 4)  # Projection matrix for V

# Compute Q, K, V
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

# Compute attention output
output, attention_weights = scaled_dot_product_attention(Q, K, V)

# Print results
print("Input embeddings (X):\n", X)
print("\nAttention Weights:\n", attention_weights)
print("\nAttention Output:\n", output)

Input embeddings (X):
 [[0.37454012 0.95071431 0.73199394 0.59865848]
 [0.15601864 0.15599452 0.05808361 0.86617615]
 [0.60111501 0.70807258 0.02058449 0.96990985]]

Attention Weights:
 [[0.52735889 0.11510842 0.35753269]
 [0.43524369 0.19971446 0.36504185]
 [0.5060941  0.12923692 0.36466898]]

Attention Output:
 [[0.9770141  0.93531103 1.15681926 1.43938704]
 [0.87857753 0.86060494 1.04849229 1.32790919]
 [0.9564023  0.92013827 1.13535784 1.41721063]]


## References

- [Attention](https://en.wikipedia.org/wiki/Attention_(machine_learning))