# ðŸ“˜ Scaled Dot-Product Attention

## 1. `scaled_dot_product_attention.py`

This module implements the **Scaled Dot-Product Attention** mechanism using **NumPy**, following the formulation used in Transformers.

---

## ðŸ”¢ Formula

For a single attention head:

[
\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right)V
]

Where:

| Symbol | Meaning      | Shape                   |
| ------ | ------------ | ----------------------- |
| **Q**  | Query matrix | `(..., seq_len_q, d_k)` |
| **K**  | Key matrix   | `(..., seq_len_k, d_k)` |
| **V**  | Value matrix | `(..., seq_len_k, d_v)` |

---

## âœ¨ Features

This implementation:

* Uses **NumPy** for all matrix operations
* Computes a **numerically stable softmax**
* Produces:

  * **Attention weights** â†’ shape `(..., seq_len_q, seq_len_k)`
  * **Context vectors** â†’ shape `(..., seq_len_q, d_v)`



In [1]:
from typing import Tuple

import numpy as np

In [2]:
def softmax(x: np.ndarray, axis: int = -1) -> np.ndarray:
    """
    Compute a numerically stable softmax along a given axis.

    Parameters
    ----------
    x : np.ndarray
        Input array of scores (logits).
    axis : int, optional
        Axis along which to apply softmax. Defaults to -1.

    Returns
    -------
    np.ndarray
        Softmax-normalized probabilities with the same shape as `x`.
    """
    # Subtract max for numerical stability: softmax(x) == softmax(x - max(x))
    x_max = np.max(x, axis=axis, keepdims=True)
    shifted = x - x_max

    exp_x = np.exp(shifted)
    sum_exp_x = np.sum(exp_x, axis=axis, keepdims=True)

    return exp_x / sum_exp_x


def scaled_dot_product_attention(
    q: np.ndarray,
    k: np.ndarray,
    v: np.ndarray,
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute scaled dot-product attention.

    Parameters
    ----------
    q : np.ndarray
        Query matrix. Shape: (..., seq_len_q, d_k)
    k : np.ndarray
        Key matrix. Shape: (..., seq_len_k, d_k)
    v : np.ndarray
        Value matrix. Shape: (..., seq_len_k, d_v)

    Returns
    -------
    Tuple[np.ndarray, np.ndarray]
        A tuple containing:
        - attention_weights: shape (..., seq_len_q, seq_len_k)
        - context: shape (..., seq_len_q, d_v)

    Notes
    -----
    The leading dimensions `...` (for example, batch size or number of heads)
    are kept and broadcast correctly as long as `q`, `k`, and `v` share the
    same leading dimensions.
    """
    # Get the dimensionality of the keys (d_k) from the last dimension of K
    d_k = k.shape[-1]

    # 1. Compute raw attention scores: Q K^T
    #    Using matrix multiplication on the last two dimensions:
    #    (..., seq_len_q, d_k) @ (..., d_k, seq_len_k) -> (..., seq_len_q, seq_len_k)
    scores = np.matmul(q, np.swapaxes(k, -1, -2))

    # 2. Scale scores by sqrt(d_k) to keep gradients stable
    scores = scores / np.sqrt(d_k)

    # 3. Normalize scores into probabilities using softmax
    attention_weights = softmax(scores, axis=-1)

    # 4. Compute the weighted sum of values:
    #    (..., seq_len_q, seq_len_k) @ (..., seq_len_k, d_v)
    #    -> (..., seq_len_q, d_v)
    context = np.matmul(attention_weights, v)

    return attention_weights, context



In [3]:

# Simple demo to show shapes and that the function runs correctly.
np.random.seed(42)

batch_size = 2
seq_len = 4
d_k = 8
d_v = 8

# Random toy inputs
q = np.random.randn(batch_size, seq_len, d_k)
k = np.random.randn(batch_size, seq_len, d_k)
v = np.random.randn(batch_size, seq_len, d_v)

attn_weights, context = scaled_dot_product_attention(q, k, v)

print("Attention weights shape:", attn_weights.shape)  # (2, 4, 4)
print("Context shape:", context.shape)                # (2, 4, 8)

Attention weights shape: (2, 4, 4)
Context shape: (2, 4, 8)
