# AIO Q3 — Simple Attention

We study a two-input attention-style model:

$$
\alpha_i = \frac{e^{\beta_i}}{e^{\beta_1} + e^{\beta_2}}, \quad
y = \alpha_1 x_1 + \alpha_2 x_2, \quad
L = \tfrac{1}{2}(y - t)^2
$$

Constants (used unless stated otherwise):  
$x_1 = 2,\; x_2 = 4,\; t = 3$  
Logits for forward pass: $\beta_1 = 0,\; \beta_2 = 1$

## Q1 — Forward: Exponentials

Compute $e^{\beta_1}$ and $e^{\beta_2}$ for $\beta_1 = 0,\; \beta_2 = 1$.

## Q2 — Forward: Softmax Weights

Using your results from Q1, compute

$$
\alpha_i = \frac{e^{\beta_i}}{e^{\beta_1} + e^{\beta_2}}, \quad i \in \{1, 2\}.
$$

## Q3 — Forward: Model Output

Given $x_1 = 2,\; x_2 = 4$, compute

$$
y = \alpha_1 x_1 + \alpha_2 x_2.
$$

## Q4 — Derivatives w.r.t. Inputs

Show that

$$
\frac{\partial L}{\partial x_i} = (y - t)\,\alpha_i.
$$

*Hint:* Chain rule.

## Q5 — Softmax Derivative Formula

Differentiate

$$
\alpha_i = \frac{e^{\beta_i}}{e^{\beta_1} + e^{\beta_2}}
$$

with respect to $\beta_j$, and show that

$$
\frac{\partial \alpha_i}{\partial \beta_j} = \alpha_i (\delta_{ij} - \alpha_j).
$$

**The Kronecker delta** is defined as: $$ \delta_{ij} = \begin{cases} 1, & i = j, \\ 0, & i \ne j. \end{cases} $$

*Hint:* Quotient rule.

## Q6 — Output Derivative w.r.t. Logits

Show that

$$
\frac{\partial y}{\partial \beta_j} = \alpha_j (x_j - y).
$$

*Hint:* Combine linearity of $y$ with Q5.

## Q7 — Gradient

Using $\dfrac{\partial L}{\partial y} = y - t$ and Q6, show that

$$
\frac{\partial L}{\partial \beta_j} = (y - t)\,\alpha_j\,(x_j - y).
$$

## Q8 — One Update Step

With learning rate $\eta = 0.2$, perform one gradient step:

$$
\beta_j' = \beta_j - \eta\,\frac{\partial L}{\partial \beta_j}.
$$

At $\beta_1 = 0,\; \beta_2 = 1$, evaluate numerically and state whether attention increased on $x_1$ or $x_2$.


## Q9 — Finite-Difference Gradient Check (Python)

Numerically confirm $\frac{\partial L}{\partial \beta_j}$ at $(\beta_1,\beta_2) = (0,1)$ using a central difference with $\epsilon = 10^{-5}$:

$$
\frac{\partial L}{\partial \beta_j} \approx \frac{L(\beta_j + \epsilon) - L(\beta_j - \epsilon)}{2\epsilon}
$$

(holding the other $\beta$ fixed).  
Report the analytic value (from Q7), the finite-difference estimate, and the relative error for $j = 1, 2$.


In [None]:
import numpy as np

# Given constants
x = np.array([2.0, 4.0])
t = 3.0
beta = np.array([0.0, 1.0])
eps = 1e-5


analytic_grad = None # TODO
fd_gradients = None # TODO


print("Analytic gradients:", analytic_grad)
print("Finite-diff gradients:", fd_gradients)

# Optional: compute relative error
rel_error = np.abs((analytic_grad - fd_gradients) / (np.abs(fd_gradients) + 1e-12))
print("Relative error:", rel_error)


Analytic gradients: [-0.1817155  0.1817155]
Finite-diff gradients: [-0.1817155  0.1817155]
Relative error: [1.13457470e-10 5.61798568e-11]


## Q10 — Visualizing the Loss Surface

Compute $L(\beta_1, \beta_2)$ on a grid over $[-3, 3]^2$ and plot the 3D surface using `matplotlib`.  
Label axes $\beta_1$, $\beta_2$, and $L$.
