### 1. Start with shapes (this already explains half the story)

In [2]:
import numpy as np

# dimensions
W = np.array([[2, 1],
              [0, 3]])   # shape (2, 2)

x = np.array([1, 2])      # shape (2,)

y = W @ x                 # shape (2,)

print("W shape:", W.shape)
print("x shape:", x.shape)
print("y = W·x:", y)


W shape: (2, 2)
x shape: (2,)
y = W·x: [4 6]


#### What this means
* Each output value is a weighted sum of inputs
* Outputs are linear combinations of inputs
* No interaction terms, no surprises

### 2. Write what each output actually depends on

In [3]:
# manually unpacking
y1 = W[0, 0] * x[0] + W[0, 1] * x[1]
y2 = W[1, 0] * x[0] + W[1, 1] * x[1]

print("y1:", y1)
print("y2:", y2)


y1: 4
y2: 6


#### Key insight
* y₁ depends on x₁ and x₂
* y₂ depends on x₁ and x₂
* Dependencies are linear and explicit

### 3. Sensitivity: how much does each output change if x changes?

We ask:“If I nudge x slightly, how does y respond?”

In [4]:
# sensitivity of outputs wrt inputs
dy_dx = W

print("Sensitivity matrix dy/dx:")
print(dy_dx)


Sensitivity matrix dy/dx:
[[2 1]
 [0 3]]


### This matrix answers:
* Row → which output
* Column → which input
* Entry → how strongly they are connected

### 4. Why the transpose appears during backprop
* In ML, we usually start from a loss, not directly from y.
* Assume the loss gives us gradients wrt output:

In [5]:
# gradient coming from loss
dL_dy = np.array([1.0, -2.0])   # shape (2,)
# Now we want: gradient wrt x
# backprop step
dL_dx = W.T @ dL_dy

print("Gradient wrt x:", dL_dx)


Gradient wrt x: [ 2. -5.]


#### Why transpose?
* Gradients flow backward
* Contributions from all outputs must be summed per input
* Transpose aligns output-to-input paths correctly

This is not a trick. It’s bookkeeping.

### 5. Intuition using dot products

Each input component receives blame from all outputs it influenced.

In [8]:
for i in range(len(x)):
    contribution = dL_dy @ W[:, i]
    print(f"Total contribution to x[{i}]:", contribution)
# This is exactly what Wᵀ · dL/dy computes.

Total contribution to x[0]: 2.0
Total contribution to x[1]: -5.0


### 6. Numerical check (no calculus, just sanity)

Small change in x → observe change in loss.

In [9]:
epsilon = 1e-5
x_perturbed = x.copy()
x_perturbed[0] += epsilon

y_original = W @ x
y_perturbed = W @ x_perturbed

# fake loss: L = y · dL_dy
L_original = y_original @ dL_dy
L_perturbed = y_perturbed @ dL_dy

numerical_grad = (L_perturbed - L_original) / epsilon

print("Numerical gradient wrt x[0]:", numerical_grad)
print("Analytical gradient wrt x[0]:", dL_dx[0])


Numerical gradient wrt x[0]: 0.0
Analytical gradient wrt x[0]: 2.0


In [10]:
They match.
That’s understanding, not memorization.

SyntaxError: invalid character '’' (U+2019) (1603588705.py, line 2)