# Explaining Backpropagation with Example

This notebook is for the blog: [Explaining Backpropagation with Example](https://derekzhouai.github.io/posts/explaining-backpropagation/)

In [1]:
import numpy as np

## Initialization

$X \in \mathbb{R}^{m \times 2}$

$W_1 \in \mathbb{R}^{2 \times 3}, b_1 \in \mathbb{R}^{1 \times 3}$

$W_2 \in \mathbb{R}^{3 \times 1}, b_2 \in \mathbb{R}^{1 \times 1}$

In [2]:
# Initialization
m = 4

X = np.array([[0.05, 0.10], [0.15, 0.20], [0.25, 0.30], [0.35, 0.40]]) # (4, 2)
y = np.array([[0.01], [0.02], [0.03], [0.04]]) # (4, 1)

W_1 = np.array([[0.15, 0.20, 0.25], [0.30, 0.35, 0.40]]) # (2, 3)
b_1 = np.array([[0, 0, 0]]) # (1, 3)
W_2 = np.array([[0.45], [0.50], [0.55]]) # (3, 1)
b_2 = np.array([[0]]) # (1, 1)

## Forward Pass

In [3]:
def sigmoid(X):
    return 1 / (1 + np.exp(-X))

### Hidden layer

$Z_1 = XW_1 + b_1 \in \mathbb{R}^{m \times 3}$

$A_1 = \sigma(Z_1) \in \mathbb{R}^{m \times 3}$

In [4]:
Z_1 = np.dot(X, W_1) + b_1
Z_1.shape, Z_1

((4, 3),
 array([[0.0375, 0.045 , 0.0525],
        [0.0825, 0.1   , 0.1175],
        [0.1275, 0.155 , 0.1825],
        [0.1725, 0.21  , 0.2475]]))

In [5]:
A_1 = sigmoid(Z_1)
A_1.shape, A_1

((4, 3),
 array([[0.5093739 , 0.5112481 , 0.51312199],
        [0.52061331, 0.52497919, 0.52934125],
        [0.53183189, 0.53867261, 0.54549879],
        [0.54301838, 0.55230791, 0.56156107]]))

### Output layer

$Z_2 = A_1 W_2 + b_2 \in \mathbb{R}^{m \times 1}$

$A_2 = Z_2 \in \mathbb{R}^{m\times 1}$

In [6]:
Z_2 = np.dot(A_1, W_2) + b_2
Z_2.shape, Z_2

((4, 1),
 array([[0.7670594 ],
        [0.78790327],
        [0.80868499],
        [0.82937081]]))

In [7]:
A_2 = Z_2  # Linear activation for output layer
A_2.shape, A_2

((4, 1),
 array([[0.7670594 ],
        [0.78790327],
        [0.80868499],
        [0.82937081]]))

### Loss (MSE)

$
L = \frac{1}{2m} \parallel A_2 - y\parallel_F^2 = \frac{1}{2m}\sum_{i=1}^m (A_2^{(i,1)}-y^{(i,1)})^2 \in \mathbb{R}
$

In [8]:
# Loss (MSE)
L = 1/2 * np.mean((A_2 - y) ** 2)
L

np.float64(0.29903386964167666)

## Backward Pass

### Output layer

$
L=\frac{1}{2m} \parallel A_2 - y \parallel_F^2, \quad\text{with } A_2 = Z_2.
$

$
Z_2 = A_1W_2 + b_2
$

**Output error**

$
\delta_2 = \frac{\partial L}{\partial Z_2} = \frac{\partial L}{\partial A_2} = \frac{1}{m}(A_2 - y) \in \mathbb{R}^{m \times 1}
$

In [9]:
d_A2 = (A_2 - y) / m
d_A2.shape, d_A2

((4, 1),
 array([[0.18926485],
        [0.19197582],
        [0.19467125],
        [0.1973427 ]]))

**Output gradient**

$
\nabla_{W_2} = \frac{\partial L}{\partial W_2} = A_1^\top\delta_2 \in \mathbb{R}^{3 \times 1}
$

$
\nabla_{b_2} = \frac{\partial L}{\partial b_2} = \text{rowsum}(\delta_2) \in \mathbb{R}^{1 \times 1}
$

In [10]:
d_W2 = np.dot(A_1.T, d_A2)
d_W2.shape, d_W2

((3, 1),
 array([[0.40704483],
        [0.41140261],
        [0.41574958]]))

In [11]:
d_b2 = np.sum(d_A2, axis=0, keepdims=True)
d_b2.shape, d_b2

((1, 1), array([[0.77325462]]))

### Hidden layer

$
A_1 = \sigma(Z_1) \Rightarrow \frac{\partial A_1}{\partial Z_1} \;=\; \sigma(Z_1)\odot \big(1-\sigma(Z_1)\big) \;=\; A_1\odot (1-A_1).
$

$Z_1=XW_1+b_1$

$\odot$: Element-wise multiplication

**Hidden error**

$
\delta_1 = \frac{\partial L}{\partial Z_1} = \left(\frac{\partial L}{\partial A_1}\right)\odot \left(\frac{\partial A_1}{\partial Z_1}\right)
\;=\; \big(\delta_2 W_2^\top\big)\odot \big(A_1 \odot (1-A_1)\big) \in \mathbb{R}^{m \times 3}
$

In [12]:
d_A1 = d_A2 @ W_2.T
d_A1.shape, d_A1

((4, 3),
 array([[0.08516918, 0.09463242, 0.10409567],
        [0.08638912, 0.09598791, 0.1055867 ],
        [0.08760206, 0.09733562, 0.10706919],
        [0.08880422, 0.09867135, 0.10853849]]))

In [13]:
d_Z1 = d_A1 * (A_1 * (1 - A_1))
d_Z1.shape, d_Z1

((4, 3),
 array([[0.02128481, 0.02364613, 0.02600599],
        [0.02156057, 0.02393708, 0.02630577],
        [0.02181175, 0.02418833, 0.02654565],
        [0.02203671, 0.02439786, 0.02672329]]))

**Hidden gradients**

$
\nabla_{W_1} = X^\top\delta_1 \in \mathbb{R}^{2 \times 3}
$

$
\nabla_{b_1} = \text{rowsum}(\delta_1) \in \mathbb{R}^{1 \times 3}
$

In [14]:
d_W1 = X.T @ d_Z1
d_W1.shape, d_W1

((2, 3),
 array([[0.01746411, 0.0193592 , 0.02123573],
        [0.02179881, 0.02416767, 0.02651476]]))

In [15]:
d_b1 = np.sum(d_Z1, axis=0, keepdims=True)
d_b1.shape, d_b1

((1, 3), array([[0.08669385, 0.09616941, 0.1055807 ]]))