# 1. Backpropagation for a single neuron layer
<img src='images/nn_1_1_1_1_labled.svg'/>

### 1.1 Cost of a single sample from training set
So the cost for one example in this network 

$C_0(...)=( a^{(L)}-y)^2$

the $C_0$ means the cost for the first example


$C_0(...)=( a^{(L)}-y)^2$

---

### 1.2 Output of neurons in each layer

$z^{(L)}=w^{(L)}a^{(L-1)}+b^{(L)}$

$a^{(L)}=\sigma(z^{(L)}) $

---

### 1.3 Derivative of costs relative to weight


$\frac{ \partial C_0   }{ \partial w^{(L)}  } = \frac{ \partial C_0   }{ \partial a^{(L)}  } \frac{ \partial a^{(L)  } }{ \partial z^{(L)}  }   \frac{\partial z^{(L)}} { \partial w^{(L)} }  $


$\frac{ \partial C_0   }{ \partial a^{(L)}}=2( a^{(L)}-y)   $

$\frac{ \partial a^{(L)  } }{ \partial z^{(L)} }= \sigma' (z^{(L)}) $

$\frac{\partial z^{(L)}} { \partial w^{(L)} }=a^{(L-1)}$


This is the cost for a specific example, 

$\frac{ \partial C_0   }{ \partial w^{(L)}  } = 2( a^{(L)}-y) \sigma' (z^{(L)})  a^{(L-1)}$

---

### 1.4 Derivative of costs relative to bias

$\frac{ \partial C_0   }{ \partial b^{(L)}  } = \frac{ \partial C_0   }{ \partial a^{(L)}  } \frac{ \partial a^{(L)  } }{ \partial z^{(L)}  }   \frac{\partial z^{(L)}} { \partial b^{(L)} }  $

$\frac{ \partial C_0   }{ \partial b^{(L)}  } =  2( a^{(L)}-y) \sigma' (z^{(L)}) \times 1$

---

### 1.5 Derivative of costs relative to the activation of previous layer

$\frac{ \partial C_0   }{ \partial a^{(L-1)}  } = \frac{ \partial C_0   }{ \partial a^{(L)}  } \frac{ \partial a^{(L)  } }{ \partial z^{(L)}  }   \frac{\partial z^{(L)}} { \partial a^{(L-1)} }  $

$\frac{ \partial C_0   }{ \partial b^{(L)}  } =  2( a^{(L)}-y) \sigma' (z^{(L)}) w^{(L)}$

---

### 1.6 Cost for all training
The cost for all training example is average cost for all examples:


$\frac{ \partial C}{ \partial w^{(L)}  } = \frac{1}{n} \sum_{k=0}^{n-1} \frac{ \partial C_k   }{ \partial w^{(L)}} $

Which is a part of total derivative for all those weights and biases:

$\nabla C=\begin{bmatrix} \frac{ \partial C}{ \partial w^{(1)}  }  \\
\frac{ \partial C}{ \partial b^{(1)}  } \\ \vdots   \\ \frac{ \partial C}{ \partial w^{(L)}  } \\ \frac{ \partial C}{ \partial b^{(L)}} 
\end{bmatrix}$

---

# 2. Backpropagation for a multi neuron layer

<img src='images/nn_a(1)_0.svg' height="50%" width="50%" />

The subscript indicate which layer of neuron it is

- $a_0^{(L-1)}$: First neuron in the layer $L-1$
- $a_2^{(L)}$: Third neuron in the layer $L$

### 2.1 Dimension of the MLP

- Last layer $L$ has $n$ neuron, 
- Layer $L-1$ has $m$ neuron,
- Therefore, the size of the weight matrix is: $\textbf{W}^{(L)}_{n \times m}$
- Similarly layer $L-1$ has $m$ neuron, Layer $L-2$ has $p$ neuron, therefore: $\textbf{W}^{(L-1)}_{m \times p}$

---

$z^{(L)}=w^{(L)}a^{(L-1)}+b^{(L)}$

$a^{(L)}=\sigma(z^{(L)}) $

---

$z^{(L)}= \textbf{W}^{(L)}_{n \times m} \times a^{(L-1)}_{m \times 1} + b^{(L)}_{n \times 1} $ 


---

For example, in the above example the first layer ($L=0$) has $6$ neurons  and the second layer  ($L=1$) has $4$ neurons:

$\begin{bmatrix}
w_{0,0}^{(1)} &w_{0,1}^{(1)}  & \cdots   &w_{0,5}^{(1)} \\ 
w_{1,0}^{(1)} &w_{1,1}^{(1)}  & \cdots   &w_{1,5}^{(1)} \\ 
 \vdots &  \ddots &  & \\ 
w_{3,0}^{(1)} &w_{3,1}^{(1)}  & \cdots   &w_{3,5}^{(1)} 
\end{bmatrix}_{4 \times 6} \times$
$\begin{bmatrix}
a^{(0)}_{0}
\\ a^{(0)}_{1}
\\ \vdots
\\ a^{(0)}_{5}
\end{bmatrix}_{6 \times 1}
+$
$\begin{bmatrix}
b^{(1)}_{0}
\\ b^{(1)}_{1}
\\ \vdots
\\ b^{(1)}_{3}
\end{bmatrix}_{4 \times 1}=$
$\begin{bmatrix}
z^{(1)}_{0}
\\ z^{(1)}_{1}
\\ \vdots
\\ z^{(1)}_{3}
\end{bmatrix}_{4 \times 1}$

---



$a^{(1)}_{0}=\sigma( 
w_{0,0}\times a^{(0)}_{0} +
w_{0,1}\times a^{(0)}_{1}+
w_{0,2}\times a^{(0)}_{2}+
w_{0,3}\times a^{(0)}_{3}+
w_{0,4}\times a^{(0)}_{4}+
w_{0,5}\times a^{(0)}_{5}+b_{0}^{(1)})$

$w_{j,k}^{(L)}$ indicates the weight that connect $k_{th}$ node in $L-1$ layer to $j_{th}$ node in the layer $L$ 
(we have $m$ neuron in the layer $L-1$)

<img src='images/nn_last_layer.svg' height="50%" width="50%" />


$z_{j}^{(L)}=w_{j,0}^{(L)}\times a^{(L-1)}_{0} +
w_{j,1}^{(L)}\times a^{(L-1)}_{1}+
\cdots+
w_{j,k}^{(L)}\times a^{(L-1)}_{k}+
\cdots+
w_{j,m-1}^{(L)}\times a^{(L-1)}_{m-1}+b_{j}$

$a^{(L)}_{j}=\sigma(z_{j}^{(L)})$

---

In the last layer we have $n$ output, the cost for our $\textbf{first}$ example in the training set, $c_{0}$ is:

$c_{0}=\sum_{j=0}^{n-1} (a_{j}^{(L)}-y_{j})^2$

### 2.2 Computing error relative to the changes of weights in the last layer



$\frac{\partial c_{0}}{\partial w_{jk}^{(L)}} =
\frac{\partial z_{j}^{(L)}}{\partial w_{jk}^{(L)}}
\frac{\partial a_{j}^{(L)}}{\partial z_{j}^{(L)}}
\frac{\partial c_{0}}{\partial a_{j}^{(L)}}$

### 2.3 Computing error relative to the changes of biases in the last layer



$\frac{\partial c_{0}}{\partial b_{j}^{(L)}} =
\frac{\partial z_{j}^{(L)}}{\partial b_{j}^{(L)}}
\frac{\partial a_{j}^{(L)}}{\partial z_{j}^{(L)}}
\frac{\partial C_{0}}{\partial a_{j}^{(L)}}$

## 2.4 $\delta^{(L)}_{j}$: The error of neuron $j$ in layer $L$


1) $\frac{\partial c_{0}}{\partial a_{j}^{(L)}}=2(a_{j}^{(L)}-y_{j})$

2) $\frac{\partial a_{j}^{(L)}}{\partial z_{j}^{(L)}}=\sigma^{\prime}(z_{j}^{(L)})$

3) $\frac{\partial z_{j}^{(L)}}{\partial w_{jk}^{(L)}}=a^{(L-1)}_{k}$

4) $\frac{\partial z_{j}^{(L)}}{\partial b_{j}^{(L)}}=1$


Since $\frac{\partial a_{j}^{(L)}}{\partial z_{j}^{(L)}}\frac{\partial c_{0}}{\partial a_{j}^{(L)}}$ is common in $\frac{\partial c_{0}}{\partial b_{j}^{(L)}}$ and $\frac{\partial c_{0}}{\partial w_{jk}^{(L)}}$ we call it $delta$ $\delta^{(L)}_{j}$ which is the error of neuron $j$ in layer $L$:

$\delta^{(L)}_{j} =\frac{\partial a_{j}^{(L)}}{\partial z_{j}^{(L)}}\frac{\partial c_{0}}{\partial a_{j}^{(L)}}=\frac{\partial c}{\partial z_{j}^{(L)}}=2(a_{j}^{(L)}-y_{j})\sigma^{\prime}(z_{j}^{(L)})$

This will give us:

- $\frac{\partial c_{0}}{\partial w_{jk}^{(L)}} =
\frac{\partial z_{j}^{(L)}}{\partial w_{jk}^{(L)}}
\frac{\partial a_{j}^{(L)}}{\partial z_{j}^{(L)}}
\frac{\partial C_{0}}{\partial a_{j}^{(L)}}=2(a_{j}^{(L)}-y_{j})\sigma^{\prime}(z_{j}^{(L)})a^{(L-1)}_{k}=\delta^{(L)}_{j}a^{(L-1)}_{k}$

- $\frac{\partial c_{0}}{\partial b_{j}^{(L)}} =
\frac{\partial z_{j}^{(L)}}{\partial b_{j}^{(L)}}
\frac{\partial a_{j}^{(L)}}{\partial z_{j}^{(L)}}
\frac{\partial C_{0}}{\partial a_{j}^{(L)}}=2(a_{j}^{(L)}-y_{j})\sigma^{\prime}(z_{j}^{(L)})=\delta^{(L)}_{j}$

## 2.5 Writing backpropagation equations in matrix form

### 2.5.1 $\delta^L$: The error term at the output layer $L$


The error term at the output layer:
$
\delta^L = \frac{\partial a^{(L)}}{\partial z^{(L)}} \cdot \frac{\partial C}{\partial a^{(L)}}
$


#### 1) $ \frac{\partial C}{\partial a^{(L)}} $



Let say our out put size is $2$:

- $ a^{(L)} \in \mathbb{R}^{2 \times 1} $
- $ C \in \mathbb{R} $ (a scalar loss function)
- So $ \frac{\partial C}{\partial a^{(L)}} $ should have what size?

---


Cost function $ C \in \mathbb{R} $

For scalar-valued functions, like the MSE loss:
$
C = \frac{1}{2} \| a^L - y \|^2 = \frac{1}{2} \sum_i (a^L_i - y_i)^2
$

Then:
$
\frac{\partial C}{\partial a^L} = a^L - y \in \mathbb{R}^{2 \times 1}
$

- $ a^L $ is a $ 2 \times 1 $ vector
- So the gradient of scalar $ C $ with respect to a vector is also a $ 2 \times 1 $ **vector of partial derivatives**

---

#### 2) $ \frac{\partial a^L}{\partial z^L} $

This is the derivative of the **activation function**, element-wise, What are the dimensions of $ \partial a^{(L)} $ and $ \partial z^{(L)} $, and what is the shape of their derivative?
$
\frac{\partial a^{(L)}}{\partial z^{(L)}}
$

---

Assume the output layer $ L $ has $ n_L $ neurons. Then:

- $ a^{(L)} \in \mathbb{R}^{n_L \times 1} $: activation (output) vector
- $ z^{(L)} \in \mathbb{R}^{n_L \times 1} $: pre-activation (weighted input) vector

Each element:
$
a^{(L)}_i = \sigma(z^{(L)}_i)
$

So $ a^{(L)} $ is computed **element-wise** from $ z^{(L)} $ via the sigmoid (or other) activation function.

---

What Is $ \frac{\partial a^{(L)}}{\partial z^{(L)}} $?

This is the derivative of a **vector-valued function** with respect to a **vector**.

#### Full Jacobian Form:

In general, for vector-valued $ a^{(L)} \in \mathbb{R}^{n_L \times 1} $, and vector-valued $ z^{(L)} \in \mathbb{R}^{n_L \times 1} $, the derivative is a **Jacobian matrix**:

$
\frac{\partial a^{(L)}}{\partial z^{(L)}} \in \mathbb{R}^{n_L \times n_L}
$

But here's the trick:  
Since the activation function is **applied element-wise**, the Jacobian is a **diagonal matrix**.

For sigmoid:
$
\frac{\partial a^{(L)}}{\partial z^{(L)}} =
\begin{bmatrix}
\sigma'(z_1) & 0 & \cdots & 0 \\
0 & \sigma'(z_2) & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \sigma'(z_{n_L})
\end{bmatrix}
$

So:
- Shape of $ \frac{\partial a^{(L)}}{\partial z^{(L)}} $: $ n_L \times n_L $
- Shape of $ \partial a^{(L)} $ or $ a^{(L)} $: $ n_L \times 1 $
- Shape of $ \partial z^{(L)} $: $ n_L \times 1 $

---
**Example: If $ n_L = 2 $**

Let’s say:
- $ a^{(L)} = \begin{bmatrix} a_1 \\ a_2 \end{bmatrix} $
- $ z^{(L)} = \begin{bmatrix} z_1 \\ z_2 \end{bmatrix} $

Then:
$
\frac{\partial a^{(L)}}{\partial z^{(L)}} =
\begin{bmatrix}
\frac{\partial a_1}{\partial z_1} & 0 \\
0 & \frac{\partial a_2}{\partial z_2}
\end{bmatrix}
=
\begin{bmatrix}
\sigma'(z_1) & 0 \\
0 & \sigma'(z_2)
\end{bmatrix}
\in \mathbb{R}^{2 \times 2}
$

---


**So Why Do We Use Element-wise Product?**

In neural network code (like NumPy or PyTorch), we **don't explicitly construct the Jacobian** because:
$
\delta^L = \left( \frac{\partial C}{\partial a^L} \right)^T \cdot \left( \frac{\partial a^L}{\partial z^L} \right)
$

Would become:
$
\delta^L = J \cdot v
\quad \text{(Jacobian times gradient vector)}
$

But since $ J $ is diagonal, this is equivalent to **element-wise multiplication**:


$
\delta^L = \left( a^L - y \right) \odot \sigma'(z^L)
$

$
\delta^L = \frac{\partial C}{\partial a^L} \odot \frac{\partial a^L}{\partial z^L}
$

---

**Summary**

| Expression | Meaning | Shape |
|------------|---------|-------|
| $ a^L $ | activation vector | $ n_L \times 1 $ |
| $ z^L $ | pre-activation vector | $ n_L \times 1 $ |
| $ \frac{\partial a^L}{\partial z^L} $ | Jacobian (element-wise derivative) | $ n_L \times n_L $, diagonal matrix |
| $ \delta^L = \frac{\partial C}{\partial a^L} \odot \sigma'(z^L) $ | vector of errors | $ n_L \times 1 $ |

If $ a^L = \sigma(z^L) $, then:

 
$
\frac{\partial a^L}{\partial z^L} = \sigma(z^L) \odot (1 - \sigma(z^L)) \in \mathbb{R}^{2 \times 1}
$

Again, this is also a **vector**, not a matrix or Jacobian here, because the sigmoid is applied element-wise.


$
\delta^L = \frac{\partial C}{\partial a^L} \odot \frac{\partial a^L}{\partial z^L}
$

So this is an **element-wise product** of two $ 2 \times 1 $ vectors:
$
\delta^L \in \mathbb{R}^{2 \times 1}
$


- $ \frac{\partial C}{\partial a^L} \in \mathbb{R}^{2 \times 1} $
- $ \delta^L \in \mathbb{R}^{2 \times 1} $

---


####  What If We Were Doing Matrix Calculus?

If you wrote everything using **Jacobian matrices**, then:
- $ \frac{\partial C}{\partial a^L} $ would be $ 1 \times 2 $
- $ \frac{\partial a^L}{\partial z^L} $ would be $ 2 \times 2 $ diagonal matrix
- And:
$
\delta^L = \left( \frac{\partial C}{\partial a^L} \right) \cdot \left( \frac{\partial a^L}{\partial z^L} \right)
\in \mathbb{R}^{1 \times 2}
$

But in deep learning frameworks and hand-calculations, we **drop the Jacobians** and use element-wise Hadamard products — which is what your original formula assumes (and correctly uses).

---

#### Final Summary

- $ C \in \mathbb{R} $
- $ a^L \in \mathbb{R}^{2 \times 1} $
- $ \frac{\partial C}{\partial a^L} \in \mathbb{R}^{2 \times 1} $
- $ \frac{\partial a^L}{\partial z^L} \in \mathbb{R}^{2 \times 1} $
- So:
  $
  \delta^L = \frac{\partial C}{\partial a^L} \odot \sigma'(z^L) \in \mathbb{R}^{2 \times 1}
  $


---

##  The Two Worlds: Vector Calculus vs Matrix Calculus

We Were Doing Matrix Calculus then the $\delta^L$  is ${1 \times 2}$ but if we don't use the Jacobian it is $\delta^L = \frac{\partial C}{\partial a^L} \odot \sigma'(z^L) \in \mathbb{R}^{2 \times 1}$




There are **two conventions** used in computing derivatives of vector functions:

### ✅ 1. **Vector (Coordinate-wise) Calculus**
- This is what we typically use in deep learning.
- Gradients are treated as **column vectors**.
- Element-wise operations (like sigmoid) make life easier.
- The error at the output is:
  $
  \delta^L = \frac{\partial C}{\partial a^L} \odot \sigma'(z^L) \in \mathbb{R}^{n_L \times 1}
  $

This is the standard **backpropagation view**, and how it's implemented in frameworks like PyTorch, TensorFlow, etc.

---

### ⚠️ 2. **Matrix Calculus (Jacobian-based)**

In strict matrix calculus, you define gradients as:
- For scalar function $ C: \mathbb{R}^n \rightarrow \mathbb{R} $,
  $
  \frac{\partial C}{\partial a^L} \in \mathbb{R}^{1 \times n_L}
  $
  (a row vector)
- And the Jacobian of $ \sigma: \mathbb{R}^{n_L} \rightarrow \mathbb{R}^{n_L} $ is:
  $
  \frac{\partial a^L}{\partial z^L} \in \mathbb{R}^{n_L \times n_L}
  $

Then, to compute $ \delta^L $ (which is $ \frac{\partial C}{\partial z^L} $), you would do:
$
\delta^L = \frac{\partial C}{\partial a^L} \cdot \frac{\partial a^L}{\partial z^L}
\in \mathbb{R}^{1 \times n_L}
$

So yes — in this formal matrix calculus:
- $ \delta^L \in \mathbb{R}^{1 \times n_L} $

But that leads to:
- Transposed weight matrices everywhere
- Extra care for dimensions in chain rule

---

## 🧠 Deep Learning Convention

In deep learning, we **drop the full Jacobians** and use element-wise derivatives. So instead of:

$
\delta^L = \left( \frac{\partial C}{\partial a^L} \right) \cdot \left( \frac{\partial a^L}{\partial z^L} \right)
\quad (\text{matrix calculus})
$

We use:
$
\delta^L = \left( \frac{\partial C}{\partial a^L} \right) \odot \sigma'(z^L)
\quad (\text{element-wise / vector calculus})
$

Which gives:
- $ \delta^L \in \mathbb{R}^{n_L \times 1} $ (a column vector)

This is far more intuitive and easier to implement.

---

## ✅ Final Clarification

| Perspective | $ \frac{\partial C}{\partial a^L} $ | $ \delta^L $ | Notes |
|-------------|------------------------------|----------------|-------|
| Matrix calculus | $ 1 \times n_L $ (row vector) | $ 1 \times n_L $ | Uses full Jacobians |
| Deep learning / coordinate-wise | $ n_L \times 1 $ (column vector) | $ n_L \times 1 $ | Uses Hadamard product |

So you're exactly right: the **same concept** has **different dimensions depending on the convention**, and deep learning typically follows the column vector (Hadamard product) style.

---

Would you like to see this distinction visualized with an example (e.g., comparing both forms on a small network)?


---

We can write the equation for $\delta^{(L)}_{j}$ in vector form:

$\begin{align*}
\boldsymbol{\delta}^{(L)} &= (\delta^{(L)}_{0},\delta^{(L)}_{1},...,\delta^{(L)}_{n-1} ) \\
&= \frac{\partial \textbf{c}}{\partial \textbf{z}^{(L)}} \\
&= (\frac{\partial c}{\partial z_{0}^{(L)}} ,\frac{\partial c}{\partial z_{1}^{(L)}},... ,\frac{\partial c}{\partial z_{n-1}^{(L)}} ) \\
&= (\frac{\partial a_{0}^{(L)}}{\partial z_{0}^{(L)}}\frac{\partial c}{\partial a_{0}^{(L)}},\frac{\partial a_{1}^{(L)}}{\partial z_{1}^{(L)}}\frac{\partial c}{\partial a_{1}^{(L)}},...\frac{\partial a_{n-1}^{(L)}}{\partial z_{n-1}^{(L)}}\frac{\partial c}{\partial a_{n-1}^{(L)}}) \\
&= (2(a_{0}^{(L)}-y_{0})\sigma^{\prime}(z_{0}^{(L)}),2(a_{1}^{(L)}-y_{1})\sigma^{\prime}(z_{1}^{(L)}),...2(a_{n-1}^{(L)}-y_{n-1})\sigma^{\prime}(z_{n-1}^{(L)})) \\
&= 2(\textbf{a}^{(L)}-\textbf{y}) \odot \sigma^{\prime}(\textbf{z}^{(L)})\tag{1}
\end{align*}$

---

### 2.6.1 $\frac{\partial C}{\partial w^L}$: The error term relative to weight in layer $L$


$
\frac{\partial C}{\partial w^L} = \delta^L (a^{L-1})^T
$


At any layer $ L $, the **pre-activation** value (before the activation function) is:

$
z^L = w^L a^{L-1} + b^L
\quad \text{where } w^L \in \mathbb{R}^{n_L \times n_{L-1}}
$

Then:
$
a^L = \sigma(z^L)
$

The loss $ C $ depends on the final output $ a^L $, and we're interested in how changing each weight in $ w^L $ affects the cost.

---

Reminder:
$
\frac{\partial C}{\partial w^L_{ij}}
$

That is how does changing the weight from neuron $ j $ in layer $ L-1 $ to neuron $ i $ in layer $ L $ affect the cost?

---

**Chain Rule View**

Use the chain rule:

$
\frac{\partial C}{\partial w^L_{ij}} = \frac{\partial C}{\partial z^L_i} \cdot \frac{\partial z^l_i}{\partial w^L_{ij}}
$

But:
- $ \frac{\partial C}{\partial z^l_i} = \delta^l_i $
- And $ z^L_i = \sum_j w^L_{ij} a^{L-1}_j + b^l_i \Rightarrow \frac{\partial z^L_i}{\partial w^L_{ij}} = a^{L-1}_j $

So:
$
\frac{\partial C}{\partial w^L_{ij}} = \delta^L_i \cdot a^{L-1}_j
$

From our previous section:

- $\frac{\partial c_{0}}{\partial w_{jk}^{(L)}} =
\frac{\partial z_{j}^{(L)}}{\partial w_{jk}^{(L)}}
\frac{\partial a_{j}^{(L)}}{\partial z_{j}^{(L)}}
\frac{\partial C_{0}}{\partial a_{j}^{(L)}}=2(a_{j}^{(L)}-y_{j})\sigma^{\prime}(z_{j}^{(L)})a^{(L-1)}_{k}=\delta^{(L)}_{j}a^{(L-1)}_{k}$

Now, write this in matrix form.



Let:
- $ \delta^l \in \mathbb{R}^{n_l \times 1} $
- $ a^{l-1} \in \mathbb{R}^{n_{l-1} \times 1} $

We want to build a full matrix of all partial derivatives:
$
\frac{\partial C}{\partial w^l} \in \mathbb{R}^{n_l \times n_{l-1}}
$

So we form the **outer product**:
$
\delta^l (a^{l-1})^T \in \mathbb{R}^{n_l \times n_{l-1}}
$

Each element is:
$
[\delta^l (a^{l-1})^T]_{ij} = \delta^l_i \cdot a^{l-1}_j
$


---

**So Why the Transpose?**

Because:
- $ a^{l-1} $ is a column vector
- You want $ \delta^l \cdot (a^{l-1})^T $ to get a **matrix**, not a scalar

### Example:

If:
- $ \delta^L = \begin{bmatrix} \delta_1 \\ \delta_2 \end{bmatrix} \in \mathbb{R}^{2 \times 1} $
- $ a^{L-1} = \begin{bmatrix} a_1 \\ a_2 \\ a_3 \end{bmatrix} \in \mathbb{R}^{3 \times 1} $

Then:
$
\delta^L (a^{L-1})^T = 
\begin{bmatrix}
\delta_1 \cdot a_1 & \delta_1 \cdot a_2 & \delta_1 \cdot a_3 \\
\delta_2 \cdot a_1 & \delta_2 \cdot a_2 & \delta_2 \cdot a_3
\end{bmatrix}
\in \mathbb{R}^{2 \times 3}
$

Which is the correct shape for $ w^L $ (2 neurons, each with 3 inputs).

---

**Summary**

- The gradient of the cost w.r.t. a weight is:
  $
  \frac{\partial C}{\partial w^L_{ij}} = \delta^L_i \cdot a^{L-1}_j
  $
- In matrix form:
  $
  \frac{\partial C}{\partial w^L} = \delta^L (a^{L-1})^T
  $


---

We have $m-1$ neuron in the layer $L-1$ 
- $\frac{\partial \textbf{c}}{\partial \textbf{w}_{j}^{(L)}}
=(\frac{\partial z_{j}^{(L)}}{\partial w_{j0}^{(L)}}
\frac{\partial a_{j}^{(L)}}{\partial z_{j}^{(L)}}
\frac{\partial c_{0}}{\partial a_{j}^{(L)}},
\frac{\partial z_{j}^{(L)}}{\partial w_{j1}^{(L)}}
\frac{\partial a_{j}^{(L)}}{\partial z_{j}^{(L)}}
\frac{\partial c_{0}}{\partial a_{j}^{(L)}},...,
\frac{\partial z_{j}^{(L)}}{\partial w_{jm-1}^{(L)}}
\frac{\partial a_{j}^{(L)}}{\partial z_{j}^{(L)}}
\frac{\partial c_{0}}{\partial a_{j}^{(L)}})
=(\delta^{(L)}_{j}a^{(L-1)}_{0},\delta^{(L)}_{j}a^{(L-1)}_{1},...,\delta^{(L)}_{j}a^{(L-1)}_{m-1})
=\delta_{j}^{(L)} \textbf{a}^{(L-1)}$

and if we write it in the matrix form:
$ \frac{\partial \textbf{c}}{\partial \textbf{W}^{(L)}}=
\begin{bmatrix}
\delta^{(L)}_{0}a^{(L-1)}_{0} & \delta^{(L)}_{0}a^{(L-1)}_{1}&...&\delta^{(L)}_{0}a^{(L-1)}_{m-1}\\ 
\delta^{(L)}_{1}a^{(L-1)}_{0} & \delta^{(L)}_{1}a^{(L-1)}_{1}&...&\delta^{(L)}_{1}a^{(L-1)}_{m-1}\\
\vdots & \vdots & \vdots &\vdots \\ 
\delta^{(L)}_{j}a^{(L-1)}_{0} & \delta^{(L)}_{j}a^{(L-1)}_{1}&...&\delta^{(L)}_{j}a^{(L-1)}_{m-1}\\
\vdots & \vdots & \vdots &\vdots \\ 
\delta^{(L)}_{n-1}a^{(L-1)}_{0} & \delta^{(L)}_{n-1}a^{(L-1)}_{1}&...&\delta^{(L)}_{n-1}a^{(L-1)}_{m-1}\\ 
\end{bmatrix}
=\begin{bmatrix}
\delta^{(L)}_{0}\\ 
\delta^{(L)}_{1}\\ 
\vdots
 \\ 
\delta^{(L)}_{j}\\ 
\vdots \\
\delta^{(L)}_{n-1}
\end{bmatrix}
\cdot
\begin{bmatrix}
a^{(L-1)}_{0} & a^{(L-1)}_{1}&...&a^{(L-1)}_{m-1}\\ 
\end{bmatrix}=
\boldsymbol{\delta}^{(L)} \cdot \textbf{a}^{(L-1)}\top \tag{3}$

### 2.6.2 $\frac{\partial C}{\partial b^L}$: The error relative to bias in layer $L$


$\frac{\partial \textbf{c}}{\partial \textbf{b}^{(L)}}=\boldsymbol{\delta}^{(L)} \tag{2} $

### 2.4 The error relative to the changes of activation in the previous layers
<img src='images/nn_l-1_layer_a.svg' height="50%" width="50%"  />

### 2.4.1 $ \frac{\partial c_{0}}{\partial a_{k}^{(L-1)}}$ error relative to the changes of single activation in the previous layers

We know that $a_{k}^{(L-1)}$ (layer $L-1$ has $m$ neurons) has effect on all neuron in the layer $L$ layer (which has $n$ neurons), so to compute the rate of changes of error with respect to $a_{k}^{(L-1)}$ :

$\frac{\partial c_{0}}{\partial a_{k}^{(L-1)}} =
\sum_{j=0}^{n-1}( 
\frac{\partial z_{j}^{(L)}}{\partial a_{k}^{(L-1)}}
\frac{\partial a_{j}^{(L)}}{\partial z_{j}^{(L)}}
\frac{\partial C_{0}}{\partial a_{j}^{(L)}})
$


This expression tells us the total influence of $ a_k^{(L-1)} $ on the cost is the **sum of its effects through all neurons in the next layer**.

To compute the rate of change of the cost with respect to an activation in the previous layer $ a_k^{(L-1)} $, we use the chain rule across all neurons in layer $ L $:

$
\frac{\partial C}{\partial a_k^{(L-1)}} =
\sum_{j=0}^{n-1}
\left(
\underbrace{\frac{\partial z_j^{(L)}}{\partial a_k^{(L-1)}}}_{=w_{jk}^{(L)}} \cdot
\underbrace{\frac{\partial a_j^{(L)}}{\partial z_j^{(L)}}}_{=\sigma'(z_j^{(L)})} \cdot
\underbrace{\frac{\partial C}{\partial a_j^{(L)}}}_{\text{from output or recursive}}
\right)
$

This explains how the signal "flows backward" from every neuron in the next layer back to neuron $ k $ in layer $ L-1 $.

---


Subsequently we have:


$\frac{\partial c_{0}}{\partial a_{k}^{(L-1)}} =
\sum_{i=0}^{n-1} 2(a_{i}^{(L)}-y_{i})\sigma^{\prime}(z_{i}^{(L)}) w_{ik}^{(L)}=
\sum_{i=0}^{n-1} \delta^{(L)}_{i}w_{ik}^{(L)}
$



$
\frac{\partial C}{\partial a_k^{(L-1)}}
$

---

- $ \frac{\partial z_j^{(L)}}{\partial a_k^{(L-1)}} = w_{jk}^{(L)} $,  

We know:

$z_{j}^{(L)}=w_{j,0}^{(L)}\times a^{(L-1)}_{0} +
w_{j,1}^{(L)}\times a^{(L-1)}_{1}+
\cdots+
w_{j,i}^{(L)}\times a^{(L-1)}_{i}+
\cdots+
w_{j,k}^{(L)}\times a^{(L-1)}_{k}+b_{j}$

Therefore:

$\frac{\partial z_{j}^{(L)}}{\partial a_{k}^{(L-1)}}=w_{jk}^{(L)}$

  because $ z_j^{(L)} = \sum_{i=0}^{m-1} w_{ji}^{(L)} a_i^{(L-1)} + b_j^{(L)} $


- $ \frac{\partial a_j^{(L)}}{\partial z_j^{(L)}} = \sigma'(z_j^{(L)}) $,  
  the derivative of the activation function

- $ \frac{\partial C}{\partial a_j^{(L)}} $:  
  how the cost depends on the activation of neuron $ j $ in layer $ L $

So you can rewrite it as:

$
\frac{\partial C}{\partial a_k^{(L-1)}} =
\sum_{j=0}^{n-1}
\left(
w_{jk}^{(L)} \cdot \sigma'(z_j^{(L)}) \cdot \frac{\partial C}{\partial a_j^{(L)}}
\right)
$


**Example:**

Let’s say:
- Layer $ L-1 $ has $ m = 3 $ neurons
- Layer $ L $ has $ n = 2 $ neurons

That is:

- $ a^{(L-1)} \in \mathbb{R}^{3 \times 1} $
- $ w^{(L)} \in \mathbb{R}^{2 \times 3} $
- $ z^{(L)} \in \mathbb{R}^{2 \times 1} $
- $ a^{(L)} = \sigma(z^{(L)}) \in \mathbb{R}^{2 \times 1} $
- $ \frac{\partial C}{\partial a^{(L)}} \in \mathbb{R}^{2 \times 1} $
- $ \sigma'(z^{(L)}) \in \mathbb{R}^{2 \times 1} $
- $ \delta^L = \frac{\partial C}{\partial a^{(L)}} \odot \sigma'(z^{(L)}) \in \mathbb{R}^{2 \times 1} $

---

**Equation with Dimensions**

You’re computing:

$
\frac{\partial C}{\partial a_k^{(L-1)}} =
\sum_{j=0}^{n-1}
\left(
\underbrace{\frac{\partial z_j^{(L)}}{\partial a_k^{(L-1)}}}_{\textcolor{blue}{\mathbb{R}}} \cdot
\underbrace{\frac{\partial a_j^{(L)}}{\partial z_j^{(L)}}}_{\textcolor{green}{\mathbb{R}}} \cdot
\underbrace{\frac{\partial C}{\partial a_j^{(L)}}}_{\textcolor{red}{\mathbb{R}}}
\right)
$

Now the dimensions:

| Term | Description | Shape |
|------|-------------|-------|
| $ w^{(L)} $ | Weights from layer $ L-1 $ to $ L $ | $ \mathbb{R}^{2 \times 3} $ |
| $ \frac{\partial z_j^{(L)}}{\partial a_k^{(L-1)}} $ | One scalar: the weight $ w_{jk}^{(L)} $ | $ \mathbb{R} $ |
| $ \frac{\partial a_j^{(L)}}{\partial z_j^{(L)}} $ | Derivative of activation function | $ \mathbb{R} $ |
| $ \frac{\partial C}{\partial a_j^{(L)}} $ | Gradient of cost w.r.t. activation | $ \mathbb{R} $ |
| Whole sum | Gradient of cost w.r.t. $ a_k^{(L-1)} $ | $ \mathbb{R} $ |

So for each $ k \in [0, 2] $, you're summing over the $ n = 2 $ neurons in the next layer.

---

### 2.4.2 $\frac{\partial C}{\partial a^{(L-1)}}$ Error relative to the changes of activations in the previous layers in matrix Form



When we write the above in vector/matrix form for the full gradient, we get:

$
\frac{\partial C}{\partial \mathbf{a}^{(L-1)}} =
(w^{(L)})^T \cdot \delta^L
$

Where:
- $ w^{(L)} \in \mathbb{R}^{n \times m} $
- $ \delta^L \in \mathbb{R}^{n \times 1} $, with each entry:
  $
  \delta_j^{(L)} = \frac{\partial C}{\partial a_j^{(L)}} \cdot \sigma'(z_j^{(L)})
  $

---



| Quantity | Shape |
|----------|-------|
| $ w^{(L)} $ | $ \mathbb{R}^{2 \times 3} $ |
| $ (w^{(L)})^T $ | $ \mathbb{R}^{3 \times 2} $ |
| $ \delta^L $ | $ \mathbb{R}^{2 \times 1} $ |
| $ \frac{\partial C}{\partial a^{(L-1)}} $ | $ \mathbb{R}^{3 \times 1} $ ✅ |

So the result has one scalar for **each activation** in layer $ L-1 $.

---

## 3. Putting all together 


The way we compute the **error term $ \delta^L $** at the **output layer** is different from how we compute the error $ \delta^l $ in **hidden layers**.

---

### 3.1 The Purpose of $ \delta^l $

In backpropagation, the error term at each layer is:

$
\delta^l = \frac{\partial C}{\partial z^l}
$

This is the **gradient of the cost function with respect to the weighted input** $ z^l $, not the activation $ a^l $.

It’s used to:
- Compute how much each neuron contributed to the final error
- Update weights and biases:  
  $
  \frac{\partial C}{\partial w^l} = \delta^l (a^{l-1})^T, \quad \frac{\partial C}{\partial b^l} = \delta^l
  $

---

### 3.2 Case 1: Output Layer $ \delta^L $

Here, the cost function is **directly dependent** on the output activation $ a^L $, so you can directly compute:

$
\delta^L = \frac{\partial C}{\partial z^L}
= \frac{\partial C}{\partial a^L} \odot \sigma'(z^L)
$

If you’re using **mean squared error**:
$
C = \frac{1}{2} \| a^L - y \|^2 \quad \Rightarrow \quad \frac{\partial C}{\partial a^L} = a^L - y
$

So:
$
\delta^L = (a^L - y) \odot \sigma'(z^L)
$

This is **explicitly derived** from the loss function.

---

### 3.3 Case 2: Hidden Layers $ \delta^l $, for $ l < L $

For hidden layers, the cost **does not directly depend** on $ a^l $, so we must **propagate the error backward** through the next layer using the chain rule.

We compute:

$
\delta^l = \frac{\partial C}{\partial z^l}
= \left( \frac{\partial z^{l+1}}{\partial a^l} \right)^T \cdot \frac{\partial C}{\partial z^{l+1}} \odot \sigma'(z^l)
$

Which simplifies to:

$
\delta^l = (w^{l+1})^T \delta^{l+1} \odot \sigma'(z^l)
$

This equation has three components:
1. $ (w^{l+1})^T \delta^{l+1} $: pulls the error **backward**
2. $ \odot $: element-wise product (Hadamard)
3. $ \sigma'(z^l) $: applies local sensitivity of activation

---


| Layer Type     | Equation                                                  | Reason |
|----------------|-----------------------------------------------------------|--------|
| Output Layer   | $ \delta^L = (a^L - y) \odot \sigma'(z^L) $             | Cost function is defined on $ a^L $ |
| Hidden Layer   | $ \delta^l = (w^{l+1})^T \delta^{l+1} \odot \sigma'(z^l) $ | We propagate gradient backward using the chain rule |

---

**Intuitive Analogy**

- At the **output layer**, we **see the error** directly (compare predicted vs target).
- At **hidden layers**, we don’t see the error directly — we only know **how errors in the next layer depend on these neurons**, so we propagate the error backward using the weights.

---

**Summary**


- $ \delta^L $ is computed **directly** from the loss
- $ \delta^l $ for $ l < L $ is computed **indirectly** using:
  $
  \delta^l = (w^{l+1})^T \delta^{l+1} \odot \sigma'(z^l)
  $

Would you like to see a **concrete numerical example** of both $ \delta^L $ and $ \delta^{L-1} $ to lock in the intuition?




## 4. Complete Numerical Example

### **Recap:**

You're working with an MLP (a feedforward neural network), with layers labeled:

- $ l = 1, 2, ..., L $, where $ L $ is the final layer (output layer)
- Each layer has:
  - **Weights** $ W^l $
  - **Biases** $ b^l $
  - **Input to activation** (pre-activation) $ z^l = W^l a^{l-1} + b^l $
  - **Activation** $ a^l = \sigma(z^l) $

We define:
- $ \delta^l $ as the **error term** (gradient of the loss with respect to $ z^l $), i.e.:
  $
  \delta^l = \frac{\partial C}{\partial z^l}
  $

---

### **The Backpropagation Rule:**

To backpropagate the error from layer $ l+1 $ to layer $ l $, we use the chain rule. The rule is:

$
\delta^l = (W^{l+1})^T \delta^{l+1} \odot \sigma'(z^l)
$

Let’s explain this step by step.

---

### **Step-by-Step Derivation:**

We want to compute:

$
\delta^l = \frac{\partial C}{\partial z^l}
$

Using the chain rule:

$
\delta^l = \frac{\partial C}{\partial z^l}
= \frac{\partial C}{\partial a^l} \cdot \frac{\partial a^l}{\partial z^l}
$

Now:

- $ \frac{\partial a^l}{\partial z^l} = \sigma'(z^l) $ — elementwise derivative of the activation function
- $ \frac{\partial C}{\partial a^l} $ is harder, but we can express it using the next layer:
  
Since:

$
z^{l+1} = W^{l+1} a^l + b^{l+1}
\Rightarrow a^{l+1} = \sigma(z^{l+1})
\Rightarrow C = C(a^{l+1})
$

By the chain rule again:

$
\frac{\partial C}{\partial a^l} = (W^{l+1})^T \frac{\partial C}{\partial z^{l+1}} = (W^{l+1})^T \delta^{l+1}
$

Therefore:

$
\delta^l = (W^{l+1})^T \delta^{l+1} \odot \sigma'(z^l)
$

---

### **What This Means Intuitively:**

- $ \delta^{l+1} $ tells you how the loss changes with respect to the next layer's inputs
- $ (W^{l+1})^T \delta^{l+1} $ propagates that error **backward through the weights**
- $ \odot \sigma'(z^l) $ adjusts that backward-flowing error by how sensitive the activation is (via derivative of the activation function)

---

### **Example (ReLU):**

If you're using ReLU:
- $ \sigma(z) = \max(0, z) $
- $ \sigma'(z) = 1 $ if $ z > 0 $, otherwise 0

So only the neurons that were active in the forward pass will get nonzero gradient in the backward pass — which is how ReLU encourages sparsity.

---

Let me know if you'd like a numeric example or want this in matrix form for an entire mini-batch.



## 5. Numerical Example

Awesome, let’s work through a **fully detailed numerical example** of **backpropagation** for a small network, so you can see exactly how we compute the errors $ \delta^l $ and gradients step by step.

---

## 🧮 Network Architecture

We'll use a small network to keep the math readable:

$
\textbf{Network structure: } [2, 2, 1]
$

- Input layer: 2 neurons  
- Hidden layer: 2 neurons  
- Output layer: 1 neuron  

We’ll assume:
- **Sigmoid** activation: $ \sigma(z) = \frac{1}{1 + e^{-z}} $
- **Quadratic cost**: $ C = \frac{1}{2}(a^L - y)^2 $

---

## 🎯 Step 1: Initialize Weights, Biases, Input

We'll choose **fixed values** to simplify:

### Input $ x $:
$
x = \begin{bmatrix} 1.0 \\ 0.5 \end{bmatrix}
$

### Target $ y $:
$
y = \begin{bmatrix} 0.8 \end{bmatrix}
$

### Weights and Biases:
| Layer | Weights $ w^l $ | Biases $ b^l $ |
|-------|-------------------|------------------|
| 1 (Input → Hidden) | $ w^1 = \begin{bmatrix} 0.1 & 0.3 \\ 0.2 & 0.4 \end{bmatrix} $ | $ b^1 = \begin{bmatrix} 0.1 \\ -0.1 \end{bmatrix} $ |
| 2 (Hidden → Output) | $ w^2 = \begin{bmatrix} 0.7 & -1.2 \end{bmatrix} $ | $ b^2 = \begin{bmatrix} 0.05 \end{bmatrix} $ |

---

## 🔄 Step 2: Forward Pass

### Hidden layer (layer 1):
$
z^1 = w^1 x + b^1
= \begin{bmatrix} 0.1 & 0.3 \\ 0.2 & 0.4 \end{bmatrix} \begin{bmatrix} 1.0 \\ 0.5 \end{bmatrix}
+ \begin{bmatrix} 0.1 \\ -0.1 \end{bmatrix}
= \begin{bmatrix} 0.1 + 0.15 \\ 0.2 + 0.2 \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.1 \end{bmatrix}
= \begin{bmatrix} 0.35 \\ 0.3 \end{bmatrix}
$

Apply sigmoid:
$
a^1 = \sigma(z^1) \approx \begin{bmatrix} \sigma(0.35) \\ \sigma(0.3) \end{bmatrix}
\approx \begin{bmatrix} 0.5866 \\ 0.5744 \end{bmatrix}
$

---

### Output layer (layer 2):
$
z^2 = w^2 a^1 + b^2 = \begin{bmatrix} 0.7 & -1.2 \end{bmatrix} \begin{bmatrix} 0.5866 \\ 0.5744 \end{bmatrix} + 0.05
= (0.4106 - 0.6893) + 0.05 = -0.2287
$

Apply sigmoid:
$
a^2 = \sigma(z^2) = \sigma(-0.2287) \approx 0.4431
$

---

## 🧠 Step 3: Backward Pass

---

### Output layer (layer 2)

Compute cost derivative:
$
\frac{\partial C}{\partial a^2} = a^2 - y = 0.4431 - 0.8 = -0.3569
$

$
\sigma'(z^2) = \sigma(z^2)(1 - \sigma(z^2)) = 0.4431 \cdot (1 - 0.4431) \approx 0.2467
$

$
\delta^2 = \frac{\partial C}{\partial z^2} = \frac{\partial C}{\partial a^2} \cdot \sigma'(z^2)
= -0.3569 \cdot 0.2467 \approx -0.0880
$

---

### Hidden layer (layer 1)

$
\delta^1 = ((w^2)^T \delta^2) \odot \sigma'(z^1)
$

First:
$
(w^2)^T \delta^2 = \begin{bmatrix} 0.7 \\ -1.2 \end{bmatrix} \cdot -0.0880
= \begin{bmatrix} -0.0616 \\ 0.1056 \end{bmatrix}
$

Now compute $ \sigma'(z^1) $:
$
\sigma'(z^1) = \sigma(z^1)(1 - \sigma(z^1)) \approx
\begin{bmatrix} 0.5866 \cdot (1 - 0.5866) \\ 0.5744 \cdot (1 - 0.5744) \end{bmatrix}
= \begin{bmatrix} 0.2425 \\ 0.2445 \end{bmatrix}
$

So:
$
\delta^1 = \begin{bmatrix} -0.0616 \\ 0.1056 \end{bmatrix}
\odot \begin{bmatrix} 0.2425 \\ 0.2445 \end{bmatrix}
= \begin{bmatrix} -0.0149 \\ 0.0258 \end{bmatrix}
$

---

## 🧾 Step 4: Gradient Calculation

### Gradients for output layer weights:
$
\frac{\partial C}{\partial w^2} = \delta^2 \cdot (a^1)^T
= -0.0880 \cdot \begin{bmatrix} 0.5866 & 0.5744 \end{bmatrix}
= \begin{bmatrix} -0.0516 & -0.0505 \end{bmatrix}
$

### Gradients for output layer biases:
$
\frac{\partial C}{\partial b^2} = \delta^2 = -0.0880
$

---

### Gradients for hidden layer weights:
$
\frac{\partial C}{\partial w^1} = \delta^1 \cdot (a^0)^T
= \begin{bmatrix} -0.0149 \\ 0.0258 \end{bmatrix}
\cdot \begin{bmatrix} 1.0 & 0.5 \end{bmatrix}
= \begin{bmatrix} -0.0149 & -0.0075 \\ 0.0258 & 0.0129 \end{bmatrix}
$

### Gradients for hidden layer biases:
$
\frac{\partial C}{\partial b^1} = \delta^1 = \begin{bmatrix} -0.0149 \\ 0.0258 \end{bmatrix}
$

---

## ✅ Final Summary

| Layer | $ \delta^l $ | $ \frac{\partial C}{\partial w^l} $ | $ \frac{\partial C}{\partial b^l} $ |
|-------|----------------|--------------------------|------------------|
| 2 (output) | -0.0880 | $[-0.0516, -0.0505]$ | -0.0880 |
| 1 (hidden) | $[-0.0149, 0.0258]$ | $ \begin{bmatrix} -0.0149 & -0.0075 \\ 0.0258 & 0.0129 \end{bmatrix} $ | same as $ \delta^1 $ |

---

Would you like me to draw a **diagram** of this mini network with arrows and values to go with this numerical walkthrough?

Perfect! Let's walk through a **full forward and backward pass** for a neural network with architecture:

---

## 🧠 Network Structure: $[2, 3, 2]$

| Layer      | # Neurons | Notation |
|------------|-----------|----------|
| Input      | 2         | $ a^0 $ |
| Hidden     | 3         | $ a^1 $ |
| Output     | 2         | $ a^2 $ |

We'll assume:
- Activation: **Sigmoid**
- Cost: **Quadratic loss**
$
C = \frac{1}{2} \| a^2 - y \|^2
$

---

## 🔢 Step 1: Choose Input and Target

### Input:
$
x = \begin{bmatrix} 0.9 \\ 0.1 \end{bmatrix}
$

### Target:
$
y = \begin{bmatrix} 1.0 \\ 0.0 \end{bmatrix}
$

---

## ⚙️ Step 2: Weights and Biases

### Layer 1: Input → Hidden
$
w^1 = \begin{bmatrix}
0.1 & 0.4 \\
0.2 & 0.3 \\
0.5 & -0.1
\end{bmatrix}, \quad
b^1 = \begin{bmatrix} 0.1 \\ -0.1 \\ 0.2 \end{bmatrix}
$

### Layer 2: Hidden → Output
$
w^2 = \begin{bmatrix}
0.3 & -0.7 & 0.2 \\
0.6 & 0.1 & -0.5
\end{bmatrix}, \quad
b^2 = \begin{bmatrix} 0.05 \\ -0.05 \end{bmatrix}
$

---

## 🔄 Step 3: Forward Pass

### Hidden Layer
$
z^1 = w^1 x + b^1 = 
\begin{bmatrix}
0.1\cdot0.9 + 0.4\cdot0.1 + 0.1 \\
0.2\cdot0.9 + 0.3\cdot0.1 - 0.1 \\
0.5\cdot0.9 + (-0.1)\cdot0.1 + 0.2
\end{bmatrix}
= \begin{bmatrix} 0.24 \\ 0.11 \\ 0.63 \end{bmatrix}
$

Apply sigmoid:
$
a^1 = \sigma(z^1) \approx \begin{bmatrix} 0.5597 \\ 0.5275 \\ 0.6522 \end{bmatrix}
$

---

### Output Layer
$
z^2 = w^2 a^1 + b^2
= \begin{bmatrix}
0.3 & -0.7 & 0.2 \\
0.6 & 0.1 & -0.5
\end{bmatrix}
\cdot
\begin{bmatrix} 0.5597 \\ 0.5275 \\ 0.6522 \end{bmatrix}
+ \begin{bmatrix} 0.05 \\ -0.05 \end{bmatrix}
$

Calculating:
$
z^2_1 = 0.3(0.5597) - 0.7(0.5275) + 0.2(0.6522) + 0.05 \approx 0.167
$
$
z^2_2 = 0.6(0.5597) + 0.1(0.5275) - 0.5(0.6522) - 0.05 \approx 0.069
$

$
z^2 = \begin{bmatrix} 0.167 \\ 0.069 \end{bmatrix}
\quad \Rightarrow \quad
a^2 = \sigma(z^2) \approx \begin{bmatrix} 0.5417 \\ 0.5172 \end{bmatrix}
$

---

## 🧠 Step 4: Backward Pass

---

### Output Layer Error:
$
\delta^2 = (a^2 - y) \odot \sigma'(z^2)
$

$
a^2 - y = \begin{bmatrix} -0.4583 \\ 0.5172 \end{bmatrix}
$
$
\sigma'(z^2) = a^2 \odot (1 - a^2) \approx \begin{bmatrix} 0.5417(1 - 0.5417) \\ 0.5172(1 - 0.5172) \end{bmatrix}
= \begin{bmatrix} 0.2483 \\ 0.2497 \end{bmatrix}
$

$
\delta^2 \approx \begin{bmatrix} -0.1138 \\ 0.1291 \end{bmatrix}
$

---

### Gradients for Layer 2:

#### Weight gradients:
$
\nabla w^2 = \delta^2 \cdot (a^1)^T
$

$
\nabla w^2 \approx \begin{bmatrix}
-0.1138 \cdot 0.5597 & -0.1138 \cdot 0.5275 & -0.1138 \cdot 0.6522 \\
0.1291 \cdot 0.5597 & 0.1291 \cdot 0.5275 & 0.1291 \cdot 0.6522
\end{bmatrix}
\approx \begin{bmatrix}
-0.0637 & -0.0600 & -0.0743 \\
0.0722 & 0.0681 & 0.0842
\end{bmatrix}
$

#### Bias gradients:
$
\nabla b^2 = \delta^2 \approx \begin{bmatrix} -0.1138 \\ 0.1291 \end{bmatrix}
$

---

### Hidden Layer Error:

$
\delta^1 = ((w^2)^T \delta^2) \odot \sigma'(z^1)
$

First:
$
(w^2)^T \delta^2 = \begin{bmatrix}
0.3 & 0.6 \\
-0.7 & 0.1 \\
0.2 & -0.5
\end{bmatrix}
\cdot \begin{bmatrix} -0.1138 \\ 0.1291 \end{bmatrix}
\approx \begin{bmatrix}
0.0278 \\
-0.0907 \\
-0.0900
\end{bmatrix}
$

Then:
$
\sigma'(z^1) = a^1 \odot (1 - a^1) \approx
\begin{bmatrix} 0.2464 \\ 0.2492 \\ 0.2267 \end{bmatrix}
$

So:
$
\delta^1 = \begin{bmatrix}
0.0278 \cdot 0.2464 \\
-0.0907 \cdot 0.2492 \\
-0.0900 \cdot 0.2267
\end{bmatrix}
\approx \begin{bmatrix}
0.0069 \\
-0.0226 \\
-0.0204
\end{bmatrix}
$

---

### Gradients for Layer 1:

#### Weight gradients:
$
\nabla w^1 = \delta^1 \cdot (x)^T =
\begin{bmatrix}
0.0069 \cdot 0.9 & 0.0069 \cdot 0.1 \\
-0.0226 \cdot 0.9 & -0.0226 \cdot 0.1 \\
-0.0204 \cdot 0.9 & -0.0204 \cdot 0.1
\end{bmatrix}
\approx
\begin{bmatrix}
0.0062 & 0.0007 \\
-0.0204 & -0.0023 \\
-0.0184 & -0.0020
\end{bmatrix}
$

#### Bias gradients:
$
\nabla b^1 = \delta^1 \approx \begin{bmatrix} 0.0069 \\ -0.0226 \\ -0.0204 \end{bmatrix}
$

---

## ✅ Summary

| Layer | $ \delta $ | $ \nabla w $ | $ \nabla b $ |
|-------|--------------|----------------|----------------|
| Output (2) | $ \begin{bmatrix}-0.1138\\0.1291\end{bmatrix} $ | $ 2 \times 3 $ matrix shown above | same as $ \delta^2 $ |
| Hidden (1) | $ \begin{bmatrix}0.0069\\-0.0226\\-0.0204\end{bmatrix} $ | $ 3 \times 2 $ matrix shown above | same as $ \delta^1 $ |

---

Would you like me to create a diagram for this one too?

$\textbf{Reminder:}$

Last layer $L$ has $n$ neuron.

Layer $L-1$ has $m$ neuron.

Layer $L-2$ has $p$ neuron.

$\textbf{W}^{(L)}_{n \times m}$

$\textbf{W}^{(L-1)}_{m \times p}$

$\textbf{W}^{(L)}_{n\times m}=\begin{bmatrix}
w_{0,0}^{(L)} &w_{0,1}^{(L)}  & \cdots   &w_{0,m-1}^{(L)} \\ 
w_{1,0}^{(L)} &w_{1,1}^{(L)}  & \cdots   &w_{1,m-1}^{(L)} \\ 
 \vdots &  \ddots &  & \\ 
w_{n-1,0}^{(L)} &w_{n-1,1}^{(L)}  & \cdots   &w_{n-1,m-1}^{(L)} 
\end{bmatrix}
_{n\times m}$

$\textbf{W}^{(L)}_{m\times n}\top=\begin{bmatrix}
w_{0,0}^{(L)} &w_{1,0}^{(L)}  & \cdots   & w_{n-1,0}^{(L)}\\ 
w_{0,1}^{(L)} &w_{1,1}^{(L)}  & \cdots   &w_{1,n-1}^{(L)} \\ 
 \vdots &  \ddots &  & \\ 
w_{0,m-1}^{(L)} &w_{m-1,1}^{(L)}  & \cdots   &w_{m-1,n-1}^{(L)} 
\end{bmatrix}
_{m\times n}$

$\boldsymbol\delta^{(L)}=\begin{bmatrix}
\delta^{(L)}_{0}\\ 
\delta^{(L)}_{1}\\ 
\vdots
 \\ 
\delta^{(L)}_{j}\\ 
\vdots \\
\delta^{(L)}_{n-1}
\end{bmatrix}$

$\boldsymbol \sigma'(z^{(L-1)})=
\begin{bmatrix}
\sigma'(z^{(L-1)}_{0})\\ 
\sigma'(z^{(L-1)}_{1})\\ 
\vdots\\
\sigma'(z^{(L-1)}_{m-1})
\end{bmatrix}$

$\textbf{a}^{(L-2)}\top=
\begin{bmatrix}
a^{(L-2)}_{0} & a^{(L-2)}_{1}&...&a^{(L-2)}_{p-1}\\ 
\end{bmatrix}$

Now to compute the error relative to weight in layer $L-1$:

$\frac{\partial c_{0}}{\partial w_{jk}^{(L-1)}} =
\frac{\partial z_{j}^{(L-1)}}{\partial w_{jk}^{(L-1)}}
\frac{\partial a_{j}^{(L-1)}}{\partial z_{j}^{(L-1)}}\frac{\partial c_{0}}{\partial a_{j}^{(L-1)}}=
\frac{\partial z_{j}^{(L-1)}}{\partial w_{jk}^{(L-1)}}
\frac{\partial a_{j}^{(L-1)}}{\partial z_{j}^{(L-1)}}
\sum_{i=0}^{n-1} \delta^{(L)}_{i}w_{ij}^{(L)}=
a_{k}^{(L-2)}
\sigma' (z_{j}^{(L-1)})
\sum_{i=0}^{n-1} \delta^{(L)}_{i}w_{ij}^{(L)}$

$\frac{\partial c_{0}}{\partial w_{jk}^{(L-1)}} =
a_{k}^{(L-2)}
\sum_{i=0}^{n-1} \delta^{(L)}_{i}w_{ij}^{(L)}=
 \left (   (\textbf{W}^{(L)}_{(0:n-1,j)} \top)\cdot  \boldsymbol\delta^{(L)}\right ) \sigma' (z_{j}^{(L-1)})
 a_{k}^{(L-2)}$



This was only one element in the $\textbf{W}^{(L-1)}_{m \times p}$, If we write it for entire matrix:




$\frac{\partial\textbf{c}}{\partial\textbf{W}^{(L-1)}}_{m\times p}=
%W
\begin{eqnarray} 
   (\begin{bmatrix}
w_{0,0}^{(L)} &w_{1,0}^{(L)}  & \cdots   & w_{n-1,0}^{(L)}\\ 
w_{0,1}^{(L)} &w_{1,1}^{(L)}  & \cdots   &w_{1,n-1}^{(L)} \\ 
 \vdots &  \ddots &  & \\ 
w_{0,m-1}^{(L)} &w_{m-1,1}^{(L)}  & \cdots   &w_{m-1,n-1}^{(L)} 
\end{bmatrix}
_{m\times n}
%delta 
\begin{bmatrix}
\delta^{(L)}_{0}\\ 
\delta^{(L)}_{1}\\ 
\vdots
 \\ 
\delta^{(L)}_{j}\\ 
\vdots \\
\delta^{(L)}_{n-1}
\end{bmatrix}_{n \times 1})
\odot 
%sigma prime
\begin{bmatrix}
\sigma'(z^{(L-1)}_{0})\\ 
\sigma'(z^{(L-1)}_{1})\\ 
\vdots\\
\sigma'(z^{(L-1)}_{m-1})
\end{bmatrix}_{m\times 1}
\end{eqnarray}
\cdot 
%a L-2 vector
\begin{bmatrix}
a^{(L-2)}_{0} & a^{(L-2)}_{1}&...&a^{(L-2)}_{p-1}\\ 
\end{bmatrix}_{p \times 1}$


$\frac{\partial\textbf{c}}{\partial\textbf{W}^{(L-1)}}=
\begin{eqnarray} 
   ((\textbf{W}^{(L)})^T \boldsymbol\delta^{(L)}) \odot \boldsymbol\sigma'(z^{(L-1)})
\end{eqnarray}
\cdot \textbf{a}^{(L-2)}\top$

Now let's compute $\delta_k^{(L-1)}$:


$\delta_k^{(L-1)} = \frac{\partial C}{\partial z_k^{(L-1)}} = \sum_j ^{n-1} \frac{\partial C}{\partial z_j^{(L)}} \frac{\partial z_j^{(L)}}{\partial z_k^{(L-1)}} = \sum_j  \delta_j^{(L)}\frac{\partial z_j^{(L)}}{\partial z_k^{(L-1)}}$

Since:
$z^{(L)}_j = \sum_{k} w_{jk}^{(L)} \sigma(z_k^{(L) - 1}) + b_j^{(L)}$

The partial derivative is:

$\frac{\partial z_j^{(L)}}{\partial z_K^{(L-1)}} = w_{jK}^{(L)}\sigma'(z_{K}^{(L-1)})$


Putting these pieces together we have:

$\delta_k^{(L-1)} =\sum_j^{n-1} \delta_j^{(L)}\frac{\partial z_j^{(L)}}{\partial z_K^{(L-1)}}= \sum_j^{n-1}  (\delta_j^{(L)}  w_{jK}^{(L)})\sigma'(z_{K}^{(L-1)})$



$\begin{eqnarray} 
  \boldsymbol\delta^{(L-1)}=\frac{\partial c}{\partial z^{(L-1)}}  = ((\textbf{W}^{(L)})^T \boldsymbol\delta^{(L)}) \odot \sigma'(z^{(L-1)})
\end{eqnarray}\tag{4}$


$\frac{\partial\textbf{c}}{\partial \textbf{b}^{(L-1)}}=\boldsymbol\delta^{(L-1)}\tag{5}$



$\frac{\partial\textbf{c}}{\partial\textbf{W}^{(L-1)}}=\boldsymbol\delta^{(L-1)}\cdot \textbf{a}^{(L-2)}\top\tag{6}$


So that was only for the $\textbf{first}$ item in the training set, you feed the rest of the training set into the network and for each $w_{jk}^{(L)}$ you make an average over all of them. Then that's the vector that you should use in your gradient descent function. In the output layer use equations (1), (2), (3) and in the hidden layers use equation (4), (5) and 6:

$\textbf{W}_{new}=\textbf{W}_{initial}-\eta \nabla (\textbf{W}_{initial})$

$\textbf{b}_{new}=\textbf{b}_{initial}-\eta \nabla (\textbf{b}_{initial})$



So you usually randomly shuffle your training set into batches called mini batches and compute the gradient for every batch.

Ref: [1](https://stats.stackexchange.com/questions/414825/how-exactly-is-the-error-backpropagated-in-backpropagation), [2](https://www.youtube.com/watch?v=xClK__CqZnQ)

## Unstable Gradient: Vanishing and Explosion Gradient


Refs [1](https://www.youtube.com/watch?v=qO_NLVjD6zE), [2](https://en.wikipedia.org/wiki/Vanishing_gradient_problem)

Assuming activation function is:
$\sigma(x) = \dfrac{1}{1 + e^{-x}}$


$\frac{\sigma(x)}{\partial x} = \sigma(x)(1 - \sigma(x))$
Because:

$\begin{align}
\dfrac{d}{dx} \sigma(x) &= \dfrac{d}{dx} \left[ \dfrac{1}{1 + e^{-x}} \right] \\
&= \dfrac{d}{dx} \left( 1 + \mathrm{e}^{-x} \right)^{-1} \\
&= -(1 + e^{-x})^{-2}(-e^{-x}) \\
&= \dfrac{e^{-x}}{\left(1 + e^{-x}\right)^2} \\
&= \dfrac{1}{1 + e^{-x}\ } \cdot \dfrac{e^{-x}}{1 + e^{-x}}  \\
&= \dfrac{1}{1 + e^{-x}\ } \cdot \dfrac{(1 + e^{-x}) - 1}{1 + e^{-x}}  \\
&= \dfrac{1}{1 + e^{-x}\ } \cdot \left( \dfrac{1 + e^{-x}}{1 + e^{-x}} - \dfrac{1}{1 + e^{-x}} \right) \\
&= \dfrac{1}{1 + e^{-x}\ } \cdot \left( 1 - \dfrac{1}{1 + e^{-x}} \right) \\
&= \sigma(x) \cdot (1 - \sigma(x))
\end{align}$
