# Backward Propagation

### Weight matrices shape in neural network

![Weight matrices shape transformation](images/matrices.png)



### Calculate the weights

Calculate the gradient of the loss ***L*** with respect to the weight matrix.

From the forward pass: 
$$
Z=X⋅W+b
$$
- ***X*** is the input to the layer.  
- ***W*** is the weight matrix.  
- ***Z*** is the linear transformation result.   

The loss ***L*** depends on ***W***  through  ***Z***. By the chain rule:
$$
\frac{\partial L}{\partial W} = \frac{\partial L}{\partial Z} \cdot \frac{\partial Z}{\partial W}
$$
- $\frac{\partial Z}{\partial L}$ is 𝛿 (the error signal from the next layer)


- $\frac{\partial W}{\partial Z} = X^T$

Thus:
$$
\frac{\partial L}{\partial W} = X^T \cdot \delta
$$

### Calculate the biases

This calculates the gradient of the loss 𝐿 with respect to the biases 𝑏.

From the forward pass: 
$$
Z=X⋅W+b
$$

The bias term 𝑏 affects 𝑍 directly. By the chain rule:
$$
\frac{\partial L}{\partial b} = \frac{\partial L}{\partial Z} \cdot \frac{\partial Z}{\partial b}
$$

Since $\frac{\partial Z}{\partial b} = 1$, we have:
$$
\frac{\partial L}{\partial b} = \sum \delta
$$

### Gradient descent

#### Standard

This is **gradient descent**:
$$
W = W - \eta \cdot \nabla L
$$

Where:
- $W$: Weight matrix.
- $\eta$: Learning rate.
- $\nabla L$: Gradient of the loss.

#### Momentum base optimization

This smooth updates:

$$
v^{(t)} = \gamma \cdot v^{(t-1)} - \eta \cdot \nabla L
$$

Where:
- $v^{(t)}$: Current velocity.
- $\gamma$: Momentum factor (e.g., 0.9).
- $\eta$: Learning rate.
- $\nabla L$: Gradient of the loss.

The weights are then updated using:
$$
W^{(t+1)} = W^{(t)} + v^{(t)}
$$


### Error signal

This calculates the error signal $(\delta)$ to be passed to the previous layer

From the forward pass:
$$
Z = X \cdot W + b
$$

The chain rule for backpropagation states:
$$
\delta^{\text{prev}} = \frac{\partial L}{\partial A^{\text{prev}}} = \delta \cdot W^T
$$
