# Theory Notes: Backpropagation and Neural Networks


## Introduction

Backpropagation is a method used to train neural networks. It calculates the gradient of the loss function with respect to each parameter (weights and biases) using the chain rule, then updates the parameters using gradient descent.



## Key Concepts

1. **Forward Pass**: Compute the predictions of the network:
   $$
   \hat{y} = f(X; W, b)
   $$

2. **Loss Function**: Quantifies the difference between predictions and actual values. For Mean Squared Error (MSE):
   $$
   L = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
   $$

3. **Gradient Descent**: Update parameters to minimize the loss:
   $$
   	heta \leftarrow 	heta - lpha \frac{\partial L}{\partial 	heta}
   $$
   where $\alpha$ is the learning rate.

4. **Chain Rule**: Used to compute gradients in multi-layer networks:
   $$
   \frac{\partial L}{\partial 	heta} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial 	heta}
   $$



## Backpropagation Steps

1. **Compute the loss**: 
   $$
   L = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
   $$

2. **Backward Pass**:
   - Calculate $\frac{\partial L}{\partial \hat{y}}$:
     $$
     \frac{\partial L}{\partial \hat{y}} = -2(y - \hat{y})
     $$
   - Use the chain rule to compute $\frac{\partial L}{\partial W}$ and $\frac{\partial L}{\partial b}$.

3. **Parameter Update**:
   $$
   W \leftarrow W - \alpha \frac{\partial L}{\partial W}, \quad b \leftarrow b - \alpha \frac{\partial L}{\partial b}
   $$

Repeat these steps until the loss converges or the maximum number of iterations is reached.



## Neural Network Layers

1. **Input Layer**: Receives input features $X$.
2. **Hidden Layers**: Transform inputs using weights $W$ and biases $b$.
   $$
   h = \sigma(XW + b)
   $$
   where $\sigma$ is an activation function (e.g., Sigmoid, ReLU).

3. **Output Layer**: Produces the prediction:
   $$
   \hat{y} = \sigma(hW_o + b_o)
   $$



## Chain Rule in Depth

For a multi-layer network:
- The gradient of the loss with respect to weights in layer $k$ is:
  $$
  \frac{\partial L}{\partial W_k} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h_k} \cdot \frac{\partial h_k}{\partial W_k}
  $$

- Gradients flow backward from the output layer to the input layer, updating parameters layer by layer.



## Conclusion

Backpropagation enables neural networks to learn by iteratively updating parameters to minimize the loss. It combines:
- The chain rule for gradient computation.
- Gradient descent for optimization.

This theoretical foundation is crucial for understanding deep learning.
