# Backpropagation in Neural Networks: Main Ideas


Backpropagation is a key algorithm for training neural networks by optimizing parameters such as weights and biases.
This notebook explains the main ideas of backpropagation using the chain rule and gradient descent. The concepts
are broken down step-by-step, with accompanying examples and mathematical notations.



## Prerequisites

Before diving into backpropagation, ensure you are familiar with:
1. Neural Networks
2. The Chain Rule in calculus
3. Gradient Descent

If not, review the necessary resources or tutorials.



## Simple Example of a Neural Network

Consider a neural network with a single hidden layer designed to predict whether drug dosages (low, medium, high) are effective.

### Neural Network Workflow
1. Input features (e.g., dosage levels) pass through the network.
2. Activation functions, weights, and biases adjust the input to produce an output prediction.
3. Backpropagation calculates how weights and biases should be updated to minimize prediction error.



## Backpropagation Process

The goal of backpropagation is to minimize the error between the predicted values and the actual values in the dataset.

### Steps:
1. **Forward Pass:** Calculate predictions using the current parameters.
2. **Error Calculation:** Compute the loss (e.g., sum of squared residuals).
3. **Backward Pass:** Use the chain rule to calculate the gradient of the loss function with respect to each parameter.
4. **Parameter Update:** Adjust weights and biases using gradient descent.

Mathematically, the process involves:
$$
L = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$
where:
- $L$: Loss function (sum of squared residuals)
- $y_i$: Observed values
- $\hat{y}_i$: Predicted values
- $n$: Number of observations



## Using the Chain Rule

To compute gradients, we apply the chain rule. For a parameter $b_3$, the gradient is:
$$
\frac{\partial L}{\partial b_3} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial b_3}
$$

### Example:
Given a loss function:
$$
L = \sum_{i=1}^{3} (y_i - \hat{y}_i)^2
$$

1. Compute $\frac{\partial L}{\partial \hat{y}}$:
$$
\frac{\partial L}{\partial \hat{y}} = -2 (y_i - \hat{y}_i)
$$

2. Compute $\frac{\partial \hat{y}}{\partial b_3}$:
For a simple network, $\hat{y} = f(x; b_3)$, where $b_3$ is added directly, so:
$$
\frac{\partial \hat{y}}{\partial b_3} = 1
$$

Thus:
$$
\frac{\partial L}{\partial b_3} = -2 (y_i - \hat{y}_i)
$$



## Gradient Descent

Gradient descent updates parameters iteratively to minimize the loss function.

The update rule for a parameter $b_3$:
$$
b_3 \leftarrow b_3 - \alpha \frac{\partial L}{\partial b_3}
$$
where:
- $\alpha$: Learning rate (step size)
- $\frac{\partial L}{\partial b_3}$: Gradient

### Example Update:
Given $\frac{\partial L}{\partial b_3} = -15.7$ and $\alpha = 0.1$:
$$
b_3 \leftarrow 0 - (0.1)(-15.7) = 1.57
$$



## Iterative Optimization

Backpropagation uses the gradient descent process repeatedly:
1. Compute gradients for all parameters (weights and biases).
2. Update parameters using the gradient descent rule.
3. Repeat until convergence (i.e., minimal loss).

In practice, this is implemented efficiently for large networks using frameworks such as TensorFlow or PyTorch.



## Conclusion

Backpropagation combines the chain rule and gradient descent to optimize neural network parameters. By iteratively
minimizing the loss function, it ensures that the model learns to make accurate predictions.

Understanding these concepts is crucial for designing and training deep learning models.
