# Implementing Back-Propagation from Scratch

Backpropagation is the central algorithm for training neural networks. It allows us to efficiently compute gradients of loss functions with respect to all weights in the network. These gradients are then used to update the weights via gradient descent.


## **1. Motivation: Why Backpropagation?**

Let’s start with the key challenge in training a neural network:

> **We want to adjust the weights in a neural network so that it makes better predictions.**

But to adjust the weights, we need to compute **how the loss changes with respect to each weight**, i.e., compute **gradients**:

$$
\frac{\partial \mathcal{L}}{\partial w}
$$

Naively doing this is **computationally expensive**, especially when we have millions of weights and layers.
Backpropagation is an efficient way to **reuse intermediate computations**, reducing the cost of computing all the gradients from exponential to linear in the number of layers.



## **2. Forward Pass: Flow of Information**

Think of the neural network as a **factory assembly line**:

* Inputs go in.
* Each layer does a transformation.
* The final output is a prediction.
* The loss measures how wrong the prediction was.

For example, in a 2-layer neural network:

```plaintext
Input x → [W1, b1] → h1 = f1(W1x + b1)
           ↓
         [W2, b2] → ŷ = f2(W2h1 + b2)
```


## **3. Backward Pass: Flow of Responsibility**

Now we ask: *Which weight caused how much of the error?*

Backpropagation works **backwards**, starting from the loss:

* How much did each output neuron contribute to the loss?
* How much did each hidden neuron contribute to the output?
* How much did each weight contribute to its neuron’s output?

This is where **the chain rule of calculus** comes in.





## **4. Mathematical Foundation: The Chain Rule**

Let’s say:

* $z = f(y)$
* $y = g(x)$

Then:

$$
\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}
$$

This rule allows us to “chain” the effects of input on the output. In neural networks, the output is a function of a function of a function… So we use the chain rule **repeatedly**, layer by layer.

---

### **Example: One Hidden Layer**

Let:

* $a = Wx + b$ — pre-activation
* $h = \sigma(a)$ — activation (e.g., ReLU or sigmoid)
* $\hat{y} = Uh + c$ — final score
* $\mathcal{L} = \text{loss}(\hat{y}, y)$

We want:

* $\frac{\partial \mathcal{L}}{\partial W}$
* $\frac{\partial \mathcal{L}}{\partial b}$

Let’s compute:

1. $\frac{\partial \mathcal{L}}{\partial \hat{y}}$ → from the loss function (e.g., softmax + cross entropy)
2. $\frac{\partial \hat{y}}{\partial h} = U$
3. $\frac{\partial h}{\partial a} = \sigma'(a)$
4. $\frac{\partial a}{\partial W} = x$

Chain them:

$$
\frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h} \cdot \frac{\partial h}{\partial a} \cdot \frac{\partial a}{\partial W}
$$

In practice, we implement this using matrix derivatives and vectorized operations.




## **5. Real-World Analogy**

Think of a restaurant kitchen with multiple stations:

* Ingredients (input)
* Chopping station (layer 1)
* Cooking station (layer 2)
* Plating station (layer 3)
* Final dish (output)

Now suppose a dish was terrible (high loss). To improve:

* The chef at the plating station adjusts based on feedback.
* That feedback goes backward to the cook.
* The cook blames the chopping for bad ingredients.
* Each station gets **blame** proportionally to its **influence**.

This blame assignment is the essence of backpropagation.


## **6. Vectorized Implementation**

For efficiency, we use matrix notation and perform the entire backward pass in **vectorized form**.

Let’s say for one hidden layer:

```python
# Forward pass
z1 = X @ W1 + b1
a1 = relu(z1)
z2 = a1 @ W2 + b2
y_hat = softmax(z2)
loss = cross_entropy(y_hat, y)
```

Then backpropagation:

```python
# Backward pass
dz2 = y_hat - y_true       # gradient from loss to z2
dW2 = a1.T @ dz2
db2 = dz2.sum(axis=0)

da1 = dz2 @ W2.T
dz1 = da1 * relu_grad(z1)  # gradient from activation
dW1 = X.T @ dz1
db1 = dz1.sum(axis=0)
```


## **7. Why Backprop is Efficient**

Naively computing each gradient would be **costly**, especially with multiple layers and neurons. Backprop takes advantage of **dynamic programming** by:

* Storing intermediate values during forward pass
* Reusing them in backward pass (no recomputation)
* Propagating gradients in a chain (chain rule)

This reduces computation to **O(L)**, where L = number of layers.

---

## **8. Modern Applications of Backpropagation**

Backpropagation is the backbone of modern deep learning. Some use cases:

* **Vision**: CNNs use backprop to learn filters for edge detection, objects, etc.
* **Language**: Transformers use backprop to learn contextual embeddings.
* **Reinforcement Learning**: Policy gradients and Q-learning rely on backprop.
* **Biology**: Neural networks trained with backprop predict protein structures (AlphaFold).
* **Art**: Style transfer, GANs, and diffusion models are all powered by backprop.

---

## **9. Limitations and Challenges**

* **Vanishing gradients**: In deep networks, early layers get very small gradients.
* **Exploding gradients**: Opposite problem; gradients grow too large.
* **Non-convexity**: Many local minima or saddle points.
* **Computational cost**: Training large models can require GPUs/TPUs and a lot of energy.

---

## **10. Backprop vs. Alternatives**

While backprop dominates deep learning, some are exploring:

* **Hebbian Learning**: Inspired by neuroscience.
* **Feedback Alignment**: Replaces exact weight transposes with random matrices.
* **Neuro-symbolic methods**: Mix logic with learning.

## Implementing From the Scratch

We will build a simple neural network with:
- Input layer: 2 features (e.g., X shape = (N, 2))
- Hidden layer: 3 neurons with ReLU
- Output layer: 2 classes with Softmax
- Loss: Cross-entropy

> Goal: Classify input points into 2 classes

In [1]:
import numpy as np

In [2]:
#ReLU activation function and its derivative
def relu(x):
    """Rectified Linear Unit activation function."""
    return np.maximum(0, x)

def relu_derivative(x):
    return (x>0).astype(float) # Derivative of ReLU is 1 for x > 0, else 0

In [3]:
#Softmax function
def softmax(logits):
    logits -= np.max(logits, axis=-1, keepdims=True)  # For numerical stability
    exps = np.exp(logits) # Avoid overflow by subtracting max
    return exps / np.sum(exps, axis=-1, keepdims=True)# Normalize to get probabilities

In [5]:
#cross-entropy loss function
def cross_entropy_loss(probs, y_true):
    N = y_true.shape[0]# Number of samples
    correct_logprobs = -np.log(probs[np.arange(N), y_true] + 1e-15)  # Add small constant for numerical stability
    return np.sum(correct_logprobs) / N  # Average loss over all samples

# One-hot encoding function
def one_hot(y, num_classes):
    """Convert class labels to one-hot encoded format."""
    return np.eye(num_classes)[y]  # Create identity matrix and index with y