# Implementing Back-Propagation from Scratch

Backpropagation is the central algorithm for training neural networks. It allows us to efficiently compute gradients of loss functions with respect to all weights in the network. These gradients are then used to update the weights via gradient descent.


## **1. Motivation: Why Backpropagation?**

Let’s start with the key challenge in training a neural network:

> **We want to adjust the weights in a neural network so that it makes better predictions.**

But to adjust the weights, we need to compute **how the loss changes with respect to each weight**, i.e., compute **gradients**:

$$
\frac{\partial \mathcal{L}}{\partial w}
$$

Naively doing this is **computationally expensive**, especially when we have millions of weights and layers.
Backpropagation is an efficient way to **reuse intermediate computations**, reducing the cost of computing all the gradients from exponential to linear in the number of layers.



## **2. Forward Pass: Flow of Information**

Think of the neural network as a **factory assembly line**:

* Inputs go in.
* Each layer does a transformation.
* The final output is a prediction.
* The loss measures how wrong the prediction was.

For example, in a 2-layer neural network:

```plaintext
Input x → [W1, b1] → h1 = f1(W1x + b1)
           ↓
         [W2, b2] → ŷ = f2(W2h1 + b2)
```


## **3. Backward Pass: Flow of Responsibility**

Now we ask: *Which weight caused how much of the error?*

Backpropagation works **backwards**, starting from the loss:

* How much did each output neuron contribute to the loss?
* How much did each hidden neuron contribute to the output?
* How much did each weight contribute to its neuron’s output?

This is where **the chain rule of calculus** comes in.





## **4. Mathematical Foundation: The Chain Rule**

Let’s say:

* $z = f(y)$
* $y = g(x)$

Then:

$$
\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}
$$

This rule allows us to “chain” the effects of input on the output. In neural networks, the output is a function of a function of a function… So we use the chain rule **repeatedly**, layer by layer.

---

### **Example: One Hidden Layer**

Let:

* $a = Wx + b$ — pre-activation
* $h = \sigma(a)$ — activation (e.g., ReLU or sigmoid)
* $\hat{y} = Uh + c$ — final score
* $\mathcal{L} = \text{loss}(\hat{y}, y)$

We want:

* $\frac{\partial \mathcal{L}}{\partial W}$
* $\frac{\partial \mathcal{L}}{\partial b}$

Let’s compute:

1. $\frac{\partial \mathcal{L}}{\partial \hat{y}}$ → from the loss function (e.g., softmax + cross entropy)
2. $\frac{\partial \hat{y}}{\partial h} = U$
3. $\frac{\partial h}{\partial a} = \sigma'(a)$
4. $\frac{\partial a}{\partial W} = x$

Chain them:

$$
\frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h} \cdot \frac{\partial h}{\partial a} \cdot \frac{\partial a}{\partial W}
$$

In practice, we implement this using matrix derivatives and vectorized operations.




## **5. Real-World Analogy**

Think of a restaurant kitchen with multiple stations:

* Ingredients (input)
* Chopping station (layer 1)
* Cooking station (layer 2)
* Plating station (layer 3)
* Final dish (output)

Now suppose a dish was terrible (high loss). To improve:

* The chef at the plating station adjusts based on feedback.
* That feedback goes backward to the cook.
* The cook blames the chopping for bad ingredients.
* Each station gets **blame** proportionally to its **influence**.

This blame assignment is the essence of backpropagation.


## **6. Vectorized Implementation**

For efficiency, we use matrix notation and perform the entire backward pass in **vectorized form**.

Let’s say for one hidden layer:

```python
# Forward pass
z1 = X @ W1 + b1
a1 = relu(z1)
z2 = a1 @ W2 + b2
y_hat = softmax(z2)
loss = cross_entropy(y_hat, y)
```

Then backpropagation:

```python
# Backward pass
dz2 = y_hat - y_true       # gradient from loss to z2
dW2 = a1.T @ dz2
db2 = dz2.sum(axis=0)

da1 = dz2 @ W2.T
dz1 = da1 * relu_grad(z1)  # gradient from activation
dW1 = X.T @ dz1
db1 = dz1.sum(axis=0)
```


## **7. Why Backprop is Efficient**

Naively computing each gradient would be **costly**, especially with multiple layers and neurons. Backprop takes advantage of **dynamic programming** by:

* Storing intermediate values during forward pass
* Reusing them in backward pass (no recomputation)
* Propagating gradients in a chain (chain rule)

This reduces computation to **O(L)**, where L = number of layers.

---

## **8. Modern Applications of Backpropagation**

Backpropagation is the backbone of modern deep learning. Some use cases:

* **Vision**: CNNs use backprop to learn filters for edge detection, objects, etc.
* **Language**: Transformers use backprop to learn contextual embeddings.
* **Reinforcement Learning**: Policy gradients and Q-learning rely on backprop.
* **Biology**: Neural networks trained with backprop predict protein structures (AlphaFold).
* **Art**: Style transfer, GANs, and diffusion models are all powered by backprop.

---

## **9. Limitations and Challenges**

* **Vanishing gradients**: In deep networks, early layers get very small gradients.
* **Exploding gradients**: Opposite problem; gradients grow too large.
* **Non-convexity**: Many local minima or saddle points.
* **Computational cost**: Training large models can require GPUs/TPUs and a lot of energy.

---

## **10. Backprop vs. Alternatives**

While backprop dominates deep learning, some are exploring:

* **Hebbian Learning**: Inspired by neuroscience.
* **Feedback Alignment**: Replaces exact weight transposes with random matrices.
* **Neuro-symbolic methods**: Mix logic with learning.

## Implementing From the Scratch

We will build a simple neural network with:
- Input layer: 2 features (e.g., X shape = (N, 2))
- Hidden layer: 3 neurons with ReLU
- Output layer: 2 classes with Softmax
- Loss: Cross-entropy

> Goal: Classify input points into 2 classes


### Forward Pass: Understanding the Math

Let’s define:

* $X \in \mathbb{R}^{N \times D}$: input data
* $y \in \{0,1\}$: true labels
* $W_1 \in \mathbb{R}^{D \times H}$, $b_1 \in \mathbb{R}^{H}$: weights/biases for hidden layer
* $W_2 \in \mathbb{R}^{H \times C}$, $b_2 \in \mathbb{R}^{C}$: weights/biases for output layer

#### Forward equations:

1. **Hidden layer pre-activation**:

   $$
   z_1 = X W_1 + b_1
   $$
2. **Hidden activation (ReLU)**:

   $$
   a_1 = \text{ReLU}(z_1)
   $$
3. **Output logits**:

   $$
   z_2 = a_1 W_2 + b_2
   $$
4. **Softmax probabilities**:

   $$
   \hat{y} = \text{softmax}(z_2)
   $$

#### Loss (cross-entropy for classification):

$$
\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \log(\hat{y}_{i, y_i})
$$

### Backpropagation: Chain Rule Step by Step

We want gradients of loss w\.r.t all weights:

* $\frac{\partial \mathcal{L}}{\partial W_2}$, $\frac{\partial \mathcal{L}}{\partial b_2}$
* $\frac{\partial \mathcal{L}}{\partial W_1}$, $\frac{\partial \mathcal{L}}{\partial b_1}$

#### Step-by-step:

1. **Loss to logits**:

   $$
   \frac{\partial \mathcal{L}}{\partial z_2} = \hat{y} - y_{\text{one-hot}}
   $$
2. **Logits to output weights**:

   $$
   \frac{\partial \mathcal{L}}{\partial W_2} = a_1^T \cdot \frac{\partial \mathcal{L}}{\partial z_2}
   $$
3. **Output back to hidden**:

   $$
   \frac{\partial \mathcal{L}}{\partial a_1} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot W_2^T
   $$
4. **Apply ReLU derivative**:

   $$
   \frac{\partial \mathcal{L}}{\partial z_1} = \frac{\partial \mathcal{L}}{\partial a_1} \cdot \text{ReLU}'(z_1)
   $$
5. **Input to hidden weights**:

   $$
   \frac{\partial \mathcal{L}}{\partial W_1} = X^T \cdot \frac{\partial \mathcal{L}}{\partial z_1}
   $$




In [1]:
import numpy as np

In [2]:
#ReLU activation function and its derivative
def relu(x):
    """Rectified Linear Unit activation function."""
    return np.maximum(0, x)

def relu_derivative(x):
    return (x>0).astype(float) # Derivative of ReLU is 1 for x > 0, else 0

In [3]:
#Softmax function
def softmax(logits):
    logits -= np.max(logits, axis=-1, keepdims=True)  # For numerical stability
    exps = np.exp(logits) # Avoid overflow by subtracting max
    return exps / np.sum(exps, axis=-1, keepdims=True)# Normalize to get probabilities

In [4]:
#cross-entropy loss function
def cross_entropy_loss(probs, y_true):
    N = y_true.shape[0]# Number of samples
    correct_logprobs = -np.log(probs[np.arange(N), y_true] + 1e-15)  # Add small constant for numerical stability
    return np.sum(correct_logprobs) / N  # Average loss over all samples

# One-hot encoding function
def one_hot(y, num_classes):
    """Convert class labels to one-hot encoded format."""
    return np.eye(num_classes)[y]  # Create identity matrix and index with y

### Training Loop

In [None]:
def train(X, y, hidden_dim=3, lr=1.0, num_epochs=1000):
    N, D = X.shape  # number of samples and input dimension
    C = np.max(y) + 1  # number of classes (assuming y contains class indices starting from 0)

    #initialising weights and biases
    W1  = 0.01 * np.random.randn(D, hidden_dim)  # Input to hidden layer weights
    b1  = np.zeros((1, hidden_dim))  # Hidden layer biases
    W2  = 0.01 * np.random.randn(hidden_dim, C)  # Hidden to output layer weights
    b2  = np.zeros((1, C))  # Output layer biases
    y_onehot = one_hot(y, C)  # Converting labels to one-hot encoding
    
    for i in range(num_epochs):
        # Forward pass
        z1 = X @ W1 + b1
        a1 = relu(z1)
        z2 = a1 @ W2 + b2
        probs = softmax(z2)
        loss = cross_entropy_loss(probs, y)

        if i % 100 == 0:
            print(f"Epoch {i}: Loss = {loss:.4f}")

        # Backward pass
        dz2 = probs - y_onehot  # (N, C)
        dW2 = a1.T @ dz2
        db2 = np.sum(dz2, axis=0, keepdims=True)

        da1 = dz2 @ W2.T
        dz1 = da1 * relu_derivative(z1)
        dW1 = X.T @ dz1
        db1 = np.sum(dz1, axis=0, keepdims=True)

        # Update
        W1 -= lr * dW1
        b1 -= lr * db1
        W2 -= lr * dW2
        b2 -= lr * db2

    return W1, b1, W2, b2
        
    

### Testing on Data

In [7]:
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

In [9]:
# creating synthetic data
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
#training the network
train(X, y)

Epoch 0: Loss = 0.6932
Epoch 100: Loss = 17.2694
Epoch 200: Loss = 17.2694
Epoch 300: Loss = 17.2694
Epoch 400: Loss = 17.2694
Epoch 500: Loss = 17.2694
Epoch 600: Loss = 17.2694
Epoch 700: Loss = 17.2694
Epoch 800: Loss = 17.2694
Epoch 900: Loss = 17.2694


(array([[-2.59549192e+02, -9.40066550e+03, -1.39900159e+28],
        [-2.75464751e+01,  1.17985290e+02, -3.37004499e+27]]),
 array([[-1.12887767e+03, -3.66176760e+04, -2.97129997e+28]]),
 array([[-1.28460763e+02,  1.28458781e+02],
        [-3.19103476e+03,  3.19106043e+03],
        [-4.44034525e+29,  4.44034525e+29]]),
 array([[-46.46629019,  46.46629019]]))