# 🎯 Backpropagation: Teaching the Network to Learn

**The Most Important Concept in Deep Learning!**

Welcome to what many consider the **hardest** topic in neural networks. But don't worry! We're going to break it down into tiny, digestible pieces using lots of analogies and visualizations. By the end, you'll understand how neural networks actually learn.

---

## 📖 What We'll Learn

1. **The Problem**: We know the error, but how do we improve?
2. **Gradient Descent**: Finding the downhill direction
3. **The Chain Rule**: The secret sauce of backpropagation
4. **Backpropagation**: Working backward through the network
5. **Implementation**: Building it from scratch
6. **Common Issues**: What can go wrong

---

## 🤔 The Problem: We Know We're Wrong, But How Do We Get Better?

Imagine you're learning to throw darts:
- You throw a dart 🎯
- It misses the bullseye by 5 inches to the right
- **You know you're wrong** (the dart missed)
- **But HOW should you adjust your throw?**

This is exactly where we are with neural networks:
- We have a network that makes predictions
- We calculate the loss (how wrong we are)
- **But which weights should we change? And by how much?**

**Backpropagation is the answer!** It tells us exactly how to adjust each weight to reduce the error.

---

In [None]:
# Import necessary libraries
import numpy as np  # For numerical operations
import matplotlib.pyplot as plt  # For creating visualizations
from matplotlib.animation import FuncAnimation  # For animated plots
from IPython.display import HTML  # For displaying animations in notebook
from mpl_toolkits.mplot3d import Axes3D  # For 3D plots

# Set random seed for reproducibility (so we get same results every time)
np.random.seed(42)

# Configure matplotlib for better-looking plots
plt.style.use('seaborn-v0_8-darkgrid')  # Nice looking plot style
plt.rcParams['figure.figsize'] = (12, 6)  # Default figure size

---

## 🏔️ Part 1: Gradient Descent - Finding the Way Downhill

### The Mountain Analogy 🏔️

Imagine you're **blindfolded** on a mountain and need to get to the bottom (lowest point). How do you do it?

1. **Feel the ground around you** - which direction slopes down?
2. **Take a step in that direction** - the steepest downward direction
3. **Repeat** until you reach the bottom

This is **exactly** what gradient descent does!

- **Mountain height** = Loss (error)
- **Your position** = Current weights
- **Feeling the slope** = Computing gradients
- **Taking a step** = Updating weights
- **Bottom of mountain** = Minimum loss (best weights)

### 💡 Key Insight

The **gradient** tells us:
1. **Direction**: Which way to move (uphill or downhill)
2. **Steepness**: How steep the slope is

To minimize loss, we move in the **opposite direction** of the gradient (downhill)!

In [None]:
# Let's visualize a simple loss curve (1D example)
# Imagine this is the loss as we change one weight

def simple_loss_function(weight):
    """A simple quadratic loss function: (weight - 3)^2
    The minimum is at weight = 3"""
    return (weight - 3) ** 2

def gradient_of_loss(weight):
    """The gradient (derivative) of our loss function
    This tells us the slope at any point"""
    return 2 * (weight - 3)

# Create a range of weight values
weights = np.linspace(0, 6, 100)  # 100 points from 0 to 6
losses = simple_loss_function(weights)  # Calculate loss at each point

# Create the plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Loss curve
ax1.plot(weights, losses, 'b-', linewidth=2, label='Loss Function')
ax1.axvline(x=3, color='r', linestyle='--', label='Minimum (optimal weight)')
ax1.set_xlabel('Weight Value', fontsize=12)
ax1.set_ylabel('Loss (Error)', fontsize=12)
ax1.set_title('Loss Curve: Our Goal is to Reach the Bottom', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Gradient (slope) at different points
gradients = gradient_of_loss(weights)
ax2.plot(weights, gradients, 'g-', linewidth=2, label='Gradient (Slope)')
ax2.axhline(y=0, color='r', linestyle='--', label='Zero gradient (minimum)')
ax2.axvline(x=3, color='r', linestyle='--', alpha=0.5)
ax2.set_xlabel('Weight Value', fontsize=12)
ax2.set_ylabel('Gradient', fontsize=12)
ax2.set_title('Gradient Shows Us Which Way is Downhill', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("📊 Understanding the plots:")
print("Left: The loss curve - we want to reach the minimum (red line)")
print("Right: The gradient at each point")
print("  • Positive gradient → slope goes uphill → move LEFT (decrease weight)")
print("  • Negative gradient → slope goes downhill → move RIGHT (increase weight)")
print("  • Zero gradient → we're at the minimum! 🎯")

### 🎯 Quick Summary

**Gradient**: The slope of the loss function. It tells us:
- If **positive**: loss increases as weight increases → decrease the weight
- If **negative**: loss decreases as weight increases → increase the weight
- If **zero**: we're at a minimum! (could be local or global)

---

## 🚶 Taking Steps: The Learning Rate

Now we know **which direction** to move. But **how big** should our steps be?

This is controlled by the **learning rate** (often denoted as α or lr).

### 🐢 Learning Rate Too Small
- Baby steps
- Very slow progress
- Might take forever to reach the minimum

### 🐰 Learning Rate Too Large
- Giant leaps
- Might overshoot the minimum
- Could bounce around and never converge
- Might even make things worse!

### 🎯 Learning Rate Just Right
- Moderate steps
- Steady progress toward minimum
- Converges efficiently

In [None]:
# Let's see gradient descent in action with different learning rates!

def gradient_descent_1d(starting_weight, learning_rate, num_steps):
    """Perform gradient descent to find the minimum
    
    Args:
        starting_weight: Where we start
        learning_rate: How big our steps are
        num_steps: How many steps to take
    
    Returns:
        history: List of (weight, loss) at each step
    """
    weight = starting_weight  # Current weight
    history = []  # Track our journey
    
    for step in range(num_steps):
        # Calculate current loss
        loss = simple_loss_function(weight)
        history.append((weight, loss))
        
        # Calculate gradient (slope) at current position
        grad = gradient_of_loss(weight)
        
        # Update weight: move in OPPOSITE direction of gradient
        # (because we want to go downhill, not uphill!)
        weight = weight - learning_rate * grad
    
    return history

# Try three different learning rates
learning_rates = [0.01, 0.3, 1.5]  # Too small, just right, too large
starting_weight = 0.5  # We all start at the same place
num_steps = 20  # Take 20 steps

# Create the plot
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
titles = ['🐢 Too Small (lr=0.01)', '🎯 Just Right (lr=0.3)', '🐰 Too Large (lr=1.5)']

for idx, (ax, lr, title) in enumerate(zip(axes, learning_rates, titles)):
    # Run gradient descent
    history = gradient_descent_1d(starting_weight, lr, num_steps)
    weights_hist = [w for w, l in history]
    losses_hist = [l for w, l in history]
    
    # Plot the loss curve
    weights = np.linspace(0, 6, 100)
    losses = simple_loss_function(weights)
    ax.plot(weights, losses, 'b-', alpha=0.3, linewidth=2)
    
    # Plot the path taken by gradient descent
    ax.plot(weights_hist, losses_hist, 'ro-', linewidth=2, markersize=8, label='GD path')
    ax.plot(weights_hist[0], losses_hist[0], 'go', markersize=15, label='Start')
    ax.plot(weights_hist[-1], losses_hist[-1], 'r*', markersize=20, label='End')
    
    # Mark the true minimum
    ax.axvline(x=3, color='purple', linestyle='--', alpha=0.5, label='True minimum')
    
    ax.set_xlabel('Weight', fontsize=11)
    ax.set_ylabel('Loss', fontsize=11)
    ax.set_title(title, fontsize=13, fontweight='bold')
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_ylim(-0.5, 15)

plt.tight_layout()
plt.show()

print("\n🎓 What we learned:")
print("Left: Learning rate too small → slow but steady progress")
print("Middle: Learning rate just right → efficient convergence")
print("Right: Learning rate too large → overshooting and instability!")

### ⚠️ Common Mistake

**Don't forget to negate the gradient!**

```python
# ❌ WRONG - moves uphill!
weight = weight + learning_rate * gradient

# ✅ CORRECT - moves downhill
weight = weight - learning_rate * gradient
```

---

## 🔗 Part 2: The Chain Rule - Connecting the Dots

Before we tackle backpropagation, we need to understand the **chain rule**. This is the mathematical tool that makes backpropagation possible.

### 🚗 Real-World Analogy: The Domino Effect

Imagine you're planning a road trip:

1. **Your driving speed** affects **distance traveled**
2. **Distance traveled** affects **fuel consumed**
3. **Fuel consumed** affects **total cost**

```
Speed → Distance → Fuel → Cost
```

If you want to know: **"How does my speed affect my total cost?"**

You need to consider the **chain of effects**:
- Speed affects distance (faster = more miles)
- Distance affects fuel (more miles = more fuel)
- Fuel affects cost (more fuel = more money)

The chain rule lets us calculate this total effect by **multiplying** the individual effects!

### 📐 Simple Mathematical Example

Let's use actual numbers to make this concrete:

Suppose:
- You drive at **60 mph** for **2 hours**
- Distance = Speed × Time
- Cost = Distance × $0.50 per mile

**Chain**: `Speed → Distance → Cost`

Let's calculate the derivatives (rates of change):

In [None]:
# Simple chain rule example with numbers

# Given values
time_hours = 2  # We drive for 2 hours
cost_per_mile = 0.50  # $0.50 per mile

# Functions
def distance(speed):
    """Distance = Speed × Time"""
    return speed * time_hours

def cost(distance):
    """Cost = Distance × Cost per mile"""
    return distance * cost_per_mile

# Let's evaluate at speed = 60 mph
speed = 60
d = distance(speed)  # Calculate distance
c = cost(d)  # Calculate cost

print("🚗 Forward Calculation (from speed to cost):")
print(f"Speed: {speed} mph")
print(f"Distance: {d} miles")
print(f"Cost: ${c}")
print()

# Now let's calculate derivatives (how much things change)
print("📊 Derivatives (rates of change):")
print()

# How does distance change with speed?
# If speed increases by 1 mph, distance increases by time_hours miles
d_distance_d_speed = time_hours
print(f"∂distance/∂speed = {d_distance_d_speed}")
print(f"  → If speed ↑ by 1 mph, distance ↑ by {d_distance_d_speed} miles")
print()

# How does cost change with distance?
# If distance increases by 1 mile, cost increases by $0.50
d_cost_d_distance = cost_per_mile
print(f"∂cost/∂distance = {d_cost_d_distance}")
print(f"  → If distance ↑ by 1 mile, cost ↑ by ${d_cost_d_distance}")
print()

# CHAIN RULE: How does cost change with speed?
# We multiply the two derivatives!
d_cost_d_speed = d_distance_d_speed * d_cost_d_distance
print("⛓️ Chain Rule:")
print(f"∂cost/∂speed = (∂cost/∂distance) × (∂distance/∂speed)")
print(f"∂cost/∂speed = {d_cost_d_distance} × {d_distance_d_speed} = {d_cost_d_speed}")
print(f"  → If speed ↑ by 1 mph, cost ↑ by ${d_cost_d_speed}")
print()

# Verify with actual calculation
print("✅ Verification:")
speed_plus_1 = speed + 1
cost_plus_1 = cost(distance(speed_plus_1))
actual_change = cost_plus_1 - c
print(f"Cost at {speed} mph: ${c}")
print(f"Cost at {speed_plus_1} mph: ${cost_plus_1}")
print(f"Actual change: ${actual_change}")
print(f"Predicted change (chain rule): ${d_cost_d_speed}")
print(f"Match! ✓" if abs(actual_change - d_cost_d_speed) < 0.01 else "Mismatch ✗")

### 💡 Key Insight: The Chain Rule Formula

If we have a chain: `A → B → C`

To find how `C` changes with `A`, we multiply:

$$\frac{\partial C}{\partial A} = \frac{\partial C}{\partial B} \times \frac{\partial B}{\partial A}$$

In plain English:
- **How C changes with A** = (**How C changes with B**) × (**How B changes with A**)

For longer chains: `A → B → C → D`, we keep multiplying:

$$\frac{\partial D}{\partial A} = \frac{\partial D}{\partial C} \times \frac{\partial C}{\partial B} \times \frac{\partial B}{\partial A}$$

**This is the foundation of backpropagation!**

---

## 🔄 Part 3: Backpropagation - Putting It All Together

Now we're ready for the main event! **Backpropagation** is just applying the chain rule systematically to a neural network.

### 🧠 The Neural Network Chain

In a neural network, we have a chain like this:

```
Weights → Weighted Sum → Activation → Output → Loss
```

We want to know: **How does the loss change if we change a weight?**

We use the chain rule, working **backward** from loss to weights:

1. **Start at loss**: We know the error
2. **Work backward**: How did each layer contribute to this error?
3. **Assign blame**: Each weight gets a "blame score" (gradient)
4. **Update weights**: Adjust weights to reduce their "blame"

### 📋 Step-by-Step: 2-Layer Network Example

Let's build a tiny network and do backpropagation by hand!

**Network Architecture:**
- Input: 2 neurons (x₁, x₂)
- Hidden layer: 2 neurons (h₁, h₂) with sigmoid activation
- Output: 1 neuron (y) with sigmoid activation
- Loss: Mean Squared Error (MSE)

```
Input     Hidden      Output
 x₁ ──┐   h₁ ──┐
      ├──→    ├──→  y  → Loss
 x₂ ──┘   h₂ ──┘
```

In [None]:
# Let's implement activation functions and their derivatives
# (We learned about these in Notebook 3!)

def sigmoid(x):
    """Sigmoid activation function: squashes values to (0, 1)"""
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    """Derivative of sigmoid: sigmoid(x) * (1 - sigmoid(x))
    This tells us how fast sigmoid is changing at point x"""
    s = sigmoid(x)
    return s * (1 - s)

# Visualize sigmoid and its derivative
x = np.linspace(-6, 6, 100)
y_sigmoid = sigmoid(x)
y_derivative = sigmoid_derivative(x)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(x, y_sigmoid, 'b-', linewidth=2, label='sigmoid(x)')
ax1.set_xlabel('x', fontsize=12)
ax1.set_ylabel('sigmoid(x)', fontsize=12)
ax1.set_title('Sigmoid Activation Function', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.legend()

ax2.plot(x, y_derivative, 'r-', linewidth=2, label="sigmoid'(x)")
ax2.set_xlabel('x', fontsize=12)
ax2.set_ylabel("sigmoid'(x)", fontsize=12)
ax2.set_title('Derivative of Sigmoid (needed for backprop!)', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.legend()

plt.tight_layout()
plt.show()

print("💡 Why we need the derivative:")
print("During backpropagation, we need to know how the activation function")
print("contributes to the gradient. The derivative tells us this!")
print("Notice: derivative is highest around x=0, and very small for large |x|")

### 🔢 Numerical Example: Complete Forward and Backward Pass

Let's do a complete example with actual numbers so you can see every step!

In [None]:
# Simple 2-layer neural network with manual backpropagation
# We'll use tiny numbers and print everything!

print("="*60)
print("🧠 COMPLETE BACKPROPAGATION EXAMPLE")
print("="*60)
print()

# --- SETUP ---
print("📋 SETUP")
print("-" * 60)

# Input (2 features)
X = np.array([0.5, 0.8])  # Our input data
print(f"Input X: {X}")

# True output (what we want the network to predict)
y_true = 1.0
print(f"True output: {y_true}")
print()

# Weights (randomly initialized, but we'll use fixed values for clarity)
# Weights from input to hidden layer (2x2 matrix)
W1 = np.array([[0.1, 0.3],   # Weights from x1 to [h1, h2]
               [0.2, 0.4]])   # Weights from x2 to [h1, h2]
b1 = np.array([0.1, 0.2])     # Biases for hidden layer

print("Weights input→hidden (W1):")
print(W1)
print(f"Biases hidden (b1): {b1}")
print()

# Weights from hidden to output layer (2x1 matrix)
W2 = np.array([[0.5],   # Weight from h1 to output
               [0.6]])   # Weight from h2 to output
b2 = np.array([0.1])     # Bias for output

print("Weights hidden→output (W2):")
print(W2)
print(f"Bias output (b2): {b2}")
print()

learning_rate = 0.5  # How big our weight updates will be
print(f"Learning rate: {learning_rate}")
print()

# --- FORWARD PASS ---
print("="*60)
print("➡️ FORWARD PASS (Input → Prediction)")
print("="*60)
print()

# Hidden layer
print("🔵 Hidden Layer:")
print("-" * 60)

# Weighted sum at hidden layer: z1 = X · W1 + b1
z1 = np.dot(X, W1) + b1
print(f"Weighted sum z1 = X·W1 + b1")
print(f"  = {X} · {W1.T} + {b1}")
print(f"  = {z1}")
print()

# Activation: apply sigmoid
h = sigmoid(z1)
print(f"Hidden activation h = sigmoid(z1)")
print(f"  = sigmoid({z1})")
print(f"  = {h}")
print()

# Output layer
print("🔴 Output Layer:")
print("-" * 60)

# Weighted sum at output: z2 = h · W2 + b2
z2 = np.dot(h, W2) + b2
print(f"Weighted sum z2 = h·W2 + b2")
print(f"  = {h} · {W2.T} + {b2}")
print(f"  = {z2}")
print()

# Final prediction: apply sigmoid
y_pred = sigmoid(z2)
print(f"Prediction y_pred = sigmoid(z2)")
print(f"  = sigmoid({z2})")
print(f"  = {y_pred}")
print()

# Loss calculation (Mean Squared Error)
print("💥 Loss Calculation:")
print("-" * 60)
loss = 0.5 * (y_pred - y_true) ** 2  # MSE
print(f"Loss = 0.5 × (y_pred - y_true)²")
print(f"     = 0.5 × ({y_pred[0]:.4f} - {y_true})²")
print(f"     = {loss[0]:.6f}")
print()

# --- BACKWARD PASS (BACKPROPAGATION!) ---
print("="*60)
print("⬅️ BACKWARD PASS (Error → Weight Updates)")
print("="*60)
print()

print("🎯 Our goal: Calculate how much each weight contributed to the error")
print("Then adjust weights in the opposite direction!")
print()

# Step 1: Gradient of loss with respect to prediction
print("Step 1️⃣: How does loss change with prediction?")
print("-" * 60)
dL_dy = y_pred - y_true  # Derivative of MSE: (y_pred - y_true)
print(f"∂Loss/∂y_pred = y_pred - y_true")
print(f"              = {y_pred[0]:.4f} - {y_true}")
print(f"              = {dL_dy[0]:.4f}")
print("This tells us if prediction is too high (+) or too low (-)")
print()

# Step 2: Gradient at output layer (before activation)
print("Step 2️⃣: How does loss change with output layer weighted sum?")
print("-" * 60)
# Chain rule: dL/dz2 = dL/dy × dy/dz2
dy_dz2 = sigmoid_derivative(z2)  # Derivative of sigmoid
dL_dz2 = dL_dy * dy_dz2
print(f"∂Loss/∂z2 = (∂Loss/∂y) × (∂y/∂z2)")
print(f"          = {dL_dy[0]:.4f} × sigmoid'({z2[0]:.4f})")
print(f"          = {dL_dy[0]:.4f} × {dy_dz2[0]:.4f}")
print(f"          = {dL_dz2[0]:.4f}")
print("Chain rule in action! We multiplied two gradients.")
print()

# Step 3: Gradients for W2 and b2
print("Step 3️⃣: How does loss change with output weights W2 and bias b2?")
print("-" * 60)
# dL/dW2 = dL/dz2 × dz2/dW2 = dL/dz2 × h (because z2 = h·W2 + b2)
dL_dW2 = np.outer(h, dL_dz2)  # Outer product
print(f"∂Loss/∂W2 = (∂Loss/∂z2) × (∂z2/∂W2)")
print(f"          = {dL_dz2[0]:.4f} × {h}")
print(f"          = {dL_dW2.flatten()}")
print()

dL_db2 = dL_dz2  # dz2/db2 = 1, so gradient is just dL_dz2
print(f"∂Loss/∂b2 = {dL_db2[0]:.4f}")
print()

# Step 4: Gradient flowing back to hidden layer
print("Step 4️⃣: How does loss change with hidden layer activations?")
print("-" * 60)
# dL/dh = dL/dz2 × dz2/dh = dL/dz2 × W2
dL_dh = np.dot(dL_dz2, W2.T)
print(f"∂Loss/∂h = (∂Loss/∂z2) × W2ᵀ")
print(f"         = {dL_dz2[0]:.4f} × {W2.T}")
print(f"         = {dL_dh}")
print("The error is propagated back through the weights!")
print()

# Step 5: Gradient at hidden layer (before activation)
print("Step 5️⃣: How does loss change with hidden layer weighted sum?")
print("-" * 60)
# dL/dz1 = dL/dh × dh/dz1
dh_dz1 = sigmoid_derivative(z1)
dL_dz1 = dL_dh * dh_dz1
print(f"∂Loss/∂z1 = (∂Loss/∂h) × (∂h/∂z1)")
print(f"          = {dL_dh} × sigmoid'({z1})")
print(f"          = {dL_dh} × {dh_dz1}")
print(f"          = {dL_dz1}")
print()

# Step 6: Gradients for W1 and b1
print("Step 6️⃣: How does loss change with input weights W1 and bias b1?")
print("-" * 60)
# dL/dW1 = dL/dz1 × dz1/dW1 = dL/dz1 × X
dL_dW1 = np.outer(X, dL_dz1)
print(f"∂Loss/∂W1 = (∂Loss/∂z1) × Xᵀ")
print(f"          = {X} × {dL_dz1}")
print("Result:")
print(dL_dW1)
print()

dL_db1 = dL_dz1
print(f"∂Loss/∂b1 = {dL_db1}")
print()

# --- WEIGHT UPDATE ---
print("="*60)
print("🔄 WEIGHT UPDATE (Gradient Descent Step)")
print("="*60)
print()

print("Formula: new_weight = old_weight - learning_rate × gradient")
print()

# Update all weights
W2_new = W2 - learning_rate * dL_dW2
b2_new = b2 - learning_rate * dL_db2
W1_new = W1 - learning_rate * dL_dW1
b1_new = b1 - learning_rate * dL_db1

print("Output layer updates:")
print(f"W2: {W2.flatten()} → {W2_new.flatten()}")
print(f"b2: {b2} → {b2_new}")
print()
print("Hidden layer updates:")
print(f"W1:")
print(f"  Old:\n{W1}")
print(f"  New:\n{W1_new}")
print(f"b1: {b1} → {b1_new}")
print()

# Verify improvement
print("="*60)
print("✅ VERIFICATION: Did we improve?")
print("="*60)
print()

# Forward pass with new weights
z1_new = np.dot(X, W1_new) + b1_new
h_new = sigmoid(z1_new)
z2_new = np.dot(h_new, W2_new) + b2_new
y_pred_new = sigmoid(z2_new)
loss_new = 0.5 * (y_pred_new - y_true) ** 2

print(f"Before update:")
print(f"  Prediction: {y_pred[0]:.6f}")
print(f"  Loss: {loss[0]:.6f}")
print()
print(f"After update:")
print(f"  Prediction: {y_pred_new[0]:.6f}")
print(f"  Loss: {loss_new[0]:.6f}")
print()
print(f"Improvement: {loss[0] - loss_new[0]:.6f} (loss decreased by {((loss[0] - loss_new[0]) / loss[0] * 100):.2f}%)")
print()
print("🎉 Success! The loss went down, meaning we're learning!")

### 🎯 Quick Summary: What Just Happened?

1. **Forward Pass**: Computed prediction from inputs
2. **Loss**: Measured how wrong we were
3. **Backward Pass**: Used chain rule to compute gradients
   - Started from loss
   - Worked backward through each layer
   - Calculated how much each weight contributed to error
4. **Weight Update**: Adjusted weights to reduce error
5. **Result**: Loss decreased! 🎉

---

## 🛠️ Part 4: Complete Implementation from Scratch

Now let's put it all together in a clean, reusable implementation!

In [None]:
class TwoLayerNetwork:
    """A simple 2-layer neural network with backpropagation
    
    This class implements:
    - Forward propagation
    - Loss calculation
    - Backpropagation
    - Weight updates
    """
    
    def __init__(self, input_size, hidden_size, output_size, learning_rate=0.1):
        """Initialize the network with random weights
        
        Args:
            input_size: Number of input features
            hidden_size: Number of neurons in hidden layer
            output_size: Number of output neurons
            learning_rate: Step size for gradient descent
        """
        # Initialize weights with small random values
        # (We multiply by 0.5 to keep values small)
        self.W1 = np.random.randn(input_size, hidden_size) * 0.5
        self.b1 = np.zeros(hidden_size)  # Biases start at zero
        
        self.W2 = np.random.randn(hidden_size, output_size) * 0.5
        self.b2 = np.zeros(output_size)
        
        self.learning_rate = learning_rate
        
        # Storage for intermediate values (needed for backprop)
        self.cache = {}
    
    def forward(self, X):
        """Forward pass: compute predictions
        
        Args:
            X: Input data (batch_size, input_size)
        
        Returns:
            y_pred: Predictions (batch_size, output_size)
        """
        # Hidden layer
        z1 = np.dot(X, self.W1) + self.b1  # Weighted sum
        h = sigmoid(z1)  # Activation
        
        # Output layer
        z2 = np.dot(h, self.W2) + self.b2  # Weighted sum
        y_pred = sigmoid(z2)  # Activation
        
        # Save intermediate values for backpropagation
        self.cache = {
            'X': X,
            'z1': z1,
            'h': h,
            'z2': z2,
            'y_pred': y_pred
        }
        
        return y_pred
    
    def compute_loss(self, y_pred, y_true):
        """Compute Mean Squared Error loss
        
        Args:
            y_pred: Predictions from network
            y_true: True labels
        
        Returns:
            loss: Average loss across all samples
        """
        # MSE = mean of (prediction - true)^2
        return np.mean((y_pred - y_true) ** 2)
    
    def backward(self, y_true):
        """Backward pass: compute gradients using backpropagation
        
        Args:
            y_true: True labels
        
        Returns:
            gradients: Dictionary of gradients for all parameters
        """
        # Get cached values from forward pass
        X = self.cache['X']
        z1 = self.cache['z1']
        h = self.cache['h']
        z2 = self.cache['z2']
        y_pred = self.cache['y_pred']
        
        batch_size = X.shape[0]  # Number of samples
        
        # --- Backward pass through output layer ---
        
        # Gradient of loss w.r.t. predictions
        dL_dy = 2 * (y_pred - y_true) / batch_size  # Average over batch
        
        # Gradient of loss w.r.t. z2 (before activation)
        # Chain rule: dL/dz2 = dL/dy × dy/dz2
        dy_dz2 = sigmoid_derivative(z2)
        dL_dz2 = dL_dy * dy_dz2
        
        # Gradients for W2 and b2
        dL_dW2 = np.dot(h.T, dL_dz2)  # (hidden_size, output_size)
        dL_db2 = np.sum(dL_dz2, axis=0)  # Sum across batch
        
        # --- Backward pass through hidden layer ---
        
        # Gradient of loss w.r.t. hidden activations
        # Chain rule: dL/dh = dL/dz2 × dz2/dh = dL/dz2 × W2
        dL_dh = np.dot(dL_dz2, self.W2.T)
        
        # Gradient of loss w.r.t. z1 (before activation)
        # Chain rule: dL/dz1 = dL/dh × dh/dz1
        dh_dz1 = sigmoid_derivative(z1)
        dL_dz1 = dL_dh * dh_dz1
        
        # Gradients for W1 and b1
        dL_dW1 = np.dot(X.T, dL_dz1)  # (input_size, hidden_size)
        dL_db1 = np.sum(dL_dz1, axis=0)  # Sum across batch
        
        # Return all gradients
        return {
            'dW1': dL_dW1,
            'db1': dL_db1,
            'dW2': dL_dW2,
            'db2': dL_db2
        }
    
    def update_weights(self, gradients):
        """Update weights using gradient descent
        
        Args:
            gradients: Dictionary of gradients from backward pass
        """
        # Update each parameter: param = param - learning_rate × gradient
        self.W1 -= self.learning_rate * gradients['dW1']
        self.b1 -= self.learning_rate * gradients['db1']
        self.W2 -= self.learning_rate * gradients['dW2']
        self.b2 -= self.learning_rate * gradients['db2']
    
    def train_step(self, X, y_true):
        """Complete training step: forward, backward, update
        
        Args:
            X: Input data
            y_true: True labels
        
        Returns:
            loss: Current loss value
        """
        # Forward pass
        y_pred = self.forward(X)
        
        # Compute loss
        loss = self.compute_loss(y_pred, y_true)
        
        # Backward pass
        gradients = self.backward(y_true)
        
        # Update weights
        self.update_weights(gradients)
        
        return loss

print("✅ TwoLayerNetwork class implemented!")
print("This class can:")
print("  • Forward propagate (make predictions)")
print("  • Compute loss")
print("  • Backpropagate (compute gradients)")
print("  • Update weights")
print("\nLet's test it on the XOR problem!")

### 🧪 Testing on XOR Problem

XOR (exclusive OR) is a classic problem that can't be solved by a single neuron (it's not linearly separable). But our 2-layer network can learn it!

**XOR Truth Table:**
```
Input 1 | Input 2 | Output
--------|---------|--------
   0    |    0    |   0
   0    |    1    |   1
   1    |    0    |   1
   1    |    1    |   0
```

In [None]:
# XOR dataset
X_xor = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])

y_xor = np.array([
    [0],
    [1],
    [1],
    [0]
])

print("XOR Problem:")
print("Inputs:\n", X_xor)
print("\nTargets:\n", y_xor.flatten())
print()

# Create and train network
np.random.seed(42)  # For reproducibility
network = TwoLayerNetwork(
    input_size=2,
    hidden_size=4,  # 4 hidden neurons
    output_size=1,
    learning_rate=0.5
)

# Train for many iterations
num_iterations = 5000
losses = []

print("Training network on XOR...")
print()

for i in range(num_iterations):
    # Train on entire dataset
    loss = network.train_step(X_xor, y_xor)
    losses.append(loss)
    
    # Print progress every 1000 iterations
    if (i + 1) % 1000 == 0:
        print(f"Iteration {i+1:4d} | Loss: {loss:.6f}")

print("\nTraining complete!")
print()

# Test the network
predictions = network.forward(X_xor)

print("Final Results:")
print("="*50)
print("Input 1 | Input 2 | Target | Prediction | Rounded")
print("-"*50)
for i in range(len(X_xor)):
    x1, x2 = X_xor[i]
    target = y_xor[i, 0]
    pred = predictions[i, 0]
    rounded = round(pred)
    print(f"   {x1}    |    {x2}    |   {target}    |   {pred:.4f}   |    {rounded}")

print()
accuracy = np.mean((predictions > 0.5) == y_xor) * 100
print(f"Accuracy: {accuracy:.1f}%")
print()
print("🎉 The network learned XOR through backpropagation!")

In [None]:
# Visualize the training process
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Loss over time
ax1.plot(losses, 'b-', linewidth=2)
ax1.set_xlabel('Iteration', fontsize=12)
ax1.set_ylabel('Loss (MSE)', fontsize=12)
ax1.set_title('Loss Decreases Over Time (Learning!)', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.set_yscale('log')  # Log scale to see details

# Plot 2: Decision boundary
# Create a grid of points
xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 100),
                     np.linspace(-0.5, 1.5, 100))
grid_points = np.c_[xx.ravel(), yy.ravel()]

# Predict for all grid points
Z = network.forward(grid_points)
Z = Z.reshape(xx.shape)

# Plot decision boundary
contour = ax2.contourf(xx, yy, Z, levels=20, cmap='RdYlBu', alpha=0.7)
ax2.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)

# Plot the XOR points
scatter = ax2.scatter(X_xor[:, 0], X_xor[:, 1], c=y_xor.flatten(),
                     cmap='RdYlBu', s=200, edgecolors='black', linewidths=3)

ax2.set_xlabel('Input 1', fontsize=12)
ax2.set_ylabel('Input 2', fontsize=12)
ax2.set_title('Decision Boundary: Network Learned XOR!', fontsize=14, fontweight='bold')
plt.colorbar(contour, ax=ax2, label='Prediction')

# Add labels for each point
for i, (x, y, label) in enumerate(zip(X_xor[:, 0], X_xor[:, 1], y_xor.flatten())):
    ax2.text(x+0.05, y+0.05, f'({int(x)},{int(y)})→{int(label)}',
             fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n📊 Understanding the visualizations:")
print("Left: Loss curve shows learning progress - it goes down!")
print("Right: Decision boundary separates the two classes")
print("  • Blue regions = network predicts 0")
print("  • Red regions = network predicts 1")
print("  • The network learned the non-linear XOR pattern!")

### 💡 Key Insight: Why Backpropagation Works

Backpropagation works because:

1. **Chain Rule**: Allows us to decompose complex derivatives
2. **Systematic**: Works backward layer by layer
3. **Efficient**: Each gradient computed only once
4. **General**: Works for any network architecture

Without backpropagation, we couldn't train deep neural networks!

---

## ⚠️ Part 5: Common Issues with Backpropagation

### 1. Vanishing Gradients 📉

**Problem**: Gradients become extremely small as they propagate backward

**Why it happens**:
- Sigmoid derivative is at most 0.25
- Multiply many small numbers → very small gradient
- Deep networks: gradient = product of many derivatives

**Example**: In a 5-layer network with sigmoid:
- Each layer multiplies gradient by ≤ 0.25
- Final gradient ≤ 0.25⁵ = 0.00098 (very small!)
- Early layers barely update → slow learning

In [None]:
# Demonstrate vanishing gradients with sigmoid

# Simulate gradients flowing through multiple sigmoid layers
num_layers = [1, 2, 3, 5, 10]  # Different network depths
initial_gradient = 1.0  # Start with gradient of 1

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# For each network depth
for n in num_layers:
    gradients = [initial_gradient]
    
    # Simulate gradient flowing back through n layers
    current_grad = initial_gradient
    for layer in range(n):
        # Sigmoid derivative at z=0 is 0.25 (maximum value)
        # In practice, it's often smaller
        current_grad *= 0.25  # Multiply by sigmoid derivative
        gradients.append(current_grad)
    
    # Plot gradient at each layer
    layers = list(range(len(gradients)))
    ax1.plot(layers, gradients, 'o-', linewidth=2, markersize=8, label=f'{n} layers')

ax1.set_xlabel('Layer (from output to input)', fontsize=12)
ax1.set_ylabel('Gradient Magnitude', fontsize=12)
ax1.set_title('Vanishing Gradients with Sigmoid', fontsize=14, fontweight='bold')
ax1.set_yscale('log')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Compare sigmoid vs ReLU derivatives
x = np.linspace(-5, 5, 100)
sigmoid_deriv = sigmoid_derivative(x)
relu_deriv = (x > 0).astype(float)  # ReLU derivative: 0 if x<0, 1 if x>0

ax2.plot(x, sigmoid_deriv, 'b-', linewidth=2, label='Sigmoid derivative')
ax2.plot(x, relu_deriv, 'r-', linewidth=2, label='ReLU derivative')
ax2.set_xlabel('x', fontsize=12)
ax2.set_ylabel('Derivative', fontsize=12)
ax2.set_title('Why ReLU Helps: Derivative is 0 or 1', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_ylim(-0.1, 1.1)

plt.tight_layout()
plt.show()

print("⚠️ Vanishing Gradient Problem:")
print("Left: Gradients shrink exponentially in deep networks with sigmoid")
print("Right: ReLU has constant derivative of 1 (when x > 0)")
print("\nSolution: Use ReLU instead of sigmoid in hidden layers!")

### 2. Exploding Gradients 💥

**Problem**: Gradients become extremely large

**Why it happens**:
- Large weight values
- Multiply many large numbers → explosion
- Network becomes unstable

**Solutions**:
- **Gradient clipping**: Cap gradients at a maximum value
- **Better weight initialization**: Start with smaller weights
- **Batch normalization**: Normalize activations

### 3. Dead ReLU Problem 💀

**Problem**: ReLU neurons can "die" and stop learning

**Why it happens**:
- If a neuron's output is always negative, ReLU always outputs 0
- Gradient is always 0 → no weight updates → neuron is "dead"

**Solution**: Use Leaky ReLU or other variants

---

## 🎮 Interactive Experiments

### Experiment 1: Effect of Learning Rate

In [None]:
# Train networks with different learning rates
learning_rates = [0.01, 0.1, 0.5, 2.0]
num_iterations = 3000

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, lr in enumerate(learning_rates):
    # Create network with specific learning rate
    np.random.seed(42)  # Same initialization for fair comparison
    network = TwoLayerNetwork(
        input_size=2,
        hidden_size=4,
        output_size=1,
        learning_rate=lr
    )
    
    # Train
    losses = []
    for i in range(num_iterations):
        loss = network.train_step(X_xor, y_xor)
        losses.append(loss)
    
    # Plot
    ax = axes[idx]
    ax.plot(losses, linewidth=2)
    ax.set_xlabel('Iteration', fontsize=11)
    ax.set_ylabel('Loss', fontsize=11)
    ax.set_title(f'Learning Rate = {lr}', fontsize=13, fontweight='bold')
    ax.grid(True, alpha=0.3)
    ax.set_ylim(0, 0.3)
    
    # Final loss
    final_loss = losses[-1]
    ax.text(0.6, 0.9, f'Final: {final_loss:.4f}',
            transform=ax.transAxes, fontsize=11,
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

print("🎯 Try it yourself!")
print("Change the learning_rates list above and re-run to see different behaviors.")
print("\nWhat to look for:")
print("• Too small (0.01): Slow, steady decrease")
print("• Just right (0.1-0.5): Fast convergence")
print("• Too large (2.0): Unstable, might diverge")

### Experiment 2: Visualizing Gradient Flow

In [None]:
# Visualize how gradients flow through the network

# Create and train a small network
np.random.seed(42)
network = TwoLayerNetwork(input_size=2, hidden_size=3, output_size=1, learning_rate=0.5)

# Do one forward pass
y_pred = network.forward(X_xor)

# Do one backward pass
gradients = network.backward(y_xor)

# Visualize the network and gradient magnitudes
fig, ax = plt.subplots(figsize=(12, 8))

# Layer positions
layer_x = [0, 1, 2]  # Input, Hidden, Output
layer_sizes = [2, 3, 1]  # Neurons in each layer

# Draw neurons
neuron_positions = {}
for layer_idx, (x, size) in enumerate(zip(layer_x, layer_sizes)):
    y_positions = np.linspace(0, 1, size + 2)[1:-1]  # Evenly spaced
    for neuron_idx, y in enumerate(y_positions):
        circle = plt.Circle((x, y), 0.08, color='lightblue', ec='black', linewidth=2)
        ax.add_patch(circle)
        neuron_positions[(layer_idx, neuron_idx)] = (x, y)
        
        # Label neurons
        if layer_idx == 0:
            ax.text(x-0.15, y, f'x{neuron_idx+1}', fontsize=10, ha='right')
        elif layer_idx == 1:
            ax.text(x, y-0.15, f'h{neuron_idx+1}', fontsize=10, ha='center')
        else:
            ax.text(x+0.15, y, 'y', fontsize=10, ha='left')

# Draw connections with gradient-based colors
# Connections from input to hidden
max_grad_w1 = np.max(np.abs(gradients['dW1']))
for i in range(2):  # Input neurons
    for j in range(3):  # Hidden neurons
        x1, y1 = neuron_positions[(0, i)]
        x2, y2 = neuron_positions[(1, j)]
        
        # Color based on gradient magnitude
        grad_magnitude = np.abs(gradients['dW1'][i, j])
        color_intensity = grad_magnitude / max_grad_w1
        color = plt.cm.Reds(color_intensity)
        
        ax.plot([x1, x2], [y1, y2], color=color, linewidth=2, alpha=0.7)

# Connections from hidden to output
max_grad_w2 = np.max(np.abs(gradients['dW2']))
for i in range(3):  # Hidden neurons
    for j in range(1):  # Output neurons
        x1, y1 = neuron_positions[(1, i)]
        x2, y2 = neuron_positions[(2, j)]
        
        grad_magnitude = np.abs(gradients['dW2'][i, j])
        color_intensity = grad_magnitude / max_grad_w2
        color = plt.cm.Reds(color_intensity)
        
        ax.plot([x1, x2], [y1, y2], color=color, linewidth=2, alpha=0.7)

# Labels
ax.text(0, -0.2, 'Input\nLayer', fontsize=12, ha='center', fontweight='bold')
ax.text(1, -0.2, 'Hidden\nLayer', fontsize=12, ha='center', fontweight='bold')
ax.text(2, -0.2, 'Output\nLayer', fontsize=12, ha='center', fontweight='bold')

ax.set_xlim(-0.5, 2.5)
ax.set_ylim(-0.3, 1.2)
ax.set_aspect('equal')
ax.axis('off')
ax.set_title('Gradient Flow Visualization\n(Darker red = larger gradient)', 
             fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("📊 Gradient magnitudes:")
print(f"\nInput → Hidden (W1):")
print(gradients['dW1'])
print(f"\nHidden → Output (W2):")
print(gradients['dW2'])
print("\n💡 Darker connections have larger gradients (will update more)")

---

## 🎯 Final Summary

Congratulations! You've learned the most important algorithm in deep learning. Let's recap:

### What is Backpropagation?

**Backpropagation** is an algorithm for computing gradients efficiently using the chain rule:

1. **Forward Pass**: Compute predictions layer by layer
2. **Compute Loss**: Measure how wrong we are
3. **Backward Pass**: Work backward computing gradients
4. **Update Weights**: Adjust weights to reduce loss

### Key Concepts

✅ **Gradient Descent**: Follow the slope downhill to minimize loss

✅ **Learning Rate**: Controls step size (too small = slow, too large = unstable)

✅ **Chain Rule**: Multiply derivatives to propagate gradients backward

✅ **Vanishing Gradients**: Problem with sigmoid (solution: use ReLU)

✅ **Systematic Process**: Same algorithm works for any network architecture

### Why It's Important

- **Enables Learning**: Without backprop, we couldn't train neural networks
- **Efficient**: Computes all gradients in one backward pass
- **General**: Works for any differentiable function
- **Foundation**: Powers all modern deep learning

### What's Next?

We've learned how to update weights **once**. But one update isn't enough!

In the next notebook, we'll learn about the **training loop** - the process of:
- Repeating forward/backward passes many times
- Processing data in batches
- Tracking progress over epochs
- Choosing hyperparameters
- Building complete, trainable networks

---

## 🎓 Review Questions

Test your understanding:

1. **What does the gradient tell us?**
   - The direction and magnitude of steepest increase in loss
   - We move in the opposite direction to decrease loss

2. **Why do we need the chain rule?**
   - To compute how changes in early layers affect the final loss
   - To propagate gradients backward through multiple layers

3. **What's the vanishing gradient problem?**
   - Gradients become very small in deep networks with sigmoid
   - Early layers barely learn
   - Solution: Use ReLU activation

4. **Why is learning rate important?**
   - Too small: slow learning
   - Too large: unstable, might diverge
   - Need to choose carefully

5. **Can you explain backpropagation in one sentence?**
   - Use the chain rule to compute gradients by working backward through the network, then update weights to reduce loss.

---

## 💪 Challenge Exercises

Ready to test your skills?

In [None]:
# Challenge 1: Implement ReLU and its derivative
def relu(x):
    """TODO: Implement ReLU activation
    ReLU(x) = max(0, x)
    """
    pass  # Your code here!

def relu_derivative(x):
    """TODO: Implement ReLU derivative
    ReLU'(x) = 1 if x > 0, else 0
    """
    pass  # Your code here!

# Test your implementation
# test_x = np.array([-2, -1, 0, 1, 2])
# print("ReLU:", relu(test_x))
# print("ReLU':", relu_derivative(test_x))

In [None]:
# Challenge 2: Modify TwoLayerNetwork to use ReLU instead of sigmoid
# Hint: You'll need to:
# 1. Change activation function in forward pass
# 2. Change derivative in backward pass
# 3. Keep sigmoid for output layer (for binary classification)

# Your code here!

In [None]:
# Challenge 3: Add momentum to gradient descent
# Momentum helps accelerate learning in the right direction
# Update rule: velocity = momentum * velocity + learning_rate * gradient
#              weight = weight - velocity

# Your code here!

---

## 🎉 Congratulations!

You've conquered backpropagation - the hardest concept in neural networks!

You now understand:
- How neural networks learn from data
- The mathematics behind gradient descent
- How to implement backpropagation from scratch
- Common problems and their solutions

**This is a HUGE achievement!** Many people struggle with this topic, but you've made it through. 🌟

### Next Steps

In **Notebook 8: Training Loop**, we'll put everything together and learn how to:
- Train networks on real datasets
- Choose hyperparameters effectively
- Monitor training progress
- Avoid overfitting
- Build production-ready models

You're almost there! One more notebook and you'll have built a complete neural network from scratch! 🚀