# Part 3.2: Backpropagation

Backpropagation is the algorithm that makes deep learning possible. It efficiently computes how to adjust every weight in a neural network to reduce error. Without it, training networks with millions of parameters would be computationally infeasible.

## Learning Objectives
- [ ] Understand the credit assignment problem and why backpropagation solves it
- [ ] Draw and interpret computational graphs for neural networks
- [ ] Apply the chain rule to compute gradients through composed functions
- [ ] Manually compute gradients for a 2-layer neural network
- [ ] Implement backpropagation from scratch using NumPy
- [ ] Train a network to solve XOR (what perceptrons couldn't!)
- [ ] Recognize vanishing and exploding gradient problems

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

## 1. The Credit Assignment Problem

### The Core Challenge

Imagine a neural network makes a prediction and gets it wrong. The error is clear at the output, but **how do we know which weights deep inside the network are responsible?**

This is the **credit assignment problem**: distributing "blame" (or credit) for the final error back to all the weights that contributed to it.

### Why This Is Hard

Consider a simple 2-layer network:

```
Input -> [Hidden Layer] -> [Output Layer] -> Prediction -> Error
           w1, w2              w3, w4
```

The error depends on w3 and w4 directly, but it depends on w1 and w2 **indirectly** through the hidden layer. How much should we adjust w1 versus w3?

### The Key Insight: Chain Rule

The breakthrough insight is that we can use the **chain rule from calculus** to decompose the problem:

$$\frac{\partial \text{Error}}{\partial w_1} = \frac{\partial \text{Error}}{\partial \text{hidden}} \times \frac{\partial \text{hidden}}{\partial w_1}$$

**What this means:** To find how w1 affects the error, we trace the path from w1 to the error, multiplying the local gradients along the way.

In [None]:
# Visualize the credit assignment problem
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Simple network diagram
ax = axes[0]
ax.set_xlim(-0.5, 4.5)
ax.set_ylim(-0.5, 3.5)

# Draw nodes
node_positions = {
    'x': (0, 1.5),
    'h1': (1.5, 2.5),
    'h2': (1.5, 0.5),
    'y': (3, 1.5),
    'L': (4, 1.5)
}

for name, pos in node_positions.items():
    circle = plt.Circle(pos, 0.3, fill=True, 
                       color='lightblue' if name not in ['L'] else 'lightcoral',
                       edgecolor='black', linewidth=2)
    ax.add_patch(circle)
    ax.text(pos[0], pos[1], name, ha='center', va='center', fontsize=12, fontweight='bold')

# Draw edges with weight labels
edges = [
    ('x', 'h1', 'w1'),
    ('x', 'h2', 'w2'),
    ('h1', 'y', 'w3'),
    ('h2', 'y', 'w4'),
    ('y', 'L', '')
]

for start, end, label in edges:
    start_pos = node_positions[start]
    end_pos = node_positions[end]
    ax.annotate('', xy=(end_pos[0]-0.3, end_pos[1]), xytext=(start_pos[0]+0.3, start_pos[1]),
                arrowprops=dict(arrowstyle='->', color='black', lw=1.5))
    if label:
        mid = ((start_pos[0] + end_pos[0])/2, (start_pos[1] + end_pos[1])/2 + 0.2)
        ax.text(mid[0], mid[1], label, fontsize=10, color='darkblue')

ax.set_title('The Credit Assignment Problem\nHow much does each weight contribute to the loss?', fontsize=12)
ax.set_aspect('equal')
ax.axis('off')

# Right: Gradient flow (backward)
ax = axes[1]
ax.set_xlim(-0.5, 4.5)
ax.set_ylim(-0.5, 3.5)

for name, pos in node_positions.items():
    circle = plt.Circle(pos, 0.3, fill=True,
                       color='lightyellow' if name not in ['L'] else 'lightcoral',
                       edgecolor='black', linewidth=2)
    ax.add_patch(circle)
    ax.text(pos[0], pos[1], name, ha='center', va='center', fontsize=12, fontweight='bold')

# Draw backward edges (gradients flow backward)
backward_edges = [
    ('L', 'y', r'$\frac{\partial L}{\partial y}$'),
    ('y', 'h1', r'$\frac{\partial L}{\partial h_1}$'),
    ('y', 'h2', r'$\frac{\partial L}{\partial h_2}$'),
]

for start, end, label in backward_edges:
    start_pos = node_positions[start]
    end_pos = node_positions[end]
    ax.annotate('', xy=(end_pos[0]+0.3, end_pos[1]), xytext=(start_pos[0]-0.3, start_pos[1]),
                arrowprops=dict(arrowstyle='->', color='red', lw=2))
    mid = ((start_pos[0] + end_pos[0])/2 - 0.3, (start_pos[1] + end_pos[1])/2 + 0.3)
    ax.text(mid[0], mid[1], label, fontsize=10, color='darkred')

ax.set_title('Backpropagation Solution\nGradients flow backward through the network', fontsize=12)
ax.set_aspect('equal')
ax.axis('off')

plt.tight_layout()
plt.show()

print("Key insight: Backpropagation computes gradients by flowing information BACKWARD")
print("from the loss to each weight, using the chain rule at each step.")

### Deep Dive: Why Not Just Compute Gradients Directly?

You might wonder: why not just compute $\frac{\partial L}{\partial w_i}$ directly for each weight?

#### The Computational Cost Problem

| Approach | Cost for n weights | For 1 million weights |
|----------|-------------------|----------------------|
| Numerical gradient (perturb each weight) | O(n) forward passes | 1,000,000 forward passes |
| Backpropagation | O(1) forward + O(1) backward | 2 passes total |

**Backpropagation is ~500,000x more efficient for large networks!**

#### Key Insight

Backpropagation reuses intermediate computations. When computing $\frac{\partial L}{\partial w_1}$ and $\frac{\partial L}{\partial w_2}$, both share the same $\frac{\partial L}{\partial h}$ term. Backprop computes this once and reuses it.

#### Historical Note

Backpropagation was independently discovered multiple times:
- 1960s: Henry J. Kelley (control theory)
- 1970: Seppo Linnainmaa (automatic differentiation)
- 1986: Rumelhart, Hinton, Williams (popularized for neural networks)

The 1986 paper is considered the birth of modern deep learning.

---

## 2. Computational Graphs

A **computational graph** is a visual way to represent mathematical expressions as a series of operations. It's the foundation for understanding backpropagation.

### The Core Idea

Any complex function can be broken down into simple operations:
- **Nodes** = operations (add, multiply, sigmoid, etc.)
- **Edges** = values flowing between operations
- **Forward pass** = compute the output, saving intermediate values
- **Backward pass** = compute gradients using saved values

### Example: $f(x, y, z) = (x + y) \cdot z$

```
     x --+
         +--[+]--q--[*]--f
     y --+         |
                   |
     z ------------+
```

Where $q = x + y$ and $f = q \cdot z$

In [None]:
# Interactive computational graph example
def visualize_computational_graph():
    """Visualize forward and backward pass through a simple computational graph."""
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
    
    # Values for our example
    x, y, z = 2.0, 3.0, 4.0
    
    # Forward pass
    q = x + y  # = 5
    f = q * z  # = 20
    
    # Backward pass (assuming df/df = 1)
    df_df = 1.0
    df_dq = z  # = 4 (partial of q*z w.r.t. q)
    df_dz = q  # = 5 (partial of q*z w.r.t. z)
    df_dx = df_dq * 1  # = 4 (chain rule: df/dq * dq/dx)
    df_dy = df_dq * 1  # = 4 (chain rule: df/dq * dq/dy)
    
    # Left plot: Forward pass
    ax = axes[0]
    ax.set_xlim(-0.5, 4.5)
    ax.set_ylim(-0.5, 3.5)
    
    # Node positions
    positions = {
        'x': (0, 2.5), 'y': (0, 1.5), 'z': (0, 0.5),
        '+': (1.5, 2), 'q': (2.5, 2),
        '*': (3, 1.5), 'f': (4, 1.5)
    }
    
    # Draw nodes
    for name, pos in positions.items():
        if name in ['+', '*']:
            circle = plt.Circle(pos, 0.25, fill=True, color='lightgreen', edgecolor='black', linewidth=2)
        else:
            circle = plt.Circle(pos, 0.25, fill=True, color='lightblue', edgecolor='black', linewidth=2)
        ax.add_patch(circle)
        ax.text(pos[0], pos[1], name, ha='center', va='center', fontsize=12, fontweight='bold')
    
    # Draw forward edges with values
    forward_edges = [
        ('x', '+', f'x={x}'),
        ('y', '+', f'y={y}'),
        ('+', 'q', ''),
        ('q', '*', f'q={q}'),
        ('z', '*', f'z={z}'),
        ('*', 'f', f'f={f}')
    ]
    
    for start, end, label in forward_edges:
        sp, ep = positions[start], positions[end]
        ax.annotate('', xy=(ep[0]-0.25, ep[1]), xytext=(sp[0]+0.25, sp[1]),
                    arrowprops=dict(arrowstyle='->', color='blue', lw=1.5))
        if label:
            mid = ((sp[0] + ep[0])/2, (sp[1] + ep[1])/2 + 0.25)
            ax.text(mid[0], mid[1], label, fontsize=9, color='blue')
    
    ax.set_title('Forward Pass: Compute Values\nf(x,y,z) = (x+y)*z', fontsize=12)
    ax.set_aspect('equal')
    ax.axis('off')
    
    # Right plot: Backward pass
    ax = axes[1]
    ax.set_xlim(-0.5, 4.5)
    ax.set_ylim(-0.5, 3.5)
    
    # Draw nodes with gradients
    gradients = {'x': df_dx, 'y': df_dy, 'z': df_dz, 'q': df_dq, 'f': df_df}
    
    for name, pos in positions.items():
        if name in ['+', '*']:
            circle = plt.Circle(pos, 0.25, fill=True, color='lightyellow', edgecolor='black', linewidth=2)
            ax.add_patch(circle)
            ax.text(pos[0], pos[1], name, ha='center', va='center', fontsize=12, fontweight='bold')
        else:
            circle = plt.Circle(pos, 0.25, fill=True, color='lightcoral', edgecolor='black', linewidth=2)
            ax.add_patch(circle)
            grad = gradients.get(name, '')
            ax.text(pos[0], pos[1], name, ha='center', va='center', fontsize=12, fontweight='bold')
            ax.text(pos[0], pos[1]-0.45, f'grad={grad}', ha='center', va='center', fontsize=8, color='darkred')
    
    # Draw backward edges
    backward_edges = [
        ('f', '*'),
        ('*', 'q'),
        ('*', 'z'),
        ('q', '+'),
        ('+', 'x'),
        ('+', 'y'),
    ]
    
    for start, end in backward_edges:
        sp, ep = positions[start], positions[end]
        ax.annotate('', xy=(ep[0]+0.25, ep[1]), xytext=(sp[0]-0.25, sp[1]),
                    arrowprops=dict(arrowstyle='->', color='red', lw=2))
    
    ax.set_title('Backward Pass: Compute Gradients\nGradients flow backward', fontsize=12)
    ax.set_aspect('equal')
    ax.axis('off')
    
    plt.tight_layout()
    plt.show()
    
    # Print detailed explanation
    print("Forward Pass:")
    print(f"  q = x + y = {x} + {y} = {q}")
    print(f"  f = q * z = {q} * {z} = {f}")
    print("\nBackward Pass:")
    print(f"  df/df = {df_df} (by definition)")
    print(f"  df/dq = z = {df_dq} (local gradient of multiplication)")
    print(f"  df/dz = q = {df_dz} (local gradient of multiplication)")
    print(f"  df/dx = df/dq * dq/dx = {df_dq} * 1 = {df_dx} (chain rule!)")
    print(f"  df/dy = df/dq * dq/dy = {df_dq} * 1 = {df_dy} (chain rule!)")

visualize_computational_graph()

### Deep Dive: Local Gradients

Each operation has **local gradients** - how its output changes with respect to its inputs.

| Operation | Forward | Local Gradients |
|-----------|---------|----------------|
| Add: $c = a + b$ | $c = a + b$ | $\frac{\partial c}{\partial a} = 1$, $\frac{\partial c}{\partial b} = 1$ |
| Multiply: $c = a \cdot b$ | $c = a \cdot b$ | $\frac{\partial c}{\partial a} = b$, $\frac{\partial c}{\partial b} = a$ |
| Power: $c = a^n$ | $c = a^n$ | $\frac{\partial c}{\partial a} = n \cdot a^{n-1}$ |
| Exp: $c = e^a$ | $c = e^a$ | $\frac{\partial c}{\partial a} = e^a$ |
| Sigmoid: $c = \sigma(a)$ | $c = \frac{1}{1+e^{-a}}$ | $\frac{\partial c}{\partial a} = c(1-c)$ |
| ReLU: $c = \max(0, a)$ | $c = \max(0, a)$ | $\frac{\partial c}{\partial a} = \mathbb{1}_{a > 0}$ |

**Key Insight:** The backward pass multiplies the incoming gradient by the local gradient and passes it back.

In [None]:
# Implement basic operations as "gates" with forward and backward

class AddGate:
    """Addition gate: c = a + b"""
    def forward(self, a, b):
        self.a, self.b = a, b
        return a + b
    
    def backward(self, grad_output):
        # Local gradients are both 1
        return grad_output * 1, grad_output * 1

class MultiplyGate:
    """Multiplication gate: c = a * b"""
    def forward(self, a, b):
        self.a, self.b = a, b
        return a * b
    
    def backward(self, grad_output):
        # Local gradients are the OTHER input
        return grad_output * self.b, grad_output * self.a

class SigmoidGate:
    """Sigmoid gate: c = 1 / (1 + exp(-a))"""
    def forward(self, a):
        self.output = 1 / (1 + np.exp(-a))
        return self.output
    
    def backward(self, grad_output):
        # Sigmoid derivative: sigma(x) * (1 - sigma(x))
        return grad_output * self.output * (1 - self.output)

# Example: compute f = sigmoid((x * y) + z)
x, y, z = 2.0, -3.0, 1.0

# Create gates
mult = MultiplyGate()
add = AddGate()
sig = SigmoidGate()

# Forward pass
mult_out = mult.forward(x, y)  # x * y = -6
add_out = add.forward(mult_out, z)  # -6 + 1 = -5
f = sig.forward(add_out)  # sigmoid(-5) = 0.0067

print("Forward pass:")
print(f"  x * y = {x} * {y} = {mult_out}")
print(f"  (x*y) + z = {mult_out} + {z} = {add_out}")
print(f"  sigmoid({add_out}) = {f:.6f}")

# Backward pass (starting with df/df = 1)
grad_f = 1.0
grad_add = sig.backward(grad_f)
grad_mult, grad_z = add.backward(grad_add)
grad_x, grad_y = mult.backward(grad_mult)

print("\nBackward pass:")
print(f"  df/df = {grad_f}")
print(f"  df/d(add_out) = {grad_add:.6f}")
print(f"  df/d(mult_out) = {grad_mult:.6f}")
print(f"  df/dz = {grad_z:.6f}")
print(f"  df/dx = {grad_x:.6f}")
print(f"  df/dy = {grad_y:.6f}")

# Verify with numerical gradient
def f_func(x, y, z):
    return 1 / (1 + np.exp(-(x*y + z)))

h = 1e-5
numerical_grad_x = (f_func(x+h, y, z) - f_func(x-h, y, z)) / (2*h)

print(f"\nVerification (numerical gradient for x): {numerical_grad_x:.6f}")
print(f"Match: {np.isclose(grad_x, numerical_grad_x)}")

---

## 3. Chain Rule Deep Dive

The chain rule is the mathematical foundation of backpropagation. Let's build deep intuition.

### Single Variable Chain Rule

If $y = f(g(x))$, then:

$$\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}$$

**Intuition:** If $x$ changes by a tiny amount $\epsilon$:
1. $g$ changes by $\frac{dg}{dx} \cdot \epsilon$
2. This change in $g$ causes $y$ to change by $\frac{dy}{dg} \cdot (\frac{dg}{dx} \cdot \epsilon)$
3. So the total change in $y$ per unit change in $x$ is $\frac{dy}{dg} \cdot \frac{dg}{dx}$

In [None]:
# Visualize the chain rule with a concrete example
# y = (3x + 1)^2
# Let g(x) = 3x + 1, f(g) = g^2

def visualize_chain_rule():
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    x = np.linspace(-2, 2, 200)
    
    # g(x) = 3x + 1
    g = 3*x + 1
    dg_dx = 3  # constant
    
    # f(g) = g^2  (but we plot f as function of x for comparison)
    f_of_g = g**2
    
    # dy/dg = 2g (derivative of g^2)
    df_dg = 2*g
    
    # Chain rule: dy/dx = dy/dg * dg/dx = 2g * 3 = 6g = 6(3x+1) = 18x + 6
    df_dx = 6*(3*x + 1)
    
    # Plot 1: The functions
    axes[0].plot(x, g, 'b-', linewidth=2, label='g(x) = 3x + 1')
    axes[0].plot(x, f_of_g, 'r-', linewidth=2, label='f(g(x)) = (3x+1)^2')
    axes[0].set_xlabel('x')
    axes[0].set_ylabel('y')
    axes[0].set_title('The Composed Function')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    axes[0].set_ylim(-5, 20)
    
    # Plot 2: The individual derivatives
    axes[1].axhline(y=dg_dx, color='blue', linewidth=2, label='dg/dx = 3 (constant)')
    axes[1].plot(x, df_dg, 'g-', linewidth=2, label='df/dg = 2g = 2(3x+1)')
    axes[1].set_xlabel('x')
    axes[1].set_ylabel('Derivative value')
    axes[1].set_title('Individual Derivatives')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    # Plot 3: The chain rule result
    axes[2].plot(x, df_dx, 'purple', linewidth=2, label='df/dx = (df/dg)(dg/dx)')
    # Also show product of individual derivatives at specific points
    x_points = np.array([-1, 0, 0.5, 1])
    for xp in x_points:
        g_val = 3*xp + 1
        product = (2*g_val) * 3
        axes[2].scatter([xp], [product], s=100, c='red', zorder=5)
        axes[2].annotate(f'{product:.0f}', xy=(xp, product), xytext=(xp+0.1, product+2))
    
    axes[2].set_xlabel('x')
    axes[2].set_ylabel('df/dx')
    axes[2].set_title('Chain Rule Result\ndf/dx = 2(3x+1) * 3 = 18x + 6')
    axes[2].legend()
    axes[2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("Chain Rule in Action:")
    print("  y = f(g(x)) = (3x + 1)^2")
    print("  g(x) = 3x + 1  ->  dg/dx = 3")
    print("  f(g) = g^2     ->  df/dg = 2g")
    print("  ----------------------------------------")
    print("  dy/dx = df/dg * dg/dx = 2g * 3 = 6(3x+1)")

visualize_chain_rule()

### Multivariable Chain Rule

When a variable affects the output through **multiple paths**, we **sum** the contributions:

If $L = f(a, b)$ where both $a$ and $b$ depend on $x$:

$$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial x} + \frac{\partial L}{\partial b} \cdot \frac{\partial b}{\partial x}$$

**This is crucial for neural networks:** A hidden unit's output often feeds into multiple neurons in the next layer!

In [None]:
# Visualize multivariable chain rule
# L = a*b where a = x+1 and b = x*2

fig, ax = plt.subplots(figsize=(10, 6))

ax.set_xlim(-0.5, 4.5)
ax.set_ylim(-0.5, 3.5)

# Positions
positions = {
    'x': (0, 1.5),
    'a': (2, 2.5),
    'b': (2, 0.5),
    'L': (4, 1.5)
}

# Draw nodes
for name, pos in positions.items():
    color = 'lightblue' if name != 'L' else 'lightcoral'
    circle = plt.Circle(pos, 0.3, fill=True, color=color, edgecolor='black', linewidth=2)
    ax.add_patch(circle)
    ax.text(pos[0], pos[1], name, ha='center', va='center', fontsize=14, fontweight='bold')

# Forward edges
ax.annotate('', xy=(1.7, 2.3), xytext=(0.3, 1.7),
            arrowprops=dict(arrowstyle='->', color='blue', lw=2))
ax.text(0.7, 2.2, 'a = x+1', fontsize=10, color='blue')

ax.annotate('', xy=(1.7, 0.7), xytext=(0.3, 1.3),
            arrowprops=dict(arrowstyle='->', color='blue', lw=2))
ax.text(0.7, 0.7, 'b = 2x', fontsize=10, color='blue')

ax.annotate('', xy=(3.7, 1.7), xytext=(2.3, 2.3),
            arrowprops=dict(arrowstyle='->', color='blue', lw=2))

ax.annotate('', xy=(3.7, 1.3), xytext=(2.3, 0.7),
            arrowprops=dict(arrowstyle='->', color='blue', lw=2))

ax.text(3, 2.3, 'L = a*b', fontsize=10, color='blue')

# Backward edges (gradients) - shown as curved red arrows
ax.annotate('', xy=(0.3, 1.7), xytext=(1.7, 2.3),
            arrowprops=dict(arrowstyle='->', color='red', lw=2, connectionstyle='arc3,rad=0.3'))
ax.text(0.5, 2.6, r'$\frac{\partial L}{\partial a}\cdot\frac{\partial a}{\partial x}$', 
        fontsize=10, color='red')

ax.annotate('', xy=(0.3, 1.3), xytext=(1.7, 0.7),
            arrowprops=dict(arrowstyle='->', color='red', lw=2, connectionstyle='arc3,rad=-0.3'))
ax.text(0.5, 0.2, r'$\frac{\partial L}{\partial b}\cdot\frac{\partial b}{\partial x}$', 
        fontsize=10, color='red')

ax.set_title('Multivariable Chain Rule: Sum over all paths\n' + 
             r'$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial a}\frac{\partial a}{\partial x} + \frac{\partial L}{\partial b}\frac{\partial b}{\partial x}$',
             fontsize=12)
ax.set_aspect('equal')
ax.axis('off')
plt.show()

# Numerical example
x = 3.0
a = x + 1  # = 4
b = 2 * x  # = 6
L = a * b  # = 24

# Gradients
dL_da = b  # = 6
dL_db = a  # = 4
da_dx = 1
db_dx = 2

# Chain rule (sum over paths)
dL_dx = dL_da * da_dx + dL_db * db_dx

print(f"x = {x}")
print(f"a = x + 1 = {a}")
print(f"b = 2x = {b}")
print(f"L = a * b = {L}")
print(f"\nPath 1: dL/da * da/dx = {dL_da} * {da_dx} = {dL_da * da_dx}")
print(f"Path 2: dL/db * db/dx = {dL_db} * {db_dx} = {dL_db * db_dx}")
print(f"Total:  dL/dx = {dL_dx}")

# Verify
def L_func(x):
    return (x + 1) * (2 * x)

h = 1e-5
numerical = (L_func(x + h) - L_func(x - h)) / (2 * h)
print(f"\nNumerical verification: {numerical:.4f}")

### Deep Dive: Why the Chain Rule Works

The chain rule captures a fundamental truth about composition: **local sensitivities multiply along paths**.

Think of it like currency exchange:
- 1 USD = 0.85 EUR (sensitivity of EUR to USD)
- 1 EUR = 0.88 GBP (sensitivity of GBP to EUR)
- Therefore: 1 USD = 0.85 * 0.88 = 0.75 GBP

The sensitivities multiply along the chain!

| Concept | In Currency | In Neural Networks |
|---------|-------------|-------------------|
| Single path | USD -> EUR -> GBP | input -> hidden -> output |
| Multiply sensitivities | 0.85 * 0.88 | dL/dh * dh/dx |
| Multiple paths | USD -> EUR -> GBP and USD -> JPY -> GBP | hidden unit feeds into multiple outputs |
| Sum contributions | Total GBP from both paths | Sum gradients from all paths |

---

## 4. Backprop Through a Neural Network

Now let's apply these concepts to an actual neural network. We'll work through a 2-layer network step by step.

### Network Architecture

```
Input (x) -> [W1, b1] -> ReLU -> h -> [W2, b2] -> sigmoid -> y_pred -> Loss
```

### Forward Pass Equations

1. **Linear 1:** $z_1 = W_1 \cdot x + b_1$
2. **Activation 1:** $h = \text{ReLU}(z_1) = \max(0, z_1)$
3. **Linear 2:** $z_2 = W_2 \cdot h + b_2$
4. **Activation 2:** $\hat{y} = \sigma(z_2) = \frac{1}{1 + e^{-z_2}}$
5. **Loss:** $L = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]$ (binary cross-entropy)

In [None]:
# Visualize network architecture with forward pass values

def visualize_network_forward():
    """Draw network architecture showing forward pass."""
    
    fig, ax = plt.subplots(figsize=(14, 8))
    ax.set_xlim(-1, 11)
    ax.set_ylim(-1, 7)
    
    # Layer positions
    layers = {
        'input': [(0, 4), (0, 2)],
        'hidden': [(3, 5), (3, 3), (3, 1)],
        'output': [(6, 3)],
        'loss': [(9, 3)]
    }
    
    labels = {
        'input': ['x1', 'x2'],
        'hidden': ['h1', 'h2', 'h3'],
        'output': ['y_pred'],
        'loss': ['L']
    }
    
    colors = {
        'input': 'lightblue',
        'hidden': 'lightgreen',
        'output': 'lightyellow',
        'loss': 'lightcoral'
    }
    
    # Draw nodes
    for layer, positions in layers.items():
        for i, pos in enumerate(positions):
            circle = plt.Circle(pos, 0.4, fill=True, color=colors[layer], 
                              edgecolor='black', linewidth=2)
            ax.add_patch(circle)
            ax.text(pos[0], pos[1], labels[layer][i], ha='center', va='center', 
                   fontsize=10, fontweight='bold')
    
    # Draw connections
    for inp in layers['input']:
        for hid in layers['hidden']:
            ax.plot([inp[0]+0.4, hid[0]-0.4], [inp[1], hid[1]], 'b-', alpha=0.3, lw=1)
    
    for hid in layers['hidden']:
        for out in layers['output']:
            ax.plot([hid[0]+0.4, out[0]-0.4], [hid[1], out[1]], 'g-', alpha=0.3, lw=1)
    
    ax.plot([6.4, 8.6], [3, 3], 'r-', lw=2)
    
    # Add labels for operations
    ax.text(1.5, 6, 'z1 = W1*x + b1', fontsize=11, ha='center')
    ax.text(1.5, 5.5, 'h = ReLU(z1)', fontsize=11, ha='center')
    ax.text(4.5, 6, 'z2 = W2*h + b2', fontsize=11, ha='center')
    ax.text(4.5, 5.5, 'y = sigmoid(z2)', fontsize=11, ha='center')
    ax.text(7.5, 4, 'L = BCE(y, target)', fontsize=11, ha='center')
    
    # Add layer labels
    ax.text(0, 0, 'Input\nLayer', ha='center', fontsize=10)
    ax.text(3, 0, 'Hidden\nLayer', ha='center', fontsize=10)
    ax.text(6, 0, 'Output\nLayer', ha='center', fontsize=10)
    ax.text(9, 0, 'Loss', ha='center', fontsize=10)
    
    ax.set_title('2-Layer Neural Network: Forward Pass', fontsize=14)
    ax.set_aspect('equal')
    ax.axis('off')
    plt.show()

visualize_network_forward()

### Backward Pass: Computing Gradients

We compute gradients in reverse order, applying the chain rule at each step.

#### Step 1: Gradient of Loss w.r.t. Output

For binary cross-entropy with sigmoid output:

$$\frac{\partial L}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}$$

Combined with sigmoid derivative, we get a beautiful simplification:

$$\frac{\partial L}{\partial z_2} = \hat{y} - y$$

#### Step 2: Gradients for Output Layer Weights

$$\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial z_2} \cdot h^T$$

$$\frac{\partial L}{\partial b_2} = \frac{\partial L}{\partial z_2}$$

#### Step 3: Gradient flowing to Hidden Layer

$$\frac{\partial L}{\partial h} = W_2^T \cdot \frac{\partial L}{\partial z_2}$$

#### Step 4: Through ReLU

$$\frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial h} \odot \mathbb{1}_{z_1 > 0}$$

(Element-wise multiplication with indicator function)

#### Step 5: Gradients for Hidden Layer Weights

$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial z_1} \cdot x^T$$

$$\frac{\partial L}{\partial b_1} = \frac{\partial L}{\partial z_1}$$

In [None]:
# Step-by-step backpropagation with concrete numbers

def backprop_walkthrough():
    """Walk through backprop with actual numbers."""
    
    print("="*60)
    print("BACKPROPAGATION WALKTHROUGH")
    print("="*60)
    
    # Network: 2 inputs -> 3 hidden -> 1 output
    np.random.seed(42)
    
    # Initialize weights (small random values)
    W1 = np.array([[0.1, 0.2], 
                   [0.3, 0.4], 
                   [0.5, 0.6]])  # (3, 2)
    b1 = np.array([[0.1], [0.1], [0.1]])  # (3, 1)
    
    W2 = np.array([[0.7, 0.8, 0.9]])  # (1, 3)
    b2 = np.array([[0.1]])  # (1, 1)
    
    # Input and target
    x = np.array([[1.0], [2.0]])  # (2, 1)
    y = np.array([[1.0]])  # (1, 1) - target
    
    print("\n--- FORWARD PASS ---")
    
    # Layer 1
    z1 = W1 @ x + b1
    print(f"\nz1 = W1*x + b1:")
    print(f"  W1 @ x = \n{W1 @ x}")
    print(f"  z1 = \n{z1}")
    
    h = np.maximum(0, z1)  # ReLU
    print(f"\nh = ReLU(z1) = \n{h}")
    
    # Layer 2
    z2 = W2 @ h + b2
    print(f"\nz2 = W2*h + b2 = {z2[0,0]:.4f}")
    
    y_pred = 1 / (1 + np.exp(-z2))  # Sigmoid
    print(f"y_pred = sigmoid(z2) = {y_pred[0,0]:.4f}")
    
    # Loss (Binary Cross-Entropy)
    epsilon = 1e-7  # For numerical stability
    L = -(y * np.log(y_pred + epsilon) + (1 - y) * np.log(1 - y_pred + epsilon))
    print(f"\nLoss = {L[0,0]:.4f}")
    
    print("\n--- BACKWARD PASS ---")
    
    # Output layer gradient
    dL_dz2 = y_pred - y  # Beautiful simplification for BCE + sigmoid
    print(f"\ndL/dz2 = y_pred - y = {y_pred[0,0]:.4f} - {y[0,0]:.4f} = {dL_dz2[0,0]:.4f}")
    
    # Gradients for W2 and b2
    dL_dW2 = dL_dz2 @ h.T
    dL_db2 = dL_dz2
    print(f"\ndL/dW2 = dL/dz2 * h^T = \n{dL_dW2}")
    print(f"dL/db2 = {dL_db2[0,0]:.4f}")
    
    # Gradient flowing back to hidden layer
    dL_dh = W2.T @ dL_dz2
    print(f"\ndL/dh = W2^T * dL/dz2 = \n{dL_dh}")
    
    # Through ReLU
    dL_dz1 = dL_dh * (z1 > 0).astype(float)
    print(f"\ndL/dz1 = dL/dh * (z1 > 0) = \n{dL_dz1}")
    print(f"  (ReLU derivative is 1 where z1 > 0, else 0)")
    
    # Gradients for W1 and b1
    dL_dW1 = dL_dz1 @ x.T
    dL_db1 = dL_dz1
    print(f"\ndL/dW1 = dL/dz1 * x^T = \n{dL_dW1}")
    print(f"dL/db1 = \n{dL_db1}")
    
    print("\n--- VERIFICATION (Numerical Gradients) ---")
    
    # Verify one gradient numerically
    def compute_loss(W1, b1, W2, b2, x, y):
        z1 = W1 @ x + b1
        h = np.maximum(0, z1)
        z2 = W2 @ h + b2
        y_pred = 1 / (1 + np.exp(-z2))
        return -(y * np.log(y_pred + 1e-7) + (1-y) * np.log(1 - y_pred + 1e-7))
    
    h_epsilon = 1e-5
    W2_plus = W2.copy()
    W2_plus[0, 0] += h_epsilon
    W2_minus = W2.copy()
    W2_minus[0, 0] -= h_epsilon
    
    numerical_grad = (compute_loss(W1, b1, W2_plus, b2, x, y) - 
                     compute_loss(W1, b1, W2_minus, b2, x, y)) / (2 * h_epsilon)
    
    print(f"\nFor W2[0,0]:")
    print(f"  Backprop gradient: {dL_dW2[0,0]:.6f}")
    print(f"  Numerical gradient: {numerical_grad[0,0]:.6f}")
    print(f"  Match: {np.isclose(dL_dW2[0,0], numerical_grad[0,0])}")
    
    return W1, b1, W2, b2, dL_dW1, dL_db1, dL_dW2, dL_db2

_ = backprop_walkthrough()

In [None]:
# Visualize gradient flow through the network

def visualize_gradient_flow():
    """Visualize how gradients flow backward through layers."""
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Simulate gradient magnitudes at each layer (typical behavior)
    layers = ['Loss', 'Output\n(sigmoid)', 'Hidden\n(ReLU)', 'Input']
    
    # Scenario 1: Healthy gradients
    healthy_grads = [1.0, 0.8, 0.6, 0.5]
    axes[0].barh(layers, healthy_grads, color=['coral', 'lightgreen', 'lightgreen', 'lightblue'])
    axes[0].set_xlabel('Gradient Magnitude')
    axes[0].set_title('Healthy Gradient Flow\n(ReLU + proper initialization)')
    axes[0].set_xlim(0, 1.5)
    for i, v in enumerate(healthy_grads):
        axes[0].text(v + 0.05, i, f'{v:.2f}', va='center')
    
    # Scenario 2: Vanishing gradients (sigmoid everywhere)
    vanishing_grads = [1.0, 0.25, 0.06, 0.015]  # sigmoid max derivative is 0.25
    axes[1].barh(layers, vanishing_grads, color=['coral', 'lightyellow', 'lightyellow', 'lightblue'])
    axes[1].set_xlabel('Gradient Magnitude')
    axes[1].set_title('Vanishing Gradients\n(All sigmoid activations)')
    axes[1].set_xlim(0, 1.5)
    for i, v in enumerate(vanishing_grads):
        axes[1].text(v + 0.05, i, f'{v:.3f}', va='center')
    
    # Scenario 3: Exploding gradients
    exploding_grads = [1.0, 2.0, 4.0, 8.0]
    axes[2].barh(layers, exploding_grads, color=['coral', 'lightcoral', 'red', 'darkred'])
    axes[2].set_xlabel('Gradient Magnitude')
    axes[2].set_title('Exploding Gradients\n(Large weights, no regularization)')
    axes[2].set_xlim(0, 10)
    for i, v in enumerate(exploding_grads):
        axes[2].text(v + 0.1, i, f'{v:.1f}', va='center')
    
    plt.tight_layout()
    plt.show()
    
    print("Key observations:")
    print("  - Healthy: Gradients stay roughly the same magnitude across layers")
    print("  - Vanishing: Gradients shrink exponentially (early layers barely learn)")
    print("  - Exploding: Gradients grow exponentially (training becomes unstable)")

visualize_gradient_flow()

### Why This Matters in Machine Learning

| Concept | What It Means | Practical Impact |
|---------|---------------|------------------|
| Chain rule | Gradients multiply through layers | Deep networks can have vanishing/exploding gradients |
| Local gradients | Each operation contributes | Activation functions determine gradient flow |
| Weight gradients | How to update weights | Larger gradient = larger weight update |
| Gradient accumulation | Sum over all paths | Hidden units feeding multiple outputs get combined gradients |

---

## 5. Implementing Backprop from Scratch

Now let's implement a complete 2-layer neural network with backpropagation and train it to solve the XOR problem - the classic problem that single-layer perceptrons couldn't solve!

### The XOR Problem

| Input 1 | Input 2 | XOR Output |
|---------|---------|------------|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |

XOR is **not linearly separable** - you cannot draw a single line to separate the 0s from the 1s. This is why single-layer perceptrons fail, and why we need hidden layers!

In [None]:
# Visualize why XOR needs a hidden layer

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# AND gate - linearly separable
ax = axes[0]
ax.scatter([0, 0, 1], [0, 1, 0], c='red', s=200, marker='o', label='Output: 0')
ax.scatter([1], [1], c='blue', s=200, marker='s', label='Output: 1')
x_line = np.linspace(-0.5, 1.5, 100)
ax.plot(x_line, -x_line + 1.5, 'g--', linewidth=2, label='Decision boundary')
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_xlabel('Input 1')
ax.set_ylabel('Input 2')
ax.set_title('AND Gate\n(Linearly Separable)')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_aspect('equal')

# OR gate - linearly separable  
ax = axes[1]
ax.scatter([0], [0], c='red', s=200, marker='o', label='Output: 0')
ax.scatter([0, 1, 1], [1, 0, 1], c='blue', s=200, marker='s', label='Output: 1')
ax.plot(x_line, -x_line + 0.5, 'g--', linewidth=2, label='Decision boundary')
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_xlabel('Input 1')
ax.set_ylabel('Input 2')
ax.set_title('OR Gate\n(Linearly Separable)')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_aspect('equal')

# XOR gate - NOT linearly separable
ax = axes[2]
ax.scatter([0, 1], [0, 1], c='red', s=200, marker='o', label='Output: 0')
ax.scatter([0, 1], [1, 0], c='blue', s=200, marker='s', label='Output: 1')
ax.annotate('No single line\ncan separate!', xy=(0.5, 0.5), fontsize=10, ha='center',
           bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.8))
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_xlabel('Input 1')
ax.set_ylabel('Input 2')
ax.set_title('XOR Gate\n(NOT Linearly Separable!)')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_aspect('equal')

plt.tight_layout()
plt.show()

print("The XOR problem exposed the limitation of single-layer perceptrons.")
print("Minsky & Papert's 1969 book on this led to the first 'AI Winter'.")
print("The solution: hidden layers + backpropagation!")

In [None]:
class NeuralNetwork:
    """
    A 2-layer neural network implemented from scratch.
    Architecture: Input -> Hidden (ReLU) -> Output (Sigmoid)
    """
    
    def __init__(self, input_size, hidden_size, output_size):
        """
        Initialize the network with random weights.
        
        Args:
            input_size: Number of input features
            hidden_size: Number of hidden neurons
            output_size: Number of output neurons
        """
        # Xavier initialization for better gradient flow
        self.W1 = np.random.randn(hidden_size, input_size) * np.sqrt(2.0 / input_size)
        self.b1 = np.zeros((hidden_size, 1))
        self.W2 = np.random.randn(output_size, hidden_size) * np.sqrt(2.0 / hidden_size)
        self.b2 = np.zeros((output_size, 1))
        
        # Store intermediate values for backward pass
        self.cache = {}
        
    def relu(self, z):
        """ReLU activation function."""
        return np.maximum(0, z)
    
    def relu_derivative(self, z):
        """Derivative of ReLU."""
        return (z > 0).astype(float)
    
    def sigmoid(self, z):
        """Sigmoid activation function."""
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def forward(self, X):
        """
        Forward pass through the network.
        
        Args:
            X: Input data of shape (input_size, num_samples)
            
        Returns:
            Output predictions of shape (output_size, num_samples)
        """
        # Layer 1: Linear -> ReLU
        self.cache['z1'] = self.W1 @ X + self.b1
        self.cache['h'] = self.relu(self.cache['z1'])
        
        # Layer 2: Linear -> Sigmoid
        self.cache['z2'] = self.W2 @ self.cache['h'] + self.b2
        self.cache['y_pred'] = self.sigmoid(self.cache['z2'])
        
        # Store input for backward pass
        self.cache['X'] = X
        
        return self.cache['y_pred']
    
    def compute_loss(self, y_pred, y_true):
        """
        Compute binary cross-entropy loss.
        
        Args:
            y_pred: Predicted probabilities
            y_true: True labels
            
        Returns:
            Scalar loss value
        """
        epsilon = 1e-7
        m = y_true.shape[1]  # Number of samples
        loss = -np.mean(y_true * np.log(y_pred + epsilon) + 
                       (1 - y_true) * np.log(1 - y_pred + epsilon))
        return loss
    
    def backward(self, y_true):
        """
        Backward pass: compute gradients using backpropagation.
        
        Args:
            y_true: True labels of shape (output_size, num_samples)
            
        Returns:
            Dictionary of gradients for all parameters
        """
        m = y_true.shape[1]  # Number of samples
        
        # Retrieve cached values
        y_pred = self.cache['y_pred']
        h = self.cache['h']
        z1 = self.cache['z1']
        X = self.cache['X']
        
        # Output layer gradients
        # For BCE + sigmoid: dL/dz2 = y_pred - y_true
        dz2 = y_pred - y_true  # (output_size, m)
        
        dW2 = (1/m) * dz2 @ h.T  # (output_size, hidden_size)
        db2 = (1/m) * np.sum(dz2, axis=1, keepdims=True)  # (output_size, 1)
        
        # Hidden layer gradients
        dh = self.W2.T @ dz2  # (hidden_size, m)
        dz1 = dh * self.relu_derivative(z1)  # (hidden_size, m)
        
        dW1 = (1/m) * dz1 @ X.T  # (hidden_size, input_size)
        db1 = (1/m) * np.sum(dz1, axis=1, keepdims=True)  # (hidden_size, 1)
        
        gradients = {
            'dW1': dW1, 'db1': db1,
            'dW2': dW2, 'db2': db2
        }
        
        return gradients
    
    def update_parameters(self, gradients, learning_rate):
        """
        Update parameters using gradient descent.
        
        Args:
            gradients: Dictionary of gradients
            learning_rate: Learning rate for gradient descent
        """
        self.W1 -= learning_rate * gradients['dW1']
        self.b1 -= learning_rate * gradients['db1']
        self.W2 -= learning_rate * gradients['dW2']
        self.b2 -= learning_rate * gradients['db2']
    
    def train(self, X, y, epochs, learning_rate, verbose=True):
        """
        Train the network.
        
        Args:
            X: Training inputs
            y: Training labels
            epochs: Number of training epochs
            learning_rate: Learning rate
            verbose: Whether to print progress
            
        Returns:
            List of losses during training
        """
        losses = []
        
        for epoch in range(epochs):
            # Forward pass
            y_pred = self.forward(X)
            
            # Compute loss
            loss = self.compute_loss(y_pred, y)
            losses.append(loss)
            
            # Backward pass
            gradients = self.backward(y)
            
            # Update parameters
            self.update_parameters(gradients, learning_rate)
            
            if verbose and (epoch + 1) % 1000 == 0:
                predictions = (y_pred > 0.5).astype(int)
                accuracy = np.mean(predictions == y)
                print(f"Epoch {epoch+1:5d} | Loss: {loss:.4f} | Accuracy: {accuracy:.2%}")
        
        return losses
    
    def predict(self, X):
        """Make predictions (returns probabilities)."""
        return self.forward(X)

print("Neural Network class defined successfully!")
print("\nKey methods:")
print("  - forward(X): Computes predictions, caches intermediate values")
print("  - backward(y): Computes gradients using backpropagation")
print("  - update_parameters(): Applies gradient descent")
print("  - train(): Full training loop")

In [None]:
# Train on XOR!

# XOR dataset
X_xor = np.array([[0, 0, 1, 1],
                  [0, 1, 0, 1]])  # (2, 4)

y_xor = np.array([[0, 1, 1, 0]])  # (1, 4)

print("XOR Dataset:")
print(f"X = \n{X_xor}")
print(f"y = {y_xor}")

# Create and train network
np.random.seed(42)
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)

print("\nTraining...")
losses = nn.train(X_xor, y_xor, epochs=10000, learning_rate=1.0)

# Final predictions
print("\n" + "="*40)
print("FINAL RESULTS")
print("="*40)
y_pred = nn.predict(X_xor)
print("\nPredictions (probabilities):")
for i in range(4):
    x1, x2 = X_xor[0, i], X_xor[1, i]
    pred = y_pred[0, i]
    true = y_xor[0, i]
    print(f"  XOR({x1}, {x2}) = {pred:.4f} (target: {true}, predicted: {int(pred > 0.5)})")

final_accuracy = np.mean((y_pred > 0.5).astype(int) == y_xor)
print(f"\nFinal Accuracy: {final_accuracy:.0%}")
print("\nWe solved XOR! The hidden layer learned to transform the space!")

In [None]:
# Visualize training progress and learned decision boundary

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot 1: Loss curve
ax = axes[0]
ax.plot(losses, 'b-', linewidth=1)
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss (Binary Cross-Entropy)')
ax.set_title('Training Loss Over Time')
ax.grid(True, alpha=0.3)
ax.set_yscale('log')

# Plot 2: Decision boundary
ax = axes[1]

# Create mesh grid
xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 100),
                     np.linspace(-0.5, 1.5, 100))
grid = np.c_[xx.ravel(), yy.ravel()].T  # (2, 10000)

# Get predictions for grid
Z = nn.predict(grid)
Z = Z.reshape(xx.shape)

# Plot decision boundary
contour = ax.contourf(xx, yy, Z, levels=50, cmap='RdYlBu', alpha=0.8)
plt.colorbar(contour, ax=ax, label='Predicted probability')

# Plot data points
for i in range(4):
    color = 'blue' if y_xor[0, i] == 1 else 'red'
    marker = 's' if y_xor[0, i] == 1 else 'o'
    ax.scatter(X_xor[0, i], X_xor[1, i], c=color, s=200, marker=marker, 
              edgecolor='black', linewidth=2)

ax.set_xlabel('Input 1')
ax.set_ylabel('Input 2')
ax.set_title('Learned Decision Boundary\n(Blue = 1, Red = 0)')
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)

# Plot 3: Hidden layer representations
ax = axes[2]

# Get hidden layer activations for each input
_ = nn.forward(X_xor)
hidden_activations = nn.cache['h']  # (hidden_size, 4)

# Plot first two hidden units (if we have at least 2)
colors = ['red', 'blue', 'blue', 'red']  # Based on XOR output
markers = ['o', 's', 's', 'o']

for i in range(4):
    x1, x2 = X_xor[0, i], X_xor[1, i]
    h1, h2 = hidden_activations[0, i], hidden_activations[1, i]
    ax.scatter(h1, h2, c=colors[i], s=200, marker=markers[i], 
              edgecolor='black', linewidth=2)
    ax.annotate(f'({x1},{x2})', (h1, h2), xytext=(5, 5), textcoords='offset points')

ax.set_xlabel('Hidden Unit 1')
ax.set_ylabel('Hidden Unit 2')
ax.set_title('Hidden Layer Representation\n(Data becomes linearly separable!)')
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("The magic of hidden layers:")
print("- The hidden layer TRANSFORMS the input space")
print("- In the new space, XOR becomes linearly separable!")
print("- This is what deep learning is all about: learning useful representations")

### Deep Dive: What Did the Network Learn?

The hidden layer learned to transform the 2D input into a representation where XOR is solvable.

| Original Input | Hidden Representation | Why It Works |
|----------------|----------------------|---------------|
| (0,0) -> 0 | Maps to region A | Both (0,0) and (1,1) map to same side |
| (0,1) -> 1 | Maps to region B | (0,1) and (1,0) map to same side |
| (1,0) -> 1 | Maps to region B | Opposite side from A |
| (1,1) -> 0 | Maps to region A | Now linearly separable! |

**This is the fundamental insight of deep learning:** Hidden layers learn transformations that make the problem easier to solve.

---

## 6. Common Issues

Backpropagation can fail in several ways. Understanding these issues is crucial for training deep networks.

### Vanishing Gradients

When gradients become extremely small as they flow backward, early layers barely learn.

**Cause:** Activation functions like sigmoid squash gradients. Sigmoid's maximum derivative is 0.25, so with each layer, gradients shrink by at least 75%!

**Symptoms:**
- Early layers' weights barely change
- Loss decreases very slowly
- Network seems "stuck"

In [None]:
# Demonstrate vanishing gradients with sigmoid

def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

# Simulate gradient flow through layers
num_layers = 10
gradient = 1.0  # Starting gradient
gradients_sigmoid = [gradient]

# Assume average input to sigmoid is 0 (where derivative is max = 0.25)
for _ in range(num_layers):
    gradient *= 0.25  # Maximum sigmoid derivative
    gradients_sigmoid.append(gradient)

# Compare with ReLU
gradient = 1.0
gradients_relu = [gradient]
for _ in range(num_layers):
    # ReLU derivative is 1 for positive inputs
    gradient *= 1.0  # Assuming positive activations
    gradients_relu.append(gradient)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ax = axes[0]
layers = range(num_layers + 1)
ax.plot(layers, gradients_sigmoid, 'ro-', linewidth=2, markersize=8, label='Sigmoid')
ax.plot(layers, gradients_relu, 'bs-', linewidth=2, markersize=8, label='ReLU')
ax.set_xlabel('Layer (from output)')
ax.set_ylabel('Gradient Magnitude')
ax.set_title('Gradient Magnitude vs Layer Depth')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_yscale('log')

# Show activation function derivatives
ax = axes[1]
x = np.linspace(-5, 5, 200)
ax.plot(x, sigmoid_derivative(x), 'r-', linewidth=2, label='Sigmoid derivative (max=0.25)')
ax.plot(x, (x > 0).astype(float), 'b-', linewidth=2, label='ReLU derivative (0 or 1)')
ax.axhline(y=0.25, color='r', linestyle='--', alpha=0.5)
ax.axhline(y=1.0, color='b', linestyle='--', alpha=0.5)
ax.set_xlabel('Input')
ax.set_ylabel('Derivative')
ax.set_title('Activation Function Derivatives')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"After {num_layers} layers:")
print(f"  Sigmoid gradient: {gradients_sigmoid[-1]:.2e} (basically zero!)")
print(f"  ReLU gradient: {gradients_relu[-1]:.2e} (preserved!)")
print(f"\nThis is why ReLU revolutionized deep learning!")

### Exploding Gradients

When gradients become extremely large, weights update too drastically and training becomes unstable.

**Cause:** Large weight values cause gradients to multiply and grow exponentially.

**Symptoms:**
- Loss suddenly becomes NaN or Inf
- Weights become extremely large
- Training diverges

In [None]:
# Demonstrate exploding gradients

def simulate_gradient_flow(weight_scale, num_layers=10):
    """Simulate gradient magnitude through layers."""
    gradient = 1.0
    gradients = [gradient]
    
    for _ in range(num_layers):
        # Gradient gets multiplied by weight magnitude at each layer
        gradient *= weight_scale
        gradients.append(gradient)
    
    return gradients

# Different weight scales
weight_scales = [0.5, 1.0, 1.5, 2.0]

fig, ax = plt.subplots(figsize=(10, 6))

colors = ['blue', 'green', 'orange', 'red']
for scale, color in zip(weight_scales, colors):
    gradients = simulate_gradient_flow(scale)
    ax.plot(range(len(gradients)), gradients, 'o-', color=color, 
           linewidth=2, markersize=6, label=f'Weight scale = {scale}')

ax.set_xlabel('Layer (from output)')
ax.set_ylabel('Gradient Magnitude')
ax.set_title('Gradient Flow with Different Weight Scales')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_yscale('symlog', linthresh=1)
ax.axhline(y=1, color='k', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

print("Observations:")
print("  - Weight scale < 1: Vanishing gradients (shrinks exponentially)")
print("  - Weight scale = 1: Perfect gradient flow (stays constant)")
print("  - Weight scale > 1: Exploding gradients (grows exponentially)")
print("\nThis is why weight initialization is so important!")

### Solutions to Gradient Problems

| Problem | Solution | How It Helps |
|---------|----------|-------------|
| Vanishing gradients | ReLU activation | Derivative is 1 for positive inputs |
| Vanishing gradients | Skip connections (ResNet) | Gradients can bypass layers |
| Exploding gradients | Gradient clipping | Caps gradient magnitude |
| Both | Proper initialization | Xavier/He initialization keeps gradients stable |
| Both | Batch normalization | Normalizes activations, stabilizes gradients |
| Both | Learning rate scheduling | Smaller steps when needed |

In [None]:
# Demonstrate gradient clipping

def gradient_clip(gradients, max_norm):
    """Clip gradients to maximum norm."""
    total_norm = np.sqrt(sum(np.sum(g**2) for g in gradients))
    clip_coef = max_norm / (total_norm + 1e-6)
    if clip_coef < 1:
        return [g * clip_coef for g in gradients]
    return gradients

# Simulate training with exploding gradients
np.random.seed(42)

# Create large random gradients (simulating explosion)
grad_W1 = np.random.randn(10, 5) * 100  # Huge gradients!
grad_W2 = np.random.randn(3, 10) * 100

original_grads = [grad_W1, grad_W2]
clipped_grads = gradient_clip(original_grads, max_norm=5.0)

orig_norm = np.sqrt(sum(np.sum(g**2) for g in original_grads))
clip_norm = np.sqrt(sum(np.sum(g**2) for g in clipped_grads))

print("Gradient Clipping Example:")
print(f"  Original gradient norm: {orig_norm:.2f}")
print(f"  Clipped gradient norm: {clip_norm:.2f}")
print(f"  Max allowed norm: 5.0")
print(f"\nGradient clipping preserved direction but limited magnitude!")

---

## Exercises

### Exercise 1: Manual Backprop

Compute the gradients by hand for a simple computational graph.

In [None]:
# EXERCISE 1: Manual Backpropagation
# Given: f = (a + b) * (b + c)
# Where: a = 1, b = 2, c = 3

# Compute the gradients df/da, df/db, df/dc manually, then verify with code.

a, b, c = 1.0, 2.0, 3.0

# Forward pass
p = a + b  # = 3
q = b + c  # = 5
f = p * q  # = 15

print("Forward pass:")
print(f"  p = a + b = {a} + {b} = {p}")
print(f"  q = b + c = {b} + {c} = {q}")
print(f"  f = p * q = {p} * {q} = {f}")

# TODO: Compute gradients manually
# Hint: df/df = 1
# Hint: For multiplication f = p * q: df/dp = q, df/dq = p
# Hint: For addition p = a + b: dp/da = 1, dp/db = 1
# Hint: Use chain rule to combine!

df_da = None  # TODO: Your answer
df_db = None  # TODO: Your answer (careful - b appears in both p and q!)
df_dc = None  # TODO: Your answer

# Uncomment to check your answers:
# print(f"\nYour answers:")
# print(f"  df/da = {df_da}")
# print(f"  df/db = {df_db}")
# print(f"  df/dc = {df_dc}")

# Numerical verification
def f_func(a, b, c):
    return (a + b) * (b + c)

h = 1e-5
numerical_da = (f_func(a+h, b, c) - f_func(a-h, b, c)) / (2*h)
numerical_db = (f_func(a, b+h, c) - f_func(a, b-h, c)) / (2*h)
numerical_dc = (f_func(a, b, c+h) - f_func(a, b, c-h)) / (2*h)

print(f"\nNumerical verification:")
print(f"  df/da = {numerical_da:.4f}")
print(f"  df/db = {numerical_db:.4f}")
print(f"  df/dc = {numerical_dc:.4f}")

### Exercise 2: Add Tanh Activation

Modify our neural network to use tanh instead of ReLU in the hidden layer.

In [None]:
# EXERCISE 2: Neural Network with Tanh
# Modify the NeuralNetwork class to use tanh activation

class NeuralNetworkTanh:
    """
    A 2-layer neural network with tanh hidden activation.
    
    TODO: Implement the tanh activation and its derivative
    Hint: tanh derivative = 1 - tanh(x)^2
    """
    
    def __init__(self, input_size, hidden_size, output_size):
        self.W1 = np.random.randn(hidden_size, input_size) * np.sqrt(2.0 / input_size)
        self.b1 = np.zeros((hidden_size, 1))
        self.W2 = np.random.randn(output_size, hidden_size) * np.sqrt(2.0 / hidden_size)
        self.b2 = np.zeros((output_size, 1))
        self.cache = {}
    
    def tanh(self, z):
        """Tanh activation function."""
        # TODO: Implement tanh
        pass
    
    def tanh_derivative(self, z):
        """Derivative of tanh."""
        # TODO: Implement tanh derivative
        # Hint: d/dx tanh(x) = 1 - tanh(x)^2
        pass
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def forward(self, X):
        self.cache['z1'] = self.W1 @ X + self.b1
        # TODO: Use tanh instead of ReLU
        self.cache['h'] = None  # Replace with tanh activation
        
        self.cache['z2'] = self.W2 @ self.cache['h'] + self.b2
        self.cache['y_pred'] = self.sigmoid(self.cache['z2'])
        self.cache['X'] = X
        return self.cache['y_pred']
    
    def backward(self, y_true):
        m = y_true.shape[1]
        y_pred = self.cache['y_pred']
        h = self.cache['h']
        z1 = self.cache['z1']
        X = self.cache['X']
        
        dz2 = y_pred - y_true
        dW2 = (1/m) * dz2 @ h.T
        db2 = (1/m) * np.sum(dz2, axis=1, keepdims=True)
        
        dh = self.W2.T @ dz2
        # TODO: Use tanh derivative instead of ReLU derivative
        dz1 = None  # Replace with correct gradient through tanh
        
        dW1 = (1/m) * dz1 @ X.T
        db1 = (1/m) * np.sum(dz1, axis=1, keepdims=True)
        
        return {'dW1': dW1, 'db1': db1, 'dW2': dW2, 'db2': db2}
    
    def train(self, X, y, epochs, learning_rate):
        losses = []
        for epoch in range(epochs):
            y_pred = self.forward(X)
            loss = -np.mean(y * np.log(y_pred + 1e-7) + (1-y) * np.log(1 - y_pred + 1e-7))
            losses.append(loss)
            
            gradients = self.backward(y)
            
            self.W1 -= learning_rate * gradients['dW1']
            self.b1 -= learning_rate * gradients['db1']
            self.W2 -= learning_rate * gradients['dW2']
            self.b2 -= learning_rate * gradients['db2']
        
        return losses

# Test your implementation
# np.random.seed(42)
# nn_tanh = NeuralNetworkTanh(input_size=2, hidden_size=4, output_size=1)
# losses = nn_tanh.train(X_xor, y_xor, epochs=10000, learning_rate=1.0)
# y_pred = nn_tanh.forward(X_xor)
# accuracy = np.mean((y_pred > 0.5).astype(int) == y_xor)
# print(f"Tanh network accuracy on XOR: {accuracy:.0%}")

### Exercise 3: Gradient Checking

Implement a function to verify backpropagation is correct by comparing with numerical gradients.

In [None]:
# EXERCISE 3: Gradient Checking

def gradient_check(network, X, y, epsilon=1e-7):
    """
    Verify backpropagation by comparing with numerical gradients.
    
    Args:
        network: NeuralNetwork instance
        X: Input data
        y: True labels
        epsilon: Small value for numerical gradient
        
    Returns:
        Dictionary with gradient differences for each parameter
    """
    # Get analytical gradients from backprop
    _ = network.forward(X)
    analytical_grads = network.backward(y)
    
    # TODO: Compute numerical gradients and compare
    # For each weight:
    #   1. Add epsilon to the weight
    #   2. Compute loss
    #   3. Subtract epsilon from the weight (2*epsilon total change)
    #   4. Compute loss
    #   5. Numerical gradient = (loss_plus - loss_minus) / (2 * epsilon)
    
    # Hint: You'll need to temporarily modify weights, compute loss, then restore
    
    results = {}
    
    # Example for one weight (W2[0,0]):
    # original_val = network.W2[0, 0]
    # 
    # network.W2[0, 0] = original_val + epsilon
    # _ = network.forward(X)
    # loss_plus = network.compute_loss(network.cache['y_pred'], y)
    # 
    # network.W2[0, 0] = original_val - epsilon
    # _ = network.forward(X)
    # loss_minus = network.compute_loss(network.cache['y_pred'], y)
    # 
    # numerical_grad = (loss_plus - loss_minus) / (2 * epsilon)
    # network.W2[0, 0] = original_val  # Restore
    # 
    # difference = abs(analytical_grads['dW2'][0, 0] - numerical_grad)
    
    return results

# Test
# np.random.seed(42)
# nn_check = NeuralNetwork(input_size=2, hidden_size=3, output_size=1)
# results = gradient_check(nn_check, X_xor, y_xor)
# print("Gradient check results:")
# for param, diff in results.items():
#     print(f"  {param}: max difference = {diff:.2e}")

---

## Summary

### Key Concepts

- **Credit Assignment Problem**: How do we know which weights caused the error?
- **Computational Graphs**: Break complex functions into simple operations
- **Chain Rule**: Multiply local gradients along paths, sum over all paths
- **Forward Pass**: Compute outputs, save intermediate values
- **Backward Pass**: Compute gradients from output to input
- **Vanishing/Exploding Gradients**: Gradients can shrink or grow exponentially with depth

### Connection to Deep Learning

| Concept | Application |
|---------|-------------|
| Computational graphs | PyTorch/TensorFlow build these automatically |
| Chain rule | The mathematical foundation of all gradient-based learning |
| Local gradients | Each layer implements forward() and backward() |
| Gradient caching | Frameworks save activations for efficient backward pass |
| Vanishing gradients | Why ReLU replaced sigmoid in deep networks |
| Exploding gradients | Why gradient clipping is used in RNNs |

### Checklist

- [ ] I can draw a computational graph for any mathematical expression
- [ ] I can compute gradients using the chain rule
- [ ] I understand why gradients flow backward
- [ ] I can implement backpropagation from scratch
- [ ] I know the difference between vanishing and exploding gradients
- [ ] I understand why hidden layers enable solving XOR

---

## Next Steps

Now that you understand backpropagation at a deep level, you're ready for:

1. **PyTorch Fundamentals** - See how autograd does all this automatically!
2. **Optimization Algorithms** - SGD, Adam, and learning rate schedules
3. **Regularization** - Dropout, weight decay, and preventing overfitting
4. **Deeper Networks** - Architectures like CNNs and Transformers

The concepts from this notebook will appear everywhere in deep learning. Every time you call `loss.backward()` in PyTorch, the algorithm you just implemented by hand is running under the hood!