# Multi-Layer Perceptron (1 Hidden Layer) - Introduction to Neural Networks

## From Logistic Regression to Neural Networks

So far we've learned:
- **Linear Regression**: $y = Xw + b$ (for continuous predictions)
- **Logistic Regression**: $y = \sigma(Xw + b)$ (for binary classification)

Both are **linear models** - they can only learn linear decision boundaries.

### The Problem with Linear Models

Some problems are **not linearly separable**!

Example: XOR problem
- Input (0,0) → 0
- Input (0,1) → 1
- Input (1,0) → 1
- Input (1,1) → 0

**No single line can separate these classes!**

### The Solution: Add Hidden Layers!

A **Multi-Layer Perceptron (MLP)** has:
1. **Input layer**: our features
2. **Hidden layer(s)**: intermediate transformations
3. **Output layer**: final predictions

## Architecture of 1-Hidden-Layer MLP

```
Input (X)  →  Hidden Layer  →  Output Layer  →  Prediction (ŷ)
  (n_features)    (n_hidden)      (n_outputs)
```

### Forward Pass:

1. **Hidden layer**: 
   - $z_1 = XW_1 + b_1$
   - $a_1 = \sigma(z_1)$ (apply activation)

2. **Output layer**:
   - $z_2 = a_1W_2 + b_2$
   - $\hat{y} = \sigma(z_2)$ (for binary classification)

Where:
- $W_1, b_1$ = weights and bias for hidden layer
- $W_2, b_2$ = weights and bias for output layer
- $\sigma$ = activation function (sigmoid, ReLU, etc.)

### Why Hidden Layers Work:

- **First layer**: Learns useful feature combinations
- **Second layer**: Combines these features to make final decision
- Together: Can learn **non-linear** decision boundaries!

Think of it like:
- Hidden layer = creating new, more useful features
- Output layer = using these features for prediction

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_moons, make_circles
from sklearn.model_selection import train_test_split

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
np.random.seed(42)

## Step 1: Activation Functions

### Why Activation Functions?

Without activation functions:
- $z_2 = (XW_1 + b_1)W_2 + b_2 = X(W_1W_2) + (b_1W_2 + b_2)$
- This is just another linear transformation!
- Multiple layers without activation = single layer

**Activation functions add non-linearity!**

### Common Activation Functions:

1. **Sigmoid**: $\sigma(z) = \frac{1}{1 + e^{-z}}$
   - Range: $(0, 1)$
   - Derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$
   - Problem: Vanishing gradients for large $|z|$

2. **ReLU (Rectified Linear Unit)**: $f(z) = \max(0, z)$
   - Range: $[0, \infty)$
   - Derivative: $f'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}$
   - Advantages: Fast, no vanishing gradient for $z > 0$
   - Most popular for hidden layers!

3. **Tanh**: $f(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$
   - Range: $(-1, 1)$
   - Derivative: $f'(z) = 1 - f(z)^2$
   - Like sigmoid but centered at 0

In [None]:
# Define activation functions and their derivatives
def sigmoid(z):
    z = np.clip(z, -500, 500)
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

def tanh(z):
    return np.tanh(z)

def tanh_derivative(z):
    return 1 - np.tanh(z) ** 2

# Visualize activation functions
z = np.linspace(-5, 5, 100)

fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Sigmoid
axes[0, 0].plot(z, sigmoid(z), linewidth=3, color='#2E86AB')
axes[0, 0].set_title('Sigmoid: σ(z) = 1/(1 + e⁻ᶻ)', fontsize=12, fontweight='bold')
axes[0, 0].set_ylabel('Output', fontsize=11)
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].axhline(y=0, color='black', linewidth=0.5)
axes[0, 0].axvline(x=0, color='black', linewidth=0.5)

axes[1, 0].plot(z, sigmoid_derivative(z), linewidth=3, color='#A23B72')
axes[1, 0].set_title('Sigmoid Derivative', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('z', fontsize=11)
axes[1, 0].set_ylabel('Derivative', fontsize=11)
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].axhline(y=0, color='black', linewidth=0.5)
axes[1, 0].axvline(x=0, color='black', linewidth=0.5)

# ReLU
axes[0, 1].plot(z, relu(z), linewidth=3, color='#2E86AB')
axes[0, 1].set_title('ReLU: f(z) = max(0, z)', fontsize=12, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].axhline(y=0, color='black', linewidth=0.5)
axes[0, 1].axvline(x=0, color='black', linewidth=0.5)

axes[1, 1].plot(z, relu_derivative(z), linewidth=3, color='#A23B72')
axes[1, 1].set_title('ReLU Derivative', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('z', fontsize=11)
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].axhline(y=0, color='black', linewidth=0.5)
axes[1, 1].axvline(x=0, color='black', linewidth=0.5)

# Tanh
axes[0, 2].plot(z, tanh(z), linewidth=3, color='#2E86AB')
axes[0, 2].set_title('Tanh: f(z) = (eᶻ - e⁻ᶻ)/(eᶻ + e⁻ᶻ)', fontsize=12, fontweight='bold')
axes[0, 2].grid(True, alpha=0.3)
axes[0, 2].axhline(y=0, color='black', linewidth=0.5)
axes[0, 2].axvline(x=0, color='black', linewidth=0.5)

axes[1, 2].plot(z, tanh_derivative(z), linewidth=3, color='#A23B72')
axes[1, 2].set_title('Tanh Derivative', fontsize=12, fontweight='bold')
axes[1, 2].set_xlabel('z', fontsize=11)
axes[1, 2].grid(True, alpha=0.3)
axes[1, 2].axhline(y=0, color='black', linewidth=0.5)
axes[1, 2].axvline(x=0, color='black', linewidth=0.5)

plt.tight_layout()
plt.show()

## Step 2: Generate Non-Linear Data

Let's create data that **cannot** be separated by a straight line!

In [None]:
# Generate two types of non-linear datasets
X_moons, y_moons = make_moons(n_samples=500, noise=0.1, random_state=42)
X_circles, y_circles = make_circles(n_samples=500, noise=0.1, factor=0.5, random_state=42)

# We'll use moons dataset
X, y = X_moons, y_moons

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Visualize datasets
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Moons dataset
axes[0].scatter(X_moons[y_moons == 0, 0], X_moons[y_moons == 0, 1],
                label='Class 0', alpha=0.7, s=50, edgecolors='black', linewidth=0.5)
axes[0].scatter(X_moons[y_moons == 1, 0], X_moons[y_moons == 1, 1],
                label='Class 1', alpha=0.7, s=50, edgecolors='black', linewidth=0.5)
axes[0].set_xlabel('Feature 1', fontsize=12)
axes[0].set_ylabel('Feature 2', fontsize=12)
axes[0].set_title('Moons Dataset (Non-Linear)', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Circles dataset
axes[1].scatter(X_circles[y_circles == 0, 0], X_circles[y_circles == 0, 1],
                label='Class 0', alpha=0.7, s=50, edgecolors='black', linewidth=0.5)
axes[1].scatter(X_circles[y_circles == 1, 0], X_circles[y_circles == 1, 1],
                label='Class 1', alpha=0.7, s=50, edgecolors='black', linewidth=0.5)
axes[1].set_xlabel('Feature 1', fontsize=12)
axes[1].set_ylabel('Feature 2', fontsize=12)
axes[1].set_title('Circles Dataset (Non-Linear)', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Features: {X.shape[1]}")

## Step 3: Understanding Backpropagation

**Backpropagation** = computing gradients for all layers using the **chain rule**.

### Forward Pass (recap):

1. Hidden layer: $z_1 = XW_1 + b_1$, $a_1 = \sigma(z_1)$
2. Output layer: $z_2 = a_1W_2 + b_2$, $\hat{y} = \sigma(z_2)$
3. Loss: $L = -\frac{1}{n} \sum [y \log(\hat{y}) + (1-y) \log(1-\hat{y})]$

### Backward Pass (computing gradients):

We need gradients for: $W_1, b_1, W_2, b_2$

**Start from the output and work backwards!**

#### Layer 2 (Output Layer) Gradients:

**Step 1:** Gradient of loss w.r.t. output
$$\frac{\partial L}{\partial \hat{y}} = -\left(\frac{y}{\hat{y}} - \frac{1-y}{1-\hat{y}}\right)$$

**Step 2:** Gradient w.r.t. $z_2$ (before activation)
$$\frac{\partial L}{\partial z_2} = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial z_2} = \frac{\partial L}{\partial \hat{y}} \times \sigma'(z_2)$$

For binary cross-entropy + sigmoid: $$\boxed{\frac{\partial L}{\partial z_2} = \hat{y} - y}$$

**Step 3:** Gradients for $W_2$ and $b_2$
$$\boxed{\frac{\partial L}{\partial W_2} = \frac{1}{n} a_1^T(\hat{y} - y)}$$
$$\boxed{\frac{\partial L}{\partial b_2} = \frac{1}{n} \sum (\hat{y} - y)}$$

#### Layer 1 (Hidden Layer) Gradients:

This is where backpropagation really shines!

**Step 4:** Gradient w.r.t. $a_1$ (hidden layer activations)
$$\frac{\partial L}{\partial a_1} = \frac{\partial L}{\partial z_2} \times \frac{\partial z_2}{\partial a_1}$$

Since $z_2 = a_1W_2 + b_2$, so $\frac{\partial z_2}{\partial a_1} = W_2$

$$\boxed{\frac{\partial L}{\partial a_1} = (\hat{y} - y)W_2^T}$$

**Step 5:** Gradient w.r.t. $z_1$ (before activation)
$$\frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial a_1} \times \frac{\partial a_1}{\partial z_1} = \frac{\partial L}{\partial a_1} \times \sigma'(z_1)$$

$$\boxed{\frac{\partial L}{\partial z_1} = [(\hat{y} - y)W_2^T] \odot \sigma'(z_1)}$$

($\odot$ means element-wise multiplication)

**Step 6:** Gradients for $W_1$ and $b_1$
$$\boxed{\frac{\partial L}{\partial W_1} = \frac{1}{n} X^T\left(\frac{\partial L}{\partial z_1}\right)}$$
$$\boxed{\frac{\partial L}{\partial b_1} = \frac{1}{n} \sum \left(\frac{\partial L}{\partial z_1}\right)}$$

### Key Insights:

1. **Chain Rule**: Errors propagate backwards through layers
2. **Reuse Computations**: We store $z_1, a_1, z_2$ from forward pass
3. **Matrix Operations**: All gradients computed efficiently with matrices
4. **Same Pattern**: Each layer follows same gradient pattern

### Backpropagation in Plain English:

1. **Output layer error**: How wrong were we? $(\hat{y} - y)$
2. **Output layer updates**: Adjust $W_2$ and $b_2$ to fix this error
3. **Propagate error back**: How much did hidden layer contribute? (error $\times W_2^T$)
4. **Hidden layer error**: Scale by activation derivative (only update active neurons)
5. **Hidden layer updates**: Adjust $W_1$ and $b_1$ to fix propagated error

## Step 4: Implement MLP with 1 Hidden Layer

In [None]:
class MLP_1Hidden:
    """Multi-Layer Perceptron with 1 Hidden Layer"""
    
    def __init__(self, n_input, n_hidden, n_output, learning_rate=0.1, n_iterations=1000, activation='relu'):
        self.n_input = n_input
        self.n_hidden = n_hidden
        self.n_output = n_output
        self.lr = learning_rate
        self.n_iterations = n_iterations
        self.activation = activation
        
        # Initialize weights (He initialization for ReLU, Xavier for others)
        if activation == 'relu':
            self.W1 = np.random.randn(n_input, n_hidden) * np.sqrt(2.0 / n_input)
            self.W2 = np.random.randn(n_hidden, n_output) * np.sqrt(2.0 / n_hidden)
        else:
            self.W1 = np.random.randn(n_input, n_hidden) * np.sqrt(1.0 / n_input)
            self.W2 = np.random.randn(n_hidden, n_output) * np.sqrt(1.0 / n_hidden)
        
        self.b1 = np.zeros((1, n_hidden))
        self.b2 = np.zeros((1, n_output))
        
        self.cost_history = []
        
        # Select activation functions
        if activation == 'relu':
            self.act_fn = relu
            self.act_derivative = relu_derivative
        elif activation == 'tanh':
            self.act_fn = tanh
            self.act_derivative = tanh_derivative
        else:  # sigmoid
            self.act_fn = sigmoid
            self.act_derivative = sigmoid_derivative
    
    def forward(self, X):
        """Forward pass - compute predictions"""
        # Layer 1: Input → Hidden
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.act_fn(self.z1)
        
        # Layer 2: Hidden → Output
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = sigmoid(self.z2)  # Sigmoid for binary classification
        
        return self.a2
    
    def compute_cost(self, y, y_pred):
        """Binary cross-entropy loss"""
        n = len(y)
        epsilon = 1e-7
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        cost = -(1/n) * np.sum(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))
        return cost
    
    def backward(self, X, y):
        """Backward pass - compute gradients"""
        n = len(y)
        y = y.reshape(-1, 1)  # Ensure correct shape
        
        # Output layer gradients
        dz2 = self.a2 - y  # Shape: (n, 1)
        dW2 = (1/n) * (self.a1.T @ dz2)  # Shape: (n_hidden, n_output)
        db2 = (1/n) * np.sum(dz2, axis=0, keepdims=True)  # Shape: (1, n_output)
        
        # Hidden layer gradients
        da1 = dz2 @ self.W2.T  # Shape: (n, n_hidden)
        dz1 = da1 * self.act_derivative(self.z1)  # Element-wise multiplication
        dW1 = (1/n) * (X.T @ dz1)  # Shape: (n_input, n_hidden)
        db1 = (1/n) * np.sum(dz1, axis=0, keepdims=True)  # Shape: (1, n_hidden)
        
        return dW1, db1, dW2, db2
    
    def fit(self, X, y, verbose=True):
        """Train the model"""
        for i in range(self.n_iterations):
            # Forward pass
            y_pred = self.forward(X)
            
            # Compute cost
            cost = self.compute_cost(y, y_pred)
            self.cost_history.append(cost)
            
            # Backward pass
            dW1, db1, dW2, db2 = self.backward(X, y)
            
            # Update weights
            self.W1 -= self.lr * dW1
            self.b1 -= self.lr * db1
            self.W2 -= self.lr * dW2
            self.b2 -= self.lr * db2
            
            if verbose and (i % 100 == 0 or i == self.n_iterations - 1):
                print(f"Iteration {i:4d} | Cost: {cost:.6f}")
    
    def predict_proba(self, X):
        """Predict probabilities"""
        return self.forward(X)
    
    def predict(self, X, threshold=0.5):
        """Predict class labels"""
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int).ravel()

## Step 5: Train the Model

In [None]:
# Create and train model
n_input = X_train.shape[1]  # 2 features
n_hidden = 10  # 10 neurons in hidden layer
n_output = 1  # Binary classification

model = MLP_1Hidden(
    n_input=n_input,
    n_hidden=n_hidden,
    n_output=n_output,
    learning_rate=0.5,
    n_iterations=2000,
    activation='relu'
)

print(f"Network Architecture:")
print(f"Input Layer:  {n_input} neurons")
print(f"Hidden Layer: {n_hidden} neurons (ReLU activation)")
print(f"Output Layer: {n_output} neuron (Sigmoid activation)")
print(f"\nTotal Parameters: {model.W1.size + model.b1.size + model.W2.size + model.b2.size}")
print(f"  W1: {model.W1.shape}, b1: {model.b1.shape}")
print(f"  W2: {model.W2.shape}, b2: {model.b2.shape}")
print()

model.fit(X_train, y_train)

## Step 6: Evaluate and Visualize

In [None]:
# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate accuracy
train_accuracy = np.mean(y_train_pred == y_train)
test_accuracy = np.mean(y_test_pred == y_test)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy:     {test_accuracy:.4f}")

# Confusion matrix
from sklearn.metrics import confusion_matrix, classification_report

print("\nConfusion Matrix (Test):")
cm = confusion_matrix(y_test, y_test_pred)
print(cm)

print("\nClassification Report (Test):")
print(classification_report(y_test, y_test_pred))

In [None]:
# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Plot 1: Cost history
axes[0, 0].plot(model.cost_history, linewidth=2, color='#2E86AB')
axes[0, 0].set_xlabel('Iteration', fontsize=12)
axes[0, 0].set_ylabel('Cost (Binary Cross-Entropy)', fontsize=12)
axes[0, 0].set_title('Learning Curve', fontsize=14, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Decision boundary
h = 0.02
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

contour = axes[0, 1].contourf(xx, yy, Z, levels=20, cmap='RdYlBu_r', alpha=0.6)
axes[0, 1].contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)
axes[0, 1].scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1],
                   label='Class 0', alpha=0.8, s=50, edgecolors='black', linewidth=0.5)
axes[0, 1].scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1],
                   label='Class 1', alpha=0.8, s=50, edgecolors='black', linewidth=0.5)
axes[0, 1].set_xlabel('Feature 1', fontsize=12)
axes[0, 1].set_ylabel('Feature 2', fontsize=12)
axes[0, 1].set_title('Decision Boundary (Training Data)', fontsize=14, fontweight='bold')
axes[0, 1].legend()
plt.colorbar(contour, ax=axes[0, 1], label='P(y=1)')

# Plot 3: Confusion matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[1, 0],
            xticklabels=['Pred 0', 'Pred 1'],
            yticklabels=['True 0', 'True 1'])
axes[1, 0].set_title('Confusion Matrix (Test)', fontsize=14, fontweight='bold')
axes[1, 0].set_ylabel('True Label', fontsize=12)
axes[1, 0].set_xlabel('Predicted Label', fontsize=12)

# Plot 4: Hidden layer activations
# Visualize what hidden neurons learned
hidden_activations = model.a1  # From last forward pass (training data)
axes[1, 1].imshow(hidden_activations[:50].T, aspect='auto', cmap='viridis', interpolation='nearest')
axes[1, 1].set_xlabel('Sample Index (first 50)', fontsize=12)
axes[1, 1].set_ylabel('Hidden Neuron', fontsize=12)
axes[1, 1].set_title('Hidden Layer Activations', fontsize=14, fontweight='bold')
plt.colorbar(axes[1, 1].images[0], ax=axes[1, 1], label='Activation Value')

plt.tight_layout()
plt.show()

## Step 7: Compare Different Activations

In [None]:
# Train models with different activations
activations = ['sigmoid', 'tanh', 'relu']
models = {}

for act in activations:
    print(f"\nTraining with {act} activation...")
    model_act = MLP_1Hidden(
        n_input=n_input,
        n_hidden=n_hidden,
        n_output=n_output,
        learning_rate=0.5,
        n_iterations=2000,
        activation=act
    )
    model_act.fit(X_train, y_train, verbose=False)
    models[act] = model_act
    
    # Evaluate
    train_acc = np.mean(model_act.predict(X_train) == y_train)
    test_acc = np.mean(model_act.predict(X_test) == y_test)
    print(f"  Train Accuracy: {train_acc:.4f}")
    print(f"  Test Accuracy:  {test_acc:.4f}")

In [None]:
# Compare learning curves
plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
for act, model in models.items():
    plt.plot(model.cost_history, label=act, linewidth=2, alpha=0.8)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Cost', fontsize=12)
plt.title('Learning Curves: Different Activations', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

# Compare final accuracies
plt.subplot(1, 2, 2)
train_accs = [np.mean(models[act].predict(X_train) == y_train) for act in activations]
test_accs = [np.mean(models[act].predict(X_test) == y_test) for act in activations]

x_pos = np.arange(len(activations))
width = 0.35
plt.bar(x_pos - width/2, train_accs, width, label='Train', alpha=0.8)
plt.bar(x_pos + width/2, test_accs, width, label='Test', alpha=0.8)
plt.xlabel('Activation Function', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Accuracy Comparison', fontsize=14, fontweight='bold')
plt.xticks(x_pos, activations)
plt.ylim(0.7, 1.0)
plt.legend()
plt.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## Key Takeaways

### What We Learned:

1. **Neural Networks = Stacked Transformations**
   - Input → Hidden Layer → Output
   - Each layer: linear transformation + non-linear activation
   - Can learn non-linear decision boundaries!

2. **Activation Functions are Crucial**
   - Without them: multiple layers = single layer (still linear)
   - **ReLU**: $f(z) = \max(0, z)$ - most popular for hidden layers
   - **Sigmoid**: $\sigma(z) = \frac{1}{1+e^{-z}}$ - good for output layer (probabilities)
   - **Tanh**: similar to sigmoid but centered at 0

3. **Forward Pass**
   - Layer 1: $z_1 = XW_1 + b_1$, $a_1 = \text{activation}(z_1)$
   - Layer 2: $z_2 = a_1W_2 + b_2$, $\hat{y} = \sigma(z_2)$
   - Store $z_1, a_1, z_2$ for backward pass!

4. **Backpropagation = Chain Rule**
   
   **Output Layer:**
   - $\frac{\partial L}{\partial z_2} = \hat{y} - y$ (error)
   - $\frac{\partial L}{\partial W_2} = \frac{1}{n} a_1^T(\hat{y} - y)$
   - $\frac{\partial L}{\partial b_2} = \frac{1}{n} \sum (\hat{y} - y)$
   
   **Hidden Layer:**
   - $\frac{\partial L}{\partial a_1} = (\hat{y} - y)W_2^T$ (propagate error back)
   - $\frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial a_1} \odot \text{activation}'(z_1)$ (scale by derivative)
   - $\frac{\partial L}{\partial W_1} = \frac{1}{n} X^T\left(\frac{\partial L}{\partial z_1}\right)$
   - $\frac{\partial L}{\partial b_1} = \frac{1}{n} \sum \left(\frac{\partial L}{\partial z_1}\right)$

5. **The Flow of Backpropagation:**
   ```
   Forward:  X → z₁ → a₁ → z₂ → ŷ → Loss
   Backward: X ← ∂W₁ ← ∂z₁ ← ∂a₁ ← ∂z₂ ← (ŷ-y)
   ```

6. **Weight Initialization Matters**
   - All zeros → all neurons learn same thing
   - Too large → exploding gradients
   - Too small → vanishing gradients
   - **He initialization** (for ReLU): $W \sim \mathcal{N}(0, \sqrt{2/n_{input}})$
   - **Xavier initialization** (for sigmoid/tanh): $W \sim \mathcal{N}(0, \sqrt{1/n_{input}})$

7. **Hidden Layers Learn Features**
   - Each hidden neuron learns to detect a pattern
   - Output layer combines these patterns
   - More neurons → more complex patterns

### Mathematical Beauty:

The gradient formulas might look complex, but they follow a beautiful pattern:

**For any layer:**
1. Get error from next layer
2. Multiply by activation derivative (element-wise)
3. Compute weight gradient: input$^T$ @ error
4. Compute bias gradient: sum of errors
5. Pass error to previous layer: error @ weights$^T$

This pattern works for any number of layers!

### Next Steps:

- Add more hidden layers (deep learning!)
- Different architectures
- Regularization techniques
- More complex activations and losses