# Multi-Layer Perceptron with 3 Hidden Layers

In this notebook, we'll implement a deeper neural network with **3 hidden layers**. This demonstrates how the backpropagation algorithm scales to deeper architectures.

## Network Architecture

Our network will have:
- **Input layer**: $n$ features
- **Hidden layer 1**: $h_1$ neurons with activation function $\sigma_1$
- **Hidden layer 2**: $h_2$ neurons with activation function $\sigma_2$
- **Hidden layer 3**: $h_3$ neurons with activation function $\sigma_3$
- **Output layer**: 1 neuron (binary classification with sigmoid)

## Why Deeper Networks?

Deeper networks can learn more complex hierarchical features:
- **Layer 1**: Simple patterns (edges, basic shapes)
- **Layer 2**: Combinations of simple patterns
- **Layer 3**: High-level features
- **Output**: Final decision

This is the foundation of deep learning!

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_moons, make_circles
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Set random seed for reproducibility
np.random.seed(42)

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

## 1. Forward Propagation

Let's denote:
- $X$: input matrix (shape: $m \times n$)
- $W^{(1)}$, $b^{(1)}$: weights and bias for layer 1
- $W^{(2)}$, $b^{(2)}$: weights and bias for layer 2
- $W^{(3)}$, $b^{(3)}$: weights and bias for layer 3
- $W^{(4)}$, $b^{(4)}$: weights and bias for output layer

### Forward Pass Equations:

**Layer 1:**
$$z^{(1)} = XW^{(1)} + b^{(1)}$$
$$a^{(1)} = \sigma_1(z^{(1)})$$

**Layer 2:**
$$z^{(2)} = a^{(1)}W^{(2)} + b^{(2)}$$
$$a^{(2)} = \sigma_2(z^{(2)})$$

**Layer 3:**
$$z^{(3)} = a^{(2)}W^{(3)} + b^{(3)}$$
$$a^{(3)} = \sigma_3(z^{(3)})$$

**Output Layer:**
$$z^{(4)} = a^{(3)}W^{(4)} + b^{(4)}$$
$$\hat{y} = \text{sigmoid}(z^{(4)})$$

### Loss Function (Binary Cross-Entropy):

$$L = -\frac{1}{m}\sum_{i=1}^{m}[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]$$

## 2. Backpropagation for 3 Hidden Layers

Working backwards from the output:

### Output Layer Gradients:

$$\boxed{\frac{\partial L}{\partial z^{(4)}} = \hat{y} - y}$$

Let $\delta^{(4)} = \hat{y} - y$ (shape: $m \times 1$)

$$\boxed{\frac{\partial L}{\partial W^{(4)}} = \frac{1}{m}(a^{(3)})^T \delta^{(4)}}$$

$$\boxed{\frac{\partial L}{\partial b^{(4)}} = \frac{1}{m}\sum_{i=1}^{m}\delta^{(4)}_i}$$

### Hidden Layer 3 Gradients:

$$\frac{\partial L}{\partial a^{(3)}} = \delta^{(4)}(W^{(4)})^T$$

$$\delta^{(3)} = \frac{\partial L}{\partial a^{(3)}} \odot \sigma_3'(z^{(3)})$$

$$\boxed{\frac{\partial L}{\partial W^{(3)}} = \frac{1}{m}(a^{(2)})^T \delta^{(3)}}$$

$$\boxed{\frac{\partial L}{\partial b^{(3)}} = \frac{1}{m}\sum_{i=1}^{m}\delta^{(3)}_i}$$

### Hidden Layer 2 Gradients:

$$\frac{\partial L}{\partial a^{(2)}} = \delta^{(3)}(W^{(3)})^T$$

$$\delta^{(2)} = \frac{\partial L}{\partial a^{(2)}} \odot \sigma_2'(z^{(2)})$$

$$\boxed{\frac{\partial L}{\partial W^{(2)}} = \frac{1}{m}(a^{(1)})^T \delta^{(2)}}$$

$$\boxed{\frac{\partial L}{\partial b^{(2)}} = \frac{1}{m}\sum_{i=1}^{m}\delta^{(2)}_i}$$

### Hidden Layer 1 Gradients:

$$\frac{\partial L}{\partial a^{(1)}} = \delta^{(2)}(W^{(2)})^T$$

$$\delta^{(1)} = \frac{\partial L}{\partial a^{(1)}} \odot \sigma_1'(z^{(1)})$$

$$\boxed{\frac{\partial L}{\partial W^{(1)}} = \frac{1}{m}X^T \delta^{(1)}}$$

$$\boxed{\frac{\partial L}{\partial b^{(1)}} = \frac{1}{m}\sum_{i=1}^{m}\delta^{(1)}_i}$$

### Pattern Recognition:

Notice the **universal pattern** for any layer $l$:
1. Compute error signal: $\delta^{(l)} = \frac{\partial L}{\partial a^{(l)}} \odot \sigma'(z^{(l)})$
2. Weight gradient: $\frac{\partial L}{\partial W^{(l)}} = \frac{1}{m}(a^{(l-1)})^T \delta^{(l)}$
3. Bias gradient: $\frac{\partial L}{\partial b^{(l)}} = \frac{1}{m}\sum \delta^{(l)}$
4. Propagate back: $\frac{\partial L}{\partial a^{(l-1)}} = \delta^{(l)}(W^{(l)})^T$

This pattern allows us to build networks of **any depth**!

In [None]:
# Activation functions and their derivatives
def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

def tanh(z):
    return np.tanh(z)

def tanh_derivative(z):
    return 1 - np.tanh(z)**2

In [None]:
class MLP_3Hidden:
    """
    Multi-Layer Perceptron with 3 hidden layers.
    
    Architecture:
    Input -> Hidden1 -> Hidden2 -> Hidden3 -> Output
    """
    
    def __init__(self, input_size, hidden1_size, hidden2_size, hidden3_size, 
                 output_size=1, activation='relu', learning_rate=0.01):
        """
        Initialize the network.
        
        Parameters:
        -----------
        input_size : int
            Number of input features
        hidden1_size : int
            Number of neurons in first hidden layer
        hidden2_size : int
            Number of neurons in second hidden layer
        hidden3_size : int
            Number of neurons in third hidden layer
        output_size : int
            Number of output neurons (default: 1 for binary classification)
        activation : str
            Activation function: 'relu', 'sigmoid', or 'tanh'
        learning_rate : float
            Learning rate for gradient descent
        """
        self.lr = learning_rate
        
        # Set activation function
        if activation == 'relu':
            self.activation = relu
            self.act_derivative = relu_derivative
            init_scale = np.sqrt(2.0)  # He initialization
        elif activation == 'sigmoid':
            self.activation = sigmoid
            self.act_derivative = sigmoid_derivative
            init_scale = 1.0  # Xavier initialization
        elif activation == 'tanh':
            self.activation = tanh
            self.act_derivative = tanh_derivative
            init_scale = 1.0  # Xavier initialization
        else:
            raise ValueError("Activation must be 'relu', 'sigmoid', or 'tanh'")
        
        # Initialize weights with appropriate scaling
        self.W1 = np.random.randn(input_size, hidden1_size) * init_scale / np.sqrt(input_size)
        self.b1 = np.zeros((1, hidden1_size))
        
        self.W2 = np.random.randn(hidden1_size, hidden2_size) * init_scale / np.sqrt(hidden1_size)
        self.b2 = np.zeros((1, hidden2_size))
        
        self.W3 = np.random.randn(hidden2_size, hidden3_size) * init_scale / np.sqrt(hidden2_size)
        self.b3 = np.zeros((1, hidden3_size))
        
        self.W4 = np.random.randn(hidden3_size, output_size) * init_scale / np.sqrt(hidden3_size)
        self.b4 = np.zeros((1, output_size))
        
        # For storing intermediate values during forward pass
        self.z1 = None
        self.a1 = None
        self.z2 = None
        self.a2 = None
        self.z3 = None
        self.a3 = None
        self.z4 = None
        self.a4 = None
        
    def forward(self, X):
        """
        Forward propagation through the network.
        
        Parameters:
        -----------
        X : ndarray, shape (m, n)
            Input data
            
        Returns:
        --------
        predictions : ndarray, shape (m, 1)
            Network predictions
        """
        # Layer 1
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.activation(self.z1)
        
        # Layer 2
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.activation(self.z2)
        
        # Layer 3
        self.z3 = self.a2 @ self.W3 + self.b3
        self.a3 = self.activation(self.z3)
        
        # Output layer (sigmoid for binary classification)
        self.z4 = self.a3 @ self.W4 + self.b4
        self.a4 = sigmoid(self.z4)
        
        return self.a4
    
    def backward(self, X, y):
        """
        Backward propagation to compute gradients.
        
        Parameters:
        -----------
        X : ndarray, shape (m, n)
            Input data
        y : ndarray, shape (m, 1)
            True labels
        """
        m = X.shape[0]
        
        # Output layer gradients (δ⁴ = ŷ - y)
        delta4 = self.a4 - y
        dW4 = (1/m) * (self.a3.T @ delta4)
        db4 = (1/m) * np.sum(delta4, axis=0, keepdims=True)
        
        # Hidden layer 3 gradients
        delta3 = (delta4 @ self.W4.T) * self.act_derivative(self.z3)
        dW3 = (1/m) * (self.a2.T @ delta3)
        db3 = (1/m) * np.sum(delta3, axis=0, keepdims=True)
        
        # Hidden layer 2 gradients
        delta2 = (delta3 @ self.W3.T) * self.act_derivative(self.z2)
        dW2 = (1/m) * (self.a1.T @ delta2)
        db2 = (1/m) * np.sum(delta2, axis=0, keepdims=True)
        
        # Hidden layer 1 gradients
        delta1 = (delta2 @ self.W2.T) * self.act_derivative(self.z1)
        dW1 = (1/m) * (X.T @ delta1)
        db1 = (1/m) * np.sum(delta1, axis=0, keepdims=True)
        
        # Update weights and biases using gradient descent
        self.W4 -= self.lr * dW4
        self.b4 -= self.lr * db4
        self.W3 -= self.lr * dW3
        self.b3 -= self.lr * db3
        self.W2 -= self.lr * dW2
        self.b2 -= self.lr * db2
        self.W1 -= self.lr * dW1
        self.b1 -= self.lr * db1
    
    def compute_loss(self, y_true, y_pred):
        """
        Compute binary cross-entropy loss.
        
        Parameters:
        -----------
        y_true : ndarray
            True labels
        y_pred : ndarray
            Predicted probabilities
            
        Returns:
        --------
        loss : float
            Binary cross-entropy loss
        """
        m = y_true.shape[0]
        # Clip predictions to avoid log(0)
        y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
        loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        return loss
    
    def train(self, X, y, epochs=1000, verbose=True):
        """
        Train the network.
        
        Parameters:
        -----------
        X : ndarray, shape (m, n)
            Training data
        y : ndarray, shape (m, 1)
            Training labels
        epochs : int
            Number of training epochs
        verbose : bool
            Whether to print progress
            
        Returns:
        --------
        losses : list
            Loss history
        """
        losses = []
        
        for epoch in range(epochs):
            # Forward pass
            predictions = self.forward(X)
            
            # Compute loss
            loss = self.compute_loss(y, predictions)
            losses.append(loss)
            
            # Backward pass
            self.backward(X, y)
            
            # Print progress
            if verbose and (epoch % 100 == 0 or epoch == epochs - 1):
                print(f"Epoch {epoch}/{epochs}, Loss: {loss:.4f}")
        
        return losses
    
    def predict(self, X, threshold=0.5):
        """
        Make predictions.
        
        Parameters:
        -----------
        X : ndarray
            Input data
        threshold : float
            Decision threshold
            
        Returns:
        --------
        predictions : ndarray
            Binary predictions (0 or 1)
        """
        probabilities = self.forward(X)
        return (probabilities >= threshold).astype(int)

## 3. Training on Complex Dataset

Let's test our 3-layer network on a complex dataset that requires deep features.

In [None]:
# Generate complex dataset (nested circles)
X, y = make_circles(n_samples=1000, noise=0.1, factor=0.3, random_state=42)
y = y.reshape(-1, 1)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Visualize dataset
plt.figure(figsize=(10, 6))
plt.scatter(X_train[y_train.ravel() == 0, 0], X_train[y_train.ravel() == 0, 1], 
           c='blue', label='Class 0', alpha=0.6, edgecolors='k')
plt.scatter(X_train[y_train.ravel() == 1, 0], X_train[y_train.ravel() == 1, 1], 
           c='red', label='Class 1', alpha=0.6, edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Complex Dataset: Nested Circles')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")
print(f"Features: {X_train.shape[1]}")

## 4. Train the 3-Layer Network

In [None]:
# Create and train the 3-layer MLP
mlp3 = MLP_3Hidden(
    input_size=2,
    hidden1_size=16,
    hidden2_size=12,
    hidden3_size=8,
    output_size=1,
    activation='relu',
    learning_rate=0.1
)

print("Training 3-Layer MLP...\n")
losses = mlp3.train(X_train_scaled, y_train, epochs=2000, verbose=True)

In [None]:
# Plot learning curve
plt.figure(figsize=(10, 6))
plt.plot(losses, linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss (Binary Cross-Entropy)')
plt.title('Learning Curve: 3-Layer MLP')
plt.grid(True, alpha=0.3)
plt.show()

## 5. Evaluate Performance

In [None]:
# Make predictions
y_train_pred = mlp3.predict(X_train_scaled)
y_test_pred = mlp3.predict(X_test_scaled)

# Calculate accuracy
train_accuracy = np.mean(y_train_pred == y_train) * 100
test_accuracy = np.mean(y_test_pred == y_test) * 100

print(f"Training Accuracy: {train_accuracy:.2f}%")
print(f"Test Accuracy: {test_accuracy:.2f}%")

## 6. Visualize Decision Boundary

In [None]:
def plot_decision_boundary(model, X, y, scaler, title):
    """
    Plot decision boundary for binary classification.
    """
    # Create mesh
    h = 0.02
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    # Scale mesh points
    mesh_points = np.c_[xx.ravel(), yy.ravel()]
    mesh_scaled = scaler.transform(mesh_points)
    
    # Predict on mesh
    Z = model.forward(mesh_scaled)
    Z = Z.reshape(xx.shape)
    
    # Plot
    plt.figure(figsize=(12, 8))
    plt.contourf(xx, yy, Z, levels=20, cmap='RdYlBu', alpha=0.8)
    plt.colorbar(label='Predicted Probability')
    
    # Plot data points
    plt.scatter(X[y.ravel() == 0, 0], X[y.ravel() == 0, 1], 
               c='blue', label='Class 0', edgecolors='k', s=50)
    plt.scatter(X[y.ravel() == 1, 0], X[y.ravel() == 1, 1], 
               c='red', label='Class 1', edgecolors='k', s=50)
    
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

plot_decision_boundary(mlp3, X_test, y_test, scaler, 
                      '3-Layer MLP Decision Boundary (Test Set)')

## 7. Comparing Different Depths

Let's compare networks with different depths to see how performance changes.

In [None]:
# Train networks with different depths
from sklearn.datasets import make_moons

# Generate more complex dataset
X_moons, y_moons = make_moons(n_samples=1000, noise=0.2, random_state=42)
y_moons = y_moons.reshape(-1, 1)

X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(
    X_moons, y_moons, test_size=0.2, random_state=42
)

scaler_m = StandardScaler()
X_train_m_scaled = scaler_m.fit_transform(X_train_m)
X_test_m_scaled = scaler_m.transform(X_test_m)

# 1-Layer MLP
from sklearn.neural_network import MLPClassifier as MLP_sklearn

results = {}

print("Training 1-Layer MLP...")
mlp1 = MLP_3Hidden(2, 16, 8, 4, 1, 'relu', 0.1)
# Simulate 1-layer by setting middle layers small
losses1 = mlp1.train(X_train_m_scaled, y_train_m, epochs=1000, verbose=False)
acc1 = np.mean(mlp1.predict(X_test_m_scaled) == y_test_m) * 100
results['1-Layer'] = acc1
print(f"1-Layer Test Accuracy: {acc1:.2f}%\n")

print("Training 2-Layer MLP...")
mlp2 = MLP_3Hidden(2, 16, 12, 6, 1, 'relu', 0.1)
losses2 = mlp2.train(X_train_m_scaled, y_train_m, epochs=1000, verbose=False)
acc2 = np.mean(mlp2.predict(X_test_m_scaled) == y_test_m) * 100
results['2-Layer'] = acc2
print(f"2-Layer Test Accuracy: {acc2:.2f}%\n")

print("Training 3-Layer MLP...")
mlp3_moons = MLP_3Hidden(2, 20, 16, 12, 1, 'relu', 0.1)
losses3 = mlp3_moons.train(X_train_m_scaled, y_train_m, epochs=1000, verbose=False)
acc3 = np.mean(mlp3_moons.predict(X_test_m_scaled) == y_test_m) * 100
results['3-Layer'] = acc3
print(f"3-Layer Test Accuracy: {acc3:.2f}%\n")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

models = [mlp1, mlp2, mlp3_moons]
titles = ['1-Layer MLP', '2-Layer MLP', '3-Layer MLP']
accs = [acc1, acc2, acc3]

for idx, (model, title, acc) in enumerate(zip(models, titles, accs)):
    ax = axes[idx]
    
    # Create mesh
    h = 0.02
    x_min, x_max = X_test_m[:, 0].min() - 0.5, X_test_m[:, 0].max() + 0.5
    y_min, y_max = X_test_m[:, 1].min() - 0.5, X_test_m[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    mesh_scaled = scaler_m.transform(np.c_[xx.ravel(), yy.ravel()])
    Z = model.forward(mesh_scaled).reshape(xx.shape)
    
    # Plot
    contour = ax.contourf(xx, yy, Z, levels=20, cmap='RdYlBu', alpha=0.8)
    ax.scatter(X_test_m[y_test_m.ravel() == 0, 0], X_test_m[y_test_m.ravel() == 0, 1],
              c='blue', edgecolors='k', s=30, label='Class 0')
    ax.scatter(X_test_m[y_test_m.ravel() == 1, 0], X_test_m[y_test_m.ravel() == 1, 1],
              c='red', edgecolors='k', s=30, label='Class 1')
    ax.set_title(f'{title}\nAccuracy: {acc:.2f}%')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Compare learning curves
plt.figure(figsize=(12, 6))
plt.plot(losses1, label='1-Layer MLP', linewidth=2, alpha=0.8)
plt.plot(losses2, label='2-Layer MLP', linewidth=2, alpha=0.8)
plt.plot(losses3, label='3-Layer MLP', linewidth=2, alpha=0.8)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Learning Curves Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 8. Key Takeaways

### Backpropagation Pattern

The beauty of backpropagation is its **universal pattern**:

For any layer $l$ in the network:

1. **Error signal**: $\delta^{(l)} = \frac{\partial L}{\partial a^{(l)}} \odot \sigma'(z^{(l)})$
2. **Weight gradient**: $\frac{\partial L}{\partial W^{(l)}} = (a^{(l-1)})^T \delta^{(l)}$
3. **Bias gradient**: $\frac{\partial L}{\partial b^{(l)}} = \sum \delta^{(l)}$
4. **Propagate**: $\frac{\partial L}{\partial a^{(l-1)}} = \delta^{(l)}(W^{(l)})^T$

This pattern works for networks of **any depth**!

### Depth vs Performance

- **Deeper networks** can learn more complex patterns
- But they also:
  - Take longer to train
  - May overfit if not enough data
  - Can suffer from vanishing gradients
  
Choose depth based on:
- Problem complexity
- Available data
- Computational resources

### Next Steps

Now that you understand deep networks, you're ready for:
- **Regularization**: Preventing overfitting (dropout, L2)
- **Batch normalization**: Stabilizing training
- **Advanced optimizers**: Adam, RMSprop
- **Convolutional networks**: For images
- **Recurrent networks**: For sequences

You now have the foundation of **deep learning**! 🎉