# 🎯 Complete MNIST Implementation from Scratch

## You've Learned All the Pieces - Let's Build Something Real!

**What you've accomplished so far:**
- ✅ Neurons and activation functions
- ✅ Multi-layer networks
- ✅ Forward propagation
- ✅ Loss functions
- ✅ Backpropagation
- ✅ Training loops

**Now**: Let's put it ALL together to solve a real problem!

## The Problem: Handwritten Digit Recognition (MNIST)

MNIST is the "Hello World" of machine learning:
- **Task**: Classify handwritten digits (0-9)
- **Data**: 70,000 grayscale images of handwritten digits
  - 60,000 for training
  - 10,000 for testing
- **Image size**: 28×28 pixels (784 pixels total)
- **Challenge**: Handle variations in handwriting styles

This is a **real benchmark** used by researchers worldwide!

## What We'll Build

A complete neural network from scratch with:
- 3 layers (784 → 128 → 64 → 10)
- Mini-batch training
- Validation monitoring
- >95% accuracy on test set

**Everything in pure NumPy** - no frameworks (yet)!

In [None]:
# Import libraries
import numpy as np  # For all numerical computations
import matplotlib.pyplot as plt  # For visualizations
from sklearn.datasets import fetch_openml  # To download MNIST dataset
from sklearn.model_selection import train_test_split  # For data splitting
from sklearn.metrics import confusion_matrix  # For evaluation
import time  # To measure training time
import seaborn as sns  # For beautiful confusion matrix
from tqdm import tqdm  # For progress bars

# Set random seed for reproducibility
np.random.seed(42)  # Same results every time we run

print("📦 All libraries imported successfully!")

## Step 1: Load and Explore MNIST Dataset

First, let's load the famous MNIST dataset and see what we're working with.

In [None]:
# Load MNIST dataset from OpenML
print("📥 Downloading MNIST dataset...")
mnist = fetch_openml('mnist_784', version=1, parser='auto')  # Download the dataset

# Extract images and labels
X = mnist.data.to_numpy()  # Images: 70,000 × 784 (each row is a flattened 28×28 image)
y = mnist.target.to_numpy().astype(int)  # Labels: 70,000 values (0-9)

print(f"\n✅ Dataset loaded!")
print(f"   Images shape: {X.shape}  (70,000 images × 784 pixels)")
print(f"   Labels shape: {y.shape}  (70,000 labels)")
print(f"   Pixel values range: {X.min():.0f} to {X.max():.0f}")
print(f"   Unique labels: {np.unique(y)}")

In [None]:
# Visualize some sample images
fig, axes = plt.subplots(3, 10, figsize=(15, 5))  # 3 rows, 10 columns
fig.suptitle('Sample MNIST Images', fontsize=16, fontweight='bold')

for i, ax in enumerate(axes.flat):  # Loop through each subplot
    # Get a random image
    idx = np.random.randint(0, len(X))  # Random index
    image = X[idx].reshape(28, 28)  # Reshape flat array to 28×28 image
    label = y[idx]  # Get corresponding label
    
    # Display the image
    ax.imshow(image, cmap='gray')  # Show grayscale image
    ax.set_title(f'Label: {label}', fontsize=10)  # Title with true label
    ax.axis('off')  # Hide axes for cleaner look

plt.tight_layout()  # Adjust spacing
plt.show()

print("\n🎨 These are real handwritten digits from different people!")
print("   Notice the variations in writing styles.")

## Step 2: Preprocess the Data

Before training, we need to:
1. **Normalize** pixel values (0-255 → 0-1)
2. **One-hot encode** labels (3 → [0,0,0,1,0,0,0,0,0,0])
3. **Split** into train/validation/test sets

In [None]:
# 1. Normalize pixel values to [0, 1] range
X_normalized = X / 255.0  # Divide by 255 (max pixel value)
print(f"✅ Normalized pixel values: {X_normalized.min():.2f} to {X_normalized.max():.2f}")

# 2. One-hot encode labels
def one_hot_encode(y, num_classes=10):
    """
    Convert labels to one-hot encoded vectors.
    
    Example: 3 → [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
    """
    n_samples = len(y)  # Number of samples
    y_encoded = np.zeros((n_samples, num_classes))  # Initialize with zeros
    y_encoded[np.arange(n_samples), y] = 1  # Set 1 at correct index
    return y_encoded

y_encoded = one_hot_encode(y)  # Convert all labels
print(f"\n✅ One-hot encoded labels shape: {y_encoded.shape}")
print(f"   Example: label {y[0]} → {y_encoded[0]}")

In [None]:
# 3. Split data into train/validation/test sets

# First split: separate test set (10,000 images)
X_temp, X_test, y_temp, y_test = train_test_split(
    X_normalized, y_encoded, 
    test_size=10000,  # 10,000 for testing
    random_state=42,  # For reproducibility
    stratify=y  # Keep same class distribution
)

# Second split: separate validation set (10,000 images)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp,
    test_size=10000,  # 10,000 for validation
    random_state=42,
    stratify=np.argmax(y_temp, axis=1)  # Stratify on original labels
)

print("\n✅ Data split complete!")
print(f"   Training set:   {X_train.shape[0]:,} images")
print(f"   Validation set: {X_val.shape[0]:,} images")
print(f"   Test set:       {X_test.shape[0]:,} images")
print(f"\n   Total: {X_train.shape[0] + X_val.shape[0] + X_test.shape[0]:,} images")

## Step 3: Build the Neural Network Class

Now for the main event - our complete neural network!

**Architecture:**
- Input layer: 784 neurons (28×28 pixels)
- Hidden layer 1: 128 neurons + ReLU activation
- Hidden layer 2: 64 neurons + ReLU activation
- Output layer: 10 neurons + Softmax activation (one per digit)

This class contains everything we've learned!

In [None]:
class NeuralNetwork:
    """
    A complete 3-layer neural network for MNIST classification.
    
    This brings together everything from notebooks 1-8!
    """
    
    def __init__(self, input_size=784, hidden1_size=128, hidden2_size=64, output_size=10):
        """
        Initialize the network with random weights.
        
        Args:
            input_size: Number of input features (784 for MNIST)
            hidden1_size: Number of neurons in first hidden layer
            hidden2_size: Number of neurons in second hidden layer
            output_size: Number of output classes (10 for digits 0-9)
        """
        # Store layer sizes
        self.input_size = input_size
        self.hidden1_size = hidden1_size
        self.hidden2_size = hidden2_size
        self.output_size = output_size
        
        # Initialize weights using He initialization (good for ReLU)
        # He initialization: multiply by sqrt(2/n) where n is input size
        self.W1 = np.random.randn(input_size, hidden1_size) * np.sqrt(2.0 / input_size)
        self.b1 = np.zeros((1, hidden1_size))  # Biases start at zero
        
        self.W2 = np.random.randn(hidden1_size, hidden2_size) * np.sqrt(2.0 / hidden1_size)
        self.b2 = np.zeros((1, hidden2_size))
        
        self.W3 = np.random.randn(hidden2_size, output_size) * np.sqrt(2.0 / hidden2_size)
        self.b3 = np.zeros((1, output_size))
        
        # For tracking training history
        self.train_losses = []  # Training loss per epoch
        self.val_losses = []    # Validation loss per epoch
        self.train_accuracies = []  # Training accuracy per epoch
        self.val_accuracies = []    # Validation accuracy per epoch
    
    def relu(self, Z):
        """ReLU activation: max(0, Z)"""
        return np.maximum(0, Z)  # Element-wise maximum
    
    def relu_derivative(self, Z):
        """Derivative of ReLU: 1 if Z > 0, else 0"""
        return (Z > 0).astype(float)  # 1 where Z > 0, 0 elsewhere
    
    def softmax(self, Z):
        """
        Softmax activation: converts logits to probabilities.
        
        Uses numerical stability trick: subtract max before exp
        """
        exp_Z = np.exp(Z - np.max(Z, axis=1, keepdims=True))  # Prevent overflow
        return exp_Z / np.sum(exp_Z, axis=1, keepdims=True)  # Normalize to sum to 1
    
    def forward(self, X):
        """
        Forward propagation through all layers.
        
        Returns:
            A3: Output activations (probabilities for each class)
        """
        # Layer 1: Input → Hidden1
        self.Z1 = np.dot(X, self.W1) + self.b1  # Linear transformation
        self.A1 = self.relu(self.Z1)  # ReLU activation
        
        # Layer 2: Hidden1 → Hidden2
        self.Z2 = np.dot(self.A1, self.W2) + self.b2  # Linear transformation
        self.A2 = self.relu(self.Z2)  # ReLU activation
        
        # Layer 3: Hidden2 → Output
        self.Z3 = np.dot(self.A2, self.W3) + self.b3  # Linear transformation
        self.A3 = self.softmax(self.Z3)  # Softmax activation (probabilities)
        
        return self.A3  # Return predicted probabilities
    
    def compute_loss(self, y_true, y_pred):
        """
        Compute cross-entropy loss.
        
        Args:
            y_true: True labels (one-hot encoded)
            y_pred: Predicted probabilities
        
        Returns:
            Average loss over all samples
        """
        m = y_true.shape[0]  # Number of samples
        # Cross-entropy: -sum(y_true * log(y_pred))
        # Add small epsilon to prevent log(0)
        loss = -np.sum(y_true * np.log(y_pred + 1e-8)) / m
        return loss
    
    def backward(self, X, y_true, learning_rate):
        """
        Backpropagation: compute gradients and update weights.
        
        This is where all the calculus from notebook 7 happens!
        """
        m = X.shape[0]  # Batch size
        
        # Output layer gradient
        dZ3 = self.A3 - y_true  # Derivative of softmax + cross-entropy
        dW3 = np.dot(self.A2.T, dZ3) / m  # Gradient for W3
        db3 = np.sum(dZ3, axis=0, keepdims=True) / m  # Gradient for b3
        
        # Hidden layer 2 gradient
        dA2 = np.dot(dZ3, self.W3.T)  # Backpropagate through W3
        dZ2 = dA2 * self.relu_derivative(self.Z2)  # Apply ReLU derivative
        dW2 = np.dot(self.A1.T, dZ2) / m  # Gradient for W2
        db2 = np.sum(dZ2, axis=0, keepdims=True) / m  # Gradient for b2
        
        # Hidden layer 1 gradient
        dA1 = np.dot(dZ2, self.W2.T)  # Backpropagate through W2
        dZ1 = dA1 * self.relu_derivative(self.Z1)  # Apply ReLU derivative
        dW1 = np.dot(X.T, dZ1) / m  # Gradient for W1
        db1 = np.sum(dZ1, axis=0, keepdims=True) / m  # Gradient for b1
        
        # Update weights using gradient descent
        self.W3 -= learning_rate * dW3  # W3 = W3 - lr * gradient
        self.b3 -= learning_rate * db3
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1
    
    def predict(self, X):
        """
        Make predictions on new data.
        
        Returns:
            Predicted class labels (0-9)
        """
        probabilities = self.forward(X)  # Get probabilities
        return np.argmax(probabilities, axis=1)  # Return class with highest probability
    
    def accuracy(self, X, y_true):
        """
        Compute classification accuracy.
        
        Args:
            X: Input data
            y_true: True labels (one-hot encoded)
        
        Returns:
            Accuracy as a percentage
        """
        predictions = self.predict(X)  # Get predictions
        true_labels = np.argmax(y_true, axis=1)  # Convert one-hot to labels
        return np.mean(predictions == true_labels) * 100  # Percentage correct

print("✅ NeuralNetwork class created!")
print("   This contains everything we learned:")
print("   - Forward propagation (notebook 5)")
print("   - Activation functions (notebook 3)")
print("   - Loss computation (notebook 6)")
print("   - Backpropagation (notebook 7)")
print("   - Weight updates (notebook 8)")

## Step 4: Training with Mini-Batches

Let's train our network using mini-batch gradient descent!

**Training strategy:**
- Batch size: 32 (process 32 images at a time)
- Learning rate: 0.01
- Epochs: 20 (go through entire dataset 20 times)
- Monitor validation performance to prevent overfitting

In [None]:
def create_mini_batches(X, y, batch_size):
    """
    Split data into mini-batches for training.
    
    Args:
        X: Input data
        y: Labels
        batch_size: Size of each batch
    
    Returns:
        List of (X_batch, y_batch) tuples
    """
    m = X.shape[0]  # Total number of samples
    mini_batches = []  # List to store batches
    
    # Shuffle data
    indices = np.random.permutation(m)  # Random order
    X_shuffled = X[indices]  # Shuffle inputs
    y_shuffled = y[indices]  # Shuffle labels (same order)
    
    # Create batches
    num_complete_batches = m // batch_size  # Number of full batches
    
    for i in range(num_complete_batches):
        start = i * batch_size  # Start index
        end = start + batch_size  # End index
        X_batch = X_shuffled[start:end]  # Get batch of inputs
        y_batch = y_shuffled[start:end]  # Get batch of labels
        mini_batches.append((X_batch, y_batch))  # Add to list
    
    # Handle remaining samples (if any)
    if m % batch_size != 0:
        start = num_complete_batches * batch_size
        X_batch = X_shuffled[start:]
        y_batch = y_shuffled[start:]
        mini_batches.append((X_batch, y_batch))
    
    return mini_batches

print("✅ Mini-batch creation function ready!")

In [None]:
# Initialize the neural network
model = NeuralNetwork(
    input_size=784,    # 28×28 pixels
    hidden1_size=128,  # First hidden layer
    hidden2_size=64,   # Second hidden layer
    output_size=10     # 10 digit classes
)

print("🧠 Neural Network initialized!")
print(f"   Architecture: {model.input_size} → {model.hidden1_size} → {model.hidden2_size} → {model.output_size}")
print(f"   Total parameters: {model.W1.size + model.b1.size + model.W2.size + model.b2.size + model.W3.size + model.b3.size:,}")

In [None]:
# Training configuration
batch_size = 32      # Process 32 images at a time
learning_rate = 0.01  # Step size for gradient descent
epochs = 20          # Number of times to go through entire dataset

print("🎯 Starting training...")
print(f"   Batch size: {batch_size}")
print(f"   Learning rate: {learning_rate}")
print(f"   Epochs: {epochs}")
print("\n" + "="*70)

# Record start time
start_time = time.time()

# Training loop
for epoch in range(epochs):
    # Create mini-batches for this epoch
    mini_batches = create_mini_batches(X_train, y_train, batch_size)
    
    # Train on each mini-batch
    for X_batch, y_batch in tqdm(mini_batches, desc=f"Epoch {epoch+1}/{epochs}", leave=False):
        # Forward pass
        model.forward(X_batch)
        # Backward pass and update weights
        model.backward(X_batch, y_batch, learning_rate)
    
    # Evaluate on full training and validation sets
    train_pred = model.forward(X_train)  # Predictions on training set
    train_loss = model.compute_loss(y_train, train_pred)  # Training loss
    train_acc = model.accuracy(X_train, y_train)  # Training accuracy
    
    val_pred = model.forward(X_val)  # Predictions on validation set
    val_loss = model.compute_loss(y_val, val_pred)  # Validation loss
    val_acc = model.accuracy(X_val, y_val)  # Validation accuracy
    
    # Store metrics
    model.train_losses.append(train_loss)
    model.val_losses.append(val_loss)
    model.train_accuracies.append(train_acc)
    model.val_accuracies.append(val_acc)
    
    # Print progress
    print(f"Epoch {epoch+1}/{epochs}:")
    print(f"  Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}%")
    print(f"  Val Loss:   {val_loss:.4f} | Val Acc:   {val_acc:.2f}%")
    print("-" * 70)

# Record end time
training_time = time.time() - start_time

print("\n" + "="*70)
print(f"✅ Training complete in {training_time:.2f} seconds!")
print(f"   Final training accuracy: {model.train_accuracies[-1]:.2f}%")
print(f"   Final validation accuracy: {model.val_accuracies[-1]:.2f}%")

## Step 5: Visualize Training Progress

Let's see how our network learned over time!

In [None]:
# Plot training curves
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Loss curves
ax1.plot(model.train_losses, label='Training Loss', linewidth=2, marker='o')
ax1.plot(model.val_losses, label='Validation Loss', linewidth=2, marker='s')
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Loss', fontsize=12)
ax1.set_title('Loss Over Time', fontsize=14, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)

# Accuracy curves
ax2.plot(model.train_accuracies, label='Training Accuracy', linewidth=2, marker='o')
ax2.plot(model.val_accuracies, label='Validation Accuracy', linewidth=2, marker='s')
ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Accuracy (%)', fontsize=12)
ax2.set_title('Accuracy Over Time', fontsize=14, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n📈 Training curves show:")
print("   - Loss decreases over time (good!)")
print("   - Accuracy increases over time (good!)")
print("   - Training and validation curves are close (not overfitting!)")

## Step 6: Evaluate on Test Set

Now let's see how well our model performs on completely unseen data!

In [None]:
# Evaluate on test set
test_predictions = model.predict(X_test)  # Get predictions
test_accuracy = model.accuracy(X_test, y_test)  # Compute accuracy

print("="*70)
print("🎯 TEST SET PERFORMANCE")
print("="*70)
print(f"\n   Accuracy: {test_accuracy:.2f}%")
print(f"   Correct predictions: {int(test_accuracy * len(X_test) / 100):,} / {len(X_test):,}")
print(f"   Wrong predictions: {len(X_test) - int(test_accuracy * len(X_test) / 100):,}")

if test_accuracy > 95:
    print("\n   🎉 EXCELLENT! Over 95% accuracy!")
elif test_accuracy > 90:
    print("\n   👍 GOOD! Over 90% accuracy!")
else:
    print("\n   📚 Room for improvement. Try training longer!")

## Step 7: Confusion Matrix

Let's see which digits the model confuses with each other.

In [None]:
# Compute confusion matrix
true_labels = np.argmax(y_test, axis=1)  # Convert one-hot to labels
cm = confusion_matrix(true_labels, test_predictions)  # Compute matrix

# Plot confusion matrix
plt.figure(figsize=(12, 10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=range(10), yticklabels=range(10),
            cbar_kws={'label': 'Number of predictions'})
plt.xlabel('Predicted Label', fontsize=13)
plt.ylabel('True Label', fontsize=13)
plt.title('Confusion Matrix - MNIST Test Set', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n📊 Confusion Matrix Interpretation:")
print("   - Diagonal: correct predictions (darker is better)")
print("   - Off-diagonal: mistakes")
print("   - Look for patterns: which digits are confused?")

# Find most confused pairs
cm_no_diag = cm.copy()
np.fill_diagonal(cm_no_diag, 0)  # Remove diagonal
max_confusion = np.unravel_index(cm_no_diag.argmax(), cm_no_diag.shape)
print(f"\n   Most confused: {max_confusion[0]} ↔ {max_confusion[1]} ({cm_no_diag[max_confusion]} mistakes)")

## Step 8: Visualize Predictions

Let's look at some correct and incorrect predictions.

In [None]:
# Find correct and incorrect predictions
correct_idx = np.where(test_predictions == true_labels)[0]  # Indices of correct predictions
incorrect_idx = np.where(test_predictions != true_labels)[0]  # Indices of mistakes

# Visualize correct predictions
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
fig.suptitle('✅ Correct Predictions', fontsize=16, fontweight='bold')

for i, ax in enumerate(axes.flat):
    idx = correct_idx[np.random.randint(0, len(correct_idx))]  # Random correct prediction
    image = X_test[idx].reshape(28, 28)  # Reshape to image
    true_label = true_labels[idx]  # True label
    pred_label = test_predictions[idx]  # Predicted label
    
    # Get prediction confidence (probability)
    probs = model.forward(X_test[idx:idx+1])[0]  # Get probabilities
    confidence = probs[pred_label] * 100  # Confidence in prediction
    
    ax.imshow(image, cmap='gray')
    ax.set_title(f'True: {true_label} | Pred: {pred_label}\nConfidence: {confidence:.1f}%', 
                 fontsize=10, color='green')
    ax.axis('off')

plt.tight_layout()
plt.show()

print(f"\n✅ Model got {len(correct_idx):,} predictions correct!")

In [None]:
# Visualize incorrect predictions
if len(incorrect_idx) > 0:  # Only if there are mistakes
    fig, axes = plt.subplots(2, 5, figsize=(15, 6))
    fig.suptitle('❌ Incorrect Predictions (Learning from Mistakes)', fontsize=16, fontweight='bold')
    
    for i, ax in enumerate(axes.flat):
        if i < len(incorrect_idx):  # If we have enough mistakes
            idx = incorrect_idx[i]  # Get mistake index
            image = X_test[idx].reshape(28, 28)  # Reshape to image
            true_label = true_labels[idx]  # True label
            pred_label = test_predictions[idx]  # Predicted label
            
            # Get prediction confidence
            probs = model.forward(X_test[idx:idx+1])[0]  # Get probabilities
            confidence = probs[pred_label] * 100  # Confidence in prediction
            
            ax.imshow(image, cmap='gray')
            ax.set_title(f'True: {true_label} | Pred: {pred_label}\nConfidence: {confidence:.1f}%', 
                         fontsize=10, color='red')
            ax.axis('off')
        else:
            ax.axis('off')  # Hide extra subplots
    
    plt.tight_layout()
    plt.show()
    
    print(f"\n❌ Model made {len(incorrect_idx)} mistakes")
    print("   Notice: Some mistakes are understandable (ambiguous handwriting)!")
else:
    print("\n🎉 Perfect score! No mistakes!")

## Step 9: Visualize What the Network Learned

Let's peek inside and see what features the first layer learned!

In [None]:
# Visualize weights of first layer
# Each neuron in the first layer has 784 weights (one per pixel)
# These weights show what pattern each neuron is looking for

fig, axes = plt.subplots(4, 8, figsize=(16, 8))
fig.suptitle('First Layer Weights (What Each Neuron Looks For)', fontsize=16, fontweight='bold')

for i, ax in enumerate(axes.flat):
    if i < 32:  # Show first 32 neurons
        weights = model.W1[:, i].reshape(28, 28)  # Reshape weights to 28×28 image
        ax.imshow(weights, cmap='RdBu', vmin=-0.5, vmax=0.5)  # Blue=negative, Red=positive
        ax.set_title(f'Neuron {i+1}', fontsize=9)
        ax.axis('off')
    else:
        ax.axis('off')

plt.tight_layout()
plt.show()

print("\n🔍 Weight Visualization:")
print("   - Red areas: neuron activates strongly for bright pixels")
print("   - Blue areas: neuron activates strongly for dark pixels")
print("   - Each neuron learns to detect different features (edges, curves, etc.)")

## Step 10: Activation Maps

Let's see how neurons activate for a specific image.

In [None]:
# Pick a random test image
sample_idx = np.random.randint(0, len(X_test))
sample_image = X_test[sample_idx:sample_idx+1]  # Keep 2D shape
sample_label = true_labels[sample_idx]

# Forward pass to get activations
model.forward(sample_image)

# Create visualization
fig = plt.figure(figsize=(16, 6))

# Original image
ax1 = plt.subplot(1, 4, 1)
ax1.imshow(sample_image.reshape(28, 28), cmap='gray')
ax1.set_title(f'Input Image\n(Digit: {sample_label})', fontsize=12, fontweight='bold')
ax1.axis('off')

# Layer 1 activations
ax2 = plt.subplot(1, 4, 2)
ax2.bar(range(model.hidden1_size), model.A1[0])  # Bar plot of activations
ax2.set_title(f'Layer 1 Activations\n({model.hidden1_size} neurons)', fontsize=12, fontweight='bold')
ax2.set_xlabel('Neuron Index')
ax2.set_ylabel('Activation Value')
ax2.grid(True, alpha=0.3)

# Layer 2 activations
ax3 = plt.subplot(1, 4, 3)
ax3.bar(range(model.hidden2_size), model.A2[0])  # Bar plot of activations
ax3.set_title(f'Layer 2 Activations\n({model.hidden2_size} neurons)', fontsize=12, fontweight='bold')
ax3.set_xlabel('Neuron Index')
ax3.set_ylabel('Activation Value')
ax3.grid(True, alpha=0.3)

# Output layer activations (probabilities)
ax4 = plt.subplot(1, 4, 4)
bars = ax4.bar(range(10), model.A3[0])  # Bar plot of probabilities
# Highlight predicted class
predicted_class = np.argmax(model.A3[0])
bars[predicted_class].set_color('green')
ax4.set_title(f'Output Probabilities\n(Predicted: {predicted_class})', fontsize=12, fontweight='bold')
ax4.set_xlabel('Digit Class')
ax4.set_ylabel('Probability')
ax4.set_xticks(range(10))
ax4.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n🎯 Activation Flow:")
print("   1. Input: 28×28 image (784 pixels)")
print("   2. Layer 1: Different neurons activate for different features")
print("   3. Layer 2: Combines Layer 1 features into higher-level patterns")
print("   4. Output: Probabilities for each digit (0-9)")

## Step 11: Performance Analysis

Let's analyze which digits are hardest to classify.

In [None]:
# Compute per-class accuracy
per_class_accuracy = {}

for digit in range(10):
    # Get indices for this digit
    digit_indices = np.where(true_labels == digit)[0]
    # Get predictions for this digit
    digit_predictions = test_predictions[digit_indices]
    # Compute accuracy
    accuracy = np.mean(digit_predictions == digit) * 100
    per_class_accuracy[digit] = accuracy

# Plot per-class accuracy
plt.figure(figsize=(12, 6))
bars = plt.bar(per_class_accuracy.keys(), per_class_accuracy.values(), 
               color=['green' if v > 95 else 'orange' if v > 90 else 'red' 
                      for v in per_class_accuracy.values()])
plt.xlabel('Digit', fontsize=12)
plt.ylabel('Accuracy (%)', fontsize=12)
plt.title('Per-Digit Classification Accuracy', fontsize=14, fontweight='bold')
plt.xticks(range(10))
plt.ylim([80, 100])  # Zoom in to see differences
plt.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, (digit, acc) in zip(bars, per_class_accuracy.items()):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{acc:.1f}%', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

# Find hardest and easiest digits
hardest_digit = min(per_class_accuracy, key=per_class_accuracy.get)
easiest_digit = max(per_class_accuracy, key=per_class_accuracy.get)

print("\n📊 Per-Digit Analysis:")
print(f"   Easiest digit: {easiest_digit} ({per_class_accuracy[easiest_digit]:.2f}% accuracy)")
print(f"   Hardest digit: {hardest_digit} ({per_class_accuracy[hardest_digit]:.2f}% accuracy)")
print("\n   Why some digits are harder:")
print("   - Similar shapes (e.g., 4 and 9, 3 and 8)")
print("   - Variation in handwriting styles")
print("   - Less training examples")

## 🎉 Congratulations!

## You Just Built a Real Neural Network from Scratch!

**What you accomplished:**
- ✅ Loaded and preprocessed 70,000 images
- ✅ Built a 3-layer neural network (>100,000 parameters)
- ✅ Implemented forward propagation
- ✅ Implemented backpropagation through all layers
- ✅ Trained with mini-batch gradient descent
- ✅ Achieved >95% accuracy on MNIST
- ✅ Visualized what the network learned

**This is REAL machine learning** - the same techniques used in production systems!

## But Here's the Thing...

This took about **500 lines of code** and required deep understanding of:
- Matrix operations
- Calculus (derivatives)
- Optimization algorithms
- Numerical stability

### What if I told you PyTorch can do this in ~50 lines?

**Next notebook**: See how deep learning frameworks make this MUCH easier!

But remember: **You now understand what's happening under the hood** - that's invaluable when debugging or optimizing models!