# Neural Networks Lesson 1B: Handwritten Digit Recognition

## Building a Real-World Image Classifier

**Learning Objectives:**
- Understand how images become neural network inputs
- Work with the famous MNIST dataset
- Build a 1-hidden-layer network for classification
- Train on 60,000 real images
- Evaluate accuracy on test data
- Visualize what the network learned

**Duration:** ~90 minutes

---

## Part 1: Introduction to MNIST

MNIST (Modified National Institute of Standards and Technology) is the "Hello World" of computer vision:

- **70,000 images** of handwritten digits (0-9)
- **60,000 training images** + **10,000 test images**
- Each image is **28√ó28 pixels** in grayscale
- **784 total pixels** per image (28√ó28)
- Pixel values range from **0 (black) to 255 (white)**

This dataset has been used to benchmark neural networks since the 1990s!

In [None]:
# Setup: Install and import required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns

# Set random seeds for reproducibility
np.random.seed(42)

print("‚úÖ Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")

## Part 2: Load and Explore the MNIST Dataset

In [None]:
# Load MNIST dataset
print("üì• Loading MNIST dataset (this may take a moment)...\n")
mnist = fetch_openml('mnist_784', version=1, parser='auto')
X, y = mnist.data.values, mnist.target.values.astype(int)

print(f"‚úÖ Dataset loaded!")
print(f"\nüìä Dataset Statistics:")
print(f"  ‚Ä¢ Total images: {X.shape[0]:,}")
print(f"  ‚Ä¢ Pixels per image: {X.shape[1]} (28√ó28)")
print(f"  ‚Ä¢ Classes: {len(np.unique(y))} (digits 0-9)")
print(f"  ‚Ä¢ Data type: {X.dtype}")
print(f"  ‚Ä¢ Value range: [{X.min():.0f}, {X.max():.0f}]")

In [None]:
# Visualize sample images
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
fig.suptitle('Sample MNIST Digits', fontsize=16, fontweight='bold')

for i, ax in enumerate(axes.flat):
    # Display image
    image = X[i].reshape(28, 28)
    ax.imshow(image, cmap='gray')
    ax.set_title(f'Label: {y[i]}', fontsize=14, fontweight='bold')
    ax.axis('off')

plt.tight_layout()
plt.show()

print("\nüñºÔ∏è Each 28√ó28 image becomes a 784-dimensional input vector for the neural network!")

## Part 3: From Images to Neural Network Inputs

### How do images become inputs?

1. **Original:** 28√ó28 pixel grid (2D)
2. **Flatten:** Convert to 784-length vector (1D)
3. **Normalize:** Scale pixel values from [0, 255] to [0, 1]

```
Image (28√ó28)          Flatten           Normalize
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê            ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ>           ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ>
‚îÇ ‚ñ° ‚ñ° ‚ñ† ‚ñ° ‚îÇ     [0, 0, 255, 0, ...]    [0, 0, 1, 0, ...]
‚îÇ ‚ñ° ‚ñ† ‚ñ† ‚ñ° ‚îÇ
‚îÇ ‚ñ† ‚ñ° ‚ñ† ‚ñ° ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
   784 pixels ‚Üí 784 inputs to the neural network
```

In [None]:
# Demonstrate the transformation
sample_idx = 0
sample_image = X[sample_idx].reshape(28, 28)
sample_flat = X[sample_idx]

print(f"üîç Examining digit '{y[sample_idx]}':\n")
print(f"Original shape: {sample_image.shape} (28√ó28 grid)")
print(f"Flattened shape: {sample_flat.shape} (784-element vector)")
print(f"\nFirst 20 pixel values (before normalization):")
print(sample_flat[:20])

# Visualize one row of pixels
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

# Show the image
axes[0].imshow(sample_image, cmap='gray')
axes[0].set_title(f'Original Image: Digit {y[sample_idx]}', fontsize=14, fontweight='bold')
axes[0].axis('off')

# Show pixel intensity profile
row_14 = sample_image[14, :]  # Middle row
axes[1].plot(row_14, 'b-', linewidth=2, marker='o')
axes[1].set_xlabel('Pixel Position (0-27)', fontsize=12)
axes[1].set_ylabel('Pixel Intensity (0-255)', fontsize=12)
axes[1].set_title('Pixel Values in Row 14', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Prepare the dataset
# Normalize pixel values to [0, 1]
X_normalized = X / 255.0

# Convert labels to one-hot encoding
def to_one_hot(y, num_classes=10):
    """Convert class labels to one-hot encoded vectors"""
    one_hot = np.zeros((len(y), num_classes))
    one_hot[np.arange(len(y)), y] = 1
    return one_hot

y_one_hot = to_one_hot(y)

print("üîß Data Preprocessing:")
print(f"  ‚Ä¢ Normalized pixel values: {X_normalized.min():.1f} to {X_normalized.max():.1f}")
print(f"  ‚Ä¢ One-hot encoding shape: {y_one_hot.shape}")
print(f"\nExample one-hot encoding for digit {y[0]}:")
print(y_one_hot[0])
print("\nüí° One-hot encoding: [0,0,0,1,0,0,0,0,0,0] means the digit is '3' (index 3)")

In [None]:
# Split into train and test sets
# Use first 60,000 for training, last 10,000 for testing (standard MNIST split)
X_train = X_normalized[:60000]
y_train = y_one_hot[:60000]
y_train_labels = y[:60000]

X_test = X_normalized[60000:]
y_test = y_one_hot[60000:]
y_test_labels = y[60000:]

print("üìä Dataset Split:")
print(f"  ‚Ä¢ Training set: {X_train.shape[0]:,} images")
print(f"  ‚Ä¢ Test set: {X_test.shape[0]:,} images")
print(f"\n‚úÖ Data ready for training!")

## üìñ Function Explanations Available!

**Open this notebook for detailed line-by-line explanations:**

üîó **`neural_networks_mnist_function_explanations.ipynb`**

Covers:
- `compute_loss()` - How to measure prediction errors
- `backward()` - How backpropagation works
- `train()` - The complete training loop
- `predict()` - Making predictions on new data

---

## Part 4: Network Architecture

We'll build a simple 3-layer neural network:

```
Input Layer     Hidden Layer     Output Layer
(784 neurons)   (128 neurons)    (10 neurons)
    ‚îÇ                 ‚îÇ                ‚îÇ
    ‚îÇ                 ‚îÇ                ‚îÇ
  [pixels]    ‚Üí   [features]   ‚Üí   [classes]
                ReLU activation   Softmax activation
```

**Parameters:**
- Input ‚Üí Hidden: 784 √ó 128 = 100,352 weights + 128 biases
- Hidden ‚Üí Output: 128 √ó 10 = 1,280 weights + 10 biases
- **Total: 101,770 parameters**

Much larger than XOR, but still relatively small for neural networks!

In [None]:
# Define activation functions
def relu(x):
    """ReLU: Rectified Linear Unit"""
    return np.maximum(0, x)

def relu_derivative(x):
    """Derivative of ReLU"""
    return (x > 0).astype(float)

def softmax(x):
    """Softmax activation for output layer"""
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))  # Stability trick
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)

# Visualize ReLU
x = np.linspace(-3, 3, 100)
y_relu = relu(x)

plt.figure(figsize=(10, 4))
plt.plot(x, y_relu, 'b-', linewidth=2, label='ReLU(x)')
plt.grid(True, alpha=0.3)
plt.xlabel('Input (x)', fontsize=12)
plt.ylabel('Output', fontsize=12)
plt.title('ReLU Activation Function: max(0, x)', fontsize=14, fontweight='bold')
plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
plt.axvline(x=0, color='r', linestyle='--', alpha=0.5, label='Activation threshold')
plt.legend()
plt.tight_layout()
plt.show()

print("‚úÖ Activation functions defined")
print("\nüîç ReLU Properties:")
print("  ‚Ä¢ Output range: [0, ‚àû)")
print("  ‚Ä¢ ReLU(-2) = 0")
print("  ‚Ä¢ ReLU(2) = 2")
print("  ‚Ä¢ Faster to compute than sigmoid")
print("  ‚Ä¢ Helps avoid vanishing gradients")

In [None]:
class DigitRecognitionNetwork:
    def __init__(self, input_size=784, hidden_size=128, output_size=10):
        """Initialize the neural network"""
        # Xavier initialization for better training
        self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / input_size)
        self.b1 = np.zeros((1, hidden_size))
        
        self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2.0 / hidden_size)
        self.b2 = np.zeros((1, output_size))
        
        # Training history
        self.loss_history = []
        self.accuracy_history = []
        
    def forward(self, X):
        """Forward propagation"""
        # Input ‚Üí Hidden
        self.z1 = np.dot(X, self.W1) + self.b1
        self.a1 = relu(self.z1)
        
        # Hidden ‚Üí Output
        self.z2 = np.dot(self.a1, self.W2) + self.b2
        self.a2 = softmax(self.z2)
        
        return self.a2
    
    def compute_loss(self, y_true, y_pred):
        """Cross-entropy loss"""
        m = y_true.shape[0]
        log_likelihood = -np.log(y_pred[range(m), y_true.argmax(axis=1)])
        loss = np.sum(log_likelihood) / m
        return loss
    
    def backward(self, X, y_true, learning_rate=0.01):
        """Backpropagation"""
        m = X.shape[0]
        
        # Output layer gradients
        dz2 = self.a2 - y_true
        dW2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0, keepdims=True) / m
        
        # Hidden layer gradients
        dz1 = np.dot(dz2, self.W2.T) * relu_derivative(self.z1)
        dW1 = np.dot(X.T, dz1) / m
        db1 = np.sum(dz1, axis=0, keepdims=True) / m
        
        # Update weights
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1
    
    def train(self, X_train, y_train, X_test, y_test, epochs=10, batch_size=128, learning_rate=0.1):
        """Train the network with mini-batch gradient descent"""
        n_batches = len(X_train) // batch_size
        
        for epoch in range(epochs):
            # Shuffle training data
            indices = np.random.permutation(len(X_train))
            X_shuffled = X_train[indices]
            y_shuffled = y_train[indices]
            
            # Mini-batch training
            for i in range(n_batches):
                start_idx = i * batch_size
                end_idx = start_idx + batch_size
                
                X_batch = X_shuffled[start_idx:end_idx]
                y_batch = y_shuffled[start_idx:end_idx]
                
                # Forward and backward pass
                self.forward(X_batch)
                self.backward(X_batch, y_batch, learning_rate)
            
            # Evaluate on test set
            train_pred = self.forward(X_train)
            test_pred = self.forward(X_test)
            
            train_loss = self.compute_loss(y_train, train_pred)
            test_loss = self.compute_loss(y_test, test_pred)
            
            train_acc = np.mean(np.argmax(train_pred, axis=1) == np.argmax(y_train, axis=1))
            test_acc = np.mean(np.argmax(test_pred, axis=1) == np.argmax(y_test, axis=1))
            
            self.loss_history.append(test_loss)
            self.accuracy_history.append(test_acc)
            
            print(f"Epoch {epoch+1:2d}/{epochs} | "
                  f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.4f} | "
                  f"Test Loss: {test_loss:.4f} | Test Acc: {test_acc:.4f}")
        
        print(f"\n‚úÖ Training complete! Final test accuracy: {test_acc:.4f} ({test_acc*100:.2f}%)")
    
    def predict(self, X):
        """Make predictions"""
        probabilities = self.forward(X)
        return np.argmax(probabilities, axis=1)

# Create the network
nn = DigitRecognitionNetwork(input_size=784, hidden_size=128, output_size=10)

print("üß† Neural Network Created")
print(f"\nArchitecture:")
print(f"  Input Layer:  784 neurons (28√ó28 pixels)")
print(f"  Hidden Layer: 128 neurons (ReLU activation)")
print(f"  Output Layer: 10 neurons (Softmax activation)")
print(f"\nTotal Parameters: {784*128 + 128 + 128*10 + 10:,}")

## Part 5: Train the Network! üöÄ

Now we'll train on 60,000 images. This will take a few minutes!

**What's happening:**
- Network sees each training image
- Makes a prediction (which digit?)
- Compares with true label
- Adjusts weights to improve

In [None]:
# Train the network
print("üèãÔ∏è Training the neural network on 60,000 handwritten digits...\n")
nn.train(
    X_train, y_train,
    X_test, y_test,
    epochs=10,
    batch_size=128,
    learning_rate=0.1
)

In [None]:
# Plot training progress
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss over time
axes[0].plot(nn.loss_history, 'b-', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Test Loss', fontsize=12)
axes[0].set_title('Training Loss Over Time', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Accuracy over time
axes[1].plot([acc*100 for acc in nn.accuracy_history], 'g-', linewidth=2)
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Test Accuracy (%)', fontsize=12)
axes[1].set_title('Test Accuracy Over Time', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim([0, 100])

plt.tight_layout()
plt.show()

print(f"\nüìà Performance Summary:")
print(f"  ‚Ä¢ Initial accuracy: {nn.accuracy_history[0]*100:.2f}%")
print(f"  ‚Ä¢ Final accuracy: {nn.accuracy_history[-1]*100:.2f}%")
print(f"  ‚Ä¢ Improvement: +{(nn.accuracy_history[-1] - nn.accuracy_history[0])*100:.2f}%")

## Part 6: Test the Network

Let's see how well the network performs on images it has never seen before!

In [None]:
# Make predictions on test set
predictions = nn.predict(X_test)
accuracy = accuracy_score(y_test_labels, predictions)

print(f"üéØ Test Set Performance:")
print(f"  ‚Ä¢ Tested on: {len(X_test):,} images")
print(f"  ‚Ä¢ Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"  ‚Ä¢ Correct: {int(accuracy * len(X_test)):,} images")
print(f"  ‚Ä¢ Incorrect: {len(X_test) - int(accuracy * len(X_test)):,} images")

if accuracy > 0.95:
    print("\nüèÜ Excellent! The network achieved >95% accuracy!")
elif accuracy > 0.90:
    print("\nüëç Good performance! Above 90% accuracy.")
else:
    print("\n‚ö° The network is learning but could use more training epochs.")

In [None]:
# Visualize sample predictions
fig, axes = plt.subplots(3, 6, figsize=(15, 8))
fig.suptitle('Sample Predictions from Test Set', fontsize=16, fontweight='bold')

for i, ax in enumerate(axes.flat):
    idx = np.random.randint(0, len(X_test))
    image = X_test[idx].reshape(28, 28)
    true_label = y_test_labels[idx]
    pred_label = predictions[idx]
    
    ax.imshow(image, cmap='gray')
    
    if true_label == pred_label:
        ax.set_title(f'‚úÖ True: {true_label}, Pred: {pred_label}', 
                    fontsize=11, color='green', fontweight='bold')
    else:
        ax.set_title(f'‚ùå True: {true_label}, Pred: {pred_label}', 
                    fontsize=11, color='red', fontweight='bold')
    
    ax.axis('off')

plt.tight_layout()
plt.show()

## Part 7: Confusion Matrix - Where Does It Fail?

A confusion matrix shows which digits the network confuses with each other.

In [None]:
# Compute confusion matrix
cm = confusion_matrix(y_test_labels, predictions)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
           xticklabels=range(10), yticklabels=range(10),
           cbar_kws={'label': 'Count'})
plt.xlabel('Predicted Label', fontsize=14, fontweight='bold')
plt.ylabel('True Label', fontsize=14, fontweight='bold')
plt.title('Confusion Matrix\n(Diagonal = Correct Predictions)', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nüìä Confusion Matrix Insights:")
print("  ‚Ä¢ Diagonal values = correct predictions")
print("  ‚Ä¢ Off-diagonal = mistakes")
print("\nüîç Common confusions:")

# Find most common confusions
confusions = []
for i in range(10):
    for j in range(10):
        if i != j and cm[i, j] > 10:
            confusions.append((i, j, cm[i, j]))

confusions.sort(key=lambda x: x[2], reverse=True)
for true_digit, pred_digit, count in confusions[:5]:
    print(f"  ‚Ä¢ Confused '{true_digit}' as '{pred_digit}': {count} times")

## Part 8: Visualize What the Network Learned

Let's peek inside the hidden layer to see what features it learned!

In [None]:
# Visualize hidden layer weights
# Each hidden neuron has 784 input weights (one per pixel)
fig, axes = plt.subplots(4, 8, figsize=(16, 8))
fig.suptitle('Hidden Layer Features (First 32 neurons)', fontsize=16, fontweight='bold')

for i, ax in enumerate(axes.flat):
    # Reshape weights to 28√ó28 to visualize as an image
    weights = nn.W1[:, i].reshape(28, 28)
    ax.imshow(weights, cmap='coolwarm', vmin=-1, vmax=1)
    ax.set_title(f'Neuron {i}', fontsize=9)
    ax.axis('off')

plt.tight_layout()
plt.show()

print("\nüß† What are we seeing?")
print("  ‚Ä¢ Each image shows what ONE hidden neuron 'looks for' in the input")
print("  ‚Ä¢ Red areas = positive weights (neuron activates when pixels are bright there)")
print("  ‚Ä¢ Blue areas = negative weights (neuron activates when pixels are dark there)")
print("  ‚Ä¢ These are the FEATURES the network learned to detect!")

## Part 9: Make Your Own Predictions!

Test the network on specific examples:

In [None]:
# Interactive prediction function
def show_prediction_detail(idx):
    """Show detailed prediction for a specific test image"""
    image = X_test[idx].reshape(28, 28)
    true_label = y_test_labels[idx]
    
    # Get prediction probabilities
    probs = nn.forward(X_test[idx:idx+1])[0]
    pred_label = np.argmax(probs)
    
    # Plot
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    # Show image
    axes[0].imshow(image, cmap='gray')
    axes[0].set_title(f'Test Image #{idx}\nTrue Label: {true_label}', 
                     fontsize=14, fontweight='bold')
    axes[0].axis('off')
    
    # Show prediction probabilities
    axes[1].bar(range(10), probs * 100, color=['green' if i == pred_label else 'lightblue' for i in range(10)])
    axes[1].set_xlabel('Digit', fontsize=12)
    axes[1].set_ylabel('Confidence (%)', fontsize=12)
    axes[1].set_title(f'Network Prediction: {pred_label} ({probs[pred_label]*100:.1f}% confidence)', 
                     fontsize=14, fontweight='bold')
    axes[1].set_xticks(range(10))
    axes[1].grid(axis='y', alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print(f"\nüìä Prediction Breakdown:")
    for digit in range(10):
        symbol = "  ‚Üê PREDICTION" if digit == pred_label else ""
        print(f"  Digit {digit}: {probs[digit]*100:6.2f}%{symbol}")

# Try a few examples
print("üîç Examining detailed predictions:\n")
for idx in [0, 100, 500]:
    show_prediction_detail(idx)
    print("\n" + "="*70 + "\n")

## Summary & Key Takeaways

### What We Built:
- ‚úÖ A 3-layer neural network (Input ‚Üí Hidden ‚Üí Output)
- ‚úÖ Trained on 60,000 handwritten digit images
- ‚úÖ Achieved ~95%+ accuracy on unseen test data
- ‚úÖ Used 101,770 learnable parameters

### Key Insights:

1. **Images ‚Üí Vectors**: 28√ó28 images become 784-element input vectors
2. **Hidden Layers Learn Features**: The 128 hidden neurons learned to detect edges, curves, and digit-specific patterns
3. **One-Hot Encoding**: 10 output neurons (one per digit) with softmax activation
4. **ReLU Activation**: Faster and more effective than sigmoid for hidden layers
5. **Mini-Batch Training**: Processing 128 images at a time is more efficient than one-by-one

### What's Next?

In **Lesson 2**, we'll dive deep into:
- The mathematics of backpropagation
- How gradient descent really works
- Modern architectures: CNNs, Transformers, and LLMs
- The principles behind ChatGPT and Claude

---

## üéì Challenges (Optional)

1. **Experiment with architecture**: Try different hidden layer sizes (64, 256, 512). How does it affect accuracy and training time?

2. **Add more layers**: Can you create a 2-hidden-layer network? (784 ‚Üí 128 ‚Üí 64 ‚Üí 10)

3. **Learning rate tuning**: Test different learning rates (0.01, 0.05, 0.2). What's optimal?

4. **Analyze errors**: Find 10 images the network got wrong. Do you see patterns in the mistakes?

---

**Congratulations!** üéâ You've built and trained a real-world image classification system!

**Next:** Move on to **Lesson 2** to understand the theory behind what you just implemented.