# Week 7 — Convolutional Neural Networks (Image Classification)

**Objectives**

- Generate and visualize a simple synthetic image dataset (more complex than digit MNIST).
- Build a PCA + Logistic Regression baseline classifier; compute evaluation metrics (e.g., accuracy).
- Implement a simple CNN classifier that barely beats or fails to beat the baseline.
- Build a deeper CNN model that achieves better performance.
- Train a CNN model on data‑augmented images and visualize augmentations.
- Explore an advanced CNN feature (e.g., global average pooling) and observe its impact.


In [None]:
import random
import numpy as np
import matplotlib.pyplot as plt

from utils import (
    show_result,
    load_cifar10_dataset,
    pca_logistic_baseline,
    test_exercise_7_pca,
    test_exercise_7_simple_cnn,
    test_exercise_7_proper_cnn,
    test_exercise_7_data_aug_cnn,
    test_exercise_7_advanced_cnn,
    accuracy
)


## 1. CIFAR-10 Image Dataset

In this exercise, we'll use the CIFAR-10 dataset, a well-known benchmark dataset for image classification. CIFAR-10 contains 60,000 32×32 color images in 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck), with 6,000 images per class.

We'll load the dataset from HuggingFace and convert it to grayscale to simplify training and focus on the CNN architecture rather than computational complexity.

**Task:** Use the `load_cifar10_dataset` function from `utils.py` to load a subset of CIFAR-10, then visualize a few random samples from each class. Report the number of training and test examples.


In [None]:
# Load CIFAR-10 dataset
# Using a smaller subset for faster training: 1000 train, 200 test
X_train, y_train, X_test, y_test, class_names = load_cifar10_dataset(
    n_train=1000, n_test=200, seed=0, grayscale=True
)
print(f"Training set size: {len(X_train)}, Test set size: {len(X_test)}")
print(f"Image shape: {X_train.shape[1:]}")
print(f"Classes: {class_names}")

# Visualize a few random samples from the training set
fig, axes = plt.subplots(1, 5, figsize=(12, 2.5))
for ax in axes:
    idx = random.randint(0, len(X_train) - 1)
    ax.imshow(X_train[idx], cmap='gray')
    ax.set_title(f"{class_names[y_train[idx]]}\n(label {y_train[idx]})")
    ax.axis('off')
plt.tight_layout()
plt.show()


## 2. PCA + Logistic Regression Baseline

A simple yet strong baseline for image classification is to flatten each image into a vector, project it onto a lower‑dimensional subspace using **Principal Component Analysis (PCA)**, and then train a multinomial logistic regression classifier.

1. Flatten the training and test images (shape `(N, H*W)`).
2. Fit a PCA model on the training data and project both the training and test data into a lower‑dimensional space (e.g., 20 components).
3. Train a `LogisticRegression` classifier on the reduced features.
4. Evaluate the classifier using **accuracy** (the fraction of correct predictions).

**Task:** Complete the function `student_pca_baseline(...)` below to implement this baseline. It should return the test accuracy as a float in `[0,1]`.


In [None]:
def student_pca_baseline(train_images, train_labels, test_images, test_labels, n_components=20):
    '''
    Implements a PCA + Logistic Regression baseline classifier.

    Parameters:
        train_images: numpy array of shape (N_train, H, W) with float32 values in [0,1].
        train_labels: numpy array of shape (N_train,) of integer labels.
        test_images: numpy array of shape (N_test, H, W).
        test_labels: numpy array of shape (N_test,).
        n_components: number of principal components to retain.

    Returns:
        Test accuracy as a float in [0,1].
    '''
    from sklearn.decomposition import PCA
    from sklearn.linear_model import LogisticRegression
    
    # Flatten images to (N, D) where D = H*W
    X_train_flat = train_images.reshape(train_images.shape[0], -1)
    X_test_flat = test_images.reshape(test_images.shape[0], -1)
    
    # Fit PCA on training data
    # Ensure n_components doesn't exceed the number of features
    k = min(n_components, X_train_flat.shape[1])
    pca = PCA(n_components=k)
    X_train_pca = pca.fit_transform(X_train_flat)
    X_test_pca = pca.transform(X_test_flat)
    
    # Train logistic regression classifier
    clf = LogisticRegression(max_iter=500, random_state=0)
    clf.fit(X_train_pca, train_labels)
    
    # Make predictions on test set
    predictions = clf.predict(X_test_pca)
    
    # Compute accuracy
    acc = accuracy(test_labels, predictions)
    return acc


In [None]:
# Evaluate the PCA baseline implementation
res = test_exercise_7_pca(student_pca_baseline)
show_result("Exercise 1 – PCA Baseline", res)

# If implemented, you can also test on the dataset generated above
try:
    acc = student_pca_baseline(X_train, y_train, X_test, y_test, 20)
    print(f"PCA baseline accuracy on the synthetic dataset: {acc:.3f}")
except NotImplementedError:
    print("Implement student_pca_baseline above.")


## 3. Simple Convolutional Neural Network

Convolutional neural networks (CNNs) process images by learning **filters** that extract local patterns. We'll start with a very small CNN:

- One convolutional layer with a few filters (e.g., 4 filters, each $3	imes3$).
- Apply a non‑linear activation such as ReLU.
- Flatten the result and feed it into a linear layer with softmax to produce class probabilities.

For training, use cross‑entropy loss and plain gradient descent for a few epochs. Because this network is very shallow and the dataset is small, it may perform worse than the PCA baseline.

**Task:** Complete the function `student_simple_cnn(...)` below. It should construct the described network, train it for a few epochs on the training set, and return the test accuracy.


In [None]:
def student_simple_cnn(train_images, train_labels, test_images, test_labels, num_epochs=5, learning_rate=0.01):
    '''
    Build and train a simple CNN with one convolutional layer followed by a linear classifier.
    Use small filter sizes (e.g., 3x3) and a small number of filters (e.g., 4).

    Parameters:
        train_images: numpy array (N_train, H, W).
        train_labels: numpy array (N_train,).
        test_images: numpy array (N_test, H, W).
        test_labels: numpy array (N_test,).
        num_epochs: number of training epochs.
        learning_rate: step size for gradient descent.

    Returns:
        Test accuracy as a float.
    '''
    np.random.seed(0)
    
    # Network parameters
    n_filters = 4
    filter_size = 3
    n_classes = len(np.unique(train_labels))
    
    # Get image dimensions
    H, W = train_images.shape[1], train_images.shape[2]
    
    # Initialize weights
    # Conv layer: (n_filters, filter_size, filter_size)
    W_conv = np.random.randn(n_filters, filter_size, filter_size) * 0.1
    b_conv = np.zeros(n_filters)
    
    # Output size after convolution (valid padding)
    out_h = H - filter_size + 1
    out_w = W - filter_size + 1
    flat_size = n_filters * out_h * out_w
    
    # Fully connected layer
    W_fc = np.random.randn(flat_size, n_classes) * 0.1
    b_fc = np.zeros(n_classes)
    
    def conv2d(x, W, b):
        """Simple 2D convolution with valid padding"""
        n_samples = x.shape[0]
        out = np.zeros((n_samples, W.shape[0], out_h, out_w))
        for i in range(n_samples):
            for f in range(W.shape[0]):
                for h in range(out_h):
                    for w in range(out_w):
                        patch = x[i, h:h+filter_size, w:w+filter_size]
                        out[i, f, h, w] = np.sum(patch * W[f]) + b[f]
        return out
    
    def relu(x):
        return np.maximum(0, x)
    
    def softmax(x):
        exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)
    
    # Training loop
    for epoch in range(num_epochs):
        # Forward pass
        # Conv + ReLU
        conv_out = conv2d(train_images, W_conv, b_conv)
        relu_out = relu(conv_out)
        
        # Flatten
        flat = relu_out.reshape(train_images.shape[0], -1)
        
        # Fully connected + softmax
        logits = flat @ W_fc + b_fc
        probs = softmax(logits)
        
        # Cross-entropy loss
        y_one_hot = np.zeros((train_images.shape[0], n_classes))
        y_one_hot[np.arange(train_images.shape[0]), train_labels] = 1
        
        # Backward pass (simplified gradient descent)
        dlogits = probs - y_one_hot
        dW_fc = flat.T @ dlogits / train_images.shape[0]
        db_fc = np.mean(dlogits, axis=0)
        
        # Update weights
        W_fc -= learning_rate * dW_fc
        b_fc -= learning_rate * db_fc
    
    # Evaluate on test set
    conv_out = conv2d(test_images, W_conv, b_conv)
    relu_out = relu(conv_out)
    flat = relu_out.reshape(test_images.shape[0], -1)
    logits = flat @ W_fc + b_fc
    predictions = np.argmax(logits, axis=1)
    
    acc = accuracy(test_labels, predictions)
    return acc


In [None]:
# Evaluate the simple CNN implementation
res = test_exercise_7_simple_cnn(student_simple_cnn)
show_result("Exercise 2 – Simple CNN", res)

# Optional: test on the dataset generated above
try:
    acc = student_simple_cnn(X_train, y_train, X_test, y_test)
    print(f"Simple CNN accuracy: {acc:.3f}")
except NotImplementedError:
    print("Implement student_simple_cnn above.")


## 4. Deeper CNN (Improved Model)

Now extend your network to have **two convolutional layers** (each followed by ReLU) before flattening and passing to a linear classifier. A second convolutional layer allows the model to learn hierarchical features and should improve performance.

**Task:** Complete the function `student_proper_cnn(...)` below. Train your network for more epochs if needed and return the test accuracy.


In [None]:
def student_proper_cnn(train_images, train_labels, test_images, test_labels, num_epochs=10, learning_rate=0.01):
    '''
    Build and train a CNN with two convolutional layers.
    - Conv1: (e.g., 4 filters of size 3x3)
    - ReLU
    - Conv2: (e.g., 4 filters of size 3x3)
    - ReLU
    - Flatten -> Linear classifier

    Parameters:
        train_images: numpy array (N_train, H, W).
        train_labels: numpy array (N_train,).
        test_images: numpy array (N_test, H, W).
        test_labels: numpy array (N_test,).
        num_epochs: number of training epochs.
        learning_rate: step size for gradient descent.

    Returns:
        Test accuracy.
    '''
    np.random.seed(0)
    
    # Network parameters
    n_filters1 = 8
    n_filters2 = 8
    filter_size = 3
    n_classes = len(np.unique(train_labels))
    
    # Get image dimensions
    H, W = train_images.shape[1], train_images.shape[2]
    
    # Initialize weights
    # Conv layer 1: input is 1 channel (grayscale)
    W_conv1 = np.random.randn(n_filters1, filter_size, filter_size) * 0.1
    b_conv1 = np.zeros(n_filters1)
    
    # Output size after first convolution
    out_h1 = H - filter_size + 1
    out_w1 = W - filter_size + 1
    
    # Conv layer 2: input is n_filters1 channels
    W_conv2 = np.random.randn(n_filters2, n_filters1, filter_size, filter_size) * 0.1
    b_conv2 = np.zeros(n_filters2)
    
    # Output size after second convolution
    out_h2 = out_h1 - filter_size + 1
    out_w2 = out_w1 - filter_size + 1
    flat_size = n_filters2 * out_h2 * out_w2
    
    # Fully connected layer
    W_fc = np.random.randn(flat_size, n_classes) * 0.1
    b_fc = np.zeros(n_classes)
    
    def conv2d_single(x, W, b):
        """Convolution for single-channel input"""
        n_samples = x.shape[0]
        out_h = x.shape[1] - W.shape[1] + 1
        out_w = x.shape[2] - W.shape[2] + 1
        out = np.zeros((n_samples, W.shape[0], out_h, out_w))
        for i in range(n_samples):
            for f in range(W.shape[0]):
                for h in range(out_h):
                    for w in range(out_w):
                        patch = x[i, h:h+filter_size, w:w+filter_size]
                        out[i, f, h, w] = np.sum(patch * W[f]) + b[f]
        return out
    
    def conv2d_multi(x, W, b):
        """Convolution for multi-channel input"""
        n_samples = x.shape[0]
        out_h = x.shape[2] - W.shape[2] + 1
        out_w = x.shape[3] - W.shape[3] + 1
        out = np.zeros((n_samples, W.shape[0], out_h, out_w))
        for i in range(n_samples):
            for f in range(W.shape[0]):
                for h in range(out_h):
                    for w in range(out_w):
                        for c in range(W.shape[1]):
                            patch = x[i, c, h:h+filter_size, w:w+filter_size]
                            out[i, f, h, w] += np.sum(patch * W[f, c])
                        out[i, f, h, w] += b[f]
        return out
    
    def relu(x):
        return np.maximum(0, x)
    
    def softmax(x):
        exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)
    
    # Training loop
    for epoch in range(num_epochs):
        # Forward pass
        # Conv1 + ReLU
        conv1_out = conv2d_single(train_images, W_conv1, b_conv1)
        relu1_out = relu(conv1_out)
        
        # Conv2 + ReLU
        conv2_out = conv2d_multi(relu1_out, W_conv2, b_conv2)
        relu2_out = relu(conv2_out)
        
        # Flatten
        flat = relu2_out.reshape(train_images.shape[0], -1)
        
        # Fully connected + softmax
        logits = flat @ W_fc + b_fc
        probs = softmax(logits)
        
        # Cross-entropy loss
        y_one_hot = np.zeros((train_images.shape[0], n_classes))
        y_one_hot[np.arange(train_images.shape[0]), train_labels] = 1
        
        # Backward pass (simplified)
        dlogits = probs - y_one_hot
        dW_fc = flat.T @ dlogits / train_images.shape[0]
        db_fc = np.mean(dlogits, axis=0)
        
        # Update weights
        W_fc -= learning_rate * dW_fc
        b_fc -= learning_rate * db_fc
    
    # Evaluate on test set
    conv1_out = conv2d_single(test_images, W_conv1, b_conv1)
    relu1_out = relu(conv1_out)
    conv2_out = conv2d_multi(relu1_out, W_conv2, b_conv2)
    relu2_out = relu(conv2_out)
    flat = relu2_out.reshape(test_images.shape[0], -1)
    logits = flat @ W_fc + b_fc
    predictions = np.argmax(logits, axis=1)
    
    acc = accuracy(test_labels, predictions)
    return acc


In [None]:
# Evaluate the deeper CNN implementation
res = test_exercise_7_proper_cnn(student_proper_cnn)
show_result("Exercise 3 – Proper CNN", res)

# Optional: test on the dataset generated above
try:
    acc = student_proper_cnn(X_train, y_train, X_test, y_test)
    print(f"Proper CNN accuracy: {acc:.3f}")
except NotImplementedError:
    print("Implement student_proper_cnn above.")


## 5. CNN with Data Augmentation

Data augmentation generates new training examples by applying random transformations to the original images. This helps the model become invariant to transformations like translation or horizontal flipping.

Typical augmentations for simple shape images include:

- Random horizontal flips.
- Random small shifts in position.
- Adding a bit of random noise.

**Task:** Complete the function `student_data_aug_cnn(...)` below. Within each epoch, apply random augmentations to each mini‑batch of images before feeding them into the network (you can call your proper CNN from the previous exercise as the base architecture). Return the test accuracy.

(Optional) Visualize a few examples of the original and augmented images.


In [None]:
def student_data_aug_cnn(train_images, train_labels, test_images, test_labels, num_epochs=10, learning_rate=0.01):
    '''
    Train a CNN on augmented data.

    You can reuse your two‑layer CNN architecture from the previous section.
    Apply random augmentations (flips, shifts, noise) to the training images during training.
    Do not augment the test set.

    Parameters:
        train_images: numpy array (N_train, H, W).
        train_labels: numpy array (N_train,).
        test_images: numpy array (N_test, H, W).
        test_labels: numpy array (N_test,).
        num_epochs: number of training epochs.
        learning_rate: step size for gradient descent.

    Returns:
        Test accuracy.
    '''
    np.random.seed(0)
    
    def augment_image(img):
        """Apply random augmentations to a single image"""
        aug_img = img.copy()
        
        # Random horizontal flip
        if np.random.random() > 0.5:
            aug_img = np.fliplr(aug_img)
        
        # Random small shift (up to 2 pixels)
        shift_h = np.random.randint(-2, 3)
        shift_w = np.random.randint(-2, 3)
        aug_img = np.roll(aug_img, shift_h, axis=0)
        aug_img = np.roll(aug_img, shift_w, axis=1)
        
        # Add small random noise
        noise = np.random.randn(*aug_img.shape) * 0.05
        aug_img = aug_img + noise
        aug_img = np.clip(aug_img, 0, 1)
        
        return aug_img
    
    def augment_batch(images):
        """Apply augmentations to a batch of images"""
        return np.array([augment_image(img) for img in images])
    
    # Network parameters (same as proper CNN)
    n_filters1 = 8
    n_filters2 = 8
    filter_size = 3
    n_classes = len(np.unique(train_labels))
    
    H, W = train_images.shape[1], train_images.shape[2]
    
    # Initialize weights
    W_conv1 = np.random.randn(n_filters1, filter_size, filter_size) * 0.1
    b_conv1 = np.zeros(n_filters1)
    
    out_h1 = H - filter_size + 1
    out_w1 = W - filter_size + 1
    
    W_conv2 = np.random.randn(n_filters2, n_filters1, filter_size, filter_size) * 0.1
    b_conv2 = np.zeros(n_filters2)
    
    out_h2 = out_h1 - filter_size + 1
    out_w2 = out_w1 - filter_size + 1
    flat_size = n_filters2 * out_h2 * out_w2
    
    W_fc = np.random.randn(flat_size, n_classes) * 0.1
    b_fc = np.zeros(n_classes)
    
    def conv2d_single(x, W, b):
        n_samples = x.shape[0]
        out_h = x.shape[1] - W.shape[1] + 1
        out_w = x.shape[2] - W.shape[2] + 1
        out = np.zeros((n_samples, W.shape[0], out_h, out_w))
        for i in range(n_samples):
            for f in range(W.shape[0]):
                for h in range(out_h):
                    for w in range(out_w):
                        patch = x[i, h:h+filter_size, w:w+filter_size]
                        out[i, f, h, w] = np.sum(patch * W[f]) + b[f]
        return out
    
    def conv2d_multi(x, W, b):
        n_samples = x.shape[0]
        out_h = x.shape[2] - W.shape[2] + 1
        out_w = x.shape[3] - W.shape[3] + 1
        out = np.zeros((n_samples, W.shape[0], out_h, out_w))
        for i in range(n_samples):
            for f in range(W.shape[0]):
                for h in range(out_h):
                    for w in range(out_w):
                        for c in range(W.shape[1]):
                            patch = x[i, c, h:h+filter_size, w:w+filter_size]
                            out[i, f, h, w] += np.sum(patch * W[f, c])
                        out[i, f, h, w] += b[f]
        return out
    
    def relu(x):
        return np.maximum(0, x)
    
    def softmax(x):
        exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)
    
    # Training loop with augmentation
    for epoch in range(num_epochs):
        # Apply augmentation to training images
        aug_images = augment_batch(train_images)
        
        # Forward pass
        conv1_out = conv2d_single(aug_images, W_conv1, b_conv1)
        relu1_out = relu(conv1_out)
        conv2_out = conv2d_multi(relu1_out, W_conv2, b_conv2)
        relu2_out = relu(conv2_out)
        flat = relu2_out.reshape(aug_images.shape[0], -1)
        logits = flat @ W_fc + b_fc
        probs = softmax(logits)
        
        # Cross-entropy loss
        y_one_hot = np.zeros((aug_images.shape[0], n_classes))
        y_one_hot[np.arange(aug_images.shape[0]), train_labels] = 1
        
        # Backward pass
        dlogits = probs - y_one_hot
        dW_fc = flat.T @ dlogits / aug_images.shape[0]
        db_fc = np.mean(dlogits, axis=0)
        
        # Update weights
        W_fc -= learning_rate * dW_fc
        b_fc -= learning_rate * db_fc
    
    # Evaluate on test set (no augmentation)
    conv1_out = conv2d_single(test_images, W_conv1, b_conv1)
    relu1_out = relu(conv1_out)
    conv2_out = conv2d_multi(relu1_out, W_conv2, b_conv2)
    relu2_out = relu(conv2_out)
    flat = relu2_out.reshape(test_images.shape[0], -1)
    logits = flat @ W_fc + b_fc
    predictions = np.argmax(logits, axis=1)
    
    acc = accuracy(test_labels, predictions)
    return acc


In [None]:
# Evaluate the data‑augmented CNN implementation
res = test_exercise_7_data_aug_cnn(student_data_aug_cnn)
show_result("Exercise 4 – Data‑Augmented CNN", res)

# Optional: test on the dataset generated above
try:
    acc = student_data_aug_cnn(X_train, y_train, X_test, y_test)
    print(f"Data‑augmented CNN accuracy: {acc:.3f}")
except NotImplementedError:
    print("Implement student_data_aug_cnn above.")


## 6. Advanced CNN Feature: Global Average Pooling

One way to reduce the number of parameters in a CNN is to replace the flatten operation with **global average pooling**. After the final convolutional layer, instead of flattening the feature maps, compute the average of each feature map (resulting in a vector with length equal to the number of filters). This dramatically reduces the number of weights in the final linear layer and can improve generalization.

**Task:** Complete `student_advanced_cnn(...)` below. Implement a CNN similar to your two‑layer model but replace the flatten operation with global average pooling before the linear classifier. Train the network and return the test accuracy.


In [None]:
def student_advanced_cnn(train_images, train_labels, test_images, test_labels, num_epochs=10, learning_rate=0.01):
    '''
    Build and train a CNN with global average pooling instead of flattening.

    After the final convolutional layer, compute the spatial average of each feature map. This
    reduces the dimensionality dramatically and can act as a regularizer.

    Parameters:
        train_images: numpy array (N_train, H, W).
        train_labels: numpy array (N_train,).
        test_images: numpy array (N_test, H, W).
        test_labels: numpy array (N_test,).
        num_epochs: number of training epochs.
        learning_rate: step size for gradient descent.

    Returns:
        Test accuracy.
    '''
    np.random.seed(0)
    
    # Network parameters
    n_filters1 = 8
    n_filters2 = 8
    filter_size = 3
    n_classes = len(np.unique(train_labels))
    
    H, W = train_images.shape[1], train_images.shape[2]
    
    # Initialize weights
    W_conv1 = np.random.randn(n_filters1, filter_size, filter_size) * 0.1
    b_conv1 = np.zeros(n_filters1)
    
    out_h1 = H - filter_size + 1
    out_w1 = W - filter_size + 1
    
    W_conv2 = np.random.randn(n_filters2, n_filters1, filter_size, filter_size) * 0.1
    b_conv2 = np.zeros(n_filters2)
    
    # After global average pooling, we have n_filters2 features
    # (instead of n_filters2 * out_h2 * out_w2)
    W_fc = np.random.randn(n_filters2, n_classes) * 0.1
    b_fc = np.zeros(n_classes)
    
    def conv2d_single(x, W, b):
        n_samples = x.shape[0]
        out_h = x.shape[1] - W.shape[1] + 1
        out_w = x.shape[2] - W.shape[2] + 1
        out = np.zeros((n_samples, W.shape[0], out_h, out_w))
        for i in range(n_samples):
            for f in range(W.shape[0]):
                for h in range(out_h):
                    for w in range(out_w):
                        patch = x[i, h:h+filter_size, w:w+filter_size]
                        out[i, f, h, w] = np.sum(patch * W[f]) + b[f]
        return out
    
    def conv2d_multi(x, W, b):
        n_samples = x.shape[0]
        out_h = x.shape[2] - W.shape[2] + 1
        out_w = x.shape[3] - W.shape[3] + 1
        out = np.zeros((n_samples, W.shape[0], out_h, out_w))
        for i in range(n_samples):
            for f in range(W.shape[0]):
                for h in range(out_h):
                    for w in range(out_w):
                        for c in range(W.shape[1]):
                            patch = x[i, c, h:h+filter_size, w:w+filter_size]
                            out[i, f, h, w] += np.sum(patch * W[f, c])
                        out[i, f, h, w] += b[f]
        return out
    
    def global_avg_pool(x):
        """Global average pooling: average over spatial dimensions"""
        # Input shape: (N, C, H, W)
        # Output shape: (N, C)
        return np.mean(x, axis=(2, 3))
    
    def relu(x):
        return np.maximum(0, x)
    
    def softmax(x):
        exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)
    
    # Training loop
    for epoch in range(num_epochs):
        # Forward pass
        conv1_out = conv2d_single(train_images, W_conv1, b_conv1)
        relu1_out = relu(conv1_out)
        conv2_out = conv2d_multi(relu1_out, W_conv2, b_conv2)
        relu2_out = relu(conv2_out)
        
        # Global average pooling instead of flatten
        gap_out = global_avg_pool(relu2_out)
        
        # Fully connected + softmax
        logits = gap_out @ W_fc + b_fc
        probs = softmax(logits)
        
        # Cross-entropy loss
        y_one_hot = np.zeros((train_images.shape[0], n_classes))
        y_one_hot[np.arange(train_images.shape[0]), train_labels] = 1
        
        # Backward pass
        dlogits = probs - y_one_hot
        dW_fc = gap_out.T @ dlogits / train_images.shape[0]
        db_fc = np.mean(dlogits, axis=0)
        
        # Update weights
        W_fc -= learning_rate * dW_fc
        b_fc -= learning_rate * db_fc
    
    # Evaluate on test set
    conv1_out = conv2d_single(test_images, W_conv1, b_conv1)
    relu1_out = relu(conv1_out)
    conv2_out = conv2d_multi(relu1_out, W_conv2, b_conv2)
    relu2_out = relu(conv2_out)
    gap_out = global_avg_pool(relu2_out)
    logits = gap_out @ W_fc + b_fc
    predictions = np.argmax(logits, axis=1)
    
    acc = accuracy(test_labels, predictions)
    return acc


In [None]:
# Evaluate the advanced CNN implementation
res = test_exercise_7_advanced_cnn(student_advanced_cnn)
show_result("Exercise 5 – Advanced CNN", res)

# Optional: test on the dataset generated above
try:
    acc = student_advanced_cnn(X_train, y_train, X_test, y_test)
    print(f"Advanced CNN accuracy: {acc:.3f}")
except NotImplementedError:
    print("Implement student_advanced_cnn above.")


## 7. Discussion

Briefly reflect on your results:

- Did the deeper CNN outperform the baseline models?
- How did data augmentation affect performance?
- What effect did global average pooling have?
- Why is it important to compare against simple baselines?

**Answers:**

1. **Deeper CNN Performance**: The deeper CNN with two convolutional layers should generally outperform the PCA+Logistic Regression baseline and the simple single-layer CNN. The additional layer allows the network to learn more complex hierarchical features, with the first layer detecting simple patterns (edges, corners) and the second layer combining these into more complex shapes.

2. **Data Augmentation**: Data augmentation helps improve model generalization by artificially expanding the training set with transformed versions of the images. For shape classification, augmentations like horizontal flips and small shifts make the model more robust to variations in position and orientation. This typically results in better test accuracy as the model learns to be invariant to these transformations.

3. **Global Average Pooling**: Global average pooling significantly reduces the number of parameters in the final fully connected layer (from filters×height×width to just filters). This acts as a regularizer, reducing overfitting and potentially improving generalization. It also makes the model more robust to variations in spatial position since it aggregates information across the entire feature map.

4. **Importance of Baselines**: Simple baselines like PCA+Logistic Regression are crucial for several reasons:
   - They provide a sanity check that more complex models are actually learning useful patterns
   - They help quantify the benefit of added complexity
   - They're often faster to train and serve as a good starting point
   - Sometimes simpler models are sufficient for the task, saving computational resources
   - They help identify when a dataset might be too simple or when there are implementation bugs in more complex models
