## MLP for Handwritten Digit Classification

Objective: Implement a multi-layer perceptron (MLP) to classify handwritten digits from the scikit-learn Digits dataset. Evaluate the effect of different activation functions on convergence and test performance.

## 1. Data Loading and Preprocessing

The dataset is loaded from scikit-learn and preprocessed by:
- Normalising the input features
- Splitting into training and test sets (80-20)
- Wrapping the data in PyTorch DataLoaders for batching.

This ensures the MLP receives data in a format suitable for training.

In [1]:
import numpy as np
import torch

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from torch.utils.data import DataLoader, TensorDataset

# Data Loading and Preprocessing

# Load the Digits dataset
digits = load_digits()
X = digits.data
y = digits.target

# Set a random seed for reproducibility
torch.manual_seed(231)
np.random.seed(231)

# Split the data into training and test sets (80%-20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=231)

# Normalise the images by scaling pixel values to 0-1 range
normaliser = MinMaxScaler()
# Normaliser is fitted to the training data
X_train_normalised = normaliser.fit_transform(X_train) 
# Normaliser fit to training set applied to test set
X_test_normalised = normaliser.transform(X_test)

"""
Converting NumPy arrays into PyTorch tensors to allow for multi-dimensional data rep. & efficient computation
Using .float() for features since NN computations are more precise with floats.
"""
X_train_tensor = torch.from_numpy(X_train_normalised).float()
# .long() for target values, since Pytorch loss expects 'long' format
y_train_tensor = torch.from_numpy(y_train).long()
X_test_tensor = torch.from_numpy(X_test_normalised).float()
y_test_tensor = torch.from_numpy(y_test).long()

# Create DataLoader objects for both training and test sets
"""
Batch size 64 provides enough variability to aid generalisation,
and enough samples to calculate stable gradient estimates for smoother training.
"""
batch_size = 64

# TensorDataset() bundles multiple tensors into a single dataset
training_dataset = TensorDataset(X_train_tensor, y_train_tensor)
# DataLoader divides the dataset into smaller batches to allow for more frequent weight updates (once per batch).
# For training data, shuffle=true to improve generalisation by preventing the model from learning artifical order-related data patterns.
training_loader = DataLoader(training_dataset, batch_size=batch_size, shuffle=True)

test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
# For test data, shuffle=false to prevent introduction of randomness that isn't reflective of how the model would be used in real-world scenarios.
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

## 3. Model Model Architecture

A simple multilayer perceptron is implemented with an input layer, one hidden layer, and an output layer.
The forward pass uses the ReLU activation function.

In [2]:
# MLP Model Implementation

"""
Defining a neural network class by extending torch.nn.Module.
NN has one input layer with 64 neurons (since Digits dataset is 8x8 images), one hidden layer of 128 neurons, 
and one output layer with 10 neurons (since there are ten possible digit options).
Using ReLU activation for the hidden layer, and the softmax activation for the output layer.
"""

# Set a random seed for reproducibility
torch.manual_seed(231)
np.random.seed(231)

import torch.nn.functional as F

# NeuralNetwork inherits from torch.nn.Module
class NeuralNetwork(torch.nn.Module):
  def __init__(self):
    # super() calls the torch.nn.Module constructor
    super().__init__()
    # Define the three layers using torch.nn.Linear (creates fully connected/linear layers)
    # Input layer has 64 neurons (64 features) and transforms that input data into 128 neurons
    self.input_layer = torch.nn.Linear(64, 128)
    # Hidden layer with 128 neurons in and out
    self.hidden_layer = torch.nn.Linear(128, 128)
    # Output layer with 10 neurons out that correspond to the 10 classes (digits 0-9)
    self.output_layer = torch.nn.Linear(128, 10)

  # Forward pass through the neural network
  def forward(self, x):
    # Input data x undergoes linear transformation to z (z = w * x + b) and then ReLU activation is applied to preserve just positive values.
    x = F.relu(self.input_layer(x))
    # Same linear transformation and ReLU activation applied to hidden layer.
    x = F.relu(self.hidden_layer(x))
    # Softmax activation applied to output layer to convert raw output scores into probabilities for each possible output class.
    softmax = torch.nn.Softmax(dim=1)
    x = softmax(self.output_layer(x))
    return x

## 4. Model Training 

The MLP is trained using CrossEntropyLoss and the Adam optimiser for 15 epochs, with the training loss and accuracy printed per epoch.

In [3]:
# MLP Model Training 

# Hyperparameter tuning: Chose CrossEntropyLoss as loss function, decided on Adam optimiser over SGD, and set an appropriate learning rate. 
# Trained the model for 15 epochs, printing the training loss and accuracy with each epoch.

def train_show(network, lossFunc, optimiser, epochs):

    for epoch in range(epochs):
      # Storing every calculated loss and accuracy in the epoch to calculate average
      lossHistory = []
      accuracyHistory = []
      
      network.train() # Set model to training mode, which has active dropout to make the model more robust and prevent overfitting
        
      for data, targ in training_loader:
        optimiser.zero_grad() # PyTorch gradients accumulate by default, so they need to be zeroed out before each forward pass to avoid mixing between batches.

        y = network.forward(data) # Perform forward pass by calling forward() in NeuralNetwork class
        
        loss = lossFunc(y,targ) # Calculate the loss
        loss.backward() # Runs autograd to get the gradients needed by the optimiser: Computes gradients of the loss with respect to all model parameters using backpropagation.

        optimiser.step() # Takes a step: updates the model parameters in the direction that minimises the loss using the calculated gradients.

        """
        torch.argmax(y,dim=1) returns the index of the class with the highest
        probability for each input sample in the batch, which is then compared
        to the actual (targ) class value for that sample. A boolean tensor is
        returned to indicate whether the prediction is correct or not,
        and .float() converts that boolean tensor into a float tensor
        (1 for correct, 0 for incorrect).
        torch.mean() calculates the mean of all the input samples' float tensors
        to effectively get the proportion of correct predictions for that batch.
        """
        accuracy = torch.mean((torch.argmax(y,dim=1) == targ).float())
          
        # Add the loss and accuracy values to their lists. These lists will later be used to calculate the average over the epoch.
        lossHistory.append(loss.detach().item()) # Extracting the loss value as a Python scalar.
        accuracyHistory.append(accuracy.detach()) # Detaches the accuracy tensor from the computation graph

      # Calculate average loss and accuracy for the current epoch
      avg_loss = sum(lossHistory) / len(lossHistory)
      avg_accuracy = sum(accuracyHistory) / len(accuracyHistory)

      # Print average training loss and accuracy over the epoch
      print(f"Epoch {epoch+1}: Loss = {avg_loss:.3f}, Accuracy = {int(avg_accuracy*100)}%")

# Instantiating the neural network using the previously defined NeuralNetwork class
model = NeuralNetwork()

# Defining the loss function
lossFunction = torch.nn.CrossEntropyLoss() # calculates how off predictions are from actual values

# Choosing an optimiser and an appropriate learning rate: chose Adam over SGD since Adam has adaptive learning rate for each parameter
optimiser = torch.optim.Adam(model.parameters(), lr=0.001)

print("Training Output:")
train_show(model, lossFunction, optimiser, 15)

Training Output:
Epoch 1: Loss = 2.288, Accuracy = 41%
Epoch 2: Loss = 2.166, Accuracy = 59%
Epoch 3: Loss = 1.888, Accuracy = 66%
Epoch 4: Loss = 1.733, Accuracy = 80%
Epoch 5: Loss = 1.664, Accuracy = 84%
Epoch 6: Loss = 1.631, Accuracy = 86%
Epoch 7: Loss = 1.580, Accuracy = 93%
Epoch 8: Loss = 1.559, Accuracy = 94%
Epoch 9: Loss = 1.544, Accuracy = 94%
Epoch 10: Loss = 1.531, Accuracy = 95%
Epoch 11: Loss = 1.522, Accuracy = 96%
Epoch 12: Loss = 1.518, Accuracy = 96%
Epoch 13: Loss = 1.510, Accuracy = 97%
Epoch 14: Loss = 1.512, Accuracy = 96%
Epoch 15: Loss = 1.502, Accuracy = 97%


## 5. Model Evaluation

The trained model is evaluated on the test set, with the final test accuracy being reported to provide an unbiased estimate of its generalisation performance.

In [4]:
# Model Evaluation on Test Set

# Measuring the model's accuracy on the test set, and printing five example predictions with their actual labels.

# Set a random seed for reproducibility
torch.manual_seed(231)
np.random.seed(231)

def test_network(network):
  network.eval() # Set the network to evaluation mode, which disables dropout and normalises data using running mean and variance estimates collected during model training
  
  # Storing every calculated loss and accuracy to calculate the average
  lossHistory = []
  accuracyHistory = []
  
  with torch.no_grad(): # Disable gradient computation
    for test_data, test_targ in test_loader:
      y_test = network(test_data) # Perform forward pass
      test_loss = lossFunction(y_test, test_targ) # Compute the loss
      test_accuracy = torch.mean((torch.argmax(y_test,dim=1) == test_targ).float()) # Compute the accuracy

      # Add loss and accuracy to their lists to calculate the average later.
      lossHistory.append(test_loss.detach().item())
      accuracyHistory.append(test_accuracy.detach())

    # Calculate the average loss and accuracy for the test set.
    avg_loss = sum(lossHistory) / len(lossHistory)
    avg_accuracy = sum(accuracyHistory) / len(accuracyHistory)

    # Print the test accuracy
    print(f"Test Accuracy: {int(torch.round(avg_accuracy * 100).item())}%")

    # Choose five examples with their predictions and actual labels

    # Extracting the predictions of the first five test samples
    example_predictions = torch.argmax(y_test, dim=1)[:5]
    example_labels = test_targ[:5]

    # Print the predictions vs actual labels for those five samples
    for i in range(5):
        print(f"Test Image {i+1}: Predicted Label = {example_predictions[i].item()}, Actual Label = {example_labels[i].item()}")

test_network(model)

Test Accuracy: 95%
Test Image 1: Predicted Label = 5, Actual Label = 5
Test Image 2: Predicted Label = 3, Actual Label = 3
Test Image 3: Predicted Label = 1, Actual Label = 1
Test Image 4: Predicted Label = 2, Actual Label = 2
Test Image 5: Predicted Label = 9, Actual Label = 9


## 6. Activation Function Experiments
Additional experiments with different activation functions (ReLU, Sigmoid, Tanh) to assess their impact on training speed and final test accuracy.

In [15]:
# Experimenting with Activation Functions: Comparing the effectiveness of ReLU, Sigmoid, and Tanh.

# Modifying NeuralNetwork() to have __init__ accept an activation function as a parameter
# Referring to this class as NewNeuralNetwork()

# Set a random seed for reproducibility
torch.manual_seed(231)
np.random.seed(231)

import torch.nn.functional as F

class NewNeuralNetwork(torch.nn.Module):
  def __init__(self, activation_function):
    super().__init__()
    # Define the three layers using torch.nn.Linear (creates fully connected/linear layers)
    # Input layer has 64 neurons (64 features) and transforms that input data into 128 neurons
    self.input_layer = torch.nn.Linear(64, 128)
    # Hidden layer with 128 neurons in and out
    self.hidden_layer = torch.nn.Linear(128, 128)
    # Output layer with 10 neurons out that correspond to the 10 classes (digits 0-9)
    self.output_layer = torch.nn.Linear(128, 10)
    self.activation_function = activation_function

  def forward(self, x):
    # Forward pass through the neural network
    x = self.activation_function(self.input_layer(x))
    x = self.activation_function(self.hidden_layer(x))
    softmax = torch.nn.Softmax(dim=1)
    x = softmax(self.output_layer(x))
    return x

"""
Activation functions introduce non-linearity into the model.

ReLU: (max(z, 0)), replaces any negative values with 0
and preserves positive values; activates only the neurons
with positive outputs to make the model sparse and efficient.

Sigmoid: Maps any input z to a value between 0 and 1, which can be
interpreted as probabilities.

Tanh: Maps any input z to a value between -1 and 1, makes it zero-centered
which can help center the data and make optimisation easier.

ReLU and Tanh often used in hidden layers, Sigmoid in output layers for binary
classfication.
"""
activation_functions = {
    'ReLU': F.relu,
    'Sigmoid': torch.sigmoid,
    'Tanh': torch.tanh
}

# Train and test the network with each activation function
for name, act_fnc in activation_functions.items():
    print(f"\n{name} activation function:")
    # Initialising neural network using NewNeuralNetwork class
    new_model = NewNeuralNetwork(act_fnc)
    # Retraining same optimiser and learning rate as before
    new_optimiser = torch.optim.Adam(new_model.parameters(), lr=0.001)
    # Retraining same loss function as before
    new_lossFunc = torch.nn.CrossEntropyLoss()
    # train_show will print all 15 epoch results
    print("Model Training:")
    train_show(new_model, new_lossFunc, new_optimiser, 15)
    # test_network will print average accuracy and five sample results
    print("\nModel Evaluation:")
    test_network(new_model)


ReLU activation function:
Model Training:
Epoch 1: Loss = 2.288, Accuracy = 41%
Epoch 2: Loss = 2.166, Accuracy = 59%
Epoch 3: Loss = 1.888, Accuracy = 66%
Epoch 4: Loss = 1.733, Accuracy = 80%
Epoch 5: Loss = 1.664, Accuracy = 84%
Epoch 6: Loss = 1.631, Accuracy = 86%
Epoch 7: Loss = 1.580, Accuracy = 93%
Epoch 8: Loss = 1.559, Accuracy = 94%
Epoch 9: Loss = 1.544, Accuracy = 94%
Epoch 10: Loss = 1.531, Accuracy = 95%
Epoch 11: Loss = 1.522, Accuracy = 96%
Epoch 12: Loss = 1.518, Accuracy = 96%
Epoch 13: Loss = 1.510, Accuracy = 97%
Epoch 14: Loss = 1.512, Accuracy = 96%
Epoch 15: Loss = 1.502, Accuracy = 97%

Model Evaluation:
Test Accuracy: 95%
Test Image 1: Predicted Label = 5, Actual Label = 5
Test Image 2: Predicted Label = 3, Actual Label = 3
Test Image 3: Predicted Label = 1, Actual Label = 1
Test Image 4: Predicted Label = 2, Actual Label = 2
Test Image 5: Predicted Label = 9, Actual Label = 9

Sigmoid activation function:
Model Training:
Epoch 1: Loss = 2.304, Accuracy = 9%


## 7. Activation Function Comparison

- *ReLU* had fast convergence, stable training, and a strong final test accuracy of 95%.
- *Sigmoid* had slow learning and struggled to propagate gradients early in training, with its final test accuracy being only 63%.
- *Tanh* had slightly better convergence and final test accuracy (96%) than ReLU.

The conclusion is that **ReLU and Tanh are the most effective** for this MLP on the Digits dataset, while Sigmoid underperforms due to vanishing gradients.