# Generalisation and Overfitting

## Learning Objectives
By the end of this notebook, you will be able to:
- Understand the concept of overfitting and how it relates to model complexity
- Visualise how model complexity affects generalisation performance
- Implement and evaluate models with different numbers of parameters
- Apply early stopping as a regularisation technique to prevent overfitting

## Overview
In this notebook, we will explore the critical issue of overfitting and learn how to measure how well our trained models generalise to unseen data. This builds upon the generalisation concepts introduced in the fourth lecture. We'll use both a custom regression example and PyTorch to demonstrate key concepts.

## Exercise 1: Overfitting and Model Complexity in 1D Regression

In this exercise, we will explore a regression problem to understand how model complexity affects overfitting. Given a fixed set of noisy observations, we will use multi-layer network models to learn the relationship between inputs and outputs. Our goal is to visualise how increasing model complexity affects the model's ability to make predictions across the input space.

### The Target Function

To keep things simple, we will consider a single input-output function defined by a fourth-degree polynomial (quartic):

$$ f(x) = 10x^4 - 17x^3 + 8x^2 - x $$

The observed values are the function values plus zero-mean Gaussian noise:

$$ y = f(x) + 0.01\epsilon \quad \text{where} \quad \epsilon \sim \mathcal{N}(0, 1) $$

The inputs will be drawn from a uniform distribution on the interval $[0, 1]$.

**üî• Run the cell below** to import the necessary modules and seed the random number generator.

In [None]:
# Import necessary libraries
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Set plotting style for better visualisations
plt.style.use('ggplot')

# Set random seed for reproducible results
seed = 17102016 
rng = np.random.RandomState(seed)

## üìù **YOUR TASK: Implement a Polynomial Function**

Write code in the cell below to calculate a polynomial function of one-dimensional inputs. 

If $\boldsymbol{c}$ is a length $P$ vector of coefficients corresponding to increasing powers in the polynomial (starting from the constant zero-power term up to the $(P-1)^{\text{th}}$ power), the function should correspond to:

$$f_{\text{polynomial}}(x, \boldsymbol{c}) = \sum_{p=0}^{P-1} c_p x^p$$

**Requirements:**
- The function should take an array of inputs and a coefficient vector
- Return the polynomial evaluated at each input point
- Handle vectorised operations efficiently

In [None]:
def polynomial_function(inputs, coefficients):
    """Calculates polynomial with given coefficients for an array of inputs.
    
    Args:
        inputs: One-dimensional array of input values of shape (num_inputs,)
        coefficients: One-dimensional array of polynomial coefficient terms
           with `coefficients[0]` corresponding to the coefficient for the
           zero-order term (constant) and `coefficients[-1]` corresponding 
           to the highest order term.
           
    Returns:
        One-dimensional array of output values of shape (num_inputs,)
    
    Example:
        For coefficients [1, 2, 3] and input x, this computes: 1 + 2*x + 3*x^2
    """
    raise NotImplementedError("TODO Implement this function") 

**üî• Run the cell below** to test your implementation.

In [None]:
# Test the polynomial function implementation
test_coefficients = np.array([-1., 3., 4.])  # Represents: -1 + 3x + 4x^2
test_inputs = np.array([0., 0.5, 1., 2.])
test_outputs = np.array([-1., 1.5, 6., 21.])  # Expected outputs

# Check output shape is correct
assert polynomial_function(test_inputs, test_coefficients).shape == (4,), (
    'Function gives wrong shape output.'
)

# Check output values are correct
assert np.allclose(polynomial_function(test_inputs, test_coefficients), test_outputs), (
    'Function gives incorrect output values.'
)

print("‚úÖ Function is correct!")

### Generating Training Data

Now we'll use the random number generator to sample input values and calculate the corresponding target outputs using your polynomial implementation. **üî• Run the cell below** to generate the noisy training data.

In [None]:
# Define the true function coefficients: f(x) = 10x^4 - 17x^3 + 8x^2 - x
coefficients = np.array([0, -1., 8., -17., 10.])

# Set up problem dimensions
input_dim, output_dim = 1, 1
noise_std = 0.01  # Standard deviation of Gaussian noise
num_data = 80     # Total number of data points

# Generate random inputs uniformly distributed in [0, 1]
inputs = rng.uniform(size=(num_data, input_dim))

# Generate noise samples from standard normal distribution
epsilons = rng.normal(size=num_data)

# Calculate noisy target outputs: y = f(x) + noise
targets = (polynomial_function(inputs[:, 0], coefficients) + 
           epsilons * noise_std)[:, None]  # Reshape to column vector

### Creating Training and Validation Sets

We will split the generated data points into equal-sized training and validation datasets. We'll use these to create data provider objects for our framework. Since the dataset is small, we'll use a batch size equal to the dataset size. 

**üî• Run the cell below** to split the data and set up the data provider objects.

In [None]:
# Import data provider class
from mlp.data_providers import DataProvider

# Split data into training and validation sets (50/50 split)
num_train = num_data // 2
batch_size = num_train  # Use full batch gradient descent

# Create training and validation splits
inputs_train, targets_train = inputs[:num_train], targets[:num_train]
inputs_valid, targets_valid = inputs[num_train:], targets[num_train:]

# Create data provider objects for training and validation
train_data = DataProvider(inputs_train, targets_train, batch_size=batch_size, rng=rng)
valid_data = DataProvider(inputs_valid, targets_valid, batch_size=batch_size, rng=rng)

### Visualising the Data

Let's visualise the data we will be modelling. **üî• Run the cell below** to plot the target outputs against inputs for both training and validation sets. Notice the clear underlying smooth functional relationship evident in the noisy data.

In [None]:
# Create a scatter plot of the training and validation data
fig = plt.figure(figsize=(8, 4))
ax = fig.add_subplot(111)

# Plot training and validation data points
ax.plot(inputs_train[:, 0], targets_train[:, 0], '.', label='Training data', alpha=0.7)
ax.plot(inputs_valid[:, 0], targets_valid[:, 0], '.', label='Validation data', alpha=0.7)

# Add labels and formatting
ax.set_xlabel('Inputs $x$', fontsize=14)
ax.set_ylabel('Outputs $y$', fontsize=14)
ax.legend(loc='best')
ax.set_title('Training and Validation Data')
fig.tight_layout()
plt.show()

### Radial Basis Function (RBF) Networks

We will fit models with varying numbers of parameters to the training data. Since multi-layer logistic sigmoid models tend to perform poorly on regression tasks like this, we will instead use a [Radial Basis Function (RBF) network](https://en.wikipedia.org/wiki/Radial_basis_function_network).

**What is an RBF Network?**
This model predicts the output as a weighted sum of basis functions (Gaussian-like "bumps") tiled across the input space. Each basis function has a centre and width, and the final prediction combines their weighted contributions.

**üî• Run the cell below** to see an example of RBF network predictions. Try running it several times with different values of `num_weights` (e.g., 5, 15, 30) to get a feel for how the number of parameters affects the model's predictions.

In [None]:
# Try changing this value and re-running the cell to see different behaviours!
num_weights = 15
weights_scale = 1.
bias_scale = 1.

def basis_function(x, centre, scale):
    """Gaussian radial basis function."""
    return np.exp(-(x - centre)**2 / scale**2)

# Generate random weights and bias for demonstration
weights = rng.normal(size=num_weights) * weights_scale
bias = rng.normal() * bias_scale

# Place basis function centres evenly across input space [0, 1]
centres = np.linspace(0, 1, weights.shape[0])
scale = 1. / weights.shape[0]  # Scale inversely with number of centres

# Create dense grid of input points for smooth plotting
xs = np.linspace(0, 1, 200)
ys = np.zeros(xs.shape[0])

# Plot the RBF network prediction
fig = plt.figure(figsize=(12, 4))
ax = fig.add_subplot(1, 1, 1)

# Sum weighted basis functions
for weight, centre in zip(weights, centres):
    ys += weight * basis_function(xs, centre, scale)
ys += bias  # Add bias term

ax.plot(xs, ys, linewidth=2)
ax.set_xlabel('Input', fontsize=14)
ax.set_ylabel('Output', fontsize=14)
ax.set_title(f'RBF Network with {num_weights} basis functions')
ax.grid(True, alpha=0.3)
plt.show()

### Model Implementation Details

You do not need to study the details of how to implement this model. All the additional code you need to fit RBF networks is provided in the `RadialBasisFunctionLayer` in the `mlp.layers` module. 

**Key Points:**
- The `RadialBasisFunctionLayer` class has the same interface as other layer classes (with `fprop` and `bprop` methods)
- We can include it as a layer in a `MultipleLayerModel` just like any other layer
- This demonstrates the advantage of using a modular framework - we can reuse existing code to train different model architectures

**Architecture:** We use the `RadialBasisFunctionLayer` as the first layer in a two-layer model:
1. **First layer:** `RadialBasisFunctionLayer` - calculates the basis function terms
2. **Second layer:** `AffineLayer` - weights and sums the basis functions together

**üî• Run the cell below** to set up the necessary components for training.

In [None]:
# Import required modules for model training
from mlp.models import MultipleLayerModel
from mlp.layers import AffineLayer, RadialBasisFunctionLayer
from mlp.errors import SumOfSquaredDiffsError
from mlp.initialisers import ConstantInit, UniformInit
from mlp.learning_rules import GradientDescentLearningRule
from mlp.optimisers import Optimiser

# Set up training components
error = SumOfSquaredDiffsError()  # Appropriate for regression problems
learning_rule = GradientDescentLearningRule(0.1)  # Basic gradient descent with fixed learning rate

# Initialise weights and biases
weights_init = UniformInit(-0.1, 0.1)  # Small random weights
biases_init = ConstantInit(0.)         # Zero bias initialization

# Training configuration
num_epoch = 2000  # Number of training epochs for all models

### Training Models with Different Complexities

The next cell defines RBF network models with varying numbers of weight parameters (equal to the number of basis functions) and fits each to the training set. We'll record the final training and validation set errors for the fitted models.

**üî• Run the cell below** to fit the models and calculate error values. This may take a few minutes to complete.

In [None]:
# Define different model complexities to test
num_weight_list = [2, 5, 10, 25, 50, 100]

# Storage for results
models = []
train_errors = []
valid_errors = []

# Train models with different numbers of parameters
for num_weight in num_weight_list:
    # Create RBF network model
    model = MultipleLayerModel([
        RadialBasisFunctionLayer(num_weight),
        AffineLayer(input_dim * num_weight, output_dim, 
                    weights_init, biases_init)
    ])
    
    # Set up optimiser
    optimiser = Optimiser(model, error, learning_rule, 
                            train_data, valid_data)
    
    print('-' * 80)
    print(f'Training model with {num_weight} weights')
    print('-' * 80)
    
    # Train the model
    _ = optimiser.train(num_epoch, -1)  # -1 means no intermediate output
    
    # Calculate final errors on both datasets
    outputs_train = model.fprop(inputs_train)[-1]
    outputs_valid = model.fprop(inputs_valid)[-1]
    
    # Store results
    models.append(model)
    train_errors.append(error(outputs_train, targets_train))
    valid_errors.append(error(outputs_valid, targets_valid))
    
    print(f'  Final training set error: {train_errors[-1]:.1e}')
    print(f'  Final validation set error: {valid_errors[-1]:.1e}')

## üìù **YOUR TASK: Analyse Training vs Validation Errors**

In the cell below, write code to create [bar charts](http://matplotlib.org/examples/api/barchart_demo.html) showing the training and validation set errors for the different fitted models.

**Think about these questions as you examine the plots:**

1. **Model Complexity:** Do models with more free parameters fit the training data better or worse?
2. **Generalisation:** What does the validation set error tell us about how well the models generalise?
3. **Best Model:** Which of the fitted models seems most likely to generalise well to unseen data?
4. **Overfitting:** Do any of the models appear to be overfitting? How can you tell?

**Hint:** Look for the "sweet spot" where validation error is minimised!

In [None]:
#TODO plot the bar charts here

## üìù **YOUR TASK: Visualise Model Predictions**

Now let's visualise what the fitted models' predictions look like across the whole input space compared to the true function we were trying to fit.

**In the cell below, complete the following for each fitted model:**
1. Compute output predictions for the model across 500 linearly spaced input points between 0 and 1
2. Plot the predicted outputs and true function values as line plots on the same axis
3. Plot the training data as points on the same axis
4. Add appropriate labels and legends

**Look for:**
- **Underfitting:** Model is too simple and misses the true pattern
- **Good fit:** Model captures the underlying function well
- **Overfitting:** Model fits training points perfectly but behaves erratically between them

In [None]:

#TODO plot the graphs here

You should be able to relate your answers to the questions above to what you see in these plots - ask a demonstrator if you are unsure what is going on. In particular for the models which appeared to be overfitting and generalising poorly you should now have an idea how this looks in terms of the model's predictions and how these relate to the training data points and true function values.

# Exercise 2: Early Stopping with PyTorch

In the previous exercise with RBF networks, we saw how model complexity affects overfitting. We observed that:
- Simple models (few parameters) underfit the data
- Complex models (many parameters) overfit the training data
- There's an optimal complexity that minimises validation error

However, there's another way to think about overfitting: **through the lens of training time**.

## Overfitting and Training Duration

As we saw in [Lab 3](https://github.com/cortu01/mlpractical/tree/mlp2023-24/lab3/notebooks/03_Multiple_layer_models.ipynb), models can show signs of overfitting after training for too many epochs. Even with an appropriately sized model, overfitting can occur when we train for too long.

**Key Insight:** Overfitting happens when the model learns the training data *too well*, including its noise, and fails to generalise to unseen data. This can result from:
1. **Model complexity** (too many parameters) - as we saw in Exercise 1
2. **Training duration** (too many epochs) - which we'll explore now

## ü§î **Think About This:**
*If we observe both high training error AND high validation error, what does this suggest about our model?*

<details>
<summary>Click for answer</summary>

**Answer:** High training error + high validation error typically indicates **underfitting**. The model is too simple to capture the underlying patterns in the data, so it performs poorly on both training and validation sets.

</details>

## Early Stopping: A Regularisation Technique

**Early stopping** is a simple yet effective technique to prevent overfitting. The idea is to monitor the validation error during training and stop when it starts to increase consistently, even if the training error continues to decrease.

In this section, we'll implement early stopping in PyTorch using the MNIST dataset and demonstrate how it can prevent overfitting during training.

In [None]:
# Import PyTorch and related libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data.sampler import SubsetRandomSampler

# Set random seed for reproducible results
torch.manual_seed(seed)

In [None]:
# Device configuration - use GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Set training hyperparameters
batch_size = 128      # Number of data points in each batch
learning_rate = 0.001 # Learning rate for gradient descent
num_epochs = 50       # Maximum number of training epochs
stats_interval = 1    # Epoch interval for recording and printing statistics

In [None]:
# Define data transformations (normalisation for MNIST)
transform = transforms.Compose([
    transforms.ToTensor(),
    # MNIST normalisation: mean=0.1307, std=0.3081 (computed from training set)
    transforms.Normalize((0.1307,), (0.3081,))
])

# Load MNIST datasets
train_dataset = datasets.MNIST('../data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST('../data', train=False, download=True, transform=transform)

# Create train/validation split from training set
valid_size = 0.2  # Use 20% of training set for validation
num_train = len(train_dataset)
indices = list(range(num_train))
split = int(np.floor(valid_size * num_train))
np.random.shuffle(indices)  # Shuffle indices for random split

# Split indices into training and validation sets
train_idx, valid_idx = indices[split:], indices[:split]
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)

# Create data loaders
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=batch_size, sampler=train_sampler, pin_memory=True)
valid_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=batch_size, sampler=valid_sampler, pin_memory=True)
test_loader = torch.utils.data.DataLoader(
    test_dataset, batch_size=batch_size, shuffle=False, pin_memory=True)

print(f"Training samples: {len(train_idx)}")
print(f"Validation samples: {len(valid_idx)}")
print(f"Test samples: {len(test_dataset)}")

In [None]:
# Define the neural network model
class MultipleLayerModel(nn.Module):
    """Multiple layer model for MNIST classification."""
    
    def __init__(self, input_dim, output_dim, hidden_dim):
        super().__init__()
        self.flatten = nn.Flatten()  # Flatten 28x28 images to 784-dimensional vectors
        
        # Define the network architecture
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),   # First hidden layer
            nn.ReLU(),                          # ReLU activation
            nn.Linear(hidden_dim, hidden_dim),  # Second hidden layer
            nn.ReLU(),                          # ReLU activation
            nn.Linear(hidden_dim, output_dim),  # Output layer
        )
        
    def forward(self, x):
        """Forward pass through the network."""
        x = self.flatten(x)              # Flatten input images
        logits = self.linear_relu_stack(x)  # Pass through network layers
        return logits

# Model configuration
input_dim = 1 * 28 * 28  # MNIST images are 28x28 pixels
output_dim = 10          # 10 classes (digits 0-9)
hidden_dim = 100         # Hidden layer size

# Create model and move to device
model = MultipleLayerModel(input_dim, output_dim, hidden_dim).to(device)

# Define loss function and optimiser
loss_fn = nn.CrossEntropyLoss()  # Cross-entropy loss for classification
optimizer = optim.Adam(model.parameters(), lr=learning_rate)  # Adam optimiser

print(f"Model created with {sum(p.numel() for p in model.parameters())} parameters")

### Implementing Early Stopping

Early stopping monitors the validation loss during training and stops when it hasn't improved for a specified number of epochs (called "patience").

**How it works:**
1. **Monitor validation loss** after each epoch
2. **Track the best (lowest) validation loss** seen so far
3. **Count consecutive epochs** without improvement
4. **Stop training** when patience is exceeded

## ü§î **Think About This:**
*Can we say that overfitting is ultimately inevitable given training over a very large number of epochs?*

<details>
<summary>Click for answer</summary>

**Answer:** Generally yes, for most practical scenarios. Given unlimited training time, a sufficiently complex model will eventually memorise the training data perfectly, including noise. This leads to overfitting. Early stopping prevents this by halting training before this point is reached.

</details>

**üî• Run the cell below** to see the implementation of an early stopping class.

In [None]:
class EarlyStopping:
    """Early stopping utility to prevent overfitting during training.
    
    Monitors validation loss and stops training when it hasn't improved
    for a specified number of epochs (patience).
    """
    
    def __init__(self, patience=5, min_delta=0):
        """
        Args:
            patience (int): Number of epochs with no improvement after which 
                          training will be stopped
            min_delta (float): Minimum change in monitored quantity to qualify 
                             as an improvement
        """
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0  # Counter for epochs without improvement
        self.min_validation_loss = float('inf')  # Best validation loss seen
        self.early_stop = False  # Flag to indicate if training should stop

    def __call__(self, validation_loss):
        """Check if training should be stopped based on validation loss.
        
        Args:
            validation_loss (float): Current epoch's validation loss
        """
        # Check if we have a new best validation loss
        if validation_loss < self.min_validation_loss:
            self.min_validation_loss = validation_loss
            self.counter = 0  # Reset counter
        # Check if validation loss has worsened beyond tolerance
        elif validation_loss > (self.min_validation_loss + self.min_delta):
            self.counter += 1
            # Stop training if patience exceeded
            if self.counter >= self.patience: 
                self.early_stop = True

In [None]:
# Initialize early stopping with patience=5 epochs
early_stopping = EarlyStopping(patience=5, min_delta=0.01)

# Track loss values over training
train_losses = [] 
valid_losses = []

print("Starting training with early stopping...")
print("=" * 60)

# Training loop
for epoch in range(num_epochs + 1): 
    # === TRAINING PHASE ===
    model.train()  # Set model to training mode
    batch_losses = []
    
    for batch_idx, (data, targets) in enumerate(train_loader):
        # Move data to device
        data, targets = data.to(device), targets.to(device)
        
        # Forward pass
        outputs = model(data)
        loss = loss_fn(outputs, targets)
        
        # Backward pass and optimisation
        optimizer.zero_grad()  # Clear gradients
        loss.backward()        # Compute gradients
        optimizer.step()       # Update parameters
        
        # Record batch loss
        batch_losses.append(loss.item())
    
    # Average training loss for this epoch
    train_losses.append(np.mean(batch_losses))

    # === VALIDATION PHASE ===
    model.eval()  # Set model to evaluation mode
    batch_losses = []
    
    with torch.no_grad():  # Disable gradient computation for efficiency
        for batch_idx, (data, targets) in enumerate(valid_loader):
            # Move data to device
            data, targets = data.to(device), targets.to(device)
            
            # Forward pass only
            outputs = model(data)
            loss = loss_fn(outputs, targets)
            
            # Record batch loss
            batch_losses.append(loss.item())
    
    # Average validation loss for this epoch
    valid_losses.append(np.mean(batch_losses))

    # Print progress every stats_interval epochs
    if epoch % stats_interval == 0:
        print(f'Epoch {epoch:2d}: Train Loss: {train_losses[-1]:.6f}, '
              f'Valid Loss: {valid_losses[-1]:.6f}')
            
    # Check for early stopping
    early_stopping(valid_losses[-1])
    
    if early_stopping.early_stop:
        print(f"\nüõë Early stopping triggered at epoch {epoch}")
        print(f"Best validation loss: {early_stopping.min_validation_loss:.6f}")
        break

print("=" * 60)
print("Training completed!")

In [None]:
# Plot the training and validation loss curves
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111)

# Plot loss curves
epochs = range(len(train_losses))
ax.plot(epochs, train_losses, 'b-', label='Training Loss', linewidth=2)
ax.plot(epochs, valid_losses, 'r-', label='Validation Loss', linewidth=2)

# Mark the early stopping point
if early_stopping.early_stop:
    ax.axvline(x=len(train_losses)-1, color='orange', linestyle='--', 
               linewidth=2, label='Early Stopping Point')

# Formatting
ax.set_xlabel('Epoch', fontsize=14)
ax.set_ylabel('Loss', fontsize=14)
ax.set_title('Training and Validation Loss with Early Stopping', fontsize=16)
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print final statistics
print(f"\nFinal Statistics:")
print(f"Training stopped at epoch: {len(train_losses)-1}")
print(f"Final training loss: {train_losses[-1]:.6f}")
print(f"Final validation loss: {valid_losses[-1]:.6f}")
print(f"Best validation loss: {early_stopping.min_validation_loss:.6f}")

# Show the classic overfitting pattern
min_valid_idx = np.argmin(valid_losses)
print(f"\nOverfitting Analysis:")
print(f"Validation loss minimum at epoch: {min_valid_idx}")
if len(valid_losses) > min_valid_idx + 3:
    print("‚úÖ Early stopping successfully prevented overfitting!")
else:
    print("‚ÑπÔ∏è  Training stopped before significant overfitting occurred.")