<h1>Distributed Hyperparameter Optimization (HPO) Techniques for CNN on MNIST</h1>

In [1]:
# TO BE DELETED ONCE COMPLETE
%pip install torchvision
%pip install optuna
%pip install hpbandster
%pip install ConfigSpace
%pip install torch
%pip install torchsummary
%pip install plotly
%pip install matplotlib
%pip install "ray[tune]"
%pip install -U ipywidgets
%pip install "ray[tune]" ray[default] ray[tune-bohb]
%pip install OptunaSearch
%pip install 'ConfigSpace<0.5.0'
%pip install --upgrade ray
%pip show ray
%pip install --upgrade ConfigSpace
%pip install scikit-learn



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;4

<h2>1. Introduction</h2>

Hyperparameter Optimization (HPO) is a critical step in deep learning model training to improve accuracy and efficiency. 
Traditional hyperparameter tuning approaches like Grid Search and Random Search are computationally expensive and inefficient. 

In this assignment, we compare and analyze different hyperparameter optimization strategies using distributed computing to achieve optimal hyperparameter selection efficiently.

<h2>2. Objectives</h2>

The goal of this project is to:

1. Compare multiple HPO techniques for training a Convolutional Neural Network (CNN) on the MNIST dataset.

2. Evaluate these techniques based on training speed, search efficiency, accuracy, and GPU resource utilization.

3. Implement real-time GPU monitoring to track memory usage and optimize resource allocation.

4. Identify the most effective HPO method that balances speed, accuracy, and efficiency.

<h2>3. HPO Strategies Implemented</h2>

We implemented and compared four different approaches for HPO:

1. Baseline (No HPO): Train the model with default hyperparameters.

2. ASHA (Asynchronous Successive Halving Algorithm): Eliminates underperforming trials early to speed up training.

3. BOHB (Bayesian Optimization + HyperBand): Uses Bayesian learning to intelligently select hyperparameters while efficiently allocating compute resources.

4. BOHB + ASHA Hybrid: Combines BOHB’s smart selection with ASHA’s aggressive pruning for improved efficiency.

<h2>4. Implementation Details</h2>

<h2>4.1 Dataset: MNIST</h2>

The MNIST dataset consists of handwritten digits (0-9).

Training set: 1000 images.

Test set: 1000 images.

Image size: 28x28 pixels, grayscale.

Output classes: 10 (digits 0-9).

In [2]:
import numpy as np
import psutil
import time
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, Subset

import optuna
from optuna.pruners import SuccessiveHalvingPruner
from optuna.visualization import plot_optimization_history, plot_param_importances

import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler

import ConfigSpace as CS
import hpbandster.core.nameserver as hpns
import hpbandster.core.result as hpres
from hpbandster.optimizers.bohb import BOHB
from hpbandster.core.worker import Worker

from torch.utils.tensorboard import SummaryWriter

import ssl

ssl._create_default_https_context = ssl._create_unverified_context

2025-03-20 17:45:20.394028: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, Subset
import numpy as np

# Transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# Load full MNIST dataset
full_trainset = torchvision.datasets.MNIST(root="./data", train=True, download=True, transform=transform)
full_testset = torchvision.datasets.MNIST(root="./data", train=False, download=True, transform=transform)

# Select indices for train and test sets (use all training data)
# Select 1000 random indices for train and test sets
train_indices = np.random.choice(len(full_trainset), 10000, replace=False)
test_indices = np.random.choice(len(full_testset), 10000, replace=False)

# Create subsets of MNIST
trainset = Subset(full_trainset, train_indices)
testset = Subset(full_testset, test_indices)

# Create DataLoaders
trainloader = DataLoader(trainset, batch_size=64, shuffle=True)
testloader = DataLoader(testset, batch_size=64, shuffle=False)

dataset = (trainloader, testloader)

<h2>4.2 Model: CNN Architecture</h2>

The CNN model used for training consists of:

1. Two convolutional layers with ReLU activation.

2. Max-pooling layers for feature down-sampling.

3. Fully connected layers with a dropout layer.

4. Softmax activation for classification.

<b>Hyperparameters Considered</b>

Learning Rate - 1e-4 to 1e-2 (log scale)

Dropout Rate - 0.2 to 0.5

Number of Filters - 16, 32, 64


In [4]:

# CNN Model for MNIST
class CNN(nn.Module):
    def __init__(self, dropout_rate=0.5, num_filters=32):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(1, num_filters, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(num_filters, num_filters * 2, kernel_size=3, stride=1, padding=1)
        self.fc1 = nn.Linear(num_filters * 2 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, 10)
        self.dropout = nn.Dropout(dropout_rate)
        self.relu = nn.ReLU()
        self.pool = nn.MaxPool2d(2, 2)

    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))
        x = self.pool(self.relu(self.conv2(x)))
        x = x.view(x.size(0), -1)
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

<h2>4.3 GPU Monitoring & Resource Utilization Tracking</h2>

We implemented real-time GPU monitoring using PyTorch’s memory allocation tracking.

GPU usage was recorded at each training epoch.

This allowed us to compare memory efficiency across different HPO techniques.

In [15]:
# function for logging GPU / CPU tracking...
def log_memory_usage(stage=""):
    ram_usage = psutil.virtual_memory().used / (1024 ** 2)  # Convert to MB
    if torch.cuda.is_available():
        device = torch.device("cuda")
        torch.cuda.empty_cache()
        gpu_usage = torch.cuda.memory_allocated(device) / (1024 ** 2)
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        torch.mps.empty_cache()
        gpu_usage = "MPS does not expose memory tracking"
    else:
        gpu_usage = "No GPU available"
    return ram_usage, gpu_usage



<h2>5. Comparison of HPO Approaches</h2>

1. Training Speed
2. Model Accuracy
3. GPU Memory Utilization

<h2>5.1 Baseline Model (No HPO)</h2>

In [19]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.tensorboard import SummaryWriter
import time
import psutil
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, Subset
import numpy as np


# Train Baseline Model (Without HPO) with GPU Logging
def train_baseline():
    writer = SummaryWriter(log_dir="./logs/baseline")
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")
    model = CNN().to(device)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    loss_fn = nn.CrossEntropyLoss()
    
    start_time = time.time()
    memory_logs = []
    gpu_logs = []

    for epoch in range(5):
        model.train()
        epoch_loss = 0
        for images, labels in trainloader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
        
        ram_usage, gpu_usage = log_memory_usage("Baseline")
        memory_logs.append(ram_usage)
        gpu_logs.append(gpu_usage)
        
        writer.add_scalar("Loss/train", epoch_loss / len(trainloader), epoch)
        writer.add_scalar("Memory/CPU_RAM_MB", ram_usage, epoch)
        if torch.cuda.is_available():
            writer.add_scalar("Memory/GPU_RAM_MB", gpu_usage, epoch)

    end_time = time.time()

    avg_ram_usage = sum(memory_logs) / len(memory_logs)
    avg_gpu_usage = sum(gpu_logs[:-1]) / (len(gpu_logs) - 1) if torch.cuda.is_available() else "GPU not available"

    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for images, labels in testloader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = 100 * correct / total
    print(f"Baseline Accuracy: {accuracy:.2f}%, Training Time: {end_time - start_time:.2f}s, Avg CPU RAM Usage: {avg_ram_usage:.2f} MB, GPU Usage: {avg_gpu_usage} MB")
    writer.close()

    return accuracy, end_time - start_time, avg_ram_usage, avg_gpu_usage

baseline_accuracy, baseline_time, baseline_memory, baseline_gpu = train_baseline()

Using device: cuda
Baseline Accuracy: 96.59%, Training Time: 14.33s, Avg CPU RAM Usage: 4210.89 MB, GPU Usage: 41.8486328125 MB


In [20]:
import optuna
from optuna.pruners import SuccessiveHalvingPruner
import multiprocessing
import torch
import time
import psutil
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, Subset
import numpy as np
import torch.nn as nn
import torch.optim as optim
from torch.utils.tensorboard import SummaryWriter


class CNN(nn.Module):
    def __init__(self, dropout_rate=0.2, num_filters=16):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(1, num_filters, kernel_size=3, stride=1, padding=1)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc = nn.Linear(num_filters * 14 * 14, 10)
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        x = x.view(x.size(0), -1)
        x = self.dropout(x)
        x = self.fc(x)
        return x


# Train Model with ASHA HPO, Memory Logging, and TensorBoard Logging
def train_cnn_asha(trial):
    writer = SummaryWriter(log_dir=f"./logs/asha_trial_{trial.number}")
    device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")

    dropout_rate = trial.suggest_float("dropout", 0.2, 0.5)
    num_filters = trial.suggest_categorical("num_filters", [16, 32, 64])
    learning_rate = trial.suggest_float("lr", 1e-4, 1e-2, log=True)

    model = CNN(dropout_rate=dropout_rate, num_filters=num_filters).to(device)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    loss_fn = nn.CrossEntropyLoss()

    memory_logs = []
    gpu_logs = []
    start_time = time.time()

    for epoch in range(5):
        model.train()
        epoch_loss = 0
        for images, labels in trainloader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

        ram_usage, gpu_usage = log_memory_usage("ASHA")
        memory_logs.append(ram_usage)
        gpu_logs.append(gpu_usage)

        writer.add_scalar("Loss/train", epoch_loss / len(trainloader), epoch)
        writer.add_scalar("Memory/CPU_RAM_MB", ram_usage, epoch)
        if torch.cuda.is_available():
            writer.add_scalar("Memory/GPU_RAM_MB", gpu_usage, epoch)

        model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for images, labels in testloader:
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                _, predicted = torch.max(outputs, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        accuracy = 100 * correct / total
        trial.report(accuracy, epoch)

        if trial.should_prune():
            writer.close()
            raise optuna.exceptions.TrialPruned()

    end_time = time.time()

    avg_ram_usage = sum(memory_logs) / len(memory_logs)
    avg_gpu_usage = sum(gpu_logs[:-1]) / (len(gpu_logs) - 1) if torch.cuda.is_available() else "GPU not available"

    writer.add_scalar("Accuracy", accuracy)
    writer.add_scalar("Training Time (s)", end_time - start_time)
    if torch.cuda.is_available():
        writer.add_scalar("Memory/GPU_AVG_MB", avg_gpu_usage)
    writer.close()

    print(f"Accuracy: {accuracy:.2f}%, Training Time: {end_time - start_time:.2f}s, "
          f"Avg CPU RAM Usage: {avg_ram_usage:.2f} MB, GPU Usage: {avg_gpu_usage}")

    return accuracy, end_time - start_time, avg_ram_usage, avg_gpu_usage

# Store best training time & resource usage
best_training_time = float("inf")
best_ram_usage = float("inf")
best_gpu_usage = None

# Optimize parallel processing
n_jobs = max(1, multiprocessing.cpu_count() // 2)  # Use half the available cores

# Enable best GPU performance for Apple MPS
torch.set_float32_matmul_precision('high')

# Define Objective Function for Optuna
def objective(trial):
    global best_training_time, best_ram_usage, best_gpu_usage

    accuracy, training_time, avg_ram_usage, gpu_usage = train_cnn_asha(trial)

    # Track best training time & resource utilization
    if training_time < best_training_time:
        best_training_time = training_time
        best_ram_usage = avg_ram_usage
        best_gpu_usage = gpu_usage

    return accuracy  # Optuna optimizes based on accuracy

# Create Optuna Study with ASHA (Successive Halving)
study = optuna.create_study(
    study_name="asha_hpo",
    direction="maximize",
    pruner=SuccessiveHalvingPruner(),
    sampler=optuna.samplers.TPESampler(
        multivariate=True,
        constant_liar=True
    )
)

# Run Optimization (20 Trials)
study.optimize(objective, n_trials=20, n_jobs=n_jobs) # added n_jobs

# Print Best Results
print(f"\nBest Model Config: {study.best_params}")
print(f"Best Accuracy: {study.best_value:.2f}%")
print(f"Best Training Time: {best_training_time:.2f}s")
print(f"Best Avg CPU RAM Usage: {best_ram_usage:.2f} MB")

[I 2025-03-20 10:00:48,373] A new study created in memory with name: asha_hpo
[I 2025-03-20 10:03:55,930] Trial 5 pruned. 
[I 2025-03-20 10:05:16,634] Trial 4 finished with value: 96.95 and parameters: {'dropout': 0.24783961516293096, 'num_filters': 32, 'lr': 0.0010890338483695852}. Best is trial 0 with value: 96.95.
[I 2025-03-20 10:06:08,971] Trial 7 pruned. 
[I 2025-03-20 10:07:02,048] Trial 9 pruned. 
[I 2025-03-20 10:07:55,353] Trial 11 pruned. 
[I 2025-03-20 10:08:48,817] Trial 13 pruned. 
[I 2025-03-20 10:09:16,259] Trial 14 pruned. 
[I 2025-03-20 10:10:09,580] Trial 16 pruned. 
[I 2025-03-20 10:10:37,183] Trial 17 pruned. 
[I 2025-03-20 10:11:01,903] Trial 15 pruned. 
[I 2025-03-20 10:11:16,127] Trial 19 pruned. 


Accuracy: 94.42%, Training Time: 133.67s, Avg CPU RAM Usage: 4204.11 MB, GPU Usage: 45.9033203125
Accuracy: 96.95%, Training Time: 133.73s, Avg CPU RAM Usage: 4204.12 MB, GPU Usage: 46.21630859375
Accuracy: 96.32%, Training Time: 133.73s, Avg CPU RAM Usage: 4204.11 MB, GPU Usage: 52.3802490234375
Accuracy: 94.40%, Training Time: 133.65s, Avg CPU RAM Usage: 4202.19 MB, GPU Usage: 46.1240234375
Accuracy: 96.95%, Training Time: 134.23s, Avg CPU RAM Usage: 4199.96 MB, GPU Usage: 46.124267578125
Accuracy: 96.97%, Training Time: 133.76s, Avg CPU RAM Usage: 4194.03 MB, GPU Usage: 46.63330078125
Accuracy: 97.05%, Training Time: 133.31s, Avg CPU RAM Usage: 4191.87 MB, GPU Usage: 46.364990234375
Accuracy: 97.18%, Training Time: 133.13s, Avg CPU RAM Usage: 4197.78 MB, GPU Usage: 46.12451171875
Accuracy: 97.39%, Training Time: 133.66s, Avg CPU RAM Usage: 4195.47 MB, GPU Usage: 46.004638671875

Best Model Config: {'dropout': 0.2136591945049297, 'num_filters': 64, 'lr': 0.004820265494502213}
Best Ac

<h2>5.2 ASHA HPO</h2>

In [28]:
# Train Model with ASHA HPO, Memory Logging, and TensorBoard Logging
def train_cnn_asha(trial):
    writer = SummaryWriter(log_dir=f"./logs/asha_trial_{trial.number}")
    device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")

    dropout_rate = trial.suggest_float("dropout", 0.2, 0.5)
    num_filters = trial.suggest_categorical("num_filters", [16, 32, 64])
    learning_rate = trial.suggest_float("lr", 1e-4, 1e-2, log=True)

    model = CNN(dropout_rate=dropout_rate, num_filters=num_filters).to(device)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    loss_fn = nn.CrossEntropyLoss()

    memory_logs = []
    gpu_logs = []
    start_time = time.time()

    for epoch in range(5):
        model.train()
        epoch_loss = 0
        for images, labels in trainloader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

        ram_usage, gpu_usage = log_memory_usage("ASHA")
        memory_logs.append(ram_usage)
        gpu_logs.append(gpu_usage)

        writer.add_scalar("Loss/train", epoch_loss / len(trainloader), epoch)
        writer.add_scalar("Memory/CPU_RAM_MB", ram_usage, epoch)
        if torch.cuda.is_available():
            writer.add_scalar("Memory/GPU_RAM_MB", gpu_usage, epoch)

        model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for images, labels in testloader:
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                _, predicted = torch.max(outputs, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        accuracy = 100 * correct / total
        trial.report(accuracy, epoch)

        if trial.should_prune():
            writer.close()
            raise optuna.exceptions.TrialPruned()

    end_time = time.time()

    avg_ram_usage = sum(memory_logs) / len(memory_logs)
    avg_gpu_usage = sum(gpu_logs[:-1]) / (len(gpu_logs) - 1) if torch.cuda.is_available() else "GPU not available"

    writer.add_scalar("Accuracy", accuracy)
    writer.add_scalar("Training Time (s)", end_time - start_time)
    if torch.cuda.is_available():
        writer.add_scalar("Memory/GPU_AVG_MB", avg_gpu_usage)
    writer.close()

    print(f"Accuracy: {accuracy:.2f}%, Training Time: {end_time - start_time:.2f}s, "
          f"Avg CPU RAM Usage: {avg_ram_usage:.2f} MB, GPU Usage: {avg_gpu_usage}")

    return accuracy, end_time - start_time, avg_ram_usage, avg_gpu_usage

# Store best training time & resource usage
best_training_time = float("inf")
best_ram_usage = float("inf")
best_gpu_usage = None

# Optimize parallel processing
n_jobs = max(1, multiprocessing.cpu_count() // 2)  # Use half the available cores

# Enable best GPU performance for Apple MPS
torch.set_float32_matmul_precision('high')

# Define Objective Function for Optuna
def objective(trial):
    global best_training_time, best_ram_usage, best_gpu_usage

    accuracy, training_time, avg_ram_usage, gpu_usage = train_cnn_asha(trial)

    # Track best training time & resource utilization
    if training_time < best_training_time:
        best_training_time = training_time
        best_ram_usage = avg_ram_usage
        best_gpu_usage = gpu_usage

    return accuracy  # Optuna optimizes based on accuracy

# Create Optuna Study with ASHA (Successive Halving)
study = optuna.create_study(
    study_name="asha_hpo",
    direction="maximize",
    pruner=SuccessiveHalvingPruner(),
    sampler=optuna.samplers.TPESampler(
        multivariate=True,
        constant_liar=True
    )
)

# Run Optimization (20 Trials)
study.optimize(objective, n_trials=20, n_jobs=n_jobs) # added n_jobs

# Print Best Results
print(f"\nBest Model Config: {study.best_params}")
print(f"Best Accuracy: {study.best_value:.2f}%")
print(f"Best Training Time: {best_training_time:.2f}s")
print(f"Best Avg CPU RAM Usage: {best_ram_usage:.2f}MB")

[I 2025-03-20 06:41:12,563] A new study created in memory with name: asha_hpo
[I 2025-03-20 06:44:25,651] Trial 3 pruned. 
[I 2025-03-20 06:44:26,482] Trial 4 pruned. 
[I 2025-03-20 06:44:26,799] Trial 5 pruned. 
[I 2025-03-20 06:45:20,935] Trial 6 pruned. 
[I 2025-03-20 06:45:22,035] Trial 7 pruned. 
[I 2025-03-20 06:45:22,324] Trial 8 pruned. 
[I 2025-03-20 06:46:16,685] Trial 9 pruned. 
[I 2025-03-20 06:46:17,562] Trial 10 pruned. 
[I 2025-03-20 06:46:18,488] Trial 11 pruned. 
[I 2025-03-20 06:47:13,960] Trial 13 pruned. 
[I 2025-03-20 06:47:15,760] Trial 14 pruned. 
[I 2025-03-20 06:48:10,517] Trial 15 pruned. 
[I 2025-03-20 06:48:12,766] Trial 16 pruned. 
[I 2025-03-20 06:49:07,474] Trial 17 pruned. 
[I 2025-03-20 06:49:08,577] Trial 18 pruned. 


Accuracy: 94.12%, Training Time: 138.57s, Avg CPU RAM Usage: 17585.58 MB, GPU Usage: 11760.55224609375
Accuracy: 96.83%, Training Time: 138.75s, Avg CPU RAM Usage: 17587.96 MB, GPU Usage: 11759.0458984375
Accuracy: 97.39%, Training Time: 142.65s, Avg CPU RAM Usage: 17582.59 MB, GPU Usage: 11759.58251953125

Best Model Config: {'dropout': 0.22384388148601603, 'num_filters': 32, 'lr': 0.008830491863782234}
Best Accuracy: 97.39%
Best Training Time: 137.99s
Best Avg CPU RAM Usage: 17585.49MB


OPTUNA STUDY .
Optuna Study with ASHA (Successive Halving)

In [31]:
import optuna
from optuna.pruners import SuccessiveHalvingPruner
import multiprocessing

# Store best training time & resource usage
best_training_time = float("inf")
best_ram_usage = float("inf")
best_gpu_usage = None

# Optimize parallel processing
# n_jobs = max(1, multiprocessing.cpu_count() // 2)  # Use half the available cores

# Enable best GPU performance for Apple MPS
torch.set_float32_matmul_precision('high') 

# Define Objective Function for Optuna
def objective(trial):
    global best_training_time, best_ram_usage, best_gpu_usage

    accuracy, training_time, avg_ram_usage, gpu_usage = train_cnn_asha(trial)  # Now returns more metrics

    # Track best training time & resource utilization
    if training_time < best_training_time:
        best_training_time = training_time
        best_ram_usage = avg_ram_usage
        best_gpu_usage = gpu_usage

    return accuracy  # Optuna optimizes based on accuracy

# Create Optuna Study with ASHA (Successive Halving)
study = optuna.create_study(
    study_name="asha_hpo",
    direction="maximize",  # We want to maximize accuracy
    pruner=SuccessiveHalvingPruner(),  # ASHA Pruning
    sampler=optuna.samplers.TPESampler(
        multivariate=True,  # Optimizes multiple parameters together
        constant_liar=True  # Avoids redundant evaluations
    )
)

# Run Optimization (20 Trials)
study.optimize(objective, n_trials=20)

# Print Best Results
print(f"\nBest Model Config: {study.best_params}")
print(f"Best Accuracy: {study.best_value:.2f}%")
print(f"Best Training Time: {best_training_time:.2f}s")
print(f"Best Avg CPU RAM Usage: {best_ram_usage:.2f} MB")
print(f"Best GPU Usage: {best_gpu_usage}")

# plotts..
# Plot optimization history
optimization_history_fig = plot_optimization_history(study)
optimization_history_fig.show()

# Plot parameter importances
param_importances_fig = plot_param_importances(study)
param_importances_fig.show()


Argument ``multivariate`` is an experimental feature. The interface can change in the future.


Argument ``constant_liar`` is an experimental feature. The interface can change in the future.

[I 2025-03-20 07:17:37,196] A new study created in memory with name: asha_hpo
[I 2025-03-20 07:18:47,831] Trial 2 pruned. 
[I 2025-03-20 07:19:27,662] Trial 4 pruned. 
[I 2025-03-20 07:19:55,833] Trial 5 finished with value: 97.1 and parameters: {'dropout': 0.2591846152128633, 'num_filters': 64, 'lr': 0.0028806532818027863}. Best is trial 5 with value: 97.1.
[I 2025-03-20 07:20:07,065] Trial 6 pruned. 
[I 2025-03-20 07:20:18,522] Trial 7 pruned. 
[I 2025-03-20 07:20:30,563] Trial 8 pruned. 
[I 2025-03-20 07:20:42,338] Trial 9 pruned. 
[I 2025-03-20 07:21:11,798] Trial 10 pruned. 
[I 2025-03-20 07:21:23,121] Trial 11 pruned. 
[I 2025-03-20 07:21:52,965] Trial 12 pruned. 
[I 2025-03-20 07:22:23,656] Trial 13 pruned. 
[I 2025-03-20 07:22:35,583] Trial 14 pruned. 
[I 2025-03-20 07:22:47,495] Trial 15

Accuracy: 92.67%, Training Time: 29.09s, Avg CPU RAM Usage: 17603.83 MB, GPU Usage: 11757.201416015625
Accuracy: 96.36%, Training Time: 29.31s, Avg CPU RAM Usage: 17610.68 MB, GPU Usage: 11757.201416015625
Accuracy: 96.91%, Training Time: 28.43s, Avg CPU RAM Usage: 17616.26 MB, GPU Usage: 11757.738525390625
Accuracy: 97.10%, Training Time: 28.09s, Avg CPU RAM Usage: 17604.38 MB, GPU Usage: 11758.699462890625

Best Model Config: {'dropout': 0.2591846152128633, 'num_filters': 64, 'lr': 0.0028806532818027863}
Best Accuracy: 97.10%
Best Training Time: 28.09s
Best Avg CPU RAM Usage: 17604.38 MB
Best GPU Usage: 11758.699462890625


## Distributed Hyperparameter Optimization (HPO) with Ray Tune

To improve the efficiency of hyperparameter tuning, we implement a distributed HPO strategy using Ray Tune. Ray Tune supports various search algorithms, including Bayesian Optimization, Genetic Algorithms, and Asynchronous Successive Halving (ASHA), making it ideal for large-scale HPO tasks.

### 1. Search Space Definition

We define the search space for a deep learning model as follows:

* **Learning rate:** Log-uniform distribution between 1e-5 and 1e-1.
* **Batch size:** Categorical values [16, 32, 64].
* **Dropout rate:** Uniform distribution between 0.1 and 0.5.
* **Number of layers:** Integer values between 2 and 5.

### 2. Parallel Execution using Ray

Ray Tune enables parallel trial execution across multiple nodes and GPUs:

* Configure a Ray cluster for multi-node execution.
* Utilize BOHB (Bayesian Optimization HyperBand) for sample efficiency and effective exploration.
* Employ ASHA (Asynchronous Successive Halving) for dynamic early stopping, reducing unnecessary computation.

### 3. Adaptive Scheduling and Resource Allocation

Ray Tune dynamically reallocates resources to the most promising trials:

* Trials demonstrating poor performance are stopped early using ASHA.
* Bayesian models refine the search space, guiding the search towards better configurations.

### 4. Implementation in Ray


In [None]:
import ray
import torch
import torchvision
from torchvision import transforms
from torch.utils.data import Subset, DataLoader
import numpy as np
from ray.tune.search.optuna import OptunaSearch 
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.optuna import OptunaSearch
import torch.nn as nn
import torch.optim as optim

ray.shutdown()
ray.init()

# Transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# Load full MNIST dataset
full_trainset = torchvision.datasets.MNIST(root="./data", train=True, download=True, transform=transform)
full_testset = torchvision.datasets.MNIST(root="./data", train=False, download=True, transform=transform)

# Select 1000 random indices for train and test sets
train_indices = np.random.choice(len(full_trainset), 10000, replace=False)
test_indices = np.random.choice(len(full_testset), 10000, replace=False)

# Create subsets of MNIST
trainset = Subset(full_trainset, train_indices)
testset = Subset(full_testset, test_indices)

# Loaders using Ray
def load_data(config):
    trainloader = DataLoader(trainset, batch_size=config["batch_size"], shuffle=True)
    testloader = DataLoader(testset, batch_size=config["batch_size"], shuffle=False)
    return trainloader, testloader



# Search space
search_space = {
    "lr": tune.loguniform(1e-5, 1e-1),
    "batch_size": tune.choice([16, 32, 64]),
    "dropout": tune.uniform(0.1, 0.5),
    "num_layers": tune.randint(2, 5)
}

# ASHA scheduler
scheduler = ASHAScheduler(
    max_t=10,
    grace_period=1,
    reduction_factor=2
)

# Optuna Search
search_alg = OptunaSearch(
    metric="loss",
    mode="min"
)

# Build model
def build_model(config):
    layers = [nn.Flatten()]
    for _ in range(config["num_layers"]):
        layers.append(nn.Linear(784, 128))
        layers.append(nn.ReLU())
        layers.append(nn.Dropout(config["dropout"]))
    layers.append(nn.Linear(128, 10))
    return nn.Sequential(*layers)

# Training function
def train(config):
    trainloader, testloader = load_data(config)
    model = build_model(config)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=config["lr"])

    for epoch in range(10):
        model.train()
        for inputs, labels in trainloader:
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for inputs, labels in testloader:
                outputs = model(inputs)
                val_loss += criterion(outputs, labels).item()

        tune.report(loss=val_loss / len(testloader))

# Tuner configuration
tuner = tune.Tuner(
    train,
    tune_config=tune.TuneConfig(
        search_alg=search_alg,
        scheduler=scheduler,
        num_samples=20,
        metric="loss",
        mode="min"
    ),
    param_space=search_space
)

results = tuner.fit()

print("Best hyperparameters found were: ", results.get_best_result().config)
ray.shutdown()

In [None]:
import ray
import torch
import torchvision
from torchvision import transforms
from torch.utils.data import Subset, DataLoader
import numpy as np
from ray.tune.search.optuna import OptunaSearch
from ray import tune
from ray.tune.schedulers import ASHAScheduler
import torch.nn as nn
import torch.optim as optim
from ray.air import session
from ray.train import Checkpoint


ray.shutdown()
if not ray.is_initialized():
    ray.init()

# Transformations
# Transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# Load full MNIST dataset
full_trainset = torchvision.datasets.MNIST(root="./data", train=True, download=True, transform=transform)
full_testset = torchvision.datasets.MNIST(root="./data", train=False, download=True, transform=transform)

# Select 10000 random indices for train and test sets
train_indices = np.random.choice(len(full_trainset), 10000, replace=False)
test_indices = np.random.choice(len(full_testset), 10000, replace=False)

# Create subsets of MNIST
trainset = Subset(full_trainset, train_indices)
testset = Subset(full_testset, test_indices)

# Place datasets in Ray's object store
trainset_ref = ray.put(trainset)
testset_ref = ray.put(testset)

# Loaders using Ray
def load_data(config, trainset_ref, testset_ref):
    trainset = ray.get(trainset_ref)
    testset = ray.get(testset_ref)
    trainloader = DataLoader(trainset, batch_size=config["batch_size"], shuffle=True)
    testloader = DataLoader(testset, batch_size=config["batch_size"], shuffle=False)
    return trainloader, testloader

# Search space
search_space = {
    "lr": tune.loguniform(1e-5, 1e-1),
    "batch_size": tune.choice([16, 32, 64]),
    "dropout": tune.uniform(0.1, 0.5),
    "num_layers": tune.randint(2, 5)
}

# ASHA scheduler
scheduler = ASHAScheduler(
    max_t=10,
    grace_period=1,
    reduction_factor=2
)

# Optuna Search
search_alg = OptunaSearch(
    metric="loss",
    mode="min"
)


def build_model(config):
    layers = [nn.Flatten()]
    layers.append(nn.Linear(784, 128))
    layers.append(nn.ReLU())
    layers.append(nn.Dropout(config["dropout"]))
    
    for _ in range(config["num_layers"] - 1):  # Subsequent layers take input size 128
        layers.append(nn.Linear(128, 128))
        layers.append(nn.ReLU())
        layers.append(nn.Dropout(config["dropout"]))

    layers.append(nn.Linear(128, 10))
    return nn.Sequential(*layers)



# Training function
def train(config):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Determine device
    trainloader, testloader = load_data(config, trainset_ref, testset_ref)  # Load data with ref
    model = build_model(config).to(device)  # Model to device

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=config["lr"])

    for epoch in range(10):
        model.train()
        for inputs, labels in trainloader:
            inputs, labels = inputs.to(device), labels.to(device)
            # inputs = inputs.view(inputs.size(0), -1)  # Ensure flattening - Not needed now!
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for inputs, labels in testloader:
                inputs, labels = inputs.to(device), labels.to(device)
                # inputs = inputs.view(inputs.size(0), -1)  # Ensure flattening - Not needed now!
                outputs = model(inputs)
                val_loss_batch = criterion(outputs, labels)
                val_loss += val_loss_batch.item()

        tune.report(
            metrics={"loss": val_loss / len(testloader)}
        )

# Tuner configuration
tuner = tune.Tuner(
    tune.with_resources(
        train,
        resources={"cpu": 1, "gpu": 0}  # Adjust resources as needed
    ),
    tune_config=tune.TuneConfig(
        search_alg=search_alg,
        scheduler=scheduler,
        num_samples=20,
        metric="loss",
        mode="min"
    ),
    param_space=search_space
)

results = tuner.fit()

print("Best hyperparameters found were: ", results.get_best_result().config)
ray.shutdown()

0,1
Current time:,2025-03-20 18:33:10
Running for:,00:00:10.24
Memory:,10.4/16.0 GiB

Trial name,status,loc,batch_size,dropout,lr,num_layers,iter,total time (s),loss
train_5a9f8023,RUNNING,127.0.0.1:47964,64,0.408394,0.010291,4,1.0,5.30102,1.88316
train_4212f678,RUNNING,127.0.0.1:47965,16,0.349154,0.000226136,3,,,
train_b94a048c,PENDING,,16,0.384521,0.000277304,4,,,
