<h1>Distributed Hyperparameter Optimization (HPO) Techniques for CNN on MNIST</h1>

In [None]:
# TO BE DELETED ONCE COMPLETE
%pip install torchvision
%pip install optuna
%pip install hpbandster
%pip install ConfigSpace
%pip install torch
%pip install torchsummary
%pip install plotly
%pip install matplotlib
%pip install "ray[tune]"
%pip install -U ipywidgets
%pip install "ray[tune]" ray[default] ray[tune-bohb]
%pip install OptunaSearch


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;4

<h2>1. Introduction</h2>

Hyperparameter Optimization (HPO) is a critical step in deep learning model training to improve accuracy and efficiency. 
Traditional hyperparameter tuning approaches like Grid Search and Random Search are computationally expensive and inefficient. 

In this assignment, we compare and analyze different hyperparameter optimization strategies using distributed computing to achieve optimal hyperparameter selection efficiently.

<h2>2. Objectives</h2>

The goal of this project is to:

1. Compare multiple HPO techniques for training a Convolutional Neural Network (CNN) on the MNIST dataset.

2. Evaluate these techniques based on training speed, search efficiency, accuracy, and GPU resource utilization.

3. Implement real-time GPU monitoring to track memory usage and optimize resource allocation.

4. Identify the most effective HPO method that balances speed, accuracy, and efficiency.

<h2>3. HPO Strategies Implemented</h2>

We implemented and compared four different approaches for HPO:

1. Baseline (No HPO): Train the model with default hyperparameters.

2. ASHA (Asynchronous Successive Halving Algorithm): Eliminates underperforming trials early to speed up training.

3. BOHB (Bayesian Optimization + HyperBand): Uses Bayesian learning to intelligently select hyperparameters while efficiently allocating compute resources.

4. BOHB + ASHA Hybrid: Combines BOHB’s smart selection with ASHA’s aggressive pruning for improved efficiency.

<h2>4. Implementation Details</h2>

<h2>4.1 Dataset: MNIST</h2>

The MNIST dataset consists of handwritten digits (0-9).

Training set: 1000 images.

Test set: 1000 images.

Image size: 28x28 pixels, grayscale.

Output classes: 10 (digits 0-9).

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, Subset

import optuna
from optuna.pruners import SuccessiveHalvingPruner

import hpbandster.core.nameserver as hpns
import hpbandster.core.result as hpres
from hpbandster.optimizers.bohb import BOHB
from hpbandster.core.worker import Worker
import ConfigSpace as CS

from torch.utils.tensorboard import SummaryWriter
import ssl
import time
import psutil
import random
import numpy as np

import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler

ssl._create_default_https_context = ssl._create_unverified_context

2025-03-19 11:32:59.997454: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:

# Transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# Load full MNIST dataset
full_trainset = torchvision.datasets.MNIST(root="./data", train=True, download=True, transform=transform)
full_testset = torchvision.datasets.MNIST(root="./data", train=False, download=True, transform=transform)

# Select 1000 random indices for train and test sets
train_indices = np.random.choice(len(full_trainset), 10000, replace=False)
test_indices = np.random.choice(len(full_testset), 10000, replace=False)

# Create subsets of MNIST
trainset = Subset(full_trainset, train_indices)
testset = Subset(full_testset, test_indices)

# Create DataLoaders
trainloader = DataLoader(trainset, batch_size=64, shuffle=True)
testloader = DataLoader(testset, batch_size=64, shuffle=False)

dataset = (trainloader, testloader)


<h2>4.2 Model: CNN Architecture</h2>

The CNN model used for training consists of:

1. Two convolutional layers with ReLU activation.

2. Max-pooling layers for feature down-sampling.

3. Fully connected layers with a dropout layer.

4. Softmax activation for classification.

<b>Hyperparameters Considered</b>

Learning Rate - 1e-4 to 1e-2 (log scale)

Dropout Rate - 0.2 to 0.5

Number of Filters - 16, 32, 64


In [24]:


# CNN Model for MNIST
class CNN(nn.Module):
    def __init__(self, dropout_rate=0.5, num_filters=32):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(1, num_filters, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(num_filters, num_filters * 2, kernel_size=3, stride=1, padding=1)
        self.fc1 = nn.Linear(num_filters * 2 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, 10)
        self.dropout = nn.Dropout(dropout_rate)
        self.relu = nn.ReLU()
        self.pool = nn.MaxPool2d(2, 2)

    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))
        x = self.pool(self.relu(self.conv2(x)))
        x = x.view(x.size(0), -1)
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

<h2>4.3 GPU Monitoring & Resource Utilization Tracking</h2>

We implemented real-time GPU monitoring using PyTorch’s memory allocation tracking.

GPU usage was recorded at each training epoch.

This allowed us to compare memory efficiency across different HPO techniques.

In [25]:
# Function to log memory usage (CPU + GPU approximation)
def log_memory_usage(stage=""):
    # Get CPU RAM usage
    ram_usage = psutil.virtual_memory().used / (1024 ** 2)  # Convert to MB
    
    # Get GPU memory (Approximate via tensor usage)
    if device.type == "mps":
        torch.mps.empty_cache()  # Free unused memory (for better tracking)
        gpu_usage = "MPS does not expose memory tracking"
    else:
        gpu_usage = "GPU not in use"
    
    return ram_usage, gpu_usage

<h2>5. Comparison of HPO Approaches</h2>

1. Training Speed
2. Model Accuracy
3. GPU Memory Utilization

<h2>5.1 Baseline Model (No HPO)</h2>

In [26]:
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
device

device(type='mps')

In [27]:


# Train Baseline Model (Without HPO) with GPU Logging
def train_baseline():
    writer = SummaryWriter(log_dir="./logs/baseline")
    device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
    print(f"Using device: {device}")
    model = CNN().to(device)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    loss_fn = nn.CrossEntropyLoss()
    
    start_time = time.time()
    # gpu_usages = []
    memory_logs = []

    for epoch in range(5):
        model.train()
        epoch_loss = 0
        for images, labels in trainloader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
        
        # Log GPU Memory
        # gpu_usage = log_gpu_usage("Baseline")
        # gpu_usages.append(gpu_usage)
        
        # Log Memory Usage
        ram_usage, gpu_usage = log_memory_usage("Baseline")
        memory_logs.append(ram_usage)
        
        writer.add_scalar("Loss/train", epoch_loss / len(trainloader), epoch)
        writer.add_scalar("Memory/CPU_RAM_MB", ram_usage, epoch)
    
    end_time = time.time()
    
    # Compute GPU Usage Stats
    # avg_gpu_usage = sum(gpu_usages) / len(gpu_usages)
    # avg_gpu_usage = "MPS memory tracking unavailable"

    # Compute Average Memory Usage
    avg_ram_usage = sum(memory_logs) / len(memory_logs)

    # Test Model
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for images, labels in testloader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = 100 * correct / total
    print(f"Baseline Accuracy: {accuracy:.2f}%, Training Time: {end_time - start_time:.2f}s, Avg CPU RAM Usage: {avg_ram_usage:.2f} MB, GPU Usage: {gpu_usage}")
    writer.close()

    # return accuracy, end_time - start_time, avg_gpu_usage
    return accuracy, end_time - start_time, avg_ram_usage
    

# Run Baseline Training
# baseline_accuracy, baseline_time, baseline_gpu = train_baseline()
baseline_accuracy, baseline_time, baseline_memory = train_baseline()


Using device: mps
Baseline Accuracy: 97.94%, Training Time: 38.61s, Avg CPU RAM Usage: 9336.10 MB, GPU Usage: MPS does not expose memory tracking


<h2>5.2 ASHA HPO</h2>

In [28]:


# Train Model with ASHA HPO, Memory Logging, and TensorBoard Logging
def train_cnn_asha(trial):
    writer = SummaryWriter(log_dir=f"./logs/asha_trial_{trial.number}")  # TensorBoard log directory
    device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

    # Sample hyperparameters using Optuna
    dropout_rate = trial.suggest_float("dropout", 0.2, 0.5)
    num_filters = trial.suggest_categorical("num_filters", [16, 32, 64])
    learning_rate = trial.suggest_float("lr", 1e-4, 1e-2, log=True)

    model = CNN(dropout_rate=dropout_rate, num_filters=num_filters).to(device)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    loss_fn = nn.CrossEntropyLoss()

    memory_logs = []
    start_time = time.time()

    for epoch in range(5):
        model.train()
        epoch_loss = 0
        for images, labels in trainloader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

        # Log Memory Usage
        ram_usage, gpu_usage = log_memory_usage("ASHA")
        memory_logs.append(ram_usage)

        # Log Loss and Memory to TensorBoard
        writer.add_scalar("Loss/train", epoch_loss / len(trainloader), epoch)
        writer.add_scalar("Memory/CPU_RAM_MB", ram_usage, epoch)

        # Evaluate Model (ASHA needs validation accuracy for pruning)
        model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for images, labels in testloader:
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                _, predicted = torch.max(outputs, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        accuracy = 100 * correct / total

        # Report accuracy for ASHA pruning
        trial.report(accuracy, epoch)

        # ASHA: Stop bad trials early
        if trial.should_prune():
            writer.close()  # Ensure writer closes even when pruned
            raise optuna.exceptions.TrialPruned()

    end_time = time.time()

    # Compute Average CPU Memory Usage
    avg_ram_usage = sum(memory_logs) / len(memory_logs)

    # Log final accuracy and memory stats to TensorBoard
    writer.add_scalar("Accuracy", accuracy)
    writer.add_scalar("Training Time (s)", end_time - start_time)
    writer.close()

    # Print Summary (Same Format as Baseline)
    print(f"Accuracy: {accuracy:.2f}%, Training Time: {end_time - start_time:.2f}s, "
          f"Avg CPU RAM Usage: {avg_ram_usage:.2f} MB, GPU Usage: {gpu_usage}")

    return accuracy, end_time - start_time, avg_ram_usage, gpu_usage

In [29]:
import optuna
from optuna.pruners import SuccessiveHalvingPruner
import multiprocessing

# Store best training time & resource usage
best_training_time = float("inf")
best_ram_usage = float("inf")
best_gpu_usage = None

# Optimize parallel processing
# n_jobs = max(1, multiprocessing.cpu_count() // 2)  # Use half the available cores

# Enable best GPU performance for Apple MPS
torch.set_float32_matmul_precision('high') 

# Define Objective Function for Optuna
def objective(trial):
    global best_training_time, best_ram_usage, best_gpu_usage

    accuracy, training_time, avg_ram_usage, gpu_usage = train_cnn_asha(trial)  # Now returns more metrics

    # Track best training time & resource utilization
    if training_time < best_training_time:
        best_training_time = training_time
        best_ram_usage = avg_ram_usage
        best_gpu_usage = gpu_usage

    return accuracy  # Optuna optimizes based on accuracy

# Create Optuna Study with ASHA (Successive Halving)
study = optuna.create_study(
    study_name="asha_hpo",
    direction="maximize",  # We want to maximize accuracy
    pruner=SuccessiveHalvingPruner(),  # ASHA Pruning
    sampler=optuna.samplers.TPESampler(
        multivariate=True,  # Optimizes multiple parameters together
        constant_liar=True  # Avoids redundant evaluations
    )
)

# Run Optimization (20 Trials)
study.optimize(objective, n_trials=10)

# Print Best Results
print(f"\nBest Model Config: {study.best_params}")
print(f"Best Accuracy: {study.best_value:.2f}%")
print(f"Best Training Time: {best_training_time:.2f}s")
print(f"Best Avg CPU RAM Usage: {best_ram_usage:.2f} MB")
print(f"Best GPU Usage: {best_gpu_usage}")



Argument ``multivariate`` is an experimental feature. The interface can change in the future.


Argument ``constant_liar`` is an experimental feature. The interface can change in the future.

[I 2025-03-18 12:55:47,027] A new study created in memory with name: asha_hpo
[I 2025-03-18 12:56:44,067] Trial 0 finished with value: 98.34 and parameters: {'dropout': 0.43788936202229833, 'num_filters': 64, 'lr': 0.0011238396202911856}. Best is trial 0 with value: 98.34.


Accuracy: 98.34%, Training Time: 57.02s, Avg CPU RAM Usage: 9322.79 MB, GPU Usage: MPS does not expose memory tracking


[I 2025-03-18 12:57:55,684] Trial 1 finished with value: 98.18 and parameters: {'dropout': 0.3018953222890598, 'num_filters': 32, 'lr': 0.003024770650463434}. Best is trial 0 with value: 98.34.


Accuracy: 98.18%, Training Time: 71.59s, Avg CPU RAM Usage: 9164.19 MB, GPU Usage: MPS does not expose memory tracking


[I 2025-03-18 12:58:18,017] Trial 2 pruned. 
[I 2025-03-18 12:58:38,433] Trial 3 pruned. 
[I 2025-03-18 12:59:00,523] Trial 4 pruned. 
[I 2025-03-18 12:59:20,957] Trial 5 pruned. 
[I 2025-03-18 12:59:44,125] Trial 6 pruned. 
[I 2025-03-18 13:00:06,957] Trial 7 pruned. 
[I 2025-03-18 13:01:00,800] Trial 8 pruned. 
[I 2025-03-18 13:01:22,530] Trial 9 pruned. 



Best Model Config: {'dropout': 0.43788936202229833, 'num_filters': 64, 'lr': 0.0011238396202911856}
Best Accuracy: 98.34%
Best Training Time: 57.02s
Best Avg CPU RAM Usage: 9322.79 MB
Best GPU Usage: MPS does not expose memory tracking


In [30]:
from optuna.visualization import plot_optimization_history, plot_param_importances

# Plot optimization history
optimization_history_fig = plot_optimization_history(study)
optimization_history_fig.show()

# Plot parameter importances
param_importances_fig = plot_param_importances(study)
param_importances_fig.show()

<h2>5.3 Train with BOHB HPO</h2>

In [31]:


# Define Worker for BOHB
class CNNWorker(Worker):
    def __init__(self, run_id, dataset, **kwargs):
        # print('__init__')
        super().__init__(run_id, **kwargs)
        self.trainloader, self.testloader = dataset
        self.device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

    def compute(self, config, budget, **kwargs):
        print(type(config))
        print(type(budget))
        writer = SummaryWriter(log_dir=f"./logs/bohb_trial_{config}")
        print('writer')

        # ✅ Convert `config` to Python native dict
        config_native = {key: int(value) if isinstance(value, np.integer) else float(value) if isinstance(value, np.floating) else value for key, value in config.items()}
        
        model = CNN(dropout_rate=float(config["dropout"]), num_filters=int(config["num_filters"])).to(self.device)
        optimizer = optim.Adam(model.parameters(), lr=float(config["lr"]))
        loss_fn = nn.CrossEntropyLoss()
        print('reached loss_fn')

        memory_logs = []
        start_time = time.time()

        for epoch in range(int(budget)):  # `budget` is set by BOHB (early stopping)
            model.train()
            epoch_loss = 0
            for images, labels in self.trainloader:
                images, labels = images.to(self.device), labels.to(self.device)
                optimizer.zero_grad()
                outputs = model(images)
                loss = loss_fn(outputs, labels)
                loss.backward()
                optimizer.step()
                epoch_loss += loss.item()

            # Log Memory Usage
            ram_usage, gpu_usage = log_memory_usage()
            memory_logs.append(float(ram_usage))

            # Log Loss and Memory to TensorBoard
            writer.add_scalar("Loss/train", epoch_loss / len(self.trainloader), epoch)
            writer.add_scalar("Memory/CPU_RAM_MB", ram_usage, epoch)

        end_time = time.time()
        avg_ram_usage = sum(memory_logs) / len(memory_logs)

        # Evaluate Model
        model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for images, labels in self.testloader:
                images, labels = images.to(self.device), labels.to(self.device)
                outputs = model(images)
                _, predicted = torch.max(outputs, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        accuracy = 100 * correct / total

        # ✅ Convert all NumPy types to standard Python types before returning
        accuracy = int(np.round(accuracy))  # Convert NumPy int64 to Python int
        avg_ram_usage = float(np.round(avg_ram_usage, 2))  # Convert float32 to Python float
        training_time = float(np.round(end_time - start_time, 2))  # Convert time to Python float

        writer.add_scalar("Accuracy", accuracy)
        writer.add_scalar("Training Time (s)", training_time)
        writer.close()

        print(f"Accuracy: {accuracy:.2f}%, Training Time: {training_time:.2f}s, "
              f"Avg CPU RAM Usage: {avg_ram_usage:.2f} MB, GPU Usage: {gpu_usage}")

        return {
        "loss": -accuracy,  # Loss should be negative for BOHB to maximize accuracy
        "info": {
            "training_time": training_time,
            "ram_usage": avg_ram_usage,
            "config": config_native  # ✅ Ensure all values are JSON serializable
            }
        }

    @staticmethod
    def get_configspace():
        cs = CS.ConfigurationSpace()
        cs.add(CS.UniformFloatHyperparameter("dropout", lower= float(0.2), upper= float(0.5)))
        cs.add(CS.CategoricalHyperparameter("num_filters", choices=[16, 32, 64]))
        cs.add(CS.UniformFloatHyperparameter("lr", lower=float(0.0001), upper=float(0.01)))
        # print('CS')
        return cs

In [32]:
def get_configspace():
    config_space = {
        "dropout": {"type": "float", "lower": 0.2, "upper": 0.5},
        "num_filters": {"type": "categorical", "choices": [16, 32, 64]},
        "lr": {"type": "float", "lower": 0.0001, "upper": 0.01}
    }
    return config_space


In [33]:


def sample_hyperparameters(config_space):
    sampled_config = {}
    for param, details in config_space.items():
        if details["type"] == "float":
            sampled_config[param] = random.uniform(details["lower"], details["upper"])
        elif details["type"] == "categorical":
            sampled_config[param] = random.choice(details["choices"])
    return sampled_config


In [34]:
config_space = get_configspace()
sampled_config = sample_hyperparameters(config_space)
print("Sampled Hyperparameters:", sampled_config)


Sampled Hyperparameters: {'dropout': 0.426433011998054, 'num_filters': 64, 'lr': 0.0015779085528022578}


In [35]:
# Set up BOHB optimization
run_id = "bohb_hpo"
# result_logger = hpres.json_result_logger(directory="./bohb_results", overwrite=True)

# Start Nameserver for BOHB
NS = hpns.NameServer(run_id=run_id, host="localhost", port=0)
NS.start()

# Start BOHB Worker
worker = CNNWorker(run_id=run_id, dataset=dataset, nameserver="localhost", nameserver_port=NS.port)
worker.run(background=True)

print(CNNWorker.get_configspace())

# Run BOHB Optimization
bohb = BOHB(
    configspace=sampled_config,
    run_id=run_id,
    nameserver="localhost",
    nameserver_port=NS.port,
    min_budget= int(1),  # Minimum epochs per trial
    max_budget= int(5)  # Maximum epochs per trial
    # result_logger=result_logger
)

res = bohb.run(n_iterations= int(2))  # Number of trials

# # Shutdown Nameserver and Worker
# bohb.shutdown(shutdown_workers=True)
# NS.shutdown()

# # Get Best Hyperparameters
# best_config = res.get_incumbent_id()
# best_accuracy = -res.get_incumbent_trajectory()["losses"][-1]

# print(f"\n🔹 Best Model Config: {res.get_id2config_mapping()[best_config]['config']}")
# print(f"✅ Best Accuracy: {best_accuracy:.2f}%")

Configuration space object:
  Hyperparameters:
    dropout, Type: UniformFloat, Range: [0.2, 0.5], Default: 0.35
    lr, Type: UniformFloat, Range: [0.0001, 0.01], Default: 0.00505
    num_filters, Type: Categorical, Choices: {16, 32, 64}, Default: 16



13:01:23 WORKER: Connected to nameserver <Pyro4.core.Proxy at 0x17aae7fd0; connected IPv4; for PYRO:Pyro.NameServer@localhost:65011>
13:01:23 WORKER: No dispatcher found. Waiting for one to initiate contact.
13:01:23 WORKER: start listening for jobs


AttributeError: 'dict' object has no attribute 'get_hyperparameters'

## Distributed Hyperparameter Optimization (HPO) with Ray Tune

To improve the efficiency of hyperparameter tuning, we implement a distributed HPO strategy using Ray Tune. Ray Tune supports various search algorithms, including Bayesian Optimization, Genetic Algorithms, and Asynchronous Successive Halving (ASHA), making it ideal for large-scale HPO tasks.

### 1. Search Space Definition

We define the search space for a deep learning model as follows:

* **Learning rate:** Log-uniform distribution between 1e-5 and 1e-1.
* **Batch size:** Categorical values [16, 32, 64].
* **Dropout rate:** Uniform distribution between 0.1 and 0.5.
* **Number of layers:** Integer values between 2 and 5.

### 2. Parallel Execution using Ray

Ray Tune enables parallel trial execution across multiple nodes and GPUs:

* Configure a Ray cluster for multi-node execution.
* Utilize BOHB (Bayesian Optimization HyperBand) for sample efficiency and effective exploration.
* Employ ASHA (Asynchronous Successive Halving) for dynamic early stopping, reducing unnecessary computation.

### 3. Adaptive Scheduling and Resource Allocation

Ray Tune dynamically reallocates resources to the most promising trials:

* Trials demonstrating poor performance are stopped early using ASHA.
* Bayesian models refine the search space, guiding the search towards better configurations.

### 4. Implementation in Ray


In [14]:
%pip install 'ConfigSpace<0.5.0'
%pip install --upgrade ray
%pip show ray
%pip install --upgrade ConfigSpace



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Name: ray
Version: 2.43.0
Summary: Ray provides a simple, universal API for building distributed applications.
Home-page: https://github.com/ray-project/ray
Author: Ray Team
Author-email: ray-dev@googlegroups.com
License: Apache 2.0
Location: /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages
Requires: aiosignal, 

[31mERROR: Could not find a version that satisfies the requirement OptunaSearch (from versions: none)[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
[31mERROR: No matching distribution found for OptunaSearch[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [12]:
import ray
import torch
import torchvision
from torchvision import transforms
from torch.utils.data import Subset, DataLoader
import numpy as np
from ray.tune.search.optuna import OptunaSearch 
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.optuna import OptunaSearch
import torch.nn as nn
import torch.optim as optim

ray.shutdown()
ray.init()

# Transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# Load full MNIST dataset
full_trainset = torchvision.datasets.MNIST(root="./data", train=True, download=True, transform=transform)
full_testset = torchvision.datasets.MNIST(root="./data", train=False, download=True, transform=transform)

# Select 1000 random indices for train and test sets
train_indices = np.random.choice(len(full_trainset), 10000, replace=False)
test_indices = np.random.choice(len(full_testset), 10000, replace=False)

# Create subsets of MNIST
trainset = Subset(full_trainset, train_indices)
testset = Subset(full_testset, test_indices)

# Loaders using Ray
def load_data(config):
    trainloader = DataLoader(trainset, batch_size=config["batch_size"], shuffle=True)
    testloader = DataLoader(testset, batch_size=config["batch_size"], shuffle=False)
    return trainloader, testloader



# Search space
search_space = {
    "lr": tune.loguniform(1e-5, 1e-1),
    "batch_size": tune.choice([16, 32, 64]),
    "dropout": tune.uniform(0.1, 0.5),
    "num_layers": tune.randint(2, 5)
}

# ASHA scheduler
scheduler = ASHAScheduler(
    max_t=10,
    grace_period=1,
    reduction_factor=2
)

# Optuna Search
search_alg = OptunaSearch(
    metric="loss",
    mode="min"
)

# Build model
def build_model(config):
    layers = [nn.Flatten()]
    for _ in range(config["num_layers"]):
        layers.append(nn.Linear(784, 128))
        layers.append(nn.ReLU())
        layers.append(nn.Dropout(config["dropout"]))
    layers.append(nn.Linear(128, 10))
    return nn.Sequential(*layers)

# Training function
def train(config):
    trainloader, testloader = load_data(config)
    model = build_model(config)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=config["lr"])

    for epoch in range(10):
        model.train()
        for inputs, labels in trainloader:
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for inputs, labels in testloader:
                outputs = model(inputs)
                val_loss += criterion(outputs, labels).item()

        tune.report(loss=val_loss / len(testloader))

# Tuner configuration
tuner = tune.Tuner(
    train,
    tune_config=tune.TuneConfig(
        search_alg=search_alg,
        scheduler=scheduler,
        num_samples=20,
        metric="loss",
        mode="min"
    ),
    param_space=search_space
)

results = tuner.fit()

print("Best hyperparameters found were: ", results.get_best_result().config)
ray.shutdown()

0,1
Current time:,2025-03-19 13:40:32
Running for:,00:03:14.11
Memory:,10.5/16.0 GiB

Trial name,# failures,error file
train_d6ded41f,1,"/tmp/ray/session_2025-03-19_13-37-11_935773_79443/artifacts/2025-03-19_13-37-18/train_2025-03-19_13-37-17/driver_artifacts/train_d6ded41f_1_batch_size=64,dropout=0.4183,lr=0.0004,num_layers=2_2025-03-19_13-37-18/error.txt"
train_36123d5d,1,"/tmp/ray/session_2025-03-19_13-37-11_935773_79443/artifacts/2025-03-19_13-37-18/train_2025-03-19_13-37-17/driver_artifacts/train_36123d5d_2_batch_size=32,dropout=0.1812,lr=0.0559,num_layers=2_2025-03-19_13-37-29/error.txt"
train_d091e8a8,1,"/tmp/ray/session_2025-03-19_13-37-11_935773_79443/artifacts/2025-03-19_13-37-18/train_2025-03-19_13-37-17/driver_artifacts/train_d091e8a8_3_batch_size=16,dropout=0.1581,lr=0.0000,num_layers=2_2025-03-19_13-37-36/error.txt"
train_3fd9885c,1,"/tmp/ray/session_2025-03-19_13-37-11_935773_79443/artifacts/2025-03-19_13-37-18/train_2025-03-19_13-37-17/driver_artifacts/train_3fd9885c_4_batch_size=32,dropout=0.2121,lr=0.0007,num_layers=4_2025-03-19_13-37-44/error.txt"
train_76433452,1,"/tmp/ray/session_2025-03-19_13-37-11_935773_79443/artifacts/2025-03-19_13-37-18/train_2025-03-19_13-37-17/driver_artifacts/train_76433452_5_batch_size=32,dropout=0.2642,lr=0.0000,num_layers=2_2025-03-19_13-37-52/error.txt"
train_b27887e9,1,"/tmp/ray/session_2025-03-19_13-37-11_935773_79443/artifacts/2025-03-19_13-37-18/train_2025-03-19_13-37-17/driver_artifacts/train_b27887e9_6_batch_size=32,dropout=0.1350,lr=0.0008,num_layers=4_2025-03-19_13-37-59/error.txt"
train_232a0a43,1,"/tmp/ray/session_2025-03-19_13-37-11_935773_79443/artifacts/2025-03-19_13-37-18/train_2025-03-19_13-37-17/driver_artifacts/train_232a0a43_7_batch_size=32,dropout=0.1621,lr=0.0000,num_layers=2_2025-03-19_13-38-07/error.txt"
train_13645290,1,"/tmp/ray/session_2025-03-19_13-37-11_935773_79443/artifacts/2025-03-19_13-37-18/train_2025-03-19_13-37-17/driver_artifacts/train_13645290_8_batch_size=32,dropout=0.1800,lr=0.0021,num_layers=2_2025-03-19_13-38-16/error.txt"
train_9d2739f7,1,"/tmp/ray/session_2025-03-19_13-37-11_935773_79443/artifacts/2025-03-19_13-37-18/train_2025-03-19_13-37-17/driver_artifacts/train_9d2739f7_9_batch_size=32,dropout=0.3169,lr=0.0467,num_layers=3_2025-03-19_13-38-26/error.txt"
train_bee3ac6d,1,"/tmp/ray/session_2025-03-19_13-37-11_935773_79443/artifacts/2025-03-19_13-37-18/train_2025-03-19_13-37-17/driver_artifacts/train_bee3ac6d_10_batch_size=64,dropout=0.1323,lr=0.0003,num_layers=4_2025-03-19_13-38-37/error.txt"

Trial name,status,loc,batch_size,dropout,lr,num_layers
train_d6ded41f,ERROR,127.0.0.1:83217,64,0.418345,0.000435215,2
train_36123d5d,ERROR,127.0.0.1:83223,32,0.181223,0.0559019,2
train_d091e8a8,ERROR,127.0.0.1:83232,16,0.158059,1.08679e-05,2
train_3fd9885c,ERROR,127.0.0.1:83249,32,0.212094,0.000657658,4
train_76433452,ERROR,127.0.0.1:83254,32,0.26418,2.15961e-05,2
train_b27887e9,ERROR,127.0.0.1:83259,32,0.134954,0.000795979,4
train_232a0a43,ERROR,127.0.0.1:83270,32,0.16208,1.41901e-05,2
train_13645290,ERROR,127.0.0.1:83280,32,0.18005,0.00214639,2
train_9d2739f7,ERROR,127.0.0.1:83286,32,0.316933,0.0467282,3
train_bee3ac6d,ERROR,127.0.0.1:83291,64,0.132297,0.000278507,4




2025-03-19 13:37:29,603	ERROR tune_controller.py:1331 -- Trial task failed for trial train_d6ded41f
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
             ^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/_private/worker.py", line 2771, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
       

RuntimeError: No best trial found for the given metric: loss. This means that no trial has reported this metric, or all values reported for this metric are NaN. To not ignore NaN values, you can set the `filter_nan_and_inf` arg to False.