# Explore hyperparameter tuning with Ablator

In this demo, we are gonna train a LeNet model with MNIST dataset with Ablator under Colab enviroments. After that, we will also use the Multi-processing feature of Ablator's to tune different hyperparameters for our model and compare the results.

## Prerequisites

Before we start, please enable GPU on your CoLab. Please select the `Change Runtime type` option from the CoLab toolbar and choose `GPU` as the hardware accelerator.

To start with, we can clone the Ablator repository from Github to our CoLab workspace:

In [1]:
try:
    import ablator
except ImportError:
    !pip install git+https://github.com/fostiropoulos/ablator.git@v0.0.1-misc-fixes
    print("Stopping RUNTIME! Please run again.")
    import os

    os.kill(os.getpid(), 9)

  from .autonotebook import tqdm as notebook_tqdm


Then we install all the dependecies.

Please note that: Since there are some package version conflicts, the installation process is seperated. We will fix this problem later.

In [2]:
from pathlib import Path
from typing import Callable, Dict

import torch
import torchvision
from sklearn.metrics import accuracy_score
from torch import nn
from torchvision import transforms

In [3]:
from ablator import (
    ModelConfig,
    ModelWrapper,
    RunConfig,
    configclass,
    Literal,
    ParallelConfig,
    ParallelTrainer,
)

## Set up basic configurations

First thing for using Ablator is to set up the configurations, including the model configurations and the training configurations.

Since we are running Ablator on CoLab, we define the inline parameters directly and use **NO** configuration files.

In [4]:
# Customized model config subclass, inheriting from ModelConfig base class
@configclass
class LenetConfig(ModelConfig):
    # Configurable attributes
    name: Literal["lenet5"]


# Customized Run config subclass, inheriting from RunConfig base class
@configclass
class LenetRunConfig(RunConfig):
    model_config: LenetConfig


# Customized Parallel Training config class, inheriting from ParallelConfig base class
@configclass
class MyParallelConfig(ParallelConfig):
    model_config: LenetConfig

Then we create objects for each necessary configuration classes and fill in the configuration values into them.

In this demo, we implemented these configration objects:

*   `TrainConfig`: Training parameters, including the dataset, epochs number, batch size and optimizer etc.
*   `ModelConfig`: Specify the model class we are gonna try
*   `ParellelConfig`: Experiments parameters, including the dir, other config's class, trials, hyperparameter search space and hardware related parameters.

In [5]:
from ablator import OptimizerConfig, TrainConfig
from ablator.config.hpo import SearchSpace

# Define the training configuration object
train_config = TrainConfig(
    dataset="mnist",
    batch_size=64,
    epochs=10,
    scheduler_config=None,
    optimizer_config=OptimizerConfig(
        name="sgd", arguments={"lr": 0.001, "momentum": 0.1}
    ),
)

# Define the model configuration object
model_config = LenetConfig(name="lenet5")

# Define the Main parallel running configuration object
run_config = MyParallelConfig(
    train_config=train_config,
    model_config=model_config,
    metrics_n_batches=200,
    total_trials=5,
    concurrent_trials=5,
    optim_metrics={"val_loss": "min"},
    optim_metric_name="val_loss",
    gpu_mb_per_experiment=1024,
    device="cuda",
    search_space={
        "train_config.optimizer_config.arguments.momentum": SearchSpace(
            value_range=("0.01", "0.1"), value_type="float"
        )
    },
)

### Hyperparameter tuning

In this demo, we will train the model with different `momentum` values for the `SGD` optimizer. To achieve this functionality, we specified these parameters in the configurations object above:

- `search_space`: Specify the hyperparameters we want to try with different values. We can specify their names, value ranges and value types. Ablator will generate different values for each hyperparameter according to the metrics and algorithms
- `total_trials`: Specify how many trials we will have for different hyperparameters values.
- `device`: Specify the hardware we will use to run our experiments

Please refer to Ablator documentations for more information on how set the configurations.

## Setmup the model and datasets

After we created our configurations, we can proceed and create our customized models and datasets.

### Model implementations

First, we define our customized model class. In this demo, we will use the LeNet-5 model defined by ourselves with each layer using PyTorch components.

In [6]:
# Customized Model class is defined here, where the model structure, forward pass
# and loss function are defined


class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(16 * 4 * 4, 120)
        self.relu3 = nn.ReLU()
        self.fc2 = nn.Linear(120, 84)
        self.relu4 = nn.ReLU()
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool1(self.relu1(self.conv1(x)))
        x = self.pool2(self.relu2(self.conv2(x)))
        x = x.view(-1, 16 * 4 * 4)
        x = self.relu3(self.fc1(x))
        x = self.relu4(self.fc2(x))
        x = self.fc3(x)
        return x


class MyModel(nn.Module):
    def __init__(self, config: LenetConfig) -> None:
        super().__init__()
        self.model = SimpleCNN()
        self.loss = nn.CrossEntropyLoss()
        # self.optimizer = optim.SGD(self.model.parameters(), lr=0.001, momentum=0.9)

    def forward(self, x, labels, custom_input=None):
        # custom_input is for demo purposes only, defined in the dataset wrapper
        out = self.model(x)
        loss = self.loss(out, labels)
        if labels is not None:
            loss = self.loss(out, labels)

        out = out.argmax(dim=-1)
        return {"y_pred": out[:, None], "y_true": labels[:, None]}, loss

### Datasets implementations

Then, we will import the MNIST dataset and make dataloader out of it

In [7]:
# Create the training & validation dataloaders from the MNIST dataset.
# Also, data preprocessing is defined here, including normalization and other transformations

transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))]
)

trainset = torchvision.datasets.MNIST(
    root="./datasets", train=True, download=True, transform=transform
)
trainloader = torch.utils.data.DataLoader(
    trainset, batch_size=64, shuffle=True, num_workers=2
)

testset = torchvision.datasets.MNIST(
    root="./datasets", train=False, download=True, transform=transform
)
testloader = torch.utils.data.DataLoader(
    testset, batch_size=64, shuffle=False, num_workers=2
)

### Evaluation function implementations

Also, we will define a evaluation function for training process and model evaluation. We will choose accuracy from the sklearn package as our metrics.

In [8]:
def my_accuracy(y_true, y_pred):
    return accuracy_score(y_true.flatten(), y_pred.flatten())

### Final Wrap-up

As a last step, we can wrap up the model, datasets and configurations into a wrapper class inheriting from ModelWrapper base class

In [9]:
# Custom Model Wrapper, extending ModelWrapper class from Ablator
class MyModelWrapper(ModelWrapper):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def make_dataloader_train(self, run_config: LenetRunConfig):  # type: ignore
        return trainloader

    def make_dataloader_val(self, run_config: LenetRunConfig):  # type: ignore
        return testloader

    def evaluation_functions(self) -> Dict[str, Callable]:
        return {"accuracy_score": my_accuracy}

## Launch Ablator

After we finished the configurations and customizations, we can launch our Ablator to run the experiments now.

The launching process follows these steps:

*   Create target directory for results
*   Initiate ray enviroments
*   Run the experiments

In [10]:
# Create results directory
!mkdir -p working_dir

In [11]:
!ray stop

Did not find any active Ray processes.
[0m

In [12]:
# Debug to make sure your model can train just fine.

import shutil

EXPERIMENT_DIR = Path.cwd().joinpath("experiment_dir")
shutil.rmtree(EXPERIMENT_DIR, ignore_errors=True)
run_config.experiment_dir = None

wrapper = MyModelWrapper(
    model_class=MyModel,
)
wrapper.train(run_config)

2023-08-27 22:34:37: [93mMetrics batch-limit 200 is larger than 20% of the train dataloader length 938. You might experience slow-down during training. Consider decreasing `metrics_n_batches`.[0m
2023-08-27 22:34:37: Creating new model
2023-08-27 22:34:44: Evaluation Step [1] val_loss: 2.31e+00 val_accuracy_score: 0.161900 train_loss: 2.30e+00 best_iteration: 00000938 best_val_loss: 2.31e+00 current_epoch: 00000001 current_iteration: 00000938 epochs: 00000010 learning_rate: 0.001000 total_steps: 00009380 train_accuracy_score: nan
2023-08-27 22:34:44: val_loss: 2.31e+00 val_accuracy_score: 0.161900 train_loss: 2.30e+00 best_iteration: 00000938 best_val_loss: 2.31e+00 current_epoch: 00000001 current_iteration: 00000938 epochs: 00000010 learning_rate: 0.001000 total_steps: 00009380 train_accuracy_score: 0.185000
2023-08-27 22:34:48: Evaluation Step [2] val_loss: 2.30e+00 val_accuracy_score: 0.236400 train_loss: 2.29e+00 best_iteration: 00001876 best_val_loss: 2.30e+00 current_epoch: 000

{'val_loss': 2.2950439453125,
 'val_accuracy_score': 0.2364,
 'train_loss': 2.2866522515570367,
 'best_iteration': 1876,
 'best_val_loss': 2.2950439453125,
 'current_epoch': 2,
 'current_iteration': 1876,
 'epochs': 10,
 'learning_rate': 0.001,
 'total_steps': 9380,
 'train_accuracy_score': 0.275}

In [13]:
# Launch Ablator to run experiments


WORKING_DIRECTORY = Path.cwd().joinpath("working_dir")
# mp_train prepares and launches parallel training

wrapper = MyModelWrapper(
    model_class=MyModel,
)
shutil.rmtree(EXPERIMENT_DIR, ignore_errors=True)
run_config.experiment_dir = EXPERIMENT_DIR

ablator = ParallelTrainer(
    wrapper=wrapper,
    run_config=run_config,
)

# NOTE to run on a cluster you will need to start ray with `ray start --head` and pass ray_head_address="auto"
ablator.launch(working_directory=WORKING_DIRECTORY)

2023-08-27 22:35:00:  - [93mNo git repository was detected at /home/iordanis/ablator-tutorials/examples/working_dir. We recommend setting the working directory to a git repository to keep track of changes.[0m
[2m[36m(FileLogger pid=2398935)[0m 2023-08-27 22:35:00:  - [93mNo git repository was detected at /home/iordanis/ablator-tutorials/examples/working_dir. We recommend setting the working directory to a git repository to keep track of changes.[0m
2023-08-27 22:35:00:  - Scheduling uid: 54a0_c111_2cd2
Parameters: 
	train_config.optimizer_config.arguments.momentum:(float)0.1->(float)0.020468748655217375
	experiment_dir:(str)/home/iordanis/ablator-tutorials/examples/experiment_dir->(str)/home/iordanis/ablator-tutorials/examples/experiment_dir/54a0_c111_2cd2
-----
[2m[36m(FileLogger pid=2398935)[0m 2023-08-27 22:35:00:  - Scheduling uid: 54a0_c111_2cd2
[2m[36m(FileLogger pid=2398935)[0m Parameters: 
[2m[36m(FileLogger pid=2398935)[0m 	train_config.optimizer_config.argumen

KeyboardInterrupt: 

[2m[36m(54a0_c111_2cd2 pid=2399690)[0m 2023-08-27 22:35:39: Evaluation Step [7] val_loss: 7.91e-01 val_accuracy_score: 0.780900 train_loss: 1.03e+00 best_iteration: 00006566 best_val_loss: 7.91e-01 current_epoch: 00000007 current_iteration: 00006566 epochs: 00000010 learning_rate: 0.001000 total_steps: 00009380 train_accuracy_score: 0.635000[32m [repeated 5x across cluster][0m
[2m[36m(54a0_c111_2cd2 pid=2399690)[0m 2023-08-27 22:35:39: val_loss: 7.91e-01 val_accuracy_score: 0.780900 train_loss: 1.03e+00 best_iteration: 00006566 best_val_loss: 7.91e-01 current_epoch: 00000007 current_iteration: 00006566 epochs: 00000010 learning_rate: 0.001000 total_steps: 00009380 train_accuracy_score: 0.730000[32m [repeated 5x across cluster][0m
[2m[36m(f8f0_c111_1e93 pid=2399702)[0m 2023-08-27 22:35:45: Evaluation Step [8] val_loss: 2.22e+00 val_accuracy_score: 0.510300 train_loss: 2.24e+00 best_iteration: 00007504 best_val_loss: 2.22e+00 current_epoch: 00000008 current_iteration: 000075

## Experiments results analysis

After running the experiments, the results are cached in the directory: `/tmp/dir`, as specified in the configurations. The results directory follows these structures:

```
- dir
    - experiment_<experiment_id>
        - <trial1_id>
          - best_checkpoints/
          - checkpoints/
          - dashboard/
          - config.yaml
          - metadata.json
          - results.json
          - train.log
        - <trial2_id>
        - <trial3_id>
        - ...
        - <experiment_id>_optuna.db
        - <experiment_id>_state.db
        - default_config.yaml
        - mp.log
```

To utilize the results, here are some detailed explations to introduce these files directories:

- `default_config.yaml`: the overrall configurations for model, training and hyperparameters tuning
- `train.log`: console infomation during the training process
- `results.json`: metrics of the model during & after the training process
- `config.yaml`: specific configurations for each trial, including the trail hyperparameters
- `checkpoints/`: directory to cache the training checkpoints and trained models
- `dashboard/`: directory to cache the metrics data for Tensorboard visualization

In the folling section, we will use Tensorboard to visualize the results from different trials.

### Tensorboard visualization

To utilize the Tensorboard, we load Tensorboard extension and then input each data directory into the Tensorboard and launch it.

In [None]:
import tensorflow as tf

# Load the TensorBoard extension
%load_ext tensorboard

In [15]:
from tensorboard import notebook

# Load TensorBoard with multiple directories
notebook.start(f"--logdir {EXPERIMENT_DIR}")

Reusing TensorBoard on port 6006 (pid 2424198), started 0:00:09 ago. (Use '!kill 2424198' to kill it.)

### Results analysis

Tensorboard gives us a clear visual on the performance of our model under different hyperparameter, to be specific, the momentum values of SGD optimizer.

When the momentum is set to be `0.027` and `0.044`, the model can have a overrall best performance, both on the training set and on the validations set. Higher or lower momentum values may both lead to a poorer performance to our LeNet-5 model on the MNIST dataset.