<img src="./imgs/hpe_logo.png" alt="HPE Logo" width="300">

<h1>Determined.AI - PyTorch Hub Model Porting Activity</h1>

This exercise aims to port a model to run on the HPE Machine Learning Development Environment (Determined.AI) and train it on custom data. We will use a U-Net model for identifying tumors in brain MRI scans. The model is available on the PyTorch Model Hub <a href="https://pytorch.org/hub/mateuszbuda_brain-segmentation-pytorch_unet/">here</a> and was created by mateuszbuda. 

For the porting, we will use the Determined.AI trial APIs, which will give us access to advanced features such as checkpointing, metrics tracking, distributed training, and hyperparameter search. The below cells provide the code needed to complete the exercise. Please follow the instructions and copy the code blocks to the correct section in the model_def.py and const.yaml files in the experiment folder.

<b>Please make sure you only work on your files. Although this is a Jupyter Notebook, you will not be able to execute the code in the cells. Please copy it to the correct files as described.</b>


<h2>Introduction</h2>

Determined provides a high-level framework APIs for PyTorch, Keras, and Estimators that let users describe their model without boilerplate code. Determined reduces boilerplate by providing a state-of-the-art training loop that provides distributed training, hyperparameter search, automatic mixed precision, reproducibility, and many more features.

<h3>Why use Determined.AI?</h3>

Running deep learning training workloads can be cumbersome and typically requires a lot of boilerplate code for the training harness.
That can include code for distributed training, automatic checkpointing, hyperparameter search and metrics tracking, and compromising hundreds of lines of code.
In addition, training a new model (for example, from a public repository) will often require changes to the model code and the training harness, taking up the valuable time of researchers and engineers. 

Determined.AI can remove the burden of writing and maintaining a custom training harness and offers a streamlined approach to onboard new models to a state-of-the-art training platform, offering the following integrated platform features:

<img src="./imgs/det_components.jpg" alt="Determined Components" width="900">

<h3>Overview of this workshop</h3>

In this activity, we’ll walk through an example and provide helpful hints to organize PyTorch code into Determined’s PyTorchTrial API successfully. Once your code is in the PyTorchTrial format, you can easily take advantage of Determined.AI’s open-source platform.

While all codebases are different, code to perform deep learning training tends to follow a typical pattern. Usually, there is a model, optimizer, data, and learning rate scheduler. determined.pytorch.PyTorchTrial follows this pattern to reduce porting friction. To port a model from PyTorch Hub, we will copy the code to load the model to the init method and then define the remaining methods in the template <b>(model_def.py, located in the experiment folder)</b> to get data, train, and validate the model. <b>Below is the current content of the template model_def.py:</b>

<h2>Step 1: init method in model_def.py</h2>

To get started, let's take a look at the code to load the unet model from Pytorch Hub. The original code can also be found <a href="https://pytorch.org/hub/mateuszbuda_brain-segmentation-pytorch_unet/">here</a>.

In [None]:
model = torch.hub.load('mateuszbuda/brain-segmentation-pytorch', 
                       'unet',
                       in_channels=3, 
                       out_channels=1, 
                       init_features=32, 
                       pretrained=True)

As with any Python class, the __init__ method is invoked to construct our trial class. Determined passes this method a single parameter, an instance of PyTorchTrialContext, which inherits from TrialContext. The trial context contains information about the trial, such as the values of the hyperparameters to use for training. All the models and optimizers must be wrapped with wrap_model and wrap_optimizer respectively, which are provided by PyTorchTrialContext. In this PyToch Hub example, we will remove all the parameters and replace them with "self.context.get_hparam("parameter")" to retrieve them from the experiment configuration instead of hard coding them. We are also adding some code to load data. For this workshop, data.py is provided, which contains the functions to load our data. Because the init method will be invoked when we load a checkpoint of the model later to make predictions, we will handle the expected error with an exception.

Please open the __model_def.py file in the experiments folder__ and copy the below code to the __init__ function in the <b>model_def.py</b>.

In [None]:
self.config = self.context.get_data_config()

# Loading the data sets
try:
    self.train_dataset, self.val_dataset = data.get_train_val_datasets(self.config["data_dir"],
                                                                       self.context.get_hparam("split_seed"),
                                                                       self.context.get_hparam("validation_ratio"))
except:
    pass

self.download_directory = torch.hub.get_dir()

# Creating directories for download
try:
    if not os.path.exists(self.download_directory):
        os.makedirs(self.download_directory)

except:
    print("Path exists")

with filelock.FileLock(os.path.join(self.download_directory, "download.lock")):
    model = torch.hub.load(self.config["repo"],
                           self.config["model"],
                           in_channels=self.context.get_hparam("input_channels"),
                           out_channels=self.context.get_hparam("output_channels"),
                           init_features=self.context.get_hparam("init_features"),
                           pretrained=self.context.get_hparam("pretrained"))

Then, please __open the const.yaml file ein the experiments folder__ and look for the workspace and project fields. Please replace the placeholders with your workspace name and project name, which you created during the preparation session:

workspace: <your_workspace>
project: <your_project>

Next, find the data configuration and add the <b>"repo" and "model"</b> values from the original code above. Your data configuration should then look like this:

Still in the <b>const.yaml</b> file, find the hyperparameters section, and add <b>"input_channels", "output_channels", "init_features", "pretrained" </b> with the values from the original code above. Your hyperparameter configuration should then look like this:

Back in the in the <b>model_def.py</b>, wrap the model and optimizer as shown below by copying the code to the <b>init</b> method, just below the model definition.

In [None]:
self.model = self.context.wrap_model(model)
self.optimizer = self.context.wrap_optimizer(optim.Adam(self.model.parameters(),
                                                        lr=self.context.get_hparam("learning_rate"),
                                                        weight_decay=self.context.get_hparam("weight_decay")))

<h2>Step 2: Custom metric IoU (Intersection over Union)</h2>

Determined allows for any custom training and validation metric to be used. For this use case, IoU (Intersection over Union) is the appropriate metric as it quantifies the degree of overlap between the ground truth and the prediction. To use IoU as our training and validation metric, we <b>define the iou method</b> as shown below and copy it to the <b>model_def.py</b> file. Please make sure that you define it <b>outside of the init method</b> but <b>inside of the MRIUnetTrial class</b>. We will later call the iou method from the train_batch and evaluate_batch methods.

In [None]:
def iou(self, pred, label):
    intersection = (pred * label).sum()
    union = pred.sum() + label.sum() - intersection
    if pred.sum() == 0 and label.sum() == 0:
        return 1
    return intersection / union

<h2>Step 3: Data Loaders</h2>

The next two methods we need to define are <b>build_training_data_loader<b> and </b>build_validation_data_loader</b>. Determined uses these methods to load the training and validation datasets, respectively. Both methods should return a determined.pytorch.DataLoader, which is very similar to torch.utils.data.DataLoader. All we have to do is to provide the dataset, the batch size, and define wether we want to shuffle the data. (True for training, false (default) for validation) To ensure scalability, we will also define the number of workers to use for the data loaders. Once again, we set this as a parameter (num_workers) we can get from the const.yaml.

Copy the two methods below and place them <b>inside of the MRIUnetTrial class</b> in the <b>model_def.py</b> file. (Note, there is a placeholder for both methods in the model_def.py file. You can replace the placeholders with the below code.)

In [None]:
def build_training_data_loader(self):
    return DataLoader(self.train_dataset, batch_size=self.context.get_per_slot_batch_size(), shuffle=True, num_workers=self.context.get_hparam("num_workers"))

def build_validation_data_loader(self):
    return DataLoader(self.val_dataset, batch_size=self.context.get_per_slot_batch_size(), num_workers=self.context.get_hparam("num_workers"))

<h2>Step 4: Train Batch</h2>

With our metric and data in place, we can now move on to training. 

The train_batch() method is passed a single batch of data from the validation data set; it should run the forward passes on the models, the backward passes on the losses, and step the optimizers. This method should return a dictionary with user-defined training metrics - in this case IoU; Determined will automatically average all the metrics across batches. If an optimizer is set to automatically handle zeroing out the gradients, step_optimizer will zero out the gradients and there will be no need to call optim.zero_grad().

<b>The code below does the following:</b>
- unpacks our batch in imgs (feature) and masks (labels)
- feeds the imgs to the model
- calculates the loss (based on the predictions and the labels (masks))
- runs the backward pass on the model using the loss 
- steps the optimizer
- calculates the iou training metric 
- returns the iou metric alongside the loss for the batch

Copy the <b>train_batch</b> method below and place it <b>inside of the MRIUnetTrial class</b> in the <b>model_def.py</b> file. (Note, there is a placeholder for the train_batch method in the model_def.py file. You can replace the placeholder with the below code.)

In [None]:
def train_batch(self, batch: TorchData, epoch_idx: int, batch_idx: int):
    imgs, masks = batch
    output = self.model(imgs)
    loss = torch.nn.functional.binary_cross_entropy(output, masks)
    self.context.backward(loss)
    self.context.step_optimizer(self.optimizer)
    iou = self.iou((output>0.5).int(), masks)
    return {"loss": loss, "IoU": iou}

<h2>Step 5: Evaluate Batch</h2>

The evaluate_batch() method is passed a single batch of data from the validation data set; it should compute the user-defined validation metrics on that data (IoU for this example) and return them as a dictionary that maps metric names to values. The metric values for each batch are reduced (aggregated) to produce a single value of each metric for the entire validation set. By default, metric values are averaged, but this behavior can be customized by overriding evaluation_reducer().

<b>The code below does the following:</b>
- unpacks our batch in imgs (feature) and masks (labels)
- feeds the imgs to the model
- calculates the validation loss (based on the predictions and the labels (masks))
- calculates the iou validation metric 
- returns the iou metric alongside the loss for the batch

Copy the <b>evaluate_batch</b> method below and place it <b>inside of the MRIUnetTrial class</b> in the <b>model_def.py</b> file. (Note, there is a placeholder for the evaluate_batch method in the model_def.py file. You can replace the placeholder with the below code.)

In [None]:
def evaluate_batch(self, batch: TorchData):
    imgs, masks = batch
    output = self.model(imgs)
    loss = torch.nn.functional.binary_cross_entropy(output, masks)
    iou = self.iou((output>0.5).int(), masks)
    return {"val_loss": loss, "val_IoU": iou}

In the <b>const.yaml</b> file, find the searcher section, and <b>change the metric from val_loss to val_IoU</b> to use our custom validation metric we return from the evaluate_batch method.

<h2>Step 6: Final experiment configuration items</h2>

We are now done working on the model_def.py file. However, we should add a few configuration items to our <b>const.yaml</b> to submit our first experiment. 

First, let's enable profiling to keep track of the experiment performance and hardware utilization. All we have to do is set <b>profiling enabled to True</b>. Use the code below and add it to your <b>const.yaml</b>. 

Lastly, let's add the <b>resources</b> section to the <b>const.yaml</b> file and specify the number of slots (GPUs) and the resource pool we will use for this experiment. For this first experiment, please use one (1) slot and specify <b>the resource pool that was assigned to you at the beginning of the workshop.</b> Please copy the below code and paste it to the end of the __const.yaml file and specify the resource pool that was assigned to you.__

<h2>Step 7: (Optional) Full model_def.py and const.yaml files for your reference</h2>

In case you have any difficulties with your code (now or going forward), you can copy the full completed model_def.py and const.yaml below to continue with the workshop. <b>Please just make sure to replace the placeholders in const.yaml with your values.</b>

<h3>Reference model_def.py</h3>

<h3>Reference const.yaml</h3>

<h2>Step 8: Launch your first Determined.AI Experiment</h2>

We now have all files and configurations ready to launch the experiment. To do so, we can use the Determined CLI, which has been installed in this Jupyter environment for your convenience. Because the Jupyter Notebook is running on Determined, and we pass the Determined cluster context to the notebook, you can directly interact with the cluster using the CLI without logging in or authenticating.

Please <b>execute the below cell</b> to launch the const.yaml experiment.

In [None]:
!det e create ./experiments/const.yaml ./experiments/

Up <b>here ^</b>, you should see a confirmation saying that the experiment has been created. Switch back to the Determined.AI WebGUI, and browse to your workspace/project to find your experiment. You can observe the training and validation metrics, the checkpoints, the profiling, and the experiment logs as it runs.

<h2>Step 9: Launch a distributed training Experiment</h2>

With Determined, going from a single GPU training job to a multi-GPU distributed training job is as easy as changing a simple configuration line. There is no need to worry about setting up frameworks like Horovod or PyTorch Lightning.

Let's copy the <b>const.yaml</b> file and name the copy <b>distributed.yaml</b>. Open the <b>distributed.yaml</b> file. First, look for the name field and <b>change it from MRI-constant-1GPU to MRI-constant-GPU2</b> to indicate the distributed training job. Then, look for the resources section. Change the <b>slots_per_trial field from 1 to 2</b> to run a distributed training job on 2 GPUs. Save the file and <b>execute the below cell.</b>

In [None]:
!det e create ./experiments/distributed.yaml ./experiments/

Up <b>here ^</b>, you should see a confirmation saying that the experiment has been created. Switch back to the Determined.AI WebGUI, and browse your workspace/project to find your experiment. You can observe the training and validation metrics, the checkpoints, the profiling, and the experiment logs as it runs. <b>Notice how you can see the four GPUs in the Profiler tab, and the different ranks in the Logs tab.</b>

<h2>Step 10: Launch a hyperparameter search experiment</h2>

The first step toward automatic hyperparameter tuning is to define the hyperparameter space, e.g., by listing the decisions that may impact model performance. We can specify a range of possible values in the experiment configuration for each hyperparameter in the search space.

To do this, copy the <b>distributed.yaml</b> file and name the copy <b>adaptive.yaml</b>. Open the <b>adaptive.yaml</b> file, change the <b>name from MRI-constant-2GPU to MRI-adaptive-1GPU </b> and then look for the hyperparameters section. In that section, we have to change the static hyperparameter values to ranges and specify the type (int, double, log, categorical). You can use the configuration in the below cell as a starting point. Copy the hyperparameter section from the cell below and use it to replace the hyperparameter section in your <b>adaptive.yaml</b>.

To tell the Determined.AI master that we want to search over the defined hyperparameter space, we have to <b>change the searcher from single to adaptive_asha</b>, which is the state-of-the-art search algorithm implementing early stopping. We will also tell Determined how many different combinations we would like to explore. Please make sure that the searcher name and max_trials are specified as below in your adaptive.yaml file under searcher:

Because we are using a shared cluster with all the workshop participants, please <b>change the slots_per_trial value from 2 to 1</b>, as otherwise you would be requesting 14 GPUs just for your experiment alone.<br>
Below is the correct resources section for this experiment. You can copy it and simply specify your compute pool again.

Save the file and <b>execute the below cell to run the experiment. Please note the experiment ID as you will need it for the next exercise.</b>

In [None]:
!det e create ./experiments/adaptive.yaml ./experiments/

Up <b>here ^</b>, you should see a confirmation saying that the experiment has been created. Switch back to the Determined.AI WebGUI, and browse to your workspace/project to find your experiment. <b>Please note the experiment ID for the next exercise</b>

<b>Notice how the experiment overview changed.</b> You now have a new tab in the experiment overview called "Trials". Each trial represents a chosen combination of Hyperparameters. Under "Visualization," you can see different plots showing how the various trials are doing relative to each other and any potential between the Hyperparemters. If you want to look at a specific trial, you can click on it and see particular information (Overview, chosen Hyperparameters, Profiler, Logs, etc.) of that trial.

<b>Notice how poorly performing trials are getting stopped.</b> Determined.AI uses the state-of-the-art adaptive ASHA algorithm based on Hyperband. It implements the principle of early stopping and terminating trials (HP combinations) that are doing poorly while extending trials that are doing well. Determined can optimize resource utilization vs. time vs. exploited hyperparameter space. For more details on adaptive ASHA visit our documentation <a href="https://docs.determined.ai/latest/training/hyperparameter/search-methods/hp-adaptive-asha.html?highlight=adaptive%20asha">here</a>.