# Getting Started with Determined, the Open-Source Deep Learning Training Platform - Lab 3
## Hyperparameter optimization with Determined without requiring any model code changes

In the previous part of the hands-on lab, you learned how to easily distribute a training task across multiple GPUs. In this section, you will look at another way that an experiment can profit from multiple GPUs: the ***automatic model tuning***, also known as hyperparameter tuning or hyperparameter optimization (**HPO**). HPO finds the best version of a model by running many training tasks (trials) on your dataset using the Searcher algorithm and ranges of hyperparameters that you specify in the experiment configuration file. Determined then chooses the hyperparameter values that result in a model that performs the best, as measured by a validation metric that you define in the experiment configuration file.

So, in this part of the lab, let's run an experiment with the same model code, but this time leverage Determined's hyperparameter optimization that ML engineers typically use to improve the model accuracy and efficiently find the combination of hyperparameter values that yields the best-performing model.  Here in this lab, the hyperparameters in the experiment configuration file are specified as ranges rather than fixed values, and the `Adaptive ASHA` searcher method is used to explore the hyperparameter space, helping you find the best hyperparameters for your model.

**With HPO, an experiment consists of multiple training tasks (trials)** running simultaneously on different GPUs. Each of the trials trains the model on the same dataset and code for the DL model. However, each trial uses a different configuration of hyperparameters **randomly** chosen by the Searcher from the range of values that you specified in the experiment configuration file.

For this part of lab, the number of trials to run, the set of user-defined hyperparameters range, the searcher method, and the amount of data (batches or epochs) on which to train the model are defined in the experiment configuration file _adaptive.yaml_.

>Note: The _Adaptive ASHA_ searcher is a state-of-the-art method that is used to find effective hyperparameter settings within a predifined range of hyperparameter values.

***More about Hyperparameter optimization and Searcher methods supported by Determined can be found [here](https://docs.determined.ai/latest/training-hyperparameter/index.html#hyperparameter-tuning)***

### 1- Create an experiment to train multiple models as part of a hyperparameter search, using Determined hyperparameter optimization (HPO)

Let's run an experiment with the same model definition (same code), but this time leveraging Determined's hyperparameter optimization functionality using the _adaptive.yaml_ experiment configuration file. 

#### First, let's take a closer look at the experiment configuration file for HPO: adaptive.yaml

In [None]:
cat Code/adaptive.yaml

As you can see here, you set up the experiment configuration file a bit differently than previous experiments. The experiment configuration file _adaptive.yaml_ tells Determined to use ***adaptive_asha*** as Searcher algorithm and the range of values to explore for each hyperparameter. In the searcher section, the parameter ***max_trials*** indicates the total number of trials that the experiment will create and how many model configuration to explore. Each trial runs on one GPU because the resource parameter _slot_per_trial_ is not specified, therefore the default setting of _slot_per_trial=1_ is used. 

>**Note:** Adaptive_ASHA method works best with many hundreds of trials. For the purpose of this hands-on lab, the maximum number of trials is set to 6. 

#### Next, submit the experiment with the experiment configuration file _adaptive.yaml_:

In [None]:
#first define the DET_MASTER env variable:
masterUrl=$(kubectl describe service determined-master-service-stagingdetai -n determinedai | grep gateway/8080 | awk '{print $3}')
determined_master="http://${masterUrl}"
export DET_MASTER=${determined_master}

In [None]:
# Launch experiment to train the model with hyperparameter tuning (HPO)
det experiment create Code/adaptive.yaml Code

In the lab environment, each Kubernetes worker host has one GPU only. Therefore, each training task (trial) in the experiment will run on one GPU. 

>**Note:** In an environment with many multi-GPU devices, you could combine HPO and Distributing Training and assign more than one GPU to each trial in the experiment by defining the parameter _slot_per_trial_ in the experiment configuration file much like the Distributed Training you examined earlier.

#### Using the command below, you will see that Determined Master has scheduled multiple trials for your experiment in the Kubernetes cluster, each of which will use its own GPU. The POD name (one per trial) for your experiment is in the form:

 _exp-\<experimentID\>-trial-\<TriaID\>-\<unique-name\>_

> Notice the trial PODs have been assigned a different trial ID for your experiment, which means your experiment features multiple trials each with a different set of hyperparameters. 

> <font color="blue"> **Note:** As you are sharing the same Kubernetes resources with other participants, and depending on the number of concurrent experiments running, your training tasks PODs might be in **Pending** state waiting for GPU resources to become available in the Kubernetes cluster. You might need to wait a few minutes until other experiments complete for your training tasks PODs to become **Running**.</font>

In [None]:
kubectl get pods -n determinedai

#### Run the code cell below to monitor the execution progress of the experiment.

In [None]:
det experiment list | tail -1
# Get the experiment Id, remove spaces
myexpId=$(det experiment list | tail -1 | cut -d'|' -f 1 |  tr -d ' ')
#det experiment describe ${myexpId} --json | jq .[0].state

### 2- Monitor and visualize your experiment using the Determined Web User Interface

Determined will run the number of _max_trials_ trials and automatically start new trials as resources become available.

To monitor the progress of the training task and access information on both training and validation performance for the trials of your experiment, you can simply return to the Determined **WebUI**.

##### From the **Dashboard**, after a minute or so, you should see the experiment as an **active** state and the completion percentage. 

> <font color="blue"> **Important Note:** If there are multiple concurrent participants to the workshop, your experiment might not run yet because there are more experiments running than the Kubernetes cluster has GPUs. You might need to wait a few minutes until other experiments complete for your experiment to start running. </font>

##### Select your most recent experiment.

As you can see in the **Visualization** pane, Determined’s hyperparameter optimization provides you with several visualization options for analyzing results: Learning Curve, Parallel Plot, Scatter Plot, Heatmap.

>***Note:*** To learn more about these visualization options, check out the blog post [here](https://www.determined.ai/blog/hyperparameter-visualizations-determined).

<img src="Pictures/WebUI-Exp-adaptive-visualization.png" height="154" width="900">

>Note: You can navigate to **Trials** tab to see progress status of the training tasks for your experiment.

As the experiment runs, the _Learning Curve_ graph is showing the model validation accuracy metric (_val_categorical_accuracy_). From the **Metric** drop-down list, under **Training Metrics**, select _categorical_accuracy_. Click the ***Apply*** button as shown in the picture above to visualize the model accuracy on training data for each trial over the number of completed batches. 

<img src="Pictures/WebUI-Exp-adaptive-graphs.png" height="394" width="900">

After the experiment is complete, you might see that Determined's hyperparameter Searcher Adaptive ASHA's ***early stopping*** capability has stopped poor performing trials that do not require any extra training. Determined releases valuable GPU resources on trials that will never produce the best model. 

### 3 - Get the trial and hyperparameters that yields to the best model

Like the other experiments you explored earlier, you can use the command below to list the trial that yields to the best model:

* _det experiment list-checkpoints [--best] [N best checkpoints to return] \<experiment_Id\>_

#### Run the code cell below to display the trial that yields to the best model for your experiment

You can see on the experiment detail page that training the model with the hyperparameter settings in `adaptive.yaml` yields a validation accuracy between 93% and 97%. 

In [None]:
#list the best Trial checkpoint(s) (training task):
det experiment list-checkpoints --best 1 ${myexpId}

You can use the command below to discover the hyperparameters that yield to the best model:

* _det trial describe \<trial_Id\>_

#### Run the code cell below to display the hyperparameters that yield the best model for your experiment
>**Note**: You might see an SQL error message. You can ignore the issue and continue with the next step to reclaim some storage space. Then, if time permits, you may want to launch a new HPO experiment. 

In [None]:
bestTrialId=$(det experiment list-checkpoints --best 1 ${myexpId} | head -3 | tail -1 | cut -d'|' -f 1 |  tr -d ' ')
echo "Best Trial ID: " $bestTrialId
det trial describe ${bestTrialId}

>**Note:** Unlike the non-HPO experiments you explored earlier, the _adaptive.yaml_ experiment configuration file does not define a periodic validation parameter (min_validation_period). The validated model is checkpointed at the trial end. Adaptive_ASHA method could also automatically checkpoint a model earlier if it makes sense to do so.

### 4- Delete the checkpoints to reclaim storage space in the storage file system

The default **checkpoint garbage collection policy** dictates Determined to save the most recent and the best checkpoint per training task (trial). The ***save_experiment_best***, ***save_trial_best*** and ***save_trial_latest*** parameters specify which checkpoints to save. The default policy is set as follows:

  * save_experiment_best:0 
  * save_trial_best:1
  * save_trial_latest:1
 
#### Run the code cell below to reclaim some storage disk space by changing the default checkpoint garbage collection policy as shown below:

In [None]:
# Delete the checkpoints data for the HPO training
det experiment set gc-policy --yes --save-experiment-best 0 --save-trial-best 0 --save-trial-latest 0 ${myexpId}

#### Wrap up of the workshop.
Click on **Conclusion** below to open the Conclusion notebook. 
* [Conclusion](4-WKSHP-Conclusion.ipynb)