# Getting Started with Determined, the Open-Source Deep Learning Training Platform - Lab 1
## Learning the base principles of Determined

Inspired by the Bank of England paper 816 entitled "Machine learning explainability in finance: an application to default risk analysis" [1]from August 2019 where a U.K. mortgage dataset was used, we use the widely available public U.S. Fannie Mae mortgage dataset to show an XGBoost classifer for predicting loan delinquencies. For Fannie Mae, there are the loan acquisition files (kept in subdir acq) and loan performance files (kept in subdir perf). 
The NVIDIA developer blog article is entitled "Explaining and Accelerating Machine Learning for Loan Delinquencies" by Mark Bennett, John Ashley, and Patrick Hogan [3]. 
Python code began with Kyle DeGrave's article entitled "Predicting Loan Defaults in the Fannie Mae Data Set" [2], and further code contributions were made for acceleration, deep learning, and explainability by Emanuel Scoullos, Jochen Papenbrock, and Mark Bennett of NVIDIA. 

More detailed analysis of explanability approaches for portfolio construction and mortgage loans is provided in the more recent articles [4], [5], [6] by Emanual Scoullos, Jochen Papenbrock, Prabhu Ramamoorthy, Thomas Schoenemeyer, and Miguel Martinez.

## Features
Revised code for four Python Jupyter Notebooks:

- `1_mortcudf_data_prep.ipynb`
- `2_mortcudf_XGB_Pytorch.ipynb`
- `3_mortcudf_captum.ipynb`
- `4_mortcudf_shapley_viz.ipynb`

with RAPIDS ETL code, XGBoost classifier code, SHAP value code, PyTorch classifier code, Captum code, Shapley Clustering and Visualization, and the Dockerfile are contained here. In addition, the original notebook to match the article from November 2020 is here and titled `mortcudf_xgb.ipynb`.

Note that docker command below should be run just above the docker directory.

## Build
The  workshop will leverage a shared dataset. The dataset consumes up to <strong>195GB</strong> of space.  From the determined AI web ui, You will deploy a second jupyterlab environment that will leverage determined AI Agents and from wich all necessary notebooks will be run.

Ensure that that PyTorch is installed with CUDA 11.1 or greater. This can be checked in the running container by running: `conda list torch`
A satisfactory output will be `1.9.0+cu111` for Pytorch 1.9.0 with CUDA 11.1. Otherwise please pip install a later version of Pytorch: `pip3 install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html`


Determined provides a Web User Interface (UI), a Command Line Interface (CLI) and APIs to interact with the Determined. In this part of the lab, you will learn how you can communicate with Determined, run tasks such as training model tasks, as well as monitor and visualize training progress and results.

* You will be introduced to Determined components.
* You will learn how to use the Determined CLI to interact with Determined.
* You will use the Determined CLI to create a training task in order to train your Neural Network model with a single GPU.
* You will also interact with the Determined AI WebUI and integrated TensorBoard to visualize experiments metrics and results.
* You will finally use the Determined's Python API to interact with Determined, load the trained model and make some predictions (inferences) using the trained model.

> <font color="green"> **Note:** The model has already been ported to run on Determined. Porting a deep learning model code to Determined is beyond the scope of this workshop. The easiest way to learn how to port your existing deep learning model code to Determined, is to start with the [Pytorch Porting tutorial](https://docs.determined.ai/latest/tutorials/pytorch-porting-tutorial.html).</font>

# 1- Determined components

The **Determined Master** is the central component of the Determined deployment on Kubernetes. The Master is responsible for:

* **Scheduling** Determined training tasks as a collection of Kubernetes PODs. The Master brings up PODs to run model training tasks (known as Trials), and auxiliary tasks such as TensorBoard instances and JupyterLab instances.
* **Tracking and storing** all model training tasks metadata (description, labels, hyperparameters, search algorithm used, training metrics, validation metrics, start/end time, logs) in the PostgreSQL database.
* **Serving** the Web User Interface (WebUI) for users to visualize training metrics and validation metrics across their model training tasks.

>**Note:** _For this hands-on lab, the Kubernetes cluster worker nodes that run Determined software have been configured to connect to a distributed file system provided by HPE Ezmeral Runtime Enterprise (from the pre-integrated HPE Ezmeral Data Fabric). The distributed file system provides shared storage that works with Determined to enable:_ 
>* _training tasks launched as container PODs on any Kubernetes worker nodes to access the shared model training and validation datasets,_ 
>* _training tasks to save and store model training artifacts (model files, code, model definition files) and training task's checkpoints on a shared checkpoint storage. Checkpoints are saved versions of validated models that users can access later to test the model and deploy the model in production. Checkpoints are also used by Determined to ensure training work is not lost in case of system failure during a training, so Determine can retry failed training tasks from latest checkpoint._

**Now, let's see the Determined in action!!!**

## 3- Communicating with Determined using the Determined CLI 

The Determined CLI is a command line tool that allows you to interact with Determined. For example, the CLI allows you to launch a new experiment to train your deep learning (DL) model. The Determined CLI is distributed as a Python package. You need a machine with Python 3.6 or later installed on your machine with Internet access. The Determined CLI has already been installed on the JupyterHub server.

To use Determined and interact with Determined with the CLI, you first need to tell the CLI where the Determined Master service is running. You then need to authenticate as a user of Determined.

#### Set the Determined Master service endpoint URL:

Any Det CLI command is in the form: ***det [-m \<det_master_URL_or_IP:port\>] \<command_argument\> \<action_verb\> [-h]***

The Master service endpoint is referenced using the ***-m*** flag to specify the URL of the Determined Master that the CLI connects to. 

Instead of specifying the -m flag in every command, you can define an environmental variable: ***DET_MASTER*** that points to the Master service endpoint URL. 

>Note: You can use the help flag [-h] to learn more about valid options.


Run the code cell below to get the Determined Master endpoint URL from the Kubernetes service of Determined Master. The _"kubectl describe service"_ command is used to get the Master URL.  

In [None]:
#
# Getting the DeterminedAI Master service endpoint URL:
#
determined_master="https://determinedai2.hpedev.io"
export DET_MASTER=${determined_master}
echo "The Determined Master Service endpoint URL is exported as environmental variable: " ${determined_master}

#### Authenticate to Determined:
Run the code cell below and follow steps 1 to 4 below to authenticate to Determined as student\<yourId\>: 

In [None]:
studentId="student{{ STDID }}"
password="{{ PASSSTU }}"


echo "export DET_MASTER=${determined_master}" 
echo "det user login ${studentId}"
echo "---------------------------"
echo "copy/paste the two lines above to a Terminal"
echo "your password is: ${password}"

1. Start a Terminal in the Launcher (navigate to Launcher tab --> Click Terminal tile; or go to Menu --> File --> New Launcher --> Terminal)
2. In the Terminal, copy/paste the two commands above to authenticate to Determined as user Student\<yourID\>.
3. Press the _Return_ key when prompted to enter a password. Please enter the password for your studentID in Determined.
4. Then, continue from here. 

The command below displays the Determined CLI client version and Master version. 

In [None]:
det version

## 4- Launch your first Determined training workloads to train your model

Let's first introduce some fundamental Determined concepts that are leveraged in this workshop.

Determined permits data science teams to launch deep learning model training tasks (known as **trials**) for their models ported to Determined. These tasks are distributed across one or more GPUs as an **experiment** using a particular set of configuration parameters specified in an **experiment configuration file**.

* **Experiment:** In Determined terms, an ***experiment*** is a collection of one or more DL training tasks (trials). A Determined experiment can either train a single model with a single training task using one or multiple GPUs, or it can define a search over a user-defined hyperparameter space with several training tasks.

* **Trial:** Each training task in an experiment is called a ***trial***. A trial is a training task that consists of the dataset (training and validation/test dataset), a deep learning model (for example, the Python scripts that load the dataset, build and compile the model) `adjusted to run on Determined`, and _an experiment configuration file_. All the elements of a training task are put together in a `model definition directory`.

* **Experiment configuration file:** Determined uses a YAML manifest file that defines how to run the training model process on Determined in terms of the ***hyperparameters***, the number of GPUs to use for each trial, the amount of data on which to train a model, how often the trial task must report the training metrics and the validation metrics to the Master, how often the trial task must save the model file, and many other parameters.

>**Note:** The Experiment configuration file has some required field and some optional ones. To learn more about Experiment configuration settings, check out the online documentation [here](https://docs.determined.ai/latest/training-apis/experiment-config.html).

* **Hyperparameters:** These are user-defined variables that define how a model is trained. They affect the accuracy of the trained model. By choosing the best combination of hyperparameters, you can obtain better performance for your model. 

## 5- Monitor and visualize your experiment in Determined AI Web User Interface

To access information on both training and validation performance, you can also simply go to the Determined **WebUI** by entering the service endpoint URL of the Determined Master in your web browser connected to the Internet.

* Run the code cell below to get the Determined Master WebUI URL. 
* Then, click on the displayed link to connect. This will open a new tab in your browser with the Determined UI login banner.
* You will be prompted to enter your credentials. Type your StudentID as credentials and press return. The password is `blank` by default.
* Upon login you should see the WebUI **dashboard** as shown in the picture below. The Dashboard page shows an overview of tasks on the Determined system as well as an overview of the GPU resources utilization. 

In [1]:

echo "The Determined Master WebUI URL is: https://determinedai2.hpedev.io/"
echo "Click the link above to connect. Login using your student Identifier: ${studentId}. You password is ${password}. Click on Sign In button"

The Determined Master WebUI URL is: https://determinedai2.hpedev.io/
Click the link above to connect. Login using your student Identifier: . You password is . Click on Sign In button


<img src="Pictures/DetWebUI-Login.png" height="298" width="300">

From the WebUI, make sure **you select your StudentID** from the ***Users*** drop-down list as shown in the picture below. By default, the Experiments are displayed. You can select other icons to display auxiliary tasks such as TensorBoard tasks and JupyterLab tasks. We will explore these auxiliary tasks in the next sections. 

<img src="Pictures/DetWebUI-Users-v1.png" height="171" width="900">


##### From the dashboard, you should see the experiment as an **ACTIVE** state and its completion percentage, or as **COMPLETED**.

> <font color="blue"> **Important Note:** If there are multiple concurrent participants to the workshop, your experiment might not run yet because there are more experiments running than the Kubernetes cluster has GPUs. You might need to wait a few minutes until other experiments complete for your experiment to start running. </font>

#### Select the most recent experiment you want to visualize.

As the experiment runs, the graph is showing the model **validation** accuracy metric (_val_categorical_accuracy_) over the number of completed batches. You can see the graph changing in real time as the experiment runs.

From the **Metrics** menu, under **Training Metrics**, select _categorical_accuracy_ (see picture below for an example). This metric indicates the model accuracy on **training** data while the _val_categorical_accuracy_ indicates the model accuracy on **validation** data. 

<img src="Pictures/WebUI-Exp-const-Metrics-selection.png" height="297" width="700">

As you can see in the graphs, the Master plots training metrics every **100 batches** of training data by default, while the validation metrics ("validation" accuracy) are plotted every 1000 batches based on the experiment configuration parameter _min_validation_period_.

After the experiment completes, although the validation accuracy may differ from one trial to another, you can see on the experiment detail page that training the model with the hyperparameter settings in `const.yaml` yields a validation accuracy between 93% and 97%. 

Scroll down to see a list of training validation workloads and their metrics for the metric types you previously selected. 
You might see one or two validation workloads with checkpoints as shown in the picture below. With the default checkpoint collection policy, Determined will checkpoint the most recent validated model and the best model per training task (trial). If the most recent checkpoint is also the best checkpoint for a given trial, only one checkpoint will be saved for that trial. 

<img src="Pictures/WebUI-Exp-const-graph.png" height="520" width="900">

## 6 - Launch a TensorBoard visualization instance as **auxiliary workload** on Determined

[TensorBoard](https://www.tensorflow.org/tensorboard) is a widely used tool by ML engineers for visualizing and inspecting the learning curve of the trained deep learning models. Determined is integrated with TensorBoard for deeper analysis of your experiment and to help you examine your neural network model by viewing the training and validation loss curves for your experiment in TensorBoard. 

Determined lets you launch **auxiliary workload** such as a Tensorboard server, and access TensorBoard in one-click from the WebUI, or you can run the following command in Determined’s command line:

* _det tensorboard start \<experiment_Id\>_

#### Run the code cell below to launch the TensorBoard server instance.

This may take a minute or so as Determined has to launch the Tensorboard server as a Kubernetes POD. 

In [None]:
echo "Start a Tensordboard server instance for your Experiment ${myexpId} with TensorBoard instance ID:"
# start the tensorBoard server instance for the experiment
det tensorboard start -d ${myexpId}

In [None]:
mytensorboard=$(det tensorboard list | grep RUNNING | cut -d'|' -f 1 |  tr -d ' ')
##mytensorboard=str(mytensorboard)[2:-2]
#print (f"{mytensorboard}")
echo "Your tensorboard is running at hhttps://determinedai2.hpedev.io/proxy/${mytensorboard}/"
echo "Click on the link to connect."

<img src="Pictures/TensorBoard-const-graph.png" height="413" width="900">

Determined created TensorBoard plots to show the training loss, validation loss, training accuracy and validation accuracy for the training task (trial).

#### When you have finished with Tensorboard, run the code cell below to `kill` the Tensorboard process

In [None]:
det tensorboard kill ${mytensorboard}

## 7 - List the best model created by the training process
By default, Determined will save the most recent and the best checkpoint per training task (trial) according to the validation metrics specified in the Searcher section of the configuration file for the experiment.

* _det experiment list-checkpoints [--best] [N best checkpoints to return] \<experiment_Id\>_

>**Note**: Upon completion of the training task, if the most recent checkpoint is also the best checkpoint for a given trial, only one checkpoint will be saved for that trial by Determined. Otherwise, two checkpoints will be saved. Other checkpoints will be automatically deleted to reclaim space. You can verify this using the command below.

#### Run the code cell below to display the best checkpoint(s) for your experiment

In [None]:
#list the best Trial checkpoint(s) (training task):
det experiment list-checkpoints --best 2 ${myexpId}

## 8 - Launch a JupyterLab instance as **auxiliary workload** on Determined

Users can also launch a JupyterLab server instance as **auxiliary workload** on Determined, in which they run Jupyter Notebooks. This is useful to load and test a model that was trained during the experiment because the Determined CLI is installed into the JupyterLab server instance by default, and the JupyterLab server container has access to the shared file system where the checkpoints are stored. 

In the next section of this part of the lab, you can use the JupyterLab server instance on Determined to test your trained model and make predictions. 

Determined lets you launch an instance of a JupyterLab server in Determined in Kubernetes and access the JupyterLab server in one-click from the WebUI, or you can run the following command in Determined’s command line:

* _det notebook start [--config-file configurationFile]_

The configuration file is used to control aspects of the JupyterLab environment such as a description, the checkpoint volume where trial checkpoints are stored in the shared file system, and the resources (CPU or GPU) used to launch the JupyterLab server. Run the next code cell to look at the content of the configuration file.

#### Run the code cell below to examine the settings for the JupyterLab instance.
The configuration file used here allows you to launch a JupyterLab server instance that does not use any GPUs (***resources.slot=0***) and that gets access to the shared checkpoint storage area where the model artifacts are stored. The shared checkpoint storage is mounted to ***/determined_shared_fs*** mount point inside Jupyterlab container POD. 

In [None]:
cat Code/notebook-config.yaml

#### Run the code cell below to launch an instance of the JupyterLab
It may take a minute or so for the JupyterLab instance to become active as Determined has to launch the JupyterLab server instance as a Kubernetes POD in the Kubernetes cluster. 

In [None]:
echo "Start a JupyterLab server instance within Determined system with instance ID:"
# start the Jyputer Notebook server instance for the experiment
det notebook start -d --config-file Code/notebook-config.yaml

>**Note:** Run the code cell below. You will notice that JupyterLab instance is launched as a container POD in the Kubernetes cluster. The POD name is in the form "_cmd-0-\<Jupyterlab-Instance-ID\>_". Determined proxies HTTP requests to and from the JupyterLab container through the Determined Master node.

In [None]:
kubectl get pod -n determinedai

#### Check the status of the JupyterLab instance using the command below:
* _det notebook list_

In [None]:
det notebook list | grep -e RUNNING -e STARTING

## 9- Inferences with Determined
When you train a model with Determined, all of the artifacts (model files) associated with that training tasks are tracked and stored in the _checkpoint storage_. Determined lets you access the artifacts programmatically using the Python API from within the JupyterLab server launched on Determined system. This makes it really easy for you to export your best-performing trained model out of Determined and load it for testing the model by making **inferences** (the process of using a trained model and new unlabeled data to make a prediction).

* More information about the Determined's Python API can be found [here](https://docs.determined.ai/latest/interact/api-experimental-client.html).
* More information for downloading a trained model can be found [here](https://docs.determined.ai/latest/post-training/use-trained-models.html).

#### Run the code cell to adjust some environment variables in the notebook **Inferences.ipynb**

In [None]:
sed -i "s/USERNAME/${studentId}/" Inferences.ipynb
sed -i "s/EXPID/${myexpId}/" Inferences.ipynb
sed -i "s/MASTERURL/${masterUrl}/" Inferences.ipynb
sed -i "s/PASSW/${password}/" Inferences.ipynb

#### Next, download the file **Inferences.ipynb** to your local PC/laptop. 

You will use this notebook to test your trained model by making some inferences from JupyterLab instance you have just launched on Determined.

Right-click on the file **Inferences.ipynb** and choose **Download**.

#### Now, connect to the JupyterLab server instance you have just deployed: 

* Run the code cell below to get the JupyterLab URL. Then, click on the link to connect to the JupyterLab instance you have just launched on Determined.

* On the JupyterLab instance, click the ***up arrow*** icon to **upload** the file _Inferences.ipynb_ from your local PC/laptop. Once the file is uploaded, double-click the file to open the notebook. 

In [None]:
myNotebook=$(det notebook list | grep RUNNING | cut -d'|' -f 1 |  tr -d ' ')
##echo "${myNotebook}"
echo "Your JupyterLab instance is running at http://${Internet_access}:${portUI}/proxy/${myNotebook}/"
echo "Click on the link to connect to the JupyterLab instance you just launched."
echo "On JupyterLab instance, click the up arrow to upload the file Inferences.ipynb."

> <font color="red"> **IMPORTANT: When you have finished with the Inferences in JupyterLab on the Determined, please go back to your local Jupyter Notebook to run the code cells below and perform some cleanup** </font>.

## 10- Time to cleanup: Delete the checkpoints for your experiment to reclaim storage space in the shared file system and stop the JupyterLab instance

The ***save_experiment_best***, ***save_trial_best*** and ***save_trial_latest*** parameters of the checkpoint collection policy specify which checkpoints to save. The default policy is set as follows:

  * save_experiment_best:0 
  * save_trial_best:1
  * save_trial_latest:1
 
The default **checkpoint garbage collection policy** dictates Determined to checkpoint the most recent (the latest) validated model and the best model per training task (trial). The “best” checkpoint for a trial is the checkpoint with the model judged best based on the validation metric defined in the Searcher settings of the experiment configuration file.
  
#### Run the code cell below to reclaim some storage disk space by changing the default checkpoint garbage collection policy, as shown below:

In [None]:
# Delete the checkpoints data for the single model training using a single GPU
det experiment set gc-policy --yes --save-experiment-best 0 --save-trial-best 0 --save-trial-latest 0 ${myexpId}

#### Next, delete the instance of the JupyterLab server.

In [None]:
det notebook kill ${myNotebook}

#### Now that you have the base principles about Determined in mind, let's explore Distributing training with Determined.

Click on Lab 2 below to open a notebook to explore Distributed Training with Determined. 
* [Lab 2](2-WKSHP-DET-AI-101-Getting-started-Dist-Training.ipynb)