# CIFAR10 TensorFlow PyTorch

This notebook walks you through image classification model training in Determined on the popular [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html), specifically using the PyTorch machine learning library.  See [this notebook](../cifar10_cnn_tf_keras/CIFAR10-TensorFlow-Keras.ipynb) for the same example built on TensorFlow Keras.

In [None]:
# Test importing Determined. If Determined is properly installed, you should see no output.
import determined as det

In [None]:
# Replace with the IP address of the Determined master.
determined_master = '<master-ip>'

# Run an Experiment

First, we will explore the components of a Determined experiment; namely, the model definition and associated experiment configuration.

## Model Directory
- `model_def.py`: The PyTorch model definition
- `.yaml` configuration files that each govern an individual experiment run

Let's look at the contents of the model directory:

In [None]:
!ls .

### model_def.py
Now drill in and view the model definition file.  Look for the implementation of Determined's `PyTorchTrial` interface.  This is the interface between Determined and PyTorch, which ultimately enables the ML Engineer to leverage Determined's distributed hyperparameter search in a shared runtime without having to worry about these distributed system concerns.

In [None]:
!cat -n model_def.py

### const.yaml
For our first Determined experiment, we'll run this model training job with fixed hyperparameters. Note the following sections (<u>keywords are clickable</u> and bring you to the [official API docs](https://hpe-mlde.determined.ai/latest/reference/training/experiment-config-reference.html)):

- [`name`](https://hpe-mlde.determined.ai/latest/reference/training/experiment-config-reference.html#name): A short human-readable name for the experiment.
- [`description`](https://hpe-mlde.determined.ai/latest/reference/training/experiment-config-reference.html#description): A short description of the experiment (ideally <255 chars).
- [`hyperparameters`](https://hpe-mlde.determined.ai/latest/reference/training/experiment-config-reference.html#hyperparameters): area for user to define hyperparameters that will be injected into the trial class at runtime. There are constant values for this configuration
- [`records_per_epoch`](https://hpe-mlde.determined.ai/latest/reference/training/experiment-config-reference.html#records-per-epoch): The number of records in the training data set. Mandatory since we're also setting `min_validation_period`.
- [`searcher`](https://hpe-mlde.determined.ai/latest/reference/training/experiment-config-reference.html#searcher): hyperparameter search algorithm for the experiment.
- [`entrypoint`](https://hpe-mlde.determined.ai/latest/reference/training/experiment-config-reference.html#experiment-config-entrypoint): A model definition trial class specification or Python launcher script, which is the model processing entrypoint.
- [`min_validation_period`](https://hpe-mlde.determined.ai/latest/reference/training/experiment-config-reference.html#min-validation-period): Specifies the minimum frequency at which validation should be run for each trial.

Not all of these settings are always mandatory. See the references API documentation for details.

In [None]:
!cat -n const.yaml

## Submit Experiment

In [None]:
!det -m {determined_master} experiment create const.yaml .

Once the experiment completes (which may take a few minutes if Determined agents have to start up), look at the experiment page to see the single completed trial.  Note the validation error around 0.75.

# Adaptive Hyperparameter Search
### adaptive.yaml

Next, let's run an experiment with the same model definition, but we'll leverage Determined's adaptive hyperparameter search to efficiently determine the hyperparameter values that yield the lowest validation error.  Note that hyperparameters in the experiment configuration are specified as ranges as opposed to fixed values as in our [first experiment](#const.yaml).

In [None]:
!cat -n adaptive.yaml

## Submit Experiment

In [None]:
!det -m {determined_master} experiment create adaptive.yaml .

During and after the experiment run, you can view the best (lowest) validation error that Determined's adaptive search finds over time:

When the experiment finishes, note that your best performing model achieves a lower validation error than our first experiment that ran with constant hyperparameter values.  From the Determined experiment detail page, you can drill in to a particular trial and view the hyperparameter values used.  You can also access the saved checkpoint of your best-performing model and load it for real-time or batch inference as described in the PyTorch documentation [here](https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-a-general-checkpoint-for-inference-and-or-resuming-training).

# Distributed training on multiple GPUs

See also the introduction to implementing distributed training, which you can find [here](https://docs.determined.ai/latest/model-dev-guide/dtrain/dtrain-implement.html#multi-gpu-training-implement).

### distributed.yaml

If you have a multi-GPU cluster set up that's running Determined AI, you can distribute your training on multiple GPUs by changing a few settings in your experiment configuration.

In [None]:
!cat -n distributed.yaml

<b>Note the slight difference to `const.yaml`:</b>
- We added `slots_per_trial` and set it to the number of GPUs we're training on (here: 16).
- Since we're training on 16 GPUs and we want a per-GPU batch size of 32, we're setting `global_batch_size` to (32*16=)512.

In [None]:
!det -m {determined_master} experiment create distributed.yaml .

# Distributed Batch Inference

When using PyTorch, you can use the distributed training workflow with PyTorchTrial to accelerate inference workloads. This workflow is not yet officially supported, therefore, users must specify certain training-specific artifacts that are not used for inference. This is covered below. Also, you can find further documentation [here](https://docs.determined.ai/latest/model-dev-guide/dtrain/dtrain-implement.html#distributed-inference).

### distributed_inference_example.yaml

In [None]:
!cat -n distributed_inference_example.yaml

Finally, launch the batch inference the same way as you would launch a training job.

In [None]:
!det -m {determined_master} experiment create distributed_inference_example.yaml .