# MNIST Training using PyTorch and Horovod

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Host](#Host)

---

## Background

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and MXNet. This notebook example shows how to:
- use Horovod with PyTorch
- run distributed parallel training
- leverage Spot instances
- save checkpoints.

You may have to raise the number of parallel spot instances in your account to be able to run it (the code uses **6 p3.2xlarge instances in parallel** during the hyperparameter optimization).

For more information about the PyTorch in SageMaker, please visit [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers) and [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) github repositories.


---

## Setup

_This notebook was created and tested on ml.p3.2xlarge notebook instances._

Let's start by creating a SageMaker session and specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the [Amazon SageMaker Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the `sagemaker.get_execution_role()` with the appropriate full IAM role arn string(s).


In [1]:
from datetime import datetime

start_time = datetime.now()

In [2]:
import sagemaker
import time
import numpy as np
import boto3
import pandas as pd

from torchvision import datasets, transforms
from sagemaker.pytorch import PyTorch
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
from IPython.display import HTML

sagemaker_session = sagemaker.Session()

bucket = 'sagemaker-pytorch-dist-demo' # or use `sagemaker_session.default_bucket()`
prefix = 'sagemaker/DEMO-pytorch-mnist'

role = sagemaker.get_execution_role()

## Data
### Getting the data

In this example, we will ues MNIST dataset. MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits).

In [3]:
data_start_time = datetime.now()

In [4]:
datasets.MNIST('data', download=True, transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
]))

Dataset MNIST
    Number of datapoints: 60000
    Root location: data
    Split: Train
    StandardTransform
Transform: Compose(
               ToTensor()
               Normalize(mean=(0.1307,), std=(0.3081,))
           )

### Uploading the data to S3
We are going to use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use later when we start the training job.


In [5]:
inputs = sagemaker_session.upload_data(path='data', bucket=bucket, key_prefix=prefix)
print(f'input spec (in this case, just an S3 path): {inputs}')

input spec (in this case, just an S3 path): s3://sagemaker-pytorch-dist-demo/sagemaker/DEMO-pytorch-mnist


## Train
### Training script
The `mnist_ckppoint.py` script provides the code we need for training a SageMaker model.
The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:

* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to.
  These artifacts are uploaded to S3 for model hosting.
* `SM_NUM_GPUS`: The number of gpus available in the current container.
* `SM_CURRENT_HOST`: The name of the current container on the container network.
* `SM_HOSTS`: JSON encoded list containing all the hosts .

Supposing one input channel, 'training', was used in the call to the PyTorch estimator's `fit()` method, the following will be set, following the format `SM_CHANNEL_[channel_name]`:

* `SM_CHANNEL_TRAINING`: A string representing the path to the directory containing data in the 'training' channel.

For more information about training environment variables, please visit [SageMaker Containers](https://github.com/aws/sagemaker-containers).

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to `model_dir` so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance.

This script uses Horovod framework for distributed training where Horovod-related lines are commented with `Horovod:`. For example, `hvd.broadcast_parameters`, `hvd.DistributedOptimizer` and etc.

For example, the script run by this notebook:

In [6]:
!pygmentize code/mnist_ckpoint.py

[37m#from __future__ import print_function[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mhorovod.torch[39;49;00m [34mas[39;49;00m [04m[36mhvd[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.nn.functional[39;49;00m [34mas[39;49;00m [04m[36mF[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.optim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.utils.data.distributed[39;49;00m

[34mfrom[39;49;00m [04m[36mtorchvision[39;49;00m [34mimport[39;49;00m datasets, transforms
[34mfrom[39;49;00m [04m[36mtorch.nn[39;49;00m [34mimport[39;49;00m NLLLoss
[34mfrom[39;49;00m [04m[36mmodel_def[39;49;00m [34mimport[39;49;00m Net

logger = logging.getLogger([31m__nam

### Run training in SageMaker

The `PyTorch` class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, the training instance type, and hyperparameters. In this case we are going to run our training job on 2 ```ml.p3.xlarge``` instances. But this example can be ran on one or multiple, cpu or gpu instances ([full list of available instances](https://aws.amazon.com/sagemaker/pricing/instance-types/)). The hyperparameters parameter is a dict of values that will be passed to your training script -- you can see how to access these values in the `mnist_ckpoint.py` script above.


In [7]:
train_start_time = datetime.now()

In [8]:
metrics = [
    {
        "Name": "Test Accuracy",
        "Regex": "Test set:.+Accuracy: (\d+(?:\.\d+))%"
    },
    {
        "Name": "Test Loss",
        "Regex": "Test set:.+Average loss: (\d+(?:\.\d+)),.+" 
    },
    {
        "Name": "Epoch",
        "Regex": 'Train Epoch: (\d+)'
    },
    {
        "Name": "Epoch completion",
        "Regex": 'Train Epoch: \d+ \[\d+/\d+ \((\d+%)\)\]'
    },
    {
        "Name": "Train loss",
        "Regex": 'Train Epoch: \d+ \[\d+/\d+ \(\d+%\)\] Loss: (\d+\.\d+)'
    },
]

estimator = PyTorch(entry_point='mnist_ckpoint.py',
                    source_dir='code',
                    role=role,
                    framework_version='1.3.1',
                    train_instance_type='ml.p3.2xlarge',
                    metric_definitions=metrics,
                    train_use_spot_instances=True,
                    train_max_wait=25*60*60,
                    train_instance_count=2,
                    checkpoint_s3_uri=f's3://{bucket}/checkpoints',
                    hyperparameters={
                        'epochs': 5,
                        'backend': 'nccl',
                        'test-checkpoint': 4
                    })

In order to demonstrate the checkpoint resume capability, we'll tell the script to raise an exception at the beginning of the fourth epoch (hyperparameter `'test-checkpoint': 4`) above.

After we've constructed our `PyTorch` object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.

**The next cell is designed to fail at the 4th epoch. When it does, just resume the notebook execution from the one after it**.


In [9]:
estimator.fit({'training': inputs}, wait=True)

2020-02-17 11:14:06 Starting - Starting the training job...
2020-02-17 11:14:08 Starting - Launching requested ML instances......
2020-02-17 11:15:16 Starting - Preparing the instances for training......
2020-02-17 11:16:20 Downloading - Downloading input data...
2020-02-17 11:16:53 Training - Downloading the training image.......[35mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[35mbash: no job control in this shell[0m
[35m2020-02-17 11:18:10,314 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[35m2020-02-17 11:18:10,339 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-02-17 11:18:13,059 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-02-17 11:18:13,084 sagemaker_pytorch_co


[34mStarting training from the beginning[0m
[35mStarting training from the beginning[0m
[35mTest checkpoint is 4[0m
[34mTest checkpoint is 4[0m
[34mGet train data sampler and data loader[0m
[34mGet test data sampler and data loader[0m
[34mProcesses 60000/60000 (100%) of train data[0m
[34mProcesses 10000/10000 (100%) of test data[0m
[34m[2020-02-17 11:18:23.525 algo-1:46 INFO json_config.py:90] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.[0m
[34m[2020-02-17 11:18:23.525 algo-1:46 INFO hook.py:152] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.[0m
[35mGet train data sampler and data loader[0m
[35mGet test data sampler and data loader[0m
[35mProcesses 60000/60000 (100%) of train data[0m
[35mProcesses 10000/10000 (100%) of test data[0m
[35m[2020-02-17 11:18:23.539 algo-2:46 INFO json_config.py:90] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.[0m

[34mINFO:__main__:Epoch: 1#011Test set: Average loss: 0.2013, Accuracy: 93.88%[0m
[0m
[35mINFO:__main__:Epoch: 1#011Test set: Average loss: 0.2009, Accuracy: 94.05%[0m
[0m
[34mEpoch: 1#011Test set: Average loss: 0.2013, Accuracy: 93.88%[0m
[35mEpoch: 1#011Test set: Average loss: 0.2009, Accuracy: 94.05%[0m
[0m
[35mSaving checkpoint for epoch 1 with accuracy 0.941[0m
[0m
[34mSaving checkpoint for epoch 1 with accuracy 0.939[0m




[35mINFO:__main__:Epoch: 2#011Test set: Average loss: 0.1267, Accuracy: 96.08%[0m
[0m
[34mINFO:__main__:Epoch: 2#011Test set: Average loss: 0.1267, Accuracy: 96.09%[0m
[0m
[34mEpoch: 2#011Test set: Average loss: 0.1267, Accuracy: 96.09%[0m
[0m
[34mSaving checkpoint for epoch 2 with accuracy 0.961[0m
[35mEpoch: 2#011Test set: Average loss: 0.1267, Accuracy: 96.08%[0m
[0m
[35mSaving checkpoint for epoch 2 with accuracy 0.961[0m


[35mINFO:__main__:Epoch: 3#011Test set: Average loss: 0.0991, Accuracy: 96.84%[0m
[0m
[35mTraceback (most recent call last):[0m
  File "mnist_ckpoint.py", line 194, in <module>[0m
    train(parser.parse_args())[0m
  File "mnist_ckpoint.py", line 104, in train[0m
    assert (args.test_checkpoint != epoch), "Interrupting the training for checkpoint testing"[0m
[35mAssertionError: Interrupting the training for checkpoint testing[0m
[34mINFO:__main__:Epoch: 3#011Test set: Average loss: 0.0988, Accuracy: 96.86%[0m
[0m
[34mTraceback (most recent call last):[0m
  File "mnist_ckpoint.py", line 194, in <module>[0m
    train(parser.parse_args())[0m
  File "mnist_ckpoint.py", line 104, in train[0m
    assert (args.test_checkpoint != epoch), "Interrupting the training for checkpoint testing"[0m
[34mAssertionError: Interrupting the training for checkpoint testing[0m
[34mEpoch: 3#011Test set: Average loss: 0.0988, Accuracy: 96.86%
[0m
[34mSaving checkpoint for epoch 3 with a


2020-02-17 11:19:14 Uploading - Uploading generated training model
2020-02-17 11:19:44 Failed - Training job failed


UnexpectedStatusException: Error for Training job pytorch-training-2020-02-17-11-14-06-031: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python mnist_ckpoint.py --backend nccl --epochs 5 --test-checkpoint 4"
INFO:__main__:Train Epoch: 1 [640/60000 (1%)] Loss: 2.304615
INFO:__main__:Train Epoch: 1 [1280/60000 (2%)] Loss: 2.312131
INFO:__main__:Train Epoch: 1 [1920/60000 (3%)] Loss: 2.291755
INFO:__main__:Train Epoch: 1 [2560/60000 (4%)] Loss: 2.294058
INFO:__main__:Train Epoch: 1 [3200/60000 (5%)] Loss: 2.291983
INFO:__main__:Train Epoch: 1 [3840/60000 (6%)] Loss: 2.259225
INFO:__main__:Train Epoch: 1 [4480/60000 (7%)] Loss: 2.270091
INFO:__main__:Train Epoch: 1 [5120/60000 (9%)] Loss: 2.273199
INFO:__main__:Train Epoch: 1 [5760/60000 (10%)] Loss: 2.240260
INFO:__main__:Train Epoch: 1 [6400/60000 (11%)] Loss: 2.219095
INFO:__main__:Train Epoch: 1 [7040/60000 (12%)] Loss: 2.242404
INFO:__main__:Train Epoch: 1 [7680/60000 (13%)] Loss: 2.145658
INFO:__main__:Train Epoch: 1 [8320/60000 (14%)] Loss: 2.101541
INFO:__main__:Train Epoch: 1 [8960/60000 (15%)] Loss: 2.152929
INFO:__main__:Tr

The original training run failed, as requested. But we have the checkpoints stored in our bucket, under the `/checkpoints` directory. If you check the `mnist_ckpoint.py` script, you'll see that it has two functions that respectively save and restore the checkpoints:
```python
def save_checkpoint(state, filename='/opt/ml/checkpoints/checkpoint.pth.tar'):
    print(f"Saving checkpoint for epoch {state['epoch']} with accuracy {state['best_accuracy']:.3f}")
    torch.save(state, filename)  # save checkpoint

    
def load_checkpoint(filename='/opt/ml/checkpoints/checkpoint.pth.tar'):
    return (torch.load(filename) if os.path.exists(filename) else None)
```

`save_checkpoint` is called inside the training loop, with the following code:
```python
        if (best_accuracy < accuracy) and (hvd.rank() == 0):
            best_accuracy = accuracy
            is_best = True
            save_checkpoint(
                {
                    'epoch': epoch,
                    'state_dict': model.state_dict(),
                    'best_accuracy': best_accuracy
                })
```
The `hvd.rank() == 0` is, according to [horovod's repo](https://github.com/horovod/horovod#usage) (see point 6), the right way to save checkpoints leveraging the framework (the example is for Tensorflow, we adapted it for PyTorch).

`load_checkpoint` is called at the beginning of the `train` function, and if it finds a saved checkpoint training will resume from there:
```python
    start_epoch = 1
    best_accuracy = 0
    checkpoint = load_checkpoint()
    if checkpoint:
        start_epoch = checkpoint['epoch']
        best_accuracy = checkpoint['best_accuracy']
        model.load_state_dict(checkpoint['state_dict'])
```

The next cell creates another PyTorch Estimator, without the request to interrupt training. From the logs, it can be seen that it started from the previous checkpoint

In [None]:
estimator2 = PyTorch(entry_point='mnist_ckpoint.py',
                    source_dir='code',
                    role=role,
                    framework_version='1.3.1',
                    train_instance_type='ml.p3.2xlarge',
                    metric_definitions=metrics,
                    train_use_spot_instances=True,
                    train_max_wait=25*60*60,
                    train_instance_count=2,
                    checkpoint_s3_uri='s3://sagemaker-us-east-1-113147044314/checkpoints',
                    hyperparameters={
                        'epochs': 5,
                        'backend': 'nccl',
                    })
estimator2.fit({'training': inputs}, wait=True)

2020-02-17 11:21:11 Starting - Starting the training job...
2020-02-17 11:21:13 Starting - Launching requested ML instances......
2020-02-17 11:22:17 Starting - Preparing the instances for training......
2020-02-17 11:23:37 Downloading - Downloading input data...
2020-02-17 11:24:07 Training - Downloading the training image......[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-02-17 11:24:58,460 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-02-17 11:24:58,485 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[35mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[35mbash: no job control in this shell[0m
[35m2020-02-17 11:25:14,877 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[35m2020-02-17 11:25:14,902 sagemaker_pytorch_con


[35mStarting training from the beginning[0m
[35mTest checkpoint is 0[0m
[35mGet train data sampler and data loader[0m
[35mGet test data sampler and data loader[0m
[35mProcesses 60000/60000 (100%) of train data[0m
[35mProcesses 10000/10000 (100%) of test data[0m
[35m[2020-02-17 11:25:25.250 algo-2:46 INFO json_config.py:90] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.[0m
[35m[2020-02-17 11:25:25.251 algo-2:46 INFO hook.py:152] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.[0m
[35m[2020-02-17 11:25:25.251 algo-2:46 INFO hook.py:197] Saving to /opt/ml/output/tensors[0m
[35m[2020-02-17 11:25:25.253 algo-2:46 INFO hook.py:216] Initialized the hook with the last saved state: last_saved_step=5500 init_step = 5685, step = 5685 mode_steps = {<ModeKeys.GLOBAL: 4>: 5685}[0m
[35m[2020-02-17 11:25:25.253 algo-2:46 INFO hook.py:326] Monitoring the collections: losses[0m
[34mStarting training 

[35mINFO:__main__:Epoch: 1#011Test set: Average loss: 0.2000, Accuracy: 93.99%[0m
[0m
[35mEpoch: 1#011Test set: Average loss: 0.2000, Accuracy: 93.99%[0m
[0m
[35mSaving checkpoint for epoch 1 with accuracy 0.940[0m


[34mINFO:__main__:Epoch: 1#011Test set: Average loss: 0.2008, Accuracy: 94.02%[0m
[0m
[34mEpoch: 1#011Test set: Average loss: 0.2008, Accuracy: 94.02%[0m
[0m
[34mSaving checkpoint for epoch 1 with accuracy 0.940[0m


[35mINFO:__main__:Epoch: 2#011Test set: Average loss: 0.1271, Accuracy: 96.11%[0m
[0m
[35mEpoch: 2#011Test set: Average loss: 0.1271, Accuracy: 96.11%[0m
[0m
[35mSaving checkpoint for epoch 2 with accuracy 0.961[0m
[34mINFO:__main__:Epoch: 2#011Test set: Average loss: 0.1276, Accuracy: 96.04%[0m
[0m
[34mEpoch: 2#011Test set: Average loss: 0.1276, Accuracy: 96.04%[0m
[0m
[34mSaving checkpoint for epoch 2 with accuracy 0.960[0m


[35mINFO:__main__:Epoch: 3#011Test set: Average loss: 0.0984, Accuracy: 96.79%[0m
[0m
[35mEpoch: 3#011Test set: Average loss: 0.0984, Accuracy: 96.79%[0m
[0m
[35mSaving checkpoint for epoch 3 with accuracy 0.968[0m


[34mINFO:__main__:Epoch: 3#011Test set: Average loss: 0.1001, Accuracy: 96.79%[0m
[0m
[34mEpoch: 3#011Test set: Average loss: 0.1001, Accuracy: 96.79%[0m
[0m
[34mSaving checkpoint for epoch 3 with accuracy 0.968[0m


[35mINFO:__main__:Epoch: 4#011Test set: Average loss: 0.0840, Accuracy: 97.31%[0m
[0m
[35mEpoch: 4#011Test set: Average loss: 0.0840, Accuracy: 97.31%[0m
[0m
[35mSaving checkpoint for epoch 4 with accuracy 0.973[0m
[34mINFO:__main__:Epoch: 4#011Test set: Average loss: 0.0841, Accuracy: 97.27%[0m
[0m
[34mEpoch: 4#011Test set: Average loss: 0.0841, Accuracy: 97.27%[0m
[0m
[34mSaving checkpoint for epoch 4 with accuracy 0.973[0m


[35mINFO:__main__:Epoch: 5#011Test set: Average loss: 0.0743, Accuracy: 97.53%[0m
[0m
[35mEpoch: 5#011Test set: Average loss: 0.0743, Accuracy: 97.53%
[0m
[35mSaving checkpoint for epoch 5 with accuracy 0.975[0m
[35m[2020-02-17 11:26:45.388 algo-2:46 INFO utils.py:25] The end of training job file will not be written for jobs running under SageMaker.[0m
[35m2020-02-17 11:26:45,816 sagemaker-containers INFO     Reporting training SUCCESS[0m


[34mINFO:__main__:Epoch: 5#011Test set: Average loss: 0.0749, Accuracy: 97.58%[0m
[0m
[34mEpoch: 5#011Test set: Average loss: 0.0749, Accuracy: 97.58%
[0m
[34mSaving checkpoint for epoch 5 with accuracy 0.976[0m
[34m[2020-02-17 11:26:48.638 algo-1:46 INFO utils.py:25] The end of training job file will not be written for jobs running under SageMaker.[0m
[34m2020-02-17 11:26:49,012 sagemaker-containers INFO     Reporting training SUCCESS[0m

2020-02-17 11:26:50 Uploading - Uploading generated training model
2020-02-17 11:27:48 Completed - Training job completed
Training seconds: 502
Billable seconds: 152
Managed Spot Training savings: 69.7%


In [None]:
print("\n".join([f"{m['MetricName']}: {m['Value']}"for m in estimator2.latest_training_job.describe()['FinalMetricDataList']]))

Test Accuracy: 97.58000183105469
Test Loss: 0.07490000128746033
Epoch: 5.0
Train loss: 0.27115100622177124


Now that we have a fully trained estimator, we could save it as a deployable model. But before we do so, let's see how we can fine-tune the hyperparameters for it.

## Hyperparameter Tuning

In [None]:
hpo_start_time = datetime.now()

The first step for hyperparameter optimization is to define which parameters should be tried and their respective ranges or categories. We'll showcase how it's done with the learning rate and the batch size, but any parameter can be tried. Actually, by adapting the training script and passing it some parameters, even the network architecture itself could be tried dynamically.

We'll also define which metric we want to optimize for, in this case average test loss.

In [None]:
hyperparameter_ranges = {
    'lr': ContinuousParameter(0.001, 0.1),
    'batch-size': CategoricalParameter([32,64,128])
}
objective_metric_name = 'average test loss'
objective_type = 'Minimize'
metric_definitions = [{'Name': 'average test loss',
                       'Regex': 'Test set: Average loss: ([0-9\\.]+)'}]

With the parameter ranges and target metric ready, we create a tuner from the estimator we trained before. We are limiting the search to 9 attempts and telling it to try up to 3 parallel training optimizations. Note that your account has to have quotas for the combination of optimization runs **times** the degree of parallelism of the estimator, and that the Bayesian optimization will have a better chance to reduce the total cost when trials are run sequentially.

In [None]:
tuner = HyperparameterTuner(estimator2,
                            objective_metric_name,
                            hyperparameter_ranges,
                            metric_definitions,
                            max_jobs=9,
                            max_parallel_jobs=3,
                            early_stopping_type="Off",
                            objective_type=objective_type)

In [None]:
tuner.fit({'training': inputs})

In [17]:
tuner.wait()

.....................................................................................................................................................................................................................!


Once tuning is finished, we can request the best estimator and check its hyperparameters and achieved results.

In [18]:
best = tuner.best_estimator()

2020-02-17 11:40:37 Starting - Preparing the instances for training
2020-02-17 11:40:37 Downloading - Downloading input data
2020-02-17 11:40:37 Training - Training image download completed. Training in progress.
2020-02-17 11:40:37 Uploading - Uploading generated training model
2020-02-17 11:40:37 Completed - Training job completed[35mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[35mbash: no job control in this shell[0m
[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-02-17 11:38:56,381 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-02-17 11:38:56,384 sagemaker-containers INFO     Failed to parse hyperparameter _tuning_objective_metric value average test loss to Json.[0m
[34mReturning the value itself[0m
[34m2020-02-17 11:38:56,409 sagemaker_pytorch_container.training INFO     Block until all host DNS

In [19]:
best.hyperparameters()

{'_tuning_objective_metric': '"average test loss"',
 'backend': '"nccl"',
 'batch-size': '"128"',
 'epochs': '5',
 'lr': '0.07896950695095702',
 'sagemaker_container_log_level': '20',
 'sagemaker_enable_cloudwatch_metrics': 'false',
 'sagemaker_estimator_class_name': '"PyTorch"',
 'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"',
 'sagemaker_job_name': '"pytorch-training-2020-02-17-11-28-26-417"',
 'sagemaker_program': '"mnist_ckpoint.py"',
 'sagemaker_region': '"us-east-1"',
 'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-113147044314/pytorch-training-2020-02-17-11-28-26-417/source/sourcedir.tar.gz"'}

In [20]:
print("\n".join([f"{m['MetricName']}: {m['Value']}"for m in best.latest_training_job.describe()['FinalMetricDataList']]))

average test loss: 0.05079999938607216
ObjectiveMetric: 0.05079999938607216


## Host
### Create endpoint
After training, we need to use the `PyTorch` estimator object to create a `PyTorchModel` object and set a different `entry_point`, otherwise, the training script `mnist_ckpoint.py` will be used for inference. (Note that the new `entry_point` must be under the same `source_dir` as `mnist_ckpoint.py`). Then we use the `PyTorchModel` object to deploy a `PyTorchPredictor`. This creates a Sagemaker Endpoint -- a hosted prediction service that we can use to perform inference.

An implementation of `model_fn` is required for inference script. We are going to use default implementations of `input_fn`, `predict_fn`, `output_fn` and `transform_fm` defined in [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers).

Here's an example of the inference script:

In [21]:
!pygmentize code/inference.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m

[34mfrom[39;49;00m [04m[36mmodel_def[39;49;00m [34mimport[39;49;00m Net


[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    device = torch.device([33m"[39;49;00m[33mcuda[39;49;00m[33m"[39;49;00m [34mif[39;49;00m torch.cuda.is_available() [34melse[39;49;00m [33m"[39;49;00m[33mcpu[39;49;00m[33m"[39;49;00m)
    model = Net()
    [34mwith[39;49;00m [36mopen[39;49;00m(os.path.join(model_dir, [33m'[39;49;00m[33mmodel.pth[39;49;00m[33m'[39;49;00m), [33m'[39;49;00m[33mrb[39;49;00m[33m'[39;49;00m) [34mas[39;49;00m f:
        model.load_state_dict(torch.load(f))
    [34mreturn[39;49;00m model.to(device)


The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job.  Here we will deploy the model to a single ```ml.p3.2xlarge``` instance. Notice that endpoint deployment can take up to 10 minutes.

In [22]:
deploy_start_time = datetime.now()

In [23]:
# Create a PyTorchModel object with a different entry_point
model = best.create_model(entry_point='inference.py', source_dir='code')

In [24]:
# Deploy the model to an instance
predictor_gpu = model.deploy(initial_instance_count=1, instance_type='ml.p3.2xlarge', wait=True)

-------------------!

In [25]:
print("Endpoint characteristics:\n"
      f"\tAccept: {predictor_gpu.accept}\n"
      f"\tContent Type: {predictor_gpu.content_type}\n"
      f"\tSerializer: {predictor_gpu.serializer}\n"
      f"\tDeserializer: {predictor_gpu.deserializer}\n"
      f"\tEndpoint: {predictor_gpu.endpoint}")

Endpoint characteristics:
	Accept: application/x-npy
	Content Type: application/x-npy
	Serializer: <sagemaker.predictor._NPYSerializer object at 0x7fb924f0ea58>
	Deserializer: <sagemaker.predictor._NumpyDeserializer object at 0x7fb924f0e898>
	Endpoint: pytorch-training-200217-1128-004-d84ef244


### Evaluate
We can now use this predictor to classify hand-written digits. Drawing into the image box loads the pixel data into a `data` variable in this notebook, which we can then pass to the `predictor`. Notice that the drawing box only works under the original Jupyter interface (not Jupyter Lab).

In [26]:
eval_start_time = datetime.now()

In [27]:
HTML(open("input.html").read())

In [28]:
%%time
image = np.array([data], dtype=np.float32)
response = predictor_gpu.predict(image)
prediction = response.argmax(axis=1)[0]
print(prediction)

2
CPU times: user 7.47 ms, sys: 4.08 ms, total: 11.6 ms
Wall time: 5.84 s


#### Predictor Timing

We can also take some measures of how efficient each instance type is. Given that the model itself is quite simple, we should not see much difference between GPU and CPU powered instances.

In [33]:
%%timeit
results = predictor_gpu.predict(image)

11.2 ms ± 837 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [34]:
predictor_gpu.delete_endpoint()
time.sleep(5)

In [35]:
predictor_cpu = model.deploy(initial_instance_count=1, instance_type='ml.m5.2xlarge', wait=True)



-------------------!

In [36]:
%%timeit
results = [predictor_cpu.predict(image) for _ in range(100)]

1.06 s ± 61.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [37]:
predictor_cpu.delete_endpoint()

Always remember to delete the endpoints after trying them out. Otherwise they'll remain active and generate additional costs in your account (even if idle).

In [38]:
end_time = datetime.now()

In [39]:
total_duration = end_time - start_time
data_duration = train_start_time - data_start_time
train_duration = hpo_start_time - train_start_time
hpo_duration = deploy_start_time - hpo_start_time
deploy_duration = eval_start_time - deploy_start_time

print(f"Total: {total_duration}\nData Load: {data_duration}\nTraining: {train_duration}\nHPO: {hpo_duration}\nDeploy: {deploy_duration}")

Total: 1:12:28.351083
Data Load: 0:00:04.658180
Training: 0:14:20.437503
HPO: 0:19:08.737477
Deploy: 0:09:32.900177
