# MNIST Training using MXNet

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Host](#Host)

---

## Background

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and MXNet. This notebook example shows how to use Horovod with MXNet in SageMaker using MNIST dataset.

For more information about the MXNet in SageMaker, please visit following github repositories:
1. [sagemaker-mxnet-training-toolkit](https://github.com/aws/sagemaker-mxnet-training-toolkit/)
2. [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) 
3. [sagemaker-training-toolkit](https://github.com/aws/sagemaker-training-toolkit)
4. [deep-learning-containers](https://github.com/aws/deep-learning-containers)

---

## Setup

_This notebook was created and tested on an ml.p2.xlarge notebook instance._

Let's start by creating a SageMaker session and specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the [Amazon SageMaker Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the `sagemaker.get_execution_role()` with the appropriate full IAM role arn string(s).


In [None]:
import sagemaker
from sagemaker.mxnet import MXNet

sess= sagemaker.Session()

bucket = sess.default_bucket()
prefix = 'sagemaker/DEMO-mxnet-mnist-horovod'

role = sagemaker.get_execution_role()

output_path='s3://' + sess.default_bucket() + '/' + prefix

## Data
### Getting the data

You will download MNIST data from a public bucket and upload it to the default bucket associated with your AWS account.

In [None]:
import logging
import boto3
from botocore.exceptions import ClientError
import os
import json
# Download training and testing data from a public S3 bucket

public_bucket = 'sagemaker-sample-files'
local_data_dir = '/tmp/data'

def download_from_s3(data_dir, train=True):
    """Download MNIST dataset and convert it to numpy array
    
    Args:
        data_dir (str): directory to save the data
        train (bool): download training set
    
    Returns:
        None
    """
    
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    
    if train:
        images_file = "train-images-idx3-ubyte.gz"
        labels_file = "train-labels-idx1-ubyte.gz"
    else:
        images_file = "t10k-images-idx3-ubyte.gz"
        labels_file = "t10k-labels-idx1-ubyte.gz"
  
    # download objects
    s3 = boto3.client('s3')
    bucket = public_bucket
    for obj in [images_file, labels_file]:
        key = os.path.join("datasets/image/MNIST", obj)
        dest = os.path.join(data_dir, obj)
        if not os.path.exists(dest):
            s3.download_file(bucket, key, dest)
    return


download_from_s3(local_data_dir, True)
download_from_s3(local_data_dir, False)


# upload to the default bucket

prefix = 'mnist'
bucket = sess.default_bucket()
loc = sess.upload_data(path=local_data_dir, bucket=bucket, key_prefix=prefix)

channels = {
    "training": loc,
    "testing": loc
}

## Train

Define hyperparameters of training job. Note, that `entry_point` param defines training script which will be executed on Horovod distributed cluster. Additionally, you can also define any parameters of your training script.


### Training Scipt

The mnist.py script provides the code we need for training a SageMaker model. The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:

- `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to. These artifacts are uploaded to S3 for model hosting.
- `SM_NUM_GPUS`: The number of gpus available in the current container.
- `SM_CURRENT_HOST`: The name of the current container on the container network.
- `SM_HOSTS`: JSON encoded list containing all the hosts .

Supposing one input channel, 'training', was used in the call to the PyTorch estimator's fit() method, the following will be set, following the format SM_CHANNEL_[channel_name]:

- `SM_CHANNEL_TRAINING`: A string representing the path to the directory containing data in the 'training' channel.

For more information about training environment variables, please visit SageMaker Containers.

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to model_dir so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance.

This script uses Horovod framework for distributed training. 

You can run the following command to view the script run by this notebook:

In [None]:
!pygmentize code/train.py

## Run training in SageMaker

The `MXNet` class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, the training instance type, and hyperparameters. In this case we are going to run our training job on 2 `ml.p2.8xlarge` instances. But this example can be ran on one or multiple, cpu or gpu instances ([full list of available instances](https://aws.amazon.com/sagemaker/pricing/instance-types/)).

### SageMaker MXNet Estimator

Estimator API in Sagemaker Python SDK supports distributed training functionality via the distributions parameter.
To leverage Horovod, we specify `mpi` dictionary in the distributions parameter. The dictionary can contain following keys
- `enabled`: True/False
- `custom_mpi_options`: string
- `processes_per_host`: integer

Note: `train_instance_type` and `processes_per_host` are interlinked. Make sure that `processes_per_host` doesn't exceed the number of available GPUs in the instance.


For further details on various AWS EC2 instances & available GPUs refer:
- P3 (https://aws.amazon.com/ec2/instance-types/p3/)
- G4 (https://aws.amazon.com/ec2/instance-types/g4/)

In [None]:
mpi_options = '-verbose -x orte_base_help_aggregate=0'
distributions = {
    'mpi':{
        'enabled': True,
        'custom_mpi_options': mpi_options,
        'processes_per_host': 4
    }
}
hyperparameters = {
    'batch-size': 64,
    'dtype': 'float32',
    'epochs': 10,
    'lr': 0.01,
}

In [None]:
estimator = MXNet(
    entry_point='train.py',
    source_dir='code',
    role=role,
    instance_type='ml.p3.8xlarge',
    instance_count=1,
    framework_version='1.7.0',
    output_path=output_path,
    py_version='py3',
    distributions=distributions,
    hyperparameters=hyperparameters,
    )

After we've constructed our `MXNet` object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.

In [None]:
estimator.fit(inputs=channels)

## Host
### Create an inference endpoint

After training, we use the MXNetModel class to build and deploy an MXNetPredictor. This creates a Sagemaker Endpoint -- a hosted prediction service that we can use to perform inference.

This allows us to perform inference on json encoded multi-dimensional arrays.

The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job. For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances. Here we will deploy the model to a single `ml.m4.xlarge` instance.

In [None]:
print(estimator.model_data)


In [None]:
#predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

from sagemaker.mxnet import MXNetModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer


model = MXNetModel(
    entry_point='inference.py',
    source_dir='code',
    role=role,
    model_data=estimator.model_data,
    framework_version='1.7.0',
    py_version='py3'
)

predictor = model.deploy(
    initial_instance_count=1,
    instance_type='ml.m4.xlarge',
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)


## Evaluate 
We can now use this predictor to classify hand-written digits. We will download the MNIST test data from a public S3 bucket and use it to evaluate our trained model. 

In [None]:
import random 
import boto3
import matplotlib.pyplot as plt
import os
import numpy as np
import gzip
import json

%matplotlib inline


fname = 't10k-images-idx3-ubyte.gz'
target = os.path.join(local_data_dir, fname)
# randomly sample 16 test images

with gzip.open(target, 'rb') as f:
    images = np.frombuffer(f.read(), np.uint8, offset=16).reshape(-1, 28, 28)


# randomly sample 16 images to inspect
mask = random.sample(range(images.shape[0]), 16)
samples = images[mask]

# plot the images 
fig, axs = plt.subplots(nrows=1, ncols=16, figsize=(16, 1))

for i, splt in enumerate(axs):
    splt.imshow(samples[i])
    
    
def normalize(x, axis):
    eps = np.finfo(float).eps
    mean = np.mean(x, axis=axis, keepdims=True)
    # avoid division by zero
    std = np.std(x, axis=axis, keepdims=True) + eps
    return (x - mean) / std

samples = normalize(samples.astype(np.float32), axis=(1, 2)) # mean 0; std 1
samples = np.expand_dims(samples, axis=1)

In [None]:
data = {
    'inputs': samples.tolist()
}
res = predictor.predict(data)

In [None]:
print("Predictions: ", *map(int, res))

### (Optional) Clean up
After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it.

In [None]:
predictor.delete_endpoint()