## Local MNIST Training with MXNet and Gluon

### Pre-requisites

This notebook shows how to use the SageMaker Python SDK to run your code in a local container before deploying to SageMaker's managed training or hosting environments.  This can speed up iterative testing and debugging while using the same familiar Python SDK interface.  Just change your estimator's `train_instance_type` to `local`.  You could also use `local_gpu` if you're using an ml.p2 or ml.p3 notebook instance, but then you'll need to set `train_instance_count=1` since distributed, local, GPU training is not yet supported.

In order to use this feature you'll need to install docker-compose (and nvidia-docker if training with a GPU).  Running the setup.sh script below will handle this for you.

**Note, you can only run a single local notebook at one time.**

In [1]:
!/bin/bash ./setup.sh

nvidia-docker2 already installed. We are good to go!
SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


### Overview

MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). This tutorial will show how to train and test an MNIST model on SageMaker local mode using MXNet and the Gluon API.

In [2]:
import os
import subprocess
import boto3
import sagemaker
from sagemaker.mxnet import MXNet
from mxnet import gluon
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

instance_type = 'local'

if subprocess.call('nvidia-smi') == 0:
    ## Set type to GPU if one is present
    instance_type = 'local_gpu'
    
print("Instance type = " + instance_type)

role = get_execution_role()

Instance type = local_gpu


## Download training and test data

In [3]:
gluon.data.vision.MNIST('./data/train', train=True)
gluon.data.vision.MNIST('./data/test', train=False)

<mxnet.gluon.data.vision.datasets.MNIST at 0x7fdf784e3290>

## Uploading the data

We use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value `inputs` identifies the location -- we will use this later when we start the training job.

In [4]:
inputs = sagemaker_session.upload_data(path='data', key_prefix='data/mnist')

## Implement the training function

We need to provide a training script that can run on the SageMaker platform. The training scripts are essentially the same as one you would write for local training, except that you need to provide a `train` function. When SageMaker calls your function, it will pass in arguments that describe the training environment. Check the script below to see how this works.

The script here is an adaptation of the [Gluon MNIST example](https://github.com/apache/incubator-mxnet/blob/master/example/gluon/mnist.py) provided by the [Apache MXNet](https://mxnet.incubator.apache.org/) project. 

In [5]:
!cat 'mnist.py'

from __future__ import print_function

import logging
import mxnet as mx
from mxnet import gluon, autograd
from mxnet.gluon import nn
import numpy as np
import json
import time
import argparse, os


logging.basicConfig(level=logging.DEBUG)

# ------------------------------------------------------------ #
# Training methods                                             #
# ------------------------------------------------------------ #


def train(current_host, channel_input_dirs, hyperparameters, hosts, num_gpus):
    # SageMaker passes num_cpus, num_gpus and other args we can use to tailor training to
    # the current container environment, but here we just use simple cpu context.
    ctx = mx.cpu()

    # retrieve the hyperparameters we set in notebook (with some defaults)
    batch_size = hyperparameters.get('batch_size', 100)
    epochs = hyperparameters.get('epochs', 10)
    learning_rate = hyperparameters.get('learning_rate', 0.1)
    momentum = hyperpar

## Run the training script on SageMaker

The ```MXNet``` class allows us to run our training function on SageMaker local mode. We need to configure it with our training script, an IAM role, the number of training instances, and the training instance type.  This is the the only difference from [mnist_with_gluon.ipynb](./mnist_with_gluon.ipynb).  Instead of ``train_instance_type='ml.c4.xlarge'``, we set it to ``train_instance_type='local'``.  For local training with GPU, we could set this to "local_gpu".  In this case, `instance_type` was set above based on your whether you're running a GPU instance.

In [6]:
m = MXNet("mnist.py",
          role=role,
          train_instance_count=1,
          train_instance_type=instance_type,
          framework_version="1.3.0",
          hyperparameters={'batch_size': 100,
                         'epochs': 20,
                         'learning_rate': 0.1,
                         'momentum': 0.9,
                         'log_interval': 100})

After we've constructed our `MXNet` object, we fit it using the data we uploaded to S3. Even though we're in local mode, using S3 as our data source makes sense because it maintains consistency with how SageMaker's distributed, managed training ingests data.

In [7]:
m.fit(inputs)

INFO:sagemaker:Creating training-job with name: sagemaker-mxnet-2019-01-24-03-05-16-811


Creating tmplrobwo_algo-1-fpbs4_1 ... 
[1BAttaching to tmplrobwo_algo-1-fpbs4_12mdone[0m
[36malgo-1-fpbs4_1  |[0m 2019-01-24 03:05:27,533 sagemaker-containers INFO     Imported framework sagemaker_mxnet_container.training
[36malgo-1-fpbs4_1  |[0m 2019-01-24 03:05:27,558 sagemaker_mxnet_container.training INFO     MXNet training environment: {'SM_INPUT_DIR': '/opt/ml/input', 'SM_USER_ENTRY_POINT': 'mnist.py', 'SM_TRAINING_ENV': '{"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1-fpbs4","framework_module":"sagemaker_mxnet_container.training:main","hosts":["algo-1-fpbs4"],"hyperparameters":{"batch_size":100,"epochs":20,"learning_rate":0.1,"log_interval":100,"momentum":0.9},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"sagemaker-mxnet-2019-01-24-03-05-16-811","log_level":20,"master_hostname":"algo-

[36malgo-1-fpbs4_1  |[0m Namespace(model_dir='/opt/ml/model', sm_current_host='algo-1-fpbs4', sm_hosts=[u'algo-1-fpbs4'], sm_hps={u'epochs': 20, u'learning_rate': 0.1, u'momentum': 0.9, u'log_interval': 100, u'batch_size': 100}, sm_num_gpus=1, training='/opt/ml/input/data/training')
[36malgo-1-fpbs4_1  |[0m [Epoch 0 Batch 100] Training: accuracy=0.802376, 5389.958492 samples/s
[36malgo-1-fpbs4_1  |[0m [Epoch 0 Batch 200] Training: accuracy=0.862438, 7024.458215 samples/s
[36malgo-1-fpbs4_1  |[0m [Epoch 0 Batch 300] Training: accuracy=0.885415, 6994.936793 samples/s
[36malgo-1-fpbs4_1  |[0m [Epoch 0 Batch 400] Training: accuracy=0.899825, 5346.127079 samples/s
[36malgo-1-fpbs4_1  |[0m [Epoch 0 Batch 500] Training: accuracy=0.909661, 5380.071832 samples/s
[36malgo-1-fpbs4_1  |[0m [Epoch 0] Training: accuracy=0.916850
[36malgo-1-fpbs4_1  |[0m [Epoch 0] Validation: accuracy=0.964100
[36malgo-1-fpbs4_1  |[0m [Epoch 1 Batch 100] Training: accuracy=0.959010, 5325.626928 samp

[36malgo-1-fpbs4_1  |[0m [Epoch 12] Training: accuracy=0.993483
[36malgo-1-fpbs4_1  |[0m [Epoch 12] Validation: accuracy=0.976400
[36malgo-1-fpbs4_1  |[0m [Epoch 13 Batch 100] Training: accuracy=0.994059, 6671.922373 samples/s
[36malgo-1-fpbs4_1  |[0m [Epoch 13 Batch 200] Training: accuracy=0.994229, 6827.333398 samples/s
[36malgo-1-fpbs4_1  |[0m [Epoch 13 Batch 300] Training: accuracy=0.992724, 4859.299079 samples/s
[36malgo-1-fpbs4_1  |[0m [Epoch 13 Batch 400] Training: accuracy=0.991820, 6833.785193 samples/s
[36malgo-1-fpbs4_1  |[0m [Epoch 13 Batch 500] Training: accuracy=0.991697, 6511.782149 samples/s
[36malgo-1-fpbs4_1  |[0m [Epoch 13] Training: accuracy=0.991583
[36malgo-1-fpbs4_1  |[0m [Epoch 13] Validation: accuracy=0.974700
[36malgo-1-fpbs4_1  |[0m [Epoch 14 Batch 100] Training: accuracy=0.992871, 5272.339195 samples/s
[36malgo-1-fpbs4_1  |[0m [Epoch 14 Batch 200] Training: accuracy=0.994378, 6818.010989 samples/s
[36malgo-1-fpbs4_1  |[0m [Epoch 14 Ba

After training, we use the MXNet object to deploy an MXNetPredictor object. This creates a SageMaker endpoint locally that we can use to perform inference. 

This allows us to perform inference on json encoded multi-dimensional arrays. 

In [8]:
predictor = m.deploy(initial_instance_count=1, instance_type=instance_type )

INFO:sagemaker:Creating model with name: sagemaker-mxnet-2019-01-24-03-05-16-811
INFO:sagemaker:Creating endpoint with name sagemaker-mxnet-2019-01-24-03-05-16-811


Attaching to tmpbyo5ch_algo-1-rgrmg_1
[36malgo-1-rgrmg_1  |[0m [2019-01-24 03:09:50 +0000] [21] [INFO] Starting gunicorn 19.9.0
[36malgo-1-rgrmg_1  |[0m [2019-01-24 03:09:50 +0000] [21] [INFO] Listening at: unix:/tmp/gunicorn.sock (21)
[36malgo-1-rgrmg_1  |[0m [2019-01-24 03:09:50 +0000] [21] [INFO] Using worker: gevent
[36malgo-1-rgrmg_1  |[0m [2019-01-24 03:09:50 +0000] [33] [INFO] Booting worker with pid: 33
[36malgo-1-rgrmg_1  |[0m [2019-01-24 03:09:50 +0000] [41] [INFO] Booting worker with pid: 41
[36malgo-1-rgrmg_1  |[0m [2019-01-24 03:09:50 +0000] [42] [INFO] Booting worker with pid: 42
[36malgo-1-rgrmg_1  |[0m [2019-01-24 03:09:50 +0000] [53] [INFO] Booting worker with pid: 53
[36malgo-1-rgrmg_1  |[0m [2019-01-24 03:09:50 +0000] [61] [INFO] Booting worker with pid: 61
[36malgo-1-rgrmg_1  |[0m [2019-01-24 03:09:50 +0000] [62] [INFO] Booting worker with pid: 62
[36malgo-1-rgrmg_1  |[0m [2019-01-24 03:09:50 +0000] [64] [INFO] Booting worker with pid: 64
[36mal

We can now use this predictor to classify hand-written digits. Drawing into the image box loads the pixel data into a 'data' variable in this notebook, which we can then pass to the mxnet predictor.

In [9]:
from IPython.display import HTML
HTML(open("input.html").read())

The predictor runs inference on our input data and returns the predicted digit (as a float value, so we convert to int for display).

In [10]:
response = predictor.predict(data)
print int(response)

[36malgo-1-rgrmg_1  |[0m [33mDEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.[0m
[36malgo-1-rgrmg_1  |[0m Processing /opt/ml/code
[36malgo-1-rgrmg_1  |[0m Building wheels for collected packages: mnist
[36malgo-1-rgrmg_1  |[0m   Building wheel for mnist (setup.py) ... [?25ldone
[36malgo-1-rgrmg_1  |[0m [?25h  Stored in directory: /tmp/pip-ephem-wheel-cache-wzGaSx/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-rgrmg_1  |[0m Successfully built mnist
[36malgo-1-rgrmg_1  |[0m Installing collected packages: mnist
[36malgo-1-rgrmg_1  |[0m   Found existing installation: mnist 1.0.0
[36malgo-1-rgrmg_1  |[0m     Uninstalling mnist-1.0.0:
[36malgo-1-rgrmg_1  |[0m       Successfully uninstalled mnist-1.0.0
[36malgo-1-rgrmg_1  |[0m Successfully installed mnist-1.0.0
3
[36malgo-1-r

## Clean-up

Deleting the local endpoint when you're finished is important since you can only run one local endpoint at a time.

In [11]:
m.delete_endpoint()

INFO:sagemaker:Deleting endpoint with name: sagemaker-mxnet-2019-01-24-03-05-16-811


Gracefully stopping... (press Ctrl+C again to force)
