# MNIST training with MXNet and Gluon

MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). This tutorial will show how to train and test an MNIST model on SageMaker using MXNet and the Gluon API.



In [11]:
import os
import sagemaker
from sagemaker.mxnet import MXNet
from sagemaker import get_execution_role

sess = sagemaker.Session()

role = get_execution_role()

output_path='s3://' + sess.default_bucket() + '/mxnet/mnist'
output_path = 's3://' + 'sagemaker-model-artifacts-dev' 



## Implement the entry point for training

The entry point for training is a python script that provides all 
the code for training a MXNet model. It is used by the SageMaker 
MXNet Estimator (`MXNet` class above) as the entry point for running the training job.

Under the hood, SageMaker MXNet Estimator downloads a docker image
with runtime environemnts 
specified by the parameters you used to initiated the
estimator class and it injects the training script into the 
docker image to be used as the entry point to run the container.

In the rest of the notebook, we use *training image* to refer to the 
docker image specified by the MXNet Estimator and *training container*
to refer to the container that runs the training image. 

This means your training script is very similar to a training script
you might run outside Amazon SageMaker, but it can access the useful environment 
variables provided by the training image. Checkout [here](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/using_mxnet.html?highlight=entry%20point) to see some common environment 
variables you might used. Checkout [here](https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md) for a complete 
description of all environment variables your training script
can access to. 

In this example, we will use the training script `code/train.py`
as the entry point for our MXNet Estimator.

The script here is an adaptation of the [Gluon MNIST example](https://github.com/apache/incubator-mxnet/blob/master/example/gluon/mnist.py) provided by the [Apache MXNet](https://mxnet.incubator.apache.org/) project. 

In [None]:
!cat 'code/train.py'

## Initiate the MXNet Estimator

The MXNet estimator allows you to run your training script on SageMaker
infrastracture in a containerized environment. You need to configure
it with the following parameters to set up the environment:

`role`: An IAM role to make AWS service requests.

`instance_type`:
The type of SageMaker instance to run your training script. 
Set it to `local` if you want to run the training job on 
the SageMaker instance you are using to run this notebook.

`instance count`:
The number of instances you need to run your training job. 
Multiple instances are needed for distributed training.

`output_path`: 
S3 location to save training output (model artifacts and output files)

`framework_version`: The version of MXNet you need to use.

`py_version`: The python version you need to use.

For more information, see [here](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase)

### Set hyperparameters

In addition, MXNet estimator allows you to parse command line arguments
to your training script via `hyperparameters`

In [12]:
# set local_mode to be True if you want to run the training script
# on the machine that runs this notebook

local_mode=False

if local_mode:
    instance_type='local'
else:
    instance_type='ml.c4.xlarge'
    
est = MXNet(
    entry_point='train.py',
    source_dir='code', # directory of your training script
    role=role,
    framework_version='1.7.0',
    py_version='py3',
    instance_type=instance_type,
    instance_count=1,
    output_path=output_path,
    hyperparameters={
        'batch-size':100,
        'epochs':20,
        'learning-rate': 0.1,
        'momentum': 0.9,
        'log-interval':100
    }
)


The training container executes your training script like

```
python train.py --batch-size 100 --epochs 20 --learning-rate 0.1
    --momentum 0.9 --log-interval 100
```

## Set up channels for training and testing data

You need to tell `MXNet` estimator where to find your training and 
testing data. It can be a link to an S3 bucket or it can be a path
in your local file system if you use local mode. In this example,
we use a public S3 bucket.

In [9]:
region_name=sess.boto_region_name

train_data_loc='s3://sagemaker-sample-data-{}/mxnet/mnist/'.format(region_name)
test_data_loc='s3://sagemaker-sample-data-{}/mxnet/mnist'.format(region_name)

channels={
    'training': train_data_loc,
    'testing': test_data_loc
}

The keys of the dictionary `channels` are parsed to the training image
and it creates the environment variable `SM_CHANNEL_<key name>`. 

In this example, `SM_CHANNEL_TRAINING` and `SM_CHANNEL_TESTING` are created in the training image (checkout 
how `code/train.py` access these variables). For more information,
see: [SM_CHANNEL_{channel_name}](https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md#sm_channel_channel_name)

If you want, you can create a channel for validation:
```
channels = {
    'training': train_data_loc,
    'validation': val_data_loc,
    'test': test_data_loc
    }
```
You can then access this channel within your training script via
`SM_CHANNEL_VALIDATION`


## Run the training script on SageMaker
Now, the training container has everything to execute your training
script. You can start the container by calling `fit` method.

In [13]:
est.fit(inputs=channels)

2020-11-13 21:31:51 Starting - Starting the training job...
2020-11-13 21:31:54 Starting - Launching requested ML instances......
2020-11-13 21:33:04 Starting - Preparing the instances for training......
2020-11-13 21:33:56 Downloading - Downloading input data...
2020-11-13 21:34:45 Training - Training image download completed. Training in progress..[34m2020-11-13 21:34:46,567 sagemaker-training-toolkit INFO     Imported framework sagemaker_mxnet_container.training[0m
[34m2020-11-13 21:34:46,569 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-11-13 21:34:46,582 sagemaker_mxnet_container.training INFO     MXNet training environment: {'SM_HOSTS': '["algo-1"]', 'SM_NETWORK_INTERFACE_NAME': 'eth0', 'SM_HPS': '{"batch-size":100,"epochs":20,"learning-rate":0.1,"log-interval":100,"momentum":0.9}', 'SM_USER_ENTRY_POINT': 'train.py', 'SM_FRAMEWORK_PARAMS': '{}', 'SM_RESOURCE_CONFIG': '{"current_host":"algo-1","hosts":["algo-1"],"network_interf

[34m[Epoch 0 Batch 200] Training: accuracy=0.860348, 2631.522019 samples/s[0m
[34m[Epoch 0 Batch 300] Training: accuracy=0.884751, 2693.957981 samples/s[0m
[34m[Epoch 0 Batch 400] Training: accuracy=0.900599, 3496.972678 samples/s[0m
[34m[Epoch 0 Batch 500] Training: accuracy=0.910878, 3395.290327 samples/s[0m
[34mDEBUG:root:Writing metric: _RawMetricData(MetricName='softmaxcrossentropyloss0_output_0_GLOBAL',Value=0.08117630332708359,Timestamp=1605303307.5658958,IterationNumber=500)[0m
[34m[Epoch 0] Training: accuracy=0.916983[0m
[34m[Epoch 0] Validation: accuracy=0.957600[0m
[34m[Epoch 1 Batch 100] Training: accuracy=0.958812, 3947.766013 samples/s[0m
[34m[Epoch 1 Batch 200] Training: accuracy=0.961592, 3393.971565 samples/s[0m
[34m[Epoch 1 Batch 300] Training: accuracy=0.962226, 3429.240455 samples/s[0m
[34mDEBUG:root:Writing metric: _RawMetricData(MetricName='softmaxcrossentropyloss0_output_0_GLOBAL',Value=0.1518828421831131,Timestamp=1605303320.9749928,Iteratio

[34m[Epoch 11] Validation: accuracy=0.972600[0m
[34m[Epoch 12 Batch 100] Training: accuracy=0.991980, 3801.633297 samples/s[0m
[34mDEBUG:root:Writing metric: _RawMetricData(MetricName='softmaxcrossentropyloss0_output_0_GLOBAL',Value=0.00780603289604187,Timestamp=1605303526.5178218,IterationNumber=8500)[0m
[34m[Epoch 12 Batch 200] Training: accuracy=0.992786, 3493.070164 samples/s[0m
[34m[Epoch 12 Batch 300] Training: accuracy=0.992591, 3519.213311 samples/s[0m
[34m[Epoch 12 Batch 400] Training: accuracy=0.992170, 3770.703202 samples/s[0m
[34m[Epoch 12 Batch 500] Training: accuracy=0.991697, 3354.584426 samples/s[0m
[34m[Epoch 12] Training: accuracy=0.991850[0m
[34m[Epoch 12] Validation: accuracy=0.976500[0m
[34m[Epoch 13 Batch 100] Training: accuracy=0.995248, 3948.026130 samples/s[0m
[34m[Epoch 13 Batch 200] Training: accuracy=0.994080, 3481.992744 samples/s[0m
[34m[Epoch 13 Batch 300] Training: accuracy=0.993023, 3516.705235 samples/s[0m
[34m[Epoch 13 Batch 4

## Inspect and store model data

Now, the training is finished, the model artifact has been saved in 
the `output_path`. We 

In [14]:
model_data = est.model_data
print("Model artifact saved at:\n", model_data)

Model artifact saved at:
 s3://sagemaker-model-artifacts-dev/mxnet-training-2020-11-13-21-31-51-252/output/model.tar.gz


We will store the variable `model_data` in the current notebook kernel. 
In the [next notebook](get_started_with_mnist_deploy.ipynb), you will learn how to retrieve the model artifact and deploy to a SageMaker
Endpoint.

In [15]:
%store model_data

Stored 'model_data' (str)
