# Training and hosting SageMaker Models using the Apache MXNet Module API

The [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) makes it easy to train and deploy Apache MXNet models. In this example, we train a simple neural network using the Apache MXNet [Module API](https://mxnet.apache.org/api/python/module/module.html) and the MNIST dataset. The MNIST dataset is widely used for handwritten digit classification, and consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). The task at hand is to train a model using the 60,000 training images and subsequently test its classification accuracy on the 10,000 test images.

### Setup

First we define a few variables that are needed later in this example.

In [1]:
from sagemaker import get_execution_role
from sagemaker.session import Session

# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = Session().default_bucket()

# Bucket location where your custom code will be saved in the tar.gz format.
custom_code_upload_location = "s3://{}/mxnet-mnist-example/code".format(bucket)

# Bucket location where results of model training are saved.
model_artifacts_location = "s3://{}/mxnet-mnist-example/artifacts".format(bucket)

# IAM execution role that gives SageMaker access to resources in your AWS account.
# We can use the SageMaker Python SDK to get the role from our notebook environment.
role = get_execution_role()

### The training script

The `mnist.py` script provides all the code we need for training and hosting a SageMaker model. The script also checkpoints the model at the end of every epoch and saves the model graph, params and optimizer state in the folder `/opt/ml/checkpoints`. If the folder path does not exist then it skips checkpointing. The script we use is adaptated from Apache MXNet [MNIST tutorial](https://mxnet.incubator.apache.org/tutorials/python/mnist.html).



In [2]:
!pygmentize mnist.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mgzip[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mstruct[39;49;00m

[34mimport[39;49;00m [04m[36mmxnet[39;49;00m [34mas[39;49;00m [04m[36mmx[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m


[34mdef[39;49;00m [32mload_data[39;49;00m(path):
    [34mwith[39;49;00m gzip.open(find_file(path, [33m"[39;49;00m[33mlabels.gz[39;49;00m[33m"[39;49;00m)) [34mas[39;49;00m flbl:
        struct.unpack([33m"[39;49;00m[33m>II[39;49;00m[33m"[39;49;00m, flbl.read([34m8[39;49;00m))
        labels = np.fromstring(flbl.read(), dtype=np.int8)
    [34mwith[39;49;00m gzip.open(find_file(path, [33m"[39;49;00m[33mimages.gz[39;49;00m[33m"[39;49;00m)) [34mas[39;49;00m fimg:
        _, _, r

### SageMaker's MXNet estimator class

The SageMaker ```MXNet``` estimator allows us to run single machine or distributed training in SageMaker, using CPU or GPU-based instances.

When we create the estimator, we pass in the filename of our training script, the name of our IAM execution role, and the S3 locations we defined in the setup section. We also provide a few other parameters. ``train_instance_count`` and ``train_instance_type`` determine the number and type of SageMaker instances that will be used for the training job. The ``hyperparameters`` parameter is a ``dict`` of values that will be passed to your training script -- you can see how to access these values in the ``mnist.py`` script above.

For this example, we will choose one ``ml.m4.xlarge`` instance.

In [3]:
from sagemaker.mxnet import MXNet

mnist_estimator = MXNet(
    entry_point="mnist.py",
    role=role,
    output_path=model_artifacts_location,
    code_location=custom_code_upload_location,
    instance_count=1,
    instance_type="ml.m4.xlarge",
    framework_version="1.4.1",
    py_version="py3",
    #distribution={"parameter_server": {"enabled": True}},
    hyperparameters={"learning-rate": 0.1},
)

distributions has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


### Running the Training Job

After we've constructed our MXNet object, we can fit it using data stored in S3. Below we run SageMaker training on two input channels: **train** and **test**.

During training, SageMaker makes this data stored in S3 available in the local filesystem where the mnist script is running. The ```mnist.py``` script simply loads the train and test data from disk.

In [4]:
%%time
import boto3

region = boto3.Session().region_name
train_data_location = "s3://sagemaker-sample-data-{}/mxnet/mnist/train".format(region)
test_data_location = "s3://sagemaker-sample-data-{}/mxnet/mnist/test".format(region)

mnist_estimator.fit({"train": train_data_location, "test": test_data_location})

2021-06-09 21:55:39 Starting - Starting the training job...
2021-06-09 21:55:41 Starting - Launching requested ML instances......
2021-06-09 21:56:50 Starting - Preparing the instances for training......
2021-06-09 21:57:59 Downloading - Downloading input data...
2021-06-09 21:58:24 Training - Downloading the training image..[34m2021-06-09 21:58:47,583 sagemaker-containers INFO     Imported framework sagemaker_mxnet_container.training[0m
[34m2021-06-09 21:58:47,587 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-06-09 21:58:47,603 sagemaker_mxnet_container.training INFO     MXNet training environment: {'SM_HOSTS': '["algo-1"]', 'SM_NETWORK_INTERFACE_NAME': 'eth0', 'SM_HPS': '{"learning-rate":0.1}', 'SM_USER_ENTRY_POINT': 'mnist.py', 'SM_FRAMEWORK_PARAMS': '{"sagemaker_parameter_server_enabled":true}', 'SM_RESOURCE_CONFIG': '{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}', 'SM_INPUT_DATA_CONFIG': '{"test":{"Reco

### Opimtize your model with Neo API
Neo API allows to optimize our model for a specific hardware type. When calling compile_model() function, we specify the target instance family (C5) as well as the S3 bucket to which the compiled model would be stored.

#### Important. If the following command result in a permission error, scroll up and locate the value of execution role returned by get_execution_role(). The role must have access to the S3 bucket specified in output_path.

In [9]:
output_path = "/".join(mnist_estimator.output_path.split("/")[:-1])
neo_optimize = True
compiled_model = mnist_estimator.compile_model(
    target_instance_family="ml_m4",
    input_shape={"data": [1, 784], "softmax_label": [1]},
    role=role,
    output_path=output_path,
    framework="mxnet",
    framework_version="1.8.0",
    
)

?..................................................!

Defaulting to the only supported framework/algorithm version: 1.7. Ignoring framework/algorithm version: 1.8.0.


### Creating an inference Endpoint

After training, we use the ``MXNet estimator`` object to build and deploy an ``MXNetPredictor``. This creates a Sagemaker **Endpoint** -- a hosted prediction service that we can use to perform inference. 

The arguments to the ``deploy`` function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job. For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances. Here we will deploy the model to a single ``ml.m4.xlarge`` instance.

In [10]:
import io
import numpy as np


def numpy_bytes_serializer(data):
    f = io.BytesIO()
    np.save(f, data)
    f.seek(0)
    return f.read()

serializer = None
if neo_optimize is True:
    serializer = numpy_bytes_serializer
    
predictor = compiled_model.deploy(
    initial_instance_count=1,
    instance_type='ml.m4.xlarge',
    serializer=serializer
    
)

------!

## (Optional) Delete the Endpoint

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it.

In [13]:
print("Endpoint name: " + predictor.endpoint_name)

Endpoint name: mxnet-training-ml-m4-2021-06-09-23-15-20-517


In [16]:
predictor.delete_endpoint()