# Training and hosting SageMaker Models using the Apache MXNet Module API

The **SageMaker Python SDK** makes it easy to train and deploy MXNet models. In this example, we train a simple neural network using the Apache MXNet [Module API](https://mxnet.apache.org/api/python/module/module.html) and the driving score dataset.

### Setup

First we need to define a few variables that will be needed later in the example.

In [1]:
!pip install sagemaker==1.38.6
!pip freeze

Collecting sagemaker==1.38.6
[?25l  Downloading https://files.pythonhosted.org/packages/ae/8f/c8e7b13cfd4c911aa5ed1974d9b3ba2c14567cd756693b88d18a70eee4f9/sagemaker-1.38.6.tar.gz (219kB)
[K    100% |████████████████████████████████| 225kB 20.2MB/s ta 0:00:01
[?25hCollecting boto3>=1.9.213 (from sagemaker==1.38.6)
[?25l  Downloading https://files.pythonhosted.org/packages/52/a0/1d1993d62dc06da6e57616a87633408ed82d960ec6902c69afdaf10f840e/boto3-1.9.225-py2.py3-none-any.whl (128kB)
[K    100% |████████████████████████████████| 133kB 27.9MB/s ta 0:00:01
Collecting fabric>=2.0 (from sagemaker==1.38.6)
[?25l  Downloading https://files.pythonhosted.org/packages/d7/cb/47feeb00dae857f0fbd1153a61e902e54ed77ccdc578b371a514a3959a19/fabric-2.5.0-py2.py3-none-any.whl (51kB)
[K    100% |████████████████████████████████| 61kB 26.5MB/s ta 0:00:01
Collecting botocore<1.13.0,>=1.12.225 (from boto3>=1.9.213->sagemaker==1.38.6)
[?25l  Downloading https://files.pythonhosted.org/packages/19/28/8f0051

  Running setup.py bdist_wheel for sagemaker ... [?25ldone
[?25h  Stored in directory: /home/ec2-user/.cache/pip/wheels/75/cc/82/52125309cf9f3667c544ed09bc875c94f865e91974e7de4110
Successfully built sagemaker
[31mawscli 1.16.198 has requirement botocore==1.12.188, but you'll have botocore 1.12.225 which is incompatible.[0m
Installing collected packages: botocore, boto3, invoke, fabric, sagemaker
  Found existing installation: botocore 1.12.188
    Uninstalling botocore-1.12.188:
      Successfully uninstalled botocore-1.12.188
  Found existing installation: boto3 1.9.188
    Uninstalling boto3-1.9.188:
      Successfully uninstalled boto3-1.9.188
  Found existing installation: sagemaker 1.33.0
    Uninstalling sagemaker-1.33.0:
      Successfully uninstalled sagemaker-1.33.0
Successfully installed boto3-1.9.225 botocore-1.12.225 fabric-2.5.0 invoke-1.3.0 sagemaker-1.38.6
[33mYou are using pip version 10.0.1, however version 19.2.3 is available.
You should consider upgrading via th

In [2]:
from sagemaker import get_execution_role
from sagemaker.session import Session

# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = Session().default_bucket()

# Location to save your custom code in tar.gz format.
custom_code_upload_location = 's3://{}/customcode/mxnet'.format(bucket)

# Location where results of model training are saved.
model_artifacts_location = 's3://{}/artifacts'.format(bucket)

In [3]:
# IAM execution role that gives SageMaker access to resources in your AWS account.
# We can use the SageMaker Python SDK to get the role from our notebook environment. 
role = get_execution_role()

### The training script

The ``maintenance.py`` script provides all the code we need for training and hosting a SageMaker model. The script also checkpoints the model at the end of every epoch and saves the model graph, params and optimizer state in the folder `/opt/ml/checkpoints`. If the folder path does not exist then it will skip checkpointing.

In [4]:
print(bucket)

sagemaker-us-east-1-540323643652


In [5]:
!cat train.py

import argparse
import gzip
import json
import logging
import os
import struct
import boto3

import mxnet as mx
import numpy as np

import pickle
# from sklearn.preprocessing import StandardScaler
# from sklearn.utils import shuffle


def load_pickle_file(path):
    with open(path, 'rb') as f:
        data = pickle.load(f)

    return data

def load_data():
    data = load_pickle_file('/opt/ml/code/data.pickle')
    label = load_pickle_file('/opt/ml/code/label.pickle')
    
    return split_data(data, label)

def split_data(data, label):

    
    X, y = (data, label)
    # split dataset
    train_data = X[:80, :].astype('float32')

    train_label = y[:80]
    val_data = X[80 :].astype('float32')
    val_label = y[80:]
    return train_data, train_label, val_data, val_label

def build_graph():
    data = mx.sym.var('data')
    fc1 = mx.sym.FullyConnected(data=data, name='fc1', num_hidden=128)
    act1 = mx.sym.Activation(data=fc1, name='act1'

### SageMaker's MXNet estimator class

The SageMaker ```MXNet``` estimator allows us to run single machine or distributed training in SageMaker, using CPU or GPU-based instances.

When we create the estimator, we pass in the filename of our training script, the name of our IAM execution role, and the S3 locations we defined in the setup section. We also provide a few other parameters. ``train_instance_count`` and ``train_instance_type`` determine the number and type of SageMaker instances that will be used for the training job. The ``hyperparameters`` parameter is a ``dict`` of values that will be passed to your training script -- you can see how to access these values in the ``train.py`` script above.

For this example, we will choose one ``ml.m4.xlarge`` instance.

In [6]:
from sagemaker.mxnet import MXNet

drvscore_estimator = MXNet(entry_point='train.py',
                           role=role,
                           output_path=model_artifacts_location,
                           code_location=custom_code_upload_location,
                           train_instance_count=1,
                           train_instance_type='ml.m4.xlarge',
                           framework_version='1.2.1',
                           py_version='py2',
                           dependencies=['data/data.pickle','data/label.pickle'])

No handlers could be found for logger "sagemaker"


### Running the Training Job

After we've constructed our MXNet object, we can fit it using data stored in S3. Below we run SageMaker training on two input channels: **train** and **test**.

During training, SageMaker makes this data stored in S3 available in the local filesystem where the script is running. The ```train.py``` script simply loads the train and test data from disk.

In [7]:
%%time
import boto3

region = boto3.Session().region_name

CPU times: user 16 ms, sys: 13.7 ms, total: 29.7 ms
Wall time: 24.6 ms


In [8]:
drvscore_estimator.fit(job_name='connectedcar-drvscore8')

2019-09-10 17:47:16 Starting - Starting the training job..................
2019-09-10 17:50:14 Starting - Launching requested ML instances.........
2019-09-10 17:51:19 Starting - Preparing the instances for training......
2019-09-10 17:52:44 Downloading - Downloading input data
2019-09-10 17:52:44 Training - Downloading the training image...
2019-09-10 17:53:14 Uploading - Uploading generated training model
[31m2019-09-10 17:53:03,290 INFO - root - running container entrypoint[0m
[31m2019-09-10 17:53:03,290 INFO - root - starting train task[0m
[31m2019-09-10 17:53:03,297 INFO - container_support.training - Training starting[0m
[31m2019-09-10 17:53:10,429 INFO - mxnet_container.train - MXNetTrainingEnvironment: {'enable_cloudwatch_metrics': False, 'available_gpus': 0, 'channels': {}, '_ps_verbose': 0, 'resource_config': {u'hosts': [u'algo-1'], u'network_interface_name': u'eth0', u'current_host': u'algo-1'}, 'user_script_name': u'train.py', 'input_config_dir': '/opt/ml/input/confi

[31m2019-09-10 17:53:11,687 INFO - root - Epoch[30] Train-accuracy=0.225000[0m
[31m2019-09-10 17:53:11,688 INFO - root - Epoch[30] Time cost=0.010[0m
[31m2019-09-10 17:53:11,693 INFO - root - Epoch[30] Validation-accuracy=0.100000[0m
[31m2019-09-10 17:53:11,704 INFO - root - Epoch[31] Train-accuracy=0.225000[0m
[31m2019-09-10 17:53:11,704 INFO - root - Epoch[31] Time cost=0.011[0m
[31m2019-09-10 17:53:11,710 INFO - root - Epoch[31] Validation-accuracy=0.100000[0m
[31m2019-09-10 17:53:11,720 INFO - root - Epoch[32] Train-accuracy=0.250000[0m
[31m2019-09-10 17:53:11,720 INFO - root - Epoch[32] Time cost=0.010[0m
[31m2019-09-10 17:53:11,724 INFO - root - Epoch[32] Validation-accuracy=0.100000[0m
[31m2019-09-10 17:53:11,731 INFO - root - Epoch[33] Train-accuracy=0.250000[0m
[31m2019-09-10 17:53:11,731 INFO - root - Epoch[33] Time cost=0.007[0m
[31m2019-09-10 17:53:11,735 INFO - root - Epoch[33] Validation-accuracy=0.100000[0m
[31m2019-09-10 17:53:11,742 INFO - root

# (Optional) Creating an inference Endpoint

After training, we use the ``MXNet estimator`` object to build and deploy an ``MXNetPredictor``. This creates a Sagemaker **Endpoint** -- a hosted prediction service that we can use to perform inference. 

The arguments to the ``deploy`` function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job. For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances. Here we will deploy the model to a single ``ml.m4.xlarge`` instance.

In [9]:
#predictor = drvscore_estimator.deploy(initial_instance_count=1,
#                                   instance_type='ml.m4.xlarge')

# (Optional) Making an inference request

Now that our Endpoint is deployed and we have a ``predictor`` object, we can use it to run inference.

In [10]:
# Load data you want to run inference
#data = ...
#response = predictor.predict(data)

# (Optional) Delete the Endpoint

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it.

In [11]:
#print("Endpoint name: " + predictor.endpoint)
#predictor.delete_endpoint()