# MNIST distributed training  

The **SageMaker Python SDK** helps you deploy your models for training and hosting in optimized, productions ready containers in SageMaker. The SageMaker Python SDK is easy to use, modular, extensible and compatible with TensorFlow and MXNet. This tutorial focuses on how to create a convolutional neural network model to train the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) with a new SageMaker supported format which is called **Script Mode**.

**Script Mode** supports training with ``Python`` script, ``Python`` module and shell script. In this example we will use a ``Python`` script.

In addition this notebook also demonstrates how to perform real time inference with [SageMaker TensorFlow serving containers](https://github.com/aws/sagemaker-tensorflow-serving-container). TensorFlow serving container is the only supported inference method for **Script Mode**. For full documention on TensorFlow serving please visit [here](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst).


### Set up the environment

In [None]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = 'SageMakerRole'

### Training Data

The MNIST dataset has been loaded to public S3 bucket ``sagemaker-sample-data-us-west-2`` under prefix ``tensorflow/mnist``. There are four ``.npy`` file under this prefix: ``train_data.npy, eval_data.npy, train_labels.npy and eval_labels.npy``.

In [None]:
training_data_uri = 's3://sagemaker-sample-data-us-west-2/tensorflow/mnist'

# Construct a script for distributed training 

The training script was adapted from TensorFlow's official [CNN MNIST example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/layers/cnn_mnist.py). We have modified to hanle a ``model_dir`` parameter passed in by SageMaker. This is a S3 path which can be used for data sharing during distributed training and checkpoint and or model persistence. In addtion we have also added a argument parsing function to handle processing training related variables.

At the end of the training job we also added a step to export the trained model to the path stored in the environment variable ``SM_MODEL_DIR`` which always points to ``/opt/ml/model``. This is critical because SageMaker uploads all the model artifact in this folder to S3 at end of training.

Here is the entire script:

In [None]:
!cat 'mnist.py'

## Create a training job using the sagemaker.TensorFlow estimator

The ``sagemaker.TensorFlow`` estimator handles locating the script mode container uploading your script to a S3 location and creating a SageMaker training job.

In [None]:
from sagemaker.tensorflow import TensorFlow


mnist_estimator = TensorFlow(entry_point='mnist.py',
                             role='SageMakerRole',
                             train_instance_count=2,
                             train_instance_type='ml.p2.xlarge',
                             framework_version='1.12',
                             py_version = 'py3',
                             distributions = {'parameter_server': {'enabled': True}},
                             base_job_name='test-tf')

mnist_estimator.fit(training_data_uri)

The **```fit```** method will create a training job in two **ml.p2.xlarge** instances. The logs above will show the instances doing training, evaluation, and incrementing the number of **training steps**. 

In the end of the training, the training job will generate a saved model for TF serving.

# Deploy the trained model to prepare for predictions

The deploy() method creates an SageMaker model which is then used to create an endpoint which serves prediction requests in realtime. 

In [None]:
predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type='ml.p2.8xlarge')

# Invoking the endpoint

Let's donwload the training data and use that as input for inference.

In [None]:
import numpy as np

!aws s3 cp s3://sagemaker-sample-data-us-west-2/tensorflow/mnist/train_data.npy train_data.npy
!aws s3 cp s3://sagemaker-sample-data-us-west-2/tensorflow/mnist/train_labels.npy train_labels.npy

train_data = np.load('train_data.npy')
train_labels = np.load('train_labels.npy')


``Python`` or ``numpy`` arrays can be used for inference. In addtion TensorFlow serving can also process multiple items at once. You can find the complete documentation on all supported input formats [here](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#making-predictions-against-a-sagemaker-endpoint)

In [None]:
predictions = predictor.predict(train_data[:50])
for i in range(0, 50):
    prediction = predictions['predictions'][i]['classes']
    label = train_labels[i]
    print('prediction is {}, label is {}, matched: {}'.format(prediction, label, prediction == label))

# Deleting the endpoint

In [None]:
sagemaker.Session().delete_endpoint(predictor.endpoint)