# Gluon CIFAR-10 Trained in Local Mode
_**ResNet model in Gluon trained locally in a notebook instance**_

---

---

_This notebook was created and tested on an ml.p3.8xlarge notebook instance._

## Setup

Import libraries and set IAM role ARN.

In [1]:
%%time
import sagemaker
import boto3
import pandas
import re
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.mxnet import MXNet

role = "arn:aws:iam::437242975833:role/service-role/AmazonSageMaker-ExecutionRole-20180711T133970"
sagemaker_session = sagemaker.Session()
# inputs = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-gluon-cifar10')


bucket='sagemaker-crops-corn' # customize to your bucket

training_image = get_image_uri(boto3.Session().region_name, 'image-classification')

CPU times: user 770 ms, sys: 821 ms, total: 1.59 s
Wall time: 561 ms


Install pre-requisites for local training.

In [2]:
!/bin/bash setup.sh

The user has root access.


TEST1
nvidia-docker2 already installed. We are good to go!


---

## Data

We use the helper scripts to download CIFAR-10 training data and sample images.

In [3]:
from cifar10_utils import download_training_data
download_training_data()

downloading training data...
done


We use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value `inputs` identifies the location -- we will use this later when we start the training job.

Even though we are training within our notebook instance, we'll continue to use the S3 data location since it will allow us to easily transition to training in SageMaker's managed environment.

In [4]:
inputs = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-gluon-cifar10')
print('input spec (in this case, just an S3 path): {}'.format(inputs))

input spec (in this case, just an S3 path): s3://sagemaker-us-east-1-437242975833/data/DEMO-gluon-cifar10


---

## Script

We need to provide a training script that can run on the SageMaker platform. When SageMaker calls your function, it will pass in arguments that describe the training environment. Check the script below to see how this works.

The network itself is a pre-built version contained in the [Gluon Model Zoo](https://mxnet.incubator.apache.org/versions/master/api/python/gluon/model_zoo.html).

In [5]:
!cat 'cifar10.py'

from __future__ import print_function

import json
import logging
import os
import time

import mxnet as mx
from mxnet import autograd as ag
from mxnet import gluon
from mxnet.gluon.model_zoo import vision as models


# ------------------------------------------------------------ #
# Training methods                                             #
# ------------------------------------------------------------ #

def train(current_host, hosts, num_cpus, num_gpus, channel_input_dirs, model_dir, hyperparameters, **kwargs):
    # retrieve the hyperparameters we set in notebook (with some defaults)
    batch_size = hyperparameters.get('batch_size', 128)
    epochs = hyperparameters.get('epochs', 100)
    learning_rate = hyperparameters.get('learning_rate', 0.1)
    momentum = hyperparameters.get('momentum', 0.9)
    log_interval = hyperparameters.get('log_interval', 1)
    wd = hyperparameters.get('wd', 0.0001)

    if len(hosts) == 1:
        kvstore = 'device' if num_gpus > 0 else 'local'
 

---

## Train (Local Mode)

The ```MXNet``` estimator will create our training job. To switch from training in SageMaker's managed environment to training within a notebook instance, just set `train_instance_type` to `local_gpu`.

In [6]:
m = MXNet('cifar10.py',
          role=role, 
          train_instance_count=1,
          train_instance_type='local_gpu',
          framework_version='1.1.0',
          hyperparameters={'batch_size': 1024,
                           'epochs': 5,
                           'learning_rate': 0.8,
                           'momentum': 0.9},
          py_version = "py3")
        

After we've constructed our `MXNet` object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.

In [19]:
m.fit(inputs)

Creating tmpqioifl1x_algo-1-67dsm_1 ... 
[1BAttaching to tmpqioifl1x_algo-1-67dsm_12mdone[0m
[36malgo-1-67dsm_1  |[0m 2019-07-15 20:35:30,378 INFO - root - running container entrypoint
[36malgo-1-67dsm_1  |[0m 2019-07-15 20:35:30,379 INFO - root - starting train task
[36malgo-1-67dsm_1  |[0m 2019-07-15 20:35:30,400 INFO - container_support.training - Training starting
[36malgo-1-67dsm_1  |[0m 2019-07-15 20:35:30,758 INFO - mxnet_container.train - MXNetTrainingEnvironment: {'channels': {'training': {'TrainingInputMode': 'File'}}, '_ps_port': 8000, 'channel_dirs': {'training': '/opt/ml/input/data/training'}, 'current_host': 'algo-1-67dsm', 'resource_config': {'hosts': ['algo-1-67dsm'], 'current_host': 'algo-1-67dsm'}, '_scheduler_host': 'algo-1-67dsm', 'hyperparameters': {'batch_size': 1024, 'epochs': 5, 'momentum': 0.9, 'sagemaker_submit_directory': 's3://sagemaker-us-east-1-437242975833/sagemaker-mxnet-2019-07-15-20-32-44-985/source/sourcedir.tar.gz', 'sagemaker_job_name': 's

---

## Host

After training, we use the MXNet estimator object to deploy an endpoint. Because we trained locally, we'll also deploy the endpoint locally.  The predictor object returned by `deploy` lets us call the endpoint and perform inference on our sample images.

In [20]:
predictor = m.deploy(initial_instance_count=1, instance_type='local_gpu')

Attaching to tmp2g0l4dal_algo-1-zl0u5_1
[36malgo-1-zl0u5_1  |[0m 2019-07-15 20:37:49,544 INFO - root - running container entrypoint
[36malgo-1-zl0u5_1  |[0m 2019-07-15 20:37:49,545 INFO - root - starting serve task
[36malgo-1-zl0u5_1  |[0m 2019-07-15 20:37:49,545 INFO - container_support.serving - reading config
[36malgo-1-zl0u5_1  |[0m Downloading s3://sagemaker-us-east-1-437242975833/sagemaker-mxnet-2019-07-15-20-32-44-985/sourcedir.tar.gz to /tmp/script.tar.gz
[36malgo-1-zl0u5_1  |[0m 2019-07-15 20:37:49,575 INFO - botocore.credentials - Found credentials in environment variables.
[36malgo-1-zl0u5_1  |[0m 2019-07-15 20:37:50,034 INFO - container_support.serving - loading framework-specific dependencies
[36malgo-1-zl0u5_1  |[0m 2019-07-15 20:37:50,382 INFO - container_support.serving - starting nginx
[36malgo-1-zl0u5_1  |[0m 2019-07-15 20:37:50,397 INFO - container_support.serving - starting gunicorn
[36malgo-1-zl0u5_1  |[0m 2019-07-15 20:37:50,400 INFO - container_

### Evaluate

We'll use these CIFAR-10 sample images to test the service:

<img style="display: inline; height: 32px; margin: 0.25em" src="images/airplane1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/automobile1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/bird1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/cat1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/deer1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/dog1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/frog1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/horse1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/ship1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/truck1.png" />



In [36]:
# load the CIFAR-10 samples, and convert them into format we can use with the prediction endpoint
from cifar10_utils import read_images

filenames = ['images/airplane1.png',
             'images/bird1.png',
             'images/cat1.png',
             'images/deer1.png',
             'images/dog1.png',
             'images/frog1.png',
             'images/horse1.png',
             'images/ship1.png',
             'images/truck1.png',
             'images/test.png',
             'images/airplane.png',
             'images/frogadded.png']

image_data = read_images(filenames)

The predictor runs inference on our input data and returns the predicted class label (as a float value, so we convert to int for display).

In [37]:
for i, img in enumerate(image_data):
    response = predictor.predict(img)
    print('image {}: class: {}'.format(i, int(response)))

image 0: class: 8
image 1: class: 7
image 2: class: 5
image 3: class: 6
image 4: class: 5
image 5: class: 5
image 6: class: 7
image 7: class: 8
image 8: class: 1
image 9: class: 5
image 10: class: 7
image 11: class: 6


---

## Cleanup

After you have finished with this example, remember to delete the prediction endpoint.  Only one local endpoint can be running at a time.

In [12]:
m.delete_endpoint()

Gracefully stopping... (press Ctrl+C again to force)
