# Object Detection using GluonCV on Amazon SageMaker

1. [Introduction](#Introduction)
2. [Setup](#Setup)
3. [Data Preparation and Exploration](#[Optional]-Data-Preparation-and-Exploration)
4. [Experimentation and Prototyping](#Experimentation-and-Prototyping)
5. [Train the model](#Train-the-model)
6. [Host the model](#Host-the-model)
7. [Test the model](#Test-the-model)
8. [Clean up](#Clean-up)

## Introduction

Object detection is the process of identifying and localizing objects in an image. A typical object detection solution takes an image as input and provides a bounding box on the image where an object of interest is found.  It also identifies what type of object the box encapsulates.  To create such a solution, we need to acquire and process a traning dataset, create and setup a training job for the alorithm so that it can learn about the dataset. Finally, we can then host the trained model in an endpoint, to which we can supply images.

This notebook is an end-to-end example showing how GluonCV's Object Detection algorithms can be used with Amazon SageMaker, and a public dataset of bird images. We demonstrate how to train and to host an object detection model based on the [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset.  GluonCV provides variants of SSD, YOLOv3 and Faster RCNN algorithms for object detection algorithm. In this notebook we demonstrate YOLOv3 on mobilenet1.0 with weights pretrained on the [VOC](http://host.robots.ox.ac.uk/pascal/VOC/) dataset.

![Sample results detecting a pair of goldfinch on a feeder](./goldfinch_detections.png)


## Setup

Before preparing the data, there are some initial steps required for setup.

First, you should be running this notebook on a **ml.c5d.4xlarge** or larger instance if you plan to run the training script locally on your notebook. Otherwise, you'll need to lower the batch size to avoid out-of-memory errors.

Any GPU instances like the ml.p3.2xlarge will do, and in practice, will be the ideal choice for a notebook instance.

Run the following cell to import the depedencies on this notebook. Most of the libraries that you need for MXNet and Gluon development are already packaged into the conda_mxnet_p36 environment.

----

In [3]:
!pip install --upgrade pip
!pip install matplotlib

Collecting pip
  Using cached https://files.pythonhosted.org/packages/5a/4a/39400ff9b36e719bdf8f31c99fe1fa7842a42fa77432e584f707a5080063/pip-20.2.2-py2.py3-none-any.whl
Installing collected packages: pip
  Found existing installation: pip 19.3.1
    Uninstalling pip-19.3.1:
      Successfully uninstalled pip-19.3.1
Successfully installed pip-20.2.2
Collecting matplotlib
  Using cached matplotlib-3.3.1-cp36-cp36m-manylinux1_x86_64.whl (11.6 MB)
Collecting kiwisolver>=1.0.1
  Using cached kiwisolver-1.2.0-cp36-cp36m-manylinux1_x86_64.whl (88 kB)
Collecting certifi>=2020.06.20
  Using cached certifi-2020.6.20-py2.py3-none-any.whl (156 kB)
Collecting cycler>=0.10
  Using cached cycler-0.10.0-py2.py3-none-any.whl (6.5 kB)
Installing collected packages: kiwisolver, certifi, cycler, matplotlib
  Attempting uninstall: certifi
    Found existing installation: certifi 2019.11.28
    Uninstalling certifi-2019.11.28:
      Successfully uninstalled certifi-2019.11.28
Successfully installed certifi-

In [4]:
import os
import boto3
import time
from datetime import datetime

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

import mxnet as mx
from mxnet import autograd, gluon

import sagemaker
from sagemaker.mxnet import MXNet, MXNetModel
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

role = sagemaker.get_execution_role()

GluonCV isn't pre-installed into this environment. However, most of the dependencies required by GluonCV have already been packaged into the **conda_mxnet_p36** environment. 

This notebook has requirements that we will have to install manually:
* MXNet 1.6+
* GluonCV libraries

Installation instructions for GluonCV are provided [here](https://gluon-cv.mxnet.io/#installation)

---
MXNet dependencies are GPU dependent, so we'll check for the presence of GPU devnices first.

In [5]:
USE_GPU = True
try:
    a = mx.nd.zeros((1,), ctx=mx.gpu(0))
    ctx = [mx.gpu(0)]
    print('GPU device is available')
except:
    ctx = [mx.cpu()]
    USE_GPU = False
    print('Using CPU on local machine. GPU device was not detected')

GPU device is available


Next we...
* Upgrade MXNet 1.6 for CUDA 10.1 with the appropriate libraries depending on whether GPU is available.
* Install GluonCV

In [6]:
FRAMEWORK_VERSION = "1.6.0"
GLUONCV_COMPATIBILITY = {
    "1.5.0":"0.6.0",
    "1.6.0":"0.8.0"
}
start = time.time()

if USE_GPU :
    cmd = 'pip install mxnet-cu101mkl=={} --upgrade'.format(FRAMEWORK_VERSION)
else :
    cmd ='pip install mxnet-mkl=={} --upgrade'.format(FRAMEWORK_VERSION)

print("Changing MXNet: {}".format(cmd))
os.system(cmd)

cmd = "pip install gluoncv=={} --upgrade".format(GLUONCV_COMPATIBILITY[FRAMEWORK_VERSION])
print("Install GluonCV: {}".format(cmd))
os.system(cmd)

end = time.time()

print("GluonCV dependencies were installed in {:3f} seconds.".format(end-start))

Changing MXNet: pip install mxnet-cu101mkl==1.6.0 --upgrade
Install GluonCV: pip install gluoncv==0.8.0 --upgrade
GluonCV dependencies were installed in 2.570071 seconds.


Gluoncv should now be available for us. Import our required libraries.

In [7]:
import gluoncv as gcv
from gluoncv.utils import download, viz

In the next session, we'll create an optimized dataset (RecordIO format) for training our YOLOv3 model. The process can take 15 minutes. 

It's up to you whether you like to go through the steps of pre-processing your dataset into RecordIO format. Alternatively, I've provided pre-processed datasets for you to use.

Run the cell below to set the locations of the pre-processed train and validation datasets. Note that MXNet doesn't require RecordIO format, but there are various benefits as described below.

In [8]:
SAMPLE_TRAINING_DATASET = 's3://reinvent2018-sagemaker-pytorch/datasets/caltech-birds/recordio/train'
SAMPLE_TEST_DATASET = 's3://reinvent2018-sagemaker-pytorch/datasets/caltech-birds/recordio/test'

# [Optional] Data Preparation and Exploration

---
This section is **optional**. For the purpose of making this lab as efficient as possible, data sets have already been prepared for you in MXNet [RecordIO format](https://mxnet.incubator.apache.org/versions/master/faq/recordio.html), which has various [benefits](https://mxnet.incubator.apache.org/versions/master/faq/recordio.html) including performance enhancements. The following are steps that were taken to produce training and validation samples in RecordIO format. Take note of the utility functions that MXNet provides for format conversion as well as the native data loaders. These are great features that reduce data wrangling work, and aren't provided by most frameworks.

---
This [link](https://cv-workshop-2.notebook.us-west-2.sagemaker.aws/notebooks/aws-cv-jumpstarter/notebooks/lab3b-optional-generate-recordio-files.ipynb) will launch another notebook, which will guide you through this optional section.

# Experimentation and Prototyping

The notebook environment is an ideal environment for rapid prototyping and experimentation. Run the cell below to preview the YOLOv3 training script that we will use to train the model.

The script is no different than one you would run outside of the Amazon SageMaker environment. The only exception is that are conditions in the script to set parameters to environment variables that will be present when we run remote SageMaker training. They're only necessary if you plan to refactor your script to take advantage of the zero-setup SageMaker training capabilities.

In [11]:
!pygmentize ../scripts/gluoncv_yolo3-train.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mmxnet[39;49;00m [34mas[39;49;00m [04m[36mmx[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mfrom[39;49;00m [04m[36mmxnet[39;49;00m [34mimport[39;49;00m nd
[34mfrom[39;49;00m [04m[36mmxnet[39;49;00m [34mimport[39;49;00m gluon
[34mfrom[39;49;00m [04m[36mmxnet[39;49;00m [34mimport[39;49;00m autograd
[34mimport[39;49;00m [04m[36mgluoncv[39;49;00m [34mas[39;49;00m [04m[36mgcv[39;49;00m
[34mfrom[39;49;00m [04m[36mgluoncv[39;49;00m [34mimport[39;49;00m data [34mas[39;49;00m gdata
[34mfrom[39;49;00m [04m[36mgluoncv[39;49;00m [34mimport[39;49;00m utils [34mas[3

The command below runs the above script on this notebook. The script is only configure to run for one epoch because it can take over 5 minutes to run a single epoch on CPU.

The YOLOv3 script has been configured to use mobilenet1.0 (versus darknet), and will be initialized with weights pretrained on the VOC dataset.

It's **optional** to run this script locally.

In [10]:
!mkdir ./models
!python ../scripts/gluoncv_yolo3-train.py \
--model-dir ./models \
--train ../data/caltech_birds_sample/recordio/train \
--test ../data/caltech_birds_sample/recordio/test \
--lr 0.001 \
--opt adam \
--local True \
--pre-trained-weights VOC \
--num-workers 1 \
--save-format symbolic \
--network mobilenet1.0 --syncbn True --batch-size 4 --epochs 1

mkdir: cannot create directory './models': File exists
INFO:root:Namespace(batch_size=4, data_prefix='', data_shape=416, epochs=1, gpus='0', label_smooth=False, local=True, log_interval=100, lr=0.001, lr_decay=0.1, lr_decay_epoch='160,180', lr_decay_period=0, lr_mode='step', mixup=False, model_dir='./models', momentum=0.9, network='mobilenet1.0', no_mixup_epochs=20, no_random_shape=False, no_wd=False, num_samples=150, num_workers=1, obj_metric='mAP', opt='adam', pre_trained_weights='VOC', resume='', save_format='symbolic', save_interval=50, save_prefix='yolo3_mobilenet1.0_custom', seed=233, start_epoch=0, syncbn=True, test='../data/caltech_birds_sample/recordio/test', train='../data/caltech_birds_sample/recordio/train', val_interval=1, warmup_epochs=0, warmup_lr=0.0, wd=0.0005)
INFO:root:Start training from [Epoch 0]
INFO:root:[Epoch 0] Training cost: 10.102, ObjLoss=2491.172, BoxCenterLoss=2.382, BoxScaleLoss=1.335, ClassLoss=3.070
INFO:root:[Epoch 0] Validation: 
017.Cardinal=0.0
036

## Validate your Scripts

You may eventually need to perform large-scale training that is beyond what want can be done from your local notebook. Before you leverage SageMaker traiing capabilities, you can run your training script in **"local mode"** to validate that your script will run within the SageMaker training environment.

Local mode will deploy a SageMaker training container locally, and run your script. In the following steps we will setup our notebook to train in "local mode."

### Installing Dependencies on SageMaker Managed Containers

---

One of the main benefits of leveraging the "Bring-Your-Own-Script" (BYOS) integration strategy is that it reduces alot of heavy-lifting compared to the alternative. For instance, if you want to integrate a custom algorithm in R with Amazon SageMaker, it is feasible. However, at this point in time, the BYOS integration option doesn't supported in R. Integrating your R algorithm would require you to build  an Amazon SageMaker compatible container to house your custom [algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html).

MXNet, however, supports BYOS--Amazon SageMaker provides and maintains MXNet containers natively. However, in cases like the one in this lab, we may want to install or upgrade select depedencies. You have the option of creating your own SageMaker compatible container and use the MXNet container as a base, or update the dependencies at launch time using [requirements.txt](https://pip.pypa.io/en/stable/user_guide/#id15).

DevOps best practices recommend building your own container. Requirements.txt is often ideal for development. You can apply minor updates quickly to support rapid prototyping.

Since the dependencies vary based on GPU support, I've provided two variations of requirements.txt.

---
Run the cell below to preview the dependency updates required by GluonCV for GPU enabled training instances.

In [12]:
!pygmentize ../scripts/requirements-gpu.txt

mxnet-cu101==1.6.0.post0
gluoncv==0.8.0


Run the cell below to preview the dependency updates required by GluonCV when your training instance doesn't have GPU available.

In [13]:
!pygmentize ../scripts/requirements-cpu.txt

mxnet-mkl==1.6.0
gluoncv==0.8.0


The appropriate requirements.txt file has to be provided depending on which training instance you've selected. The requirements.txt needs to be place in the root directory of your scripts. In the following steps, we'll provide some utilities to take care of this.

### GPU Enabled Environment Considerations

---
Your environment and certain hyperparameters are dependent on the amount of available GPU. We need to ensure the right libraries are installed if you select a GPU enabled instance. 

Secondly, if you have multiple GPUs either on a single instance or distributed cluster, you have more GPU memory available to support larger batch sizes for accelerating training time. Larger batch sizes will help accelerate and scale your training process.

This YOLOv3 script also needs to know how many devices you like to use for multiple GPU training on a single instance, so that data sharding can be done in the script to facilitate data parallelism scaling.

Note that you may need to adapt to model performance degradation as a result of training on large batch sizes by lowering the learning rate.

In [58]:
import shutil

# sample selection of SageMaker instances and available GPUs
INSTANCE_GPUS = {
    'ml.p3.2xlarge':1,
    'ml.p3.8xlarge':4,
    'ml.c5.4xlarge':0,
    'ml.c5.xlarge':0,
    'ml.g4dn.xlarge':1
}

def set_requirements_text(gpu_train=True) :
    
    if gpu_train :
        shutil.copy('../scripts/requirements-gpu.txt','../scripts/requirements.txt')
    else :
        shutil.copy('../scripts/requirements-cpu.txt','../scripts/requirements.txt')

    !pygmentize ../scripts/requirements.txt

def environment_device_config(instance_selection, acceleration_type=None) :
    
    USE_GPU = (INSTANCE_GPUS[instance_selection] > 0)

    gpu_devices = '0'
    for i in range(1,INSTANCE_GPUS[instance_selection]) :
        gpu_devices+= ","+str(i)

    print('Set requirements.txt for {}:'.format('GPU' if USE_GPU else 'CPU')) 
    print('-----------------------------------')
    set_requirements_text(USE_GPU)
    print('-----------------------------------')
    
    config={
        "gpu_devices":gpu_devices,
        "batch_size_range": [1*INSTANCE_GPUS[instance_selection] if USE_GPU else 16,
                          16*INSTANCE_GPUS[instance_selection] if USE_GPU else 32]
    }
    
    print('You have selected {}{}, which will utilize {}.'.format(instance_selection, 
                                                                '' if acceleration_type == None else " with EIA: {}".format(acceleration_type),
                                                                'GPU' if USE_GPU else 'CPU' if acceleration_type == None else "EIA"))
    print('Recommended GPU dependent configurations: {}'.format(config))
    
    return config

LOCAL_INSTANCE_TYPE = 'ml.g4dn.xlarge'
config = environment_device_config(LOCAL_INSTANCE_TYPE)

Set requirements.txt for GPU:
-----------------------------------
mxnet-cu101==1.6.0.post0
gluoncv==0.8.0
-----------------------------------
You have selected ml.g4dn.xlarge, which will utilize GPU.
Recommended GPU dependent configurations: {'gpu_devices': '0', 'batch_size_range': [1, 16]}


---

### Amazon SageMaker Local

We're now ready to configure our script with appropriate hyperparameters, and launch Amazon SageMaker training in local mode. 

We'll use the following hyperparameters.

In [16]:
params = {
    'optimizer': 'adam',
    'lr': 0.0001,
    'epochs': 1,
    'batch-size': np.floor((config['batch_size_range'][0]+config['batch_size_range'][1])/2),
    'network':'mobilenet1.0',
    'syncbn':True,
    'num-workers':1,
    'save-format':'symbolic',
    'pre-trained-weights':"VOC"} 
print(params)

{'optimizer': 'adam', 'lr': 0.0001, 'epochs': 1, 'batch-size': 8.0, 'network': 'mobilenet1.0', 'syncbn': True, 'num-workers': 1, 'save-format': 'symbolic', 'pre-trained-weights': 'VOC'}


A sample dataset of the Caltech Birds dataset in RecordIO format should have been copied over to your notebook through the Github integration. Run the cell to validate the contents.

In [17]:
from os import listdir

LOCAL_TRAIN_DATA_DIR = os.path.join(os.getcwd(),'../data/caltech_birds_sample/recordio/train')
LOCAL_TEST_DATA_DIR = os.path.join(os.getcwd(),'../data/caltech_birds_sample/recordio/test')

print('Local training data directory contents:')
print('---------------------------------------')
for f in listdir(LOCAL_TRAIN_DATA_DIR) :
    print(f)
print()
print('Local test data directory contents:')
print('---------------------------------------')
for f in listdir(LOCAL_TEST_DATA_DIR) :
    print(f)

Local training data directory contents:
---------------------------------------
classes.csv
train.rec
train.idx

Local test data directory contents:
---------------------------------------
val.idx
val.rec


Next, we instantiate an MXNet estimator. This object ecapsulates your MXNet algorithm including infrastructure configurations that it will leverage when you execute training via the fit() method.

The parameter **train_instance_type = 'local'** informs SageMaker that you wish to run this estimator in local mode. Secondly, ensure you replace /'<<NAME YOUR TRAINING JOB/>>' with a unique name.

In [19]:
#pip install 'sagemaker[local]' --upgrade

from sagemaker.local import LocalSession

local_session = LocalSession()
local_session.config = {'local': {'local_code': True}}

BASE_JOB_NAME ='dylantong-yolo3-local-training'
#BASE_JOB_NAME='<<NAME YOUR TRAINING JOB>>'
SOURCE_DIR = '../scripts'

local_estimator = MXNet(entry_point='gluoncv_yolo3-train.py',
                        role=role,
                        sagemaker_session = local_session,
                        source_dir=SOURCE_DIR,
                        framework_version=FRAMEWORK_VERSION,
                        py_version='py3',
                        train_instance_count=1,
                        train_instance_type='local',
                        hyperparameters=params)

Calling fit() will deploy a training container locally to run your training script. In local mode, you have the option of utilizing data from a local file system as shown below. Alternatively, you can specify S3 locations as would normally do for remote training (commented out below).

**SageMaker local mode doesn't work with in SageMaker Studio at this time 8/27/2020. Please skip this cell for now.**

In [20]:
LOCAL_TRAIN_INPUT = local_session.upload_data(path=LOCAL_TRAIN_DATA_DIR)
LOCAL_TEST_INPUT = local_session.upload_data(path=LOCAL_TEST_DATA_DIR)

local_estimator.fit({"train":LOCAL_TRAIN_INPUT, "test":LOCAL_TEST_INPUT})

ImportError: 'docker-compose' is not installed. Local Mode features will not work without docker-compose. For more information on how to install 'docker-compose', please, see https://docs.docker.com/compose/install/

---

# Train the model

So you've identified candidate algorithm(s) and an architecture that is ready for larger scale training, and you've validated your script in a SageMaker training environment via SageMaker Local. 

You can now leverage Amazon SageMaker to perform large-scale training with zero setup.

## Step 1: Define your hyperparameters

After you've verified that your script runs on Amazon SageMaker training without errors, a good first step is utilize the SageMaker's Automatic Model Tuning to hone in a good set of hyperparameters.

Let's prepare configurations for GPU training so our tuning job can get done in a reasonable time. With SageMaker Training you only pay for the resources during training. GPU instances are more expensive, but they can also reduce training time by magnitudes compared to training on CPU, which in turn reduces training time costs.

In [21]:
INSTANCE_SELECTION = 'ml.p3.8xlarge'
config = environment_device_config(INSTANCE_SELECTION)

Set requirements.txt for GPU:
-----------------------------------
mxnet-cu101==1.6.0.post0
gluoncv==0.8.0
-----------------------------------
You have selected ml.p3.8xlarge, which will utilize GPU.
Recommended GPU dependent configurations: {'gpu_devices': '0,1,2,3', 'batch_size_range': [4, 64]}


Next, we need to specify the objective metric so that SageMaker can use Bayesian Optimization to find hyperparameters that optimize on a specific objective. We'll use the standard objective, which is to maximize mAP on the validation set.

The regex experession informs SageMaker how to parse the training logs to obtain the configured objective metric. Your script should have logic to write out the objective metric into the logs.

In [22]:
OBJECTIVE_METRIC_NAME = 'mAP'
OBJECTIVE_TYPE = 'Maximize'

METRIC_DEFINITIONS = [{ 'Name':OBJECTIVE_METRIC_NAME,
                        'Regex':'Target Objective: {} = ([0-9\\.]+)'.format(OBJECTIVE_METRIC_NAME)}]

print(METRIC_DEFINITIONS)

[{'Name': 'mAP', 'Regex': 'Target Objective: mAP = ([0-9\\.]+)'}]


Next, we set the hyperparameters as well as the the hyperparameter value ranges that you want Amazon SageMaker to search over.

The configurations below will search for optimal values for the optimizer to use, the learning rate, and the mini-batch-size. In practice, you might want to explore other variables such as the base network, pretrained weights, and other variables that could influence the performance of your model.

In [23]:
HYPERPARAMETER_RANGES = {'opt': CategoricalParameter(['adam','sgd']),
                         'lr': ContinuousParameter(0.0002, 0.001, scaling_type='Logarithmic'),
                         'batch_size': IntegerParameter(config['batch_size_range'][0], 
                                                        config['batch_size_range'][1])}
params = {
    'opt': 'adam',
    'epochs': 1,
    'network':'mobilenet1.0',
    'syncbn':True,
    'gpus': config['gpu_devices'] 
} 

print(params)

{'opt': 'adam', 'epochs': 1, 'network': 'mobilenet1.0', 'syncbn': True, 'gpus': '0,1,2,3'}


## Step 2: Launch your Hyperparameter Tuning Job

Similar to Lab 2, we're going to launch an hyperparamter job to automate the hyperparameter search process. The difference is that we're launching this job programatically via the Python SDK. 

Secondly, since we're training on a custom script, we have a couple of additional parameters: (1) we specify the location of the scripts in our local directory (source_dir) (2) we provide the name of our MXNet (GluonCV) script that serves as an entry point (entry_point) for our algorithm. 

Lastly, we have provided a requirements.txt file in the same directory as our scripts, so that it will be loaded into our training containers. Requirements.txt will instruct SageMaker to install the necessary dependencies on to the training containers upon launch. The values framework_version='1.3.0' and py_version='py3' informs SageMaker to use the MXNet container that has been pre-built and tested for MXNet1.3 running on python3.

Replace '\<<NAME YOUR HPO JOB\>>' with a unique name, so that you can identify your hyperparameter tuning jobs.

In this cell we instantiate an MXNet estimator. This object ecapsulates your MXNet algorithm including infrastructure configurations that it will leverage when you execute training via the fit() method.

In [25]:
BASE_JOB_NAME='dylantong-yolo3-hpo'
#BASE_JOB_NAME='<<NAME YOUR HPO JOB>>'
SOURCE_DIR = '../scripts'

estimator = MXNet(entry_point='gluoncv_yolo3-train.py',
                  role=role,
                  source_dir=SOURCE_DIR,
                  framework_version=FRAMEWORK_VERSION,
                  py_version='py3',
                  train_instance_count=1,
                  train_instance_type=INSTANCE_SELECTION,
                  sagemaker_session=sagemaker.Session(),
                  base_job_name=BASE_JOB_NAME,
                  hyperparameters=params)

Next, we configure the parameters for our automatic hyperparameter tuner job. We provide the same parameters as we did in Lab 2 except we're instantiating the job programmatically via the SDK. Note that we only run two jobs (max_jobs=2) for the sake of demonstration. In practice, you'll need to run more to hone in on optimal parameters.

In [26]:
tuner = HyperparameterTuner(estimator,
                            OBJECTIVE_METRIC_NAME,
                            HYPERPARAMETER_RANGES,
                            METRIC_DEFINITIONS,
                            objective_type = OBJECTIVE_TYPE,
                            max_jobs=1,
                            max_parallel_jobs=1)

Calling fit() on our tuner launch the job as a non-blocking asynchronous call. We pass two parameters to specify the location of our training and test sets.

'Train' and 'Test' will be treated as "channels" and the training container will have environment variables with the name SM_CHANNEL_TRAIN and SM_CHANNEL_TEST created respectively. The datasets will be loaded from the specified S3 location, and can be access in the training container context via the environment variables (eg. os.environ['SM_CHANNEL_TRAIN'])

In [27]:
tuner.fit({'train': SAMPLE_TRAINING_DATASET, 
           'test': SAMPLE_TEST_DATASET})

SageMaker's Automatic Model tuning functionality can run tuning jobs in sequence and in parallel asynchronously. You can use the .wait() command to block a thread and wait for completion.

In [None]:
tuner.wait()

## Step 3: Finetune your model

The training process will be iterative as described in Lab 2. Eventually, you will hone in on a set of parameters, and may choose to further fine tune the model.

The section below demonstrates how to perform incremental training on a model that you've previously trained. For instance, this could be the best model from your recent Hyperparameter Tuning Job. 

The training time required to train a YOLOv3 model is beyond what is practical for a workshop. Therefore, I've provided a pretrained YOLOv3 model that has been finetuned a sample of our CalTech Birds dataset. You can treat this model as if it were the output of your Hyperparameter Tuning Job.

In [28]:
TRAINED_MODEL = 'yolo3_mobilenet1.0_custom_best.params'
TRAINED_MODEL_URI = 's3://reinvent2018-sagemaker-pytorch/models/caltech-birds/gluoncv/yolov3/sample/{}'.format(TRAINED_MODEL)

Let's select an instance and rest our parameters. We'll use a smaller GPU instance for the purpose of minimizing the cost of this lab.

Note that we've also set the epochs to one for the sake of demonstration. The provided model has been trained for hundreds of epochs.

In [29]:
INSTANCE_SELECTION = 'ml.p3.2xlarge'
config = environment_device_config(INSTANCE_SELECTION)

params = {
    'optimizer': 'adam',
    'lr': 0.0001,
    'epochs': 1,
    'batch-size': config['batch_size_range'][1],
    'network':'mobilenet1.0',
    'syncbn':True,
    'pre-trained-weights':"NONE",
    'resume': TRAINED_MODEL,
    'save-format': 'symbolic',
    'gpus': config['gpu_devices'] } 
print()
print("Configured Parameters")
print("---------------------")
print(params)

Set requirements.txt for GPU:
-----------------------------------
mxnet-cu101==1.6.0.post0
gluoncv==0.8.0
-----------------------------------
You have selected ml.p3.2xlarge, which will utilize GPU.
Recommended GPU dependent configurations: {'gpu_devices': '0', 'batch_size_range': [1, 16]}

Configured Parameters
---------------------
{'optimizer': 'adam', 'lr': 0.0001, 'epochs': 1, 'batch-size': 16, 'network': 'mobilenet1.0', 'syncbn': True, 'pre-trained-weights': 'NONE', 'resume': 'yolo3_mobilenet1.0_custom_best.params', 'save-format': 'symbolic', 'gpus': '0'}


Replace '\<<NAME YOUR TRAINING JOB\>>' with an unique prefix that will help you identify your training job.

We're going to instantiate another MXNet estimator object. The main difference is the "model_uri" parameter, which will result in having the model artifacts at the specified S3 location to be uploaded into the training container. Your script can then initialize a model with those weights. The artifacts are located in the directory specified by the SM_CHANNEL_MODEL environment variable.

In [31]:
BASE_JOB_NAME ='dylantong-yolo3-incremental-training'
#BASE_JOB_NAME='<<NAME YOUR TRAINING JOB>>'
SOURCE_DIR = '../scripts'

incr_estimator = MXNet(entry_point='gluoncv_yolo3-train.py',
                       role=role,
                       source_dir=SOURCE_DIR,
                       framework_version=FRAMEWORK_VERSION,
                       py_version='py3',
                       train_instance_count=1,
                       train_instance_type=INSTANCE_SELECTION,
                       sagemaker_session=sagemaker.Session(),
                       base_job_name=BASE_JOB_NAME,
                       hyperparameters=params,
                       model_uri=TRAINED_MODEL_URI)

In [32]:
incr_estimator.fit({'train': SAMPLE_TRAINING_DATASET, 
                    'test': SAMPLE_TEST_DATASET})

2020-08-28 07:45:08 Starting - Starting the training job...
2020-08-28 07:45:10 Starting - Launching requested ML instances......
2020-08-28 07:46:18 Starting - Preparing the instances for training......
2020-08-28 07:47:27 Downloading - Downloading input data...
2020-08-28 07:47:53 Training - Downloading the training image.....[34m2020-08-28 07:48:49,288 sagemaker-training-toolkit INFO     Imported framework sagemaker_mxnet_container.training[0m
[34m2020-08-28 07:48:49,312 sagemaker_mxnet_container.training INFO     MXNet training environment: {'SM_HOSTS': '["algo-1"]', 'SM_NETWORK_INTERFACE_NAME': 'eth0', 'SM_HPS': '{"batch-size":16,"epochs":1,"gpus":"0","lr":0.0001,"network":"mobilenet1.0","optimizer":"adam","pre-trained-weights":"NONE","resume":"yolo3_mobilenet1.0_custom_best.params","save-format":"symbolic","syncbn":true}', 'SM_USER_ENTRY_POINT': 'gluoncv_yolo3-train.py', 'SM_FRAMEWORK_PARAMS': '{}', 'SM_RESOURCE_CONFIG': '{"current_host":"algo-1","hosts":["algo-1"],"network_in

# Host the model

Next, we're going to programatically deploy our model, so that we have a managed endpoint for inference. 

This requires us to provide a script that is capable of loading a model into a model serving container and responding to requests. SageMaker expects integrators to implement a programmatic [interface](https://sagemaker.readthedocs.io/en/stable/using_mxnet.html#model-serving). Refer to the [link](https://sagemaker.readthedocs.io/en/stable/using_mxnet.html#model-serving) for details. The interface is common among all the BYOS integrations (eg. Tensorflow, PyTorch, MXNet).

The interface has default implementations, but generally, you will need to provide your own implementation. Run the cell below to view the model serving script that has been provided for the YOLO3 model. The model expects images in a binary format, and returns three tensors (class, scores, bounding box) as a byte stream in a JSON response. 

You have to freedom to accept and respond with different formats, and there are certainly performance improvements that can be made with the current implementation.

In [33]:
!pygmentize ../scripts/gluoncv-yolo3-inference.py

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mfrom[39;49;00m [04m[36mos[39;49;00m [34mimport[39;49;00m walk
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mmxnet[39;49;00m [34mas[39;49;00m [04m[36mmx[39;49;00m
[34mfrom[39;49;00m [04m[36mmxnet[39;49;00m [34mimport[39;49;00m gluon
[34mimport[39;49;00m [04m[36mgluoncv[39;49;00m [34mas[39;49;00m [04m[36mgcv[39;49;00m

[34mdef[39;49;00m [32mload_sym_model[39;49;00m(sym_f, param_f, model_dir) :

[37m#    try:[39;49;00m
[37m#        a = mx.nd.zeros((1,), ctx=mx.gpu(0))[39;49;00m
[37m#        device = [mx.gpu(0)][39;49;00m
[37m#        print('GPU device is available')[39;49;00m
[37m#    except:[39;49;00m
    device = [mx.cpu()]
[37m#        print('Using CPU on local machine. GPU device was not detected')[39;49;00m

    sym_file = os.path.join(model_dir, sym_f)
    param_file = os.path.join(model_dir, param_f)
    [3

Let's start by selecting our instance, and ensuring the right dependencies will be installed on our model serving instance. We'll use a c5, a CPU optimize instance, but attach GPU to the instance via Elastic Inference Acceleration (EIA). GPU should provide us better throughput on YOLOv3. With EIA, you'll get performance that betters the K80 GPUs that are available on the P2 instances at a lower cost.

In [63]:
INSTANCE_SELECTION = 'ml.c5.xlarge'
ACCELERATION_TYPE = 'ml.eia1.medium'
#ACCELERATION_TYPE = None
USE_EIA = ACCELERATION_TYPE is not None

config = environment_device_config(INSTANCE_SELECTION, ACCELERATION_TYPE)

Set requirements.txt for CPU:
-----------------------------------
mxnet-mkl==1.6.0
gluoncv==0.8.0
-----------------------------------
You have selected ml.c5.xlarge with EIA: ml.eia1.medium, which will utilize EIA.
Recommended GPU dependent configurations: {'gpu_devices': '0', 'batch_size_range': [16, 32]}


Now, let's programatically register the last model that we trained, so that it is available for deployment within SageMaker Hosting.

In [77]:
# 8/28/2020 MXNET 1.6 containers don't have support for elastic inference. Using MXNet 1.5.1 for serving temporarily.
INFERENCE_MXNET_VER = '1.5.1'
env_vars = {"SAGEMAKER_MODEL_SERVER_TIMEOUT":"3600",
            "SYM_FILE_NAME":"yolov3-gcv-symbol.json",
            "PARAM_FILE_NAME":"yolov3-gcv-0000.params",
            "USE_EIA": str(int(USE_EIA))}
                              

yolo_model = MXNetModel(model_data = incr_estimator.model_data,
                        role=role,
                        framework_version=INFERENCE_MXNET_VER,
                        py_version="py3",
                        entry_point='gluoncv-yolo3-inference.py',
                        env = env_vars,
                        source_dir=incr_estimator.source_dir)

Replace \'<<PROVIDE AN ENDPOINT NAME\>>' with an unique name so that you can identify your endpoint. For the purpose of demonstration, we only deploy a single instance. However, the endpoint can be configured for auto-scaling and deployed across a fleet to support throughput requirements.

**This may take around 10 minutes to deploy**

In [78]:
RT_ENDPOINT_NAME = 'dylantong-yolov3-rt-ep-{}'.format(datetime.now().strftime('%Y-%m-%d-%H-%M-%S'))
#RT_ENDPOINT_NAME = '<<PROVIDE AN ENDPOINT NAME>>'
print('endpoint name: {}'.format(RT_ENDPOINT_NAME))

start = time.time()
yolo_rt_ep = yolo_model.deploy(instance_type=INSTANCE_SELECTION,
                                initial_instance_count = 1,
                                endpoint_name= RT_ENDPOINT_NAME,
                                accelerator_type=ACCELERATION_TYPE)
end = time.time()

print("Endpoint was deployed in {:3f} seconds.".format(end-start))

endpoint name: dylantong-yolov3-rt-ep-2020-08-28-20-06-30
-------------!Endpoint was deployed in 411.327956 seconds.


# Test the model

Let's download some sample images from the internet, and see how well our bird detector does on detecting these images.

In [69]:
!wget -q -O multi-goldfinch-1.jpg https://t3.ftcdn.net/jpg/01/44/64/36/500_F_144643697_GJRUBtGc55KYSMpyg1Kucb9yJzvMQooW.jpg
!wget -q -O northern-flicker-1.jpg https://upload.wikimedia.org/wikipedia/commons/5/5c/Northern_Flicker_%28Red-shafted%29.jpg
!wget -q -O northern-cardinal-1.jpg https://cdn.pixabay.com/photo/2013/03/19/04/42/bird-94957_960_720.jpg
!wget -q -O blue-jay-1.jpg https://cdn12.picryl.com/photo/2016/12/31/blue-jay-bird-feather-animals-b8ee04-1024.jpg
!wget -q -O hummingbird-1.jpg http://res.freestockphotos.biz/pictures/17/17875-hummingbird-close-up-pv.jpg

Our endpoint expects the request payload to be an image sent as bytes within the request body. Here's a utility function to generate the requests.

In [70]:
def generate_request_payload(img_loc) :

    with open(img_loc, 'rb') as image:
        f = image.read()
        b = bytearray(f)
        
    return b

Our model server implementation returns responses in JSON format. Here's a utility function that retrieves the bytes from the JSON payload, and restructures them back into tensor outputs corresponding to the class, score and bounding box predictions.

In [71]:
CID_END_OFFSET = 100
SCORE_END_OFFSET = 200
BBOX_END_OFFSET = 600
Y = 100
N_SCORES = 1
N_CID = 1
N_BBOX_POINTS = 4

def parse_response_paylod(resp) :
    
    stacked_tensors = np.frombuffer(resp['Body'].read(), dtype="float32")
    #stacked_tensors = np.frombuffer(resp, dtype="float32")
    t= np.split(stacked_tensors,[CID_END_OFFSET,SCORE_END_OFFSET,BBOX_END_OFFSET])

    c= mx.nd.array(t[0].reshape(Y,N_SCORES))
    s= mx.nd.array(t[1].reshape(Y,N_CID))
    bb= mx.nd.array(t[2].reshape(Y,N_BBOX_POINTS))
    
    return c,s,bb

Run this cell to detect the birds on the collection of images that we've downloaded.

In [80]:
TEST_IMAGES = ['multi-goldfinch-1.jpg','northern-flicker-1.jpg', 
          'northern-cardinal-1.jpg','blue-jay-1.jpg','hummingbird-1.jpg']
CLASSES = ['Cardinal','Northern_Flicker','American_Goldfinch', 'Ruby_throated_Hummingbird','Blue_Jay']
runtime = boto3.client(service_name='runtime.sagemaker')

def test_endpoint(img_file_list) :
    
    for img in img_file_list :

        b = generate_request_payload(img)
        
        response = runtime.invoke_endpoint(EndpointName='dylantong-yolov3-rt-ep-2020-08-28-20-06-30',
                                            ContentType='image/jpeg',
                                            Body=b)

        c,s,bb = parse_response_paylod(response)

        trans_X, trans_image = gcv.data.transforms.presets.yolo.transform_test(mx.img.imdecode(b), 512)
        ax = viz.plot_bbox(trans_image, bb, s,c, class_names=CLASSES)
        plt.show()
    
test_endpoint(TEST_IMAGES)   

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (503) from model with message "{
  "code": 503,
  "type": "InternalServerException",
  "message": "Prediction failed"
}
". See https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/dylantong-yolov3-rt-ep-2020-08-28-20-06-30 in account 407247006381 for more information.

# Clean up
Here we delete the SageMaker endpoint, as we will no longer be performing any inferences.  This is an important step, as your account is billed for the amount of time an endpoint is running, even when it is idle.

In [None]:
sagemaker.Session().delete_endpoint(yolo_rt_ep.endpoint)