# Hyperparameter Tuning with Amazon SageMaker and MXNet
_**Creating a Hyperparameter Tuning Job for an MXNet Network**_

---

---


## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Code](#Code)
1. [Tune](#Train)
1. [Wrap-up](#Wrap-up)

---

## Background

This example notebook focuses on how to create a convolutional neural network model to train the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) using MXNet distributed training. It leverages SageMaker's hyperparameter tuning to kick off multiple training jobs with different hyperparameter combinations, to find the set with best model performance.  This is an important step in the machine learning process as hyperparameter settings can have a large impact on model accuracy.  In this example, we'll use the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to create a hyperparameter tuning job for an MXNet estimator.

---

## Setup

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the notebook instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/using-identity-based-policies.html) for more details on creating these.  Note, if a role not associated with the current notebook instance, or more than one role is required for training and/or hosting, please replace `sagemaker.get_execution_role()` with a the appropriate full IAM role arn string(s).

In [1]:
import sagemaker

role = sagemaker.get_execution_role()

Now we'll import the Python libraries we'll need.

In [2]:
import sagemaker
import boto3
from sagemaker.mxnet import MXNet
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

---

## Data

The MNIST dataset is widely used for handwritten digit classification, and consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). See [here](http://yann.lecun.com/exdb/mnist/) for more details on MNIST.

For this example notebook we'll use a version of the dataset that's already been published in the desired format to a shared S3 bucket.  Let's specify that location now.

In [3]:
region = boto3.Session().region_name
train_data_location = 's3://sagemaker-sample-data-{}/mxnet/mnist/train'.format(region)
test_data_location = 's3://sagemaker-sample-data-{}/mxnet/mnist/test'.format(region)

---

## Code

To use SageMaker's pre-built MXNet containers, we need to pass in an MXNet script for the container to run.  For our example, we'll define several functions, including:
- `load_data()` and `find_file()` which help bring in our MNIST dataset as NumPy arrays
- `build_graph()` which defines our neural network structure
- `train()` which is the main function that is run during each training job and calls the other functions in order to read in the dataset, create a neural network, and train it.

There are also several functions for hosting which we won't define, like `input_fn()`, `output_fn()`, and `predict_fn()`.  These will take on their default values as described [here](https://github.com/aws/sagemaker-python-sdk#model-serving), and are not important for the purpose of showcasing SageMaker's hyperparameter tuning.

In [4]:
!cat mnist.py

import logging

import gzip
import mxnet as mx
import numpy as np
import os
import struct


def load_data(path):
    with gzip.open(find_file(path, "labels.gz")) as flbl:
        struct.unpack(">II", flbl.read(8))
        labels = np.fromstring(flbl.read(), dtype=np.int8)
    with gzip.open(find_file(path, "images.gz")) as fimg:
        _, _, rows, cols = struct.unpack(">IIII", fimg.read(16))
        images = np.fromstring(fimg.read(), dtype=np.uint8).reshape(len(labels), rows, cols)
        images = images.reshape(images.shape[0], 1, 28, 28).astype(np.float32) / 255
    return labels, images


def find_file(root_path, file_name):
    for root, dirs, files in os.walk(root_path):
        if file_name in files:
            return os.path.join(root, file_name)


def build_graph():
    data = mx.sym.var('data')
    data = mx.sym.flatten(data=data)
    fc1 = mx.sym.FullyConnected(data=data, num_hidden=128)
    act1 = mx.sym.Activation(data=fc1, act_type="relu")

Once we've specified and tested our training script to ensure it works, we can start our tuning job.  Testing can be done in either local mode or using SageMaker training.  Please see the [MXNet MNIST example notebooks](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/mxnet_mnist/mxnet_mnist.ipynb) for more detail.

---

## Tune

Similar to training a single MXNet job in SageMaker, we define our MXNet estimator passing in the MXNet script, IAM role, (per job) hardware configuration, and any hyperparameters we're not tuning.

In [5]:
estimator = MXNet(entry_point='mnist.py',
                  role=role,
                  train_instance_count=1,
                  train_instance_type='ml.m4.xlarge',
                  sagemaker_session=sagemaker.Session(),
                  base_job_name='DEMO-hpo-mxnet',
                  hyperparameters={'batch_size': 100})

No framework_version specified, defaulting to version 1.2. This is not the latest supported version. If you would like to use version 1.4, please add framework_version=1.4 to your constructor.
The Python 2 mxnet images will be soon deprecated and may not be supported for newer upcoming versions of the mxnet images.
Please set the argument "py_version='py3'" to use the Python 3 mxnet image.


Once we've defined our estimator we can specify the hyperparameters we'd like to tune and their possible values.  We have three different types of hyperparameters.
- Categorical parameters need to take one value from a discrete set.  We define this by passing the list of possible values to `CategoricalParameter(list)`
- Continuous parameters can take any real number value between the minimum and maximum value, defined by `ContinuousParameter(min, max)`
- Integer parameters can take any integer value between the minimum and maximum value, defined by `IntegerParameter(min, max)`

*Note, if possible, it's almost always best to specify a value as the least restrictive type.  For example, tuning `thresh` as a continuous value between 0.01 and 0.2 is likely to yield a better result than tuning as a categorical parameter with possible values of 0.01, 0.1, 0.15, or 0.2.*

In [6]:
hyperparameter_ranges = {'learning_rate': ContinuousParameter(0.01, 0.2),
                         'num_epoch': IntegerParameter(10, 50)}

Next we'll specify the objective metric that we'd like to tune and its definition.  This includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of our training job.

In [7]:
objective_metric_name = 'Validation-accuracy'
metric_definitions = [{'Name': 'Validation-accuracy',
                       'Regex': 'Validation-accuracy=([0-9\\.]+)'}]

Now, we'll create a `HyperparameterTuner` object, which we pass:
- The MXNet estimator we created above
- Our hyperparameter ranges
- Objective metric name and definition
- Number of training jobs to run in total and how many training jobs should be run simultaneously.  More parallel jobs will finish tuning sooner, but may sacrifice accuracy.  We recommend you set the parallel jobs value to less than 10% of the total number of training jobs (we'll set it higher just for this example to keep it short).
- Whether we should maximize or minimize our objective metric (we haven't specified here since it defaults to 'Maximize', which is what we want for validation accuracy)

In [8]:
tuner = HyperparameterTuner(estimator,
                            objective_metric_name,
                            hyperparameter_ranges,
                            metric_definitions,
                            max_jobs=9,
                            max_parallel_jobs=3)

And finally, we can start our tuning job by calling `.fit()` and passing in the S3 paths to our train and test datasets.

In [9]:
inputs = {'train': train_data_location, 'test': test_data_location}

In [10]:
tuner.fit(inputs)

Let's just run a quick check of the hyperparameter tuning jobs status to make sure it started successfully and is `InProgress`.

In [11]:
boto3.client('sagemaker').describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']

'InProgress'

---

## Wrap-up

Now that we've started our hyperparameter tuning job, it will run in the background and we can close this notebook.  Once finished, we can use the [HPO Analysis notebook](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/hyperparameter_tuning/analyze_results/HPO_Analyze_TuningJob_Results.ipynb) to determine which set of hyperparameters worked best.

For more detail on Amazon SageMaker's Hyperparameter Tuning, please refer to the AWS documentation. 

## BYO HPO [Random]

In [12]:
!cat random_tuner.py

import random
import time
import boto3
import re
import pandas as pd
import numpy as np


#################
# Hyperparameters
#################
class StaticParameter():
    '''
    Class for Static hyperparameters.
    Takes one argument which is a single hyperparameter value.
    '''
    def __init__(self, values):
        self.values = values
    def get_value(self):
        return self.values



class CategoricalParameter():
    '''
    Class for categorical hyperparameters.
    Takes one argument which is a list of possible hyperparameter values.
    '''
    def __init__(self, values):
        self.values = values
    def get_value(self):
        return random.choice(self.values)


class IntegerParameter():
    '''
    Class for integer hyperparameters.
    Takes two arguments: min_value and then max_value.
    '''
    def __init__(self, min_value, max_value):
        self.min_value = min_value
        self.max_value = max_value
    def get

In [13]:
import random_tuner as rt

We'll need to supply:
A function that trains our MXNet model given a job name and list of hyperparameters. Note, wait is set to false in our fit() call so that we can train multiple jobs at once.
A dictionary of hyperparameters where the ones we want to tune are defined as one of three types (ContinuousParameter, IntegerParameter, or CategoricalParameter) and appropriate minimum and maximum ranges or a list of possible values are provided.

In [14]:
def fit_random(job_name, hyperparameters):
    estimator = MXNet(entry_point='mnist.py',
                  role=role,
                  train_instance_count=1,
                  train_instance_type='ml.m4.xlarge',
                  sagemaker_session=sagemaker.Session(),
                  base_job_name='DEMO-hpo-mxnet',
                  hyperparameters=hyperparameters)
    estimator.fit(inputs, wait=False, job_name=job_name)

Define hyperparameter ranges

In [15]:
hyperparameter_ranges = {'batch_size': rt.StaticParameter(100),
                         'learning_rate': rt.ContinuousParameter(0.01, 0.2),
                         'num_epoch': rt.IntegerParameter(10, 50)}

Run custom training job

In [16]:
%%time

jobs = rt.random_search(fit_random,
                        hyperparameter_ranges,
                        max_parallel_jobs=3,
                        generations=3)

generation 0


No framework_version specified, defaulting to version 1.2. This is not the latest supported version. If you would like to use version 1.4, please add framework_version=1.4 to your constructor.
The Python 2 mxnet images will be soon deprecated and may not be supported for newer upcoming versions of the mxnet images.
Please set the argument "py_version='py3'" to use the Python 3 mxnet image.
No framework_version specified, defaulting to version 1.2. This is not the latest supported version. If you would like to use version 1.4, please add framework_version=1.4 to your constructor.
The Python 2 mxnet images will be soon deprecated and may not be supported for newer upcoming versions of the mxnet images.
Please set the argument "py_version='py3'" to use the Python 3 mxnet image.
No framework_version specified, defaulting to version 1.2. This is not the latest supported version. If you would like to use version 1.4, please add framework_version=1.4 to your constructor.
The Python 2 mxnet im

generation 1


No framework_version specified, defaulting to version 1.2. This is not the latest supported version. If you would like to use version 1.4, please add framework_version=1.4 to your constructor.
The Python 2 mxnet images will be soon deprecated and may not be supported for newer upcoming versions of the mxnet images.
Please set the argument "py_version='py3'" to use the Python 3 mxnet image.
No framework_version specified, defaulting to version 1.2. This is not the latest supported version. If you would like to use version 1.4, please add framework_version=1.4 to your constructor.
The Python 2 mxnet images will be soon deprecated and may not be supported for newer upcoming versions of the mxnet images.
Please set the argument "py_version='py3'" to use the Python 3 mxnet image.
No framework_version specified, defaulting to version 1.2. This is not the latest supported version. If you would like to use version 1.4, please add framework_version=1.4 to your constructor.
The Python 2 mxnet im

generation 2


No framework_version specified, defaulting to version 1.2. This is not the latest supported version. If you would like to use version 1.4, please add framework_version=1.4 to your constructor.
The Python 2 mxnet images will be soon deprecated and may not be supported for newer upcoming versions of the mxnet images.
Please set the argument "py_version='py3'" to use the Python 3 mxnet image.
No framework_version specified, defaulting to version 1.2. This is not the latest supported version. If you would like to use version 1.4, please add framework_version=1.4 to your constructor.
The Python 2 mxnet images will be soon deprecated and may not be supported for newer upcoming versions of the mxnet images.
Please set the argument "py_version='py3'" to use the Python 3 mxnet image.
No framework_version specified, defaulting to version 1.2. This is not the latest supported version. If you would like to use version 1.4, please add framework_version=1.4 to your constructor.
The Python 2 mxnet im

CPU times: user 1.93 s, sys: 90.4 ms, total: 2.02 s
Wall time: 12min 28s


In [17]:
random_metrics = rt.table_metrics(jobs, rt.get_metrics(jobs, 'Validation-accuracy=([0-9\\.]+)'))
random_metrics.sort_values(['objective'], ascending=False)

Unnamed: 0,batch_size,learning_rate,num_epoch,objective,job_number
random-hp-2019-06-17-11-54-10-168gen-0job-0,100,0.18448,39,0.9786,1
random-hp-2019-06-17-11-54-10-168gen-2job-2,100,0.163544,50,0.9785,9
random-hp-2019-06-17-11-54-10-168gen-1job-1,100,0.169248,45,0.9778,5
random-hp-2019-06-17-11-54-10-168gen-1job-2,100,0.164563,24,0.9777,6
random-hp-2019-06-17-11-54-10-168gen-1job-0,100,0.156719,20,0.9776,4
random-hp-2019-06-17-11-54-10-168gen-0job-2,100,0.097009,43,0.9763,3
random-hp-2019-06-17-11-54-10-168gen-0job-1,100,0.08807,42,0.9761,2
random-hp-2019-06-17-11-54-10-168gen-2job-0,100,0.109143,24,0.9743,7
random-hp-2019-06-17-11-54-10-168gen-2job-1,100,0.017521,38,0.9627,8


## BYO HPO [Evolutionary Algorithms]

In [18]:
!pip install inspyred

Collecting inspyred
[?25l  Downloading https://files.pythonhosted.org/packages/ec/83/95dc9cc74d802e52b6f33d43de791dcfb376b187269757cf3c945ac7e0bb/inspyred-1.0.1-py2.py3-none-any.whl (88kB)
[K    100% |████████████████████████████████| 92kB 21.8MB/s ta 0:00:01
[?25hInstalling collected packages: inspyred
Successfully installed inspyred-1.0.1
[33mYou are using pip version 10.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


### Evolutionary Optimization Primer

In [19]:
def fit_evolutionary(job_name, hyperparameters):
    estimator = MXNet(entry_point='mnist.py',
                  role=role,
                  train_instance_count=1,
                  train_instance_type='ml.m4.xlarge',
                  sagemaker_session=sagemaker.Session(),
                  base_job_name='DEMO-hpo-mxnet',
                  hyperparameters=hyperparameters)
    estimator.fit(inputs, wait=False, job_name=job_name)

In [20]:
from random import Random
from time import time
import inspyred


def generate_population(random, args):
    bounder = args["_ec"].bounder
    chromosome = []

    for lo, hi in zip(bounder.lower_bound, bounder.upper_bound):
        
        if(isinstance(lo, int) and isinstance(hi, int)):
            chromosome.append(random.randint(lo, hi))
        else:
            chromosome.append(random.uniform(lo, hi))
        
    return chromosome

def evaluate_population(candidates, args):
    fitness = []
    for cs in candidates:
        fit = 10 * len(cs) + sum([(x - 1)**2 - 10  for x in cs])
        fitness.append(fit)
    return fitness

def hpo(prng=None, display=False): 
    if prng is None:
        prng = Random()
        prng.seed(time()) 
    constraints=((-5.1,      20),(-1,      21))
    
    ea = inspyred.ec.GA(prng)
    ea.terminator = inspyred.ec.terminators.evaluation_termination
    final_pop = ea.evolve(generator=generate_population,
                          evaluator=evaluate_population,
                          pop_size=10,
                          maximize=False,
                          bounder=inspyred.ec.Bounder(constraints[0], constraints[1]),
                          max_evaluations=40, 
                          num_elites=1)
                          
    if display:
        best = max(final_pop)
        print('Best Solution: \n{0}'.format(str(best)))
    return ea

In [21]:
hpo(display=True)

Best Solution: 
[-1.3081308127830642, 20] : 366.3274678489186


<inspyred.ec.ec.GA at 0x7f9cd2607ba8>

## Genetic Algorithm

In [22]:
import evolutionary_tuner as et

In [23]:
hyperparameter = {'batch_size': et.StaticParameter(100),
                         'learning_rate': et.ContinuousParameter(0.01, 0.2),
                         'num_epoch': et.IntegerParameter(10, 50)}

In [21]:
jobs=et.evolutionary_search(base_name=None,hyperparameters=hyperparameter,generations=3,max_parallel_jobs=6,train_fn=fit_evolutionary)

NameError: name 'et' is not defined

In [29]:
evol_metrics = et.table_metrics(jobs, et.get_metrics(jobs, 'Validation-accuracy=([0-9\\.]+)'))
evol_metrics#evol_metrics.sort_values(['objective'], ascending=False)

Unnamed: 0,batch_size,learning_rate,num_epoch,objective,job_number
evolve-hp-2018-11-07-13-34-06-311-job-0,100,0.118184,48,0.9774,1
evolve-hp-2018-11-07-13-34-08-311-job-1,100,0.099719,11,0.9678,2
evolve-hp-2018-11-07-13-34-11-311-job-2,100,0.030858,17,0.9541,3
evolve-hp-2018-11-07-13-34-13-311-job-3,100,0.13671,18,0.9756,4
evolve-hp-2018-11-07-13-34-15-311-job-4,100,0.17212,24,0.9781,5
evolve-hp-2018-11-07-13-34-18-311-job-5,100,0.113952,26,0.9764,6
evolve-hp-2018-11-07-13-40-23-311-job-0,100,0.17212,24,0.9781,7
evolve-hp-2018-11-07-13-40-25-311-job-1,100,0.17212,24,0.9781,8
evolve-hp-2018-11-07-13-40-28-311-job-2,100,0.17212,26,0.978,9
evolve-hp-2018-11-07-13-40-31-311-job-3,100,0.113952,24,0.9761,10


## Particle Swarm Optimization

In [None]:
import evolutionary_tuner as et

hyperparameter = {'batch_size': et.StaticParameter(100),
                         'learning_rate': et.ContinuousParameter(0.01, 0.2),
                         'num_epoch': et.IntegerParameter(10, 50)}

jobs=et.evolutionary_search(base_name=None,hyperparameters=hyperparameter,generations=3,max_parallel_jobs=6,train_fn=fit_evolutionary)

In [32]:
evol_metrics = et.table_metrics(jobs, et.get_metrics(jobs, 'Validation-accuracy=([0-9\\.]+)'))
evol_metrics.sort_values(['job_number'], ascending=False)

Unnamed: 0,batch_size,learning_rate,num_epoch,objective,job_number
evolve-hp-2018-11-07-22-39-47-311-job-5,100,0.148788,34,0.9781,36
evolve-hp-2018-11-07-22-39-45-311-job-4,100,0.168791,41,0.9786,35
evolve-hp-2018-11-07-22-39-42-311-job-3,100,0.148788,34,0.9781,34
evolve-hp-2018-11-07-22-39-40-311-job-2,100,0.156026,41,0.9795,33
evolve-hp-2018-11-07-22-39-37-311-job-1,100,0.156026,44,0.9792,32
evolve-hp-2018-11-07-22-39-35-311-job-0,100,0.156026,41,0.9795,31
evolve-hp-2018-11-07-22-34-33-311-job-5,100,0.156026,41,0.9795,30
evolve-hp-2018-11-07-22-34-30-311-job-4,100,0.168791,44,0.9788,29
evolve-hp-2018-11-07-22-34-28-311-job-3,100,0.148788,34,0.9781,28
evolve-hp-2018-11-07-22-34-26-311-job-2,100,0.148788,34,0.9781,27


## Pareto Archived Evolution Strategy (PAES)

In [None]:
import evolutionary_tuner as et

hyperparameter = {'batch_size': et.StaticParameter(100),
                         'learning_rate': et.ContinuousParameter(0.01, 0.2),
                         'num_epoch': et.IntegerParameter(10, 50)}

jobs=et.evolutionary_search(base_name=None,hyperparameters=hyperparameter,generations=3,max_parallel_jobs=6,train_fn=fit_evolutionary)

In [36]:
evol_metrics = et.table_metrics(jobs, et.get_metrics(jobs, 'Validation-accuracy=([0-9\\.]+)'))
evol_metrics.sort_values(['job_number'], ascending=False)

Unnamed: 0,batch_size,learning_rate,num_epoch,objective,job_number
evolve-hp-2018-11-08-00-02-53-312-job-5,100,0.117031,50,0.9783,66
evolve-hp-2018-11-08-00-02-50-312-job-4,100,0.068905,47,0.9761,65
evolve-hp-2018-11-08-00-02-48-312-job-3,100,0.085603,50,0.9769,64
evolve-hp-2018-11-08-00-02-45-312-job-2,100,0.085603,50,0.9769,63
evolve-hp-2018-11-08-00-02-42-312-job-1,100,0.085603,50,0.9769,62
evolve-hp-2018-11-08-00-02-39-312-job-0,100,0.117031,50,0.9783,61
evolve-hp-2018-11-07-23-57-32-311-job-5,100,0.085603,50,0.9769,60
evolve-hp-2018-11-07-23-57-29-311-job-4,100,0.117031,14,0.9747,59
evolve-hp-2018-11-07-23-57-27-311-job-3,100,0.068905,49,0.9759,58
evolve-hp-2018-11-07-23-57-24-311-job-2,100,0.085603,14,0.9706,57
