<img src='../img/rapids_sagemaker.png' width="600" >

In [None]:
import sagemaker
from helper_functions import *

<span style="color:#8735fb; font-size:22pt"> Hyper-Parameter Optimization with RAPIDS + SageMaker </span>

Hyper Parameter Optimization (HPO) imporves model quality by searching the space of possible 'architecture parameters,' parameters not usually trained during the learning process. 

This search can significantly boost model quality relative to default parameters and non-expert tuning; however, the search over architectures can take a very long time on a non-accelerated platform.

In this notebook, we containerize a RAPIDS workflow and run Bring-Yor-Own-Container SageMaker HPO to show how we can overcome the computational complexity of model search. We accelerate HPO in two key ways: 1. by scaling within a node (e.g., multi-GPU where each GPU brings a magnitude higher core count relative to CPUs), and 2. by scaling across nodes and running parallel trials on cloud instances.

GPU + cloud HPO is reduced from a multi-day searche to just a few hours.
For example with 10 years of airline data, we found 
XX overal speedup and XX cost savings
~3X cost savings between GPUs and CPUs [ ml.p3.8xlarge vs ml.m5.24xlarge ]. Further cost reduction (up to ~70%) were easily unlocked using spot instances.

With all these powerful tools at our disposal, every data scientist should feel empowered to uplevel their model before serving it to the world!

<img src='../img/three_steps_to_hpo.png' width=2000>

<span style="color:#8735fb; font-size:22pt"> Key Choices: </span>

<span style="color:#8735fb; font-size:18pt"> [ Dataset Size and S3 Bucket ] </span>

We target a large real-world structured dataset or flight logs for US airlines and train a model to predict flight delays ( published monthly since 1987 by the Bureau of Transportation [dataset link](https://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&DB_URL=)). 

We host 3 increasingly larger versions of this dataset as directoreis in a public bucket, and offer `1_year` (2019, 7.2M flights), `3_year` (2016-2019, 18M flights) or `10_year` (2009-2019, 125M flights) configuration. 

In [None]:
dataset_bucket = 'rapidslabdata'

In [None]:
dataset_directory = '3_year'   
assert( dataset_directory in [ '1_year', '3_year', '10_year'] )

In [None]:
s3_data_URI = f's3://{dataset_bucket}/{dataset_directory}'

<span style="color:#8735fb; font-size:18pt"> [ Algorithm ] </span>

From a ML/algorithm perspective, we offer `XGBoost` and `RandomForest` decision tree models which do quite well on this structured dataset.

In [None]:
algorithm_choice = 'XGBoost'
assert ( algorithm_choice in [ 'XGBoost', 'RandomForest' ])

We can also optionally increase robustness via reshuffles of the train-test split (i.e., cross-validation folds).

In [None]:
cv_folds = 1  
assert ( cv_folds >= 1 )

<span style="color:#8735fb; font-size:18pt"> [ Code ] </span>

Lastly, we enable the option of running the pipeline in single or multi CPU/GPU within each node. The possible options are `singleCPU`, `singleGPU`, `multiCPU`, and `multiGPU`.
The singleCPU option is code written with pandas and sklearn, singleGPU runs RAPIDS cudf and cuml (i.e., GPU equivalents to pandas and sklearn). In both multiCPU and multiGPU we add dask to parallelize the workflows and allow it to run on a cluster of CPUs/GPUs.

In [None]:
code_choice = 'multiGPU' 
assert ( code_choice in [ 'singleCPU', 'singleGPU', 'multiCPU', 'multiGPU'])

<span style="color:#8735fb; font-size:18pt"> [ Compute Instance ] </span>

Based on the dataset size and compute choice we will try to recommend an instance choice, you are of course welcome to select alternate configurations. In the case of the CPU we choose a large memory instance (ml.r5) since the during training we can get upwards of 200GB of memory utilization when using the 10 year dataset.

In [None]:
instance_type = recommend_instance_type ( code_choice, dataset_directory  ) 

In [None]:
use_spot_instances_flag = True

In [None]:
max_duration_of_experiment_seconds = 60*60*24 # 24 hrs 
assert ( max_duration_of_experiment_seconds > 60*60*2 ) # 2 hrs

<span style="color:#8735fb; font-size:18pt"> [ HPO ] </span>

One of the most important choices when running HPO is to choose the bounds of the hyper-parameter search process. Below we've set the ranges of the hyper-parameters to allow for significant variation in all of the different dimensions though you are welcome to try different variations.

In [None]:
n_trees_variable_name = 'num_boost_round' if ('XGBoost' in algorithm_choice) else 'n_estimators'
from sagemaker.parameter import ContinuousParameter, IntegerParameter

hyperparameter_ranges = {
    'max_depth'           : IntegerParameter        ( 5, 15 ),
    n_trees_variable_name : IntegerParameter        ( 100, 500 ),
    'max_features'        : ContinuousParameter     ( 0.1, 1.0 ),
}

We also need to decide the search strategy, how may total experiments/jobs to run, and how many jobs can run in parallel.

In [None]:
search_strategy = 'Bayesian'

In [None]:
max_parallel_jobs = 2  

In [None]:
max_jobs = 4  

<span style="color:#8735fb; font-size:22pt"> Validate: </span>

We'll need to capture our configuration choices into unique job names when we submit our Estimator for testing and when we run HPO. These job names will allows us to do experiment tracking, and also enable the correct code to run inside the container.

In [None]:
new_job_name_from_config( dataset_directory, code_choice, algorithm_choice, cv_folds, instance_type );

In [None]:
summarize_choices( s3_data_URI, code_choice, algorithm_choice, cv_folds,
                   instance_type, use_spot_instances_flag, search_strategy, 
                   max_jobs, max_parallel_jobs, max_duration_of_experiment_seconds )

-----

<span style="color:#8735fb; font-size:22pt"> 1. Build ML Pipeline </span>

-----

<img src='../img/airline_dataset.png' width='1250px'>

<span style="color:#8735fb; font-size:20pt"> 1.1 - Dataset </span>

In this demo we'll utilize the Airline dataset (Carrier On-Time Performance 1987-2020, available from the [Bureau of Transportation Statistics](https://transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time#)). 

For each flight the features in the data include information about time, the airline, source and destination airports, distance, and departure delay. Using these features we'll be trying to build a classifier model to predict whether a flight is going to be more than 15 minutes late on arrival as it prepares to depart.

We have a cleaned version of our dataset on a public S3 bucket, which we specify here and will subsequently use as an input to our HPO Estimators.


<span style="color:#8735fb; font-size:20pt"> 1.2 - Python DS Workflow [ ETL, Train, Eval ] </span>

In [None]:
# %load ../code/train.py

In [None]:
# %load ../code/rapids_cloud_ml.py

If you would like to point the code at your own data, just modify the top few lines of train.py and be sure that the `dataset_columns` (columns/features of you dataset) and `target_variable` (the label column which will be the classification target) match your dataset.

-----

<span style="color:#8735fb; font-size:22pt"> 2. Define Estimator </span>

-----

To build a RAPIDS enabled SageMaker HPO we first need to build an Estimator. 

An Estimator is a docker container image that captures all the software needed to run an HPO experiment.

The container is augmented with special **entrypoint code** that will be triggered at runtime by each worker. 

The entrypoint code enables us to write custom models and hook them up to data. 

<img src='../img/estimator.png'>

If you want to dig into the custom code, check out the `train.py` script as well as its supporting library `rapids_cloud_ml.py`.

In order to work with SageMaker HPO, the entrypoint logic should parse hyper-parameters (supplied by AWS SageMaker), load and split data, build and train a model, score/evaluate the trained model, and emit an output representing the final score for the given hyper-parameter setting.

We've already built sample entrypoint code leveraging the cuml.RandomForest classifier model. If you would like to make changes by adding your custom model logic feel free to modify the **train.py** file.

<span style="color:#8735fb; font-size:20pt"> 2.1 - Containerize and Push to ECR </span>

Now lets turn to building our container so that it can integrate with the AWS SageMaker HPO API.

To get things rolling lets make sure we can query our AWS SageMaker execution role and session as well as our account ID and AWS region.

In [None]:
sm_execution_role = sagemaker.get_execution_role()
sm_session = sagemaker.Session()

account=!(aws sts get-caller-identity --query Account --output text)
region=!(aws configure get region)

In [None]:
account, region

Our container takes the latest RAPIDS [ nightly ] image as a starting layer, adds some bits to inter-operate with AWS SageMaker (i.e., github.com/aws/sagemaker-containers), and copies in custom entypoint code that will run when the Estimator is spawned. We'll discuss the custom logic in the section below, for now lets actually build our container and push it to the Amazon Elastic Container Registry (ECR). 



In [None]:
rapids_base_container = 'rapidsai/rapidsai-nightly:0.15-cuda10.1-runtime-ubuntu18.04-py3.7'

Let's decide on the full name of our container `image_base:image_tag`

In [None]:
image_base = 'cloud-ml-sagemaker'
image_tag  = rapids_base_container.split(':')[1]

In [None]:
ecr_fullname = f"{account[0]}.dkr.ecr.{region[0]}.amazonaws.com/{image_base}:{image_tag}"

In [None]:
ecr_fullname

<span style="color:#8735fb; font-size:18pt"> 2.1.1 - Write Dockerfile </span>

We write out the Dockerfile in this cell, write it to disk, and in the next cell execute the docker build command.
> Note that we're copying in custom logic [ train.py, rapids_csp. py ] that we'll be defining shortly

In [None]:
workdir='~/SageMaker/cloud-ml-examples/aws/code'

In [None]:
%cd {workdir}

In [None]:
%%writefile Dockerfile
# make sure the container base matches {rapids_base_container}
FROM rapidsai/rapidsai-nightly:0.15-cuda10.1-runtime-ubuntu18.04-py3.7 

# install https://github.com/aws/sagemaker-training-toolkit
RUN apt-get update && apt-get install -y --no-install-recommends build-essential \ 
    && source activate rapids && pip3 install sagemaker-training

# path where sagemaker looks for our code
ENV CLOUD_PATH="/opt/ml/code"

# copy our latest [local] code into the container 
COPY rapids_cloud_ml.py $CLOUD_PATH/rapids_cloud_ml.py
COPY train.py $CLOUD_PATH/train.py

# sagemaker entrypoint will be train.py
ENV SAGEMAKER_PROGRAM train.py 

WORKDIR $CLOUD_PATH

In [None]:
# validate that our desired rapids image matches the Dockerfile
with open('Dockerfile') as df: 
    assert( rapids_base_container in df.read())

<span style="color:#8735fb; font-size:18pt"> 2.1.2 Build and Tag </span>

The build usually take less than 1 minute.

In [None]:
%%time
!docker build . -t $ecr_fullname -f Dockerfile

<span style="color:#8735fb; font-size:18pt"> 2.1.3 - Publish to Elastic Cloud Registry (ECR) </span>

Now that we've built and tagged our container its time to push it to Amazon's container registry (ECR). Once in ECR, AWS SageMaker will be able to leverage our image to build Estimators and run experiments.


Docker Login to ECR

In [None]:
docker_login_str = !(aws ecr get-login --region {region[0]} --no-include-email)

In [None]:
!{docker_login_str[0]}

Create ECR repository [ if it doesn't already exist]

In [None]:
repository_query = !(aws ecr describe-repositories --repository-names $image_base)
if repository_query[0] == '':
    !(aws ecr create-repository --repository-name $image_base)

Let's now actually push the container to ECR
> Note the first push to ECR may take some time (hopefully less than 10 minutes).

In [None]:
ecr_fullname

In [None]:
!docker push $ecr_fullname

<span style="color:#8735fb; font-size:20pt"> 2.2 - Create Estimator </span>

Having built our container [ +custom logic] and pushed it to ECR, we can finally compile all of efforts into an **Estimator** object -- you can think of the Estimator as the software stack that AWS SageMaker will replicate to each worker node.

We'll build the Estimator using our SageMaker execution role, the ECR image we built/tagged, and add an output path to [optionally] save models trained during the HPO experimentation.

For additional options and details see the [Estimator documentation](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.Estimator) (e.g., to change the size in GB of the EBS volume to use for storing input data during training, default = 30GB ).

In [None]:
estimator_params = {
    
    'sagemaker_session' : sm_session,     
    'role' : sm_execution_role,
    
    'image_name' : ecr_fullname,
    
    'train_instance_type' : instance_type, 
    'train_instance_count' : 1, 
    
    'train_use_spot_instances': use_spot_instances_flag,
    
    'train_max_run' : max_duration_of_experiment_seconds,
    'train_max_wait' : max_duration_of_experiment_seconds+1,     
    
    'input_mode' : 'File'    
}

In [None]:
sm_estimator = sagemaker.estimator.Estimator( **estimator_params  )

<span style="color:#8735fb; font-size:20pt"> 2.3 - Test Estimator </span>

Now we are ready to test by asking SageMaker to run the BYOContainer logic inside our Estimator. This is a useful step if you've made changes to your custom logic and are interested in making sure everything works before launching a large HPO search. 

> Note: This verification step will use the default hyper-parameter values declared in our custom train code, as SageMaker HPO will not be orchestrating a search for this single run.

In [None]:
assert ( input('confirm test run? [ y / n ] : ').lower() == 'y' )

job_name = new_job_name_from_config( dataset_directory, code_choice, 
                                     algorithm_choice, cv_folds,
                                     instance_type  )

sm_estimator.fit(inputs = s3_data_URI, job_name=job_name.lower())

-----

<span style="color:#8735fb; font-size:22pt"> 3 - HPO </span>

-----

With a working SageMaker Estimator in hand, the hardest part is behind us. Now all we have to do is tell SageMaker about the space of hyper-parameters in which to search for the best model.

For more documentation check out the AWS SageMaker [HyperParameter Tuner documentation](https://sagemaker.readthedocs.io/en/stable/tuner.html).

<span style="color:#8735fb; font-size:20pt"> 3.1 - Define Metric </span>

The definitions below specify a regular expressions (i.e., string parsing rules) to find the metrics which we are using to evalaute performance in the output log of each worker/Estimator. In this case we are case we are onyl interested in the performance of our model on the test data (i.e., `test-accuracy`), so we have a single metric to track.

For additional details on metrics refer to the [AWS SageMaker documentation on Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-metrics.html).

In [None]:
metric_definitions = [{'Name': 'test-accuracy', 'Regex': 'test-accuracy: (.*);'}]

In [None]:
objective_metric_name = 'test-accuracy'

<span style="color:#8735fb; font-size:20pt"> 3.2 - Define Tuner </span>

Below we are setting up the parameters that will define the HPO job. By default (to avoid accidently spawning large compute jobs), we have limited the number of HPO experiments to run to 2.

To run a more realistic large-scale HPO, change `max_jobs` to 100 and `max_parallel_jobs` to 10 (or as high as your instance limit permits).

In [None]:
hpo = sagemaker.tuner.HyperparameterTuner( estimator = sm_estimator,
                                           metric_definitions = metric_definitions, 
                                           objective_metric_name = objective_metric_name,
                                           objective_type = 'Maximize',
                                           hyperparameter_ranges = hyperparameter_ranges,
                                           strategy = search_strategy,  
                                           max_jobs = max_jobs,
                                           max_parallel_jobs = max_parallel_jobs)

<span style="color:#8735fb; font-size:20pt"> 3.3 - Run HPO </span>

In [None]:
summarize_choices( s3_data_URI, code_choice, algorithm_choice, cv_folds,
                   instance_type, use_spot_instances_flag, search_strategy, 
                   max_jobs, max_parallel_jobs, max_duration_of_experiment_seconds )

Let's be sure we take a moment to confirm before launching all of our HPO experiments.

In [None]:
assert ( input('confirm HPO launch? [ y / n ] : ').lower() == 'y' )

tuning_job_name = new_job_name_from_config( dataset_directory, code_choice, 
                                            algorithm_choice, cv_folds, 
                                            instance_type )
hpo.fit( inputs = s3_data_URI, 
         job_name = tuning_job_name, 
         wait = True, logs = 'All') 

hpo.wait() # block until the .fit call above is completed

<img src='../img/run_hpo.png'>

<span style="color:#8735fb; font-size:20pt"> 3.4 - Results and Summary </span>

In [None]:
results_df = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name).dataframe()

In [None]:
results_df

AWS SageMaker + NVIDIA RAPIDS HPO FTW!

<span style="color:#8735fb; font-size:20pt"> Rapids References </span>


[cloud-ml-examples](http://github.com/rapidsai/cloud-ml-examples)

[cuML Documentation](https://docs.rapids.ai/api/cuml/stable/)

<span style="color:#8735fb; font-size:20pt"> SageMaker References </span>

[SageMaker Training Toolkit](https://github.com/aws/sagemaker-training-toolkit)

[Estimator Parameters](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)

Spot Instances [docs](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html), and [blog]()

#### 1 year of data, NVIDIA V100 vs Intel Xeon-5.2698
    > ingestion : speedup: 13.79 x  -- cpu: 22.70 seconds, gpu: 1.65 seconds
    > dropna : speedup: 86.62 x  -- cpu: 5.52 seconds, gpu: 0.06 seconds
    > split : speedup: 26.08 x  -- cpu: 2.66 seconds, gpu: 0.10 seconds
    > RandomForest.train : speedup: 11.92 x  -- cpu: 16.73 seconds, gpu: 1.40 seconds
    > RandomForest.predict : speedup: 14.27 x  -- cpu: 0.57 seconds, gpu: 0.04 seconds