## Develop, Train, Optimize and Deploy Scikit-Learn Random Forest

* Doc https://sagemaker.readthedocs.io/en/stable/using_sklearn.html
* SDK https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html
* boto3 https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#client

In this notebook we show how to use Amazon SageMaker to develop, train, tune and deploy a Scikit-Learn based ML model (Random Forest). More info on Scikit-Learn can be found here https://scikit-learn.org/stable/index.html. We use the Boston Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/datasets/index.html#boston-dataset


## Setup libraries and environment


In [1]:
import datetime
import tarfile

import boto3
import pandas as pd
import numpy as np
from sagemaker import get_execution_role
import sagemaker
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston

sm_boto3 = boto3.client('sagemaker')
sess = sagemaker.Session()
region = sess.boto_session.region_name
bucket = sess.default_bucket()  # this could also be a hard-coded bucket name

print('Using bucket ' + bucket)


Using bucket sagemaker-ap-southeast-1-765838616097


## Prepare data
We load a dataset from sklearn, split it and send it to S3

In [2]:
# we use the Boston housing dataset 
data = load_boston()

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.25, random_state=42)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX['target'] = y_test

In [4]:
trainX.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.09103,0.0,2.46,0.0,0.488,7.155,92.2,2.7006,3.0,193.0,17.8,394.12,4.82,37.9
1,3.53501,0.0,19.58,1.0,0.871,6.152,82.6,1.7455,5.0,403.0,14.7,88.01,15.02,15.6
2,0.03578,20.0,3.33,0.0,0.4429,7.82,64.5,4.6947,5.0,216.0,14.9,387.31,3.76,45.4
3,0.38735,0.0,25.65,0.0,0.581,5.613,95.6,1.7572,2.0,188.0,19.1,359.29,27.26,15.7
4,0.06724,0.0,3.24,0.0,0.46,6.333,17.2,5.2146,4.0,430.0,16.9,375.21,7.34,22.6


In [5]:
# create directories
! mkdir data
! mkdir source
! mkdir model

# save data as csv
trainX.to_csv('data/boston_train.csv')
testX.to_csv('data/boston_test.csv')

## Create a training script
The below script contains both training and inference functionality and can run both in SageMaker Training hardware or locally (desktop, SageMaker notebook, on prem, etc). Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script

In [6]:
%%writefile source/sklearn_training_script.py

import argparse
import os

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.externals import joblib


# inference functions ---------------
def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf


if __name__ =='__main__':
    
    #------------------------------- parsing input parameters (from command line)
    print('extracting arguments')
    parser = argparse.ArgumentParser()

    # RandomForest hyperparameters
    parser.add_argument('--n_estimators', type=int, default=10)
    parser.add_argument('--min_samples_leaf', type=int, default=3)
    
    # Data, model, and output directories
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train_dir', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--test_dir', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train_file', type=str, default='boston_train.csv')
    parser.add_argument('--test_file', type=str, default='boston_test.csv')
    parser.add_argument('--features', type=str)  # explicitly name which features to use
    parser.add_argument('--target_variable', type=str)  # explicitly name the column to be used as target

    args, _ = parser.parse_known_args()
    
    #------------------------------- data preparation
    print('reading data')
    train_df = pd.read_csv(os.path.join(args.train_dir, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test_dir, args.test_file))

    print('building training and testing datasets')
    X_train = train_df[args.features.split()]
    X_test = test_df[args.features.split()]
    y_train = train_df[args.target_variable]
    y_test = test_df[args.target_variable]
    
    #------------------------------- model training
    print('training model')
    model = RandomForestRegressor(
        n_estimators=args.n_estimators,
        min_samples_leaf=args.min_samples_leaf,
        n_jobs=-1)
    
    model.fit(X_train, y_train)
    
    #-------------------------------  model testing
    print('testing model')
    abs_err = np.abs(model.predict(X_test) - y_test)

    # percentile absolute errors
    for q in [10, 50, 90]:
        print('AE-at-' + str(q) + 'th-percentile: '
              + str(np.percentile(a=abs_err, q=q)))
        
    #------------------------------- save model
    path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(model, path)
    print('model saved at ' + path)


Writing source/sklearn_training_script.py


## Local training
Script arguments allows us to remove from the script any SageMaker-specific configuration, and run locally

In [7]:
! python source/sklearn_training_script.py \
    --n_estimators 100 \
    --min_samples_leaf 3 \
    --model_dir 'model/' \
    --train_dir 'data/' \
    --test_dir 'data/' \
    --train_file 'boston_train.csv' \
    --test_file 'boston_test.csv' \
    --features 'CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT' \
    --target_variable 'target'

extracting arguments
reading data
building training and testing datasets
training model
testing model
AE-at-10th-percentile: 0.2823231818181775
AE-at-50th-percentile: 1.6322408369408414
AE-at-90th-percentile: 4.461983407148411
model saved at model/model.joblib


## SageMaker Training

### Creating data input channels (copy to S3)

In [8]:
# send data to S3. SageMaker will take training data from s3
train_path_s3 = sess.upload_data(
    path='data/boston_train.csv',  # source
    bucket=bucket,
    key_prefix='sagemaker/sklearncontainer'  # destination path in S3
)

test_path_s3 = sess.upload_data(
    path='data/boston_test.csv',  # source
    bucket=bucket,
    key_prefix='sagemaker/sklearncontainer'  # destination path in S3
)

print('Train set URI:', train_path_s3)
print('Test set URI:', test_path_s3)

Train set URI: s3://sagemaker-ap-southeast-1-765838616097/sagemaker/sklearncontainer/boston_train.csv
Test set URI: s3://sagemaker-ap-southeast-1-765838616097/sagemaker/sklearncontainer/boston_test.csv


### Launching a training job with the Python SDK

In [9]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn

sklearn_estimator = SKLearn(
    entry_point='source/sklearn_training_script.py',
    role=get_execution_role(),
    instance_count=1,
    instance_type='ml.m5.large',
    framework_version='0.20.0',
    base_job_name='rf-scikit',
    metric_definitions=[
        { 'Name': 'median-AE', 'Regex': 'AE-at-50th-percentile: ([0-9.]+).*$' },
    ],
    hyperparameters={
        'n_estimators': 100,
        'min_samples_leaf': 3,
        'features': 'CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT',
        'target_variable': 'target',
    },
    max_run=20*60,  # Maximum allowed active runtime (in seconds)
    use_spot_instances=True,  # Use spot instances to reduce cost
    max_wait=30*60,  # Maximum clock time (including spot delays)
)

In [10]:
sklearn_estimator.fit({'train':train_path_s3, 'test': test_path_s3}, wait=True)


2021-01-28 03:11:16 Starting - Starting the training job...
2021-01-28 03:11:40 Starting - Launching requested ML instancesProfilerReport-1611803476: InProgress
......
2021-01-28 03:12:41 Starting - Preparing the instances for training...
2021-01-28 03:13:15 Downloading - Downloading input data...
2021-01-28 03:13:45 Training - Downloading the training image..[34m2021-01-28 03:13:58,454 sagemaker-training-toolkit INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2021-01-28 03:13:58,458 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-01-28 03:13:58,472 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2021-01-28 03:13:58,625 botocore.utils INFO     IMDS ENDPOINT: http://169.254.169.254/[0m
[34m2021-01-28 03:13:58,764 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-01-28 03:14:01,787 sagemaker-training-toolkit INFO     No GPUs de

Remember that the training job that we ran is very "light", due to the very small dataset. As such, running locally on the notebook instance results in a faster execution time, compared to SageMaker. SageMaker takes longer time to run the job because it has to provision the training infrastructure. Since the training job is very "light", the infrastructure provisioning process adds more overhead, compared to the training job itself. 

In a real situation, where datasets are large, running on SageMaker will considerably speed up the execution process, especially if multiple instances are used in parallel.

## Deploy to a real-time endpoint

### Deploy with Python SDK

An `Estimator` could be deployed directly after training, with an `Estimator.deploy()` but here we showcase the more extensive process of creating a model from s3 artifacts, that could be used to deploy a model that was trained in a different session or even out of SageMaker.

In [11]:
sklearn_estimator.latest_training_job.wait(logs='None')

model_artifact = sm_boto3.describe_training_job(
    TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts']

print('Model artifact saved at:', model_artifact)


2021-01-28 03:14:23 Starting - Preparing the instances for training
2021-01-28 03:14:23 Downloading - Downloading input data
2021-01-28 03:14:23 Training - Training image download completed. Training in progress.
2021-01-28 03:14:23 Uploading - Uploading generated training model
2021-01-28 03:14:23 Completed - Training job completed
Model artifact saved at: s3://sagemaker-ap-southeast-1-765838616097/rf-scikit-2021-01-28-03-11-16-212/output/model.tar.gz


In [12]:
from sagemaker.sklearn.model import SKLearnModel

model = SKLearnModel(
    model_data=model_artifact,
    framework_version='0.20.0',
    py_version='py3',
    role=get_execution_role(),
    entry_point='source/sklearn_training_script.py',
)

In [13]:
predictor = model.deploy(
    instance_type='ml.c5.large',
    initial_instance_count=1,
)

---------------!

### Realtime inference

In [14]:
# the SKLearnPredictor does the serialization from pandas for us
print(predictor.predict(testX[data.feature_names]))

[22.89781825 32.32461933 17.11533017 23.60930065 16.82640491 21.52786205
 19.36272707 16.1172184  21.22235491 21.35377148 20.08234131 19.68911854
  8.37202551 21.68816071 19.81871102 25.39598968 18.92982765  8.6959904
 44.79736857 15.42281086 24.01005689 23.9527035  14.82616667 23.01741364
 14.7972312  15.88109992 21.50991216 14.15101125 19.36409918 21.10882036
 20.15901966 23.51973211 28.53647262 20.37366353 14.5850798  15.77324762
 34.74719603 19.17739405 21.06302862 23.92563996 19.49016432 29.03301753
 45.25695191 19.44070602 22.61508492 13.90397262 15.54876281 24.05766634
 18.19423359 28.02974614 21.29765284 33.53048732 16.85717872 26.64862957
 45.4798163  21.68480754 15.81891082 33.15620317 22.45891548 20.98030999
 25.77045105 33.68503651 30.738625   18.91233247 27.56041346 16.0303732
 13.49678997 23.13210105 29.12914441 14.95546002 20.50972165 27.4869562
  9.97853506 22.17713889 22.20545397  7.40620646 20.03419776 45.99623603
 11.14932013 13.95852644 21.46770169 11.02019978 20.12

### Delete endpoint

In [15]:
predictor.delete_endpoint(delete_endpoint_config=True)
