## Develop, Train, Optimize and Deploy Scikit-Learn Random Forest

* Doc https://sagemaker.readthedocs.io/en/stable/using_sklearn.html
* SDK https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html
* boto3 https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#client

In this notebook we show how to use Amazon SageMaker to develop, train, tune and deploy a Scikit-Learn based ML model (Random Forest). More info on Scikit-Learn can be found here https://scikit-learn.org/stable/index.html. We use the Boston Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/datasets/index.html#boston-dataset


## Setup libraries and environment


In [2]:
import datetime
import tarfile

import boto3
import pandas as pd
import numpy as np
from sagemaker import get_execution_role
import sagemaker
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston

sm_boto3 = boto3.client('sagemaker')
sess = sagemaker.Session()
region = sess.boto_session.region_name
bucket = sess.default_bucket()  # this could also be a hard-coded bucket name

print('Using bucket ' + bucket)


Using bucket sagemaker-ap-northeast-2-951310885027


## Prepare data
We load a dataset from sklearn, split it and send it to S3

In [3]:
# we use the Boston housing dataset 
data = load_boston()

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.25, random_state=42)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX['target'] = y_test

In [5]:
trainX.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.09103,0.0,2.46,0.0,0.488,7.155,92.2,2.7006,3.0,193.0,17.8,394.12,4.82,37.9
1,3.53501,0.0,19.58,1.0,0.871,6.152,82.6,1.7455,5.0,403.0,14.7,88.01,15.02,15.6
2,0.03578,20.0,3.33,0.0,0.4429,7.82,64.5,4.6947,5.0,216.0,14.9,387.31,3.76,45.4
3,0.38735,0.0,25.65,0.0,0.581,5.613,95.6,1.7572,2.0,188.0,19.1,359.29,27.26,15.7
4,0.06724,0.0,3.24,0.0,0.46,6.333,17.2,5.2146,4.0,430.0,16.9,375.21,7.34,22.6


In [6]:
# create directories
! mkdir data
! mkdir source
! mkdir model

# save data as csv
trainX.to_csv('data/boston_train.csv')
testX.to_csv('data/boston_test.csv')

mkdir: cannot create directory ‘data’: File exists
mkdir: cannot create directory ‘source’: File exists
mkdir: cannot create directory ‘model’: File exists


## Create a training script
The below script contains both training and inference functionality and can run both in SageMaker Training hardware or locally (desktop, SageMaker notebook, on prem, etc). Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script

In [7]:
!pwd

/root/AWS-Enterprise-Boost/3_migration_challenge


In [8]:
%%writefile source/sklearn_training_script.py

import argparse
import os

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.externals import joblib


# inference functions ---------------
def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf


if __name__ =='__main__':
    
    #------------------------------- parsing input parameters (from command line)
    print('extracting arguments')
    parser = argparse.ArgumentParser()

    # RandomForest hyperparameters
    parser.add_argument('--n_estimators', type=int, default=10)
    parser.add_argument('--min_samples_leaf', type=int, default=3)
    
    # Data, model, and output directories
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train_dir', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--test_dir', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train_file', type=str, default='boston_train.csv')
    parser.add_argument('--test_file', type=str, default='boston_test.csv')
    parser.add_argument('--features', type=str)  # explicitly name which features to use
    parser.add_argument('--target_variable', type=str)  # explicitly name the column to be used as target

    args, _ = parser.parse_known_args()
    
    #------------------------------- data preparation
    print('reading data')
    train_df = pd.read_csv(os.path.join(args.train_dir, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test_dir, args.test_file))

    print('building training and testing datasets')
    X_train = train_df[args.features.split()]
    X_test = test_df[args.features.split()]
    y_train = train_df[args.target_variable]
    y_test = test_df[args.target_variable]
    
    #------------------------------- model training
    print('training model')
    model = RandomForestRegressor(
        n_estimators=args.n_estimators,
        min_samples_leaf=args.min_samples_leaf,
        n_jobs=-1)
    
    model.fit(X_train, y_train)
    
    #-------------------------------  model testing
    print('testing model')
    abs_err = np.abs(model.predict(X_test) - y_test)

    # percentile absolute errors
    for q in [10, 50, 90]:
        print('AE-at-' + str(q) + 'th-percentile: '
              + str(np.percentile(a=abs_err, q=q)))
        
    #------------------------------- save model
    path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(model, path)
    print('model saved at ' + path)


Overwriting source/sklearn_training_script.py


## Local training
Script arguments allows us to remove from the script any SageMaker-specific configuration, and run locally

In [9]:
! python source/sklearn_training_script.py \
    --n_estimators 100 \
    --min_samples_leaf 3 \
    --model_dir 'model/' \
    --train_dir 'data/' \
    --test_dir 'data/' \
    --train_file 'boston_train.csv' \
    --test_file 'boston_test.csv' \
    --features 'CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT' \
    --target_variable 'target'

extracting arguments
reading data
building training and testing datasets
training model
testing model
AE-at-10th-percentile: 0.2933440548340556
AE-at-50th-percentile: 1.472081152181154
AE-at-90th-percentile: 4.407958537018532
model saved at model/model.joblib


## SageMaker Training

### Creating data input channels (copy to S3)

In [10]:
# send data to S3. SageMaker will take training data from s3
train_path_s3 = sess.upload_data(
    path='data/boston_train.csv',  # source
    bucket=bucket,
    key_prefix='sagemaker/sklearncontainer'  # destination path in S3
)

test_path_s3 = sess.upload_data(
    path='data/boston_test.csv',  # source
    bucket=bucket,
    key_prefix='sagemaker/sklearncontainer'  # destination path in S3
)

print('Train set URI:', train_path_s3)
print('Test set URI:', test_path_s3)

Train set URI: s3://sagemaker-ap-northeast-2-951310885027/sagemaker/sklearncontainer/boston_train.csv
Test set URI: s3://sagemaker-ap-northeast-2-951310885027/sagemaker/sklearncontainer/boston_test.csv


### Launching a training job with the Python SDK

In [11]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn

sklearn_estimator = SKLearn(
    entry_point='source/sklearn_training_script.py',
    role=get_execution_role(),
    instance_count=1,
    instance_type='ml.m5.large',
    framework_version='0.20.0',
    base_job_name='rf-scikit',
    metric_definitions=[
        { 'Name': 'median-AE', 'Regex': 'AE-at-50th-percentile: ([0-9.]+).*$' },
    ],
    hyperparameters={
        'n_estimators': 100,
        'min_samples_leaf': 3,
        'features': 'CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT',
        'target_variable': 'target',
    },
    max_run=20*60,  # Maximum allowed active runtime (in seconds)
    use_spot_instances=True,  # Use spot instances to reduce cost
    max_wait=30*60,  # Maximum clock time (including spot delays)
)

In [12]:
sklearn_estimator.fit({'train':train_path_s3, 'test': test_path_s3}, wait=True)


INFO:sagemaker:Creating training-job with name: rf-scikit-2023-06-19-01-39-50-504


2023-06-19 01:39:50 Starting - Starting the training job...
2023-06-19 01:40:18 Starting - Preparing the instances for training......
2023-06-19 01:41:10 Downloading - Downloading input data...
2023-06-19 01:41:50 Training - Downloading the training image...
2023-06-19 01:42:31 Uploading - Uploading generated training model[34m2023-06-19 01:42:23,184 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2023-06-19 01:42:23,187 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-06-19 01:42:23,195 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2023-06-19 01:42:23,400 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-06-19 01:42:23,412 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-06-19 01:42:23,423 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gp

Remember that the training job that we ran is very "light", due to the very small dataset. As such, running locally on the notebook instance results in a faster execution time, compared to SageMaker. SageMaker takes longer time to run the job because it has to provision the training infrastructure. Since the training job is very "light", the infrastructure provisioning process adds more overhead, compared to the training job itself. 

In a real situation, where datasets are large, running on SageMaker will considerably speed up the execution process, especially if multiple instances are used in parallel.

## Deploy to a real-time endpoint

### Deploy with Python SDK

An `Estimator` could be deployed directly after training, with an `Estimator.deploy()` but here we showcase the more extensive process of creating a model from s3 artifacts, that could be used to deploy a model that was trained in a different session or even out of SageMaker.

In [13]:
sklearn_estimator.latest_training_job.wait(logs='None')

model_artifact = sm_boto3.describe_training_job(
    TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts']

print('Model artifact saved at:', model_artifact)


2023-06-19 01:42:41 Starting - Preparing the instances for training
2023-06-19 01:42:41 Downloading - Downloading input data
2023-06-19 01:42:41 Training - Training image download completed. Training in progress.
2023-06-19 01:42:41 Uploading - Uploading generated training model
2023-06-19 01:42:41 Completed - Training job completed
Model artifact saved at: s3://sagemaker-ap-northeast-2-951310885027/rf-scikit-2023-06-19-01-39-50-504/output/model.tar.gz


In [14]:
from sagemaker.sklearn.model import SKLearnModel

model = SKLearnModel(
    model_data=model_artifact,
    framework_version='0.20.0',
    py_version='py3',
    role=get_execution_role(),
    entry_point='source/sklearn_training_script.py',
)

In [15]:
predictor = model.deploy(
    instance_type='ml.c5.large',
    initial_instance_count=1,
)

INFO:sagemaker:Creating model with name: sagemaker-scikit-learn-2023-06-19-01-43-18-795
INFO:sagemaker:Creating endpoint-config with name sagemaker-scikit-learn-2023-06-19-01-43-19-434
INFO:sagemaker:Creating endpoint with name sagemaker-scikit-learn-2023-06-19-01-43-19-434


----!

### Realtime inference

In [16]:
# the SKLearnPredictor does the serialization from pandas for us
print(predictor.predict(testX[data.feature_names]))

[22.94055588 31.38659982 16.53363196 23.326014   17.28188987 21.50379527
 19.54200545 15.58337826 21.3146965  20.8856131  19.82728839 19.80642619
  8.01143846 21.76707431 19.36751962 25.99059019 18.60906638  8.85813023
 44.14577868 15.62706775 24.3282636  24.15583005 15.06637242 24.28007078
 14.56851472 15.43823994 21.56505833 14.10941905 19.45325891 20.87988843
 19.96523474 23.50013734 28.9771294  20.33729185 14.41959663 16.01453698
 35.5173161  19.1919735  20.6456816  24.09476046 19.82199637 28.79939297
 44.44159091 19.7570145  23.11445635 13.5517561  15.59662367 24.48814058
 18.77217918 28.89131046 21.00800317 33.57822637 17.41983568 26.07649012
 45.58253644 21.54892551 15.49546808 31.78997745 22.25722053 21.04572442
 25.48368297 34.00273662 31.08214665 18.72318052 27.65262734 16.86650664
 13.4077039  23.23273106 28.73741439 14.98365177 20.52904361 26.97773669
 10.25145725 21.379418   22.26883557  7.4382377  20.13328384 45.23097468
 11.39048089 13.63256825 21.59888286 10.99570062 20

### Delete endpoint

In [17]:
predictor.delete_endpoint(delete_endpoint_config=True)


INFO:sagemaker:Deleting endpoint configuration with name: sagemaker-scikit-learn-2023-06-19-01-43-19-434
INFO:sagemaker:Deleting endpoint with name: sagemaker-scikit-learn-2023-06-19-01-43-19-434
