## Develop, Train, Optimize and Deploy Scikit-Learn Random Forest

* Doc https://sagemaker.readthedocs.io/en/stable/using_sklearn.html
* SDK https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html
* boto3 https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#client

In this notebook we show how to use Amazon SageMaker to develop, train, tune and deploy a Scikit-Learn based ML model (Random Forest). More info on Scikit-Learn can be found here https://scikit-learn.org/stable/index.html. We use the Boston Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/datasets/index.html#boston-dataset


## Setup libraries and environment


In [None]:
import datetime
import tarfile

import boto3
import pandas as pd
import numpy as np
from sagemaker import get_execution_role
import sagemaker
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston

sm_boto3 = boto3.client('sagemaker')
sess = sagemaker.Session()
region = sess.boto_session.region_name
bucket = sess.default_bucket()  # this could also be a hard-coded bucket name

print('Using bucket ' + bucket)


## Prepare data
We load a dataset from sklearn, split it and send it to S3

In [None]:
# we use the Boston housing dataset 
data = load_boston()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.25, random_state=42)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX['target'] = y_test

In [None]:
trainX.head()

In [None]:
# create directories
! mkdir data
! mkdir source
! mkdir model

# save data as csv
trainX.to_csv('data/boston_train.csv')
testX.to_csv('data/boston_test.csv')

## Create a training script
The below script contains both training and inference functionality and can run both in SageMaker Training hardware or locally (desktop, SageMaker notebook, on prem, etc). Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script

In [None]:
%%writefile source/sklearn_training_script.py

import argparse
import os

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.externals import joblib


# inference functions ---------------
def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf


if __name__ =='__main__':
    
    #------------------------------- parsing input parameters (from command line)
    print('extracting arguments')
    parser = argparse.ArgumentParser()

    # RandomForest hyperparameters
    parser.add_argument('--n_estimators', type=int, default=10)
    parser.add_argument('--min_samples_leaf', type=int, default=3)
    
    # Data, model, and output directories
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train_dir', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--test_dir', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train_file', type=str, default='boston_train.csv')
    parser.add_argument('--test_file', type=str, default='boston_test.csv')
    parser.add_argument('--features', type=str)  # explicitly name which features to use
    parser.add_argument('--target_variable', type=str)  # explicitly name the column to be used as target

    args, _ = parser.parse_known_args()
    
    #------------------------------- data preparation
    print('reading data')
    train_df = pd.read_csv(os.path.join(args.train_dir, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test_dir, args.test_file))

    print('building training and testing datasets')
    X_train = train_df[args.features.split()]
    X_test = test_df[args.features.split()]
    y_train = train_df[args.target_variable]
    y_test = test_df[args.target_variable]
    
    #------------------------------- model training
    print('training model')
    model = RandomForestRegressor(
        n_estimators=args.n_estimators,
        min_samples_leaf=args.min_samples_leaf,
        n_jobs=-1)
    
    model.fit(X_train, y_train)
    
    #-------------------------------  model testing
    print('testing model')
    abs_err = np.abs(model.predict(X_test) - y_test)

    # percentile absolute errors
    for q in [10, 50, 90]:
        print('AE-at-' + str(q) + 'th-percentile: '
              + str(np.percentile(a=abs_err, q=q)))
        
    #------------------------------- save model
    path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(model, path)
    print('model saved at ' + path)


## Local training
Script arguments allows us to remove from the script any SageMaker-specific configuration, and run locally

In [None]:
! python source/sklearn_training_script.py \
    --n_estimators 100 \
    --min_samples_leaf 3 \
    --model_dir 'model/' \
    --train_dir 'data/' \
    --test_dir 'data/' \
    --train_file 'boston_train.csv' \
    --test_file 'boston_test.csv' \
    --features 'CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT' \
    --target_variable 'target'

## SageMaker Training

### Creating data input channels (copy to S3)

In [None]:
# send data to S3. SageMaker will take training data from s3
train_path_s3 = sess.upload_data(
    path='data/boston_train.csv',  # source
    bucket=bucket,
    key_prefix='sagemaker/sklearncontainer'  # destination path in S3
)

test_path_s3 = sess.upload_data(
    path='data/boston_test.csv',  # source
    bucket=bucket,
    key_prefix='sagemaker/sklearncontainer'  # destination path in S3
)

print('Train set URI:', train_path_s3)
print('Test set URI:', test_path_s3)

### Launching a training job with the Python SDK

In [None]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn

sklearn_estimator = SKLearn(
    entry_point='source/sklearn_training_script.py',
    role=get_execution_role(),
    instance_count=1,
    instance_type='ml.m5.large',
    framework_version='0.20.0',
    base_job_name='rf-scikit',
    metric_definitions=[
        { 'Name': 'median-AE', 'Regex': 'AE-at-50th-percentile: ([0-9.]+).*$' },
    ],
    hyperparameters={
        'n_estimators': 100,
        'min_samples_leaf': 3,
        'features': 'CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT',
        'target_variable': 'target',
    },
    max_run=20*60,  # Maximum allowed active runtime (in seconds)
    use_spot_instances=True,  # Use spot instances to reduce cost
    max_wait=30*60,  # Maximum clock time (including spot delays)
)

In [None]:
sklearn_estimator.fit({'train':train_path_s3, 'test': test_path_s3}, wait=True)


Remember that the training job that we ran is very "light", due to the very small dataset. As such, running locally on the notebook instance results in a faster execution time, compared to SageMaker. SageMaker takes longer time to run the job because it has to provision the training infrastructure. Since the training job is very "light", the infrastructure provisioning process adds more overhead, compared to the training job itself. 

In a real situation, where datasets are large, running on SageMaker will considerably speed up the execution process, especially if multiple instances are used in parallel.

## Deploy to a real-time endpoint

### Deploy with Python SDK

An `Estimator` could be deployed directly after training, with an `Estimator.deploy()` but here we showcase the more extensive process of creating a model from s3 artifacts, that could be used to deploy a model that was trained in a different session or even out of SageMaker.

In [None]:
sklearn_estimator.latest_training_job.wait(logs='None')

model_artifact = sm_boto3.describe_training_job(
    TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts']

print('Model artifact saved at:', model_artifact)

In [None]:
from sagemaker.sklearn.model import SKLearnModel

model = SKLearnModel(
    model_data=model_artifact,
    framework_version='0.20.0',
    py_version='py3',
    role=get_execution_role(),
    entry_point='source/sklearn_training_script.py',
)

In [None]:
predictor = model.deploy(
    instance_type='ml.c5.large',
    initial_instance_count=1,
)

### Realtime inference

In [None]:
# the SKLearnPredictor does the serialization from pandas for us
print(predictor.predict(testX[data.feature_names]))

### Delete endpoint

In [None]:
predictor.delete_endpoint(delete_endpoint_config=True)
