## Develop, Train, Optimize and Deploy Scikit-Learn Random Forest

* Doc https://sagemaker.readthedocs.io/en/stable/using_sklearn.html
* SDK https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html
* boto3 https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#client

In this notebook we show how to use Amazon SageMaker to develop, train, tune and deploy a Scikit-Learn based ML model (Random Forest). More info on Scikit-Learn can be found here https://scikit-learn.org/stable/index.html. We use the Boston Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/datasets/index.html#boston-dataset


More info on the dataset:

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. ‘Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, ‘Regression diagnostics …’, Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression problems.
References

 * Belsley, Kuh & Welsch, ‘Regression diagnostics: Identifying Influential Data and Sources of Collinearity’, Wiley, 1980. 244-261.
 * Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
 
 
 
 
**This sample is provided for demonstration purposes, make sure to conduct appropriate testing if derivating this code for your own use-cases!**

In [None]:
import datetime
import tarfile

import boto3
import pandas as pd
import numpy as np
from sagemaker import get_execution_role
import sagemaker
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston


sm_boto3 = boto3.client('sagemaker')

sess = sagemaker.Session()

region = sess.boto_session.region_name

bucket = sess.default_bucket()  # this could also be a hard-coded bucket name

print('Using bucket ' + bucket)

## Prepare data
We load a dataset from sklearn, split it and send it to S3

In [None]:
# we use the Boston housing dataset 
data = load_boston()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.25, random_state=42)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX['target'] = y_test

In [None]:
trainX.head()

In [None]:
trainX.to_csv('boston_train.csv')
testX.to_csv('boston_test.csv')

In [None]:
# send data to S3. SageMaker will take training data from s3
trainpath = sess.upload_data(
    path='boston_train.csv', bucket=bucket,
    key_prefix='sagemaker/sklearncontainer')

testpath = sess.upload_data(
    path='boston_test.csv', bucket=bucket,
    key_prefix='sagemaker/sklearncontainer')

## Writing a *Script Mode* script
The below script contains both training and inference functionality and can run both in SageMaker Training hardware or locally (desktop, SageMaker notebook, on prem, etc). Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script

In [None]:
%%writefile script.py

import argparse
import os

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.externals import joblib



# inference functions ---------------
def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf



if __name__ =='__main__':

    print('extracting arguments')
    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    # to simplify the demo we don't use all sklearn RandomForest hyperparameters
    parser.add_argument('--n-estimators', type=int, default=10)
    parser.add_argument('--min-samples-leaf', type=int, default=3)

    # Data, model, and output directories
    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train-file', type=str, default='boston_train.csv')
    parser.add_argument('--test-file', type=str, default='boston_test.csv')
    parser.add_argument('--features', type=str)  # in this script we ask user to explicitly name features
    parser.add_argument('--target', type=str) # in this script we ask user to explicitly name the target

    args, _ = parser.parse_known_args()

    print('reading data')
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))

    print('building training and testing datasets')
    X_train = train_df[args.features.split()]
    X_test = test_df[args.features.split()]
    y_train = train_df[args.target]
    y_test = test_df[args.target]

    # train
    print('training model')
    model = RandomForestRegressor(
        n_estimators=args.n_estimators,
        min_samples_leaf=args.min_samples_leaf,
        n_jobs=-1)
    
    model.fit(X_train, y_train)

    # print abs error
    print('validating model')
    abs_err = np.abs(model.predict(X_test) - y_test)

    # print couple perf metrics
    for q in [10, 50, 90]:
        print('AE-at-' + str(q) + 'th-percentile: '
              + str(np.percentile(a=abs_err, q=q)))
        
    # persist model
    path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(model, path)
    print('model persisted at ' + path)
    print(args.min_samples_leaf)

## Local training
Script arguments allows us to remove from the script any SageMaker-specific configuration, and run locally

In [None]:
! python script.py --n-estimators 100 \
                   --min-samples-leaf 2 \
                   --model-dir /home/ec2-user/SageMaker \
                   --train /home/ec2-user/SageMaker \
                   --test /home/ec2-user/SageMaker \
                   --features 'CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT' \
                   --target target

## SageMaker Training

### Launching a training job with the Python SDK

In [None]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn

sklearn_estimator = SKLearn(
    entry_point='script.py',
    role = get_execution_role(),
    train_instance_count=1,
    train_instance_type='ml.c5.xlarge',
    framework_version='0.20.0',
    metric_definitions=[
        {'Name': 'median-AE',
         'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"}],
    hyperparameters = {'n-estimators': 100,
                       'min-samples-leaf': 3,
                       'features': 'CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT',
                       'target': 'target'})

In [None]:
# launch training job, with asynchronous call
sklearn_estimator.fit({'train':trainpath, 'test': testpath}, wait=False)

### Alternative: launching a training with `boto3`
`boto3` is more verbose yet gives more visibility in the low-level details of Amazon SageMaker

In [None]:
# first compress the code and send to S3

source = 'source.tar.gz'
project = 'scikitlearn-train-from-boto3'

tar = tarfile.open(source, 'w:gz')
tar.add ('script.py')
tar.close()

s3 = boto3.client('s3')
s3.upload_file(source, bucket, project+'/'+source)

When using `boto3` to launch a training job we must explicitly point to a docker image. Below we do a rudimentary function forming the ARN of a sklearn 0.20 CPU container for python 3

In [None]:
image_registry_map = {
    'us-west-1': '746614075791',
    'us-west-2': '246618743249',
    'us-east-1': '683313688378',
    'us-east-2': '257758044811',
    'ap-northeast-1': '354813040037',
    'ap-northeast-2': '366743142698',
    'ap-southeast-1': '121021644041',
    'ap-southeast-2': '783357654285',
    'ap-south-1': '720646828776',
    'eu-west-1': '141502667606',
    'eu-west-2': '764974769150',
    'eu-central-1': '492215442770',
    'ca-central-1': '341280168497',
    'us-gov-west-1': '414596584902',
    'us-iso-east-1': '833128469047',
}

def container_arn(region):
    
    return (image_registry_map[region] + '.dkr.ecr.' + region 
            + '.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3')

In [None]:
# launch training job

response = sm_boto3.create_training_job(
    TrainingJobName='sklearn-boto3-' + datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S'),
    HyperParameters={
        'n_estimators': '300',
        'min_samples_leaf': '3',
        'sagemaker_program': 'script.py',
        'features': 'CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT',
        'target': 'target',
        'sagemaker_submit_directory': 's3://' + bucket + '/' + project + '/' + source 
    },
    AlgorithmSpecification={
        'TrainingImage': container_arn(region),
        'TrainingInputMode': 'File',
        'MetricDefinitions': [
            {'Name': 'median-AE', 'Regex': 'AE-at-50th-percentile: ([0-9.]+).*$'},
        ]
    },
    RoleArn=get_execution_role(),
    InputDataConfig=[
        {
            'ChannelName': 'train',
            'DataSource': {
                'S3DataSource': {
                    'S3DataType': 'S3Prefix',
                    'S3Uri': trainpath,
                    'S3DataDistributionType': 'FullyReplicated',
                }
            }},
        {
            'ChannelName': 'test',
            'DataSource': {
                'S3DataSource': {
                    'S3DataType': 'S3Prefix',
                    'S3Uri': testpath,
                    'S3DataDistributionType': 'FullyReplicated',
                }
            }},
    ],
    OutputDataConfig={'S3OutputPath': 's3://'+ bucket + '/sagemaker-sklearn-artifact/'},
    ResourceConfig={
        'InstanceType': 'ml.c5.xlarge',
        'InstanceCount': 1,
        'VolumeSizeInGB': 10
    },
    StoppingCondition={'MaxRuntimeInSeconds': 86400},
    EnableNetworkIsolation=False
)

print(response)

### Launching a tuning job with the Python SDK

In [None]:
# we use the Hyperparameter Tuner
from sagemaker.tuner import IntegerParameter

# Define exploration boundaries
hyperparameter_ranges = {
    'n-estimators': IntegerParameter(20, 100),
    'min-samples-leaf': IntegerParameter(2, 6)}

# create Optimizer
Optimizer = sagemaker.tuner.HyperparameterTuner(
    estimator=sklearn_estimator,
    hyperparameter_ranges=hyperparameter_ranges,
    base_tuning_job_name='RF-tuner',
    objective_type='Minimize',
    objective_metric_name='median-AE',
    metric_definitions=[
        {'Name': 'median-AE',
         'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"}],  # extract tracked metric from logs with regexp 
    max_jobs=20,
    max_parallel_jobs=2)

In [None]:
Optimizer.fit({'train': trainpath, 'test': testpath})

In [None]:
# get tuner results in a df
results = Optimizer.analytics().dataframe()
results.head()

## Deploy to a real-time endpoint

### Deploy with Python SDK

An `Estimator` could be deployed directly after training, with an `Estimator.deploy()` but here we showcase the more extensive process of creating a model from s3 artifacts, that could be used to deploy a model that was trained in a different session or even out of SageMaker.

In [None]:
artifact = sm_boto3.describe_training_job(
    TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts']

print('Model artifact persisted at ' + artifact)

In [None]:
from sagemaker.sklearn.model import SKLearnModel

model = SKLearnModel(
    model_data=artifact,
    role=get_execution_role(),
    entry_point='script.py')

In [None]:
endpoint_name = 'rf-scikit-endpoint'

model.deploy(
    instance_type='ml.c5.large',
    initial_instance_count=1,
    endpoint_name=endpoint_name)

### Invoke with the Python SDK

In [None]:
# we use the SklearnPredictor from the python SDK
predictor = sagemaker.sklearn.model.SKLearnPredictor(endpoint_name=endpoint_name)

In [None]:
# the SKLearnPredictor does the serialization from pandas for us
print(predictor.predict(testX[data.feature_names]))

### Alternative: invoke with `boto3`

In [None]:
runtime = boto3.client('sagemaker-runtime')

#### Option 1: `csv` serialization

In [None]:
# csv serialization
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=testX[data.feature_names].to_csv(header=False, index=False).encode('utf-8'),
    ContentType='text/csv')

print(response['Body'].read())

#### Option 2: `npy` serialization

In [None]:
# npy serialization
from io import BytesIO


#Serialise numpy ndarray as bytes
buffer = BytesIO()
# Assuming testX is a data frame
np.save(buffer, testX[data.feature_names].values)

response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=buffer.getvalue(),
    ContentType='application/x-npy')

print(response['Body'].read())

## Don't forget to delete the endpoint !

In [None]:
sm_boto3.delete_endpoint(EndpointName=endpoint_name)