# Train Using SageMaker Managed Warm Pools with Scikit-Learn Random Forest 

* Doc https://sagemaker.readthedocs.io/en/stable/using_sklearn.html
* SDK https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html
* boto3 https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#client

SageMaker Managed Warm Pools let you retain and reuse provisioned infrastructure after the completion of a training job to reduce latency for repetitive workloads, such as iterative experimentation or running many jobs consecutively. Subsequent training jobs that match specified parameters run on the retained warm pool infrastructure, which speeds up start times by reducing the time spent provisioning resources.

In this notebook we show how to use Amazon SageMaker to develop and train a Scikit-Learn based ML model (Random Forest). More info on Scikit-Learn can be found here https://scikit-learn.org/stable/index.html. We use the California Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html. The California Housing dataset was originally published in:

> Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.
 
**This sample is provided for demonstration purposes, make sure to conduct appropriate testing if derivating this code for your own use-cases!**

In [1]:
!pip install sagemaker -U

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting sagemaker
  Downloading sagemaker-2.112.2.tar.gz (579 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m579.2/579.2 KB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting botocore<1.28.0,>=1.27.82
  Downloading botocore-1.27.91-py3-none-any.whl (9.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.2/9.2 MB[0m [31m61.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-2.112.2-py2.py3-none-any.whl size=796129 sha256=81c2e8e3b342fe43f4c0401c80dbee1531873af81f55a8c54c16cb1088ec9cdd
  Stored in directory: /home/ec2-user/.cache/pip/wheels/36/9f/18/06cf3b1b76d5f220e62ab030e576092ea53819ea543ff3e790
Successfully built sagemaker
Installing collected

In [2]:
import datetime
import time
import tarfile

import boto3
import pandas as pd
import numpy as np
from sagemaker import get_execution_role
import sagemaker
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

sm_boto3 = boto3.client("sagemaker")

sess = sagemaker.Session()

region = sess.boto_session.region_name

bucket = sess.default_bucket()  # this could also be a hard-coded bucket name

print("Using bucket " + bucket)
print(f"Using sagemaker version {sagemaker.__version__}")

Using bucket sagemaker-us-east-1-062083580489
Using sagemaker version 2.112.2


## Prepare data
We load a dataset from sklearn, split it and send it to S3

In [3]:
# we use the California housing dataset
data = fetch_california_housing()

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.25, random_state=42
)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX["target"] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX["target"] = y_test

In [5]:
trainX.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,4.2143,37.0,5.288235,0.973529,860.0,2.529412,33.81,-118.12,2.285
1,5.3468,42.0,6.364322,1.08794,957.0,2.404523,37.16,-121.98,2.799
2,3.9191,36.0,6.110063,1.059748,711.0,2.235849,38.45,-122.69,1.83
3,6.3703,32.0,6.0,0.990196,1159.0,2.272549,34.16,-118.41,4.658
4,2.3684,17.0,4.795858,1.035503,706.0,2.088757,38.57,-121.33,1.5


In [6]:
trainX.to_csv("california_housing_train.csv")
testX.to_csv("california_housing_test.csv")

In [7]:
# send data to S3. SageMaker will take training data from s3
trainpath = sess.upload_data(
    path="california_housing_train.csv", bucket=bucket, key_prefix="sagemaker/sklearncontainer"
)

testpath = sess.upload_data(
    path="california_housing_test.csv", bucket=bucket, key_prefix="sagemaker/sklearncontainer"
)

## Writing a *Script Mode* script
The below script contains both training and inference functionality and can run both in SageMaker Training hardware or locally (desktop, SageMaker notebook, on prem, etc). Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script

In [8]:
%%writefile script.py

import argparse
import joblib
import os

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor


# inference functions ---------------
def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf


if __name__ == "__main__":

    print("extracting arguments")
    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    # to simplify the demo we don't use all sklearn RandomForest hyperparameters
    parser.add_argument("--n-estimators", type=int, default=10)
    parser.add_argument("--min-samples-leaf", type=int, default=3)

    # Data, model, and output directories
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
    parser.add_argument("--train-file", type=str, default="california_housing_train.csv")
    parser.add_argument("--test-file", type=str, default="california_housing_test.csv")
    parser.add_argument(
        "--features", type=str
    )  # in this script we ask user to explicitly name features
    parser.add_argument(
        "--target", type=str
    )  # in this script we ask user to explicitly name the target

    args, _ = parser.parse_known_args()

    print("reading data")
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))

    print("building training and testing datasets")
    X_train = train_df[args.features.split()]
    X_test = test_df[args.features.split()]
    y_train = train_df[args.target]
    y_test = test_df[args.target]

    # train
    print("training model")
    model = RandomForestRegressor(
        n_estimators=args.n_estimators, min_samples_leaf=args.min_samples_leaf, n_jobs=-1
    )

    model.fit(X_train, y_train)

    # print abs error
    print("validating model")
    abs_err = np.abs(model.predict(X_test) - y_test)

    # print couple perf metrics
    for q in [10, 50, 90]:
        print("AE-at-" + str(q) + "th-percentile: " + str(np.percentile(a=abs_err, q=q)))

    # persist model
    path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(model, path)
    print("model persisted at " + path)
    print(args.min_samples_leaf)

Writing script.py


## Local training
Script arguments allows us to remove from the script any SageMaker-specific configuration, and run locally

In [9]:
! python script.py --n-estimators 100 \
                   --min-samples-leaf 2 \
                   --model-dir ./ \
                   --train ./ \
                   --test ./ \
                   --features 'MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude' \
                   --target target

extracting arguments
reading data
building training and testing datasets
training model
validating model
AE-at-10th-percentile: 0.03138073809523758
AE-at-50th-percentile: 0.2059283809523813
AE-at-90th-percentile: 0.7769970204761901
model persisted at ./model.joblib
2


## SageMaker Training

### Launching a training job with the Python SDK and creating a warm pool

To create a warm pool, use the SageMaker Python SDK to create an estimator with a `keep_alive_period_in_seconds` value greater than 0 and call `fit()`. 

When the training job completes, a warm pool is retained. For more information on training scripts and estimators, see [Train a Model with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#train-a-model-with-the-sagemaker-python-sdk). If your script does not create a warm pool, see [Warm pool creation](https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html#train-warm-pools-creation) for possible explanations.

In [10]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "0.23-1"

sklearn_estimator_1 = SKLearn(
    entry_point="script.py",
    role=get_execution_role(),
    instance_count=1,
    instance_type="ml.c5.xlarge",
    framework_version=FRAMEWORK_VERSION,
    job_name="rf-scikit-1",
    keep_alive_period_in_seconds=3600,
    metric_definitions=[{"Name": "median-AE", "Regex": "AE-at-50th-percentile: ([0-9.]+).*$"}],
    hyperparameters={
        "n-estimators": 100,
        "min-samples-leaf": 3,
        "features": "MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude",
        "target": "target",
    },
)

Training the model will take around 4 minutes to complete. 

This time includes preparing the instances for training, downloading input data and training image, training the model and finally, uploading generated training model to S3. 

In [11]:
%%time

# launch training job, with asynchronous call
sklearn_estimator_1.fit({"train": trainpath, "test": testpath}, wait=True)

2022-10-16 04:35:54 Starting - Starting the training job...
2022-10-16 04:36:18 Starting - Preparing the instances for trainingProfilerReport-1665894954: InProgress
.........
2022-10-16 04:37:39 Downloading - Downloading input data...
2022-10-16 04:38:19 Training - Downloading the training image...
2022-10-16 04:38:55 Uploading - Uploading generated training model[34m2022-10-16 04:38:42,213 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2022-10-16 04:38:42,217 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-10-16 04:38:42,226 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2022-10-16 04:38:42,621 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-10-16 04:38:42,636 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-10-16 04:38:42,647 sagemaker-training-toolkit I

Check the warm pool status of the 1st job to confirm that the warm pool is `Available`

In [14]:
response = sess.describe_training_job(sklearn_estimator_1._current_job_name)
response['WarmPoolStatus']

{'Status': 'Available'}

### Launching another training job reusing the warm pool created previously

Next, create a second matching training job. In this example, we create `rf-scikit-2`, which has all of the necessary attributes to match with `rf-scikit-1`, but has a different hyperparameter for experimentation. The second training job reuses the warm pool and starts up faster than the first training job. For more information on which attributes need to match, see Matching training jobs.

In [15]:
sklearn_estimator_2 = SKLearn(
    entry_point="script.py",
    role=get_execution_role(),
    instance_count=1,
    instance_type="ml.c5.xlarge",
    framework_version=FRAMEWORK_VERSION,
    job_name="rf-scikit-2",
    keep_alive_period_in_seconds=3600,
    metric_definitions=[{"Name": "median-AE", "Regex": "AE-at-50th-percentile: ([0-9.]+).*$"}],
    hyperparameters={
        "n-estimators": 200,
        "min-samples-leaf": 3,
        "features": "MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude",
        "target": "target",
    },
)

In [16]:
%%time

# launch the 2nd training job, using SageMaker Wram Pools
sklearn_estimator_2.fit({"train": trainpath, "test": testpath}, wait=True)

2022-10-16 04:41:10 Starting - Starting the training job...
2022-10-16 04:41:34 Downloading - Downloading input data
2022-10-16 04:41:34 Training - Training image download completed. Training in progress.ProfilerReport-1665895270: InProgress
.[34m2022-10-16 04:41:33,754 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2022-10-16 04:41:33,757 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-10-16 04:41:33,765 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2022-10-16 04:41:34,355 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-10-16 04:41:34,371 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-10-16 04:41:34,387 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-10-16 04:41:34,398 sagemaker-training-toolkit INFO    

This time, re-using the warm pool of the 1st training job, the training took around one minute to complete!

### Terminate a warm pool

To manually terminate a warm pool, set the KeepAlivePeriodInSeconds to 0.

In [17]:
sess.update_training_job(sklearn_estimator_2._current_job_name, resource_config={"KeepAlivePeriodInSeconds":0})