In [None]:
!pip install -q -U git+https://github.com/aws/sagemaker-python-sdk

# Scheduling SKLearn with Amazon SageMaker Pipelines

In this notebook, we will use Amazon SageMaker Pipelines to create two workflows with Scikit-Learn. We will create a pipeline that preprocess data and trains a model (we will use scikit-learn Pipeline), then we will schedule inference with SageMaker Batch Transform.

### The dataset

For this example, we will create an artificial dataset with the `sklearn.datasets.make_classification()` function from the scikit-learn library. Once that's created, we will store it locally and then in S3.

In [1]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import pandas as pd
from sagemaker import Session

session = Session()
bucket = session.default_bucket()  # Change to another bucket if running outside of SageMaker
prefix = "sklearn-pipeline/data"  # Choose your preferred prefix, but keep it consistent

# Create a random dataset for classification
X, y = make_classification(random_state=42)
data = pd.concat([pd.DataFrame(X), pd.DataFrame(y, columns=["y"])], axis=1)
data.to_csv("/tmp/data.csv", index=False)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
pd.DataFrame(X_test).to_csv("/tmp/x_test.csv", index=False, header=False)
pd.DataFrame(y_test).to_csv("/tmp/y_test.csv", index=False, header=False)
# Upload to S3
data_path = session.upload_data(path="/tmp/data.csv", bucket=bucket, key_prefix=f"{prefix}/source")
x_test_path = session.upload_data(
    path="/tmp/x_test.csv", bucket=bucket, key_prefix=f"{prefix}/test"
)
y_test_path = session.upload_data(
    path="/tmp/y_test.csv", bucket=bucket, key_prefix=f"{prefix}/test"
)

print(data_path)

s3://sagemaker-eu-west-1-859755744029/sklearn-pipeline/data/source/data.csv


### SageMaker helper variables

In [2]:
import sagemaker
import boto3

sess = sagemaker.session.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

# Training pipeline

Let's start by creating the preprocessing and training pipeline. In this case, we will use two pre-existing scripts, `preprocessing.py` and `training.py`, to preprocess our input data and train our model. Both scripts use the `sklearn.pipeline` library, and are expected to output a `joblib` compressed file to be re-used during inference.

## Step 1 - Create the parameters of the pipeline

Before we begin to create the pipeline itself, we should think about how to parameterize it. For example, we may use different instance types for different purposes, such as CPU-based types for data processing and GPU-based or more powerful types for model training. These are all "knobs" of the pipeline that we can parameterize. Parameterizing enables custom pipeline executions and schedules without having to modify the pipeline definition.

In [3]:
from sagemaker.workflow.parameters import (
    ParameterInteger,
    ParameterString,
)

# raw input data
input_data = ParameterString(name="InputData", default_value=data_path)

# processing step parameters
processing_instance_type = ParameterString(
    name="ProcessingInstanceType", default_value="ml.m5.xlarge"
)
processing_instance_count = ParameterInteger(name="ProcessingInstanceCount", default_value=1)

# training step parameters
training_instance_type = ParameterString(name="TrainingInstanceType", default_value="ml.c5.2xlarge")
training_instance_count = ParameterInteger(name="TrainingInstanceCount", default_value=1)

# batch inference step parameters
batch_instance_type = ParameterString(name="BatchInstanceType", default_value="ml.c5.xlarge")
batch_instance_count = ParameterInteger(name="BatchInstanceCount", default_value=1)

## Step 2 - Create the `SKLearnProcessor` and the `ProcessorStep`

The first step in the pipeline will preprocess the data to prepare it for training. We create a `SKLearnProcessor` object similar to the one above, but now parameterized, so we can separately track and change the job configuration as needed, for example to increase the instance type size and count to accommodate a growing dataset.

In [4]:
from sagemaker.sklearn.processing import SKLearnProcessor


framework_version = "0.23-1"

sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    instance_type=processing_instance_type,
    instance_count=processing_instance_count,
    base_job_name="scheduled-pipelines-sklearn",
    sagemaker_session=sess,
    role=role,
)

In [5]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep


step_process = ProcessingStep(
    name="scheduled-pipeline-sklearn-process",
    processor=sklearn_processor,
    inputs=[
        ProcessingInput(
            source=input_data,
            destination="/opt/ml/processing/input",
            s3_data_distribution_type="ShardedByS3Key",
        ),
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/output/train"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/output/test"),
        ProcessingOutput(output_name="pipeline", source="/opt/ml/processing/output/pipeline"),
    ],
    code="./preprocessing.py",
)

### Step 3 - Create the `SKLearnEstimator` and its `TrainingStep`

Next, we specify a `Estimator` object, and define a `TrainingStep` to insert the training job in the pipeline with inputs from the previous SageMaker Processing step. Notice that we have used the hyperparameters from the best estimator in the tuning job we ran before.

In [6]:
from sagemaker.sklearn.estimator import SKLearn

# Define the Estimator from SageMaker (Script Mode)
sklearn_estimator = SKLearn(
    entry_point="training.py",
    role=role,
    instance_count=training_instance_count,
    instance_type=training_instance_type,
    framework_version=framework_version,
    base_job_name="sklearn-training",
    metric_definitions=[{"Name": "model_accuracy", "Regex": "Model Accuracy: ([0-9.]+).*$"}],
    hyperparameters={"n-estimators": 100, "min-samples-leaf": 3},
)

In [7]:
from sagemaker.workflow.steps import TrainingStep
from sagemaker.inputs import TrainingInput

step_train = TrainingStep(
    name="scheduled-pipeline-sklearn-training",
    estimator=sklearn_estimator,
    inputs={
        "train": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri
        ),
        "test": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri
        ),
        "pipeline": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "pipeline"
            ].S3Output.S3Uri
        ),
    },
)

## Step 4 - Create the `SKLearnModel` abstraction

As another step, we create a SageMaker `SKLearnModel` object to wrap the model artifact, and associate it with a separate SageMaker prebuilt SKLearn inference container to potentially use later for validation and inference.

In [8]:
from sagemaker.sklearn.model import SKLearnModel
from sagemaker.inputs import CreateModelInput
from sagemaker.workflow.steps import CreateModelStep
from sagemaker.utils import name_from_base

model = SKLearnModel(
    entry_point="training.py",
    framework_version=framework_version,
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    sagemaker_session=sess,
    role=role,
    name=name_from_base("sklearn-pipeline-model"),
)

inputs_model = CreateModelInput(instance_type=batch_instance_type)

step_create_model = CreateModelStep(
    name="scheduled-pipeline-sklearn-createmodel",
    model=model,
    inputs=inputs_model,
)

## Create and execute the Pipeline

With all the pipeline steps now defined, we can define the pipeline itself as a `Pipeline` object comprising a series of those steps. Parallelized and conditional steps also are possible.

In [9]:
from sagemaker.workflow.pipeline import Pipeline

pipeline_name = "sklearn-scheduled-training"

pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        input_data,
        processing_instance_type,
        processing_instance_count,
        training_instance_type,
        training_instance_count,
        batch_instance_type,
        batch_instance_count,
    ],
    steps=[step_process, step_train, step_create_model],
    sagemaker_session=sess,
)

In [10]:
pipeline.upsert(role_arn=role)
execution = pipeline.start()

In [11]:
execution.wait()

In [25]:
step_create_model.properties.ModelName.__dict__

{'_path': 'Steps.scheduled-pipeline-sklearn-createmodel.ModelName',
 '_shape_name': 'ModelName',
 '__str__': 'ModelName'}

# Inference Pipeline

Once our model has been trained and the model abstraction created, we can now define the pipeline that we will use to schedule our inferences. This is a very simple pipeline, only composed of one step, the `TransformStep`. In fact, this pipeline is used specifically to wrap around Batch Transform in order to run it on a schedule. 

In [12]:
from sagemaker.workflow.parameters import ParameterInteger, ParameterString

# model name
model_name = ParameterString(name="ModelName")

# test data
test_data = ParameterString(name="TestData")
output_path = ParameterString(name="OutputPath", default_value=f"s3://{bucket}/{prefix}/output/")

# batch inference step parameters
batch_instance_type = ParameterString(name="BatchInstanceType", default_value="ml.c5.xlarge")
batch_instance_count = ParameterInteger(name="BatchInstanceCount", default_value=1)

In [13]:
from sagemaker.sklearn import SKLearnModel
from sagemaker.transformer import Transformer

transformer = Transformer(
    model_name=model_name,
    instance_count=batch_instance_count,
    instance_type=batch_instance_type,
    base_transform_job_name="sklearn-transformer",
    output_path=output_path,
)

In [14]:
from sagemaker.workflow.steps import TransformStep
from sagemaker.inputs import TransformInput

transformer_step = TransformStep(
    name="scheduled-pipeline-sklearn-transformer",
    transformer=transformer,
    inputs=TransformInput(data=test_data, content_type="text/csv"),
)

In [15]:
from sagemaker.workflow.pipeline import Pipeline

pipeline_name = "sklearn-scheduled-inference"

pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        model_name,
        test_data,
        output_path,
        batch_instance_type,
        batch_instance_count,
    ],
    steps=[transformer_step],
    sagemaker_session=sess,
)

In [16]:
import json
from IPython.display import JSON

definition = json.loads(pipeline.definition())
JSON(definition)

<IPython.core.display.JSON object>

In [17]:
pipeline.upsert(
    role_arn=role,
    description="A SM Pipeline to have scheduled inference with SM Batch Transform",
)

{'PipelineArn': 'arn:aws:sagemaker:eu-west-1:859755744029:pipeline/sklearn-scheduled-inference',
 'ResponseMetadata': {'RequestId': 'd58fcbd6-98d2-4d8c-b0fd-c252ba6afc76',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'd58fcbd6-98d2-4d8c-b0fd-c252ba6afc76',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '95',
   'date': 'Tue, 22 Jun 2021 12:12:27 GMT'},
  'RetryAttempts': 0}}

In [18]:
execution = pipeline.start(
    parameters={
        "ModelName": step_create_model.properties.ModelName,  # or replace with your model name
        "TestData": x_test_path,  # or replace with your S3 path
    }
)
execution.wait()

WaiterError: Waiter PipelineExecutionComplete failed: Waiter encountered a terminal failure state: For expression "PipelineExecutionStatus" we matched expected path: "Failed"

In [19]:
execution.describe()

{'PipelineArn': 'arn:aws:sagemaker:eu-west-1:859755744029:pipeline/sklearn-scheduled-inference',
 'PipelineExecutionArn': 'arn:aws:sagemaker:eu-west-1:859755744029:pipeline/sklearn-scheduled-inference/execution/qy9ipmpe3qki',
 'PipelineExecutionDisplayName': 'execution-1624363948316',
 'PipelineExecutionStatus': 'Failed',
 'CreationTime': datetime.datetime(2021, 6, 22, 12, 12, 27, 957000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2021, 6, 22, 12, 12, 30, 295000, tzinfo=tzlocal()),
 'CreatedBy': {'UserProfileArn': 'arn:aws:sagemaker:eu-west-1:859755744029:user-profile/d-albioydxzy86/davide-d4f',
  'UserProfileName': 'davide-d4f',
  'DomainId': 'd-albioydxzy86'},
 'LastModifiedBy': {'UserProfileArn': 'arn:aws:sagemaker:eu-west-1:859755744029:user-profile/d-albioydxzy86/davide-d4f',
  'UserProfileName': 'davide-d4f',
  'DomainId': 'd-albioydxzy86'},
 'ResponseMetadata': {'RequestId': '5a4a61a1-f969-4951-89c2-dcdb71dc1e69',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn

In [None]:
!aws s3 ls s3://$bucket/$prefix/output/

You can now schedule it at specific moments of the day with [EventBridge Scheduled rule](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html) or [CloudWatch Rule](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-Scheduled-Rule.html).