# Use SageMaker Pipelines With Step Caching

This notebook demonstrates how to take advantage of pipeline step caching. With step caching, SageMaker tracks the arguments used for each step execution and re-uses previous, successful executions when the call signatures match. SageMaker only tracks arguments important for the output of the step, so pipeline steps are optimized for cache hits and unnecessary step executions are avoided.

 See the [Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-caching.html) and the [Python SDK docs](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#caching-configuration) for more information. 

## A SageMaker Pipeline
The pipeline in this notebook follows a shortened version of a typical ML pattern. Just two steps are included - preprocessing and training.

## Dataset

The dataset you use is the [UCI Machine Learning Abalone Dataset](https://archive.ics.uci.edu/ml/datasets/abalone) [1].  The aim for this task is to determine the age of an abalone snail from its physical measurements. At the core, this is a regression problem.

The dataset contains several features: length (the longest shell measurement), diameter (the diameter perpendicular to length), height (the height with meat in the shell), whole_weight (the weight of whole abalone), shucked_weight (the weight of meat), viscera_weight (the gut weight after bleeding), shell_weight (the weight after being dried), sex ('M', 'F', 'I' where 'I' is Infant), and rings (integer).

The number of rings turns out to be a good approximation for age (age is rings + 1.5). However, to obtain this number requires cutting the shell through the cone, staining the section, and counting the number of rings through a microscope, which is a time-consuming task. However, the other physical measurements are easier to determine. You use the dataset to build a predictive model of the variable rings through these other physical measurements.

Before you upload the data to an S3 bucket, install the SageMaker Python SDK and gather some constants you can use later in this notebook.

> [1] Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

#### Install the latest version of the SageMaker Python SDK. 

In [None]:
!pip install 'sagemaker' --upgrade

## Define Constants

Before downloading the dataset, gather some constants you can use later in this notebook.

In [None]:
import boto3
import sagemaker
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.workflow.steps import CacheConfig

sagemaker_session = sagemaker.session.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()  # Or a literal role ARN you've created in your account
pipeline_session = PipelineSession()
default_bucket = sagemaker_session.default_bucket()
model_package_group_name = f"AbaloneModelPackageGroupName"
step_cache_config = CacheConfig(enable_caching=True, expire_after="T12H")

Download the Abalone dataset:

In [None]:
local_path = "artifacts/abalone-dataset.csv"

s3 = boto3.resource("s3")
s3.Bucket(f"sagemaker-sample-files").download_file(
    "datasets/tabular/uci_abalone/abalone.csv", local_path
)

## Define Parameters to Parametrize Pipeline Execution

Define Pipeline parameters that you can use to parametrize the pipeline. Parameters enable custom pipeline executions and schedules without having to modify the Pipeline definition.

The supported parameter types include:

* `ParameterString` - represents a `str` Python type
* `ParameterInteger` - represents an `int` Python type
* `ParameterFloat` - represents a `float` Python type

These parameters support providing a default value, which can be overridden on pipeline execution. The default value specified should be an instance of the type of the parameter.

The parameters defined in this workflow include:

* `processing_instance_count` - The instance count of the processing job.
* `instance_type` - The `ml.*` instance type of the training job.

In [None]:
from sagemaker.workflow.parameters import (
    ParameterInteger,
    ParameterString,
    ParameterFloat,
)

processing_instance_count = ParameterInteger(name="ProcessingInstanceCount", default_value=1)
instance_type = ParameterString(name="TrainingInstanceType", default_value="ml.m5.xlarge")

## Define a Processing Step for Feature Engineering

First, develop a preprocessing script that is specified in the Processing step.

The file `preprocessing.py` in `artifacts/code` contains the preprocessing script. You can update the script and save the file to overwrite. The preprocessing script uses `scikit-learn` to do the following:

* Fill in missing sex category data and encode it so that it is suitable for training.
* Scale and normalize all numerical fields, aside from sex and rings numerical data.
* Split the data into training, validation, and test datasets.

The Processing step executes the script on the input data. The Training step uses the preprocessed training features and labels to train a model.

Next, create an instance of a `SKLearnProcessor` processor and use that in our `ProcessingStep`.

You also specify the `framework_version` to use throughout this notebook.

Note the `processing_instance_count` parameter used by the processor instance.

In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor


framework_version = "0.23-1"

sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    instance_type="ml.m5.xlarge",
    instance_count=processing_instance_count,
    base_job_name="sklearn-abalone-process",
    role=role,
    sagemaker_session=pipeline_session,
)

Finally, we take the output of the processor's `run` method and pass that as arguments to the `ProcessingStep`. When passing a `pipeline_session` as the `sagemaker_session` parameter, this causes the `.run()` method to return a function call rather than launch a processing job. The function call executes once the pipeline gets built, and creates the arguments needed to run the job as a step in the pipeline.

Note the `"train"` and `"validation"`, and `"test"` named channels specified in the output configuration for the processing job. Step `Properties` can be used in subsequent steps and resolve to their runtime values at execution. Specifically, this usage is called out when you define the training step.

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.dataset_definition.inputs import S3Input

processor_args = sklearn_processor.run(
    inputs=[
        ProcessingInput(
            source="artifacts/abalone-dataset.csv",
            input_name="abalone-dataset",
            s3_input=S3Input(
                local_path="/opt/ml/processing/input",
                s3_uri="artifacts/abalone-dataset.csv",
                s3_data_type="S3Prefix",
                s3_input_mode="File",
                s3_data_distribution_type="FullyReplicated",
                s3_compression_type="None",
            ),
        )
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
    ],
    code="artifacts/code/processing/preprocessing.py",
)

step_process = ProcessingStep(
    name="AbaloneProcess", step_args=processor_args, cache_config=step_cache_config
)

## Define a Training Step to Train a Model

In this section, use Amazon SageMaker's [XGBoost Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) to train on this dataset. Configure an Estimator for the XGBoost algorithm and the input dataset. A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to `model_dir` so that it can be hosted later.

The model path where the models from training are saved is also specified.

Note the `instance_type` parameter may be used in multiple places in the pipeline. In this case, the `instance_type` is passed into the estimator.

In [None]:
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep

model_path = f"s3://{default_bucket}/AbaloneTrain"
image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.0-1",
    py_version="py3",
    instance_type="ml.m5.xlarge",
)
xgb_train = Estimator(
    image_uri=image_uri,
    instance_type=instance_type,
    instance_count=1,
    output_path=model_path,
    role=role,
    sagemaker_session=pipeline_session,
)
xgb_train.set_hyperparameters(
    objective="reg:linear",
    num_round=50,
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.7,
)

train_args = xgb_train.fit(
    inputs={
        "train": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri,
            content_type="text/csv",
        ),
        "validation": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "validation"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
    }
)

Finally, we use the output of the estimator's `.fit()` method as arguments to the `TrainingStep`. When passing a `pipeline_session` as the `sagemaker_session` parameter, this causes the `.fit()` method to return a function call rather than launch the training job. The function call executes once the pipeline gets built, and creates the arguments needed to run the job as a step in the pipeline.

Pass in the `S3Uri` of the `"train"` output channel to the `.fit()` method. The `properties` attribute of a Pipeline step matches the object model of the corresponding response of a describe call. These properties can be referenced as placeholder values and are resolved at runtime. For example, the `ProcessingStep` `properties` attribute matches the object model of the [DescribeProcessingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeProcessingJob.html) response object.

In [None]:
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep


step_train = TrainingStep(name="AbaloneTrain", step_args=train_args, cache_config=step_cache_config)

## Define a Pipeline of Parameters and Steps

In this section, combine the steps into a Pipeline, so it can be executed.

A pipeline requires a `name`, `parameters`, and `steps`. Names must be unique within an `(account, region)` pair.

Note:

* All the parameters used in the definitions must be present.
* Steps passed into the pipeline do not have to be listed in the order of execution. The SageMaker Pipeline service resolves the data dependency DAG as steps for the execution to complete.
* Steps must be unique to across the pipeline step list and all condition step if/else lists.

In [None]:
from sagemaker.workflow.pipeline import Pipeline


pipeline_name = f"AbaloneBetaPipelineCaching"
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        processing_instance_count,
        instance_type,
    ],
    steps=[step_process, step_train],
)

### (Optional) Examining the pipeline definition

The JSON of the pipeline definition can be examined to confirm the pipeline is well-defined and the parameters and step properties resolve correctly.

For example, you might check the `ProcessingInputs` of the pre-processing step. The Python SDK intentionally structures input code artifacts' S3 paths in order to optimize caching - more explanation on this later in the notebook.

In [None]:
import json


definition = json.loads(pipeline.definition())
definition

## Submit the pipeline to SageMaker and start execution

Submit the pipeline definition to the Pipeline service. The Pipeline service uses the role that is passed in to create all the jobs defined in the steps.

In [None]:
pipeline.upsert(role_arn=role)

Start the pipeline and accept all the default parameters.

In [None]:
execution = pipeline.start()

## Pipeline Operations: Examining and Waiting for Pipeline Execution

Describe the pipeline execution.

In [None]:
execution.describe()

Wait for the execution to complete.

In [None]:
execution.wait()

List the steps in the execution. These are the steps in the pipeline that have been resolved by the step executor service.

In [None]:
execution.list_steps()

## Caching Behavior
In the next part of the notebook, we observe both cache hit and cache miss scenarios. There are many parameters that are passed into SageMaker pipeline steps. Some directly influence the results of the corresponding SageMaker jobs such as the input data, while others describe how the job will run, for example an `instance_type`. When parameters from the first group are updated, a cache miss occurs and the step re-runs. When parameters from the second group are updated, a cache hit occurs and the step does not execute, as the job results are unaffected. In the following pipeline execution examples, parameters from both categories are updated and the effects of each one are observed.

There are many other parameters outside of these examples - for more information on how they affect caching, or for more information on how to opt in to or out of caching, please refer to the [Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-caching.html) and the [Python SDK docs](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#caching-configuration). 

**Hint:** If you are executing this notebook in SageMaker Studio, use the following tip to easily track caching behavior.

To verify whether a cache hit or miss occurred for a particular step during a pipeline execution, open the SageMaker resources tab on the left. Click on Pipelines in the dropdown menu and find the "AbaloneBetaPipelineCaching" pipeline created in this notebook. Click on the pipeline in order to view the different executions tracked under that pipeline. You can click on each execution to view a graph of the steps and their behavior during that execution. In the graph, click on a step and then click on the "information" column to view the cache information.

Here is an example of a cache hit in SageMaker Studio, displayed in the pane on the right side of the page:

!["studio cache hit image"](artifacts/studio_cache_hit.png)

And here is an example of a cache miss in SageMaker Studio:

!["studio cache miss image"](artifacts/studio_cache_miss.png)

Information tab with cache hit result, enlarged:

!["studio cache hit zoomed image"](artifacts/studio_cache_hit_zoomed.png)

### Cache Hit
Now that the pipeline has executed, the cache for the steps has been created. To observe cache hit behavior, change the `instance_type` parameter for both steps, from xlarge to large.

In [None]:
sklearn_processor.instance_type = "ml.m5.large"
xgb_train.instance_type = "ml.m5.large"

Create the step args again, and pass the updated steps to the pipeline.

In [None]:
processor_args = sklearn_processor.run(
    inputs=[
        ProcessingInput(
            source="artifacts/abalone-dataset.csv",
            input_name="abalone-dataset",
            s3_input=S3Input(
                local_path="/opt/ml/processing/input",
                s3_uri="artifacts/abalone-dataset.csv",
                s3_data_type="S3Prefix",
                s3_input_mode="File",
                s3_data_distribution_type="FullyReplicated",
                s3_compression_type="None",
            ),
        )
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
    ],
    code="artifacts/code/processing/preprocessing.py",
)

step_process = ProcessingStep(
    name="AbaloneProcess", step_args=processor_args, cache_config=step_cache_config
)

train_args = xgb_train.fit(
    inputs={
        "train": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri,
            content_type="text/csv",
        ),
        "validation": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "validation"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
    }
)

step_train = TrainingStep(name="AbaloneTrain", step_args=train_args, cache_config=step_cache_config)

pipeline.steps = [step_process, step_train]

View the pipeline definition again and verify our changes are reflected there.

In [None]:
definition = json.loads(pipeline.definition())
definition

Update the pipeline and re-execute. The new execution results in cache hits for both steps, as the `instance_type` parameter does not affect the result of the jobs. SageMaker does not track this parameter when evaluating the cache for previous step executions, so it has no effect.

In [None]:
pipeline.update(role)
second_execution = pipeline.start()

Describe the new execution.

In [None]:
second_execution.describe()

Wait for the new execution to complete.

In [None]:
second_execution.wait()

List the steps in the new execution.

In [None]:
second_execution.list_steps()

### Cache Miss

Now, change a different set of parameters for the steps. For the processing step, use a different code script from the artifacts directory. For the training step, update some hyperparameters.

In [None]:
# processing
processor_args = sklearn_processor.run(
    inputs=[
        ProcessingInput(
            source="artifacts/abalone-dataset.csv",
            input_name="abalone-dataset",
            s3_input=S3Input(
                local_path="/opt/ml/processing/input",
                s3_uri="artifacts/abalone-dataset.csv",
                s3_data_type="S3Prefix",
                s3_input_mode="File",
                s3_data_distribution_type="FullyReplicated",
                s3_compression_type="None",
            ),
        )
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
    ],
    code="artifacts/code/processing/preprocessing_2.py",
)

step_process = ProcessingStep(
    name="AbaloneProcess", step_args=processor_args, cache_config=step_cache_config
)


# training
xgb_train.set_hyperparameters(
    objective="reg:linear",
    num_round=30,
    max_depth=4,
    eta=0.2,
    gamma=5,
    min_child_weight=6,
    subsample=0.6,
)

train_args = xgb_train.fit(
    inputs={
        "train": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri,
            content_type="text/csv",
        ),
        "validation": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "validation"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
    }
)

step_train = TrainingStep(name="AbaloneTrain", step_args=train_args, cache_config=step_cache_config)

pipeline.steps = [step_process, step_train]

View the pipeline definition again and verify the changes.

In [None]:
definition = json.loads(pipeline.definition())
definition

Because input code artifacts and hyperparameters directly affect the job results, these attributes are tracked by SageMaker. This results in cache misses during the next pipeline execution, and both steps re-execute.

**Note**: When local data or code artifacts are passed in as parameters to pipeline steps, the Python SDK uses a specific path structure when uploading these artifacts to S3. The contents of code files and in some cases configuration files are hashed, and this hash is included in the S3 upload path (View the pipeline definition to see the path structure). Because SageMaker tracks the S3 paths of these artifacts when evaluating whether a step has already executed or not, this ensures that when a new local code or data file is provided, the SDK creates a new S3 upload path, a cache miss will occur, and the step will run again with the new data. For more information on the Python SDK's S3 path structures, see the [Python SDK docs](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#caching-configuration).

Update the pipeline and re-execute.

In [None]:
pipeline.update(role)
third_execution = pipeline.start()

Describe the new execution.

In [None]:
third_execution.describe()

Wait for the new execution to complete.

In [None]:
third_execution.wait()

List the steps in the new execution.

In [None]:
third_execution.list_steps()