# Using @step Decorator with SageMaker Pipelines Local Mode 


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/sagemaker-pipelines|tabular|local-mode|sagemaker-pipelines-local-mode.ipynb)

---

This notebook demonstrates how to orchestrate pipeline steps that are decorated by the `@step` decorator on your local machine. We will build a SageMaker Pipeline that processes a dataset, trains a model on the processed dataset, and evaluates the trained model. All of these steps will be defined using the `@step` decorator and will run locally using a `LocalPipelineSession`.

**Notes**: 
1. This notebook will not run in SageMaker Studio. You can run this on SageMaker Classic Notebook instances OR your local IDE. 
2. This notebook can only run on either **Python 3.8** or **Python 3.10**. Otherwise, you will get an error message prompting you to provide an `image_uri` when defining a step.

### Dataset

The dataset you use is the [UCI Machine Learning Abalone Dataset](https://archive.ics.uci.edu/ml/datasets/abalone) [1].  The aim for this task is to determine the age of an abalone snail from its physical measurements. At the core, this is a regression problem.

The dataset contains several features: length (the longest shell measurement), diameter (the diameter perpendicular to length), height (the height with meat in the shell), whole_weight (the weight of whole abalone), shucked_weight (the weight of meat), viscera_weight (the gut weight after bleeding), shell_weight (the weight after being dried), sex ('M', 'F', 'I' where 'I' is Infant), and rings (integer).

The number of rings turns out to be a good approximation for age (age is rings + 1.5). However, to obtain this number requires cutting the shell through the cone, staining the section, and counting the number of rings through a microscope, which is a time-consuming task. However, the other physical measurements are easier to determine. You use the dataset to build a predictive model of the variable rings through these other physical measurements.

Before you upload the data to an S3 bucket, install the SageMaker Python SDK and gather some constants you can use later in this notebook.

> [1] Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

#### Install the dependencies

Install the dependencies required by this notebook.

In [None]:
! pip install -r ./requirements.txt

#### Setup configuration file

<div class="alert alert-info"> 💡 <strong> Set Execution Role for Permissions </strong>

If you are running this notebook from a local machine, as opposed to within the SageMaker Jupyter environment, you must add a SageMaker execution role ARN to <a href="./config.yaml">config.yaml</a>. You must also specify the execution role ARN with the <code>role</code> variable below.
</div>

In [None]:
role = None

If `role` is not specified, then fetch the execution role using `get_execution_role()`.

In [None]:
import sagemaker

if role is None:
    role = sagemaker.get_execution_role()

print(role)

Set the directory in which `config.yaml` resides so that the `@step` decorator can use the our settings.

In [None]:
import os

os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = os.getcwd()

## Define the pipeline steps

### Processing step

Transform and split the abalone data into train, validation, and test datasets. 

In [None]:
import numpy as np
import pandas as pd

from sagemaker.workflow.function_step import step

# Since we get a headerless CSV file, we specify the column names here.
feature_columns_names = [
    "sex",
    "length",
    "diameter",
    "height",
    "whole_weight",
    "shucked_weight",
    "viscera_weight",
    "shell_weight",
]

label_column = "rings"

feature_columns_dtype = {
    "sex": str,
    "length": np.float64,
    "diameter": np.float64,
    "height": np.float64,
    "whole_weight": np.float64,
    "shucked_weight": np.float64,
    "viscera_weight": np.float64,
    "shell_weight": np.float64,
}

label_column_dtype = {"rings": np.float64}


def merge_two_dicts(x, y):
    z = x.copy()
    z.update(y)
    return z


@step(name="AbaloneProcess")
def process(input_data_s3_uri: str) -> tuple:
    from sklearn.compose import ColumnTransformer
    from sklearn.impute import SimpleImputer
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler, OneHotEncoder

    df = pd.read_csv(
        input_data_s3_uri,
        header=None,
        names=feature_columns_names + [label_column],
        dtype=merge_two_dicts(feature_columns_dtype, label_column_dtype),
    )
    numeric_features = list(feature_columns_names)
    numeric_features.remove("sex")
    numeric_transformer = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler()),
        ]
    )

    categorical_features = ["sex"]
    categorical_transformer = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
            ("onehot", OneHotEncoder(handle_unknown="ignore")),
        ]
    )

    preprocess = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, numeric_features),
            ("cat", categorical_transformer, categorical_features),
        ]
    )

    y = df.pop("rings")
    X_pre = preprocess.fit_transform(df)
    y_pre = y.to_numpy().reshape(len(y), 1)

    X = np.concatenate((y_pre, X_pre), axis=1)

    np.random.shuffle(X)
    train, validation, test = np.split(X, [int(0.7 * len(X)), int(0.85 * len(X))])

    return pd.DataFrame(train), pd.DataFrame(validation), pd.DataFrame(test)

### Training step

Train a XGBoost model using the train and validation datasets from the `AbaloneProcess` step.


In [None]:
import xgboost


@step(
    name="AbaloneTrain",
)
def train(
    train_df,
    validation_df,
    *,
    num_round=50,
    objective="reg:linear",
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.7,
    use_gpu=False,
):
    y_train = train_df.iloc[:, 0].to_numpy()
    train_df.drop(train_df.columns[0], axis=1, inplace=True)
    x_train = train_df.to_numpy()
    train_dmatrix = xgboost.DMatrix(x_train, label=y_train)

    y_validation = validation_df.iloc[:, 0].to_numpy()
    validation_df.drop(validation_df.columns[0], axis=1, inplace=True)
    x_validation = validation_df.to_numpy()
    validation_dmatrix = xgboost.DMatrix(x_validation, label=y_validation)

    param = {
        "objective": objective,
        "max_depth": max_depth,
        "eta": eta,
        "gamma": gamma,
        "min_child_weight": min_child_weight,
        "subsample": subsample,
        "tree_method": "gpu_hist" if use_gpu else "hist",  # Use GPU accelerated algorithm
    }

    evaluation_results = {}  # Store accuracy result
    booster = xgboost.train(
        param,
        train_dmatrix,
        num_round,
        evals=[(train_dmatrix, "train"), (validation_dmatrix, "validation")],
        early_stopping_rounds=5,
        evals_result=evaluation_results,
    )

    return booster

### Evaluation step

Evaluate the model from the `AbaloneTrain` step by calculating the mean squared error and standard deviation.

In [None]:
import numpy as np

from sklearn.metrics import mean_squared_error


@step(name="AbaloneEval")
def evaluate(model, test_df):
    y_test = test_df.iloc[:, 0].to_numpy()
    test_df.drop(test_df.columns[0], axis=1, inplace=True)
    x_test = test_df.to_numpy()

    predictions = model.predict(xgboost.DMatrix(x_test))

    mse = mean_squared_error(y_test, predictions)
    std = np.std(y_test - predictions)
    report_dict = {
        "regression_metrics": {
            "mse": {"value": mse, "standard_deviation": std},
        },
    }
    print(f"evaluation report: {report_dict}")
    return report_dict

## Define the Pipeline using `LocalPipelineSession`

We will create a `LocalPipelineSession` object and define our pipeline with it so that each step will run locally.

In [None]:
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.pipeline_context import LocalPipelineSession

# To run the pipeline in the cloud, you must change `LocalPipelineSession()` to `PipelineSession()`
local_pipeline_session = LocalPipelineSession()

# Resolve the S3 location of the Abalone dataset
abalone_s3_uri = f"s3://sagemaker-example-files-prod-{local_pipeline_session.boto_region_name}/datasets/tabular/uci_abalone/abalone.csv"

# Define the DelayedReturn objects
delayed_data = process(input_data_s3_uri=abalone_s3_uri)
delayed_model = train(train_df=delayed_data[0], validation_df=delayed_data[1])
delayed_evaluation = evaluate(model=delayed_model, test_df=delayed_data[2])

pipeline_name = "StepDecoratorLocalModePipeline"
pipeline = Pipeline(
    name=pipeline_name,
    steps=[delayed_evaluation],
    sagemaker_session=local_pipeline_session,
)

### (Optional) Examining the pipeline definition

The JSON of the pipeline definition can be examined to confirm the pipeline is well-defined and the parameters and step properties resolve correctly.


In [None]:
import json

definition = json.loads(pipeline.definition())
definition

## Submit the pipeline to SageMaker and start execution

Submit the pipeline definition to the Pipeline service. The Pipeline service uses the role that is passed in to create all the jobs defined in the steps.


In [None]:
pipeline.upsert(role_arn=role)

Start the pipeline.

In [None]:
execution = pipeline.start()

In [None]:
execution.list_steps()

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/sagemaker-pipelines|tabular|local-mode|sagemaker-pipelines-local-mode.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/sagemaker-pipelines|tabular|local-mode|sagemaker-pipelines-local-mode.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/sagemaker-pipelines|tabular|local-mode|sagemaker-pipelines-local-mode.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/sagemaker-pipelines|tabular|local-mode|sagemaker-pipelines-local-mode.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/sagemaker-pipelines|tabular|local-mode|sagemaker-pipelines-local-mode.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/sagemaker-pipelines|tabular|local-mode|sagemaker-pipelines-local-mode.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/sagemaker-pipelines|tabular|local-mode|sagemaker-pipelines-local-mode.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/sagemaker-pipelines|tabular|local-mode|sagemaker-pipelines-local-mode.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/sagemaker-pipelines|tabular|local-mode|sagemaker-pipelines-local-mode.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/sagemaker-pipelines|tabular|local-mode|sagemaker-pipelines-local-mode.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/sagemaker-pipelines|tabular|local-mode|sagemaker-pipelines-local-mode.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/sagemaker-pipelines|tabular|local-mode|sagemaker-pipelines-local-mode.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/sagemaker-pipelines|tabular|local-mode|sagemaker-pipelines-local-mode.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/sagemaker-pipelines|tabular|local-mode|sagemaker-pipelines-local-mode.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/sagemaker-pipelines|tabular|local-mode|sagemaker-pipelines-local-mode.ipynb)
