# SageMaker Pipeline - Local Mode

This notebook demonstrates how to orchestrate SageMaker jobs locally using SageMaker Pipelines.

Using a `LocalPipelineSession` object, you can now run your pipelines on your local machine before running them in the cloud.

The `LocalPipelineSession` object is used while defining each pipeline step and when defining the complete Pipeline object. To run this pipeline in the cloud, each step along with the Pipeline object must be redefined using `PipelineSession`.

Note: This notebook will not run in SageMaker Studio. You can run this on SageMaker Classic Notebook instances OR your local IDE.

In this notebook, we will execute in our local environment a pipeline that will perform the following steps:

* ProcessingStep by using `FrameworkProcessor`
* TrainingStep by using `Estimator` with a custom PyTorch container

## Dataset

We are using a subset of ~20000 records of synthetic transactions, each of which is labeled as fraudulent or not fraudulent.
We'd like to train a model based on the features of these transactions so that we can predict risky or fraudulent transactions in the future.

This is a binary classification problem:

* 1 - Fraud
* 0 - No Fraud

In [None]:
! rm -rf ./data && mkdir -p ./data

! aws s3 cp s3://sagemaker-sample-files/datasets/tabular/synthetic_credit_card_transactions/user0_credit_card_transactions.csv ./data/creditcard_csv.csv

***

## Prerequisites

Install the latest version of the SageMaker Python SDK

In [None]:
! pip install 'sagemaker' --upgrade

***

## Build Container

In order to use Amazon SageMaker Training Job with a custom image, the first step is to build it and push in a private [Amazon ECR Repository](https://docs.aws.amazon.com/en_en/AmazonECR/latest/userguide/what-is-ecr.html).

The Dockerfile defined is creating starting from the public [torch 1.12.1 image](https://hub.docker.com/layers/pytorch/pytorch/1.12.1-cuda11.3-cudnn8-runtime/images/sha256-0bc0971dc8ae319af610d493aced87df46255c9508a8b9e9bc365f11a56e7b75?context=explore), and by the usage of
[sagemaker-training-toolkit](https://github.com/aws/sagemaker-training-toolkit) we are making our container compatible with Amazon SageMaker for providing our training script during the definition of the `Estimator`.

For facilitating the steps of building the Docker Image and push it in the Amazon ECR Repository, we are providing a utility script [build_image.sh](./code/build_image.sh).

For more information on the usage, please read the [README](./code/README.md)

In [None]:
! pygmentize ./code/training/Dockerfile

***

## Part 1/3 - Setup

Here we'll import some libraries and define some variables.

In [None]:
import boto3
import json
import logging
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.pipeline_context import LocalPipelineSession
from sagemaker.processing import FrameworkProcessor, ProcessingInput, ProcessingOutput
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.workflow.parameters import ParameterString
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep

In [None]:
logging.basicConfig(level=logging.INFO)
LOGGER = logging.getLogger(__name__)

In [None]:
s3_client = boto3.client("s3")

Create `LocalPipelineSession` object so that each pipeline step will run locally.

To run this pipeline in the cloud, you must change `LocalPipelineSession()` to `PipelineSession()`

In [None]:
sagemaker_session = LocalPipelineSession()

region = sagemaker_session.boto_region_name

default_bucket = sagemaker_session.default_bucket()

role = None

## Please Note: Provide SageMaker Execution Role ARN if not running on SageMaker Notebook environment

<div class="alert alert-info"> 💡 
    <strong> Set Execution Role for Permissions </strong>
    If you are running this notebook from a local machine, as opposed to within the SageMaker Jupyter environment, you will need to add the code below, after filling in the name for a valid SageMaker Execution Role.
    <p>
        <strong>
            <a style="color: #0397a7" href="https://console.aws.amazon.com/iam/home#/roles">
                <u>Click here to lookup IAM SageMaker Execution Roles</u>
            </a>
            The except block below will look up the ARN from the role name.
        </strong>
    </p>
</div>

In [None]:
if role == None:
    role = sagemaker.get_execution_role()

### Upload Dataset in the Default Amazon S3 Bucket

In order to make the data available, we are uploading the downloaded dataset into the default S3 bucket

In [None]:
s3_client.delete_object(Bucket=default_bucket, Key="sg-pipeline-local/data/input")

input_data = sagemaker_session.upload_data(
    "./data/creditcard_csv.csv", key_prefix="sg-pipeline-local/data/input"
)

input_data

***

## Part 2/3 - Create Amazon SageMaker Pipeline

In this section, we are creating the Amazon SageMaker Pipeline and define the proper Input Parameters for making it usable for both local mode and for cloud executions

### Compress source code for installing additional python modules

By using [sagemaker-training-toolkit](https://github.com/aws/sagemaker-training-toolkit), we can provide the execution scripts and the requirements.txt for installing additional dependencies to the `Estimator` that we will define some steps below.

In order to make sure that Amazon SageMaker will install our additional Python modules by reading `requirements.txt`, we are compressing the content of the [training](./code/training) folder and uploading it in the default S3 Bucket.

In [None]:
! cd ./code/training && rm -rf ./../dist/training && mkdir -p ./../dist/training && tar --exclude='Dockerfile' --exclude='.dockerignore' -czvf ./../dist/training/sourcedir.tar.gz *

In [None]:
# Download the
# clean the buckets first
s3_client.delete_object(Bucket=default_bucket, Key="sg-pipeline-local/rtifact/training")

code_path = sagemaker_session.upload_data(
    "./code/dist/training/sourcedir.tar.gz", key_prefix="sg-pipeline-local/artifact/training"
)

code_path

By using `FrameworkProcessor`, we can provide to the Amazon SageMaker Job the execution scripts and the requirements.txt for installing additional Python modules. Look at the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job-frameworks.html) for additional info.

In order to make sure that Amazon SageMaker will install our additional Python modules by reading `requirements.txt`, we are compressing the content of the [processing](./code/processing) folder and uploading it in the default S3 Bucket.

In [None]:
! cd ./code/processing && rm -rf ./../dist/processing && mkdir -p ./../dist/processing && tar -czvf ./../dist/processing/sourcedir.tar.gz *

In [None]:
# Download the
# clean the buckets first
s3_client.delete_object(Bucket=default_bucket, Key="sg-pipeline-local/artifact/processing")

code_path = sagemaker_session.upload_data(
    "./code/dist/processing/sourcedir.tar.gz", key_prefix="sg-pipeline-local/artifact/processing"
)

code_path

***

### Global Parameters

In [None]:
processing_artifact_path = "sg-pipeline-local/artifact/processing"
processing_artifact_name = "sourcedir.tar.gz"
processing_framework_version = "0.23-1"
processing_instance_count = 1
processing_input_files_path = "sg-pipeline-local/data/input"
processing_output_files_path = "sg-pipeline-local/data/output"

training_image_name = "torch-1.12.1"
training_image_version = "latest"
training_artifact_path = "sg-pipeline-local/artifact/training"
training_artifact_name = "sourcedir.tar.gz"
training_output_files_path = "sg-pipeline-local/models"
training_python_version = "py37"
training_instance_count = 1
training_hyperparameters = {"epochs": 6, "learning_rate": 1.34e-4, "batch_size": 100}

### Pipeline Parameters

In order to make the Amazon SageMaker Pipeline available for executing it both in `local mode` and in the cloud, we are defining the following `ParameterString` for providing the execution type at runtime

In [None]:
processing_instance_type = ParameterString(
    name="ProcessingInstanceType", default_value="ml.t3.large"
)

training_instance_type = ParameterString(name="TrainingInstanceType", default_value="ml.m5.large")

#### SageMaker Processing Step

In [None]:
processing_inputs = [
    ProcessingInput(
        source="s3://{}/{}".format(default_bucket, processing_input_files_path),
        destination="/opt/ml/processing/input",
    )
]

processing_outputs = [
    ProcessingOutput(
        output_name="output",
        source="/opt/ml/processing/output",
        destination="s3://{}/{}".format(default_bucket, processing_output_files_path),
    )
]

processing_source_dir = "s3://{}/{}/{}".format(
    default_bucket, processing_artifact_path, processing_artifact_name
)

Define the `FrameworkProcessor` object

In [None]:
processor = FrameworkProcessor(
    estimator_cls=SKLearn,
    framework_version=processing_framework_version,
    role=role,
    instance_count=processing_instance_count,
    instance_type=processing_instance_type,
    sagemaker_session=sagemaker_session,
)

run_args = processor.get_run_args(
    "processing.py",
    source_dir=processing_source_dir,
    inputs=processing_inputs,
    outputs=processing_outputs,
)

In [None]:
step_process = ProcessingStep(
    name="ProcessData",
    code=run_args.code,
    processor=processor,
    inputs=run_args.inputs,
    outputs=run_args.outputs,
)

#### SageMaker Training Step

#### Utility methods

For providing the compressed `sourcedir` to `Estimator`, we are defining a utility method for encoding the job `hyperparameters`

In [None]:
def json_encode_hyperparameters(hyperparameters):
    return {str(k): json.dumps(v) for (k, v) in hyperparameters.items()}

In [None]:
source_dir = "s3://{}/{}/{}".format(default_bucket, training_artifact_path, training_artifact_name)

training_hyperparameters["sagemaker_program"] = "train.py"
training_hyperparameters["sagemaker_submit_directory"] = source_dir

training_hyperparameters

In [None]:
training_input = TrainingInput(
    s3_data="s3://{}/{}/train".format(default_bucket, processing_output_files_path),
    content_type="text/csv",
)

test_input = TrainingInput(
    s3_data="s3://{}/{}/test".format(default_bucket, processing_output_files_path),
    content_type="text/csv",
)

In [None]:
output_path = "s3://{}/{}".format(default_bucket, training_output_files_path)

##### Get ECR image uri

Let's take the `image_uri` related to our custom image we want to use for our training job

In [None]:
container = "{}.dkr.ecr.{}.amazonaws.com/{}:{}".format(
    boto3.client("sts").get_caller_identity().get("Account"),
    boto3.session.Session().region_name,
    training_image_name,
    training_image_version,
)

print(container)

Define the `Estimator` object

In [None]:
estimator = Estimator(
    image_uri=container,
    output_path=output_path,
    hyperparameters=json_encode_hyperparameters(training_hyperparameters),
    enable_sagemaker_metrics=True,
    role=role,
    instance_count=training_instance_count,
    instance_type=training_instance_type,
    disable_profiler=True,
)

In [None]:
step_train = TrainingStep(
    depends_on=[step_process],
    name="TrainModel",
    estimator=estimator,
    inputs={"train": training_input, "test": test_input},
)

#### Pipeline definition

Let's create the pipeline object, which contains as `parameters` the inputs defined in the previous sections, and as steps the `ProcessingStep` and `TrainingStep` defined few cells above

In [None]:
pipeline = Pipeline(
    name="FraudTrainingPipeline",
    parameters=[processing_instance_type, training_instance_type],
    steps=[step_process, step_train],
    sagemaker_session=sagemaker_session,
)

In [None]:
pipeline.upsert(role_arn=role)

In [None]:
json.loads(pipeline.definition())

***

## Part 3/3 - Run SageMaker Pipeline

For executing the Amazon SageMaker Pipeline in our local environment, we are providing for both the `ProcessingStep` and `TrainingStep` the parameter `local` for the `instance_type` to use.

In [None]:
execution = pipeline.start(
    parameters={"ProcessingInstanceType": "local", "TrainingInstanceType": "local"}
)

In [None]:
execution.describe()

In [None]:
execution.list_steps()