# SageMaker Pipeline - Local Mode

This notebook demonstrates how to orchestrate SageMaker jobs locally using SageMaker Pipelines.

Using a `LocalPipelineSession` object, you can now run your pipelines on your local machine before running them in the cloud.

The `LocalPipelineSession` object is used while defining each pipeline step and when defining the complete Pipeline object. To run this pipeline in the cloud, each step along with the Pipeline object must be redefined using `PipelineSession`.

Note: This notebook will not run in SageMaker Studio. You can run this on SageMaker Classic Notebook instances OR your local IDE.

In this notebook, we will execute in our local environment a pipeline that will perform the following steps:

* ProcessingStep by using `FrameworkProcessor`
* TrainingStep by using `Estimator` with a custom PyTorch container

## Dataset

We are performing a Binary Classification using an anonymized dataset of transactions for classifying frauds:

* 1 - Fraud
* 0 - No Fraud

***

### Download Dataset

In [None]:
! rm -rf ./data && mkdir -p ./data

!curl https://s3.us-east-1.amazonaws.com/ee-assets-prod-us-east-1/modules/aab8e619f53f4d79b65d2272f3ee8de1/v1/datasets/fraud/data.zip -o ./data/data.zip

In [None]:
! unzip ./data/data.zip -d ./data

## Install Requirements

## Build Container

The Dockerfile defined is creating starting from the public torch 1.12.1 image, and by the usage of [sagemaker-training-toolkit](https://github.com/aws/sagemaker-training-toolkit) we are making our container compatible with Amazon SageMaker

You can use the automation script [build_image.sh](./code/build_image.sh). For more information, please read the [README](./code/README.md)

In [None]:
! pygmentize ./code/training/Dockerfile

## Part 1/3 - Setup
Here we'll import some libraries and define some variables.

In [None]:
import boto3
import json
import logging
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.pipeline_context import LocalPipelineSession
from sagemaker.processing import FrameworkProcessor, ProcessingInput, ProcessingOutput
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.workflow.parameters import ParameterString
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep

In [None]:
logging.basicConfig(level=logging.INFO)
LOGGER = logging.getLogger(__name__)

In [None]:
s3_client = boto3.client("s3")

In [None]:
sagemaker_session = LocalPipelineSession()

region = sagemaker_session.boto_region_name

default_bucket = sagemaker_session.default_bucket()

role = None

<div class="alert alert-info"> 💡 
    <strong> Set Execution Role for Permissions </strong>
    If you are running this notebook from a local machine, as opposed to within the SageMaker Jupyter environment, you will need to add the code below, after filling in the name for a valid SageMaker Execution Role.
    <p>
        <strong>
            <a style="color: #0397a7" href="https://console.aws.amazon.com/iam/home#/roles">
                <u>Click here to lookup IAM SageMaker Execution Roles</u>
            </a>
            The except block below will look up the ARN from the role name.
        </strong>
    </p>
</div>

In [None]:
if role == None:
    role = sagemaker.get_execution_role()

### Upload Dataset into an Amazon S3 Bucket

In [None]:
s3_client.delete_object(Bucket=default_bucket, Key="data/input")

input_data = sagemaker_session.upload_data("./data/data.csv", key_prefix="data/input")

print(input_data)

***

## Part 2/3 - Create Amazon SageMaker Pipeline

### Compress source code for installing additional python modules

In [None]:
! cd ./code/training && rm -rf ./../dist/training && mkdir -p ./../dist/training && tar --exclude='Dockerfile' --exclude='.dockerignore' -czvf ./../dist/training/sourcedir.tar.gz *

In [None]:
# Download the
# clean the buckets first
s3_client.delete_object(Bucket=default_bucket, Key="artifact/training")

code_path = sagemaker_session.upload_data(
    "./code/dist/training/sourcedir.tar.gz", key_prefix="artifact/training"
)

LOGGER.info(code_path)

In [None]:
! cd ./code/processing && rm -rf ./../dist/processing && mkdir -p ./../dist/processing && tar -czvf ./../dist/processing/sourcedir.tar.gz *

In [None]:
# Download the
# clean the buckets first
s3_client.delete_object(Bucket=default_bucket, Key="artifact/processing")

code_path = sagemaker_session.upload_data(
    "./code/dist/processing/sourcedir.tar.gz", key_prefix="artifact/processing"
)

LOGGER.info(code_path)

***

### Global Parameters

In [None]:
processing_artifact_path = "artifact/processing"
processing_artifact_name = "sourcedir.tar.gz"
processing_framework_version = "0.23-1"
processing_instance_count = 1
processing_input_files_path = "data/input"
processing_output_files_path = "data/output"

training_image_name = "torch-1.12.1"
training_image_version = "latest"
training_artifact_path = "artifact/training"
training_artifact_name = "sourcedir.tar.gz"
training_output_files_path = "models"
training_python_version = "py37"
training_instance_count = 1
training_hyperparameters = {"epochs": 6, "learning_rate": 1.34e-4, "batch_size": 100}

### Utility methods

In [None]:
def json_encode_hyperparameters(hyperparameters):
    return {str(k): json.dumps(v) for (k, v) in hyperparameters.items()}

***

### Pipeline Parameters

In [None]:
processing_instance_type = ParameterString(
    name="ProcessingInstanceType", default_value="ml.t3.large"
)

training_instance_type = ParameterString(name="TrainingInstanceType", default_value="ml.m5.large")

#### SageMaker Processing Step

In [None]:
processing_inputs = [
    ProcessingInput(
        source="s3://{}/{}".format(default_bucket, processing_input_files_path),
        destination="/opt/ml/processing/input",
    )
]

processing_outputs = [
    ProcessingOutput(
        output_name="output",
        source="/opt/ml/processing/output",
        destination="s3://{}/{}".format(default_bucket, processing_output_files_path),
    )
]

processing_source_dir = "s3://{}/{}/{}".format(
    default_bucket, processing_artifact_path, processing_artifact_name
)

In [None]:
processor = FrameworkProcessor(
    estimator_cls=SKLearn,
    framework_version=processing_framework_version,
    role=role,
    instance_count=processing_instance_count,
    instance_type=processing_instance_type,
    sagemaker_session=sagemaker_session,
)

run_args = processor.get_run_args(
    "processing.py",
    source_dir=processing_source_dir,
    inputs=processing_inputs,
    outputs=processing_outputs,
)

In [None]:
step_process = ProcessingStep(
    name="ProcessData",
    code=run_args.code,
    processor=processor,
    inputs=run_args.inputs,
    outputs=run_args.outputs,
)

***

#### SageMaker Training Step

In [None]:
source_dir = "s3://{}/{}/{}".format(default_bucket, training_artifact_path, training_artifact_name)

training_hyperparameters["sagemaker_program"] = "train.py"
training_hyperparameters["sagemaker_submit_directory"] = source_dir

In [None]:
training_input = TrainingInput(
    s3_data="s3://{}/{}/train".format(default_bucket, processing_output_files_path),
    content_type="text/csv",
)

test_input = TrainingInput(
    s3_data="s3://{}/{}/test".format(default_bucket, processing_output_files_path),
    content_type="text/csv",
)

In [None]:
output_path = "s3://{}/{}".format(default_bucket, training_output_files_path)

##### Get ECR image uri

In [None]:
container = "{}.dkr.ecr.{}.amazonaws.com/{}:{}".format(
    boto3.client("sts").get_caller_identity().get("Account"),
    boto3.session.Session().region_name,
    training_image_name,
    training_image_version,
)

print(container)

In [None]:
estimator = Estimator(
    image_uri=container,
    output_path=output_path,
    hyperparameters=json_encode_hyperparameters(training_hyperparameters),
    enable_sagemaker_metrics=True,
    role=role,
    instance_count=training_instance_count,
    instance_type=training_instance_type,
    disable_profiler=True,
)

In [None]:
step_train = TrainingStep(
    depends_on=[step_process],
    name="TrainModel",
    estimator=estimator,
    inputs={"train": training_input, "test": test_input},
)

***

#### Pipeline definition

In [None]:
pipeline = Pipeline(
    name="FraudTrainingPipeline",
    parameters=[processing_instance_type, training_instance_type],
    steps=[step_process, step_train],
    sagemaker_session=sagemaker_session,
)

In [None]:
pipeline.upsert(role_arn=role)

In [None]:
json.loads(pipeline.definition())

***

## Part 3/3 - Run SageMaker Pipeline

In [None]:
execution = pipeline.start(
    parameters={"ProcessingInstanceType": "local", "TrainingInstanceType": "local"}
)

In [None]:
execution.describe()

In [None]:
execution.list_steps()