# Sagemaker Pipelines SKLearn-->SKLearn Serial Batch Inference Demo End to End

This notebook demonstrates the end to end process of creating a Sagemaker Pipeline for data preprocessing and model training with scikit-learn, registering the pipeline model, and running serial batch inference.


## Table of Contents
1. [Configure AWS](#configure)
2. [Load Data](#data)
3. [Define Pipeline Parameters](#params)
4. [Define a `features.py` script for training and inference](#features)
5. [Define a `model.py` script for training and inference](#model)
6. [Define model creation and registration steps](#createregister)
7. [Configure Sagemaker Pipeline](#pipeline)
8. [Execute Sagemaker Pipeline to build features, train model, register artifacts, and create Sagemaker Model](#submit)
9. [Pass pipeline to Batch Transform for serial inference](#inference)
10. [Download resulting predictions](#download)



In [2]:
# !pip install -U sagemaker

### 1. Configure AWS <a name="configure"></a><a name="configure"></a>

Set up your Sagemaker Session, Sagemaker Pipeline session, roles, predefined variables, etc. 

In [42]:
import os
import time
import boto3
import json
import numpy as np
import pandas as pd
from sagemaker.workflow.pipeline_context import PipelineSession
import sagemaker
from sagemaker import get_execution_role
from botocore.exceptions import ClientError

# Load configs
with open('config.json', 'r') as file:
    config_data = json.load(file)
print("Configs:")    
print(config_data)

# Configure boto3, bucket info, and Sagemaker Sessions
sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(default_bucket = config_data['DEFAULT_BUCKET'], boto_session=sess)
pipeline_session = PipelineSession(default_bucket = config_data['DEFAULT_BUCKET'], boto_session=sess) 

# Configure region
region = sagemaker_session.boto_region_name
print(f"Region: {region}")

# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()
print(f"Role: {role}")

# S3 bucket
bucket = sagemaker_session.default_bucket()
print(f"Bucket: {bucket}")

Configs:
{'DEFAULT_BUCKET': 'sklearn-mlops-inf-pipeline-demo', 'MODEL_PACKAGE_GROUP_NAME': 'PipelineSKLearnModelPackageGroup', 'PREFIX': 'serial-inference-pipeline', 'PIPELINE_NAME': 'serial-inference-pipeline', 'INPUT_DATA': 's3://sklearn-mlops-inf-pipeline-demo/serial-inference-pipeline/housing_data/raw', 'MODEL_APPROVAL_STATUS': 'Approved', 'PROCESSING_INSTANCE_TYPE': 'ml.m5.xlarge', 'PROCESSING_INSTANCE_COUNT': 1, 'TRAINING_INSTANCE_TYPE': 'ml.m5.xlarge', 'CREATE_MODEL_INSTANCE_TYPE': 'ml.m5.large', 'BATCH_TRANSFORM_INSTANCE_COUNT': 1, 'BATCH_TRANSFORM_INSTANCE_TYPE': 'ml.m4.xlarge'}
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
Region: ca-

In [43]:
# define prefixes and names
model_package_group_name = config_data['MODEL_PACKAGE_GROUP_NAME']
prefix = config_data['PREFIX']
pipeline_name = config_data['PIPELINE_NAME'] 

### 2. Load Dataset to Studio & Upload to S3 for training <a name="data"></a><a name="data"></a>

We use the California housing dataset. More info on the dataset:

* This dataset was obtained from the StatLib repository. http://lib.stat.cmu.edu/datasets/
* The target variable is the median house value for California districts.
* This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

In [44]:
data_dir = os.path.join(os.getcwd(), "housing_data")
os.makedirs(data_dir, exist_ok=True)

raw_dir = os.path.join(os.getcwd(), "housing_data/raw")
os.makedirs(raw_dir, exist_ok=True)

In [45]:
s3 = boto3.client("s3")
s3.download_file(
    f"sagemaker-example-files-prod-{region}",
    "datasets/tabular/california_housing/cal_housing.tgz",
    "cal_housing.tgz",
)

In [46]:
!tar -zxf cal_housing.tgz

tar: CaliforniaHousing/cal_housing.data: Cannot change ownership to uid 10017, gid 166: Operation not permitted
tar: CaliforniaHousing/cal_housing.domain: Cannot change ownership to uid 10017, gid 166: Operation not permitted
tar: Exiting with failure status due to previous errors


In [47]:
columns = [
    "longitude",
    "latitude",
    "housingMedianAge",
    "totalRooms",
    "totalBedrooms",
    "population",
    "households",
    "medianIncome",
    "medianHouseValue",
]
cal_housing_df = pd.read_csv("CaliforniaHousing/cal_housing.data", names=columns, header=None)
cal_housing_df[
    "medianHouseValue"
] /= 500000  # Scaling target down to avoid overcomplicating the example
cal_housing_df.to_csv(f"./housing_data/raw/raw_data_all.csv", header=True, index=False)
rawdata_s3_prefix = "{}/housing_data/raw".format(prefix)
raw_s3 = sagemaker_session.upload_data(path="./housing_data/raw/", key_prefix=rawdata_s3_prefix)
print(raw_s3)

s3://sklearn-mlops-inf-pipeline-demo/serial-inference-pipeline/housing_data/raw


### 3. Define Parameters to Parametrize Pipeline Execution <a name="params"></a><a name="params"></a>

Define Pipeline parameters that you can use to parametrize the pipeline. Parameters enable custom pipeline executions and schedules without having to modify the Pipeline definition.

The supported parameter types include:

- ParameterString - represents a str Python type
- ParameterInteger - represents an int Python type
- ParameterFloat - represents a float Python type

In [89]:
from sagemaker.workflow.parameters import ParameterInteger, ParameterString, ParameterFloat

# raw input data
input_data = ParameterString(name='InputData', default_value=config_data['INPUT_DATA'])

# status of newly trained model in registry
model_approval_status = ParameterString(name="ModelApprovalStatus", default_value=config_data['MODEL_APPROVAL_STATUS'])

# processing step parameters
processing_instance_type = ParameterString(
    name="ProcessingInstanceType", default_value=config_data['PROCESSING_INSTANCE_TYPE']
)
processing_instance_count = ParameterInteger(name="ProcessingInstanceCount", default_value=config_data['PROCESSING_INSTANCE_COUNT'])

# training step parameters
training_instance_type = ParameterString(name="TrainingInstanceType", default_value=config_data['TRAINING_INSTANCE_TYPE'])

# create model step parameters 
create_model_instance_type = ParameterString(name="CreateModelInstanceType", default_value=config_data["CREATE_MODEL_INSTANCE_TYPE"])

### 4. Feature Build and Inference Script<a name="features"></a><a name="features"></a>

Define a Sagemaker processing job for feature engineering, utilizing a scikit-learn StandardScaler(). Save the scaler as a feature artifact during training, and deserialize it for custom inference transformations.

#### Script Structure

Inside the main guard (`if name == __main__`), provide training code with arguments aligned to Sagemaker Processing Job documentation.

Outside the main guard, define four inference functions as expected by Sagemaker:
* `input_fn`: reads input data from the relative directory passed into the feature container
* `model_fn`: deserializes the tar.gz artifact from the model registry, containing pretrained feature artifact(s)
* `predict_fn`: computes the data transformation step for inference data
* `output_fn`: sends transformed data to the model step container as JSON

Refer to the Sagemaker Python SDK documentation for details. If no custom inference functions are provided, the default Sagemaker inference handler will run.

`features.py` is the entry point for data preprocessing functions.


In [77]:
!mkdir -p code

In [78]:
%%writefile code/features.py

import glob
import numpy as np
import pandas as pd
import os
import json
import joblib
from io import StringIO
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import tarfile

try:
    from sagemaker_containers.beta.framework import (
        content_types,
        encoders,
        env,
        modules,
        transformer,
        worker,
        server,
    )
except ImportError:
    pass

feature_columns = [
    "longitude",
    "latitude",
    "housingMedianAge",
    "totalRooms",
    "totalBedrooms",
    "population",
    "households",
    "medianIncome",
]
label_column = "medianHouseValue"

base_dir = "/opt/ml/processing"
base_output_dir = "/opt/ml/output/"

# feature build logic 
if __name__ == "__main__":
    df = pd.read_csv(f"{base_dir}/input/raw_data_all.csv")
    x_train, x_test, y_train, y_test = train_test_split(df[feature_columns], df[label_column], test_size=0.33)

    scaler = StandardScaler()
    scaler.fit(x_train.values) 
    x_train[feature_columns] = scaler.transform(x_train) 

    train_dataset = pd.concat([x_train, y_train], axis=1) 
    test_dataset = pd.concat([x_test, y_test], axis=1)
        
    train_dataset.to_csv(f"{base_dir}/train/train.csv", header=None, index=None) 
    test_dataset.to_csv(f"{base_dir}/test/test.csv", header=None, index=None)
    
    # save feature artifact for inference
    joblib.dump(scaler, "model.joblib")
    with tarfile.open(f"{base_dir}/scaler_model/model.tar.gz", "w:gz") as tar_handle:
        tar_handle.add(f"model.joblib")

# inference functions
def input_fn(input_data, content_type):
    """Parse input data payload
    """
    print("Entering preprocessing input fn.")
    if content_type == "text/csv":
        # Read the raw input data as CSV.
        df = pd.read_csv(StringIO(input_data), header=None) 
        
        # If labelled, drop before inference
        if len(df.columns) == len(feature_columns) + 1:
            df.columns = feature_columns + [label_column]
            df=df.drop(columns = label_column)
        
        # If unlabelled, continue    
        elif len(df.columns) == len(feature_columns):
            df.columns = feature_columns
        return df
    else:
        raise ValueError("{} not supported by script!".format(content_type))

def model_fn(model_dir):
    """Deserialize fitted model"""
    print("Entering preprocessing model fn.")
    preprocessor = joblib.load(os.path.join(model_dir, "model.joblib"))
    return preprocessor

def predict_fn(input_data, model):
    """Apply feature transform to data
    """
    print("Entering preprocessing predict fn.")
    features = model.transform(input_data.values) 
    return features

def output_fn(prediction, accept):
    """Format prediction output
    The default accept/content-type between containers for serial inference is JSON.
    """
    print("Entering preprocessing output fn.")
    if accept == "application/json":
        instances = []
        for row in prediction.tolist():
            instances.append(row)
        json_output = {"instances": instances}

        return worker.Response(json.dumps(json_output), mimetype=accept)
    elif accept == "text/csv":
        return worker.Response(encoders.encode(prediction, accept), mimetype=accept)
    else:
        raise RuntimeException("{} accept type is not supported by this script.".format(accept))


Overwriting code/features.py


Define a scikit-learn FrameworkProcessor to wrap feature build script for a Sagemaker Processing Job.

In [79]:
from sagemaker.processing import ProcessingInput, ProcessingOutput, FrameworkProcessor
from sagemaker.sklearn.estimator import SKLearn

sklearn_framework_version = "1.2-1"

sklearn_processor = FrameworkProcessor(
    estimator_cls=SKLearn,
    framework_version=sklearn_framework_version,
    instance_type=processing_instance_type,
    instance_count=processing_instance_count,
    base_job_name="sklearn-housing-data-process",
    role=role,
    sagemaker_session=pipeline_session,
    code_location=f"s3://{bucket}/{prefix}/processing"
)

processor_args = sklearn_processor.run(
    inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(output_name="scaler_model", source="/opt/ml/processing/scaler_model", destination = f"s3://{bucket}/{prefix}/processing"), 
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train", destination = f"s3://{bucket}/{prefix}/train"), 
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test", destination = f"s3://{bucket}/{prefix}/test"), 
    ],
    code="code/features.py",
)



Wrap feature script in a Sagemaker Pipelines ProcessingStep. 

In [80]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep, CacheConfig

cache_config = CacheConfig(enable_caching=True, expire_after="30d")


step_process = ProcessingStep(
    name="PreprocessData",
    step_args=processor_args,
    cache_config=cache_config,
)

### 5. Model Training and Inference Script <a name="model"></a><a name="model"></a>

Demonstrates SKLearn model training and artifact registration. In custom inference functions, deserialize the model for prediction computation.

#### Script Structure

Inside the main guard (`if name == __main__`), provide training code with arguments for Sagemaker Training Job.

Outside the main guard, define Sagemaker's expected inference functions:
* `input_fn`: reads preprocessed input data from the feature step via the relative directory passed into the model step container
* `model_fn`: deserializes the tar.gz artifact from the model registry containing any pretrained model artifacts
* `predict_fn`: computes model inference predictions
* (optional) `output_fn`: can configure additional custom handling of output predictions
  
Refer to the Sagemaker Python SDK documentation for model-specific inferencing details. If no custom inferencing functions are provided, the default Sagemaker inference handler will run.


In [81]:
%%writefile code/model.py

import argparse, os
import boto3
import json
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
import joblib
import pathlib
import pickle
from io import StringIO


def get_train_data(train_dir):
    train_data = pd.read_csv(os.path.join(train_dir, "train.csv"))
    x_train = train_data.iloc[:,:-1].to_numpy()
    y_train = train_data.iloc[:,-1].to_numpy()
    print("x train", x_train.shape, "y train", y_train.shape)

    return x_train, y_train


def get_test_data(test_dir):
    test_data = pd.read_csv(os.path.join(test_dir, "test.csv"))
    x_test = test_data.iloc[:,:-1].to_numpy()
    y_test = test_data.iloc[:,-1].to_numpy()
    print("x test", x_test.shape, "y test", y_test.shape)

    return x_test, y_test


# Model Training
if __name__ == '__main__':
    
    feature_columns = [
    "longitude",
    "latitude",
    "housingMedianAge",
    "totalRooms",
    "totalBedrooms",
    "population",
    "households",
    "medianIncome",
    ]
    
    label_column = "medianHouseValue"
    
    # Passing in environment variables and hyperparameters for our training script
    parser = argparse.ArgumentParser()
    
    # Hyperparamaters
    parser.add_argument('--model_dir', type=str)
    parser.add_argument("--n_estimators", type=int, default=20)
    parser.add_argument('--sm-model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
    parser.add_argument("--eval", type=str, default= "/opt/ml/processing/evaluation")
    
    args, _ = parser.parse_known_args()
    n_estimators     = args.n_estimators
    model_dir  = args.model_dir
    sm_model_dir = args.sm_model_dir
    training_dir   = args.train
    eval_dir = args.eval
    
    # Reading in data
    df = pd.read_csv(training_dir + '/train.csv',sep=',', header=None)
    print("Data read into model script:")
    print(df)
    
    # Split data    
    print("Training data location: {}".format(args.train))
    print("Test data location: {}".format(args.test))
    x_train, y_train = get_train_data(args.train)
    x_test, y_test = get_test_data(args.test)
    
    # Model Building
    model = RandomForestRegressor(n_estimators=args.n_estimators)
    model.fit(x_train, y_train)
    
    # Evaluate on test set
    y_preds = model.predict(x_test)
    print("\nR2 score on test set :", model.score(x_test,y_test))
    
    # Write results to file
    eval_results = {
        "Test R2": model.score(x_test,y_test)
    }

    # Serializing json
    results = json.dumps(eval_results, indent=4)

    # Writing serialized eval results to file in container
    pathlib.Path(eval_dir).mkdir(parents=True, exist_ok=True)
    evaluation_path = f"{eval_dir}/evaluation.json"
    
    with open(f"{evaluation_path}", "w") as outfile:
        outfile.write(results)
                
    # Save model
    joblib.dump(model, os.path.join(args.sm_model_dir, "model.joblib"))
    
    
# inference functions
def input_fn(request_body, request_content_type):
    """Parse input data payload
    """
    print("Entering model input_fn.")
    if request_content_type == "application/json":
        request_body = json.loads(request_body)
        inpVar = request_body["instances"]
        return inpVar
    else:
        raise ValueError("This model only supports application/json input")

def model_fn(model_dir):
    """
    Deserialize fitted model
    """
    print("Entering model model_fn.")
    model = joblib.load(os.path.join(model_dir, "model.joblib"))
    return model


def predict_fn(input_data, model):
    """
    Generate model inference predictions
    """
    print("Entering model predict_fn.")
    return model.predict(input_data)


Overwriting code/model.py


Define SKLearn estimator and wrap model training in TrainingStep.

In [82]:
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep
from sagemaker.workflow.model_step import ModelStep
from sagemaker.workflow.pipeline_context import PipelineSession
import time

sklearn_estimator = SKLearn(
    sagemaker_session=pipeline_session,
    entry_point="code/model.py", 
    framework_version=sklearn_framework_version, 
    instance_type=training_instance_type, 
    role=role
)

train_args = sklearn_estimator.fit(
    {
        "train": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri,
            content_type="text/csv",
        ),
        "test": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri,
            content_type="text/csv",
        ),
    }
)

step_train_model = TrainingStep(name="TrainSKLearnModel", step_args=train_args)



### 6. Define a model creation step <a name="createregister"></a><a name="createregister"></a>

In [83]:
from sagemaker.model import Model
from sagemaker.sklearn.model import SKLearnModel
from sagemaker import PipelineModel
from sagemaker import image_uris


scaler_model_s3 = "{}/model.tar.gz".format(
    step_process.arguments["ProcessingOutputConfig"]["Outputs"][0]["S3Output"]["S3Uri"]
)

scaler_model = SKLearnModel(
    model_data=scaler_model_s3,
    role=role,
    sagemaker_session=pipeline_session,
    entry_point="code/features.py",
    framework_version=sklearn_framework_version,
)

sklearn_model_image_uri = image_uris.retrieve(
    framework="sklearn",
    region=region,
    version=sklearn_framework_version,
    py_version="py3",
    instance_type=training_instance_type,
)

sklearn_model = SKLearnModel(
    framework_version = sklearn_framework_version,
    model_data=step_train_model.properties.ModelArtifacts.S3ModelArtifacts,
    sagemaker_session=pipeline_session,
    entry_point="code/model.py", 
    role=role,
)

pipeline_model = PipelineModel(
    models=[scaler_model, sklearn_model], role=role, sagemaker_session=pipeline_session)

INFO:sagemaker.processing:Uploaded None to s3://sklearn-mlops-inf-pipeline-demo/serial-inference-pipeline/processing/sklearn-housing-data-process-2023-12-15-21-22-29-101/source/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://sklearn-mlops-inf-pipeline-demo/serial-inference-pipeline/processing/sklearn-housing-data-process-2023-12-15-21-22-29-101/source/runproc.sh


Using provided s3_resource


In [84]:
from sagemaker.inputs import CreateModelInput
from sagemaker.workflow.model_step import ModelStep

step_create_model = ModelStep(
    name="PipelineModelCreation",
    step_args=pipeline_model.create(instance_type=create_model_instance_type)
)



### Define a Model Registration Step

In [90]:
from sagemaker.model_metrics import MetricsSource, ModelMetrics
from sagemaker.workflow.step_collections import RegisterModel


register_args = pipeline_model.register(
    content_types=["text/csv"],
    response_types=["text/csv"],
    inference_instances=[processing_instance_type, training_instance_type],
    transform_instances=[config_data["BATCH_TRANSFORM_INSTANCE_TYPE"]],
    model_package_group_name=model_package_group_name,
    approval_status=model_approval_status,
)

step_register_pipeline_model = ModelStep(
    name="PipelineModelRegistration",
    step_args=register_args
)



### 7. Define a Sagemaker Pipeline<a name="pipeline"></a><a name="pipeline"></a>

Wrap the feature building and model building for training and inference in a Sagemaker Pipeline.

In [91]:
from sagemaker.workflow.pipeline import Pipeline

pipeline = Pipeline(
    name=pipeline_name,
    sagemaker_session=sagemaker_session,
    parameters=[
        training_instance_type,
        processing_instance_type,
        processing_instance_count,
        input_data,
        model_approval_status,
    ],
    
    steps = [step_process, step_train_model, step_create_model, step_register_pipeline_model]
)

In [92]:
import json

definition = json.loads(pipeline.definition())
definition


INFO:sagemaker.processing:Uploaded None to s3://sklearn-mlops-inf-pipeline-demo/serial-inference-pipeline/processing/serial-inference-pipeline/code/285c41c3c73f2437a1b221759a34fd6c/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://sklearn-mlops-inf-pipeline-demo/serial-inference-pipeline/code/ece5752bed68960c30544c25e83cd9ee/runproc.sh


Using provided s3_resource
Using provided s3_resource


{'Version': '2020-12-01',
 'Metadata': {},
 'Parameters': [{'Name': 'TrainingInstanceType',
   'Type': 'String',
   'DefaultValue': 'ml.m5.xlarge'},
  {'Name': 'ProcessingInstanceType',
   'Type': 'String',
   'DefaultValue': 'ml.m5.xlarge'},
  {'Name': 'ProcessingInstanceCount', 'Type': 'Integer', 'DefaultValue': 1},
  {'Name': 'InputData',
   'Type': 'String',
   'DefaultValue': 's3://sklearn-mlops-inf-pipeline-demo/serial-inference-pipeline/housing_data/raw'},
  {'Name': 'ModelApprovalStatus',
   'Type': 'String',
   'DefaultValue': 'Approved'}],
 'PipelineExperimentConfig': {'ExperimentName': {'Get': 'Execution.PipelineName'},
  'TrialName': {'Get': 'Execution.PipelineExecutionId'}},
 'Steps': [{'Name': 'PreprocessData',
   'Type': 'Processing',
   'Arguments': {'ProcessingResources': {'ClusterConfig': {'InstanceType': {'Get': 'Parameters.ProcessingInstanceType'},
      'InstanceCount': {'Get': 'Parameters.ProcessingInstanceCount'},
      'VolumeSizeInGB': 30}},
    'AppSpecificati

### 8. Submit the pipeline and start execution <a name="submit"></a><a name="submit"></a>

Running steps to upsert the `role_arn` and start the [pipeline execution](https://docs.aws.amazon.com/sagemaker/latest/dg/run-pipeline.html) will kick-off training. 

In [93]:
pipeline.upsert(role_arn=role)
execution = pipeline.start()
execution.wait()
print("------- done -------")

INFO:sagemaker.processing:Uploaded None to s3://sklearn-mlops-inf-pipeline-demo/serial-inference-pipeline/processing/serial-inference-pipeline/code/285c41c3c73f2437a1b221759a34fd6c/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://sklearn-mlops-inf-pipeline-demo/serial-inference-pipeline/code/ece5752bed68960c30544c25e83cd9ee/runproc.sh


Using provided s3_resource
Using provided s3_resource


INFO:sagemaker.processing:Uploaded None to s3://sklearn-mlops-inf-pipeline-demo/serial-inference-pipeline/processing/serial-inference-pipeline/code/285c41c3c73f2437a1b221759a34fd6c/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://sklearn-mlops-inf-pipeline-demo/serial-inference-pipeline/code/ece5752bed68960c30544c25e83cd9ee/runproc.sh


Using provided s3_resource
Using provided s3_resource
------- done -------


Get name of the latest model.

In [94]:
sm_model_name = sm.list_models()['Models'][0]['ModelName']

print(sm_model_name)

pipelines-cqmtuckawika-PipelineModelCreatio-QYzdoR4H69


List model registry information:

In [95]:
from utils import get_approved_package

pck = get_approved_package(model_package_group_name, sm) 
model_description = sm.describe_model_package(ModelPackageName=pck["ModelPackageArn"])

print(model_description)

INFO:utils:Identified the latest approved model package: arn:aws:sagemaker:ca-central-1:817463428454:model-package/PipelineSKLearnModelPackageGroup/5


{'ModelPackageGroupName': 'PipelineSKLearnModelPackageGroup', 'ModelPackageVersion': 5, 'ModelPackageArn': 'arn:aws:sagemaker:ca-central-1:817463428454:model-package/PipelineSKLearnModelPackageGroup/5', 'CreationTime': datetime.datetime(2023, 12, 15, 21, 25, 55, 928000, tzinfo=tzlocal()), 'InferenceSpecification': {'Containers': [{'Image': '341280168497.dkr.ecr.ca-central-1.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3', 'ImageDigest': 'sha256:9b43ef4706faae38d10bdff012a0d1b35ed9c5b3aac9e60c960170f10d29fa51', 'ModelDataUrl': 's3://sklearn-mlops-inf-pipeline-demo/serial-inference-pipeline/processing/model.tar.gz', 'Environment': {'SAGEMAKER_CONTAINER_LOG_LEVEL': '20', 'SAGEMAKER_PROGRAM': 'features.py', 'SAGEMAKER_REGION': 'ca-central-1', 'SAGEMAKER_SUBMIT_DIRECTORY': 's3://sklearn-mlops-inf-pipeline-demo/sagemaker-scikit-learn-2023-12-15-21-23-05-348/sourcedir.tar.gz'}}, {'Image': '341280168497.dkr.ecr.ca-central-1.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3', 'ImageDigest'

### 9. Run serial batch inference job <a name="inference"></a>
After the pipeline has finished executing, lookup the Sagemaker Model Name and pass it to a Sagemaker Batch Transformation job for inference, along with a raw test dataset.

In [96]:
import sagemaker
input_data_path = 's3://{}/{}'.format(sagemaker_session.default_bucket(), f"{prefix}/test/test.csv") 
output_data_path = 's3://{}/{}'.format(sagemaker_session.default_bucket(), f'{prefix}/batch-transform/output')

transform_job = sagemaker.transformer.Transformer(
    model_name = sm_model_name,
    instance_count = int(config_data["BATCH_TRANSFORM_INSTANCE_COUNT"]),
    instance_type = config_data["BATCH_TRANSFORM_INSTANCE_TYPE"],
    strategy = 'MultiRecord',
    assemble_with = 'Line',
    output_path = output_data_path,
    base_transform_job_name='inference-pipelines-batch',
    sagemaker_session=sagemaker_session,
    accept = 'text/csv')

transform_job.transform(data = input_data_path, 
                        content_type = 'text/csv', 
                        split_type = 'Line',
                        join_source='Input')

INFO:sagemaker:Creating transform job with name: inference-pipelines-batch-2023-12-15-21-26-35-112


...................................[34m2023-12-15 21:32:27,838 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[34m2023-12-15 21:32:27,842 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[34m2023-12-15 21:32:27,843 INFO - sagemaker-containers - nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[35m2023-12-15 21:32:27,838 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[35m2023-12-15 21:32:27,842 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[35m2023-12-15 21:32:27,843 INFO - sagemaker-containers - nginx config: [0m
[35mworker_processes auto;[0m
[35mdaemon off;[0m
[35mpid /tmp/nginx.pid;[0m
[35merror_log  /dev/stderr;[0m
[35mworker_rlimit_nofile 4096;[0m
[35mevents

### 10. Download Inference Predictions<a name="download"></a>

After the batch transform job has completed, you can download and read inference predictions for future use.

In [97]:
s3 = boto3.client('s3')

####### LIST FILES IN S3 #########
res = s3.list_objects_v2(Bucket=bucket, Prefix=f"{prefix}/batch-transform/output")
                         
# select the files
for item in res["Contents"]:
    print(item['Key'])


serial-inference-pipeline/batch-transform/output/test.csv.out


In [98]:
s3 = boto3.client('s3')

FILE_NAME = "batch_inference_preds.csv"
BUCKET_NAME = bucket
OBJECT_NAME = f"{prefix}/batch-transform/output/test.csv.out"

s3.download_file(BUCKET_NAME, OBJECT_NAME, FILE_NAME)

In [100]:
columns = [
    "longitude",
    "latitude",
    "housingMedianAge",
    "totalRooms",
    "totalBedrooms",
    "population",
    "households",
    "medianIncome",
    "medianHouseValue",
    "predictedTarget"
]

inf_results = pd.read_csv("batch_inference_preds.csv", names=columns)

display(inf_results)

Unnamed: 0,longitude,latitude,housingMedianAge,totalRooms,totalBedrooms,population,households,medianIncome,medianHouseValue,predictedTarget
0,-118.36,34.07,48.0,1740.0,360.0,748.0,357.0,4.7019,0.822200,0.736840
1,-121.41,38.64,38.0,1384.0,287.0,682.0,280.0,1.9167,0.128800,0.144690
2,-117.87,33.84,17.0,2395.0,410.0,1224.0,399.0,5.1182,0.498400,0.434500
3,-122.02,38.26,20.0,3899.0,763.0,2198.0,779.0,3.2061,0.240800,0.262640
4,-119.73,36.31,20.0,2440.0,433.0,1579.0,400.0,2.8281,0.120400,0.176340
...,...,...,...,...,...,...,...,...,...,...
6807,-117.91,33.63,20.0,3442.0,1526.0,1427.0,977.0,3.1985,0.212600,0.534670
6808,-120.86,37.76,32.0,964.0,198.0,623.0,201.0,3.0917,0.177800,0.275460
6809,-122.43,37.81,39.0,3275.0,837.0,1137.0,725.0,3.7679,1.000002,0.700341
6810,-121.91,36.59,31.0,2034.0,335.0,966.0,322.0,4.6964,0.582600,0.553750
