# Step 3: Add an ML pipeline with PyTorch Autoencoder
<div class="alert alert-warning"> This notebook demonstrates PyTorch autoencoder pipeline for anomaly detection. Last tested on a SageMaker Studio JupyterLab instance using the <code>SageMaker Distribution Image 3.0.1</code> and with the SageMaker Python SDK version <code>2.245.0</code></div>

In this step you automate our end-to-end ML workflow using [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines/) and [Amazon SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html). You make feature engineering re-usable, repeatable, and scaleable using [Amazon SageMaker Feature Store](https://aws.amazon.com/sagemaker/feature-store/).

This pipeline implements a **PyTorch autoencoder for anomaly detection** with the following key differences from supervised learning:
- **Unsupervised learning approach** - No target labels needed for training
- **Reconstruction error-based evaluation** - Anomalies have higher reconstruction errors
- **Threshold-based classification** - Uses percentile-based thresholds for anomaly detection

||||
|---|---|---|
|1. |Experiment with autoencoder in a notebook ||
|2. |Scale with SageMaker AI processing jobs and SageMaker SDK ||
|3. |Operationalize with ML pipeline, model registry|**<<<< YOU ARE HERE**|
|4. |Add a model deployment pipeline ||
|5. |Add streaming inference with SQS ||

<div class="alert alert-info"> Make sure you using <code>Python 3</code> kernel in JupyterLab for this notebook.</div>




In [None]:
# Standard library
import os
from time import gmtime, strftime
from importlib.metadata import version

# Third-party libraries
import boto3
import mlflow
import pandas as pd  # Keep if used in pipeline_steps modules

# SageMaker imports
import sagemaker
from sagemaker.workflow.execution_variables import ExecutionVariables
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.pytorch.estimator import PyTorch
from sagemaker.processing import FrameworkProcessor
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import TrainingStep, CacheConfig
from sagemaker.workflow.parameters import (
    ParameterInteger, 
    ParameterFloat, 
    ParameterString
)
from sagemaker.workflow.fail_step import FailStep
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.functions import Join
from sagemaker.workflow.function_step import step
from sagemaker.workflow.pipeline_definition_config import PipelineDefinitionConfig

# IPython/Jupyter specific
from IPython.display import HTML
import torch
# Import from local modules (if these exist)

(sagemaker.__version__, boto3.__version__, mlflow.__version__, torch.__version__)


('2.249.0', '1.40.3', '2.22.1', '2.6.0')

In [4]:
%store -r 

%store

try:
    initialized
except NameError:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN 00-start-here notebook   ")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++")

Unable to restore variable 'pytorch_estimator', ignoring (use %store -d to forget!)
The error was: <class 'KeyError'>
Stored variables and their in-db values:
baseline_s3_url                        -> 's3://sagemaker-us-west-2-224425919845/from-idea-t
bucket_name                            -> 'sagemaker-us-west-2-224425919845'
bucket_prefix                          -> 'from-idea-to-prod/autoencoder'
dataset_file_local_path                -> 'data/bank-additional/bank-additional-full.csv'
domain_id                              -> 'd-rtctvdud9qsp'
endpoint_name                          -> 'from-idea-to-prod-autoencoder-endpoint-06-04-51-2
evaluation_s3_url                      -> 's3://sagemaker-us-west-2-224425919845/from-idea-t
initialized                            -> True
input_s3_url                           -> 's3://sagemaker-us-west-2-224425919845/from-idea-t
mlflow_arn                             -> 'arn:aws:sagemaker:us-west-2:224425919845:mlflow-t
mlflow_name                  

In [5]:
# Set names of pipeline objects, experiment, and a model
project = "from-idea-to-prod"

current_timestamp = strftime('%d-%H-%M-%S', gmtime())

registered_model_name = f"{project}-autoencoder-pipeline-model-{current_timestamp}"
experiment_name = f"{project}-autoencoder-pipeline-{current_timestamp}"
pipeline_name = f"{project}-autoencoder-pipeline-{current_timestamp}"
pipeline_model_name = f"{project}-model-autoencoder"
model_package_group_name = registered_model_name
endpoint_config_name = f"{project}-autoencoder-endpoint-config"
endpoint_name = f"{project}-autoencoder-endpoint"
model_approval_status = "PendingManualApproval"

In [7]:
# Set instance types and counts for autoencoder training
process_instance_type = "ml.m5.large"
train_instance_type = "ml.g4dn.xlarge"  # Slightly larger for PyTorch training

In [8]:

# Set S3 urls for various datasets produced in the pipeline
output_s3_prefix = f"s3://{bucket_name}/{bucket_prefix}"
output_s3_url = f"{output_s3_prefix}/output"

train_s3_url = f"{output_s3_prefix}/train"
validation_s3_url = f"{output_s3_prefix}/validation"
test_s3_url = f"{output_s3_prefix}/test"
evaluation_s3_url = f"{output_s3_prefix}/evaluation"

baseline_s3_url = f"{output_s3_prefix}/baseline"
baseline_results_s3_url = f"{baseline_s3_url}/results"

prediction_baseline_s3_url = f"{output_s3_prefix}/prediction_baseline"
prediction_baseline_results_s3_url=f"{prediction_baseline_s3_url}/results"


In [9]:
%store train_s3_url
%store validation_s3_url
%store test_s3_url
%store baseline_s3_url
%store pipeline_name
%store model_package_group_name
%store evaluation_s3_url
%store prediction_baseline_s3_url
%store output_s3_url

Stored 'train_s3_url' (str)
Stored 'validation_s3_url' (str)
Stored 'test_s3_url' (str)
Stored 'baseline_s3_url' (str)
Stored 'pipeline_name' (str)
Stored 'model_package_group_name' (str)
Stored 'evaluation_s3_url' (str)
Stored 'prediction_baseline_s3_url' (str)
Stored 'output_s3_url' (str)


In [10]:
print(f"Train S3 url: {train_s3_url}")
print(f"Validation S3 url: {validation_s3_url}")
print(f"Test S3 url: {test_s3_url}")
print(f"Data baseline S3 url: {baseline_s3_url}")
print(f"Evaluation metrics S3 url: {evaluation_s3_url}")
print(f"Model prediction baseline S3 url: {prediction_baseline_s3_url}")


Train S3 url: s3://sagemaker-us-west-2-224425919845/from-idea-to-prod/autoencoder/train
Validation S3 url: s3://sagemaker-us-west-2-224425919845/from-idea-to-prod/autoencoder/validation
Test S3 url: s3://sagemaker-us-west-2-224425919845/from-idea-to-prod/autoencoder/test
Data baseline S3 url: s3://sagemaker-us-west-2-224425919845/from-idea-to-prod/autoencoder/baseline
Evaluation metrics S3 url: s3://sagemaker-us-west-2-224425919845/from-idea-to-prod/autoencoder/evaluation
Model prediction baseline S3 url: s3://sagemaker-us-west-2-224425919845/from-idea-to-prod/autoencoder/prediction_baseline


In [11]:

def get_pytorch_autoencoder_estimator(
    session,
    instance_type,
    output_s3_url,
    base_job_name,
):
    """Create PyTorch estimator for autoencoder training"""
    estimator = PyTorch(
        entry_point='train_autoencoder.py',
        source_dir='./training',
        role=sagemaker.get_execution_role(),
        instance_type=instance_type,
        instance_count=1,
        framework_version='1.12',
        py_version='py38',
        output_path=output_s3_url,
        sagemaker_session=session,
        base_job_name=base_job_name,
        environment={
            'MLFLOW_TRACKING_ARN': mlflow_arn,
            'MLFLOW_EXPERIMENT_NAME': experiment_name,
            'REGION': region
        },
        enable_sagemaker_metrics=True,
        metric_definitions=[
            {'Name': 'train_loss', 'Regex': 'Train Loss: ([0-9\\.]+)'},
            {'Name': 'val_loss', 'Regex': 'Val Loss: ([0-9\\.]+)'},
            {'Name': 'reconstruction_threshold', 'Regex': 'threshold.*: ([0-9\\.]+)'}
        ]
    )
    
    # Set hyperparameters for autoencoder
    estimator.set_hyperparameters(
        encoding_dim=32,
        dropout_rate=0.2,
        learning_rate=0.001,
        batch_size=64,
        num_epochs=100,
        weight_decay=1e-5
    )

    return estimator

def get_pytorch_processor(
    session,
    instance_type,
    base_job_name,
):
    """Create PyTorch processor for data processing"""
    processor = FrameworkProcessor(
        estimator_cls=PyTorch,
        framework_version='1.12',
        py_version='py38',
        role=sagemaker.get_execution_role(),
        instance_type=instance_type,
        instance_count=1,
        base_job_name=base_job_name,
        sagemaker_session=session,
        env={
            'MLFLOW_TRACKING_ARN': mlflow_arn,
            'MLFLOW_EXPERIMENT_NAME': experiment_name,
            'REGION': region
        }
    )
    
    return processor


## Configure MLflow

In [13]:
sm = boto3.client("sagemaker")

while sm.describe_mlflow_tracking_server(TrackingServerName=mlflow_name)['TrackingServerStatus'] != 'Created':
    print(f"The MLflow server {mlflow_name} is not in the status 'Created'")
    sleep(30)
else:
    print(f"Using server {mlflow_name}")

Using server mlflow-d-rtctvdud9qsp


In [14]:
mlflow.set_tracking_uri(mlflow_arn)
experiment = mlflow.set_experiment(experiment_name=experiment_name)

2025/08/06 08:54:39 INFO mlflow.tracking.fluent: Experiment with name 'from-idea-to-prod-autoencoder-pipeline-06-08-50-22' does not exist. Creating a new experiment.


## A SageMaker pipeline

### Setup pipeline parameters

In [15]:

# Set processing instance type
process_instance_type_param = ParameterString(
    name="ProcessingInstanceType",
    default_value=process_instance_type,
)

# Set training instance type
train_instance_type_param = ParameterString(
    name="TrainingInstanceType",
    default_value=train_instance_type,
)

# Set model approval status for the model registry
model_approval_status_param = ParameterString(
    name="ModelApprovalStatus",
    default_value=model_approval_status
)

# Minimal threshold for model performance on the test dataset (ROC AUC for autoencoder)
test_score_threshold_param = ParameterFloat(
    name="TestScoreThreshold",
    default_value=0.65  # Lower threshold for autoencoder anomaly detection
)

# Parametrize the S3 url for input dataset
input_s3_url_param = ParameterString(
    name="InputDataUrl",
    default_value=input_s3_url,
)

# Model package group name
model_package_group_name_param = ParameterString(
    name="ModelPackageGroupName",
    default_value=model_package_group_name,
)

# MLflow tracking server ARN
tracking_server_arn_param = ParameterString(
    name="TrackingServerARN",
    default_value=mlflow_arn,
)

# Autoencoder hyperparameters
encoding_dim_param = ParameterInteger(name="EncodingDim", default_value=32)
dropout_rate_param = ParameterFloat(name="DropoutRate", default_value=0.2)
learning_rate_param = ParameterFloat(name="LearningRate", default_value=0.001)
batch_size_param = ParameterInteger(name="BatchSize", default_value=64)
num_epochs_param = ParameterInteger(name="NumEpochs", default_value=100)
weight_decay_param = ParameterFloat(name="WeightDecay", default_value=1e-5)


In [16]:
!aws s3 ls {input_s3_url}

2025-08-06 06:22:17    5834924 bank-additional-full.csv


### Implement and test the pipeline steps

In [20]:
%mkdir -p pipeline_steps/

In [21]:
%%writefile pipeline_steps/preprocess_autoencoder.py
#!/usr/bin/env python3

import pandas as pd
import numpy as np
import argparse
import os
import mlflow
from time import gmtime, strftime
from sklearn.preprocessing import StandardScaler
import boto3

def preprocess_autoencoder(
    input_data_s3_path,
    output_s3_prefix,
    tracking_server_arn,
    experiment_name,
    pipeline_run_name=None,
):
    """
    Preprocess data for autoencoder training - unsupervised learning approach
    """
    
    # Set up MLflow
    mlflow.set_tracking_uri(tracking_server_arn)
    mlflow.set_experiment(experiment_name)
    
    run_name = f"preprocess-autoencoder-{strftime('%d-%H-%M-%S', gmtime())}"
    if pipeline_run_name:
        run_name = f"preprocess-{pipeline_run_name}"
    
    with mlflow.start_run(run_name=run_name, description="Data preprocessing for autoencoder") as run:
        
        # Download and load data
        print(f"Loading data from {input_data_s3_path}")
        
        # Extract bucket and key from S3 path
        s3_parts = input_data_s3_path.replace("s3://", "").split("/", 1)
        bucket = s3_parts[0]
        key = s3_parts[1]
        
        # Download file locally
        s3_client = boto3.client('s3')
        local_file = '/tmp/input_data.csv'
        s3_client.download_file(bucket, key, local_file)
        
        # Load data
        df_raw = pd.read_csv(local_file, sep=";")
        print(f"Original data shape: {df_raw.shape}")
        
        # Feature engineering (same as before but we'll use all features for reconstruction)
        df_data = df_raw.copy()
        df_data["no_previous_contact"] = np.where(df_data["pdays"] == 999, 1, 0)
        df_data["not_working"] = np.where(
            np.in1d(df_data["job"], ["student", "retired", "unemployed"]), 1, 0
        )

        # Remove unnecessary data but keep more features for autoencoder
        df_model_data = df_data.drop(
            ["duration", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"],
            axis=1,
        )

        # Age binning
        bins = [18, 30, 40, 50, 60, 70, 90]
        labels = ['18-29', '30-39', '40-49', '50-59', '60-69', '70-plus']
        df_model_data['age_range'] = pd.cut(df_model_data.age, bins, labels=labels, include_lowest=True)
        df_model_data = pd.concat([df_model_data, pd.get_dummies(df_model_data['age_range'], prefix='age', dtype=int)], axis=1)
        df_model_data.drop('age', axis=1, inplace=True)
        df_model_data.drop('age_range', axis=1, inplace=True)

        # Scale numerical features
        scaled_features = ['pdays', 'previous', 'campaign']
        scaler = StandardScaler()
        df_model_data[scaled_features] = scaler.fit_transform(df_model_data[scaled_features])

        # Convert categorical variables to dummy variables
        df_model_data = pd.get_dummies(df_model_data, dtype=int)

        # For autoencoder, we'll separate the target for evaluation but not use it in training
        target_col = "y"
        if 'y_yes' in df_model_data.columns and 'y_no' in df_model_data.columns:
            # Keep target for anomaly evaluation
            target_data = df_model_data["y_yes"].copy()
            # Remove target columns from features for unsupervised learning
            feature_data = df_model_data.drop(["y_no", "y_yes"], axis=1)
        else:
            target_data = None
            feature_data = df_model_data
        
        print(f"Feature data shape after processing: {feature_data.shape}")
        
        # For autoencoder, we typically use normal data for training and test on both normal and anomalous
        # Split data: 70% train, 15% validation, 15% test
        train_size = int(0.7 * len(feature_data))
        val_size = int(0.15 * len(feature_data))
        
        # Shuffle data
        shuffled_indices = np.random.permutation(len(feature_data))
        feature_data_shuffled = feature_data.iloc[shuffled_indices].reset_index(drop=True)
        if target_data is not None:
            target_data_shuffled = target_data.iloc[shuffled_indices].reset_index(drop=True)
        
        # Split features
        train_features = feature_data_shuffled[:train_size]
        val_features = feature_data_shuffled[train_size:train_size + val_size]
        test_features = feature_data_shuffled[train_size + val_size:]
        
        # Split targets (for evaluation)
        if target_data is not None:
            train_targets = target_data_shuffled[:train_size]
            val_targets = target_data_shuffled[train_size:train_size + val_size]
            test_targets = target_data_shuffled[train_size + val_size:]
        
        print(f"Data split > train:{train_features.shape} | validation:{val_features.shape} | test:{test_features.shape}")
        
        # Log parameters to MLflow
        mlflow.log_params({
            "train_features": train_features.shape,
            "val_features": val_features.shape,
            "test_features": test_features.shape,
            "total_features": feature_data.shape[1]
        })

        mlflow.set_tags({
            'mlflow.source.type': 'JOB',
            'model_type': 'autoencoder',
            'step': 'preprocessing'
        })
        
        # Upload datasets to S3
        s3_client = boto3.client('s3')
        
        # Extract bucket from output prefix
        output_bucket = output_s3_prefix.replace("s3://", "").split("/")[0]
        output_prefix = "/".join(output_s3_prefix.replace("s3://", "").split("/")[1:])
        
        # Save and upload train data
        train_local = '/tmp/train.csv'
        train_features.to_csv(train_local, index=False, header=False)
        train_key = f"{output_prefix}/train/train.csv"
        s3_client.upload_file(train_local, output_bucket, train_key)
        train_s3_url = f"s3://{output_bucket}/{train_key}"
        
        # Save and upload validation data
        val_local = '/tmp/validation.csv'
        val_features.to_csv(val_local, index=False, header=False)
        val_key = f"{output_prefix}/validation/validation.csv"
        s3_client.upload_file(val_local, output_bucket, val_key)
        validation_s3_url = f"s3://{output_bucket}/{val_key}"
        
        # Save and upload test features
        test_x_local = '/tmp/test_features.csv'
        test_features.to_csv(test_x_local, index=False, header=False)
        test_x_key = f"{output_prefix}/test/test_features.csv"
        s3_client.upload_file(test_x_local, output_bucket, test_x_key)
        test_x_s3_url = f"s3://{output_bucket}/{test_x_key}"
        
        # Save and upload test targets (for evaluation)
        if target_data is not None:
            test_y_local = '/tmp/test_targets.csv'
            test_targets.to_csv(test_y_local, index=False, header=False)
            test_y_key = f"{output_prefix}/test/test_targets.csv"
            s3_client.upload_file(test_y_local, output_bucket, test_y_key)
            test_y_s3_url = f"s3://{output_bucket}/{test_y_key}"
        else:
            test_y_s3_url = None
        
        # Save and upload baseline data
        baseline_local = '/tmp/baseline.csv'
        feature_data.to_csv(baseline_local, index=False, header=False)
        baseline_key = f"{output_prefix}/baseline/baseline.csv"
        s3_client.upload_file(baseline_local, output_bucket, baseline_key)
        baseline_s3_url = f"s3://{output_bucket}/{baseline_key}"
        
        # Log artifacts to MLflow
        mlflow.log_artifact(baseline_local, "baseline")
        
        print("## Processing complete.")
        
        return {
            'train_data': train_s3_url,
            'validation_data': validation_s3_url,
            'test_x_data': test_x_s3_url,
            'test_y_data': test_y_s3_url,
            'baseline_data': baseline_s3_url,
            'experiment_name': experiment_name,
            'pipeline_run_id': pipeline_run_name or run.info.run_id
        }

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--input-data-s3-path', type=str, required=True)
    parser.add_argument('--output-s3-prefix', type=str, required=True)
    parser.add_argument('--tracking-server-arn', type=str, required=True)
    parser.add_argument('--experiment-name', type=str, required=True)
    parser.add_argument('--pipeline-run-name', type=str, default=None)
    
    args = parser.parse_args()
    
    result = preprocess_autoencoder(
        input_data_s3_path=args.input_data_s3_path,
        output_s3_prefix=args.output_s3_prefix,
        tracking_server_arn=args.tracking_server_arn,
        experiment_name=args.experiment_name,
        pipeline_run_name=args.pipeline_run_name
    )
    
    print(f"Preprocessing result: {result}")


Writing pipeline_steps/preprocess_autoencoder.py


#### Processing step

In [23]:
from pipeline_steps.preprocess_autoencoder import preprocess_autoencoder

In [24]:
r_preprocess = preprocess_autoencoder(
    input_data_s3_path=input_s3_url,
    output_s3_prefix=output_s3_prefix,
    tracking_server_arn=mlflow_arn,
    experiment_name=f"local-test-{current_timestamp}"
)
r_preprocess

2025/08/06 09:00:18 INFO mlflow.tracking.fluent: Experiment with name 'local-test-06-08-50-22' does not exist. Creating a new experiment.


Loading data from s3://sagemaker-us-west-2-224425919845/from-idea-to-prod/autoencoder/input/bank-additional-full.csv
Original data shape: (41188, 21)
Feature data shape after processing: (41188, 64)
Data split > train:(28831, 64) | validation:(6178, 64) | test:(6179, 64)
## Processing complete.
🏃 View run preprocess-autoencoder-06-09-00-18 at: https://us-west-2.experiments.sagemaker.aws/#/experiments/56/runs/c7e41bb462cf4562aae4faf914d4d6b5
🧪 View experiment at: https://us-west-2.experiments.sagemaker.aws/#/experiments/56


{'train_data': 's3://sagemaker-us-west-2-224425919845/from-idea-to-prod/autoencoder/train/train.csv',
 'validation_data': 's3://sagemaker-us-west-2-224425919845/from-idea-to-prod/autoencoder/validation/validation.csv',
 'test_x_data': 's3://sagemaker-us-west-2-224425919845/from-idea-to-prod/autoencoder/test/test_features.csv',
 'test_y_data': 's3://sagemaker-us-west-2-224425919845/from-idea-to-prod/autoencoder/test/test_targets.csv',
 'baseline_data': 's3://sagemaker-us-west-2-224425919845/from-idea-to-prod/autoencoder/baseline/baseline.csv',
 'experiment_name': 'local-test-06-08-50-22',
 'pipeline_run_id': 'c7e41bb462cf4562aae4faf914d4d6b5'}

In [25]:
# check that the function generated output
!aws s3 ls {output_s3_prefix}/test/

2025-08-06 09:00:20    1121718 test_features.csv
2025-08-06 09:00:21      12358 test_targets.csv
2025-07-28 02:51:48     600465 test_x.csv
2025-07-28 02:51:47       8238 test_y.csv


#### Training step

In [26]:
# use PipelineSession() in the estimator for pipeline construction
estimator = get_pytorch_autoencoder_estimator(
    session=sagemaker.Session(),
    instance_type=train_instance_type,
    output_s3_url=output_s3_url,
    base_job_name=f"{project}-autoencoder-train",
)


In [28]:
# Set up the training inputs using the outputs from preprocess function
training_inputs = {
    "train": TrainingInput(
        s3_data=r_preprocess['train_data'],
        content_type="text/csv",
    ),
    "validation": TrainingInput(
        s3_data=r_preprocess['validation_data'],
        content_type="text/csv",
    ),
}

## Prepare evaluate step

In [29]:
%%writefile pipeline_steps/evaluate_autoencoder.py
#!/usr/bin/env python3

import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import boto3
import tarfile
import os
import json
import mlflow
from time import gmtime, strftime
from sklearn.metrics import precision_recall_curve, roc_curve, auc, classification_report
import io

class Autoencoder(nn.Module):
    def __init__(self, input_dim, encoding_dim=32, dropout_rate=0.2):
        super(Autoencoder, self).__init__()
        
        # Encoder
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(64, encoding_dim),
            nn.ReLU()
        )
        
        # Decoder
        self.decoder = nn.Sequential(
            nn.Linear(encoding_dim, 64),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(64, 128),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(128, input_dim),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

def load_autoencoder_model(model_s3_path):
    """Load autoencoder model from S3"""
    s3_client = boto3.client('s3')
    
    # Parse S3 path
    s3_parts = model_s3_path.replace("s3://", "").split("/", 1)
    bucket = s3_parts[0]
    key = s3_parts[1]
    
    # Download model artifacts
    local_model_path = '/tmp/model.tar.gz'
    s3_client.download_file(bucket, key, local_model_path)
    
    # Extract model
    extract_path = '/tmp/model'
    os.makedirs(extract_path, exist_ok=True)
    with tarfile.open(local_model_path, 'r:gz') as tar:
        tar.extractall(path=extract_path)
    
    # Load model checkpoint - Fix for PyTorch 2.6+ weights_only issue
    checkpoint = torch.load(os.path.join(extract_path, 'model.pth'), map_location='cpu', weights_only=False)
    
    # Create and load model
    model = Autoencoder(
        checkpoint['input_dim'],
        checkpoint['encoding_dim'],
        checkpoint['dropout_rate']
    )
    model.load_state_dict(checkpoint['model_state_dict'])
    model.eval()
    
    return model, checkpoint

def evaluate_autoencoder(
    test_x_data_s3_path,
    test_y_data_s3_path,
    model_s3_path,
    output_s3_prefix,
    tracking_server_arn,
    experiment_name,
    pipeline_run_id=None,
):
    """
    Evaluate autoencoder model for anomaly detection
    """
    
    # Set up MLflow
    mlflow.set_tracking_uri(tracking_server_arn)
    mlflow.set_experiment(experiment_name)
    
    run_name = f"evaluate-autoencoder-{strftime('%d-%H-%M-%S', gmtime())}"
    if pipeline_run_id:
        run_name = f"evaluate-{pipeline_run_id}"
    
    with mlflow.start_run(run_name=run_name, description="Autoencoder model evaluation") as run:
        
        # Load test data
        s3_client = boto3.client('s3')
        
        # Load test features
        test_x_parts = test_x_data_s3_path.replace("s3://", "").split("/", 1)
        test_x_local = '/tmp/test_features.csv'
        s3_client.download_file(test_x_parts[0], test_x_parts[1], test_x_local)
        test_features = pd.read_csv(test_x_local, header=None)
        
        # Load test targets
        test_y_parts = test_y_data_s3_path.replace("s3://", "").split("/", 1)
        test_y_local = '/tmp/test_targets.csv'
        s3_client.download_file(test_y_parts[0], test_y_parts[1], test_y_local)
        test_targets = pd.read_csv(test_y_local, header=None)[0].values
        
        print(f"Loaded test data: {test_features.shape} features, {len(test_targets)} targets")
        
        # Load model
        model, checkpoint = load_autoencoder_model(model_s3_path)
        threshold = checkpoint['threshold']
        
        print(f"Loaded model with threshold: {threshold}")
        
        # Make predictions
        test_tensor = torch.FloatTensor(test_features.values)
        
        with torch.no_grad():
            reconstructed = model(test_tensor)
            reconstruction_errors = torch.mean((test_tensor - reconstructed) ** 2, dim=1).numpy()
        
        # Calculate metrics
        precision, recall, pr_thresholds = precision_recall_curve(test_targets, reconstruction_errors)
        pr_auc = auc(recall, precision)
        
        fpr, tpr, roc_thresholds = roc_curve(test_targets, reconstruction_errors)
        roc_auc = auc(fpr, tpr)
        
        # Calculate F1 scores and find optimal threshold
        f1_scores = 2 * (precision[:-1] * recall[:-1]) / (precision[:-1] + recall[:-1])
        optimal_idx = np.argmax(f1_scores)
        optimal_threshold = pr_thresholds[optimal_idx]
        max_f1_score = np.max(f1_scores)
        
        # Predictions using optimal threshold
        predictions = (reconstruction_errors > optimal_threshold).astype(int)
        
        # Calculate confusion matrix components
        tp = np.sum((test_targets == 1) & (predictions == 1))
        fp = np.sum((test_targets == 0) & (predictions == 1))
        tn = np.sum((test_targets == 0) & (predictions == 0))
        fn = np.sum((test_targets == 1) & (predictions == 0))
        
        # Calculate additional metrics
        precision_score = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall_score = tp / (tp + fn) if (tp + fn) > 0 else 0
        accuracy = (tp + tn) / (tp + fp + tn + fn)
        
        # Create evaluation results
        evaluation_result = {
            "anomaly_detection_metrics": {
                "roc_auc": {"value": float(roc_auc)},
                "pr_auc": {"value": float(pr_auc)},
                "optimal_threshold": {"value": float(optimal_threshold)},
                "max_f1_score": {"value": float(max_f1_score)},
                "precision": {"value": float(precision_score)},
                "recall": {"value": float(recall_score)},
                "accuracy": {"value": float(accuracy)},
                "true_positives": {"value": int(tp)},
                "false_positives": {"value": int(fp)},
                "true_negatives": {"value": int(tn)},
                "false_negatives": {"value": int(fn)},
                "mean_reconstruction_error": {"value": float(np.mean(reconstruction_errors))},
                "std_reconstruction_error": {"value": float(np.std(reconstruction_errors))}
            }
        }
        
        # Log metrics to MLflow
        mlflow.log_metrics({
            "roc_auc": roc_auc,
            "pr_auc": pr_auc,
            "optimal_threshold": optimal_threshold,
            "max_f1_score": max_f1_score,
            "precision": precision_score,
            "recall": recall_score,
            "accuracy": accuracy,
            "mean_reconstruction_error": np.mean(reconstruction_errors),
            "std_reconstruction_error": np.std(reconstruction_errors)
        })
        
        mlflow.set_tags({
            'mlflow.source.type': 'JOB',
            'model_type': 'autoencoder',
            'step': 'evaluation'
        })
        
        # Create prediction baseline for monitoring
        prediction_baseline = pd.DataFrame({
            'prediction': predictions,
            'probability': reconstruction_errors,
            'label': test_targets
        })
        
        # Save prediction baseline
        baseline_local = '/tmp/prediction_baseline.csv'
        prediction_baseline.to_csv(baseline_local, index=False)
        
        # Upload to S3
        output_bucket = output_s3_prefix.replace("s3://", "").split("/")[0]
        output_prefix = "/".join(output_s3_prefix.replace("s3://", "").split("/")[1:])
        
        baseline_key = f"{output_prefix}/prediction_baseline/prediction_baseline.csv"
        s3_client.upload_file(baseline_local, output_bucket, baseline_key)
        prediction_baseline_s3_url = f"s3://{output_bucket}/{baseline_key}"
        
        # Save evaluation results
        eval_results_local = '/tmp/evaluation.json'
        with open(eval_results_local, 'w') as f:
            json.dump(evaluation_result, f, indent=2)
        
        eval_key = f"{output_prefix}/evaluation/evaluation.json"
        s3_client.upload_file(eval_results_local, output_bucket, eval_key)
        
        # Log artifacts
        mlflow.log_artifact(baseline_local, "prediction_baseline")
        mlflow.log_artifact(eval_results_local, "evaluation")
        
        print(f"Evaluation completed. ROC AUC: {roc_auc:.4f}, PR AUC: {pr_auc:.4f}")
        
        return {
            'evaluation_result': evaluation_result,
            'prediction_baseline_data': prediction_baseline_s3_url
        }

if __name__ == "__main__":
    import argparse
    
    parser = argparse.ArgumentParser()
    parser.add_argument('--test-x-data-s3-path', type=str, required=True)
    parser.add_argument('--test-y-data-s3-path', type=str, required=True)
    parser.add_argument('--model-s3-path', type=str, required=True)
    parser.add_argument('--output-s3-prefix', type=str, required=True)
    parser.add_argument('--tracking-server-arn', type=str, required=True)
    parser.add_argument('--experiment-name', type=str, required=True)
    parser.add_argument('--pipeline-run-id', type=str, default=None)
    
    args = parser.parse_args()
    
    result = evaluate_autoencoder(
        test_x_data_s3_path=args.test_x_data_s3_path,
        test_y_data_s3_path=args.test_y_data_s3_path,
        model_s3_path=args.model_s3_path,
        output_s3_prefix=args.output_s3_prefix,
        tracking_server_arn=args.tracking_server_arn,
        experiment_name=args.experiment_name,
        pipeline_run_id=args.pipeline_run_id
    )
    
    print(f"Evaluation result: {result}")


Writing pipeline_steps/evaluate_autoencoder.py


## The next code cell fits the estimator. Wait for the training job to finish.

In [31]:
from pipeline_steps.evaluate_autoencoder import load_autoencoder_model
mlflow.set_experiment(r_preprocess['experiment_name'])
with mlflow.start_run(
    run_name=f"autoencoder-training-{strftime('%d-%H-%M-%S', gmtime())}",
    description="autoencoder training in the notebook 03 with a training job") as run:
    mlflow.log_params(estimator.hyperparameters())
    
    estimator.fit(training_inputs)

    mlflow.log_param("training job name", estimator.latest_training_job.name)
    mlflow.log_metrics({i['metric_name'].replace(':', '_'):i['value'] for i in estimator.training_job_analytics.dataframe().iloc})
    mlflow.pytorch.log_model(load_autoencoder_model(estimator.model_data)[0], artifact_path="autoencoder")


INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: from-idea-to-prod-autoencoder-train-2025-08-06-09-12-00-357


2025-08-06 09:12:01 Starting - Starting the training job...
2025-08-06 09:12:22 Starting - Preparing the instances for training...
2025-08-06 09:12:47 Downloading - Downloading input data...
2025-08-06 09:13:12 Downloading - Downloading the training image...............
2025-08-06 09:16:04 Training - Training image download completed. Training in progress...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2025-08-06 09:16:16,905 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2025-08-06 09:16:16,927 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2025-08-06 09:16:16,938 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2025-08-06 09:16:17,004 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2025-08-06 09:16:17,237 sag



🏃 View run autoencoder-training-06-09-12-00 at: https://us-west-2.experiments.sagemaker.aws/#/experiments/56/runs/ff6464d38ef44e9aaa2ce7a85c068bd8
🧪 View experiment at: https://us-west-2.experiments.sagemaker.aws/#/experiments/56


In [33]:
!aws s3 ls {estimator.model_data}

2025-08-06 09:19:31     143107 model.tar.gz


In [35]:
from pipeline_steps.evaluate_autoencoder import evaluate_autoencoder
r_eval = evaluate_autoencoder(
    test_x_data_s3_path=r_preprocess['test_x_data'],
    test_y_data_s3_path=r_preprocess['test_y_data'],
    model_s3_path=estimator.model_data,
    output_s3_prefix=output_s3_prefix,
    tracking_server_arn=mlflow_arn,
    experiment_name=r_preprocess['experiment_name'],
)
r_eval


Loaded test data: (6179, 64) features, 6179 targets
Loaded model with threshold: 0.13681658655405043


  f1_scores = 2 * (precision[:-1] * recall[:-1]) / (precision[:-1] + recall[:-1])


Evaluation completed. ROC AUC: 0.6765, PR AUC: 0.2590
🏃 View run evaluate-autoencoder-06-09-24-50 at: https://us-west-2.experiments.sagemaker.aws/#/experiments/56/runs/8295d1a6976145f096d320ac88d34f8c
🧪 View experiment at: https://us-west-2.experiments.sagemaker.aws/#/experiments/56


{'evaluation_result': {'anomaly_detection_metrics': {'roc_auc': {'value': 0.6765062665196033},
   'pr_auc': {'value': 0.25897085378591583},
   'optimal_threshold': {'value': 2.501936912536621},
   'max_f1_score': {'value': nan},
   'precision': {'value': 0.0},
   'recall': {'value': 0.0},
   'accuracy': {'value': 0.8876840912769056},
   'true_positives': {'value': 0},
   'false_positives': {'value': 2},
   'true_negatives': {'value': 5485},
   'false_negatives': {'value': 692},
   'mean_reconstruction_error': {'value': 0.043993186205625534},
   'std_reconstruction_error': {'value': 0.14815062284469604}}},
 'prediction_baseline_data': 's3://sagemaker-us-west-2-224425919845/from-idea-to-prod/autoencoder/prediction_baseline/prediction_baseline.csv'}

In [36]:
# check that the evaluation function generated output
!aws s3 ls {output_s3_prefix}/prediction_baseline/

2025-08-06 09:24:52      99347 prediction_baseline.csv


#### Model registration step


In [40]:
%%writefile pipeline_steps/register_autoencoder.py
#!/usr/bin/env python3

import boto3
import json
import mlflow
from time import gmtime, strftime
from sagemaker.model_metrics import MetricsSource, ModelMetrics
from sagemaker.drift_check_baselines import DriftCheckBaselines

def register_autoencoder(
    training_job_name,
    model_package_group_name,
    model_approval_status,
    evaluation_result,
    output_s3_prefix,
    tracking_server_arn,
    experiment_name,
    pipeline_run_id=None,
    model_statistics_s3_path=None,
    model_constraints_s3_path=None,
    model_data_statistics_s3_path=None,
    model_data_constraints_s3_path=None,
):
    """
    Register autoencoder model in SageMaker Model Registry
    """
    
    # Set up MLflow
    mlflow.set_tracking_uri(tracking_server_arn)
    mlflow.set_experiment(experiment_name)
    
    run_name = f"register-autoencoder-{strftime('%d-%H-%M-%S', gmtime())}"
    if pipeline_run_id:
        run_name = f"register-{pipeline_run_id}"
    
    with mlflow.start_run(run_name=run_name, description="Autoencoder model registration") as run:
        
        # Get SageMaker client
        sm_client = boto3.client('sagemaker')
        
        # Ensure Model Package Group exists with proper tags
        try:
            # Check if model package group exists
            sm_client.describe_model_package_group(ModelPackageGroupName=model_package_group_name)
            print(f"Model Package Group {model_package_group_name} already exists")
        except sm_client.exceptions.ClientError as e:
            if e.response['Error']['Code'] == 'ValidationException':
                # Model Package Group doesn't exist, create it with tags
                print(f"Creating Model Package Group: {model_package_group_name}")
                sm_client.create_model_package_group(
                    ModelPackageGroupName=model_package_group_name,
                    ModelPackageGroupDescription=f"PyTorch autoencoder models for anomaly detection",
                    Tags=[
                        {"Key": "ModelType", "Value": "Autoencoder"},
                        {"Key": "Framework", "Value": "PyTorch"},
                        {"Key": "UseCase", "Value": "AnomalyDetection"},
                        {"Key": "Project", "Value": "from-idea-to-prod"}
                    ]
                )
            else:
                raise e
        
        # Get training job details
        training_job = sm_client.describe_training_job(TrainingJobName=training_job_name)
        model_data_url = training_job['ModelArtifacts']['S3ModelArtifacts']
        
        # Create model metrics
        model_metrics = None
        if evaluation_result:
            # Save evaluation metrics to S3
            s3_client = boto3.client('s3')
            output_bucket = output_s3_prefix.replace("s3://", "").split("/")[0]
            output_prefix = "/".join(output_s3_prefix.replace("s3://", "").split("/")[1:])
            
            metrics_local = '/tmp/model_metrics.json'
            with open(metrics_local, 'w') as f:
                json.dump(evaluation_result, f, indent=2)
            
            metrics_key = f"{output_prefix}/model_metrics/model_metrics.json"
            s3_client.upload_file(metrics_local, output_bucket, metrics_key)
            metrics_s3_url = f"s3://{output_bucket}/{metrics_key}"
            
            model_metrics = ModelMetrics(
                model_statistics=MetricsSource(
                    s3_uri=metrics_s3_url,
                    content_type="application/json"
                )
            )
        
        # Create drift check baselines if provided
        drift_check_baselines = None
        if any([model_statistics_s3_path, model_constraints_s3_path, 
                model_data_statistics_s3_path, model_data_constraints_s3_path]):
            drift_check_baselines = DriftCheckBaselines(
                model_statistics=MetricsSource(
                    s3_uri=model_statistics_s3_path,
                    content_type="application/json"
                ) if model_statistics_s3_path else None,
                model_constraints=MetricsSource(
                    s3_uri=model_constraints_s3_path,
                    content_type="application/json"
                ) if model_constraints_s3_path else None,
                model_data_statistics=MetricsSource(
                    s3_uri=model_data_statistics_s3_path,
                    content_type="application/json"
                ) if model_data_statistics_s3_path else None,
                model_data_constraints=MetricsSource(
                    s3_uri=model_data_constraints_s3_path,
                    content_type="application/json"
                ) if model_data_constraints_s3_path else None,
            )
        
        # Get execution role
        execution_role = training_job['RoleArn']
        
        # Get container image
        container_image = training_job['AlgorithmSpecification']['TrainingImage']
        
        # Create model package (without tags - tags go on the group, not individual versions)
        model_package_input_dict = {
            "ModelPackageGroupName": model_package_group_name,
            "ModelPackageDescription": f"PyTorch autoencoder for anomaly detection. Training job: {training_job_name}",
            "ModelApprovalStatus": model_approval_status,
            "InferenceSpecification": {
                "Containers": [
                    {
                        "Image": container_image,
                        "ModelDataUrl": model_data_url,
                        "Framework": "PYTORCH",
                        "FrameworkVersion": "1.12"
                    }
                ],
                "SupportedContentTypes": ["text/csv"],
                "SupportedResponseMIMETypes": ["application/json"],
                "SupportedRealtimeInferenceInstanceTypes": [
                    "ml.t2.medium",
                    "ml.m5.large",
                    "ml.m5.xlarge"
                ],
                "SupportedTransformInstanceTypes": [
                    "ml.m5.large",
                    "ml.m5.xlarge"
                ]
            }
            # Note: Tags removed - they should be on the Model Package Group, not individual versions
        }
        
        # Add model metrics if available
        if model_metrics:
            model_package_input_dict["ModelMetrics"] = {
                "ModelQuality": {
                    "Statistics": {
                        "ContentType": "application/json",
                        "S3Uri": metrics_s3_url
                    }
                }
            }
        
        # Add drift check baselines if available
        if drift_check_baselines:
            model_package_input_dict["DriftCheckBaselines"] = drift_check_baselines.to_request()
        
        # Create model package
        try:
            response = sm_client.create_model_package(**model_package_input_dict)
            model_package_arn = response['ModelPackageArn']
            
            print(f"✅ Model package created: {model_package_arn}")
            
            # Log to MLflow
            mlflow.log_params({
                "model_package_group_name": model_package_group_name,
                "model_approval_status": model_approval_status,
                "training_job_name": training_job_name
            })
            
            if evaluation_result and 'anomaly_detection_metrics' in evaluation_result:
                metrics = evaluation_result['anomaly_detection_metrics']
                mlflow.log_metrics({
                    "registered_model_roc_auc": metrics.get('roc_auc', {}).get('value', 0),
                    "registered_model_pr_auc": metrics.get('pr_auc', {}).get('value', 0),
                    "registered_model_f1_score": metrics.get('max_f1_score', {}).get('value', 0)
                })
            
            mlflow.set_tags({
                'mlflow.source.type': 'JOB',
                'model_type': 'autoencoder',
                'step': 'registration',
                'model_package_arn': model_package_arn
            })
            
            return {
                'model_package_arn': model_package_arn,
                'model_package_group_name': model_package_group_name,
                'model_approval_status': model_approval_status
            }
            
        except Exception as e:
            print(f"❌ Error creating model package: {str(e)}")
            raise e

if __name__ == "__main__":
    import argparse
    
    parser = argparse.ArgumentParser()
    parser.add_argument('--training-job-name', type=str, required=True)
    parser.add_argument('--model-package-group-name', type=str, required=True)
    parser.add_argument('--model-approval-status', type=str, required=True)
    parser.add_argument('--evaluation-result', type=str, required=True)
    parser.add_argument('--output-s3-prefix', type=str, required=True)
    parser.add_argument('--tracking-server-arn', type=str, required=True)
    parser.add_argument('--experiment-name', type=str, required=True)
    parser.add_argument('--pipeline-run-id', type=str, default=None)
    
    args = parser.parse_args()
    
    # Parse evaluation result from JSON string
    evaluation_result = json.loads(args.evaluation_result)
    
    result = register_autoencoder(
        training_job_name=args.training_job_name,
        model_package_group_name=args.model_package_group_name,
        model_approval_status=args.model_approval_status,
        evaluation_result=evaluation_result,
        output_s3_prefix=args.output_s3_prefix,
        tracking_server_arn=args.tracking_server_arn,
        experiment_name=args.experiment_name,
        pipeline_run_id=args.pipeline_run_id
    )
    
    print(f"Registration result: {result}")


Writing pipeline_steps/register_autoencoder.py


In [41]:

from pipeline_steps.register_autoencoder import register_autoencoder

In [42]:
r_register = register_autoencoder(
    training_job_name=estimator.latest_training_job.name,
    model_package_group_name=model_package_group_name,
    model_approval_status=model_approval_status,
    evaluation_result=r_eval['evaluation_result'],
    output_s3_prefix=output_s3_url,
    tracking_server_arn=mlflow_arn,
    experiment_name=r_preprocess['experiment_name'],
)
r_register

Creating Model Package Group: from-idea-to-prod-autoencoder-pipeline-model-06-08-50-22
✅ Model package created: arn:aws:sagemaker:us-west-2:224425919845:model-package/from-idea-to-prod-autoencoder-pipeline-model-06-08-50-22/1
🏃 View run register-autoencoder-06-09-26-26 at: https://us-west-2.experiments.sagemaker.aws/#/experiments/56/runs/d35ec0bf13fd4144950207663dd744be
🧪 View experiment at: https://us-west-2.experiments.sagemaker.aws/#/experiments/56


{'model_package_arn': 'arn:aws:sagemaker:us-west-2:224425919845:model-package/from-idea-to-prod-autoencoder-pipeline-model-06-08-50-22/1',
 'model_package_group_name': 'from-idea-to-prod-autoencoder-pipeline-model-06-08-50-22',
 'model_approval_status': 'PendingManualApproval'}

In [43]:
# check that a new model version has been registered in the model package group
boto3.client('sagemaker').describe_model_package(ModelPackageName=r_register['model_package_arn'])

{'ModelPackageGroupName': 'from-idea-to-prod-autoencoder-pipeline-model-06-08-50-22',
 'ModelPackageVersion': 1,
 'ModelPackageArn': 'arn:aws:sagemaker:us-west-2:224425919845:model-package/from-idea-to-prod-autoencoder-pipeline-model-06-08-50-22/1',
 'ModelPackageDescription': 'PyTorch autoencoder for anomaly detection. Training job: from-idea-to-prod-autoencoder-train-2025-08-06-09-12-00-357',
 'CreationTime': datetime.datetime(2025, 8, 6, 9, 26, 27, 431000, tzinfo=tzlocal()),
 'InferenceSpecification': {'Containers': [{'Image': '763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.12-gpu-py38',
    'ImageDigest': 'sha256:71b4ded5aac900d117824923fa93a6e626bed538fcae7282dcfdcbe7a752e3eb',
    'ModelDataUrl': 's3://sagemaker-us-west-2-224425919845/from-idea-to-prod/autoencoder/output/from-idea-to-prod-autoencoder-train-2025-08-06-09-12-00-357/output/model.tar.gz',
    'Framework': 'PYTORCH',
    'FrameworkVersion': '1.12',
    'ModelDataETag': 'b38c7e78a1ed0a2a96c2e14a0137081

### Construct a pipeline

In [45]:

# preprocess data step
step_preprocess = step(
    preprocess_autoencoder, 
    instance_type=process_instance_type_param,
    name=f"{project}-preprocess",
    keep_alive_period_in_seconds=3600,
)(
    input_data_s3_path=input_s3_url_param,
    output_s3_prefix=output_s3_prefix,
    tracking_server_arn=tracking_server_arn_param,
    experiment_name=experiment_name,
    pipeline_run_name=ExecutionVariables.PIPELINE_EXECUTION_ID,
)

cache_config = CacheConfig(enable_caching=True)
cache_config.expire_after = "p30d"

# train step
step_train = TrainingStep(
    name=f"{project}-autoencoder-train",
    step_args=get_pytorch_autoencoder_estimator(
        session=PipelineSession(),
        instance_type=train_instance_type_param,
        output_s3_url=output_s3_url,
        base_job_name=f"{project}-autoencoder-train",
    ).fit(
        {
            "train": TrainingInput(
                step_preprocess['train_data'],
                content_type="text/csv",
            ),
            "validation": TrainingInput(
                step_preprocess['validation_data'],
                content_type="text/csv",
            ),
        }
    ),
    cache_config=cache_config,
)    

# evaluate step
step_evaluate = step(
    evaluate_autoencoder,
    instance_type=process_instance_type_param,
    name=f"{project}-evaluate",
    keep_alive_period_in_seconds=3600,
)(
    test_x_data_s3_path=step_preprocess['test_x_data'],
    test_y_data_s3_path=step_preprocess['test_y_data'],
    model_s3_path=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    output_s3_prefix=output_s3_prefix,
    tracking_server_arn=tracking_server_arn_param,
    experiment_name=step_preprocess['experiment_name'],
    pipeline_run_id=step_preprocess['pipeline_run_id'],
)

# register model step
step_register = step(
        register_autoencoder,
        name=f"{project}-register",
        keep_alive_period_in_seconds=3600,
    )(
        training_job_name=step_train.properties.TrainingJobName,
        model_package_group_name=model_package_group_name_param,
        model_approval_status=model_approval_status_param,
        evaluation_result=step_evaluate['evaluation_result'],
        output_s3_prefix=output_s3_url,
        tracking_server_arn=tracking_server_arn_param,
        experiment_name=step_preprocess['experiment_name'],
        pipeline_run_id=step_preprocess['pipeline_run_id'],
    )


# fail the pipeline execution step
step_fail = FailStep(
    name=f"{project}-fail",
    error_message=Join(on=" ", values=["Execution failed due to ROC AUC Score < ", test_score_threshold_param]),
)

# condition to check in the condition step (using ROC AUC for autoencoder)
condition_gte = ConditionGreaterThanOrEqualTo(
        left=step_evaluate['evaluation_result']['anomaly_detection_metrics']['roc_auc']['value'],  
        right=test_score_threshold_param,
)

# conditional register step
step_conditional_register = ConditionStep(
    name=f"{project}-check-metrics",
    conditions=[condition_gte],
    if_steps=[step_register],
    else_steps=[step_fail],
)

# Create a pipeline object
pipeline = Pipeline(
    name=f"{pipeline_name}",
    parameters=[
        input_s3_url_param,
        process_instance_type_param,
        train_instance_type_param,
        model_approval_status_param,
        test_score_threshold_param,
        model_package_group_name_param,
        tracking_server_arn_param,
        encoding_dim_param,
        dropout_rate_param,
        learning_rate_param,
        batch_size_param,
        num_epochs_param,
        weight_decay_param,
    ],
    steps=[step_conditional_register],
    pipeline_definition_config=PipelineDefinitionConfig(use_custom_job_prefix=True)
)

INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.


In [48]:
# Upsert operation serialize the function code, arguments, and other artefacts to S3 where it can be accessed during pipeline's runtime
pipeline.upsert(role_arn=sm_role)

2025-08-06 09:36:50,279 sagemaker.remote_function INFO     Uploading serialized function code to s3://sagemaker-us-west-2-224425919845/from-idea-to-prod-autoencoder-pipeline-06-08-50-22/from-idea-to-prod-register/2025-08-06-09-36-50-279/function
2025-08-06 09:36:50,367 sagemaker.remote_function INFO     Uploading serialized function arguments to s3://sagemaker-us-west-2-224425919845/from-idea-to-prod-autoencoder-pipeline-06-08-50-22/from-idea-to-prod-register/2025-08-06-09-36-50-279/arguments
2025-08-06 09:36:50,564 sagemaker.remote_function INFO     Copied dependencies file at './requirements.txt' to '/tmp/tmposvtpddf/requirements.txt'
2025-08-06 09:36:50,589 sagemaker.remote_function INFO     Successfully uploaded dependencies and pre execution scripts to 's3://sagemaker-us-west-2-224425919845/from-idea-to-prod-autoencoder-pipeline-06-08-50-22/from-idea-to-prod-register/2025-08-06-09-36-50-279/pre_exec_script_and_dependencies'
2025-08-06 09:36:50,796 sagemaker.remote_function INFO   

sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.Dependencies
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.IncludeLocalWorkDir
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.CustomFileFilter.IgnoreNamePatterns


2025-08-06 09:36:53,644 sagemaker.remote_function INFO     Uploading serialized function code to s3://sagemaker-us-west-2-224425919845/from-idea-to-prod-autoencoder-pipeline-06-08-50-22/from-idea-to-prod-evaluate/2025-08-06-09-36-50-279/function
2025-08-06 09:36:53,705 sagemaker.remote_function INFO     Uploading serialized function arguments to s3://sagemaker-us-west-2-224425919845/from-idea-to-prod-autoencoder-pipeline-06-08-50-22/from-idea-to-prod-evaluate/2025-08-06-09-36-50-279/arguments
2025-08-06 09:36:53,763 sagemaker.remote_function INFO     Copied dependencies file at './requirements.txt' to '/tmp/tmp_r3nns0x/requirements.txt'
2025-08-06 09:36:53,791 sagemaker.remote_function INFO     Successfully uploaded dependencies and pre execution scripts to 's3://sagemaker-us-west-2-224425919845/from-idea-to-prod-autoencoder-pipeline-06-08-50-22/from-idea-to-prod-evaluate/2025-08-06-09-36-50-279/pre_exec_script_and_dependencies'


sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.Dependencies
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.IncludeLocalWorkDir
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.CustomFileFilter.IgnoreNamePatterns


2025-08-06 09:36:55,491 sagemaker.remote_function INFO     Uploading serialized function code to s3://sagemaker-us-west-2-224425919845/from-idea-to-prod-autoencoder-pipeline-06-08-50-22/from-idea-to-prod-preprocess/2025-08-06-09-36-50-279/function
2025-08-06 09:36:55,542 sagemaker.remote_function INFO     Uploading serialized function arguments to s3://sagemaker-us-west-2-224425919845/from-idea-to-prod-autoencoder-pipeline-06-08-50-22/from-idea-to-prod-preprocess/2025-08-06-09-36-50-279/arguments
2025-08-06 09:36:55,597 sagemaker.remote_function INFO     Copied dependencies file at './requirements.txt' to '/tmp/tmp308ao68p/requirements.txt'
2025-08-06 09:36:55,623 sagemaker.remote_function INFO     Successfully uploaded dependencies and pre execution scripts to 's3://sagemaker-us-west-2-224425919845/from-idea-to-prod-autoencoder-pipeline-06-08-50-22/from-idea-to-prod-preprocess/2025-08-06-09-36-50-279/pre_exec_script_and_dependencies'
INFO:sagemaker.image_uris:image_uri is not presente

{'PipelineArn': 'arn:aws:sagemaker:us-west-2:224425919845:pipeline/from-idea-to-prod-autoencoder-pipeline-06-08-50-22',
 'ResponseMetadata': {'RequestId': '41dcb184-1e66-42dc-94c6-6b9ed053aaca',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '41dcb184-1e66-42dc-94c6-6b9ed053aaca',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '118',
   'date': 'Wed, 06 Aug 2025 09:36:56 GMT'},
  'RetryAttempts': 0}}

# Show the pipeline link

In [50]:
pipeline_execution = pipeline.start()
pipeline_execution.describe()

{'PipelineArn': 'arn:aws:sagemaker:us-west-2:224425919845:pipeline/from-idea-to-prod-autoencoder-pipeline-06-08-50-22',
 'PipelineExecutionArn': 'arn:aws:sagemaker:us-west-2:224425919845:pipeline/from-idea-to-prod-autoencoder-pipeline-06-08-50-22/execution/bb57oy07ns1e',
 'PipelineExecutionDisplayName': 'execution-1754473052584',
 'PipelineExecutionStatus': 'Executing',
 'CreationTime': datetime.datetime(2025, 8, 6, 9, 37, 32, 530000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2025, 8, 6, 9, 37, 32, 530000, tzinfo=tzlocal()),
 'CreatedBy': {'UserProfileArn': 'arn:aws:sagemaker:us-west-2:224425919845:user-profile/d-rtctvdud9qsp/studio-user-7788a530',
  'UserProfileName': 'studio-user-7788a530',
  'DomainId': 'd-rtctvdud9qsp',
  'IamIdentity': {'Arn': 'arn:aws:sts::224425919845:assumed-role/tm-ws-SageMakerExecutionRole-Ou4AK8i38tA1/SageMaker',
   'PrincipalId': 'AROATIQGTYVSVCQU4RTXO:SageMaker'}},
 'LastModifiedBy': {'UserProfileArn': 'arn:aws:sagemaker:us-west-2:224425919

In [51]:
pipeline_execution.wait() 
pipeline_execution.list_steps()

[{'StepName': 'from-idea-to-prod-preprocess',
  'StartTime': datetime.datetime(2025, 8, 6, 9, 37, 33, 986000, tzinfo=tzlocal()),
  'StepStatus': 'Executing',
  'Metadata': {'TrainingJob': {'Arn': 'arn:aws:sagemaker:us-west-2:224425919845:training-job/preprocess-autoencoder-bb57oy07ns1e-IzILMJxHMv'}},
  'AttemptCount': 1}]

In [52]:
# Show the pipeline execution link
display(
    HTML('<b>See <a target="top" href="https://studio-{}.studio.{}.sagemaker.aws/pipelines/{}/executions/{}/graph">the pipeline execution</a> in the Studio UI</b>'.format(
            domain_id, region, pipeline_name, pipeline_execution.describe()['PipelineExecutionArn'].split('/')[-1]))
)

print("✅ PyTorch Autoencoder Pipeline Created and Executed Successfully!")
print("📊 Key Changes Made:")
print("  - Replaced XGBoost with PyTorch autoencoder")
print("  - Updated hyperparameters for autoencoder training")
print("  - Modified evaluation metrics for anomaly detection")
print("  - Adjusted pipeline steps for unsupervised learning")
print("  - Updated bucket prefix to 'autoencoder'")
print("  - Changed threshold evaluation to use ROC AUC")



✅ PyTorch Autoencoder Pipeline Created and Executed Successfully!
📊 Key Changes Made:
  - Replaced XGBoost with PyTorch autoencoder
  - Updated hyperparameters for autoencoder training
  - Modified evaluation metrics for anomaly detection
  - Adjusted pipeline steps for unsupervised learning
  - Updated bucket prefix to 'autoencoder'
  - Changed threshold evaluation to use ROC AUC
