# Notebook 3. Track Model Quality with SageMaker MLOps

## Learning Objectives
- Automate Machine Learning Operations (MLOps) with SageMaker Pipelines.

## Environment Notes:
This notebook was created and tested on an `ml.t3.medium (2 vCPU + 4 GiB)` notebook instance running the `Python 3.0 (Data Science)` kernel in SageMaker Studio.

-----
## 1. Background

In Notebook 2 of this series, we demonstrated how SageMaker Processing, Training, and Hyperparameter Optimization (HPO) jobs can make the development of new machine learning (ML) models faster and more cost efficient. In this notebook, we'll look at some best practices for deploying and managing your models into production. Many of these practices fall into the category of "Machine Learning Operations", or "MLOps" and are increasingly a part of many [regulatory and quality requirements](https://www.fda.gov/files/medical%20devices/published/US-FDA-Artificial-Intelligence-and-Machine-Learning-Discussion-Paper.pdf).

MLOps plays a key role in the **Model Deployment** and **Model Monitoring/Maintenance** phases of the Machine Learning Lifecycle. For more information, please refer to the [Machine Learning Best Practices in Healthcare and Life Sciences Whitepaper](https://d1.awsstatic.com/whitepapers/ML-best-practices-health-science.pdf?did=wp_card&trk=wp_card).

![Machine Learning Life Cycle - Part 1](img/MLLC2.png "ML Life Cycle - Part 1")

[Amazon SageMaker Model Building Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html) is a tool for building machine learning pipelines that take advantage of direct SageMaker integration. Because of this integration, you can create a pipeline and set up SageMaker Projects for orchestration using a tool that handles much of the step creation and management for you. You can manage these pipelines in the SageMaker Studio UI and automatically capture data and model lineage.

One of the challenges with deploying ML solutions is that their effectiveness can change over time.  For example, perhaps the distribution of your data shifts from year-to-year? Or the boundaries of a classification category? In these cases, you want to be able to quickly retrain and deploy new versions of your model, either on a schedule or in response to some event.

Amazon SageMaker Pipelines allows us to define reproducible ML processes that we can trigger at will. In this example, we'll use the processing, training, and registration artifacts from above to create a pipeline and demonstrate how to execute it.

---
## 2. Preparation

Let's start by specifying:

- The Python libraries that we'll use throughout the analysis
- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

### 2.1. Import Python Libraries

In [None]:
%pip install --disable-pip-version-check -q -U 'boto3==1.35.16' 'sagemaker==2.231.0'

In [None]:
import boto3
import os
import sagemaker
from time import strftime

### 2.2. Create Some Necessary Clients

In [None]:
boto_session = boto3.session.Session()
region = boto_session.region_name
sagemaker_session = sagemaker.session.Session(boto_session)
sagemaker_execution_role = sagemaker.session.get_execution_role(sagemaker_session)
sagemaker_boto_client = boto_session.client("sagemaker")
s3_boto_client = boto_session.client("s3")
account_id = boto_session.client("sts").get_caller_identity().get("Account")
print(f"Assumed SageMaker role is {sagemaker_execution_role}")

### 2.3. Specify S3 Bucket and Prefix

In [None]:
S3_BUCKET = sagemaker_session.default_bucket()
S3_PREFIX = "brca-her2-classifier"
S3_PATH = sagemaker.s3.s3_path_join(S3_BUCKET, S3_PREFIX)
print(f"S3 path is {S3_PATH}")

### 2.4. Define Local Working Directories

In [None]:
WORKING_DIR = os.getcwd()
DATA_DIR = os.path.join(WORKING_DIR, "data")
print(f"Working directory is {WORKING_DIR}")
print(f"Data directory is {DATA_DIR}")

### 2.7. Define MLflow parameters

Verify that you have a running MLFlow server

In [None]:
running_mlflow_servers = [
    summary
    for summary in sagemaker_boto_client.list_mlflow_tracking_servers().get(
        "TrackingServerSummaries"
    )
    if summary.get("TrackingServerStatus") == "Created"
]
tracking_server_arn = [
    server["TrackingServerArn"] for server in running_mlflow_servers
][-1]
running_mlflow_servers

### 2.8. Define pipeline inputs

In [None]:
from sagemaker.workflow.parameters import (
    ParameterInteger,
    ParameterString,
    ParameterFloat,
)

hiseq_uri = ParameterString(
    name="HiSeqURI",
    default_value="https://tcga.xenahubs.net/download/TCGA.BRCA.sampleMap/HiSeqV2_PANCAN.gz",
)
brca_clinical_matrix_uri = ParameterString(
    name="BRCAClinicalURI",
    default_value="https://tcga.xenahubs.net/download/TCGA.BRCA.sampleMap/BRCA_clinicalMatrix",
)
train_test_split_ratio = ParameterFloat(name="TrainTestSplit", default_value=0.2)
gene_count = ParameterInteger(name="GeneCount", default_value=2000)

s3_bucket = ParameterString(
    name="S3Bucket", default_value=sagemaker_session.default_bucket()
)
s3_prefix = ParameterString(name="S3Prefix", default_value="brca-classifier-pipeline")

# What instance type to use for processing.
processing_instance_type = ParameterString(
    name="ProcessingInstanceType", default_value="ml.m5.xlarge"
)

# What instance type to use for training.
training_instance_type = ParameterString(
    name="TrainingInstanceType", default_value="ml.m5.xlarge"
)

### 2.9. Define additional pipeline parameters

In [None]:
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.workflow.steps import CacheConfig

pipeline_session = PipelineSession()
cache_config = CacheConfig(enable_caching=True, expire_after="PT1H")

---
## 3. Define Data Processing Step

In [None]:
from sagemaker.processing import FrameworkProcessor, ProcessingOutput
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.workflow.execution_variables import ExecutionVariables
from sagemaker.workflow.functions import Join

sklearn_processor = FrameworkProcessor(
    estimator_cls=SKLearn,
    framework_version="1.2-1",
    instance_count=1,
    instance_type=processing_instance_type,
    role=sagemaker_execution_role,
    sagemaker_session=pipeline_session,  ########## Pipelines-specific
)

processing_step_args = sklearn_processor.run(
    job_name=f"data-processing-job-{strftime('%Y-%m-%d-%H-%M-%S')}",
    code="scripts/processing/pipeline-processing.py",
    dependencies=["scripts/processing/requirements.txt"],
    outputs=[
        ProcessingOutput(
            output_name="train",
            source="/opt/ml/processing/output/train",
            destination=Join(
                on="/",
                values=[
                    "s3:/",
                    s3_bucket,
                    s3_prefix,
                    ExecutionVariables.PIPELINE_EXECUTION_ID,
                    "train",
                ],
            ),
        ),
        ProcessingOutput(
            output_name="validation",
            source="/opt/ml/processing/output/val",
            destination=Join(
                on="/",
                values=[
                    "s3:/",
                    s3_bucket,
                    s3_prefix,
                    ExecutionVariables.PIPELINE_EXECUTION_ID,
                    "validation",
                ],
            ),
        ),
        ProcessingOutput(
            output_name="test",
            source="/opt/ml/processing/output/test",
            destination=Join(
                on="/",
                values=[
                    "s3:/",
                    s3_bucket,
                    s3_prefix,
                    ExecutionVariables.PIPELINE_EXECUTION_ID,
                    "test",
                ],
            ),
        ),
    ],
    arguments=[
        "--brca_clinical_matrix_url",
        brca_clinical_matrix_uri.to_string(),
        "--hiseq_url",
        hiseq_uri.to_string(),
        "--train_test_split_ratio",
        train_test_split_ratio.to_string(),
        "--gene_count",
        gene_count.to_string(),
        "--create_test_data",
    ],
)

In [None]:
from sagemaker.workflow.steps import ProcessingStep

step_process = ProcessingStep(
    name="ProcessBRCAData", step_args=processing_step_args, cache_config=cache_config
)

---
## 4. Define Model Training Step

In [None]:
# define the data type and paths to the training and validation datasets
content_type = "text/csv"

s3_input_train = sagemaker.inputs.TrainingInput(
    step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri,
    content_type=content_type,
)

s3_input_validation = sagemaker.inputs.TrainingInput(
    step_process.properties.ProcessingOutputConfig.Outputs["validation"].S3Output.S3Uri,
    content_type=content_type,
)

model_output_path = f"s3://{S3_BUCKET}/{S3_PREFIX}/models/"

In [None]:
from sagemaker.xgboost.estimator import XGBoost

xgb_job_name = f"XGB-Training-Job-{strftime('%Y-%m-%d-%H-%M-%S')}"

framework_version = "1.7-1"
py_version = "py3"

hyper_params_dict = {
    "objective": "binary:logistic",
    "booster": "gbtree",
    "eval_metric": "error",
    "scale_pos_weight": 9.0,
    "max_depth": 3,
    "min_child_weight": 5,
    "subsample": 0.9,
    "verbosity": 1,
    "tree_method": "auto",
}

xgb_estimator = XGBoost(
    enable_sagemaker_metrics=True,
    entry_point="xgb_train.py",
    framework_version=framework_version,
    hyperparameters=hyper_params_dict,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path=model_output_path,
    py_version=py_version,
    role=sagemaker_execution_role,
    sagemaker_session=pipeline_session,  ########## Pipelines-specific
    source_dir="scripts/xgb_train",
    environment={"MLFLOW_TRACKING_ARN": tracking_server_arn},
)

training_step_args = xgb_estimator.fit(
    {"train": s3_input_train, "validation": s3_input_validation},
    job_name=xgb_job_name,
)

In [None]:
from sagemaker.workflow.steps import TrainingStep

step_train = TrainingStep(
    name="TrainXGBoost", step_args=training_step_args, cache_config=cache_config
)

## 5. Define Model Registration Step

In [None]:
from sagemaker.model import Model

model = Model(
    image_uri=xgb_estimator.training_image_uri(),
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    sagemaker_session=pipeline_session,
    role=sagemaker_execution_role,
)

register_model_step_args = model.register(
    content_types=["application/json"],
    response_types=["application/json"],
    inference_instances=["ml.t2.medium", "ml.m5.xlarge"],
    transform_instances=["ml.m5.xlarge"],
    model_package_group_name="brca",
)

In [None]:
from sagemaker.workflow.model_step import ModelStep

step_model_create = ModelStep(
    name="BRCAModelCreationStep", step_args=register_model_step_args
)

## 6. Create Pipeline

In [None]:
from sagemaker.workflow.pipeline import Pipeline

pipeline_name = "BRCAPipeline"
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        hiseq_uri,
        brca_clinical_matrix_uri,
        train_test_split_ratio,
        gene_count,
        s3_bucket,
        s3_prefix,
        processing_instance_type,
        training_instance_type,
    ],
    steps=[step_process, step_train, step_model_create],
)

In [None]:
pipeline.upsert(role_arn=sagemaker_execution_role)

In [None]:
pipeline.start()

In [None]:
latest_execution = pipeline.list_executions()["PipelineExecutionSummaries"][0]
latest_execution

In [None]:
pipeline.build_parameters_from_execution(latest_execution["PipelineExecutionArn"])