# Notebook 2: Use SageMaker Training Jobs to Accelerate Model Development

## Learning Objectives
- Use SageMaker processing and training jobs to optimize cost and performance
- Compare model performance using SageMaker MLflow
- Deploy a model as an endpoint for real-time inference

## Environment Notes:
This notebook was created and tested on an `ml.t3.medium (2 vCPU + 4 GiB)` notebook instance running the `Python 3.0 (Data Science)` kernel in SageMaker Studio.

---

## 1. Background
In notebook 1 of this series, we demonstrated using RNAseq data to predict HER2 status using the compute resources on the notebook server. However, using notebook server resources to process large amounts of data or train complex models is generally not a good idea. It's possible to scale up your notebook server, but any time you spend on non-compute intensive tasks (i.e. most of your time) will be wasted. A better idea is to run your notebook on a small server and submit compute-intensive tasks to independent jobs. SageMaker provides managed services for running data processing, model training, and hyperparameter tuning jobs. In this notebook, we'll demonstrate how to leverage these services to optimize the performance and cost of our tasks.

Specifically, we'll demonstrate two best practices: **Jobs** and **Experiments**.

These best practices play a key role in the **Prepare Data** and **Model Development** phases of the Machine Learning Lifecycle. For more information, please refer to the [Machine Learning Best Practices in Healthcare and Life Sciences Whitepaper](https://d1.awsstatic.com/whitepapers/ML-best-practices-health-science.pdf?did=wp_card&trk=wp_card).

![Machine Learning Life Cycle - Part 1](img/MLLC1.png "ML Life Cycle - Part 1")

---

## 1.1. SageMaker Jobs

[SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html) processing, training, and hyperparameter optimization (HPO) jobs allow data scientists to submit compute-heavy processes to external services. This keeps costs optimized and ensures that these tasks run in reproducible environments. It also improves data scientist productivity by allowing these jobs to run in "the background" and provides resiliancy if something happens to your notebook environment.

![alt text](img/jobs.png "Jobs")

## 1.2. MLflow

![MLflow workflow](img/mlflow-diagram.png "MLflow")

[Amazon SageMaker with MLflow](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html) is a capability of Amazon SageMaker that lets you create, manage, analyze, and compare your machine learning experiments. 

Machine learning is an iterative process that requires experimenting with various combinations of data, algorithms, and parameters, while observing their impact on model accuracy. The iterative nature of ML experimentation results in numerous model training runs and versions, making it challenging to track the best performing models and their configurations. The complexity of managing and comparing iterative training runs increases with generative artificial intelligence (generative AI), where experimentation involves not only fine-tuning models but also exploring creative and diverse outputs. Researchers must adjust hyperparameters, select suitable model architectures, and curate diverse datasets to optimize both the quality and creativity of the generated content. Evaluating generative AI models requires both quantitative and qualitative metrics, adding another layer of complexity to the experimentation process.

Use MLflow with Amazon SageMaker to track, organize, view, analyze, and compare iterative ML experimentation to gain comparative insights and register and deploy your best performing models.

---
## 2. Preparation

Let's start by specifying:

- The Python libraries that we'll use throughout the analysis
- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

### 2.1. Import Python Libraries

In [None]:
%pip install --disable-pip-version-check -q -U 'boto3==1.35.16' 'sagemaker==2.231.0' 'mlflow==2.13.2' 'sagemaker-mlflow==0.1.0'

In [None]:
import boto3
import os
import sagemaker
from sagemaker.processing import FrameworkProcessor, ProcessingOutput
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.tensorflow import TensorFlow
from time import strftime

### 2.2. Create Some Necessary Clients

In [None]:
boto_session = boto3.session.Session()
region = boto_session.region_name
sagemaker_session = sagemaker.session.Session(boto_session)
sagemaker_execution_role = sagemaker.session.get_execution_role(sagemaker_session)
sagemaker_boto_client = boto_session.client("sagemaker")
s3_boto_client = boto_session.client("s3")
account_id = boto_session.client("sts").get_caller_identity().get("Account")
print(f"Assumed SageMaker role is {sagemaker_execution_role}")

### 2.3. Specify S3 Bucket and Prefix

In [None]:
S3_BUCKET = sagemaker_session.default_bucket()
S3_PREFIX = "brca-her2-classifier"
S3_PATH = sagemaker.s3.s3_path_join(S3_BUCKET, S3_PREFIX)
print(f"S3 path is {S3_PATH}")

### 2.4. Define Local Working Directories

In [None]:
WORKING_DIR = os.getcwd()
DATA_DIR = os.path.join(WORKING_DIR, "data")
print(f"Working directory is {WORKING_DIR}")
print(f"Data directory is {DATA_DIR}")

### 2.5. Define MLflow parameters

If you haven't yet, create an [MLflow tracking server](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-create-tracking-server-studio.html). Then, run the following cell to verfiy that your MLflow server is ready for use.

In [None]:
running_mlflow_servers = [
    summary
    for summary in sagemaker_boto_client.list_mlflow_tracking_servers().get(
        "TrackingServerSummaries"
    )
    if summary.get("TrackingServerStatus") == "Created"
]
tracking_server_arn = [
    server["TrackingServerArn"] for server in running_mlflow_servers
][-1]
running_mlflow_servers

---
## 3. Data Preparation  with Amazon SageMaker Processing

Amazon SageMaker Processing allows you to run steps for data pre- or post-processing, feature engineering, data validation, or model evaluation workloads on Amazon SageMaker. Processing jobs accept data from Amazon S3 as input and store data into Amazon S3 as output.

![processing](https://sagemaker.readthedocs.io/en/stable/_images/amazon_sagemaker_processing_image1.png)

Here, we'll import the dataset and transform it with SageMaker Processing, which can be used to process terabytes of data in a SageMaker-managed cluster separate from the instance running your notebook server. In a typical SageMaker workflow, notebooks are only used for prototyping and can be run on relatively inexpensive and less powerful instances, while processing, training and model hosting tasks are run on separate, more powerful SageMaker-managed instances.  SageMaker Processing includes off-the-shelf support for Scikit-learn, as well as a Bring Your Own Container option, so it can be used with many different data transformation technologies and tasks.    

To use SageMaker Processing, simply supply a Python data preprocessing script as shown below.  For this example, we're using a SageMaker prebuilt Scikit-learn container, which includes many common functions for processing data.  There are few limitations on what kinds of code and operations you can run, and only a minimal contract:  input and output data must be placed in specified directories.  If this is done, SageMaker Processing automatically loads the input data from S3 and uploads transformed data back to S3 when the job is complete.

For this example, we'll download the raw data directly from Xenahubs as part of the processing script, so we do not need to specify an input bucket.

### 3.1. Submit SageMaker Processing Job

This will take about 5 minutes to complete. Notice that this code block references `scripts/processing/processing.py`. The processing job will run this script on a different compute instance, in this case a ml.m5.xlarge. This allows us to use a small instance for our notebook server, while still taking advantage of a more powerful instance for the processing.

In [None]:
HISEQ_URL = "https://tcga.xenahubs.net/download/TCGA.BRCA.sampleMap/HiSeqV2_PANCAN.gz"
BRCA_CLINICAL_MATRIX_URL = (
    "https://tcga.xenahubs.net/download/TCGA.BRCA.sampleMap/BRCA_clinicalMatrix"
)
processing_run_name = f"data-processing-job-{strftime('%Y-%m-%d-%H-%M-%S')}"
train_test_split_ratio = 0.2
gene_count = 20000

sklearn_processor = FrameworkProcessor(
    estimator_cls=SKLearn,
    framework_version="1.2-1",
    instance_count=1,
    instance_type="ml.m5.xlarge",
    role=sagemaker_execution_role,
    sagemaker_session=sagemaker_session,
)

sklearn_processor.run(
    job_name=processing_run_name,
    code="scripts/processing/processing.py",
    dependencies=["scripts/processing/requirements.txt"],
    outputs=[
        ProcessingOutput(
            output_name="train",
            source="/opt/ml/processing/output/train",
            destination=f"s3://{S3_BUCKET}/{S3_PREFIX}/data/train/",
        ),
        ProcessingOutput(
            output_name="validation",
            source="/opt/ml/processing/output/val",
            destination=f"s3://{S3_BUCKET}/{S3_PREFIX}/data/val/",
        ),
        ProcessingOutput(
            output_name="test",
            source="/opt/ml/processing/output/test",
            destination=f"s3://{S3_BUCKET}/{S3_PREFIX}/data/test/",
        ),
    ],
    arguments=[
        "--brca_clinical_matrix_url",
        BRCA_CLINICAL_MATRIX_URL,
        "--hiseq_url",
        HISEQ_URL,
        "--train_test_split_ratio",
        str(train_test_split_ratio),
        "--gene_count",
        str(gene_count),
        "--create_test_data",
    ],
    wait=True,
)

### 3.2. Download Processed Data from S3

In [None]:
sagemaker_session.download_data(
    f"{DATA_DIR}/output/train",
    bucket=S3_BUCKET,
    key_prefix=f"{S3_PREFIX}/data/train/train.csv",
)
sagemaker_session.download_data(
    f"{DATA_DIR}/output/val",
    bucket=S3_BUCKET,
    key_prefix=f"{S3_PREFIX}/data/val/val.csv",
)
sagemaker_session.download_data(
    f"{DATA_DIR}/output/test",
    bucket=S3_BUCKET,
    key_prefix=f"{S3_PREFIX}/data/test/test.csv",
)

---
## 4. Model Training

Now that our training data is set up, we can train some models. To highlight the benefits of experiment tracking, we're going to train models using three different frameworks:
- The random forest model from Scikit Learm
- A multi-layer perceptron (MLP) neural network in Keras
- The open-source XGBoost algorithm

Since we're using SageMaker jobs to run our training, we don't need to install any additional libraries or spin up expensive compute resources on our notebook server. The jobs use their own dependencies and we're only charged for the time they run.

First, let's define some variables that all three training jobs will need.

In [10]:
# define the data type and paths to the training and validation datasets
content_type = "text/csv"

s3_input_train = sagemaker.inputs.TrainingInput(
    f"s3://{S3_BUCKET}/{S3_PREFIX}/data/train/train.csv", content_type=content_type
)

s3_input_validation = sagemaker.inputs.TrainingInput(
    f"s3://{S3_BUCKET}/{S3_PREFIX}/data/val/val.csv", content_type=content_type
)

model_output_path = f"s3://{S3_BUCKET}/{S3_PREFIX}/models/"

### 4.1. Train Model Using a SKLearn Random Forest Algorithm

Here again we're passing a script (`scripts/rf_train/rf_train.py`) to run during the training job. Notice that we've also included a `requirements.txt` file in the training script directory to install additional dependencies in the training container. This is a great way to install an extra package or two without creating your own container image from scratch!

Setting `wait=False` allows us to continue running the notebook while the training job runs in "the background" (on a different machine).

In [None]:
rf_job_name = f"RF-Training-Job-{strftime('%Y-%m-%d-%H-%M-%S')}"

n_estimators = 100
min_samples_leaf = 3

rf_estimator = SKLearn(
    base_job_name=rf_job_name,
    enable_sagemaker_metrics=True,
    entry_point="rf_train.py",
    framework_version="1.2-1",
    hyperparameters={
        "n-estimators": n_estimators,
        "min-samples-leaf": min_samples_leaf,
    },
    instance_count=1,
    instance_type="ml.c5.xlarge",
    output_path=model_output_path,
    role=sagemaker_execution_role,
    sagemaker_session=sagemaker_session,
    source_dir="scripts/rf_train",
    environment={"MLFLOW_TRACKING_ARN": tracking_server_arn},
)
rf_estimator.fit(
    {"train": s3_input_train, "validation": s3_input_validation},
    job_name=rf_job_name,
    wait=False,
)

---

### 4.2. Train Model using a Keras MLP

In [None]:
tf_job_name = f"TF-Training-Job-{strftime('%Y-%m-%d-%H-%M-%S')}"

tf_estimator = TensorFlow(
    enable_sagemaker_metrics=True,
    entry_point="tf_train.py",
    framework_version="2.16",
    instance_count=1,
    instance_type="ml.c5.xlarge",
    metric_definitions=[
        {"Name": "validation:accuracy", "Regex": "Validation Accuracy: ([0-9.]+)$"},
        {"Name": "validation:precision", "Regex": "Validation Precision: ([0-9.]+)$"},
        {"Name": "validation:f1", "Regex": "Validation F1 Score: ([0-9.]+)$"},
    ],
    output_path=model_output_path,
    py_version="py310",
    role=sagemaker_execution_role,
    sagemaker_session=sagemaker_session,
    source_dir="scripts/tf_train",
    environment={"MLFLOW_TRACKING_ARN": tracking_server_arn},
)

tf_estimator.fit(
    {"train": s3_input_train, "validation": s3_input_validation},
    job_name=tf_job_name,
    wait=False,
)

### 4.3. Train Model Using the XGBoost Algorithm

Compare the XGBoost training script we're about to run (scripts/rf_train/rf_train.py) with the training function we used in Notebook 1. You'll notice that the `xgb.train` call is the same in both!

Since we're setting `wait=True` our Jupyter session will wait until this training job is finished before moving on

Submit the training job

In [None]:
xgb_job_name = f"XGB-Training-Job-{strftime('%Y-%m-%d-%H-%M-%S')}"

framework_version = "1.7-1"
py_version = "py3"

hyper_params_dict = {
    "objective": "binary:logistic",
    "booster": "gbtree",
    "eval_metric": "error",
    "scale_pos_weight": 9.0,
    "max_depth": 3,
    "min_child_weight": 5,
    "subsample": 0.9,
    "verbosity": 1,
    "tree_method": "auto",
}

xgb_estimator = XGBoost(
    enable_sagemaker_metrics=True,
    entry_point="xgb_train.py",
    framework_version=framework_version,
    hyperparameters=hyper_params_dict,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path=model_output_path,
    py_version=py_version,
    role=sagemaker_execution_role,
    sagemaker_session=sagemaker_session,
    source_dir="scripts/xgb_train",
    environment={"MLFLOW_TRACKING_ARN": tracking_server_arn},
)

xgb_estimator.fit(
    {"train": s3_input_train, "validation": s3_input_validation},
    job_name=xgb_job_name,
    logs=True,
    wait=True,
)

---
## 5. Model Evaluation

### 5.1. Compare Model Results in MLflow

In [None]:
import mlflow

mlflow.set_tracking_uri(tracking_server_arn)
runs = mlflow.search_runs()
display(runs)

Finally, we'll look at options for deploying this model.

---
## 6. Deploy

Now that we've trained a model, we can allow other applications to use it for inference by deploying it. SageMaker offers several deployment options, based on your performance, cost, and data needs:

![alt text](img/deployment_options.png "SageMaker Model Deployment Options")

### 6.1. Deploy Model as SageMaker Endpoint

Real-time inference endpoints are deployed to a persistent EC2 instance. This allows them to respond quickly to requests and support a wide range of custom properties. It's a good choice for models with steady usage. Deploying this endpoint will take about 5 minutes.

In [None]:
realtime_endpoint_name = f"her2-real-time-endpoint-{strftime('%Y-%m-%d-%H-%M-%S')}"

xgb_predictor = xgb_estimator.deploy(
    endpoint_name=realtime_endpoint_name,
    serializer=sagemaker.serializers.CSVSerializer(),  # Helper function to serialize ndarray into buffer
    deserializer=sagemaker.deserializers.JSONDeserializer(),  # Helper function to deserialize buffer into ndarray
    wait=True,
    instance_type="ml.t2.medium",  # Instance type we want to use to host our endpoint.
    initial_instance_count=1,  # For this example, we'll only use a single hosting instance.
)

### 6.2. Test Endpoint

In [None]:
import pandas as pd
from time import sleep

# Load a random sample of 10 records from the test data
test_df = pd.read_csv("data/output/test/test.csv", header=None).sample(n=25)

# Submit the 10 samples to the inference endpoint and compare the actual and predicted values
print(
    f"Sending test traffic to the endpoint {xgb_predictor.endpoint_name}. \nPlease wait..."
)

for i, row in test_df.iterrows():
    print(
        f"[Actual | predicted] labels for record {i:3} are [{row[0]} | {xgb_predictor.predict([row.iloc[1:]])[0]:.3f}]"
    )
    sleep(0.1)

### 6.3. Clean Up

In [None]:
# Delete endpoint
xgb_predictor.delete_endpoint()

# Delete all S3 objects
bucket = boto_session.resource("s3").Bucket(S3_BUCKET)
bucket.objects.filter().delete()

import shutil

shutil.rmtree("data", ignore_errors=True)
shutil.rmtree("models", ignore_errors=True)
shutil.rmtree("generated", ignore_errors=True)
shutil.rmtree("training_reports", ignore_errors=True)