# Build and Deploy Many Models Leveraging Cancer Gene Expression Data With SageMaker Pipelines and SageMaker Multi-Model Endpoints

When building machine learning models that leverage genomic data, a key problem is how to allow users to select which features should be used when querying models. To address this,data scientists will sometimes build multiple models to handle specific sub-problems within the dataset. In the context of survival analysis for cancer, a common approach is to analyze gene signatures, and to predict the survival of patients based on the gene expression signatures. See [here](https://www.nature.com/articles/s41598-021-84787-5) for a an example of such an approach in the context of a number of different cancer types. See also [this](https://pubmed.ncbi.nlm.nih.gov/31296308/) review, which discusses different techniques to perform survival analysis.

A problem that may occur is that, should an application require publishing models based on many hundreds or thousands of gene signatures, managing and deploying all such models may become difficult to maintain and thus unweildly. In this blog post, we show how you can leverage SageMaker Pipelines and SageMaker MultiModel Endpoints to build and deploy many such models. 

To give a specific example, we will leverage the sample cancer RNA expression dataset discussed in the paper [Non-Small Cell Lung Cancer Radiogenomics Map Identifies Relationships between Molecular and Imaging Phenotypes with Prognostic Implications](https://pubmed.ncbi.nlm.nih.gov/28727543/). To simpify the use case, we will focus on 21 co-expressed groups that have been found in this paper to be clicially significant in NSCLC (see that paper, Table 2). These groups of genes, which the authors term metagenes, are annotated in different cellcular pathways. For example, the first group of genes LRIG1, HPGD and GDF15 are relate to the EGFR signaling pathay, while CIM,LMO2 and EFR2 all are involved in cell hypoxia/inflaation. Thus, each cancer patient (row) has gene expression values (columns). In addtion, each of the 199 patients is annoted by their survival status; each described by their Survival Status (1 for deceased; 0 for alive at time of collection of the dataset. We followed the preprocessing [this blog post](https://aws.amazon.com/blogs/industries/building-scalable-machine-learning-pipelines-for-multimodal-health-data-on-aws/) for preprocessing the data. As described more fully in that blog post, the final dataset is 119 patients where each cancer patient (row) has gene expression values (columns). If you run the pipeline described in that blog post, you will get the entire gene expression profile based on the raw FASTQ files, or you can also access the entire gene expression at [GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103584). 

The architecture for this approach is as follows:

![](images/Architecture.jpeg)

As can be seen in the diagram, we first start with data that is located in S3. We then create a [SageMaker Pipeline](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-pipelines/index.html). SageMaker Pipelines is a powerful feature that allows data scientists to wrap different components of their workload as a pipeline. This allows for a deployment strategy whereby each step of the analysis is automatically kicked off after the previous job finishes. See the associate code repository ?? for the specific syntax for creating a SageMaker Pipeline.
The pipeline consists of:

* A SageMaker Processing job for preprocessing the data

* A SageMaker Training job for training the model. 

* A SageMaker Processing job for evaluating and registering the model in SageMaker Model Registry.

* A seperate SageMaker Processing job for deploying the model on SageMaker Multi Model Endpoint (MME)




Before we begin lets verify SageMaker version

In [None]:
import sagemaker
sagemaker.__version__

In [None]:
%pip install --upgrade --quiet sagemaker==2.244.2

* Please restart the kernel after the sagemaker update. You can do that by following the options on the menu Kernel->Restart Kernel.
* After restarting execute the from below. Make sure that the version of the sagemaker is updated '>=2.94.0'.

In [None]:
import sagemaker
sagemaker.__version__

Then let's import rest of the packages needed.

In [None]:
import time
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sagemaker import get_execution_role

from sagemaker.multidatamodel import MultiDataModel

from sagemaker.pytorch import PyTorch
from sagemaker.pytorch.model import PyTorchModel

from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.workflow.fail_step import FailStep
from sagemaker.workflow.functions import Join
from sagemaker.model_metrics import MetricsSource, ModelMetrics
from sagemaker.workflow.functions import Join
from sagemaker.workflow.model_step import ModelStep
from sagemaker.workflow.conditions import ConditionLessThanOrEqualTo
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.functions import JsonGet

from sagemaker.predictor import Predictor

import matplotlib.pyplot as plt

### Read the data 

Data related to the project is available in the `data` folder. Lets read the and do some exploratory analysis of it and basic pre-processing.

In [None]:
genomic_data_with_label = pd.read_csv("data/Genomic-data-119patients.csv")
genomic_data_with_label

You can see that for each patient (`Case_ID`) we have all gene expression levels, as well as SurvivalStatus. Note that this dataset also contains a pathological label for the patient. We will not be leveraging this column, but you can read more about the histopathology data associated with this dataset [here](https://aws.amazon.com/blogs/industries/building-scalable-machine-learning-pipelines-for-multimodal-health-data-on-aws/). Thus, we remove `Case_ID` and `PathologicalMstage`

In [None]:
genomic_data_with_label.drop(columns=["Case_ID", "PathologicalMstage"], inplace=True)

Next, we check the Class Balanceness

In [None]:
genomic_data_with_label.SurvivalStatus.value_counts().plot.bar()
plt.show()

While class `0` is a greater proportion of cases, there is sufficient number of class `1` to proceed without rebalancing the data.

Next, we will rescale the data column, by column.

In [None]:
genomic_data = genomic_data_with_label.drop(columns=["SurvivalStatus"])
labels = genomic_data_with_label["SurvivalStatus"]

scaler = MinMaxScaler()
genomic_data[genomic_data.columns] = scaler.fit_transform(genomic_data.to_numpy())
genomic_data                  

### Split the data Train/Test


In [None]:
X_train, X_val, y_train, y_val = train_test_split(genomic_data, labels, test_size = 0.2)


After spliting the data lets visually verify that the class distributions follow the same both in `train` and `validation` data.

In [None]:
y_train.value_counts().plot.bar()
plt.show()

In [None]:
y_val.value_counts().plot.bar()
plt.show()

### Save data

In [None]:
X_train.insert(0, "SurvivalStatus", y_train)
X_train.to_csv("./data/train_data.csv", index = False, header=True)

In [None]:
X_val.insert(0, "SurvivalStatus", y_val)
X_val.to_csv("./data/validation_data.csv", index = False, header=True)

### Prepare for SageMaker Training

In [None]:
role = get_execution_role()
session = sagemaker.Session()
bucket = session.default_bucket()

s3_prefix = "genome-survival-classification/data"

### Upload to S3

In [None]:
input_train = session.upload_data(
        path="./data/train_data.csv", bucket=bucket, key_prefix="{}/train".format(s3_prefix)
    )

input_val = session.upload_data(
        path="./data/validation_data.csv", bucket=bucket, key_prefix="{}/validation".format(s3_prefix)
    )

print("Train data : [{}]".format(input_train))
print("Val data : [{}]".format(input_val))

## Create the Multimodel Endpoint 

At this time we are creating the multi-model endpoint (one time configuration) to serve the models that are going to be delivered by the SageMaker piplines. Note that for now we are deploying a MME model that points to an empty collection of models; we will populate the collection of models later in the SageMaker Pipeline step. We also specify a custom inference.py script, which will allow users to choose which model to invoke. 


In [None]:
FRAMEWORK_VERSION = "1.12.0"

mme_model_data_location = "s3://{}/{}/mme-models-location".format(bucket, s3_prefix)

endpoint_name = "Genome-Survival-Prediction-MultiModel-Endpoint-{}".format(time.strftime("%H-%M-%S"))

model = PyTorchModel(model_data="./model/model.tar.gz", 
                     source_dir='src', 
                     entry_point='inference.py', 
                     role=role, 
                     framework_version=FRAMEWORK_VERSION,
                     py_version = "py38",
                     sagemaker_session=session)

 
mme = MultiDataModel(
    name = "Genome-Survival-Prediction-MME-Model-{}".format(time.strftime("%H-%M-%S")),
    model_data_prefix = mme_model_data_location,
    model = model,  # passing our model
    sagemaker_session=session,
)

mme_predictor = mme.deploy(
    initial_instance_count=1, 
    instance_type="ml.m5.large", 
    endpoint_name=endpoint_name
)

#### Check for current models (First time it should be empty)

In [None]:
list(mme.list_models())

## Creating the pipeline 

At this point, the trained models are stored on S3, and the Multi-Model Enpoint can dynamically retrieve the needed model based on the user request. The user specifies not only the input data to run, but which specific model to use. 

Thinking back to the gene expression data, the following diagram represents an overview of the modeling process FIX:

![](images/image_2.jpg)

In this diagram, we first start with the original gene expression data (red indicates higher expression; blue lower expression), and then split that data into N seperate subsets of gene expression data. Model 1, for example, is built on genes 1,2,3; Model 2 on genes 4,5,6 etc. We then train multiple models, where each subsample of gene expression data is leveraged to predict survival. Note that each execution of the SageMaker Pipeline corresponds to building one model based on a gene signature.   

As mentioned in the introduction, we are leveraging a small data set for just 21 genes found to be signficant in predicting survival in lung cancer. However, you could do similair analysis with others groups of genes, such as those present in the [KEGG pathway database](https://www.genome.jp/kegg/pathway.html) or [Molecular Signatures Database](http://www.gsea-msigdb.org/gsea/msigdb/index.jsp)



In [None]:
pipeline_session = PipelineSession()

from sagemaker.workflow.parameters import (
    ParameterInteger,
    ParameterString,
    ParameterFloat,
)

input_train_data = ParameterString(
    name="InputTrainData",
    default_value=input_train,
)

input_validation_data = ParameterString(
    name="InputValidationData",
    default_value=input_val,
)

genome_group = ParameterString(
    name="genomeGroup",
    default_value="ALL",
)

training_instance_type = ParameterString(
    name="TrainingInstanceType", 
    default_value="ml.m5.large"
)

mme_model_location = ParameterString(
    name="MMEModelsLocation",
    default_value=mme_model_data_location,
)

from sagemaker.workflow.steps import CacheConfig

cache_config = CacheConfig(enable_caching=True, expire_after="PT1H")


#### Training Step

In [None]:
pytorch_estimator = PyTorch(
     source_dir="src",           
     entry_point="train.py",
     framework_version = "1.12.0",
     py_version = "py38",
     instance_type= training_instance_type,
     instance_count=1,
     role = role,
     hyperparameters = {
         "genome-group" : genome_group
     },
    sagemaker_session = pipeline_session
)

#pytorch_estimator.fit({"train_data" : input_train, "val_data": input_val})

In [None]:
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep

step_train = TrainingStep(
    name="Genome-Survival-Prediction-Training",
    estimator=pytorch_estimator,
    inputs={
        "train_data": TrainingInput(
            s3_data=input_train_data,
            content_type="text/csv",
        ),
         "val_data": TrainingInput(
            s3_data=input_validation_data,
            content_type="text/csv",
        )
    },
    cache_config=cache_config
)

#### Model evaluation Step

In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.workflow.properties import PropertyFile
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep

framework_version = "0.23-1"

sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    instance_type="ml.m5.large",
    instance_count=1,
    base_job_name="Genome-Survival-Prediction-Eval",
    role=role,
    env = {
        "genomeGroup" : genome_group
    },
    sagemaker_session = pipeline_session
)

In [None]:
evaluation_report = PropertyFile(
    name="EvaluationReport", output_name="evaluation", path="evaluation.json"
)

step_eval = ProcessingStep(
    name="Genome-Survival-Prediction-Eval",
    processor=sklearn_processor,
    inputs=[
        ProcessingInput(
            source=step_train.properties.ModelArtifacts.S3ModelArtifacts,
            destination="/opt/ml/processing/model",
        ),
        ProcessingInput(
            source=input_validation_data,
            destination="/opt/ml/processing/test",
        ),
        ProcessingInput(
            source="./src",
            destination="/opt/ml/processing/code",
        )
    ],
    outputs=[
        ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation")
    ],
    code="src/evaluation.py",
    property_files=[evaluation_report],
)

In [None]:
step_fail = FailStep(
    name="Genome-Survival-Prediction-Fail",
    error_message="Execution failed due to Obective Metric was not met",
)

#### Define a Register Model Step to Create a Model Package


In [None]:
model_metrics = ModelMetrics(
    model_statistics=MetricsSource(
        s3_uri="{}/evaluation.json".format(
            step_eval.arguments["ProcessingOutputConfig"]["Outputs"][0]["S3Output"]["S3Uri"]
        ),
        content_type="application/json",
    )
)

model = PyTorchModel(
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    role=role,
    entry_point="inference.py",
    source_dir = "src",
    framework_version = "1.12.0",
    py_version = "py38",
    sagemaker_session=PipelineSession()
)

# in addition, we might also want to register a model to SageMaker Model Registry
register_model_step_args = model.register(
    content_types=["application/json"],
    response_types=["application/json"],
    inference_instances=["ml.t2.medium", "ml.m5.xlarge"],
    transform_instances=["ml.m5.xlarge"],
    model_package_group_name='Genome-Survival-Prediction-Model-Package-Group',
    approval_status = "Approved"
)

step_model_registration = ModelStep(
   name="Genome-Survival-Prediction-Model-Registration",
   step_args=register_model_step_args,
)



#### Define MME Deployment Step


In [None]:
sklearn_processor_for_mme_deployment = SKLearnProcessor(
    framework_version=framework_version,
    instance_type="ml.m5.xlarge",
    instance_count=1,
    base_job_name="Genome-Survival-Prediction-Deployment",
    role=role,
    env = {
        "modelPackageArn" : step_model_registration.steps[1].properties.ModelPackageArn,
        "mmeModelLocation" : mme_model_location,
        "genomeGroup" : genome_group,
        "AWS_DEFAULT_REGION": session.boto_region_name
    }
)

step_mme_deployment = ProcessingStep(
    name="Genome-Survival-Prediction-MME-Deployment",
    processor=sklearn_processor_for_mme_deployment,
    inputs=[
        
    ],
    outputs=[
        ProcessingOutput(output_name="mme_model_location", source="/opt/ml/processing/model/mme")
    ],
    code="src/mme_deployment.py"
)

### Condition Step

In [None]:
cond_lte = ConditionLessThanOrEqualTo(
    left=JsonGet(
        step_name=step_eval.name,
        property_file=evaluation_report,
        json_path="metrics.test_accuracy.value",
    ),
    right=0.4
)

step_cond = ConditionStep(
    name="Genome-Survival-Prediction-Condition",
    conditions=[cond_lte],
    if_steps=[step_fail],
    else_steps=[step_model_registration, step_mme_deployment],
)

### Create the pipeline using all the steps defined above

In [None]:
from sagemaker.workflow.pipeline import Pipeline

pipeline_name = f"Genome-Survival-Prediction-Pipeline"
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        input_train_data,
        input_validation_data,
        training_instance_type,
        genome_group,
        mme_model_location
    ],
    steps=[step_train, step_eval, step_cond]
)

In [None]:
import json

definition = json.loads(pipeline.definition())
definition

In [None]:
pipeline.upsert(role_arn=role)

If you are using SageMaker Studio, you can visualize what each step of the pipeline actually looks like:

![](images/image_3.jpg)

### Start the pipeline with all the Gene groups.

In [None]:
execution = pipeline.start({
        "genomeGroup" : "ALL"
    }
)

### Pipeline Operations: Examining and Waiting for Pipeline Execution

Describe the pipeline execution

In [None]:
execution.describe()

Wait for the execution to complete.


In [None]:
execution.wait()

### Verify how many models deploye on MME

In [None]:
list(mme.list_models())

* We can see there is model suffixed with 'ALL' already in the MME location. Let's do some predictions with the test dataset. 

### Predict with trained models using test data


In [None]:
predictor = Predictor(endpoint_name = endpoint_name)

predictor.serializer = sagemaker.serializers.JSONSerializer()
predictor.deserializer = sagemaker.deserializers.CSVDeserializer()

In [None]:
payload = {
    "inputs" : X_val.iloc[:, 1:].values
}

predictor.predict(payload, target_model="/model-ALL.tar.gz")

### Next lets start training model with the "metagene_19" Gene group

In [None]:
execution = pipeline.start(
    parameters=dict(
        genomeGroup="metagene_19"
    )
)

In [None]:
execution.wait()

### Verify how many models deploye on MME

In [None]:
list(mme.list_models())

We can see there is a new model suffixed with 'metagene_19' in the MME location. Let's do some predictions with the test dataset. 

In [None]:
payload = {
    "inputs" : X_val[['LRIG1', 'HPGD', 'GDF15']].iloc[0:5, :].values
}
payload

In [None]:
predictor.predict(payload, target_model="/model-metagene_19.tar.gz")

## Clean up

Once you are completed the work with the notebook, please delete the endpoint by uncommenting the following code.

In [None]:
#predictor.delete_endpoint()