# Lung Cancer Survival Prediction Endpoint Demo

In this endpoint demo notebook, we demonstrate how to send inference requests to an pre-deployed endpoint and get the model response.

To find more details of an end-to-end solution for data processing, feature store, model training and deployement using SageMaker, check out the solution notebook `xxxxxxx.ipynb`. It shows how-to for the following steps: 1/ processing multi-modal data (genomic, clinical, medical imaging) to obtain ML features, 2/ ingesting and managing multi-modal features in SageMaker Feature Store, 3/ training a survival status prediction model using PCA and XGBoost, 4/ hosting a model for inference. The exposition in this notebook is deliberately brief. 

>**<span style="color:RED">Important</span>**: 
>This solution is for demonstrative purposes only. It is not for clinical use. The ML inference should not be used to inform any clinical decision. The associated notebooks, including the trained model and sample data, are not intended for production.

### (dev only) Hosting a trained model from local model.tar.gz
This section ought to be removed once tested by solution team. JS team should not include this section in integration test.

In [5]:
import sagemaker
import boto3
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')

default_bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sagemaker-soln-lcsp-js'

In [10]:
model_local = './js-us-west-2-output/model.tar.gz'
model_s3_prefix = f's3://{default_bucket}/{prefix}/js-us-west-2-output'
model_s3_uri = sagemaker.s3.S3Uploader.upload(model_local, model_s3_prefix, sagemaker_session=sagemaker_session)

In [11]:
image = sagemaker.image_uris.retrieve('xgboost', region=region, version='1.2-1')
model = sagemaker.model.Model(image, model_data=model_s3_uri, role = role, sagemaker_session=sagemaker_session)

In [None]:
import uuid
suffix=uuid.uuid1().hex[:5] # to be used in resource names
endpoint_name = f'sagemaker-soln-lcsp-js-{suffix}'
model.deploy(initial_instance_count=1, instance_type='ml.t2.large', endpoint_name=endpoint_name)

### Step 1: Read in the solution config

In [None]:
# import json

# SOLUTION_CONFIG = json.load(open("stack_outputs.json"))
# ROLE = SOLUTION_CONFIG["IamRole"]
# SOLUTION_BUCKET = SOLUTION_CONFIG["SolutionS3Bucket"]
# REGION = SOLUTION_CONFIG["AWSRegion"]
# SOLUTION_NAME = SOLUTION_CONFIG["SolutionName"]
# BUCKET = SOLUTION_CONFIG["S3Bucket"]

### Step 2: Download and read in the multimodal dataset for inference

The test multimodal dataset consists of genomic, clinical and imaging features for X patients. The features have been condensed by PCA from 216 to 65 principal components.

In [None]:
# from sagemaker.s3 import S3Downloader

# input_data_bucket = f"s3://{SOLUTION_BUCKET}-{REGION}/{SOLUTION_NAME}/data"
# print("original data: ")
# S3Downloader.list(input_data_bucket)

#### Download the data for inference from S3

In [None]:
# inference_data = f"{input_data_bucket}/test.csv"
# !aws s3 cp $inference_data .

In [24]:
import pandas as pd

df_test = pd.read_csv("js-us-west-2-output/test.csv", header=None)
print(df_test.shape)
groundtruth = df_test[0].values
df_test = df_test.drop(columns=[0])

(24, 66)


In [25]:
df_test.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,56,57,58,59,60,61,62,63,64,65
0,-4.580822,4.778989,1.79645,-1.587991,-2.254577,-0.992406,1.329909,1.500276,-0.564313,-0.170869,...,1.543979,1.883418,1.032847,-0.924655,-0.505878,0.224683,-0.689189,-0.929684,0.394437,1.20787
1,12.791638,0.107049,-0.136994,-0.964423,0.017514,0.513931,0.086338,2.386275,-1.994754,0.244188,...,-0.320036,-0.389636,0.584013,0.11658,0.332778,1.098054,-0.246963,-1.01692,-1.405553,-0.400758
2,-6.650556,0.387462,-3.685193,-3.511565,0.171694,-0.518721,-2.470869,3.635547,0.896038,-1.177886,...,-0.835626,0.05824,-1.875023,-0.521941,-0.225894,-0.303791,1.210972,0.630887,-1.033824,1.368265
3,12.88435,0.469195,-2.623686,0.795279,-1.279601,1.27434,-2.278928,-0.307392,-2.002069,0.676743,...,-0.800259,1.468973,1.47895,0.881505,0.699133,0.031382,0.478656,2.755765,-0.022756,-3.56667
4,-3.01372,-2.245453,1.419483,-5.974647,-1.185193,1.607399,0.492786,-0.75744,-0.376684,-0.993567,...,0.821061,0.050708,-0.238885,-0.205702,-0.310094,1.212688,-2.072481,-0.197707,-1.123059,0.362785


The features are principal components computed from 216 features. The original feature vector include features from genomic secondary analysis, clinical health records, and radiomic features from within the lung tumor in computed tomography images. 

#### Snapshot of data
##### Clinical
![clinical-data](../images/clinical-data-screenshot.png)

#### Genomic
![genomic-data](../images/genomic-secondary.png)

#### Medical imaging
![imaging-data](../images/CT-tumor-overlay.png)

### Step 3: Predicting survival status

If you want to use the demo endpoint successfully, your dataframe columns should be identical to the `df_tabtext_score` as shown in the previous step.

In [None]:
import sagemaker
from sagemaker import Predictor
import numpy as np

# endpoint_name = SOLUTION_CONFIG["SolutionPrefix"] + "-demo-endpoint" 


predictor = Predictor(
    endpoint_name = endpoint_name,
    sagemaker_session = sagemaker.Session(),
    deserializer =  sagemaker.deserializers.JSONDeserializer(),
    serializer = sagemaker.serializers.CSVSerializer(),
)

prediction = predictor.predict(df_test.values)


Let's take a look at the total count of predicted survival status and evaluate the model performance

In [None]:
print(np.bincount(prediction))

In [None]:
import sklearn.metrics as skm
skm.confusion_matrix(groundtruth, prediction)

In [None]:
skm.classification_report(groundtruth, prediction)

In [None]:
# plot ROC curve, and compute AUC score
fpr, tpr, thresholds = metrics.roc_curve(groundtruth, prediction, pos_label=1)
metrics.auc(fpr, tpr)