# Lung Cancer Survival Prediction Endpoint Demo

In this endpoint demo notebook, we demonstrate how to send inference requests to a pre-deployed endpoint and get the model response (survived vs death). 

To find more details of an end-to-end solution for data processing, feature store, model training and deployement using SageMaker, check out the solution notebook [1_preprocess_genomic_data.ipynb](./1_preprocess_genomic_data.ipynb) to start the end-to-end solution.  

The end-to-end solution shows how-to for the following steps: 
1. processing multi-modal data (genomic, clinical, medical imaging) to obtain ML features
2. ingesting and managing multi-modal features in SageMaker Feature Store
3. training a survival status prediction model using PCA and XGBoost
4. hosting a model for inference. 

The exposition in this notebook is deliberately brief. 

>**<span style="color:RED">Important</span>**: 
>This solution is for demonstrative purposes only. It is not for clinical use. The ML inference should not be used to inform any clinical decision. The associated notebooks, including the trained model and sample data, are not intended for production.

### Step 1: Read in the solution config

In [None]:
import json

SOLUTION_CONFIG = json.load(open("stack_outputs.json"))
ROLE = SOLUTION_CONFIG["IamRole"]
SOLUTION_BUCKET = SOLUTION_CONFIG["SolutionS3Bucket"]
REGION = SOLUTION_CONFIG["AWSRegion"]
SOLUTION_NAME = SOLUTION_CONFIG["SolutionName"]
BUCKET = SOLUTION_CONFIG["S3Bucket"]

### Step 2: Download and read in the multimodal dataset for inference

The test multimodal dataset consists of genomic, clinical and imaging features for X patients. The features have been condensed by PCA from 216 to 65 principal components.

#### Download the data for inference from S3

In [None]:
input_data_bucket = f"s3://{SOLUTION_BUCKET}-{REGION}/{SOLUTION_NAME}/data"
inference_data = f"{input_data_bucket}/test.csv"
!aws s3 cp $inference_data .

In [None]:
import pandas as pd

df_test = pd.read_csv("test.csv", header=None)
print(df_test.shape)

In [None]:
# Separate the labels from the test dataset
groundtruth = df_test[0].values
df_test = df_test.drop(columns=[0])

The features are principal components computed from 216 features. The original feature vector include features from genomic secondary analysis, clinical health records, and radiomic features from within the lung tumor in computed tomography images. 

#### Snapshot of data
##### Clinical
![clinical-data](https://sagemaker-solutions-prod-us-east-2.s3-us-east-2.amazonaws.com/sagemaker-lung-cancer-survival-prediction/1.0.0/docs/clinical-data-screenshot.png)

##### Genomic
![genomic-data](https://sagemaker-solutions-prod-us-east-2.s3-us-east-2.amazonaws.com/sagemaker-lung-cancer-survival-prediction/1.0.0/docs/genomic-secondary.png)

##### Medical imaging
![imaging-data](https://sagemaker-solutions-prod-us-east-2.s3-us-east-2.amazonaws.com/sagemaker-lung-cancer-survival-prediction/1.0.0/docs/CT-tumor-overlay.png)

### Step 3: Predicting survival status

If you want to use the demo endpoint successfully, your dataframe columns should be identical to the `df_test` as shown in the previous step.

In [None]:
import sagemaker
from sagemaker import Predictor
import numpy as np
from sagemaker.predictor import json_serializer, json_deserializer, Predictor


endpoint_name = SOLUTION_CONFIG["SolutionPrefix"] + "-demo-endpoint" 

# Make prediction requests to an Amazon SageMaker endpoint.
predictor = Predictor(
    endpoint_name = endpoint_name,
    sagemaker_session = sagemaker.Session(),
    deserializer =  sagemaker.deserializers.CSVDeserializer(),
    serializer = sagemaker.serializers.CSVSerializer(),
)

# Retrieve predictions by passing the test dataset to the Predictor's predict method.
prediction = predictor.predict(df_test.values)
prediction

In [None]:
# Since the predictions are strings, let's convert them back to floats. 
# The response is an list of lists, let's retrieve list of our predictions.

prediction_float = [float(pred) for pred in prediction[0]]
prediction_float

Let's evaluate the model performance on this sample data against the ground truth using the ROC curve (Receiver Operating Characteristic curve) and AUC (Area Under the Curve) for ROC.

In [None]:
# plot ROC curve, and compute AUC score using the probability
import sklearn.metrics as skm

fpr, tpr, thresholds = skm.roc_curve(groundtruth, prediction_float)
skm.auc(fpr, tpr)

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot as plt

plt.subplots(1, figsize=(5,5))
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr)
plt.plot([0, 1], ls="--")
# plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Let's convert the probability to hard prediction, survive (1) or death (0) by a set threshold of 0.5.

In [None]:
predictions_label = [1 if pred > 0.5 else 0 for pred in prediction_float]
predictions_label

In [None]:
target_names = [False, True]
skm.confusion_matrix(groundtruth, predictions_label)

In [None]:
print(skm.classification_report(groundtruth, predictions_label))

## Next Stage

Next, we'll start the end-to-end solution from [here](./1_preprocess_genomic_data.ipynb).