# Clinical, Genomic, and Imaging data - Training and Testing 

---
This notebook demonstrates the use of Amazon SageMaker [AutoGluon-Tabular](https://auto.gluon.ai/stable/tutorials/tabular_prediction/index.html) algorithm to train and test a tabular binary classification model. Tabular classification is the task of assigning a class to an example of structured or relational data. The Amazon SageMaker API for tabular classification can be used for classification of an example in two classes (binary classification) or more than two classes (multi-class classification).

In this notebook, we demonstrate two use cases of tabular classification models using the [Synthea Coherent Data Set](https://registry.opendata.aws/synthea-coherent-data/):

* How to get features from Amazon SageMaker FeatureStore. The preprocess-multimodal-data notebooks for clinical, genomic, and imaging notebooks need to be run before running this notebook.
* How to train a tabular model on a multimodal dataset to do binary classification. This notebook shows example for four different outcomes:  Alzheimers Disease, Coronary Heart Disease, Stroke, and Hypertension
* How to use evaluate predictions from the out of sample test data.

Note: This notebook was tested in Amazon SageMaker Studio on ml.t3.xlarge instance with Python 3 (Data Science 3.0) kernel.

---

In [None]:
import boto3
import sagemaker
from sagemaker.session import Session
from sagemaker import get_execution_role
import pandas as pd
import io, os
import sys
from sklearn.model_selection import train_test_split

In [None]:
!pip install autogluon

In [None]:
from autogluon.tabular import TabularPredictor
import autogluon as ag

## Get data type to train model

In [None]:
data_type = 'genomic-clinical-imaging'
PatientID = 'patientid'

## Set up S3 buckets and session

In [None]:
sm_session = sagemaker.Session()
bucket = sm_session.default_bucket()
region = boto3.Session().region_name
role = get_execution_role()

boto_session = boto3.Session(region_name=region)
sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)
featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)

feature_store_session = Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime
)

s3_client = boto3.client('s3', region_name=region)

default_s3_bucket_name = sm_session.default_bucket()
prefix = 'multi-model-health-ml'


## Get features from SageMaker FeatureStore based on data type

In [None]:
from sagemaker.feature_store.feature_group import FeatureGroup

genomic_feature_group_name = 'genomic-feature-group'
clinical_feature_group_name = 'clinical-feature-group'
imaging_feature_group_name = 'imaging-feature-group'

genomic_feature_group = FeatureGroup(name=genomic_feature_group_name, sagemaker_session=feature_store_session)
clinical_feature_group = FeatureGroup(name=clinical_feature_group_name, sagemaker_session=feature_store_session)
imaging_feature_group = FeatureGroup(name=imaging_feature_group_name, sagemaker_session=feature_store_session)

In [None]:
genomic_query = genomic_feature_group.athena_query()
clinical_query = clinical_feature_group.athena_query()
imaging_query = imaging_feature_group.athena_query()

genomic_table = genomic_query.table_name
clinical_table = clinical_query.table_name
imaging_table = imaging_query.table_name

print('Table names')
print(genomic_table)
print(clinical_table)
print(imaging_table)


In [None]:
def get_features(data_type, output_location):   
    if (data_type == 'genomic-clinical-imaging'):
        query_string = f'''SELECT * FROM "{genomic_table}", "{clinical_table}", "{imaging_table}"
                           WHERE "{genomic_table}".{PatientID} = "{clinical_table}".{PatientID}
                           AND "{genomic_table}".{PatientID} = "{imaging_table}".{PatientID}
                           ORDER BY "{clinical_table}".{PatientID} ASC'''                   
        print(query_string)
        
        genomic_query.run(query_string=query_string, output_location=output_location)
        genomic_query.wait()
        dataset = genomic_query.as_dataframe()
        
    elif data_type not in supported_data_type:
        raise KeyError(f'data_type {data_type} is not supported for this analysis.')
        
    return dataset

In [None]:
fs_output_location = f's3://{default_s3_bucket_name}/{prefix}/feature-store-queries'
dataset = get_features(data_type, fs_output_location)
dataset = dataset.astype(str).replace({"{":"", "}":""}, regex=True)

# Write to csv in S3 without headers and index column.
filename=f'{data_type}-dataset.csv'
dataset_uri_prefix = f's3://{default_s3_bucket_name}/{prefix}/training_input/';

dataset.to_csv(filename)
s3_client.upload_file(filename, default_s3_bucket_name, f'{prefix}/training_input/{filename}')
print("Observing the different features in the dataset")
dataset.head(3)

In [None]:
ag.core.utils.random.seed(25)

## Alzheimers Prediction
Splitting data for training and testing

In [None]:
#Alzheimers Prediction
#Splitting data into training and testing 80:20
dataset = dataset.loc[:, ~dataset.columns.str.startswith('diagnostics')]
dataset = dataset.drop(columns = ['eventtime', 'write_time', 'api_invocation_time', 'is_deleted', 'eventtime.1', 'write_time.1', 'api_invocation_time.1', 'is_deleted.1', 'alzheimers_prediction.1',
                                    'coronary_heart_disease_prediction.1', 'stroke_prediction.1', 'hypertension_prediction.1', 'patientid.1', 'eventtime.2', 'write_time.2', 'api_invocation_time.2', 'is_deleted.2', 
                                   'alzheimers_prediction.2', 'coronary_heart_disease_prediction.2', 'stroke_prediction.2', 'hypertension_prediction.2', 'patientid.2'])
training= dataset.sample(frac=0.8, random_state=23)
training = training.drop(columns = ['patientid', 'coronary_heart_disease_prediction', 'stroke_prediction', 'hypertension_prediction'])
testing = dataset.drop(training.index)
testing = testing.drop(columns = ['patientid', 'coronary_heart_disease_prediction', 'stroke_prediction', 'hypertension_prediction'])
X_test = testing.drop(columns = ['alzheimers_prediction'])
print("Training size = ", len(training))
print("Out of sample testing size = ", len(testing))

### Alzheimers prediction on clinical, genomic, and imaging data using Autogluon

In [None]:
import time
start_time = time.time()
buckt = sm_session.default_bucket()
prefix= "genomic-clinical-imaging-alzheimers-prediction"
save_file = 's3://{}/{}'.format(buckt, prefix)
predictor = TabularPredictor(label= 'alzheimers_prediction', problem_type= 'binary', path=save_file).fit(train_data=training, holdout_frac=0.1, excluded_model_types=['CAT', 'XGB'])
print("--- Training time= %s seconds ---" % (time.time() - start_time))

In [None]:
predictor.evaluate_predictions(y_true=testing['alzheimers_prediction'], y_pred=predictor.predict(X_test), auxiliary_metrics=True, detailed_report=True)

## Coronary heart disease Prediction
Splitting data for training and testing

In [None]:
#coronary_heart_disease_prediction
#Splitting data into training and testing 80:20
training = dataset.sample(frac=0.8, random_state=25)
training =  training.drop(columns = ['patientid', 'alzheimers_prediction', 'stroke_prediction', 'hypertension_prediction'])
testing = dataset.drop(training.index)
testing = testing.drop(columns = ['patientid', 'alzheimers_prediction', 'stroke_prediction', 'hypertension_prediction'])
X_test = testing.drop(columns = ['coronary_heart_disease_prediction'])
print("Training size = ", len(training))
print("Out of sample testing size = ", len(testing))

### Coronary heart disease prediction on clinical,  genomic, and imaging data using Autogluon

In [None]:
import time
start_time = time.time()
buckt = sm_session.default_bucket()
prefix= "genomic-clinical-imaging-coronary-heart-disease-prediction"
save_file = 's3://{}/{}'.format(buckt, prefix)
predictor = TabularPredictor(label= 'coronary_heart_disease_prediction', problem_type= 'binary', path=save_file).fit(train_data=training, holdout_frac=0.1, excluded_model_types=['CAT', 'XGB'])
print("--- Training time= %s seconds ---" % (time.time() - start_time))

In [None]:
predictor.evaluate_predictions(y_true=testing['coronary_heart_disease_prediction'], y_pred=predictor.predict(X_test), auxiliary_metrics=True, detailed_report=True)

## Stroke Prediction
Splitting data for training and testing

In [None]:
#stroke_prediction
#Splitting data into training and testing 80:20
training = dataset.sample(frac=0.8, random_state=30)
training =  training.drop(columns = ['patientid', 'alzheimers_prediction', 'coronary_heart_disease_prediction', 'hypertension_prediction'])
testing = dataset.drop(training.index)
testing = testing.drop(columns = ['patientid', 'alzheimers_prediction', 'coronary_heart_disease_prediction', 'hypertension_prediction'])
X_test = testing.drop(columns = ['stroke_prediction'])
print("Training size = ", len(training))
print("Out of sample testing size = ", len(testing))

### Stroke prediction on clinical, genomic, and imaging data using Autogluon

In [None]:
import time
start_time = time.time()
buckt = sm_session.default_bucket()
prefix= "genomic-clinical-imaging-stroke_prediction"
save_file = 's3://{}/{}'.format(buckt, prefix)
predictor = TabularPredictor(label= 'stroke_prediction', problem_type= 'binary', path=save_file).fit(train_data=training, holdout_frac=0.1, excluded_model_types=['CAT', 'XGB'])
print("--- Training time= %s seconds ---" % (time.time() - start_time))

In [None]:
predictor.evaluate_predictions(y_true=testing['stroke_prediction'], y_pred=predictor.predict(X_test), auxiliary_metrics=True, detailed_report=True)

## Hypertension Prediction
Splitting data for training and testing

In [None]:
#hypertension_prediction
#Splitting data into training and testing 80:20
training = dataset.sample(frac=0.8, random_state=25)
training = training.drop(columns = ['patientid', 'alzheimers_prediction', 'coronary_heart_disease_prediction', 'stroke_prediction'])
testing = dataset.drop(training.index)
testing = testing.drop(columns = ['patientid', 'alzheimers_prediction', 'coronary_heart_disease_prediction', 'stroke_prediction'])
X_test = testing.drop(columns = ['hypertension_prediction'])
print("Training size = ", len(training))
print("Out of sample testing size = ", len(testing))

### Hypertension prediction on clinical, genomic, and imaging data using Autogluon

In [None]:
import time
start_time = time.time()
buckt = sm_session.default_bucket()
prefix= "genomic-clinical-imaging-hypertension-prediction"
save_file = 's3://{}/{}'.format(buckt, prefix)
predictor = TabularPredictor(label= 'hypertension_prediction', problem_type= 'binary', path=save_file).fit(train_data=training, holdout_frac=0.1, excluded_model_types=['CAT', 'XGB'])
print("--- Training time= %s seconds ---" % (time.time() - start_time))

In [None]:
predictor.evaluate_predictions(y_true=testing['hypertension_prediction'], y_pred=predictor.predict(X_test), auxiliary_metrics=True, detailed_report=True)