# Clinical, Genomic, and Imaging data - Training and Testing 

---
This notebook demonstrates the use of Amazon SageMaker [AutoGluon-Tabular](https://auto.gluon.ai/stable/tutorials/tabular_prediction/index.html) algorithm to train and test a tabular binary classification model. Tabular classification is the task of assigning a class to an example of structured or relational data. The Amazon SageMaker API for tabular classification can be used for classification of an example in two classes (binary classification) or more than two classes (multi-class classification).

In this notebook, we demonstrate two use cases of tabular classification models using the [Synthea Coherent Data Set](https://registry.opendata.aws/synthea-coherent-data/):

* How to get features from Amazon SageMaker FeatureStore. The preprocess-multimodal-data notebooks for clinical, genomic, and imaging notebooks need to be run before running this notebook.
* How to train a tabular model on a multimodal dataset to do binary classification. This notebook shows example for four different outcomes:  Alzheimers Disease, Coronary Heart Disease, Stroke, and Hypertension
* How to use evaluate predictions from the out of sample test data.

Note: This notebook was tested in Amazon SageMaker Studio on ml.t3.xlarge instance with Python 3 (Data Science 3.0) kernel.

---

In [6]:
import boto3
import sagemaker
from sagemaker.session import Session
from sagemaker import get_execution_role
import pandas as pd
import io, os
import sys
from sklearn.model_selection import train_test_split

In [None]:
!pip install autogluon

In [7]:
from autogluon.tabular import TabularPredictor
import autogluon as ag

## Get data type to train model

In [8]:
data_type = 'genomic-clinical-imaging'
PatientID = 'patientid'

## Set up S3 buckets and session

In [9]:
sm_session = sagemaker.Session()
bucket = sm_session.default_bucket()
region = boto3.Session().region_name
role = get_execution_role()

boto_session = boto3.Session(region_name=region)
sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)
featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)

feature_store_session = Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime
)

s3_client = boto3.client('s3', region_name=region)

default_s3_bucket_name = sm_session.default_bucket()
prefix = 'multi-model-health-ml'


## Get features from SageMaker FeatureStore based on data type

In [10]:
from sagemaker.feature_store.feature_group import FeatureGroup

genomic_feature_group_name = 'genomic-feature-group'
clinical_feature_group_name = 'clinical-feature-group'
imaging_feature_group_name = 'imaging-feature-group'

genomic_feature_group = FeatureGroup(name=genomic_feature_group_name, sagemaker_session=feature_store_session)
clinical_feature_group = FeatureGroup(name=clinical_feature_group_name, sagemaker_session=feature_store_session)
imaging_feature_group = FeatureGroup(name=imaging_feature_group_name, sagemaker_session=feature_store_session)

In [11]:
genomic_query = genomic_feature_group.athena_query()
clinical_query = clinical_feature_group.athena_query()
imaging_query = imaging_feature_group.athena_query()

genomic_table = genomic_query.table_name
clinical_table = clinical_query.table_name
imaging_table = imaging_query.table_name

print('Table names')
print(genomic_table)
print(clinical_table)
print(imaging_table)


Table names
genomic_feature_group_1688073848
clinical_feature_group_1688073736
imaging_feature_group_1692317426


In [12]:
def get_features(data_type, output_location):   
    if (data_type == 'genomic-clinical-imaging'):
        query_string = f'''SELECT * FROM "{genomic_table}", "{clinical_table}", "{imaging_table}"
                           WHERE "{genomic_table}".{PatientID} = "{clinical_table}".{PatientID}
                           AND "{genomic_table}".{PatientID} = "{imaging_table}".{PatientID}
                           ORDER BY "{clinical_table}".{PatientID} ASC'''                   
        print(query_string)
        
        genomic_query.run(query_string=query_string, output_location=output_location)
        genomic_query.wait()
        dataset = genomic_query.as_dataframe()
        
    elif data_type not in supported_data_type:
        raise KeyError(f'data_type {data_type} is not supported for this analysis.')
        
    return dataset

In [30]:
fs_output_location = f's3://{default_s3_bucket_name}/{prefix}/feature-store-queries'
dataset = get_features(data_type, fs_output_location)
dataset = dataset.astype(str).replace({"{":"", "}":""}, regex=True)

# Write to csv in S3 without headers and index column.
filename=f'{data_type}-dataset.csv'
dataset_uri_prefix = f's3://{default_s3_bucket_name}/{prefix}/training_input/';

dataset.to_csv(filename)
s3_client.upload_file(filename, default_s3_bucket_name, f'{prefix}/training_input/{filename}')
print("Observing the different features in the dataset")
dataset.head(3)

SELECT * FROM "genomic_feature_group_1688073848", "clinical_feature_group_1688073736", "imaging_feature_group_1692317426"
                           WHERE "genomic_feature_group_1688073848".patientid = "clinical_feature_group_1688073736".patientid
                           AND "genomic_feature_group_1688073848".patientid = "imaging_feature_group_1692317426".patientid
                           ORDER BY "clinical_feature_group_1688073736".patientid ASC
Observing the different features in the dataset


Unnamed: 0,patientid,gene_info,clinical_significance,contigname,start,referenceallele,alternatealleles,phased,calls,alzheimers_prediction,...,original_ngtdm_strength,imagesetid,alzheimers_prediction.2,coronary_heart_disease_prediction.2,stroke_prediction.2,hypertension_prediction.2,eventtime.2,write_time.2,api_invocation_time.2,is_deleted.2
0,0074596f-5fd0-7965-db0f-cce71c81567d,"'SOD3', None, 'TNF', 'LPL', 'F5', 'PON1', 'PPARG', 'FTO', 'HFE', 'DCHS1', 'ABCG2', 'HRC', 'PLTP', 'GCH1', 'ANKK1', 'EDN1', 'LRP8', 'DSP', 'TTR', 'SCN5A', 'APOB', 'CDKN2B', 'BMPR2', 'LPA', 'HABP2', 'F12', 'PPP1R3A', 'ALOX15', 'F2', 'CCL2', 'APOE', 'AGTR1', 'UMOD', 'ITGB3', 'AGT', 'SOD1', 'LDLR', 'LTA', 'CDKN2B-AS1', 'ADRB2', 'SMAD3', 'TNNI3', 'PSEN2', 'KCNE1', 'LDB3', 'ADRB3'","'Conflicting_interpretations_of_pathogenicity', 'Conflicting_interpretations_of_pathogenicity|_other', 'Conflicting_interpretations_of_pathogenicity|_other|_risk_factor', 'Benign', 'Benign/Likely_benign', 'Pathogenic', 'Likely_benign', 'Uncertain_significance', 'drug_response', 'risk_factor', 'Affects|_association', 'Conflicting_interpretations_of_pathogenicity|_risk_factor', 'Pathogenic/Likely_pathogenic', 'not_provided', 'association', 'Pathogenic|_risk_factor'","'14', '9', '6', '16', '17', '8', '10', '11', '2', '18', '1', '3', '21', '4', '19', '5', '7', '20', '15'","53767041, 44919688, 31667848, 12351625, 44908683, 6624137, 31573006, 55156238, 11089558, 22003223, 11110680, 38575384, 38613786, 20354331, 160589085, 230710047, 49188640, 47283363, 113400105, 46739504, 19956017, 22096055, 148742200, 88139961, 148827321, 31598654, 113588286, 22125503, 45919553, 11106626, 226885569, 26090950, 19961927, 160540104, 37966279, 11116872, 202520141, 53247054, 86681679, 21006288, 4632018, 24800211, 31575253, 12296021, 202555351, 67166300, 34252768, 11105249, 95308133, 113878378, 169549810, 21006195, 34449522, 7562998, 11113336, 177409530, 11110779, 54843773","'T', 'G', 'A', 'C'","'[A]', '[C]', '[G]', '[T]'",False,"'[1, 1]', '[0, 1]'",0,...,8.547276310306838,'d7298fe7dde4537b8343b5a702979aea',0,0,0,0,1692317426.0,2023-08-18 00:16:36.401,2023-08-18 00:11:13.000,False
1,0618424e-ed51-3100-ea5c-e46492bfd65b,"'SOD3', None, 'TNF', 'LPL', 'F5', 'PON1', 'PPARG', 'FTO', 'HFE', 'DCHS1', 'ABCG2', 'HRC', 'PLTP', 'ANKK1', 'GCH1', 'EDN1', 'LRP8', 'DSP', 'TTR', 'SCN5A', 'APOB', 'CDKN2B', 'BMPR2', 'LPA', 'HABP2', 'F12', 'PPP1R3A', 'ALOX15', 'F2', 'CCL2', 'APOE', 'AGTR1', 'UMOD', 'ITGB3', 'AGT', 'SOD1', 'LDLR', 'LTA', 'CDKN2B-AS1', 'ADRB2', 'SMAD3', 'TNNI3', 'PSEN2', 'KCNE1', 'LDB3', 'ADRB3'","'Conflicting_interpretations_of_pathogenicity', 'Conflicting_interpretations_of_pathogenicity|_other', 'Conflicting_interpretations_of_pathogenicity|_other|_risk_factor', 'Benign', 'Benign/Likely_benign', 'Pathogenic', 'Likely_benign', 'Uncertain_significance', 'drug_response', 'risk_factor', 'Affects|_association', 'Conflicting_interpretations_of_pathogenicity|_risk_factor', 'Pathogenic/Likely_pathogenic', 'not_provided', 'association', 'Pathogenic|_risk_factor'","'14', '9', '6', '16', '17', '8', '10', '11', '2', '18', '1', '3', '21', '4', '19', '5', '7', '20', '15'","53767041, 44919688, 12351625, 31667848, 44908683, 6624137, 31573006, 55156238, 11089558, 22003223, 11110680, 38575384, 38613786, 20354331, 160589085, 230710047, 49188640, 47283363, 113400105, 46739504, 19956017, 22096055, 148742200, 88139961, 148827321, 31598654, 113588286, 22125503, 45919553, 11106626, 226885569, 26090950, 19961927, 160540104, 37966279, 11116872, 202520141, 53247054, 86681679, 21006288, 4632018, 24800211, 31575253, 12296021, 202555351, 67166300, 34252768, 11105249, 95308133, 113878378, 169549810, 21006195, 34449522, 7562998, 11113336, 177409530, 11110779, 54843773","'T', 'G', 'A', 'C'","'[A]', '[C]', '[G]', '[T]'",False,"'[1, 1]', '[0, 1]'",0,...,9.241749116927542,'e5e2ccf0487eed5523395431675fa708',0,0,0,0,1692317426.0,2023-08-18 00:16:06.915,2023-08-18 00:11:14.000,False
2,06cc033a-f09a-0fb2-4a1a-4c4d99d88839,"'SOD3', None, 'TNF', 'LPL', 'F5', 'PON1', 'PPARG', 'FTO', 'HFE', 'DCHS1', 'ABCG2', 'HRC', 'PLTP', 'GCH1', 'ANKK1', 'EDN1', 'LRP8', 'DSP', 'TTR', 'SCN5A', 'APOB', 'CDKN2B', 'BMPR2', 'LPA', 'HABP2', 'F12', 'PPP1R3A', 'ALOX15', 'F2', 'CCL2', 'APOE', 'AGTR1', 'UMOD', 'ITGB3', 'AGT', 'SOD1', 'LDLR', 'LTA', 'CDKN2B-AS1', 'ADRB2', 'SMAD3', 'TNNI3', 'PSEN2', 'KCNE1', 'LDB3', 'ADRB3'","'Conflicting_interpretations_of_pathogenicity', 'Conflicting_interpretations_of_pathogenicity|_other', 'Conflicting_interpretations_of_pathogenicity|_other|_risk_factor', 'Benign', 'Benign/Likely_benign', 'Pathogenic', 'Likely_benign', 'Uncertain_significance', 'drug_response', 'risk_factor', 'Affects|_association', 'Conflicting_interpretations_of_pathogenicity|_risk_factor', 'Pathogenic/Likely_pathogenic', 'not_provided', 'association', 'Pathogenic|_risk_factor'","'14', '9', '6', '16', '17', '8', '10', '11', '2', '18', '1', '3', '21', '4', '19', '5', '7', '20', '15'","53767041, 44919688, 31667848, 12351625, 44908683, 6624137, 31573006, 55156238, 11089558, 22003223, 11110680, 38575384, 38613786, 20354331, 160589085, 230710047, 49188640, 47283363, 113400105, 46739504, 19956017, 22096055, 148742200, 88139961, 148827321, 31598654, 113588286, 22125503, 45919553, 11106626, 226885569, 26090950, 19961927, 160540104, 37966279, 11116872, 202520141, 53247054, 86681679, 21006288, 4632018, 24800211, 31575253, 12296021, 202555351, 67166300, 34252768, 11105249, 95308133, 113878378, 169549810, 21006195, 34449522, 7562998, 11113336, 177409530, 11110779, 54843773","'T', 'G', 'A', 'C'","'[A]', '[C]', '[G]', '[T]'",False,"'[1, 1]', '[0, 1]'",1,...,9.02168678253902,'1a3c6b0044e2b58a13c106896da369ef',1,0,1,1,1692317426.0,2023-08-18 00:16:06.943,2023-08-18 00:11:14.000,False


In [31]:
ag.core.utils.random.seed(25)

## Alzheimers Prediction
Splitting data for training and testing

In [32]:
#Alzheimers Prediction
#Splitting data into training and testing 80:20
dataset = dataset.loc[:, ~dataset.columns.str.startswith('diagnostics')]
dataset = dataset.drop(columns = ['eventtime', 'write_time', 'api_invocation_time', 'is_deleted', 'eventtime.1', 'write_time.1', 'api_invocation_time.1', 'is_deleted.1', 'alzheimers_prediction.1',
                                    'coronary_heart_disease_prediction.1', 'stroke_prediction.1', 'hypertension_prediction.1', 'patientid.1', 'eventtime.2', 'write_time.2', 'api_invocation_time.2', 'is_deleted.2', 
                                   'alzheimers_prediction.2', 'coronary_heart_disease_prediction.2', 'stroke_prediction.2', 'hypertension_prediction.2', 'patientid.2'])
training= dataset.sample(frac=0.8, random_state=23)
training = training.drop(columns = ['patientid', 'coronary_heart_disease_prediction', 'stroke_prediction', 'hypertension_prediction'])
testing = dataset.drop(training.index)
testing = testing.drop(columns = ['patientid', 'coronary_heart_disease_prediction', 'stroke_prediction', 'hypertension_prediction'])
X_test = testing.drop(columns = ['alzheimers_prediction'])
print("Training size = ", len(training))
print("Out of sample testing size = ", len(testing))

Training size =  121
Out of sample testing size =  30


### Alzheimers prediction on clinical, genomic, and imaging data using Autogluon

In [None]:
import time
start_time = time.time()
buckt = sm_session.default_bucket()
prefix= "genomic-clinical-imaging-alzheimers-prediction"
save_file = 's3://{}/{}'.format(buckt, prefix)
predictor = TabularPredictor(label= 'alzheimers_prediction', problem_type= 'binary', path=save_file).fit(train_data=training, holdout_frac=0.1, excluded_model_types=['CAT', 'XGB'])
print("--- Training time= %s seconds ---" % (time.time() - start_time))

In [34]:
predictor.evaluate_predictions(y_true=testing['alzheimers_prediction'], y_pred=predictor.predict(X_test), auxiliary_metrics=True, detailed_report=True)

Evaluation: accuracy on test data: 0.8666666666666667
Evaluations on test data:
{
    "accuracy": 0.8666666666666667,
    "balanced_accuracy": 0.6296296296296297,
    "mcc": 0.25925925925925924,
    "f1": 0.3333333333333333,
    "precision": 0.3333333333333333,
    "recall": 0.3333333333333333
}
Detailed (per-class) classification report:
{
    "0": {
        "precision": 0.9259259259259259,
        "recall": 0.9259259259259259,
        "f1-score": 0.9259259259259259,
        "support": 27
    },
    "1": {
        "precision": 0.3333333333333333,
        "recall": 0.3333333333333333,
        "f1-score": 0.3333333333333333,
        "support": 3
    },
    "accuracy": 0.8666666666666667,
    "macro avg": {
        "precision": 0.6296296296296297,
        "recall": 0.6296296296296297,
        "f1-score": 0.6296296296296297,
        "support": 30
    },
    "weighted avg": {
        "precision": 0.8666666666666667,
        "recall": 0.8666666666666667,
        "f1-score": 0.86666666666666

{'accuracy': 0.8666666666666667,
 'balanced_accuracy': 0.6296296296296297,
 'mcc': 0.25925925925925924,
 'f1': 0.3333333333333333,
 'precision': 0.3333333333333333,
 'recall': 0.3333333333333333,
 'confusion_matrix':     0  1
 0  25  2
 1   2  1,
 'classification_report': {'0': {'precision': 0.9259259259259259,
   'recall': 0.9259259259259259,
   'f1-score': 0.9259259259259259,
   'support': 27},
  '1': {'precision': 0.3333333333333333,
   'recall': 0.3333333333333333,
   'f1-score': 0.3333333333333333,
   'support': 3},
  'accuracy': 0.8666666666666667,
  'macro avg': {'precision': 0.6296296296296297,
   'recall': 0.6296296296296297,
   'f1-score': 0.6296296296296297,
   'support': 30},
  'weighted avg': {'precision': 0.8666666666666667,
   'recall': 0.8666666666666667,
   'f1-score': 0.8666666666666667,
   'support': 30}}}

## Coronary heart disease Prediction
Splitting data for training and testing

In [20]:
#coronary_heart_disease_prediction
#Splitting data into training and testing 80:20
training = dataset.sample(frac=0.8, random_state=25)
training =  training.drop(columns = ['patientid', 'alzheimers_prediction', 'stroke_prediction', 'hypertension_prediction'])
testing = dataset.drop(training.index)
testing = testing.drop(columns = ['patientid', 'alzheimers_prediction', 'stroke_prediction', 'hypertension_prediction'])
X_test = testing.drop(columns = ['coronary_heart_disease_prediction'])
print("Training size = ", len(training))
print("Out of sample testing size = ", len(testing))

Training size =  121
Out of sample testing size =  30


### Coronary heart disease prediction on clinical,  genomic, and imaging data using Autogluon

In [None]:
import time
start_time = time.time()
buckt = sm_session.default_bucket()
prefix= "genomic-clinical-imaging-coronary-heart-disease-prediction"
save_file = 's3://{}/{}'.format(buckt, prefix)
predictor = TabularPredictor(label= 'coronary_heart_disease_prediction', problem_type= 'binary', path=save_file).fit(train_data=training, holdout_frac=0.1, excluded_model_types=['CAT', 'XGB'])
print("--- Training time= %s seconds ---" % (time.time() - start_time))

In [22]:
predictor.evaluate_predictions(y_true=testing['coronary_heart_disease_prediction'], y_pred=predictor.predict(X_test), auxiliary_metrics=True, detailed_report=True)

Evaluation: accuracy on test data: 0.9
Evaluations on test data:
{
    "accuracy": 0.9,
    "balanced_accuracy": 0.625,
    "mcc": 0.4734320764739993,
    "f1": 0.4,
    "precision": 1.0,
    "recall": 0.25
}
Detailed (per-class) classification report:
{
    "0": {
        "precision": 0.896551724137931,
        "recall": 1.0,
        "f1-score": 0.9454545454545454,
        "support": 26
    },
    "1": {
        "precision": 1.0,
        "recall": 0.25,
        "f1-score": 0.4,
        "support": 4
    },
    "accuracy": 0.9,
    "macro avg": {
        "precision": 0.9482758620689655,
        "recall": 0.625,
        "f1-score": 0.6727272727272727,
        "support": 30
    },
    "weighted avg": {
        "precision": 0.9103448275862068,
        "recall": 0.9,
        "f1-score": 0.8727272727272728,
        "support": 30
    }
}


{'accuracy': 0.9,
 'balanced_accuracy': 0.625,
 'mcc': 0.4734320764739993,
 'f1': 0.4,
 'precision': 1.0,
 'recall': 0.25,
 'confusion_matrix':     0  1
 0  26  0
 1   3  1,
 'classification_report': {'0': {'precision': 0.896551724137931,
   'recall': 1.0,
   'f1-score': 0.9454545454545454,
   'support': 26},
  '1': {'precision': 1.0, 'recall': 0.25, 'f1-score': 0.4, 'support': 4},
  'accuracy': 0.9,
  'macro avg': {'precision': 0.9482758620689655,
   'recall': 0.625,
   'f1-score': 0.6727272727272727,
   'support': 30},
  'weighted avg': {'precision': 0.9103448275862068,
   'recall': 0.9,
   'f1-score': 0.8727272727272728,
   'support': 30}}}

## Stroke Prediction
Splitting data for training and testing

In [29]:
#stroke_prediction
#Splitting data into training and testing 80:20
training = dataset.sample(frac=0.8, random_state=30)
training =  training.drop(columns = ['patientid', 'alzheimers_prediction', 'coronary_heart_disease_prediction', 'hypertension_prediction'])
testing = dataset.drop(training.index)
testing = testing.drop(columns = ['patientid', 'alzheimers_prediction', 'coronary_heart_disease_prediction', 'hypertension_prediction'])
X_test = testing.drop(columns = ['stroke_prediction'])
print("Training size = ", len(training))
print("Out of sample testing size = ", len(testing))

Training size =  121
Out of sample testing size =  30


### Stroke prediction on clinical, genomic, and imaging data using Autogluon

In [None]:
import time
start_time = time.time()
buckt = sm_session.default_bucket()
prefix= "genomic-clinical-imaging-stroke_prediction"
save_file = 's3://{}/{}'.format(buckt, prefix)
predictor = TabularPredictor(label= 'stroke_prediction', problem_type= 'binary', path=save_file).fit(train_data=training, holdout_frac=0.1, excluded_model_types=['CAT', 'XGB'])
print("--- Training time= %s seconds ---" % (time.time() - start_time))

In [73]:
predictor.evaluate_predictions(y_true=testing['stroke_prediction'], y_pred=predictor.predict(X_test), auxiliary_metrics=True, detailed_report=True)

Evaluation: accuracy on test data: 0.9666666666666667
Evaluations on test data:
{
    "accuracy": 0.9666666666666667,
    "balanced_accuracy": 0.9642857142857143,
    "mcc": 0.9348527048856053,
    "f1": 0.9696969696969697,
    "precision": 0.9411764705882353,
    "recall": 1.0
}
Detailed (per-class) classification report:
{
    "0": {
        "precision": 1.0,
        "recall": 0.9285714285714286,
        "f1-score": 0.962962962962963,
        "support": 14
    },
    "1": {
        "precision": 0.9411764705882353,
        "recall": 1.0,
        "f1-score": 0.9696969696969697,
        "support": 16
    },
    "accuracy": 0.9666666666666667,
    "macro avg": {
        "precision": 0.9705882352941176,
        "recall": 0.9642857142857143,
        "f1-score": 0.9663299663299664,
        "support": 30
    },
    "weighted avg": {
        "precision": 0.9686274509803922,
        "recall": 0.9666666666666667,
        "f1-score": 0.9665544332210999,
        "support": 30
    }
}


{'accuracy': 0.9666666666666667,
 'balanced_accuracy': 0.9642857142857143,
 'mcc': 0.9348527048856053,
 'f1': 0.9696969696969697,
 'precision': 0.9411764705882353,
 'recall': 1.0,
 'confusion_matrix':     0   1
 0  13   1
 1   0  16,
 'classification_report': {'0': {'precision': 1.0,
   'recall': 0.9285714285714286,
   'f1-score': 0.962962962962963,
   'support': 14},
  '1': {'precision': 0.9411764705882353,
   'recall': 1.0,
   'f1-score': 0.9696969696969697,
   'support': 16},
  'accuracy': 0.9666666666666667,
  'macro avg': {'precision': 0.9705882352941176,
   'recall': 0.9642857142857143,
   'f1-score': 0.9663299663299664,
   'support': 30},
  'weighted avg': {'precision': 0.9686274509803922,
   'recall': 0.9666666666666667,
   'f1-score': 0.9665544332210999,
   'support': 30}}}

## Hypertension Prediction
Splitting data for training and testing

In [30]:
#hypertension_prediction
#Splitting data into training and testing 80:20
training = dataset.sample(frac=0.8, random_state=25)
training = training.drop(columns = ['patientid', 'alzheimers_prediction', 'coronary_heart_disease_prediction', 'stroke_prediction'])
testing = dataset.drop(training.index)
testing = testing.drop(columns = ['patientid', 'alzheimers_prediction', 'coronary_heart_disease_prediction', 'stroke_prediction'])
X_test = testing.drop(columns = ['hypertension_prediction'])
print("Training size = ", len(training))
print("Out of sample testing size = ", len(testing))

# Splitting data into training and testing for deploying to an endpoint
training_target = training.pop("hypertension_prediction")
training.insert(0, 'hypertension_prediction', training_target)
testing_target = testing.pop("hypertension_prediction")
testing.insert(0, 'hypertension_prediction', testing_target)
training.to_csv('s3://multimodal-dataset-clinical-genomic-imaging/multimodal_hypertension_training.csv', index=False)
testing.to_csv('s3://multimodal-dataset-clinical-genomic-imaging/multimodal_hypertension_testing.csv', index=False)

Training size =  121
Out of sample testing size =  30


### Hypertension prediction on clinical, genomic, and imaging data using Autogluon

In [None]:
import time
start_time = time.time()
buckt = sm_session.default_bucket()
prefix= "genomic-clinical-imaging-hypertension-prediction"
save_file = 's3://{}/{}'.format(buckt, prefix)
predictor = TabularPredictor(label= 'hypertension_prediction', problem_type= 'binary', path=save_file).fit(train_data=training, holdout_frac=0.1, excluded_model_types=['CAT', 'XGB'])
print("--- Training time= %s seconds ---" % (time.time() - start_time))

In [76]:
predictor.evaluate_predictions(y_true=testing['hypertension_prediction'], y_pred=predictor.predict(X_test), auxiliary_metrics=True, detailed_report=True)

Evaluation: accuracy on test data: 0.8666666666666667
Evaluations on test data:
{
    "accuracy": 0.8666666666666667,
    "balanced_accuracy": 0.9047619047619048,
    "mcc": 0.7486251134176306,
    "f1": 0.8181818181818181,
    "precision": 0.6923076923076923,
    "recall": 1.0
}
Detailed (per-class) classification report:
{
    "0": {
        "precision": 1.0,
        "recall": 0.8095238095238095,
        "f1-score": 0.8947368421052632,
        "support": 21
    },
    "1": {
        "precision": 0.6923076923076923,
        "recall": 1.0,
        "f1-score": 0.8181818181818181,
        "support": 9
    },
    "accuracy": 0.8666666666666667,
    "macro avg": {
        "precision": 0.8461538461538461,
        "recall": 0.9047619047619048,
        "f1-score": 0.8564593301435406,
        "support": 30
    },
    "weighted avg": {
        "precision": 0.9076923076923077,
        "recall": 0.8666666666666667,
        "f1-score": 0.8717703349282296,
        "support": 30
    }
}


{'accuracy': 0.8666666666666667,
 'balanced_accuracy': 0.9047619047619048,
 'mcc': 0.7486251134176306,
 'f1': 0.8181818181818181,
 'precision': 0.6923076923076923,
 'recall': 1.0,
 'confusion_matrix':     0  1
 0  17  4
 1   0  9,
 'classification_report': {'0': {'precision': 1.0,
   'recall': 0.8095238095238095,
   'f1-score': 0.8947368421052632,
   'support': 21},
  '1': {'precision': 0.6923076923076923,
   'recall': 1.0,
   'f1-score': 0.8181818181818181,
   'support': 9},
  'accuracy': 0.8666666666666667,
  'macro avg': {'precision': 0.8461538461538461,
   'recall': 0.9047619047619048,
   'f1-score': 0.8564593301435406,
   'support': 30},
  'weighted avg': {'precision': 0.9076923076923077,
   'recall': 0.8666666666666667,
   'f1-score': 0.8717703349282296,
   'support': 30}}}