Using Machine Learning to Predict Survival of Patients with Heart Failure

Overview

The current project uses machine learning to predict patients’ survival based on their medical data.

I create two models in the environment of Azure Machine Learning Studio: one using Automated Machine Learning (i.e. AutoML) and one customized model whose hyperparameters are tuned using HyperDrive. I then compare the performance of both models and deploy the best performing model as a service using Azure Container Instances (ACI).

The diagram below is a visualization of the rough overview of the operations that take place in this project:

Project Set Up and Installation

In order to run the project in Azure Machine Learning Studio, we will need the two Jupyter Notebooks:

automl.ipynb: for the AutoML experiment;
hyperparameter_tuning.ipynb: for the HyperDrive experiment.

The following files are also necessary:

heart_failure_clinical_records_dataset.csv: the dataset file. It can also be taken directly from Kaggle;
train.py: a basic script for manipulating the data used in the HyperDrive experiment;
scoring_file_v_1_0_0.py: the script used to deploy the model which is downloaded from within Azure Machine Learning Studio; &
env.yml: the environment file which is also downloaded from within Azure Machine Learning Studio.

Dataset

Overview

Cardiovascular diseases (CVDs) kill approximately 18 million people globally every year, being the number 1 cause of death globally. Heart failure is one of the two ways CVDs exhibit (the other one being myocardial infarctions) and occurs when the heart cannot pump enough blood to meet the needs of the body. People with cardiovascular disease or who are at high cardiovascular risk need early detection and management wherein Machine Learning would be of great help. This is what this project attempts to do: create an ML model that could help predicting patients’ survival based on their medical data.

The dataset used is taken from Kaggle and -as we can read in the original Research article- the data comes from 299 patients with heart failure collected at the Faisalabad Institute of Cardiology and at the Allied Hospital in Faisalabad (Punjab, Pakistan), during April–December 2015. The patients consisted of 105 women and 194 men, and their ages range between 40 and 95 years old.

The dataset contains 13 features:

Feature	Explanation	Measurement
age	Age of patient	Years (40-95)
anaemia	Decrease of red blood cells or hemoglobin	Boolean (0=No, 1=Yes)
creatinine-phosphokinase	Level of the CPK enzyme in the blood	mcg/L
diabetes	Whether the patient has diabetes or not	Boolean (0=No, 1=Yes)
ejection_fraction	Percentage of blood leaving the heart at each contraction	Percentage
high_blood_pressure	Whether the patient has hypertension or not	Boolean (0=No, 1=Yes)
platelets	Platelets in the blood	kiloplatelets/mL
serum_creatinine	Level of creatinine in the blood	mg/dL
serum_sodium	Level of sodium in the blood	mEq/L
sex	Female (F) or Male (M)	Binary (0=F, 1=M)
smoking	Whether the patient smokes or not	Boolean (0=No, 1=Yes)
time	Follow-up period	Days
DEATH_EVENT	Whether the patient died during the follow-up period	Boolean (0=No, 1=Yes)

Task

The main task that I seek to solve with this project & dataset is to classify patients based on their odds of survival. The prediction is based on the first 12 features included in the above table, while the classification result is reflected in the last column named Death event (target) and it is either 0 (no) or 1 (yes).

Access

First, I made the data publicly accessible in the current GitHub repository via this link: https://raw.githubusercontent.com/dimikara/heart-failure-prediction/master/heart_failure_clinical_records_dataset.csv

and then create the dataset:

As it is depicted below, the dataset is registered in Azure Machine Learning Studio:

Registered datasets: Dataset heart-failure-prediction registered

I am also accessing the data directly via:

data = pd.read_csv('./heart_failure_clinical_records_dataset.csv')

Automated ML

AutoML settings and configuration:

Below you can see an overview of the automl settings and configuration I used for the AutoML run:

automl_settings = {"n_cross_validations": 2,
                   "primary_metric": 'accuracy',
                   "enable_early_stopping": True,
                   "max_concurrent_iterations": 4,
                   "experiment_timeout_minutes": 20,
                   "verbosity": logging.INFO
                  }

automl_config = AutoMLConfig(compute_target = compute_target,
                             task = 'classification',
                             training_data = dataset,
                             label_column_name = 'DEATH_EVENT',
                             path = project_folder,
                             featurization = 'auto',
                             debug_log = 'automl_errors.log,
                             enable_onnx_compatible_models = False
                             **automl_settings
                             )

"n_cross_validations": 2

This parameter sets how many cross validations to perform, based on the same number of folds (number of subsets). As one cross-validation could result in overfit, in my code I chose 2 folds for cross-validation; thus the metrics are calculated with the average of the 2 validation metrics.

"primary_metric": 'accuracy'

I chose accuracy as the primary metric as it is the default metric used for classification tasks.

"enable_early_stopping": True

It defines to enable early termination if the score is not improving in the short term. In this experiment, it could also be omitted because the experiment_timeout_minutes is already defined below.

"max_concurrent_iterations": 4

It represents the maximum number of iterations that would be executed in parallel.

"experiment_timeout_minutes": 20

This is an exit criterion and is used to define how long, in minutes, the experiment should continue to run. To help avoid experiment time out failures, I used the value of 20 minutes.

"verbosity": logging.INFO

The verbosity level for writing to the log file.

compute_target = compute_target

The Azure Machine Learning compute target to run the Automated Machine Learning experiment on.

task = 'classification'

This defines the experiment type which in this case is classification. Other options are regression and forecasting.

training_data = dataset

The training data to be used within the experiment. It should contain both training features and a label column - see next parameter.

label_column_name = 'DEATH_EVENT'

The name of the label column i.e. the target column based on which the prediction is done.

path = project_folder

The full path to the Azure Machine Learning project folder.

featurization = 'auto'

This parameter defines whether featurization step should be done automatically as in this case (auto) or not (off).

debug_log = 'automl_errors.log

The log file to write debug information to.

enable_onnx_compatible_models = False

I chose not to enable enforcing the ONNX-compatible models at this stage. However, I will try it in the future. For more info on Open Neural Network Exchange (ONNX), please see here.

Results

During the AutoML run, the Data Guardrails are run when automatic featurization is enabled. As we can see in the screenshot below, the dataset passed all three checks:

Data Guardrails Checks in the Notebook

Data Guardrails Checks in Azure Machine Learning Studio

Completion of the AutoML run (RunDetails widget):

Best model

After the completion, we can see and take the metrics and details of the best run:

Best model results:

AutoML Model
id	AutoML_213153bb-f0e4-4be9-b265-6bbad4f0f9e4_40
Accuracy	0.8595525727069351
AUC_weighted	0.9087491748331944
Algorithm	VotingEnsemble

Screenshots from Azure ML Studio

AutoML models

Best model data

Best model metrics

Charts

Aggregate feature importance

As we can see, time is by far the most important factor, followed by serum creatinine and ejection fraction.

Hyperparameter Tuning

For this experiment I am using a custom Scikit-learn Logistic Regression model, whose hyperparameters I am optimising using HyperDrive. Logistic regression is best suited for binary classification models like this one and this is the main reason I chose it.

I specify the parameter sampler using the parameters C and max_iter and chose discrete values with choice for both parameters.

Parameter sampler

I specified the parameter sampler as such:

ps = RandomParameterSampling(
    {
        '--C' : choice(0.001,0.01,0.1,1,10,20,50,100,200,500,1000),
        '--max_iter': choice(50,100,200,300)
    }
)

I chose discrete values with choice for both parameters, C and max_iter.

C is the Regularization while max_iter is the maximum number of iterations.

RandomParameterSampling is one of the choices available for the sampler and I chose it because it is the faster and supports early termination of low-performance runs. If budget is not an issue, we could use GridParameterSampling to exhaustively search over the search space or BayesianParameterSampling to explore the hyperparameter space.

Early stopping policy

An early stopping policy is used to automatically terminate poorly performing runs thus improving computational efficiency. I chose the BanditPolicy which I specified as follows:

policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

evaluation_interval: This is optional and represents the frequency for applying the policy. Each time the training script logs the primary metric counts as one interval.

slack_factor: The amount of slack allowed with respect to the best performing training run. This factor specifies the slack as a ratio.

Any run that doesn't fall within the slack factor or slack amount of the evaluation metric with respect to the best performing run will be terminated. This means that with this policy, the best performing runs will execute until they finish and this is the reason I chose it.

Results

Completion of the HyperDrive run (RunDetails widget):

Please also see this video here where we can see that the RunDetails widget is enabled and the experiment is logging during its run until it shows 'Completed'.

Best model

After the completion, we can see and get the metrics and details of the best run:

Best model overview:

HyperDrive Model
id	HD_debd4c29-658d-4280-b761-2308b5eff7e4_1
Accuracy	0.8333333333333334
--C	0.01
--max_iter	300

Screenshots from Azure ML Studio

HyperDrive model

Best model data and details

Best model metrics

Model Deployment

The deployment is done following the steps below:

Selection of an already registered model
Preparation of an inference configuration
Preparation of an entry script
Choosing a compute target
Deployment of the model
Testing the resulting web service

Registered model

Using as basis the accuracy metric, we can state that the best AutoML model is superior to the best model that resulted from the HyperDrive run. For this reason, I choose to deploy the best model from AutoML run (best_run_automl.pkl, Version 2).

Registered models in Azure Machine Learning Studio

Runs of the experiment

Inference configuration

The inference configuration defines the environment used to run the deployed model. The inference configuration includes two entities, which are used to run the model when it's deployed:

An entry script, named scoring_file_v_1_0_0.py.
An Azure Machine Learning environment, named env.yml in this case. The environment defines the software dependencies needed to run the model and entry script.

Entry script

The entry script is the scoring_file_v_1_0_0.py file. The entry script loads the model when the deployed service starts and it is also responsible for receiving data, passing it to the model, and then returning a response.

Compute target

As compute target, I chose the Azure Container Instances (ACI) service, which is used for low-scale CPU-based workloads that require less than 48 GB of RAM.

The AciWebservice Class represents a machine learning model deployed as a web service endpoint on Azure Container Instances. The deployed service is created from the model, script, and associated files, as I explain above. The resulting web service is a load-balanced, HTTP endpoint with a REST API. We can send data to this API and receive the prediction returned by the model.

cpu_cores : It is the number of CPU cores to allocate for this Webservice. Can also be a decimal.

memory_gb : The amount of memory (in GB) to allocate for this Webservice. Can be a decimal as well.

auth_enabled : I set it to True in order to enable auth for the Webservice.

enable_app_insights : I set it to True in order to enable AppInsights for this Webservice.

Deployment

Bringing all of the above together, here is the actual deployment in action:

Best AutoML model deployed (Azure Machine Learning Studio)

Deployment takes some time to conclude, but when it finishes successfully the ACI web service has a status of Healthy and the model is deployed correctly. We can now move to the next step of actually testing the endpoint.

Consuming/testing the endpoint (ACI service)

Endpoint (Azure Machine Learning Studio)

After the successful deployment of the model and with a Healthy service, I can print the scoring URI, the Swagger URI and the primary authentication key:

The same info can be retrieved from Azure Machine Learning Studio as well:

The scoring URI can be used by clients to submit requests to the service.

In order to test the deployed model, I use a Python file, named endpoint.py:

In the beginning, I fill in the scoring_uri and key with the data of the aciservice printed above. We can test our deployed service, using test data in JSON format, to make sure the web service returns a result.

In order to request data, the REST API expects the body of the request to be a JSON document with the following structure:

{
    "data":
        [
            <model-specific-data-structure>
        ]
}

In our case:

The data is then converted to JSON string format:

We set the content type:

Finally, we make the request and print the response on screen:

I execute Cell 21 and, based on the above, I expect to get a response in the format of true or false:

In order to test the deployed service, one could use the above file by inserting data in the endpoint.py file, saving it, and then run the relevant cell in the automl.ipynb Jupyter Notebook.

Another way would be using the Swagger URI of the deployed service and the Swagger UI.

A third way would also be to use Azure Machine Learning Studio. Go to the Endpoints section, choose aciservice and click on the tab Test:

Fill in the empty fields with the medical data you want to get a prediction for and click Test:

Screen Recording

The screen recording can be found here and it shows the project in action.

More specifically, the screencast demonstrates:

A working model
Demo of the deployed model
Demo of a sample request sent to the endpoint and its response

Comments and future improvements

The first factor that could improve the model is increasing the training time. This suggestion might be seen as a no-brainer, but it would also increase costs and this is a limitation that can be very difficult to overcome: there must always be a balance between minimum required accuracy and assigned budget.
Continuing the above point, it would be great to be able to experiment more with the hyperparameters chosen for the HyperDrive model or even try running it with more of the available hyperparameters, with less time contraints.
Another thing I would try is deploying the best models to the Edge using Azure IoT Edge and enabling logging in the deployed web apps.
I would certainly try to deploy the HyperDrive model as well, since the deployment procedure is a bit defferent than the one used for the AutoML model.
In the original Research article where this dataset was used it is mentioned that:

Random Forests [...] turned out to be the top performing classifier on the complete dataset

I would love to further explore on this in order to create a model with higher accuracy that would give better and more reliable results, with potential practical benefits in the field of medicine.

The question of how much training data is required for machine learning is always valid and, by all means, the dataset used here is rather small and geographically limited: it contains the medical records of only 299 patients and comes from only a specific geographical area. Increasing the sample size can mean higher level of accuracy and more reliable results. Plus, a dataset including data from patients from around the world would also be more reliable as it would compensate for factors specific to geographical regions.
Finally, although cheerful and taking into account gender equality, it would be great not to stumble upon issues like this:

Dataset Citation

Davide Chicco, Giuseppe Jurman: Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making 20, 16 (2020).

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Other files		Other files
img		img
.gitignore		.gitignore
README.md		README.md
automl.ipynb		automl.ipynb
config.json		config.json
endpoint.py		endpoint.py
env.yml		env.yml
heart_failure_clinical_records_dataset.csv		heart_failure_clinical_records_dataset.csv
hyperparameter_tuning.ipynb		hyperparameter_tuning.ipynb
scoring_file_v_1_0_0.py		scoring_file_v_1_0_0.py
train.py		train.py

dimikara/Survival-Prediction-of-Patients-with-Heart-Failure

Folders and files

Latest commit

History

Repository files navigation

Using Machine Learning to Predict Survival of Patients with Heart Failure

Table of contents

Overview

Project Set Up and Installation

Dataset

Overview

Task

Access

Automated ML

Results

Completion of the AutoML run (RunDetails widget):

Best model

Hyperparameter Tuning

Results

Completion of the HyperDrive run (RunDetails widget):

Best model

Model Deployment

Registered model

Inference configuration

Entry script

Compute target

Deployment

Consuming/testing the endpoint (ACI service)

Screen Recording

Comments and future improvements

Dataset Citation

References

About

Topics

Resources

Stars

Watchers

Forks

Languages