## Predicting survival of patients with heart failure

#### *Drew Solomon*

### Introduction
##### The Problem

Cardiovascular diseases (CVDs) are the number 1 cause of death globally. In 2016, an estimated 17.9 million people died of CVDs, representing 31% of all global deaths (WHO, 2016). In the US, 1 in 4 deaths are due to heart disease (CDC, 2020). Thus, CVDs represent a major public health burden. These disorders of the heart and blood vessels include heart attacks, strokes, and heart failure, as well as other conditions. Within CVDs, heart failure is rising in prevalence globally, affecting an estimated 26 million people(Savarese & Lund, 2017). Specifically, heart failure refers to when the heart cannot pump enough blood to the body, and is associated with diabetes, high blood pressure, and other heart diseases. While predicting heart failure is common in medicine, clinicians' predictions of heart failure-related events have limited accuracy (Buchan et al., 2019). Thus, accurately predicting the survival of patients with heart failure using available medical information is an important machine learning problem. Identifying key features related to patients' survival and building classification models to accurately predict survival may ultimately assist clinicians in extending and improving the lives of people with heart failure.

As such, I want to predict the survival of patients with heart failure using medical record information. First, I want to identify which clinical, body, and lifestyle features are important for predicting patients' survival. Then, I want to build and evaluate a machine learning model to accurately predict heart failure patients' survival. I aim to do so using a dataset of medical records from 299 heart failure patients a hospital in Pakistan, collected from April-December 2015 (described below).

##### Target variable 
The target variable is the death event, which measures if the patient died before the end of the follow up period (130 days on average). It is a binary variable, where `1` represents a patients' death, and `0` their survival. Regarding the balance, 203 patients (67.89%) survived and 96 patients (32.11%) died. 

##### Classification problem
Since the target variable is a categorical one with two classes (survived or not), the problem here is a binary classification problem. In particular, given the dataset, this classification problem involves predicting heart failure patient survival within a time period of around 130 days, for patients who have already suffered heart failures and rank highly on the NYHA progression score.

##### Why it matters
This problem is both important and interesting. Solving this problem could help doctors predict heart failure patients' survival with greater accuracy. Further, feature selection may simplify medical decision-making by identifying key risk factors. For example, if a few features can accurately predict survival, then doctors may still predict patients' survival with incomplete medical records (e.g. missing tests). Importantly, an accurate predictive model may help doctors triage patients, particularly given their limited time and resources. In addition, an accurate prediction of survival may help support patients and their families in planning and coping with loss. Broadly, using medical data to uncover relationships between risk factors towards predicting heart failure-related events may potentially improve the lives of people risk of heart failure.

##### Dataset description

This dataset contains the (anonymous) medical records of 299 heart failure patients admitted to the Institute of Cardiology and Allied hospital Faisalabad-Pakistan during April-December (2015). The dataset has 299 data points (for each individual patient) and 13 features (including the target variable). The patients consisted of 105 women and 194 men,ranging between 40 and 95 years old. All of patients had previous heart failures and ranked in top 2 of 4 classes of the NYHA score on heart failure progression (which classifies patients based on physical activity limitations (AHA, 2017).

The dataset contains 13 features from electronic health records. The categorical features (all binary) are: anaemia, high blood pressure, diabetes, sex, smoking, and death event (if the patient during before the end of the follow-up period). The continuous features are: age, creatinine phosphokinase (level of the CPK enzyme in the blood - an indicator of muscle tissue damage), ejection fraction (% of blood leaving the heart at each contraction), platelets, serum creatinine (creatinine in the blood - indicator of kidney dysfunction), serum sodium (sodium in the blood - possible indicator of heart failure if very low), and time (days between hospital visit and follow-up).

This dataset was obtained via UCI. In its original publication, Dr. Ahmad and his colleagues employed traditional biostatistics time-dependent models (such as Cox regression and Kaplan–Meier survival plots) to identify risk factors and model mortality of patients with heart failure. Ahmad et al. (2017) found that age, renal dysfunction, blood pressure, ejection fraction and anemia are significant risk factors for mortality among heart failure patients.

Chicco and Jurman (2020) extended this work by comparing biostatistics approaches with machine learning ones, in identify key features and predicting patients' survival. Within both approaches, the authors found that serum creatinine and the ejection fraction were the two most important features in predicting patients' survival. Moreover, Chicco and Jurman (2020) found that these two features were not only sufficient for predicting heart failure patient survival from medical records, but yielded more accurate predictions than the original dataset. While the two features are well-known drivers of heart failure, Chicco et al. (2020) were successful in predicting heart failure patients' survival using only two features, which has useful practical implications.

In [42]:
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pylab as plt
import matplotlib.patches as mpatches
from IPython.display import Image

# load the heart failure clinical records dataset
df = pd.read_csv('../data/heart_failure_clinical_records_dataset.csv')
print("shape:",df.shape)

shape: (299, 13)


In [43]:
# rename column names
df.columns=['age', 'anaemia', 'creatinine phosphokinase', 'diabetes',
       'ejection fraction', 'high blood pressure', 'platelets',
       'serum creatinine', 'serum sodium', 'sex', 'smoking', 'time',
       'death event']
columns = df.columns

### Exploratory Data Analysis 

In [44]:
df.head()

Unnamed: 0,age,anaemia,creatinine phosphokinase,diabetes,ejection fraction,high blood pressure,platelets,serum creatinine,serum sodium,sex,smoking,time,death event
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [45]:
# count missing values
df.isnull().sum(axis=0)

age                         0
anaemia                     0
creatinine phosphokinase    0
diabetes                    0
ejection fraction           0
high blood pressure         0
platelets                   0
serum creatinine            0
serum sodium                0
sex                         0
smoking                     0
time                        0
death event                 0
dtype: int64

### Selected Figures:

The first figure is a scattermatrix of patients' continuous features: age, the time of their follow-up visit, and their clinical information. These scatterplots are grouped by the patients' survival status, to visualize the patterns in the dataset by the target variable. The most visible pattern is that patients who died had a shorter time until follow-up, by default. Beyond the time column, there is a distinct grouping of target classes within the ejection fraction and serum creatinine scatterplot, whereby patients with a higher ejection fraction and lower serum creatinine have a higher frequency of survival.

![title](../figures/fig1_scattermatrix_by_survival.png)

The second figure is a stacked bar plot of patients' survival by their sex. The rate of survival between the sexes is remarkably similar, at 68.04% for female patients, and 67.62% for male patients.

![title](../figures//fig2_stacked_bar_by_sex_survival.png)

The third figure is a scatter plot of ejection fraction and serum creatinine. Ejection fraction (EF) represents the percentage of blood the left ventricle pumps out with each contraction. Serum creatinine is a waste product generated when a muscle breaks down, and may indicate kidney dysfunction at high levels. This plot illustrates that patients who did not survived typically had low EF and higher serum creatinine, whereas patients who survived (within the follow-up window) had a higher ejection fraction and lower serum creatinine. The separation between these groups suggests that these features may be important for classifying patients by survival.

![title](../figures/fig3_scatter_serum_creatinine_by_EF.png)

Focusing on the role of ejection fraction, the fourth figure is a violin plot illlustrating the distribution of patients' ejection fraction measures by survival status. There is a clear difference between patients who survived, with a higher EF with a mode around 38%, and patients who did not, with a mode around 23%. However, some patients who died had a high EF, indicating the possibility of heart failure with preserved ejection fraction (HFpEF), where the ventricle fails to relax (Chicco, & Jurman, 2020).

![title](../figures/fig4_violin_EF_by_survival.png)

## Data preprocessing 

#### Split proportion

Given the small dataset size, I split the data in 60% training, 20% validation, and 20% test set, in order to train, tune, and test the classification model on sufficient data respectively. Importantly, the model must accurately predict new patients' survival, so the (unseen) test set must be sufficiently large and varied.

#### IID data

The dataset is IID (independent and identically distributed) because each data point represents a unique patient, without repeated visits. Each observation is independent of the others. The data does not have a group structure nor is it generated by a time-dependent process.

#### Split type and justification

Given the IID structure, categorical target variable, small size ($n=299$) of the dataset, and the smaller proportion of positive target cases (death event = 1), I split the dataset using a stratified KFold split. While the data is not heavily imbalanced (~32% dead, 68% survived), the small number of observations results in large variation (on the order of 10%) in balance in the training, validation, and test sets using a regular KFold split (with shuffling). Instead, the stratified KFold split - which splits the data into folds that preserve the proportion of samples from each class - results in variation in balance on order of only 1%. Thus, this split ensures that the balance of the target class - patient death or survival - is preserved across the training, validation, and test sets despite the small sample size.


#### Splitting the dataset based on the classification goal 

Since I am predicting heart failure patients' survival from medical record information, I discarded the 'time' feature, which indicates the time until a patients' follow-up (whose procedures were not specified in the original dataset or paper), since this information is not available when predicting new patients' survival. Moreover, the stratified KFold split (described above) ensures that each training/validation/test test has a consistent balance of heart failure patients who survived and patients who did not. Thus, the machine learning classification model will be fit, tuned, and tested on a consistent ratio of patient classes towards more accurate survival prediction. For example, this avoids the risk of a predictive model being fit largely on data from patients who survived, which could occur in a small dataset.

#### Preprocessors by feature 

For the continuous variables (age, creatinine phosphokinase, ejection fraction, platelets, serum creatinine, and serum sodium), I standardized them by applying the StandardScaler. I chose the StandardScaler to prepare these features for feature ranking as well as for machine learning estimators which require features to have comparable scales, such as for linear or logistic regression. Moreover, many of the continuous features follow a tail distribution or have wide ranges (especially for the blood test information), so standardizing was more appropriate than scaling by the minimum and maximum values.

In this dataset, the categorical features (anaemia, diabetes, high blood pressure, sex, and smoking) are already in binary numerical format (`0` if false, `1` if true). Therefore, they do not need transforming. Thus, I specified in my preprocessor to pass over these columns. Likewise, the target variable 'death event' has unique values of `0` and `1`, and thus is ready for classification.

####  Preprocessed data features

After preprocessing, there are 11 features in the preprocessed data. The 'time' feature was discarded, and the continuous features were standardized. The initial training, validation, and test sets had shape: (180, 12), (59, 12), and (60, 12) respectively. After preprocessing, the transformed training, validation, and tests sets had shape: (180, 11), (59, 11), and (60, 11) respectively. Thus, one feature was dropped, and no columns were added over the standardized. The sets are split 60/20/20 and are consistently balanced for the target variable at 68% survived, 32% died with a variance on the order of 1%.

### References 


Ahmad, T., Munir, A., Bhatti, S. H., Aftab, M., & Raza, M. A. (2017). Survival analysis of heart failure patients: A case study. PloS one, 12(7), e0181001.

American Heart Association. Classes of Heart Failure. www.heart.org/en/health-topics/heart-failure/what-is-heart-failure/classes-of-heart-failure.

Buchan, T. A., Ross, H. J., McDonald, M., Billia, F., Delgado, D., Posada, J. D., ... & Alba, A. C. (2019). Physician prediction versus model predicted prognosis in ambulatory patients with heart failure. The Journal of Heart and Lung Transplantation, 38(4), S381.

World Health Organization. Cardiovascular Diseases (CVDs). www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds).

Chicco, D., & Jurman, G. (2020). Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC medical informatics and decision making, 20(1), 16.


Heart Disease Facts. 8 Sept. 2020, www.cdc.gov/heartdisease/facts.htm.

Raphael, C., Briscoe, C., Davies, J., Whinnett, Z. I., Manisty, C., Sutton, R., ... & Francis, D. P. (2007). Limitations of the New York Heart Association functional classification system and self-reported walking distances in chronic heart failure. Heart, 93(4), 476-482.

Savarese, G., & Lund, L. H. (2017). Global public health burden of heart failure. Cardiac failure review, 3(1), 7.
