# Heart Failure Prediction

## About

* Our team has built a classification model to predict whether a patient has risk of passing away after surviving a heart attack based on their medical records
* Currently accuracy = 0.78 due to limited availably of data
* Working with real-world healthcare data often presents challenges like data imbalance and unavailability (due to patient data confidentiality). Despite that, this project showcases how machine learning can be used to save lives

## Data 

| Column Name            | Description                                                  |
|------------------------|--------------------------------------------------------------|
| age                    | Patient's age                                          |
| anaemia                | Decrease of red blood cells or hemoglobin                    |
| creatinine_phosphokinase| Level of the CPK enzyme in the blood                        |
| diabetes               | If the patient has diabetes                                  |
| ejection_fraction      | Percentage of blood leaving the heart at each contraction    |
| high_blood_pressure    | If the patient has hypertension                              |
| platelets              | Platelets in the blood                                       |
| serum_creatinine       | Level of serum creatinine in the blood                       |
| serum_sodium           | Level of serum sodium in the blood                           |
| sex                    | Woman or man                                                 |
| smoking                | If the patient smokes or not                                 |
| time                   | Follow-up period                                             |
| DEATH_EVENT            | Whether the patient died or not (target variable)            |


# Model

We compared Decision Tree, KNN, Logistic Regression, and selected Logistic Regression due to its interpretability, and ability to handle both linear and non-linear relationships between features. Logistic Regression performed better than the other two models as it works well with fewer features and is less prone to overfitting compared to more complex models like Decision Trees or KNN, especially when the data is relatively small.

# Results and Conculsion

Machine Learning can be useful in saving patients live 

## EDA and Analysis

### Dataset and Imports

In [235]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import altair as alt
import altair_ally as aly
import os
from vega_datasets import data
from sklearn import set_config
from sklearn.model_selection import (GridSearchCV, cross_validate, train_test_split,)
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score


# Enable Vegafusion for better data transformation
aly.alt.data_transformers.enable('vegafusion')
alt.data_transformers.enable('vegafusion')

DataTransformerRegistry.enable('vegafusion')

In [61]:
# Load the dataset
file_path = 'data/heart_failure_clinical_records_dataset.csv'
heart_failure_data = pd.read_csv(file_path)

### EDA and Visualisations

In [62]:
heart_failure_data.shape

(299, 13)

In [63]:
heart_failure_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB


In [64]:
heart_failure_data['DEATH_EVENT'].value_counts()

DEATH_EVENT
0    203
1     96
Name: count, dtype: int64

* Dataset Size: The dataset is relatively small, with only 300 rows.
* Class Imbalance: The target variable, DEATH_EVENT, has few examples in the "True" class (i.e., the event occurred), which might affect the model's ability to learn and generalize well. This class imbalance will be taken into consideration during analysis and model evaluation.

In [59]:
# Summary statistics
print("Summary Statistics:")
heart_failure_data.describe()

Summary Statistics:


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.833893,0.431438,581.839465,0.41806,38.083612,0.351171,263358.029264,1.39388,136.625418,0.648829,0.32107,130.26087,0.32107
std,11.894809,0.496107,970.287881,0.494067,11.834841,0.478136,97804.236869,1.03451,4.412477,0.478136,0.46767,77.614208,0.46767
min,40.0,0.0,23.0,0.0,14.0,0.0,25100.0,0.5,113.0,0.0,0.0,4.0,0.0
25%,51.0,0.0,116.5,0.0,30.0,0.0,212500.0,0.9,134.0,0.0,0.0,73.0,0.0
50%,60.0,0.0,250.0,0.0,38.0,0.0,262000.0,1.1,137.0,1.0,0.0,115.0,0.0
75%,70.0,1.0,582.0,1.0,45.0,1.0,303500.0,1.4,140.0,1.0,1.0,203.0,1.0
max,95.0,1.0,7861.0,1.0,80.0,1.0,850000.0,9.4,148.0,1.0,1.0,285.0,1.0


In [5]:
# Check for missing values

missing_values = heart_failure_data.isnull().sum()
print("\nMissing Values:")
print(missing_values)


Missing Values:
age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64


No missing values, no imputation or filling Nulls required

In [7]:
aly.heatmap(heart_failure_data,color="DEATH_EVENT")

RuntimeError: The versions of the vegafusion and vegafusion-python-embed packages must match
and must be version 1.5.0 or greater.
Found:
 - vegafusion==2.0.0rc1
 - vegafusion-python-embed==1.6.9


alt.VConcatChart(...)

In [8]:
# 1. Distributions of all columns
print("Visualizing distributions for all columns...")
aly.dist(heart_failure_data)

Visualizing distributions for all columns...


RuntimeError: The versions of the vegafusion and vegafusion-python-embed packages must match
and must be version 1.5.0 or greater.
Found:
 - vegafusion==2.0.0rc1
 - vegafusion-python-embed==1.6.9


alt.ConcatChart(...)

In [5]:
aly.pair(heart_failure_data,color="DEATH_EVENT")

In [6]:
aly.corr(data.movies())

In [7]:
aly.parcoord(heart_failure_data,color = 'DEATH_EVENT')

In [8]:
# Create the distribution plots
aly.dist(heart_failure_data,color = 'DEATH_EVENT')

### Data Splitting

In [86]:
heart_failure_train, heart_failure_test = train_test_split(heart_failure_data, 
                                                           train_size = 0.8, 
                                                           stratify = heart_failure_data['DEATH_EVENT'],
                                                           random_state = 522)

url_processed = 'data/processed/'
heart_failure_train.to_csv(os.path.join(url_processed, 'heart_failure_train.csv'))
heart_failure_test.to_csv(os.path.join(url_processed, 'heart_failure_test.csv'))

### Preprocessing columns

In [148]:
# Define numeric columns
numeric_columns = ['age', 'creatinine_phosphokinase', 'ejection_fraction', 
                   'platelets', 'serum_creatinine', 'serum_sodium', 'time']
# List of binary columns
binary_columns = ['anaemia', 'diabetes', 'high_blood_pressure', 'sex', 'smoking']

# Convert all binary columns to True/False so they're treated as categorical data
heart_failure_train[binary_columns] = heart_failure_train[binary_columns].astype(bool)
heart_failure_test[binary_columns] = heart_failure_test[binary_columns].astype(bool)

In [149]:
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_columns),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop='if_binary', dtype = int), binary_columns),
    remainder = 'passthrough'
)

# preprocessor.fit(heart_failure_train)
# heart_failure_scaled_train = preprocessor.transform(heart_failure_train)
# heart_failure_scaled_test = preprocessor.transform(heart_failure_test)

### Building Model
Testing Decision Tree, KNN, Logistic Regression

In [244]:
pipeline = make_pipeline(
        preprocessor, 
        DecisionTreeClassifier(random_state=522)
    )

dt_scores = cross_validate(pipeline, 
                           heart_failure_train.drop(columns=['DEATH_EVENT']), 
                           heart_failure_train['DEATH_EVENT'],
                           return_train_score=True
                          )

dt_scores = pd.DataFrame(dt_scores).sort_values('test_score', ascending = False)
dt_scores

Unnamed: 0,fit_time,score_time,test_score,train_score
4,0.005264,0.002517,0.829787,1.0
1,0.007444,0.002875,0.8125,1.0
3,0.005588,0.002598,0.791667,1.0
2,0.005796,0.00278,0.770833,1.0
0,0.010477,0.00349,0.666667,1.0


#### KNN

In [182]:
pipeline = make_pipeline(
        preprocessor, 
        KNeighborsClassifier()
    )

param_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 100, 3)
}

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=10,  
    n_jobs=-1,  
    return_train_score=True,
)

heart_failure_fit = grid_search.fit(heart_failure_train.drop(columns=['DEATH_EVENT']), heart_failure_train['DEATH_EVENT'] )

knn_best_model = grid_search.best_estimator_ 
knn_best_model

In [187]:
pd.DataFrame(grid_search.cv_results_).sort_values('mean_test_score', ascending = False)[['params', 'mean_test_score']].iloc[0]

params             {'kneighborsclassifier__n_neighbors': 19}
mean_test_score                                     0.777899
Name: 6, dtype: object

 _

#### Logistic Regression

In [195]:
pipeline = make_pipeline(
        preprocessor, 
        LogisticRegression(random_state=522, max_iter=2000)
    )

param_grid = {
    "logisticregression__C": 10.0 ** np.arange(-4, 7, 1)
}

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=10,  
    n_jobs=-1,  
    return_train_score=True
)

heart_failure_fit = grid_search.fit(heart_failure_train.drop(columns=['DEATH_EVENT']), heart_failure_train['DEATH_EVENT'] )

lr_best_model = grid_search.best_estimator_ 
lr_best_model

In [205]:
lr_scores = pd.DataFrame(grid_search.cv_results_).sort_values('mean_test_score', ascending = False)[['param_logisticregression__C', 'mean_test_score', 'mean_train_score']]
lr_scores.iloc[0:5]

Unnamed: 0,param_logisticregression__C,mean_test_score,mean_train_score
4,1.0,0.853986,0.871688
5,10.0,0.849819,0.871688
6,100.0,0.849819,0.871223
7,1000.0,0.849819,0.871223
8,10000.0,0.849819,0.871223


Model is performing well with C = 1.0 - high test score close to test score indicating that model isnt overfitting or underfitting

In [220]:
alt.Chart(lr_scores).mark_line().encode(
    x = "param_logisticregression__C",
    y = "mean_test_score",
    color = alt.Color(value = "skyblue")
) + alt.Chart(lr_scores).mark_line().encode(
    x = "param_logisticregression__C",
    y = "mean_train_score",
    color = alt.Color(value = "pink")
)

TypeError: Too few parameters for <class 'altair.utils.plugin_registry.PluginRegistry'>; actual 1, expected at least 2

alt.LayerChart(...)

**Logistic regression performs better than decision trees and KNN on the cross validation data. hence, we will select it as our final model**

### Evaluation

#### Confusion Matrix

In [231]:
# Confusion Matrix

heart_failure_predictions = heart_failure_test.assign(
    predicted=heart_failure_fit.predict(heart_failure_test)
)

cm_crosstab = pd.crosstab(heart_failure_predictions['DEATH_EVENT'], 
                          heart_failure_predictions['predicted'], 
                          rownames=["Actual"], 
                          colnames=["Predicted"]
                         )


cm_crosstab
# cm = confusion_matrix(heart_failure_test["DEATH_EVENT"], heart_failure_fit.predict(heart_failure_test))
# cm

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,35,6
1,7,12


In [238]:
accuracy = accuracy_score(heart_failure_predictions['DEATH_EVENT'], heart_failure_predictions['predicted'])
precision = precision_score(heart_failure_predictions['DEATH_EVENT'], heart_failure_predictions['predicted'])
recall = recall_score(heart_failure_predictions['DEATH_EVENT'], heart_failure_predictions['predicted'])
f1 = f1_score(heart_failure_predictions['DEATH_EVENT'], heart_failure_predictions['predicted'])

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

Accuracy: 0.7833
Precision: 0.6667
Recall: 0.6316
F1 Score: 0.6486
