In [560]:
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

# Comparing Classifiers for Predicting Death in Heart Failure Patients

## Introduction and Goal

Heart failure is a condition where the heart becomes too weak to pump blood effectively throughout the body. Patient outcomes can vary widely and often depend on many biological factors.

**The goal of this project is to train and compare three classifiers for predicting survival in heart failure patients.** The models I’ll be using are K-Nearest Neighbors, Gaussian Naive Bayes, and Logistic Regression. Each model will learn from biological features to predict whether a patient survived or not.

The process will be as follows: first, train all three models and evaluate them using multiple performance metrics. Next, refine the models through parameter tuning and dimensionality reduction. Finally, compare their performances and determine which model provides the most reliable predictions.

## Data Description

The dataset contains information on 299 patients with heart failure. Each row represents one patient, and each of the 13 columns provides a biological or clinical feature:

- Age
- Sex
- Whether they are anaemic
- Whether they have hypertension
- CPK enzyme level
- Whether they have diabetes
- Platelet count
- Serum creatinine level
- Serum sodium level
- Smoking status
- Follow-up period
- Death event

The follow-up period indicates how long a patient was monitored after their heart failure diagnosis, with an average of 130 days. The death event column is the target variable and records whether the patient survived (0) or died (1) during follow-up.

For prediction, I’ll use the first 11 features. Since all of them are numeric or already encoded numerically, the dataset is ready for training without additional preprocessing.

Access the data [here](https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records).

## Data Preprocessing

In [561]:
df = pd.read_csv('dataset.csv')
df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


Check for missing values: 

In [562]:
df.isna().sum()

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

Check for balance of groups: 

In [563]:
df['DEATH_EVENT'].value_counts()

0    203
1     96
Name: DEATH_EVENT, dtype: int64

Separate features from target variable:

In [564]:
X = df.drop(['DEATH_EVENT', 'time'], axis = 1) # features 
y = df['DEATH_EVENT'] # target variable 
y_labels = ['Died' if yi == 1 else 'Survived' for yi in y] # target labels 
model_names = ['KNN', 'Gaussian NB', 'LR']

Separate into training and testing sets for evaluation purposes. Stratify by the target variable to ensure an equal distribution of individuals in each group. 

In [565]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    stratify=y, 
                                                    random_state=26)

Standardize each feature to have a mean of 0 and variance of 1. This is required for the KNN algorithm so that no features dominate the distance computations. 

In [566]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Analysis: Baseline Models 

First, I'm going to train and test the 3 models using the 11 training features directly and with default hyper-parameters. 

**K-Nearest Neighbors (KNN)** is an instance-based, non-parametric algorithm that predicts outcomes based on similarity to nearby data points. Using k = 5 neighbors as the baseline parameter, it classifies a patient by majority vote among the 5 closest points. Unlike probabilistic models, it doesn’t build an explicit model of the data distribution and instead makes predictions directly from the training set. It’s sensitive to feature scaling and can be slower at larger scales, but it’s useful because it makes very few assumptions about the data.

**Gaussian Naive Bayes** is a fast, simple probabilistic classifier that applies Bayes’ theorem under the assumption of feature independence. The Gaussian version models continuous features as normally distributed. The performance of this model will depend on how well the normality assumption holds for the training features.

**Logistic Regression** is a linear probabilistic model that predicts survival probabilities using a logistic function of the input features. With its default regularization parameter of 1.0, it balances fit and simplicity. Since it doesn’t rely on independence or normality assumptions, it may outperform Naive Bayes when features are correlated or deviate from a Gaussian distribution.

Model performance will be evaluated using 4 metrics:

- Accuracy: The overall proportion of correct predictions out of all predictions.
- Precision: Of the patients predicted to die, the proportion that actually died.
- Recall: Of the patients who actually died, the proportion that the model correctly identified.
- F1-Score: The harmonic mean of precision and recall.

Since the groups are unbalanced (<33% of participants died), it’s important to evaluate model performance using more than just accuracy. A model that predicts all heart failure patients will survive would have an accuracy >67%, despite being essentially useless. The F1-score is a better measure of model fit because it balances the effects of false negatives and false positives. A false negative in this context is a patient who dies but is predicted to survive, while a false positive is the opposite: a patient who survives but is predicted to die. As in most medical cases, it’s better to catch as many at-risk patients as possible, even if that means some patients are flagged who ultimately survive. That said, maintaining a balance is still important, which is why the F1-score is a valuable metric.

In [567]:
baseline_performances = []

In [568]:
def trainAndTest(model, X_train, y_train, X_test, y_test, name):
    """
    Function to be used for each model's training and testing 
    - Trains a model 
    - Make predictions 
    - Compute performance metrics: accuracy, precision, recall, F1
    - Returns a dictionary of performance metrics and the prediction probabilities
    """
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1] 
    performance = {"Model": name,
                   "Accuracy": accuracy_score(y_test, y_pred),
                   "Precision": precision_score(y_test, y_pred),
                   "Recall": recall_score(y_test, y_pred),
                   "F1": f1_score(y_test, y_pred)}
    
    return performance

#### K-Nearest Neighbors

In [569]:
performance_knn = trainAndTest(KNeighborsClassifier(n_neighbors=5),
                               X_train_scaled, 
                               y_train, 
                               X_test_scaled, 
                               y_test, 
                              'KNN')
baseline_performances.append(performance_knn)

#### Gaussian Naive Bayes 

In [570]:
performance_gnb = trainAndTest(GaussianNB(),
                               X_train_scaled, 
                               y_train, 
                               X_test_scaled, 
                               y_test,
                              'GNB')
baseline_performances.append(performance_gnb)

#### Logistic Regression

In [571]:
performance_lr = trainAndTest(LogisticRegression(max_iter=1000),
                               X_train_scaled, 
                               y_train, 
                               X_test_scaled, 
                               y_test,
                             'LR')
baseline_performances.append(performance_lr)
baseline_df = pd.DataFrame(baseline_performances)

## Analysis: Refined Models

### Dimensionality Reducation via PCA

By transforming correlated features into a smaller set of uncorrelated components, PCA can help remove noise and redundant information, which may improve model performance. This is especially useful for algorithms like KNN that are sensitive to irrelevant or highly correlated features, and it can also make probabilistic models like Naive Bayes and Logistic Regression more robust. 

In [572]:
pca = PCA(n_components=0.95, random_state=26) # using enough PCs to explain 95% of the variance
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

### Parameter Tuning 

Choosing the right parameters can significantly improve model performance by balancing underfitting and overfitting.

#### K-Nearest Neighbors

The hyperparameters I will test for the KNN algorithm:
- K: the number of neighbors considered per computation (3-21)
- Weighting schema: uniform or weighted votes

I'm keeping the distance metric as Euclidean and using F1-score to evaluate model fit/parameter performance. 

In [573]:
knn_param_grid = {
    "n_neighbors": list(range(3, 22)),  
    "weights": ["uniform", "distance"], 
    "metric": ["euclidean"]}
knn_grid = GridSearchCV(KNeighborsClassifier(), 
                        knn_param_grid,
                        cv=5,
                        scoring="f1")
knn_grid.fit(X_train_pca, y_train)
best_knn_params = knn_grid.best_params_
print("Best parameters:", best_knn_params)

Best parameters: {'metric': 'euclidean', 'n_neighbors': 5, 'weights': 'uniform'}


Now, I can train/test the refined KNN model using the optimal parameters and PCA-transformed features. 

In [574]:
refined_performances = []
performance_knn_refined = trainAndTest(KNeighborsClassifier(**best_knn_params),
                                       X_train_pca, 
                                       y_train, 
                                       X_test_pca, 
                                       y_test, 
                                      'KNN')
refined_performances.append(performance_knn_refined)

#### Gaussian Naive Bayes 

There are no major hyperparameters to tune for the Gaussian NB algorithm, so I can go ahead and train the refined model using the pca-transformed data.

In [575]:
performance_gnb_refined = trainAndTest(GaussianNB(),
                               X_train_pca, 
                               y_train, 
                               X_test_pca, 
                               y_test, 
                               'GNB')
refined_performances.append(performance_gnb_refined)

#### Logistic Regression

The hyperparameter to be tuned for the Logistic Regression algorithm is C which defines the degree of regularization. I will test values on an expontial scale: 0.01, 0.1, 1, 10, 100. Again, I'm using F1-score to evaluate model fit/parameter performance. 

In [576]:
lr_grid = GridSearchCV(LogisticRegression(max_iter=1000), 
                       {"C": [0.01, 0.1, 1, 10, 100]},
                       cv=5,                  
                       scoring="f1")
lr_grid.fit(X_train_pca, y_train)
best_lr_params = lr_grid.best_params_
print("Best Logistic Regression params:", best_lr_params)

Best Logistic Regression params: {'C': 1}


The optimal value happens to be the same as the default value, so the refined LR model will be different from the baseline model in that it uses PCA-transformed training features. 

In [577]:
performance_lr_refined = trainAndTest(LogisticRegression(max_iter=1000, 
                                                        **best_lr_params),
                               X_train_pca, 
                               y_train, 
                               X_test_pca, 
                               y_test, 
                               'LR')
refined_performances.append(performance_lr_refined)
refined_df = pd.DataFrame(refined_performances)

## Results

In [578]:
print('Baseline Models Performances:')
baseline_df

Baseline Models Performances:


Unnamed: 0,Model,Accuracy,Precision,Recall,F1
0,KNN,0.633333,0.2,0.052632,0.083333
1,GNB,0.683333,0.5,0.263158,0.344828
2,LR,0.7,0.538462,0.368421,0.4375


In [579]:
print('Refined Models Performances:')
refined_df

Refined Models Performances:


Unnamed: 0,Model,Accuracy,Precision,Recall,F1
0,KNN,0.633333,0.2,0.052632,0.083333
1,GNB,0.683333,0.5,0.315789,0.387097
2,LR,0.716667,0.6,0.315789,0.413793


Looking at the baseline model performances, Logistic Regression (LR) achieves the highest accuracy (0.70) and F1-score (0.4375), followed by Gaussian Naive Bayes (GNB), with K-Nearest Neighbors (KNN) performing the worst. KNN struggles in this dataset, particularly in recall (0.0526), meaning it misses almost all patients who actually died. This poor performance may be due to the small dataset size and the fact that KNN is highly sensitive to feature scaling and irrelevant or correlated features. GNB and LR show more balanced results, with LR slightly outperforming GNB overall.

After refinement with parameter tuning and PCA, the models show modest improvements. LR sees the largest gain in accuracy (0.7167) and precision (0.6), although its recall slightly decreases (0.3158), resulting in a small drop in F1 (0.4138). GNB improves in recall (0.3158) and F1-score (0.3871), indicating it catches more of the at-risk patients without sacrificing precision. KNN remains essentially unchanged, reinforcing that it is not well-suited to this dataset even after tuning and dimensionality reduction.

Applying PCA reduces the dataset to fewer orthogonal features, which can affect the models differently. GNB may perform better because the independence assumption becomes more valid when features are uncorrelated. LR could lose some interpretability since the principal components are combinations of original features, but it may gain speed and stability. KNN might benefit if PCA reduces noise, though in this case the improvement appears limited.

The differences in performance also reflect the underlying model assumptions. GNB assumes feature independence and normally distributed features, which may not fully hold in the raw dataset, limiting its predictive power. LR, by contrast, does not make these assumptions and can capture correlations between features, making it more reliable here.

## Conclusion 

This analysis suggests that LR provides the most accurate and interpretable predictions for heart failure patient survival. GNB may still be useful for quickly identifying at-risk patients, especially after refinement and dimensionality reduction, but KNN appears unsuitable for this type of structured, correlated clinical data. These results highlight the importance of selecting models whose assumptions align with the characteristics of the dataset.