# Predicting the presence of heart disease with key health metrics and attributes

by Ethan Fang, Caroline Kahare and Alex Wong

# Introduction 

The objective of this project is to build a classification model that predicts the presence of heart disease based on key health metrics and attributes. It aims to contribute to the understanding and early detection of heart disease, which is crucial for effective medical intervention and prevention. 


# Methods

## Data 

The dataset is created by R. Detrano, A. Jánosi, W. Steinbrunn, M. Pfisterer, J. Schmid, S. Sandhu, K. Guppy, S. Lee, V. Froelicher and was sourced from UC Irvine's Machine Learning Repository. The original dataset consists of 920 observation and 76 attributes, however only 13 attributes were used for the project. The target variable, label, indicates the presence or absence of heart disease, with values ranging from 0 (no presence) to 4 (indicating varying levels of severity). The dataset also contains demographic, clinical, and diagnostic attributes, offering a comprehensive view of patient health metrics.

Key features of the dataset include demographic indicators such as age (age in years) and sex (gender: 1 = male, 0 = female). Clinical measurements include trestbps (resting blood pressure), chol (serum cholesterol levels), thalach (maximum heart rate achieved), and oldpeak (ST depression induced by exercise). Additionally, categorical variables such as cp (chest pain type), fbs (fasting blood sugar > 120 mg/dl), restecg (resting electrocardiographic results), exang (exercise-induced angina), slope (slope of the peak exercise ST segment), ca (number of major vessels colored by fluoroscopy), and thal (heart imaging defects) provide valuable context for predicting heart disease. 


1. **age**: Age in years  
2. **sex**: Sex  
   - `1` = Male  
   - `0` = Female  
3. **cp**: Chest pain type  
   - Value `1`: Typical angina  
   - Value `2`: Atypical angina  
   - Value `3`: Non-anginal pain  
   - Value `4`: Asymptomatic  
4. **trestbps**: Resting blood pressure (in mm Hg on admission to the hospital)  
5. **chol**: Serum cholesterol in mg/dl  
6. **fbs**: Fasting blood sugar (> 120 mg/dl)  
   - `1` = True  
   - `0` = False  
7. **restecg**: Resting electrocardiographic results  
   - Value `0`: Normal  
   - Value `1`: Having ST-T wave abnormality (T wave inversions and/or ST  
     elevation or depression of > 0.05 mV)  
   - Value `2`: Showing probable or definite left ventricular hypertrophy by  
     Estes' criteria  
8. **thalach**: Maximum heart rate achieved  
9. **exang**: Exercise-induced angina  
   - `1` = Yes  
   - `0` = No  
10. **oldpeak**: ST depression induced by exercise relative to rest  
11. **slope**: The slope of the peak exercise ST segment  
    - Value `1`: Upsloping  
    - Value `2`: Flat  
    - Value `3`: Downsloping  
12. **ca**: Number of major vessels (0-3) colored by fluoroscopy  
13. **thal**:  
    - `3` = Normal  
    - `6` = Fixed defect  
    - `7` = Reversible defect
14. **label**: 
    - `0` = Absence 
    - `1` = Presence 
    - `2` = Presence
    - `3` = Presence
    - `4` = Presence

In [8]:
import numpy as np
import pandas as pd
import altair_ally as aly
import altair as alt
import pandera as pa
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import make_scorer, roc_auc_score
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from deepchecks.tabular import Dataset
from deepchecks.tabular.checks import FeatureLabelCorrelation, FeatureFeatureCorrelation

In [4]:
columns = ['age','sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'label']

In [25]:
file_path_hungarian = 'data/processed.hungarian.data'
file_path_switzerland = 'data/processed.switzerland.data'
file_path_cleveland = 'data/processed.cleveland.data'
file_path_va = 'data/processed.va.data'
hungary_df = pd.read_csv(file_path_hungarian,index_col=False, names = columns)
swiss_df = pd.read_csv(file_path_switzerland,index_col=False, names = columns)
cleveland_df = pd.read_csv(file_path_cleveland,index_col=False, names = columns)
va_df = pd.read_csv(file_path_va,index_col=False, names = columns)

# Combine the four dataset into one consolidated set 
combined_df = pd.concat([hungary_df, swiss_df, cleveland_df, va_df], axis = 0)
combined_df.replace("?", pd.NA, inplace = True)
combined_df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,label
0,28.0,1.0,2.0,130,132,0,2,185,0,0.0,,,,0
1,29.0,1.0,2.0,120,243,0,0,160,0,0.0,,,,0
2,29.0,1.0,2.0,140,,0,0,170,0,0.0,,,,0
3,30.0,0.0,1.0,170,237,0,1,170,0,0.0,,,6,0
4,31.0,0.0,2.0,100,219,0,1,150,0,0.0,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,54.0,0.0,4.0,127,333,1,1,154,0,0,,,,1
196,62.0,1.0,1.0,,139,0,1,,,,,,,0
197,55.0,1.0,4.0,122,223,1,1,100,0,0,,,6,2
198,58.0,1.0,4.0,,385,1,2,,,,,,,0


## Data Validation

In [53]:
## --- 1. Correct data file format

file_path = [file_path_hungarian, file_path_switzerland, file_path_cleveland, file_path_va]
if False in [path.endswith('.data') for path in file_path]:
    print("Warning: The file extension is not .data")
else:
    print("File is in the expected format.")

File is in the expected format.


In [54]:
## --- 2. Correct column names

expected_names = set(columns)
actual_names = set(combined_df.columns)
if expected_names != actual_names:
    print(f"Warning: Column names do not match. Expected: {columns}, Found: {combined_df.columns.tolist()}")
else:
    print("Column names are correct.")

Column names are correct.


In [58]:
## --- 3. No empty observations

empty_obs_schema = pa.DataFrameSchema(
    checks = [
        pa.Check(lambda df: ~(df.isna().all(axis = 1)).any(), error = "Empty rows found.")
    ]
)
try:
    empty_obs_schema.validate(combined_df)
    print("No missing row found.")
except pa.errors.SchemaError as a:
    print(f"Warning: There are {combined_df.isna().sum().sum()} missing values in dataset.")

No missing row found.


In [18]:
## --- 4. Missingness not beyond expected threshold

threshold = 0.05
missing_prop = combined_df.isna().mean()
for col, prop in missing_prop.items():
    if prop > threshold:
        print(f"Warning: There're too many missing values in column '{col}'.")
    else:
        print(f"Column '{col}' passed the test of missingness.")

Column 'age' passed the test of missingness.
Column 'sex' passed the test of missingness.
Column 'cp' passed the test of missingness.
Column 'trestbps' passed the test of missingness.
Column 'chol' passed the test of missingness.
Column 'fbs' passed the test of missingness.
Column 'restecg' passed the test of missingness.
Column 'thalach' passed the test of missingness.
Column 'exang' passed the test of missingness.
Column 'oldpeak' passed the test of missingness.
Column 'slope' passed the test of missingness.
Column 'ca' passed the test of missingness.
Column 'thal' passed the test of missingness.
Column 'label' passed the test of missingness.


In [19]:
## --- 5. Correct data types in each column

column_type_schema = pa.DataFrameSchema(
    {
        "age": pa.Column(pa.Int, nullable = True),
        "sex": pa.Column(pa.Int, nullable = True),
        "cp": pa.Column(pa.String, nullable = True),
        "trestbps": pa.Column(pa.Int, nullable = True),
        "chol": pa.Column(pa.Int, nullable = True),
        "fbs": pa.Column(pa.Int, nullable = True),
        "restecg": pa.Column(pa.String, nullable = True),
        "thalach": pa.Column(pa.Int, nullable = True),
        "exang": pa.Column(pa.String, nullable = True),
        "oldpeak": pa.Column(pa.Float, nullable = True),
        "slope": pa.Column(pa.String, nullable = True),
        "ca": pa.Column(pa.Float, nullable = True),
        "thal": pa.Column(pa.String, nullable = True),
        "label": pa.Column(pa.Int, nullable = True)
    }    
)
try:
    column_type_schema.validate(combined_df)
    print("All columns have correct data types.")
except pa.errors.SchemaError as e:
    print(f"Warning: Validation failed: {e}")



In [25]:
## --- 6. No duplicate observations

duplicate_obs_schema = pa.DataFrameSchema(
    checks=[
        pa.Check(lambda df: ~df.duplicated().any(), error="There're duplicate rows")
    ]
)
try:
    duplicate_obs_schema.validate(combined_df)
    print("No duplicate rows found.")
except pa.errors.SchemaError as e:
    duplicate_rows = combined_df[combined_df.duplicated(keep=False)]
    print(f"Warning: There're duplicate rows: \n{duplicate_rows}.")

      age  sex   cp trestbps chol fbs restecg thalach exang oldpeak slope ca  \
101  49.0  0.0  2.0      110    ?   0       0     160     0     0.0     ?  ?   
102  49.0  0.0  2.0      110    ?   0       0     160     0     0.0     ?  ?   
139  58.0  1.0  3.0      150  219   0       1     118     1       0     ?  ?   
187  58.0  1.0  3.0      150  219   0       1     118     1       0     ?  ?   

    thal  label  
101    ?      0  
102    ?      0  
139    ?      2  
187    ?      2  .


In [76]:
## --- 7. No outlier or anomalous values

values_schema = pa.DataFrameSchema({
    "age": pa.Column(float, pa.Check.between(0, 120), nullable=True),
    "sex": pa.Column(float, pa.Check.isin([0.0, 1.0]), nullable=True), 
    "cp": pa.Column(float, pa.Check.isin([1.0, 2.0, 3.0, 4.0]), nullable=True), 
    "trestbps": pa.Column(float, pa.Check.between(20, 220), nullable=True),
    "chol": pa.Column(float, pa.Check.between(50, 800), nullable=True), 
    "fbs": pa.Column(float, pa.Check.isin([0.0, 1.0]), nullable=True), 
    "restecg": pa.Column(float, pa.Check.isin([0.0, 1.0, 2.0]), nullable=True),  
    "thalach": pa.Column(float, pa.Check.between(50, 240), nullable=True),  
    "exang":  pa.Column(float, pa.Check.isin([0.0, 1.0]), nullable=True),  
    "oldpeak": pa.Column(float, pa.Check.between(0.0, 7.0), nullable=True),  
    "slope": pa.Column(float, pa.Check.isin([1.0, 2.0, 3.0]), nullable=True),  
    "ca": pa.Column(float, pa.Check.between(0, 4), nullable=True), 
    "thal": pa.Column(float, pa.Check.isin([3.0, 6.0, 7.0]), nullable=True),  
    "label": pa.Column(float, pa.Check.between(0.0, 4.0), nullable=True),  
})
replicate_df = combined_df.applymap(lambda x: float(x) if pd.notnull(x) else x)
try:
    values_schema.validate(replicate_df, lazy = True)
    print("No outlier or anomalous value found.")
except pa.errors.SchemaErrors as e:
    print(f"Warning: There're outlier or anomalous values.")



  replicate_df = combined_df.applymap(lambda x: float(x) if pd.notnull(x) else x)


In [78]:
## --- 8. There's no categorical value in this dataframe.

In [9]:
## --- 9. Target/response variable follows expected distribution

proportions = combined_df.label.value_counts(normalize=True)
print("Class proportions are", proportions)
print("Class proportions are as expected.")

Class proportions are label
0    0.446739
1    0.288043
2    0.118478
3    0.116304
4    0.030435
Name: proportion, dtype: float64
Class proportions are as expected.


In [17]:
## --- 10. No anomalous correlations between target variable and features variables

features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']
ds = Dataset(combined_df, label='label', cat_features=categorical_features)

check_feat_lab_corr = FeatureLabelCorrelation().add_condition_feature_pps_less_than(0.9)
check_feat_lab_corr_result = check_feat_lab_corr.run(dataset=ds)
check_feat_lab_corr.run(dataset=ds).show()
if not check_feat_lab_corr_result.passed_conditions():
    raise ValueError("The correlation between target and features variables exceeds the threshold.")
else:
    print("Everything is fine.")


Dataframe index has duplicate indexes, setting index to [0,1..,n-1].



VBox(children=(HTML(value='<h4><b>Feature Label Correlation</b></h4>'), HTML(value='<p>Return the PPS (Predict…

Everything is fine.


In [18]:
## --- 11. No anomalous correlations between features variables

check_feat_feat_corr = FeatureFeatureCorrelation(threshold=0.9)
check_feat_feat_corr_result = check_feat_feat_corr.run(dataset=ds)
check_feat_feat_corr.run(dataset=ds).show()

if not check_feat_feat_corr_result.passed_conditions():
    raise ValueError("The correlation between features variables exceeds the threshold.")
else:
    print("Everything is fine.")

VBox(children=(HTML(value='<h4><b>Feature-Feature Correlation</b></h4>'), HTML(value='<p>    Checks for pairwi…

Everything is fine.


## Data Cleaning

In [26]:
## delete duplicated rows
duplicate_index = [102,187]
combined_df = combined_df.drop(index=duplicate_index).reset_index(drop=True)
combined_df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,label
0,28.0,1.0,2.0,130,132,0,2,185,0,0.0,,,,0
1,29.0,1.0,2.0,120,243,0,0,160,0,0.0,,,,0
2,29.0,1.0,2.0,140,,0,0,170,0,0.0,,,,0
3,30.0,0.0,1.0,170,237,0,1,170,0,0.0,,,6,0
4,31.0,0.0,2.0,100,219,0,1,150,0,0.0,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
908,54.0,0.0,4.0,127,333,1,1,154,0,0,,,,1
909,62.0,1.0,1.0,,139,0,1,,,,,,,0
910,55.0,1.0,4.0,122,223,1,1,100,0,0,,,6,2
911,58.0,1.0,4.0,,385,1,2,,,,,,,0


----------------------------------

## Exploratory data analysis

#### Key findings

- Patients with no heart disease exhibits on average higher ST depression induced by exercise relative to rest, higher maximum heart rate and lower serum cholestorel. 

- Heart disease is more common among patients over 55. 

- Patients with heart disease are more likely to experience asymptomatic chest pains.

- Males appear to be more susceptible to heart disease. 

- Patients without heart disease tend to have lower fasting blood sugar when compared to the positive group.

In [19]:
train_df, test_df = train_test_split(combined_df, test_size=0.3, random_state=123)

In [20]:
aly.alt.data_transformers.enable('vegafusion')
aly.dist(train_df, color='label')

AttributeError: module 'pyarrow' has no attribute 'from_pandas'

alt.ConcatChart(...)

In [21]:
aly.dist(train_df, dtype = 'object', color = 'label')

AttributeError: module 'pyarrow' has no attribute 'from_pandas'

alt.ConcatChart(...)

## Features pre-processing 

In [32]:
X_train = train_df.drop(columns=["label"])
X_test = test_df.drop(columns=["label"])
y_train = train_df["label"]
y_test = test_df["label"]

In [33]:
numeric_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak'] # standard scaling for numerical features
categorical_features = ['cp', 'restecg'] # onehot encoding for categorical features with > 2 classes
binary_features = ['sex', 'exang', 'fbs'] # simple imputing on the binary features
drop_features = ['thal', 'ca', 'slope'] # dropping features with signifcant NaN values 

In [34]:
numeric_transformer_pipe = make_pipeline(SimpleImputer(strategy = 'median'), StandardScaler())
categorical_transfomer_pipe = make_pipeline(SimpleImputer(strategy = 'most_frequent'), OneHotEncoder(drop = 'if_binary', sparse_output = False)) 
imputer = SimpleImputer(strategy = 'most_frequent')

In [35]:
preprocessor = make_column_transformer(
    (numeric_transformer_pipe, numeric_features),
    (categorical_transfomer_pipe, categorical_features),
    (imputer, binary_features),
    ("drop", drop_features)
)

In [36]:
X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.fit_transform(X_test)

In [37]:
col_names = ( 
    numeric_features +
    preprocessor.named_transformers_['pipeline-2'].get_feature_names_out().tolist() + 
    binary_features
)

In [38]:
X_train_transformed = pd.DataFrame(X_train_transformed, columns = col_names)
X_test_transformed = pd.DataFrame(X_test_transformed, columns = col_names)
X_train_transformed 

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,cp_1.0,cp_2.0,cp_3.0,cp_4.0,restecg_0.0,restecg_1.0,restecg_2.0,sex,exang,fbs
0,-0.806021,-0.1196,0.178542,-1.048509,-1.048509,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
1,0.046673,-0.660108,0.402221,-1.12703,-1.12703,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,1.325714,0.961415,-1.890497,-1.205551,-1.205551,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0
3,-0.059914,0.420908,0.122622,0.129303,0.129303,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,-2.085061,-0.1196,-0.389978,2.013803,2.013803,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
639,-0.592847,-0.1196,0.392901,0.835991,0.835991,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
640,-0.379674,0.691162,-1.890497,0.011522,0.011522,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0
641,0.79278,-0.1196,-1.890497,-2.422623,-2.422623,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
642,0.47302,2.042431,-1.890497,-1.323332,-1.323332,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0


## Machine learning models application and hyperparameters tuning
#### Summary

- Various machine learning models were tested and optimized using hyperparameter tuning to identify the best-performing model.
- Randomized Search CV was used to perform hyperparameter optimization for each model. Key parameters tuned include:
   - Logistic Regression: Regularization strength(C), Solver(liblinear, lbfgs).
   - Decision Tree: Maximum depth, Minimum samples per split.
   - SVM: Regularization strength(C), Kernel type(linear, rbf).
   - KNN: Number of neighbours, Weight type(uniform, distance).

- Results:

    <img src="docs/Best_Models.png" alt="Model Summary Table" width="800"/>

#### Conclusion

- The K-Nearest Neighbors (KNN) model emerged as the best-performing classifier based on accuracy and weighted F1-score. Despite moderate overall performance, it provided reasonable balance across classes compared to other models.This study demonstrates the potential of leveraging machine learning techniques for predicting heart disease but also highlights the challenges posed by multi-class classification and limited data quality.

In [39]:
#importing libraries

import warnings
warnings.filterwarnings('ignore')

#pio.templates.default = "plotly_white"

#%matplotlib inline

#Models for scikit learn

#Model Evaluations



In [40]:
models = {
    'Logistic Regression': LogisticRegression(random_state = 123, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(),
    'Support Vector Machine': SVC(random_state = 123, probability=True),
    'K-Nearest Neighbors': KNeighborsClassifier()
}

param_distributions = {
    'Logistic Regression': {
        'classifier__C': stats.loguniform(1e-3, 1e3),
        'classifier__solver': ['liblinear', 'lbfgs']
    },
    'Decision Tree': {
        'classifier__max_depth': [3, 5, 10],
        'classifier__min_samples_split': stats.randint(2, 20)
    },
    'Support Vector Machine': {
        'classifier__C': stats.loguniform(1e-2, 1e2),
        'classifier__kernel': ['linear', 'rbf']
    },
    'K-Nearest Neighbors': {
        'classifier__n_neighbors': stats.randint(3, 20),
        'classifier__weights': ['uniform', 'distance']
    }
}

In [41]:
best_models = {}

for model_name, model in models.items():
    print(f"Tuning hyperparameters for {model_name} using RandomizedSearchCV...")
    
    clf = Pipeline(steps=[('classifier', model)])
    
    random_search = RandomizedSearchCV(
        estimator=clf,
        param_distributions=param_distributions[model_name],
        scoring=make_scorer(roc_auc_score, needs_proba=True),
        n_iter=10, 
        cv=5,
        random_state=42
    )
    
    random_search.fit(X_train_transformed, y_train)
    
    best_models[model_name] = random_search.best_estimator_
    
    print(f"Best parameters for {model_name}: {random_search.best_params_}")
    print("-" * 40)

for model_name, model in best_models.items():
    print(f"Evaluating {model_name} on test set...")
    y_pred = model.predict(X_test_transformed)
    
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    print("-" * 40)

Tuning hyperparameters for Logistic Regression using RandomizedSearchCV...
Best parameters for Logistic Regression: {'classifier__C': np.float64(0.1767016940294795), 'classifier__solver': 'liblinear'}
----------------------------------------
Tuning hyperparameters for Decision Tree using RandomizedSearchCV...
Best parameters for Decision Tree: {'classifier__max_depth': 10, 'classifier__min_samples_split': 16}
----------------------------------------
Tuning hyperparameters for Support Vector Machine using RandomizedSearchCV...
Best parameters for Support Vector Machine: {'classifier__C': np.float64(0.31489116479568624), 'classifier__kernel': 'linear'}
----------------------------------------
Tuning hyperparameters for K-Nearest Neighbors using RandomizedSearchCV...
Best parameters for K-Nearest Neighbors: {'classifier__n_neighbors': 9, 'classifier__weights': 'distance'}
----------------------------------------
Evaluating Logistic Regression on test set...
Classification Report:
        

# Reference 

```
{bibliography}
```