# Heart Disease Stage Prediction Project  

## Project Overview  
The **Heart Disease Stage Prediction Project** focuses on predicting the presence and stages of heart disease based on patient data. Using machine learning models and exploratory data analysis, this project aims to identify key factors contributing to heart disease, assist in early diagnosis, and provide actionable insights for healthcare providers.  

---

## Context  
This dataset is a **multivariate dataset**, meaning it involves various mathematical or statistical variables. It contains 14 primary attributes out of 76 available ones, which have been widely used in machine learning research.  
The **Cleveland database** is the most commonly utilized subset for heart disease prediction tasks.  

The main goals of this project are:  
1. To predict whether a person has heart disease based on given attributes.  
2. To analyze the dataset for insights that could improve understanding and early detection of heart disease.  

---

## Data Source

This dataset is available on Kaggle in the following link:
> https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data/data

## About the Dataset  

### Column Descriptions  

| Column     | Description                                                                                       |
|------------|---------------------------------------------------------------------------------------------------|
| `id`       | Unique identifier for each patient.                                                              |
| `age`      | Age of the patient in years.                                                                      |
| `origin`   | Place of study where data was collected.                                                          |
| `sex`      | Gender of the patient (`Male`/`Female`).                                                          |
| `cp`       | Chest pain type (`typical angina`, `atypical angina`, `non-anginal`, `asymptomatic`).              |
| `trestbps` | Resting blood pressure (in mm Hg on admission to the hospital).                                   |
| `chol`     | Serum cholesterol level in mg/dl.                                                                 |
| `fbs`      | Fasting blood sugar (`True` if >120 mg/dl, else `False`).                                          |
| `restecg`  | Resting electrocardiographic results (`normal`, `st-t abnormality`, `lv hypertrophy`).            |
| `thalach`  | Maximum heart rate achieved during exercise.                                                      |
| `exang`    | Exercise-induced angina (`True`/`False`).                                                         |
| `oldpeak`  | ST depression induced by exercise relative to rest.                                               |
| `slope`    | Slope of the peak exercise ST segment.                                                            |
| `ca`       | Number of major vessels (0-3) colored by fluoroscopy.                                             |
| `thal`     | Results of the thalassemia test (`normal`, `fixed defect`, `reversible defect`).                  |
| `num`      | Predicted attribute (`0` = no heart disease; `1, 2, 3, 4` = stages of heart disease).             |

---

## Problem Statement
   - **Baseline Models:** Use decision trees for initial benchmarks.  
   - **Advanced Models:** Train machine learning models such as Random Forest, XGBoost, and SVM.
   - **Hyperparameter Tuning:** Optimize models to enhance accuracy and efficiency.  

### Import Libraries

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
import pickle

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score,precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV

### Settings

In [2]:
# warnings
warnings.filterwarnings("ignore")

# Plot
sns.set_style("darkgrid")

# DataFrame
pd.set_option("display.max_columns", None)

# Data
data_path = "../data"
model_path = "../models"
csv_path = os.path.join(data_path, "hd_uci_no_missing.csv")

### Load Data

In [42]:
df = pd.read_csv(csv_path)

In [43]:
# Check data
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,63,Male,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,67,Male,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,67,Male,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,37,Male,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,41,Female,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


### Preprocessing

In [44]:
# Encode the categorical features 
df["sex"] = df["sex"].map({"Male": 1, "Female": 0})
# df["fbs"] = df["fbs"].map({"True": 1, "False": 0})
# df["exang"] = df["exang"].map({"True": 1, "False": 0})
df["restecg"] = df["restecg"].map({"normal": 0, "lv hypertrophy": 1, "st-t abnormality": 2})
df["cp"] = df["cp"].map({"asymptomatic": 0, "typical angina": 3, "atypical angina": 2, "non-anginal": 1})
df["slope"] = df["slope"].map({"downsloping": -1, "upsloping": 1, "flat": 0})
df["thal"] = df["thal"].map({"normal": 0, "fixed defect": 1, "reversable defect": 2})
# Sanity check
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,63,1,3,145.0,233.0,True,1,150.0,False,2.3,-1,0.0,1,0
1,67,1,0,160.0,286.0,False,1,108.0,True,1.5,0,3.0,0,2
2,67,1,0,120.0,229.0,False,1,129.0,True,2.6,0,2.0,2,1
3,37,1,1,130.0,250.0,False,0,187.0,False,3.5,-1,0.0,0,0
4,41,0,2,130.0,204.0,False,1,172.0,False,1.4,1,0.0,0,0


In [45]:
# Separate Input and output features
X= df.drop("num", axis= 1)
y= df["num"]
# Sanity check
X.shape, y.shape

((299, 13), (299,))

In [46]:
# Split the train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

# Sanity check
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((239, 13), (239,), (60, 13), (60,))

In [47]:
# Standarize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Model Training and Evaluation

In [48]:
def train_evaluate(model, X_train, y_train, X_test, y_test):
    # Train the model
    model.fit(X_train, y_train)

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Evaluate the model on training data
    train_a = accuracy_score(y_train, y_train_pred)
    train_p = precision_score(y_train, y_train_pred, average="weighted")
    train_r = recall_score(y_train, y_train_pred, average="weighted")
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    print("Evaluation on Training Data")
    print(f"Accuracy: {train_a * 100: 0.3f}")
    print(f"Precision: {train_p * 100 : 0.3f}")
    print(f"Recall: {train_r * 100 : 0.3f}")
    print(f"F1: {train_f1 * 100 : 0.3f}")

    # Evaluate the model on test data
    test_a = accuracy_score(y_test, y_test_pred)
    test_p = precision_score(y_test, y_test_pred, average="weighted")
    test_r = recall_score(y_test, y_test_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")
    print("Evaluation on Test Data")
    print(f"Accuracy: {test_a * 100: 0.3f}")
    print(f"Precision: {test_p * 100 : 0.3f}")
    print(f"Recall: {test_r * 100 : 0.3f}")
    print(f"F1: {test_f1 * 100 : 0.3f}")

    return train_a, train_p, train_r, train_f1, test_a, test_p, test_r, test_f1

In [49]:
# Train the base model Decision Tree Classifier
dt = DecisionTreeClassifier()
train_a, train_p, train_r, train_f1, test_a, test_p, test_r, test_f1= train_evaluate(dt, X_train, y_train, X_test, y_test)

Evaluation on Training Data
Accuracy:  100.000
Precision:  100.000
Recall:  100.000
F1:  100.000
Evaluation on Test Data
Accuracy:  53.333
Precision:  52.787
Recall:  53.333
F1:  52.720


### Performance Analysis of Decision Tree Classifier

#### Training Data:

- **Accuracy, Precision, Recall, F1: All 100%.**
    - This perfect score on the training data indicates that the model has learned the training data very well.
    - However, it suggests that the model has overfit, meaning it may not generalize well to unseen data.

#### Test Data:

- **Accuracy: 53.26%**

    - The accuracy is only slightly better than random guessing, which indicates poor generalization on test data.
- **Precision: 44.03%**

    - Low precision suggests that the model makes many false-positive predictions.
- **Recall: 43.58%**

    - Low recall indicates that the model fails to correctly identify a significant proportion of the positive instances.
- **F1 Score: 43.62%**

    - The F1 score, which balances precision and recall, is also quite low, reflecting the poor predictive performance on the test set.

#### Key Observations

- **Overfitting:**

    - The model performs perfectly on the training data but fails on the test data. This is a clear sign of overfitting.
    - Decision trees can overfit easily if not properly regularized (e.g., by setting constraints like max_depth, min_samples_split, or min_samples_leaf).
- **Poor Generalization:**

    - The gap between training and test performance is too large, indicating that the model has memorized the training data instead of learning patterns that generalize well.

#### Recommendations to Improve Performance

- **Regularization:**

    - Limit the complexity of the decision tree:
        - Use hyperparameters like max_depth, min_samples_split, or min_samples_leaf to control the depth and growth of the tree.
- **Alternative Models:**

    - Consider using ensemble methods like Random Forest or Gradient Boosting (e.g., XGBoost, LightGBM), which are more robust and less prone to overfitting compared to a single decision tree.

In [50]:
# Train the base model Decision Tree Classifier with regularization
dt = DecisionTreeClassifier(max_depth=5, min_samples_split=10, min_samples_leaf=5)
train_a, train_p, train_r, train_f1, test_a, test_p, test_r, test_f1= train_evaluate(dt, X_train, y_train, X_test, y_test)

Evaluation on Training Data
Accuracy:  72.385
Precision:  71.535
Recall:  72.385
F1:  70.068
Evaluation on Test Data
Accuracy:  58.333
Precision:  48.939
Recall:  58.333
F1:  52.903


### Performance Analysis of Decision Tree Classifier After Applying Regularization

#### Training Data:

- **Accuracy: 65.71%**
    - The accuracy has dropped significantly compared to the initial overfitted model (100%), indicating reduced overfitting.
- **Precision, Recall, F1: ~44%**
    - These moderate values suggest the model has learned some patterns from the training data but is still not highly effective in distinguishing between classes.

#### Test Data:

- **Accuracy: 52.71%**

    - The accuracy on the test set remains slightly better than random guessing (which would be ~50% in a binary classification problem).
- **Precision, Recall, F1: ~33%**

    - These scores indicate the model struggles to correctly predict positive cases and has a significant number of false positives and false negatives.

#### Key Observations

- **Improved Generalization:**

    - The gap between training and test performance has decreased, which is a positive sign of reduced overfitting.
- **Low Overall Performance:**

    - Both training and test set metrics are quite low, suggesting that the model is underfitting or that the data might not have enough discriminative features for the current model.
- **Feature Complexity:**

    - The current feature set may not capture the underlying patterns well, or the relationships between features and the target variable may be too complex for a single DecisionTreeClassifier, even with regularization.

#### Recommendations to Further Improve Performance

- **Use an Ensemble Model**

    - Decision trees are prone to underfitting and overfitting when used in isolation. Switching to ensemble models like Random Forest or Gradient Boosting (e.g., XGBoost, LightGBM) can help improve performance:
        - Random Forest averages multiple decision trees, reducing variance.
        - Gradient Boosting optimizes prediction errors sequentially, improving accuracy.

In [51]:
# Use SMOTE(Synthetic Minority Oversampling Technique) for balacing the dataset
smote = SMOTE(random_state= 42)
X_balanced, y_balanced = smote.fit_resample(X, y)

# Split the train and test data
X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size= 0.2, random_state= 42)

# Sanity check
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((640, 13), (640,), (160, 13), (160,))

In [52]:
# Train the Random Forest Classifier
rfc = RandomForestClassifier(random_state= 42, n_estimators= 200,max_depth=5, min_samples_split=10, min_samples_leaf=5)
train_a, train_p, train_r, train_f1, test_a, test_p, test_r, test_f1= train_evaluate(rfc, X_train, y_train, X_test, y_test)

Evaluation on Training Data
Accuracy:  84.375
Precision:  85.291
Recall:  84.375
F1:  83.845
Evaluation on Test Data
Accuracy:  73.125
Precision:  73.424
Recall:  73.125
F1:  72.611


### Performance Analysis

#### Training Data:

- **Accuracy: 84.22%, Precision: 85.18%, Recall: 84.22%, F1: 83.67%**
    - These metrics indicate that the model is learning well from the oversampled training data without excessive overfitting.
    - The balance between precision and recall demonstrates the ability to correctly classify the majority and minority classes.

#### Test Data:

- **Accuracy: 73.13%**
    - A notable improvement from 63.33% (pre-SMOTE), indicating that the oversampling has helped the model generalize better to unseen data.
- **Precision: 73.42%, Recall: 73.13%, F1: 72.61%**
    - The balance between precision and recall suggests that the model effectively handles both false positives and false negatives on the test set.
    - The F1 score is significantly higher than before, reflecting better handling of imbalanced data and overall model robustness.

#### Key Observations

- **Improved Generalization:**

    - The increased test accuracy (73.13%) and F1 score (72.61%) confirm that SMOTE has mitigated the impact of class imbalance, allowing the model to perform better on unseen data.
- **Balanced Training Performance:**

    - The training metrics remain strong but not excessively high, demonstrating that the Random Forest is neither underfitting nor overfitting the oversampled data.
- **Impact of SMOTE:**

    - By balancing the class distribution, SMOTE has allowed the Random Forest to learn more effectively, improving recall and F1 score across both training and test datasets.

In [53]:
# Train the XGBoost Classifier
xgbc = XGBClassifier()
train_a, train_p, train_r, train_f1, test_a, test_p, test_r, test_f1= train_evaluate(xgbc, X_train, y_train, X_test, y_test)

Evaluation on Training Data
Accuracy:  100.000
Precision:  100.000
Recall:  100.000
F1:  100.000
Evaluation on Test Data
Accuracy:  82.500
Precision:  83.339
Recall:  82.500
F1:  82.463


### Performance Analysis XGBoost Classifier

#### Training Data:

- **Accuracy, Precision, Recall, F1: 100.00%**
    - While these metrics are perfect, they may indicate potential overfitting on the training data, as the model could be memorizing the oversampled dataset instead of generalizing.
    - However, this concern is mitigated by the excellent performance on the test data.

#### Test Data:

- **Accuracy: 82.50%**
    - A significant improvement compared to the Random Forest post-SMOTE model (73.13%), demonstrating that XGBoost effectively captures complex patterns in the data.
- **Precision: 83.34%, Recall: 82.50%, F1: 82.46%**
    - These metrics indicate a well-balanced model capable of handling both false positives and false negatives effectively.
    - The high F1 score highlights the model's robustness, as it balances precision and recall.

#### Key Observations
- **Excellent Generalization:**

    - The test accuracy (82.50%) and F1 score (82.46%) confirm that the XGBoost classifier generalizes better than the Random Forest, even on oversampled data.
- **Improvements Across Metrics:**

    - Compared to the Random Forest, XGBoost shows significant improvement in all test metrics, particularly in precision, recall, and F1 score.
- **Handling Class Imbalance:**

    - The SMOTE-oversampled data, combined with XGBoost's inherent ability to handle imbalanced datasets, has resulted in a model that performs well on minority and majority classes alike.

### Model Optimization

In [19]:
xgb = XGBClassifier(random_state=42)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [100],       # Number of trees
    'learning_rate': [0.01, 0.1],    # Step size shrinkage
    'max_depth': [3, 5, 7],               # Maximum depth of a tree
    # 'subsample': [0.6, 0.8],         # Subsample ratio of the training set
    'colsample_bytree': [0.8, 1.0],  # Subsample ratio of features for each tree
    'gamma': [0, 1],                   # Minimum loss reduction required for further split
    'reg_alpha': [0, 0.1, 1],             # L1 regularization term on weights
    # 'reg_lambda': [1, 2]               # L2 regularization term on weights
}

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=xgb,
    param_grid=param_grid,
    # scoring='recall',  # You can use 'accuracy', 'precision', 'recall', or others
    cv=5,          # 5-fold cross-validation
    verbose=1,     # Output progress
    # n_jobs=-1      # Use all available CPU cores
)

# Fit the model on the training data
grid_search.fit(X_train, y_train)

# Output the best parameters and the corresponding score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)


Fitting 5 folds for each of 72 candidates, totalling 360 fits
Best Parameters: {'colsample_bytree': 1.0, 'gamma': 0, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100, 'reg_alpha': 0}
Best Score: 0.8234375


In [21]:
# Train with best best parameter set 
best_params = grid_search.best_params_


In [54]:
xgbmodel = XGBClassifier(**best_params)
train_a, train_p, train_r, train_f1, test_a, test_p, test_r, test_f1= train_evaluate(xgbmodel, X_train, y_train, X_test, y_test)

Evaluation on Training Data
Accuracy:  100.000
Precision:  100.000
Recall:  100.000
F1:  100.000
Evaluation on Test Data
Accuracy:  84.375
Precision:  84.881
Recall:  84.375
F1:  84.168


### Performance Analysis of XGBoost Classifier After Optimization

#### Training Data:

- The model achieves perfect scores (100%) for accuracy, precision, recall, and F1 on the training set.
- While this might indicate overfitting, the strong test set performance suggests that the model is generalizing well.

#### Test Data:

- **Accuracy: 84.375%** indicates the proportion of correctly classified instances is high.
- **Precision (84.881%):** The classifier is reliable in predicting true positives out of all predicted positives.
- **Recall (84.375%):** The model captures a high percentage of actual positives, showing strong sensitivity.
- **F1 Score (84.168%):** The harmonic mean of precision and recall reflects a good trade-off between them.

In [25]:
# Save the model
xgb_path = os.path.join(model_path, "model_xgb.pkl")
with open(xgb_path, "wb") as xgb_model_file:
    pickle.dump(xgbmodel, xgb_model_file)