# Heart Disease Stage Prediction Project  

## Project Overview  
The **Heart Disease Stage Prediction Project** focuses on predicting the presence and stages of heart disease based on patient data. Using machine learning models and exploratory data analysis, this project aims to identify key factors contributing to heart disease, assist in early diagnosis, and provide actionable insights for healthcare providers.  

---

## Context  
This dataset is a **multivariate dataset**, meaning it involves various mathematical or statistical variables. It contains 14 primary attributes out of 76 available ones, which have been widely used in machine learning research.  
The **Cleveland database** is the most commonly utilized subset for heart disease prediction tasks.  

The main goals of this project are:  
1. To predict whether a person has heart disease based on given attributes.  
2. To analyze the dataset for insights that could improve understanding and early detection of heart disease.  

---

## Data Source

This dataset is available on Kaggle in the following link:
> https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data/data

## About the Dataset  

### Column Descriptions  

| Column     | Description                                                                                       |
|------------|---------------------------------------------------------------------------------------------------|
| `id`       | Unique identifier for each patient.                                                              |
| `age`      | Age of the patient in years.                                                                      |
| `origin`   | Place of study where data was collected.                                                          |
| `sex`      | Gender of the patient (`Male`/`Female`).                                                          |
| `cp`       | Chest pain type (`typical angina`, `atypical angina`, `non-anginal`, `asymptomatic`).              |
| `trestbps` | Resting blood pressure (in mm Hg on admission to the hospital).                                   |
| `chol`     | Serum cholesterol level in mg/dl.                                                                 |
| `fbs`      | Fasting blood sugar (`True` if >120 mg/dl, else `False`).                                          |
| `restecg`  | Resting electrocardiographic results (`normal`, `st-t abnormality`, `lv hypertrophy`).            |
| `thalach`  | Maximum heart rate achieved during exercise.                                                      |
| `exang`    | Exercise-induced angina (`True`/`False`).                                                         |
| `oldpeak`  | ST depression induced by exercise relative to rest.                                               |
| `slope`    | Slope of the peak exercise ST segment.                                                            |
| `ca`       | Number of major vessels (0-3) colored by fluoroscopy.                                             |
| `thal`     | Results of the thalassemia test (`normal`, `fixed defect`, `reversible defect`).                  |
| `num`      | Predicted attribute (`0` = no heart disease; `1, 2, 3, 4` = stages of heart disease).             |

---

## Problem Statement
   - **Baseline Models:** Use decision trees for initial benchmarks.  
   - **Advanced Models:** Train machine learning models such as Random Forest, XGBoost, and SVM.
   - **Hyperparameter Tuning:** Optimize models to enhance accuracy and efficiency.  

### Import Libraries

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score,precision_score, recall_score, f1_score

### Settings

In [5]:
# warnings
warnings.filterwarnings("ignore")

# Plot
sns.set_style("darkgrid")

# DataFrame
pd.set_option("display.max_columns", None)

# Data
data_path = "../data"
model_path = "../models"
csv_path = os.path.join(data_path, "hd_a_cleaned.csv")

### Load Data

In [11]:
df = pd.read_csv(csv_path)

In [12]:
# Check data
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,num
0,63,Male,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,0
1,67,Male,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,2
2,67,Male,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,1
3,37,Male,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,0
4,41,Female,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,0


### Preprocessing

In [14]:
# Encode the categorical features 
df["sex"] = df["sex"].map({"Male": 1, "Female": 0})
# df["fbs"] = df["fbs"].map({"True": 1, "False": 0})
# df["exang"] = df["exang"].map({"True": 1, "False": 0})
df["restecg"] = df["restecg"].map({"normal": 0, "lv hypertrophy": 1, "st-t abnormality": 2})
df["cp"] = df["cp"].map({"asymptomatic": 0, "typical angina": 1, "atypical angina": 2, "non-anginal": 3})
df["slope"] = df["slope"].map({"-1": -1, "downsloping": 1, "upsloping": 2, "flat": 3})

# Sanity check
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,num
0,63,1,1,145.0,233.0,True,1,150.0,False,2.3,1,0.0,0
1,67,1,0,160.0,286.0,False,1,108.0,True,1.5,3,3.0,2
2,67,1,0,120.0,229.0,False,1,129.0,True,2.6,3,2.0,1
3,37,1,3,130.0,250.0,False,0,187.0,False,3.5,1,0.0,0
4,41,0,2,130.0,204.0,False,1,172.0,False,1.4,2,0.0,0


In [15]:
# Separate Input and output features
X= df.drop("num", axis= 1)
y= df["num"]
# Sanity check
X.shape, y.shape

((919, 12), (919,))

In [16]:
# Split the train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

# Sanity check
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((735, 12), (735,), (184, 12), (184,))

In [17]:
# Standarize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Model Training and Evaluation

In [18]:
def train_evaluate(model, X_train, y_train, X_test, y_test):
    # Train the model
    model.fit(X_train, y_train)

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Evaluate the model on training data
    train_a = accuracy_score(y_train, y_train_pred)
    train_p = precision_score(y_train, y_train_pred, average="macro")
    train_r = recall_score(y_train, y_train_pred, average="macro")
    train_f1 = f1_score(y_train, y_train_pred, average="macro")
    print(f"Accuracy: {train_a * 100: 0.3f}")
    print(f"Precision: {train_p * 100 : 0.3f}")
    print(f"Recall: {train_r * 100 : 0.3f}")
    print(f"F1: {train_f1 * 100 : 0.3f}")

    # Evaluate the model on test data
    test_a = accuracy_score(y_test, y_test_pred)
    test_p = precision_score(y_test, y_test_pred, average="macro")
    test_r = recall_score(y_test, y_test_pred, average="macro")
    test_f1 = f1_score(y_test, y_test_pred, average="macro")
    print(f"Accuracy: {test_a * 100: 0.3f}")
    print(f"Precision: {test_p * 100 : 0.3f}")
    print(f"Recall: {test_r * 100 : 0.3f}")
    print(f"F1: {test_f1 * 100 : 0.3f}")

    return train_a, train_p, train_r, train_f1, test_a, test_p, test_r, test_f1

In [23]:
# Train the base model Decision Tree Classifier
dt = DecisionTreeClassifier()
train_a, train_p, train_r, train_f1, test_a, test_p, test_r, test_f1= train_evaluate(dt, X_train, y_train, X_test, y_test)

Accuracy:  100.000
Precision:  100.000
Recall:  100.000
F1:  100.000
Accuracy:  47.826
Precision:  38.194
Recall:  35.748
F1:  36.528
