# Heart Disease Stage Prediction Project  

## Project Overview  
The **Heart Disease Stage Prediction Project** focuses on predicting the presence and stages of heart disease based on patient data. Using machine learning models and exploratory data analysis, this project aims to identify key factors contributing to heart disease, assist in early diagnosis, and provide actionable insights for healthcare providers.  

---

## Context  
This dataset is a **multivariate dataset**, meaning it involves various mathematical or statistical variables. It contains 14 primary attributes out of 76 available ones, which have been widely used in machine learning research.  
The **Cleveland database** is the most commonly utilized subset for heart disease prediction tasks.  

The main goals of this project are:  
1. To predict whether a person has heart disease based on given attributes.  
2. To analyze the dataset for insights that could improve understanding and early detection of heart disease.  

---

## Data Source

This dataset is available on Kaggle in the following link:
> https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data/data

## About the Dataset  

### Column Descriptions  

| Column     | Description                                                                                       |
|------------|---------------------------------------------------------------------------------------------------|
| `id`       | Unique identifier for each patient.                                                              |
| `age`      | Age of the patient in years.                                                                      |
| `origin`   | Place of study where data was collected.                                                          |
| `sex`      | Gender of the patient (`Male`/`Female`).                                                          |
| `cp`       | Chest pain type (`typical angina`, `atypical angina`, `non-anginal`, `asymptomatic`).              |
| `trestbps` | Resting blood pressure (in mm Hg on admission to the hospital).                                   |
| `chol`     | Serum cholesterol level in mg/dl.                                                                 |
| `fbs`      | Fasting blood sugar (`True` if >120 mg/dl, else `False`).                                          |
| `restecg`  | Resting electrocardiographic results (`normal`, `st-t abnormality`, `lv hypertrophy`).            |
| `thalach`  | Maximum heart rate achieved during exercise.                                                      |
| `exang`    | Exercise-induced angina (`True`/`False`).                                                         |
| `oldpeak`  | ST depression induced by exercise relative to rest.                                               |
| `slope`    | Slope of the peak exercise ST segment.                                                            |
| `ca`       | Number of major vessels (0-3) colored by fluoroscopy.                                             |
| `thal`     | Results of the thalassemia test (`normal`, `fixed defect`, `reversible defect`).                  |
| `num`      | Predicted attribute (`0` = no heart disease; `1, 2, 3, 4` = stages of heart disease).             |

---

## Problem Statement
   - **Exploratory Data Analysis (EDA):** Perform statistical analysis and visualize data distributions, trends, and relationships to understand the data and find patterns present in the data which helps to predict the stages of heart disease.  
   - **Data Assessment:** Identify missing values, outliers, and inconsistencies.
   - **Data Cleaning:** Clean the data by dropping the duplicate data and imputing the missing values.

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os

# Missing value imputation 
from sklearn.experimental import enable_iterative_imputer  # Required to use IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder

### Settings

In [2]:
# warnings
warnings.filterwarnings("ignore")

# Plot
sns.set_style("darkgrid")

# DataFrame
pd.set_option("display.max_columns", None)

# Data
data_path = "../data"
csv_path = os.path.join(data_path, "heart_disease_uci.csv")

### Load Data and Explore

In [3]:
# Load data
df = pd.read_csv(csv_path)

In [4]:
# Show 1st 5 rows to get an idea what information stored in each feature
df.head()

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


In [5]:
# Data Description
def describe_data():
    print("=" * 60)
    print("DATA DESCRIPTION")
    print("=" * 60)
    print(f"Number of observations: {df.shape[0]}")
    print(f"Number of features: {df.shape[1]}")

describe_data()

DATA DESCRIPTION
Number of observations: 920
Number of features: 16


In [6]:
# Feature Description
def describe_features():
    print("=" * 60)
    print("FEATURE DESCRIPTION")
    print("=" * 60)
    print(df.dtypes)

    # Get numerical and categorical features
    num_cols = [col for col in df.columns if df[col].dtype != "object"]
    cat_cols = [col for col in df.columns if df[col].dtype == "object"]
    unique_cols = [col for col in df.columns if df[col].nunique() == df.shape[0]]
    print("-" * 60)
    print(f"Number of Categorical features: {len(cat_cols)}")
    print(cat_cols)
    print("-" * 60)
    print(f"Number of Numerical features: {len(num_cols)}")
    print(num_cols)
    print("-" * 60)
    print(f"Number of features containing unique values: {len(unique_cols)}")
    print(unique_cols)

describe_features()

FEATURE DESCRIPTION
id            int64
age           int64
sex          object
dataset      object
cp           object
trestbps    float64
chol        float64
fbs          object
restecg      object
thalch      float64
exang        object
oldpeak     float64
slope        object
ca          float64
thal         object
num           int64
dtype: object
------------------------------------------------------------
Number of Categorical features: 8
['sex', 'dataset', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']
------------------------------------------------------------
Number of Numerical features: 8
['id', 'age', 'trestbps', 'chol', 'thalch', 'oldpeak', 'ca', 'num']
------------------------------------------------------------
Number of features containing unique values: 1
['id']


In [7]:
# Missing value Detection
def check_missing():
    print("=" * 60)
    print("MISSING VALUE DETECTION")
    print("=" * 60)

    if df.isnull().sum().sum() > 0:
        print(df.isnull().sum().sort_values(ascending=False))
    else:
        print("No missing value present in any feature")

check_missing()

MISSING VALUE DETECTION
ca          611
thal        486
slope       309
fbs          90
oldpeak      62
trestbps     59
thalch       55
exang        55
chol         30
restecg       2
id            0
age           0
sex           0
dataset       0
cp            0
num           0
dtype: int64


In [8]:
# Duplicate observation detection
def check_duplicate():
    print("=" * 60)
    print("DUPLICATE OBSERVATION DETECTION")
    print("=" * 60)
    print(f"Number of duplicate observations: {df.duplicated().sum()}")

check_duplicate()

DUPLICATE OBSERVATION DETECTION
Number of duplicate observations: 0


### Key Findings

#### Dataset Overview
- **Observations:** The dataset consists of **920 records** of patients.
- **Features:**
  - **Categorical:** 8 features, including `sex`, `cp`, and `thal`.
  - **Numerical:** 8 features, including `age`, `trestbps`, and `chol`.
- **Target Variable (`num`):**
  - `0`: Indicates no heart disease.
  - `1, 2, 3, 4`: Represent different stages of heart disease severity.
- **Unique Identifier:** The `id` column is unique for each record and does not contribute to analysis.

#### Key Observations
1. **Missing Data:**
   - Several features have missing values:
     - `trestbps` (59), `chol` (30), `fbs` (90), `restecg` (2), `thalch` (55), `exang` (55), `oldpeak` (62), `slope` (309), `ca` (611), `thal` (486).
   - Features like `slope`, `ca`, and `thal` have a high percentage of missing values (33.6%, 66.4%, and 52.8%, respectively).
   - Handling missing data is critical to ensure model accuracy and reliability.

2. **Feature Characteristics:**
   - **Categorical Features:** Require encoding for machine learning models.
     - Example: `cp` has values like `typical angina` that need transformation into numerical labels.
   - **Numerical Features:** Some columns may require scaling (e.g., `chol`, `trestbps`) to improve model performance.

#### Recommendations for Data Cleaning
1. **Handle Missing Data:**
   - Impute missing values for categorical features (`restecg`, `slope`, `ca`, `thal`) using the mode or predictive imputation techniques.
   - For numerical features (`trestbps`, `chol`, `thalch`, etc.), use median or advanced imputation methods (e.g., KNN).

2. **Encode Categorical Features:**
   - Use Label Encoding or One-Hot Encoding based on the type of categorical data.

3. **Normalize Numerical Features:**
   - Normalize features like `trestbps`, `chol`, and `thalch` using Min-Max Scaling or StandardScaler.

4. **Remove Redundant Features:**
   - Drop the `id` column as it does not contribute to predictive analysis.


### Data Cleaning

In [9]:
# Drop the 'id' feature
df.drop(["id", "dataset"], axis= 1, inplace= True)

In [10]:
# Sanity check
describe_features()

FEATURE DESCRIPTION
age           int64
sex          object
cp           object
trestbps    float64
chol        float64
fbs          object
restecg      object
thalch      float64
exang        object
oldpeak     float64
slope        object
ca          float64
thal         object
num           int64
dtype: object
------------------------------------------------------------
Number of Categorical features: 7
['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']
------------------------------------------------------------
Number of Numerical features: 7
['age', 'trestbps', 'chol', 'thalch', 'oldpeak', 'ca', 'num']
------------------------------------------------------------
Number of features containing unique values: 0
[]


In [19]:
# Impute restecg with most present value(mode)
df["restecg"] = df["restecg"].fillna(df["restecg"].mode()[0])
# Sanity check
check_missing()

MISSING VALUE DETECTION
ca          611
thal        486
slope       309
fbs          90
oldpeak      62
trestbps     59
thalch       55
exang        55
chol         30
age           0
sex           0
cp            0
restecg       0
num           0
dtype: int64


In [20]:
# Impute chol(cheolestoral level) with median of the feature
df["chol"] = df["chol"].fillna(df["chol"].median())
# Sanity check
check_missing()

MISSING VALUE DETECTION
ca          611
thal        486
slope       309
fbs          90
oldpeak      62
trestbps     59
thalch       55
exang        55
age           0
sex           0
cp            0
chol          0
restecg       0
num           0
dtype: int64


In [21]:
# Impute exang with most present value(mode)
df["exang"] = df["exang"].fillna(df["exang"].mode()[0])
# Sanity check
check_missing()

MISSING VALUE DETECTION
ca          611
thal        486
slope       309
fbs          90
oldpeak      62
trestbps     59
thalch       55
age           0
sex           0
cp            0
chol          0
restecg       0
exang         0
num           0
dtype: int64


In [22]:
# Impute thalch with median of the feature
df["thalch"] = df["thalch"].fillna(df["thalch"].median())
# Sanity check
check_missing()

MISSING VALUE DETECTION
ca          611
thal        486
slope       309
fbs          90
oldpeak      62
trestbps     59
age           0
sex           0
cp            0
chol          0
restecg       0
thalch        0
exang         0
num           0
dtype: int64


In [23]:
# Impute trestbps and oldpeak with median of the feature
df["trestbps"] = df["trestbps"].fillna(df["trestbps"].median())
df["oldpeak"] = df["oldpeak"].fillna(df["oldpeak"].median())
# Sanity check
check_missing()

MISSING VALUE DETECTION
ca          611
thal        486
slope       309
fbs          90
age           0
sex           0
cp            0
trestbps      0
chol          0
restecg       0
thalch        0
exang         0
oldpeak       0
num           0
dtype: int64


In [24]:
# Impute fbs(fasting blood sugar) with most present value(mode)
df["fbs"] = df["fbs"].fillna(df["fbs"].mode()[0])
# Sanity check
check_missing()

MISSING VALUE DETECTION
ca          611
thal        486
slope       309
age           0
sex           0
cp            0
trestbps      0
chol          0
fbs           0
restecg       0
thalch        0
exang         0
oldpeak       0
num           0
dtype: int64


In [25]:
#  Impute ca, thal and slope using MICE (Multiple Imputation by Chained Equations)

# Features with missing values
categorical_features = ['slope', 'thal']

# Encode categorical features with LabelEncoder
encoders = {}
for feature in categorical_features:
    le = LabelEncoder()
    df[feature] = df[feature].fillna("-1").astype("str")  # Placeholder for missing values
    df[feature] = le.fit_transform(df[feature])
    encoders[feature] = le  # Store encoder for later decoding
# Initialize IterativeImputer
imputer = IterativeImputer(max_iter=50, random_state=42, estimator= RandomForestRegressor())

# Fit and transform the imputer for selected features
df[['slope', 'ca', 'thal']] = imputer.fit_transform(df[['slope', 'ca', 'thal']])

# Decode the imputed values back to original categories
for feature in categorical_features:
    le = encoders[feature]
    df[feature] = df[feature].round().astype(int)  # Round to nearest integer
    df[feature] = le.inverse_transform(df[feature])

# Sanity check
check_missing()

MISSING VALUE DETECTION
No missing value present in any feature


In [26]:
# Save the cleaned data
cleaned_path = os.path.join(data_path, "hd_uci_cleaned.csv")
df.to_csv(cleaned_path, index= False)

In [15]:
# Drop the rows containing missing value in ca
df.dropna(subset=["ca"], inplace= True)

# Sanity check
describe_data()
check_missing()

DATA DESCRIPTION
Number of observations: 309
Number of features: 14
MISSING VALUE DETECTION
thal        10
fbs          5
slope        3
chol         1
age          0
sex          0
cp           0
trestbps     0
restecg      0
thalch       0
exang        0
oldpeak      0
ca           0
num          0
dtype: int64


In [16]:
# Drop the rows containing missing value in ca
df.dropna(subset=["thal"], inplace= True)

# Sanity check
describe_data()
check_missing()

DATA DESCRIPTION
Number of observations: 299
Number of features: 14
MISSING VALUE DETECTION
No missing value present in any feature


In [17]:
# Save the cleaned data
cleaned_path = os.path.join(data_path, "hd_uci_no_missing.csv")
df.to_csv(cleaned_path, index= False)