# Heart Disease Prediction
Author: [Usman Tariq](https://github.com/USM811)

&emsp;<a id="content_list">**Content List**</a>
> 1.0 - [Introduction](#intro)\
> &emsp;1.1 - [Purpose](#purpose)\
> &emsp;1.2 - [Methodology](#methodology)\
> \
> 2.0 - [About Data](#about_data)\
> &emsp;2.1 - [Context](#about_data)\
> &emsp;2.2 - [Column Descriptions](#column_descriptions)\
> \
> 3.0 - [Acknowledgements](#acknowledgements)\
> &emsp;3.1 - [Creators](#creators)\
> \
> 4.0 - [Import Libraries](#import_libraries)\
> \
> 5.0 - [Load Dataset](#loading_dataset)\
> \
> 6.0 - [Data Overview](#data_overview)\
> &emsp;6.1 - [Inspect Data Dimensions](#dimenstions)\
> &emsp;6.2 - [Inspect Missing Data](#inspect_missing)\
> &emsp;6.3 - [Inspect Categorical Features](#inspect_cat_cols)\
> &emsp;6.4 - [Inspect Numerical Features](#inspect_num_cols)\
> \
> 7.0 - [Data Pre-processing](#data_preprocessing)\
> &emsp;7.1 - [Drop the Irrelevant Features](#irrelevant_cols)\
> &emsp;7.2 - [Handling Missing Data](#handle_missing)\
> &emsp;&emsp;7.2.1 - [Impute the `Numerical Features` with low (`less then 10%`) Missing Values](#handle_num_cols)\
> &emsp;&emsp;7.2.2 - [Impute the Categorical Features with low (`less then 10%`) Missing Values](#handle_cat_cols)\
> &emsp;&emsp;7.2.3 - [Impute the Features with `high percentage` of Missing Values](#handle_high_missing_cols)\
> &emsp;[Observations and Improvements - 7.2](#observation_improvements)

## <a id="intro">1.0 - Introduction</a>

&emsp;<a id="purpose">**1.1 - Purpose**</a>
> In this notebook, I constructed a Machine Learning model to `deal the missing data` by carefully looking at the various madical attributes provided in the [`UCI Heart Disease Dataset`](https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data/data).
>
> My ultimate goal has been to predict the intensity of the heart disease in human body by measuring different medical parameters (`i.e. chest pain, blood pressure, cholesterol, heart beat rate, etc.`), based on its `age` and `gender`.


&emsp;<a id="methodology">**1.2 - Methodlogy**</a>
+ The following machine learning algorithms will be used to build the model:
> 1. Logistic Regression
> 2. Support Vector Machine (SVC)
> 3. K-Nearest Neighbors (KNN)
> 4. Decision Tree Algorithm
> 5. Random Forest Algorithm

+ The following metrics will be used to measure the model performance:
> 1. Confusion Matrix
> 2. Classification Report
> 3. F1 Score

<div style='text-align: center;'>
Back to the <a href="#content_list">Content List</a>
</div>

## <a id="about_data">2.0 - About Data</a>

&emsp;<a id="context">**2.1 - Context**</a>
> This is a multivariate type of dataset which means providing or involving a variety of separate mathematical or statistical variables, multivariate numerical data analysis. It is composed of 14 attributes which are age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, oldpeak — ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels and Thalassemia. This database includes 76 attributes, but all published studies relate to the use of a subset of 14 of them. The Cleveland database is the only one used by ML researchers to date. One of the major tasks on this dataset is to predict based on the given attributes of a patient that whether that particular person has heart disease or not and other is the experimental task to diagnose and find out various insights from this dataset which could help in understanding the problem more.

&emsp;<a id="column_descriptions">**2.2 - Column Descriptions**</a>
> 01. `id`: (Unique id for each patient)
> 02. `age`: (Age of the patient in years)
> 03. `origin`: (place of study)
> 04. `sex`: (Male/Female)
> 05. `cp`: chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic])
> 06. `trestbps`: resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital))
> 07. `chol`: (serum cholesterol in mg/dl)
> 08. `fbs`: (if fasting blood sugar > 120 mg/dl)
> 09. `restecg`: (resting electrocardiographic results) -- Values: [normal, stt abnormality, lv hypertrophy]
> 10. `thalach`: maximum heart rate achieved
> 11. `exang`: exercise-induced angina (True/ False)
> 12. `oldpeak`: ST depression induced by exercise relative to rest
> 13. `slope`: the slope of the peak exercise ST segment
> 14. `ca`: number of major vessels (0-3) colored by fluoroscopy
> 15. `thal`: [normal; fixed defect; reversible defect]
> 16. `num`: the predicted attribute

<div style='text-align: center;'>
Back to the <a href="#content_list">Content List</a>
</div>

## <a id="acknowledgements"> 3.0 - Acknowledgements</a>
&emsp;<a id="creators">**3.1 - Creators**</a>
> 1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
> 2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
> 3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
> 4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

## <a id="import_libraries">4.0 - Import Libraries</a>

In [1]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier


from sklearn.model_selection import train_test_split

# import label encoder
from sklearn.preprocessing import LabelEncoder

# import metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# import warnings
import warnings
warnings.filterwarnings('ignore')

## <a id="loading_dataset">5.0 - Load Dataset</a>

In [2]:
# Importing Dataset
df = pd.read_csv('heart_disease_uci.csv')

## <a id="data_overview">6.0 - Data Overview</a>

&emsp;<a id="dimenstions">**6.1 - Inspect Data Dimensions**</a>

In [3]:
# Display first 5 rows.
df.head()

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


In [4]:
# shape of the dataset
df.shape

(920, 16)

In [5]:
# Columns of the dataset
df.columns

Index(['id', 'age', 'sex', 'dataset', 'cp', 'trestbps', 'chol', 'fbs',
       'restecg', 'thalch', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num'],
      dtype='object')

&emsp; **Observations - 6.1**
> + There are `920` rows, means the data of `920` human being.
> + There are total `16` columns in the dataset, including `id`, `dataset (location of the patient)`.
> + The target feature `num` represents the ordinal numeric severity of the heart disease (`[0, 1, 2, 3, 4]`).
> + There are `13` features or `medical parameters` (excluding `id` and `dataset`), which will be used to predict the target feature `num` (the intensity of the heart disease).

&emsp;<a id="inspect_missing">**6.2 - Inspect Missing Data**</a>

In [6]:
# Identify the columns in which the data is missing.
round((df.isnull().sum()[df.isnull().sum()>0]/len(df)*100),1).sort_values(ascending=False)

ca          66.4
thal        52.8
slope       33.6
fbs          9.8
oldpeak      6.7
trestbps     6.4
thalch       6.0
exang        6.0
chol         3.3
restecg      0.2
dtype: float64

&emsp;**Observations - 6.2**
> + There are `10` features in which the data is missing.
> + There are `7` features in which the `percentage of missing data` is `less than 10%`
> + There are `3` features in which the `percentaage of missing data` is `high (33%, 52%, 66%)`.

&emsp;<a id="inspect_cat_cols">**6.3 - Inspect Categorical Features**</a>

In [7]:
# Identify the unique values in each categorical column.
for col in df.columns:
    if df[col].dtype == 'object' or df[col].dtype == 'category':
        print(col, ":", df[col].unique(), '\n')

sex : ['Male' 'Female'] 

dataset : ['Cleveland' 'Hungary' 'Switzerland' 'VA Long Beach'] 

cp : ['typical angina' 'asymptomatic' 'non-anginal' 'atypical angina'] 

fbs : [True False nan] 

restecg : ['lv hypertrophy' 'normal' 'st-t abnormality' nan] 

exang : [False True nan] 

slope : ['downsloping' 'flat' 'upsloping' nan] 

thal : ['fixed defect' 'normal' 'reversable defect' nan] 



&emsp;**Observations - 6.3**
> + There are no spelling mistakes in the categorical-values of the categorical features.
> 
> + The feature `slope` can be considered to be ordinal.
> 
>   + `'Downsloping'` represents a `downward slope`.
>   + `'flat'` represents `no significant slope`.
>   + `'upsloping'` represents an `upward slope`.
>   + `This ordering` implies a natural progression `from downsloping to flat to upsloping`.
>
> + The feature `thal` can be considered to be ordinal.
> 
>   + `'normal'` represents `no abnormality`.
>   + `'reversible defect'` indicates a `potentially reversible abnormality`.
>   + `'Fixed defect'` represents a `permanent abnormality`.

&emsp;<a id="inspect_num_cols">**6.4 - Inspect Numerical Features**</a>

In [8]:
# Create a list of numerical columns.
num_cols = [col for col in df.columns if df[col].dtype != 'object' and df[col].dtype != 'category']
num_cols

['id', 'age', 'trestbps', 'chol', 'thalch', 'oldpeak', 'ca', 'num']

In [9]:
# Checking the unique values of each numerical feature one by one.
df['ca'].unique()

array([ 0.,  3.,  2.,  1., nan])

&emsp;**Observations - 6.4**
> + The feature `ca` considered to be ordinal.
> 
>   + The values `'0'`, `'1'`, `'2'`, and `'3'` can be arranged in a specific order based on the `increasing number of vessels colored`.
>   + The ordering from `0 to 3` signifies an `increase in the severity` or extent of the condition being measured.

## <a id="data_preprocessing">7.0 - Data Pre-processing</a>

&emsp;<a id="irrelevant_cols">**7.1 - Drop the Irrelevant Features based on the `Observation - 6.1`**</a>

In [10]:
# Drop the column 'id' from the dataframe.
df.drop(['id'], axis=1, inplace=True)

&emsp;<a id="handle_missing">**7.2 - Handling Missing Data**</a>

In [11]:
# Identify the columns in which the data is missing.
round((df.isnull().sum()[df.isnull().sum()>0]/len(df)*100),1).sort_values(ascending=False)

ca          66.4
thal        52.8
slope       33.6
fbs          9.8
oldpeak      6.7
trestbps     6.4
thalch       6.0
exang        6.0
chol         3.3
restecg      0.2
dtype: float64

&emsp;&emsp;<a id="handle_num_cols">**7.2.1 - Impute the `Numerical Features` with low (`less then 10%`) Missing Values by using `IterativeImputer`**</a>

In [12]:
# Finding the numerical columns from the dataframe.
num_cols = df.select_dtypes(include=['float', 'int']).columns

# Calculating the percentage of missing values in each numerical column
missing_percentages = (df[num_cols].isnull().sum() / len(df)) * 100

# Select numerical features with missing data less than 10%
num_low_missing_cols = missing_percentages[(missing_percentages > 0) & (missing_percentages < 10)].index.tolist()

# Print the selected numerical features
print("Numerical features having missing data less than 10% are:", num_low_missing_cols)

Numerical features having missing data less than 10% are: ['trestbps', 'chol', 'thalch', 'oldpeak']


In [13]:
# Apply IterativeImputer with RandomForestRegressor to impute missing values of low missing numerical columns.
iterative_imputer = IterativeImputer(estimator=RandomForestRegressor(random_state=42), random_state=42, add_indicator=True)
imputed_values = iterative_imputer.fit_transform(df[num_low_missing_cols])

df[num_low_missing_cols] = imputed_values[:, :len(num_low_missing_cols)]

In [14]:
# Verifying the missing values in low missing numerical columns after imputation.
round((df.isnull().sum()[df.isnull().sum()>0]/len(df)*100),1).sort_values(ascending=False)

ca         66.4
thal       52.8
slope      33.6
fbs         9.8
exang       6.0
restecg     0.2
dtype: float64

+ The numerical features had missing values less than 10% have been imputed successfully.

&emsp;&emsp;<a id="handle_cat_cols">**7.2.2 - Impute the Categorical Features with low (`less then 10%`) Missing Values by using `Random Values Imputation` technique**</a>

In [15]:
# Finding the categorical columns from the dataframe.
cat_cols = df.select_dtypes(include=['object', 'category']).columns

# Calculating the percentage of missing values in each column.
missing_percentages = (df[cat_cols].isnull().sum() / len(df)) * 100

# Select categorical features with missing data less than 10%.
cat_low_missing_cols = missing_percentages[(missing_percentages > 0) & (missing_percentages < 10)].index.tolist()

# Print the selected categorical features.
print("Categorical features having missing data less than 10% are:", cat_low_missing_cols)

Categorical features having missing data less than 10% are: ['fbs', 'restecg', 'exang']


In [16]:
# Imputing the missing values in categorical columns with low percentage of missing values.
for col in cat_low_missing_cols:
    df[col][df[col].isnull()] = df[col].dropna().sample(df[col].isnull().sum(), replace=True).values

In [17]:
# Verifying the missing values in low missing categorical columns after imputation.
round((df.isnull().sum()[df.isnull().sum()>0]/len(df)*100),1).sort_values(ascending=False)


ca       66.4
thal     52.8
slope    33.6
dtype: float64

+ The categorical features had missing values less than 10% have been imputed successfully.

&emsp;&emsp;<a id="handle_high_missing_cols">**7.2.3 - Impute the Features with `high percentage` of Missing Values by using `RandomForestClassifier`**</a>

In [18]:
# identify the features with missing values in more than 10%
high_missing_cols = df.isnull().sum()[df.isnull().sum() > 0].index.tolist()
high_missing_cols

['slope', 'ca', 'thal']

In [19]:
def impute_high_missing_data(passed_col):
    
    df_null = df[df[passed_col].isnull()]
    df_not_null = df[df[passed_col].notnull()]

    X = df_not_null.drop(passed_col, axis=1)
    y = df_not_null[passed_col]
    
    other_missing_cols = [col for col in high_missing_cols if col != passed_col]
    
    label_encoder = LabelEncoder()

    for col in X.columns:
        if X[col].dtype == 'object' or X[col].dtype == 'category':
            X[col] = label_encoder.fit_transform(X[col])
    
    iterative_imputer = IterativeImputer(estimator=RandomForestRegressor(random_state=42), add_indicator=True)

    for col in other_missing_cols:
        if X[col].isnull().sum() > 0:
            col_with_missing_values = X[col].values.reshape(-1, 1)
            imputed_values = iterative_imputer.fit_transform(col_with_missing_values)
            X[col] = imputed_values[:, 0]
        else:
            pass
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    rf_classifier = RandomForestClassifier()

    rf_classifier.fit(X_train, y_train)

    y_pred = rf_classifier.predict(X_test)

    acc_score = accuracy_score(y_test, y_pred)

    print("The feature '"+ passed_col+ "' has been imputed with", round((acc_score * 100), 2), "accuracy\n")

    X = df_null.drop(passed_col, axis=1)

    for col in X.columns:
        if X[col].dtype == 'object' or X[col].dtype == 'category':
            X[col] = label_encoder.fit_transform(X[col])

    for col in other_missing_cols:
        if X[col].isnull().sum() > 0:
            col_with_missing_values = X[col].values.reshape(-1, 1)
            imputed_values = iterative_imputer.fit_transform(col_with_missing_values)
            X[col] = imputed_values[:, 0]
        else:
            pass
                
    if len(df_null) > 0: 
        df_null[passed_col] = rf_classifier.predict(X)
    else:
        pass

    df_combined = pd.concat([df_not_null, df_null])
    
    return df_combined[passed_col]

In [20]:
df.isnull().sum()[df.isnull().sum() > 0]

slope    309
ca       611
thal     486
dtype: int64

In [21]:
for col in high_missing_cols:
    df[col] = impute_high_missing_data(col)

The feature 'slope' has been imputed with 68.29 accuracy



The feature 'ca' has been imputed with 64.52 accuracy

The feature 'thal' has been imputed with 70.11 accuracy



&emsp;<a id="observation_improvements">**Observations and Improvements - 7.2**</a>
> + One of my colleague, [Muhammad Bilal Khan](https://www.kaggle.com/devbilalkhan) also worked on the same dataset to impute the missing values, See [Notbook](https://www.kaggle.com/code/devbilalkhan/ml-heart-disease-detection-random-forest/notebook).
>
> + `'Muhammad Bilal Khan'` used the `SimpleImputer` for the features having `low percentages (less than 10%) of missing data`, with `mean` strategy for `numerical features` and `most_frequent` strategy for the `categorical features`.
> 
> 
> + I used the `Multivariate Iterative Imputer` for the features having `low percentages (less than 10%) of missing data` with the parameter `add_indicator=True` to avoid the noisiness of the missing data in other features.
> 
> 
> + I selected the `Non-Null Random Values` of the `categorical features` to impute the missing data within the `corresponding categorical features`, bucause the selection of the 'non-null random values' depends on the `probability of the occurance` of the categories.
>   + Although this strategy is not very perfect, but it's better way to `avoid the skewness or biasness` created by `most_frequent` values.
> 
> 
> + `'Muhammad Bilal Khan'` used the `RandomForestClassifier` for the features having `high percentages of missing data`, but he dropped the other remaining features with high percentages of missing data while imputing first and second featues.
>   + Although, this strategy increase accuracy of the imputation of the high percentage of missing data, but the drwaback of this strategy that we are totally `ignoring some featrues` from the dataset while imputing one of them.
>   + Ignoring the feature can negatively effect the `accuracy` of the final prediction of the `target feature (num).
> 
> 
> + To overcome the immediate-above stated issue, instead of dropping the remaing features having high percentages of missing data, I followed the following strategy:
>   + I temporarily imputed the featues having high missing data by using the `multivariate iterative imputer`, just before applying the `RandomForestClassifier` while imputing one of them and also updated the imputed feature into the original dataframe.
> 
>   + Although, by following this technique, the accuracy of the imputation of the features having high missing data is `little bit less` than the accuracy acheived `Muhammad Bilal Khan` in his [`notebook`](https://www.kaggle.com/code/devbilalkhan/ml-heart-disease-detection-random-forest/notebook), but this imputation technique considering all the features and it may `positively effect the final prediction` of the `target feature (num)`.

&emsp;**Further Possible Improvements - 7.2**
> + We can crate small functions separatly for `temporary imputation` and the `features encoding`, while imputing the missing values in the features having high percentages of missing data.
> 
> + In this way, we can `improve the code reuseability` and `reduce the time-complexity` of the programe.