### Title of Project: Heart Disease Prediction

`Author of notebook:` [Usman Khan](https://github.com/UsmanK7)\
`Date: ` 31 Jan 2025 \
`Dataset used: ` [Heart Disease UCI](https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data/data)


# Meta Data

## Context
This is a multivariate type of dataset, meaning it provides or involves a variety of separate mathematical or statistical variables for multivariate numerical data analysis. It is composed of 14 attributes:

- Age  
- Sex  
- Chest pain type  
- Resting blood pressure  
- Serum cholesterol  
- Fasting blood sugar  
- Resting electrocardiographic results  
- Maximum heart rate achieved  
- Exercise-induced angina  
- Oldpeak — ST depression induced by exercise relative to rest  
- Slope of the peak exercise ST segment  
- Number of major vessels  
- Thalassemia  

This database includes 76 attributes, but all published studies relate to the use of a subset of 14 of them. The Cleveland database is the only one used by ML researchers to date.  

One of the major tasks for this dataset is to predict whether a given patient has heart disease based on the provided attributes. Another experimental task is to diagnose and find insights from the dataset that can help in understanding heart disease better.

## Content

### Column Descriptions:

- **id**: Unique ID for each patient  
- **age**: Age of the patient in years  
- **origin**: Place of study  
- **sex**: Male/Female  
- **cp**: Chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic])  
- **trestbps**: Resting blood pressure (in mm Hg on admission to the hospital)  
- **chol**: Serum cholesterol in mg/dl  
- **fbs**: If fasting blood sugar > 120 mg/dl  
- **restecg**: Resting electrocardiographic results  
  - Values: [normal, ST-T abnormality, left ventricular hypertrophy]  
- **thalach**: Maximum heart rate achieved  
- **exang**: Exercise-induced angina (True/False)  
- **oldpeak**: ST depression induced by exercise relative to rest  
- **slope**: Slope of the peak exercise ST segment  
- **ca**: Number of major vessels (0-3) colored by fluoroscopy  
- **thal**: [normal, fixed defect, reversible defect]  
- **num**: The predicted attribute (presence of heart disease)  

### Acknowledgements

#### Creators:
- **Hungarian Institute of Cardiology, Budapest**: Andras Janosi, M.D.  
- **University Hospital, Zurich, Switzerland**: William Steinbrunn, M.D.  
- **University Hospital, Basel, Switzerland**: Matthias Pfisterer, M.D.  
- **V.A. Medical Center, Long Beach and Cleveland Clinic Foundation**: Robert Detrano, M.D., Ph.D.  

#### Relevant Papers:
- Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1989).  
  *International application of a new probability algorithm for the diagnosis of coronary artery disease.*  
  American Journal of Cardiology, 64, 304–310.  

- **David W. Aha & Dennis Kibler**  
  *Instance-based prediction of heart-disease presence with the Cleveland database.*  

- Gennari, J.H., Langley, P., & Fisher, D. (1989).  
  *Models of incremental concept formation.*  
  Artificial Intelligence, 40, 11–61.  

#### Citation Request
The authors of the dataset have requested that any publications resulting from the use of the data include the names of the principal investigators responsible for data collection at each institution:

- **Hungarian Institute of Cardiology, Budapest**: Andras Janosi, M.D.  
- **University Hospital, Zurich, Switzerland**: William Steinbrunn, M.D.  
- **University Hospital, Basel, Switzerland**: Matthias Pfisterer, M.D.  
- **V.A. Medical Center, Long Beach and Cleveland Clinic Foundation**: Robert Detrano, M.D., Ph.D.  


## Aims and Objectives:

### Import libraries
Let's start the project by import the necessary libraries required for data manipulation, visualization, and machine learning.


In [2]:
# import libraries

# 1. Data manipulation
import pandas as pd
import numpy as np

# 2. Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# 3. Data Preprocessing
from sklearn.preprocessing import StandardScaler, LabelEncoder,MinMaxScaler

# 4. Data Splitting
from sklearn.impute import SimpleImputer, KNNImputer

# 5. Imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer 

# 6. Model Selection
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 7. For classification
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

### Load the Dataset
In this section, we will load the heart disease dataset into a pandas DataFrame for further analysis and processing.

In [4]:
# load the dataset from the csv file using pandas
df = pd.read_csv('heart_disease_uci.csv')
df.head()

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0
