# ü©∫ Clinical Heart Disease Prediction with AI  
---
<span style="color:red">*by **Ridwan Oladipo, MD | AI Specialist***</span>

End-to-end machine learning pipeline for **coronary artery disease (CAD)** prediction using the **Cleveland Heart Disease dataset** ‚Äî designed for transparency, clinical relevance, and production readiness.


**üéØ Objectives**
- Predict **heart disease presence** from standard clinical features (age, chest pain, cholesterol, ECG, exercise tolerance)
- Explain predictions with **SHAP** for clinician trust
- Deploy via **FastAPI + Streamlit on AWS Fargate**


**üìä Dataset**
- 303 patients √ó 14 features  
- Target: `1 = disease`, `0 = no disease`  

> ‚ö†Ô∏è **Note:** The Cleveland dataset is small (303 patients) and used here as a **methods + deployment template**.  
> The same pipeline generalizes seamlessly to **larger EHR datasets** (e.g., **MIMIC-IV**, multi-center registries), ensuring real-world scalability.


**üß† Methods**
- Feature engineering aligned with **cardiology guidelines**
- Models: **Logistic Regression, Random Forest, XGBoost**
- End-to-end **MLOps deployment** (FastAPI + Streamlit + AWS Fargate)

---

> ‚öïÔ∏è **Built by a physician‚Äìdata scientist to merge clinical insight with AI precision.**

## ==================================================================
## 1Ô∏è‚É£ DATA LOADING & OVERVIEW
## ==================================================================

### üì• Import Libraries & Set Global Styles

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

### üìÇ Load Dataset & Show Basic Shape + Memory

In [2]:
df = pd.read_csv('../data/heartdisease.csv')

print("üìä DATASET OVERVIEW")
print(f"Shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")
print("\n" + "="*50)

üìä DATASET OVERVIEW
Shape: (303, 14)
Memory usage: 33.27 KB



### üîç Preview First 5 Rows of the Dataset

In [3]:
print("\nüîç FIRST 5 ROWS:")
display(df.head())


üîç FIRST 5 ROWS:


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


### üìù Dataset Info & Data Types

In [4]:
print("\nüìã DATASET INFO:")
print(df.info())


üìã DATASET INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB
None


### üö® Check for Missing Values

In [5]:
print("\nMISSING VALUES:")
missing_counts = df.isnull().sum()
missing_pct = (missing_counts / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_counts,
    'Missing %': missing_pct
}).round(2)

print(missing_df[missing_df['Missing Count'] > 0])
if missing_df['Missing Count'].sum() == 0:
    print("No missing values detected!")


MISSING VALUES:
Empty DataFrame
Columns: [Missing Count, Missing %]
Index: []
No missing values detected!


### üìà Statistical Summary of Numerical Features

In [6]:
print("\nüìà STATISTICAL SUMMARY:")
display(df.describe())


üìà STATISTICAL SUMMARY:


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


### üßê Verify Unique Categorical Values (Data Audit)

In [7]:
for col in ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal', 'ca']:
    print(f"{col}: {sorted(df[col].unique())}")

sex: [0, 1]
cp: [0, 1, 2, 3]
fbs: [0, 1]
restecg: [0, 1, 2]
exang: [0, 1]
slope: [0, 1, 2]
thal: [0, 1, 2, 3]
ca: [0, 1, 2, 3, 4]


### ü©∫ **Clinical Data Correction: Handling Invalid 'thal' and 'ca' Codes**

In [8]:
print("\nü©∫ CLINICAL DATA CORRECTION:")

# Check thal
num_thal_errors = df[df['thal'] == 0].shape[0]
if num_thal_errors > 0:
    thal_mode = df['thal'][df['thal'] != 0].mode()[0]
    df['thal'] = df['thal'].replace(0, thal_mode)
    print(f"Corrected {num_thal_errors} 'thal' rows by replacing 0 with mode: {thal_mode}")
else:
    print("No 'thal' correction needed.")

# Check ca
num_ca_errors = df[df['ca'] == 4].shape[0]
if num_ca_errors > 0:
    ca_mode = df['ca'][df['ca'] != 4].mode()[0]
    df['ca'] = df['ca'].replace(4, ca_mode)
    print(f"Corrected {num_ca_errors} 'ca' rows by replacing 4 with mode: {ca_mode}")
else:
    print("No 'ca' correction needed.")


ü©∫ CLINICAL DATA CORRECTION:
Corrected 2 'thal' rows by replacing 0 with mode: 2
Corrected 5 'ca' rows by replacing 4 with mode: 0


### Save cleaned dataframe to data directory and load it back

In [9]:
df.to_csv('../data/heartdisease_cleaned.csv', index=False)
print("Cleaned dataset saved to '../data/heartdisease_cleaned.csv'.")

df = pd.read_csv('../data/heartdisease_cleaned.csv')
print("Reloaded cleaned data into 'df' for continued analysis.")

Cleaned dataset saved to '../data/heartdisease_cleaned.csv'.
Reloaded cleaned data into 'df' for continued analysis.


### ü©∫**Clinical Integrity Highlights:** 
---

- üßë‚Äç‚öïÔ∏è **Population:** Middle-aged to elderly (mean ~54), slight male predominance.  
- ü©∏ **Cholesterol:** Extreme values up to **564**, likely reflecting true severe hyperlipidemia or rare data capture artifacts.  
  ‚ûî **Kept intentionally** to train robust models on possible high-risk outliers.

- üí° **Data Corrections:**  
  - Fixed invalid `thal=0` ‚ûî replaced with mode to reflect realistic thalassemia states.  
  - Fixed invalid `ca=4` ‚ûî replaced with mode to maintain coronary calcium plausibility (0‚Äì3).

> **Dataset now clinically sound & ready for deeper analysis.**