### 🫀 Heart Disease Prediction with Machine Learning

A comprehensive machine learning pipeline to predict heart disease using clinical and lifestyle features.  
This notebook includes data preprocessing, model training with hyperparameter tuning, evaluation, and model export for deployment.

**Best Model:** LightGBM  
**Evaluation Metrics:** Accuracy, Precision, Recall, F1 Score, ROC AUC  
**Deployment-ready:** Model saved with Pickle & Joblib, and prediction function supports threshold tuning.

---


## 1. Import Libraries

In [None]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load data
df = pd.read_csv(r"C:\Users\pascal\Desktop\PROJECTS 2025\Multi-disease prediction models\multi-health-ml-predictor\data\heart_data.csv")


In [None]:
df.head()

In [None]:
df.shape

In [None]:
# Rename columns for clarity
df.rename(columns={
    'age': 'age_days',
    'gender': 'sex',
    'height': 'height_cm',
    'weight': 'weight_kg',
    'ap_hi': 'systolic_bp',
    'ap_lo': 'diastolic_bp',
    'cholesterol': 'cholesterol_level',
    'gluc': 'glucose_level',
    'smoke': 'smoking',
    'alco': 'alcohol_intake',
    'active': 'physical_activity',
    'cardio': 'heart_disease'
}, inplace=True)

In [None]:
df.info()

In [None]:
# Convert age from days to years
df['age_years'] = (df['age_days'] // 365).astype(int)

In [None]:
# Drop age_days if you prefer
df.drop(columns='age_days', inplace=True)

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
missing_values = df.isnull().sum()
missing_values


In [None]:
duplicates = df.duplicated().sum()
duplicates

In [None]:
df.nunique()

In [None]:
# Select only numerical columns (excluding index/id)
numerical_cols = df.select_dtypes(include=['int64', 'int32','float64']).drop(['index', 'id', 'heart_disease'], axis=1).columns

# Set up the plot grid
plt.figure(figsize=(16, len(numerical_cols) * 4))

for i, col in enumerate(numerical_cols, 1):
    # Histogram
    plt.subplot(len(numerical_cols), 2, 2*i - 1)
    sns.histplot(df[col], kde=True, bins=30, color='skyblue')
    plt.title(f'Distribution of {col}')
    
    # Boxplot
    plt.subplot(len(numerical_cols), 2, 2*i)
    sns.boxplot(x=df[col], color='salmon')
    plt.title(f'Boxplot of {col}')

plt.tight_layout()
plt.show()

In [None]:
# 1. Remove blood pressure outliers
df = df[(df['systolic_bp'].between(90, 250)) & 
        (df['diastolic_bp'].between(60, 140))]


In [None]:

# 2. Remove implausible height and weight
df = df[(df['height_cm'].between(120, 220)) & 
        (df['weight_kg'].between(30, 200))]

In [None]:
df.head()

In [None]:
def enhanced_heart_features(df):
    # Keep ALL original features, ADD derived ones
    
    # 1. BMI (keep weight_kg, height_cm, ADD bmi)
    df['bmi'] = df['weight_kg'] / (df['height_cm'] / 100) ** 2
    
    # 2. Blood pressure derivatives (keep systolic_bp, diastolic_bp, ADD derived)
    df['pulse_pressure'] = df['systolic_bp'] - df['diastolic_bp']
    df['mean_arterial_pressure'] = df['diastolic_bp'] + (df['pulse_pressure'] / 3)
    
    # 3. Risk interactions (keep individual features, ADD interactions)
    df['bp_age_risk'] = (df['systolic_bp'] - 120) * df['age_years']
    df['metabolic_risk'] = df['bmi'] * df['glucose_level']
    
    # 4. Risk categories (ADD as new features)
    df['hypertension'] = ((df['systolic_bp'] >= 140) | (df['diastolic_bp'] >= 90)).astype(int)
    df['high_cholesterol'] = (df['cholesterol_level'] >= 3).astype(int)
    df['obesity'] = (df['bmi'] >= 30).astype(int)
    
    # 5. Cardiovascular risk score (ADD composite feature)
    df['cv_risk_score'] = (
        (df['age_years'] > 55).astype(int) +
        (df['systolic_bp'] > 140).astype(int) +
        (df['cholesterol_level'] >= 3).astype(int) +
        (df['bmi'] > 30).astype(int) +
        df['smoking']
    )
    
    return df

In [None]:
df = enhanced_heart_features(df)

In [None]:
df.head()

In [None]:
df['heart_disease'].value_counts(normalize=True)


In [None]:
df.dtypes

# Separate the classes
df_pos = df[df['heart_disease'] == 1]
df_neg = df[df['heart_disease'] == 0]

# Choose sample size per class (e.g. 5000 per class)
sample_size = 20000

# Sample from each class
df_pos_sample = df_pos.sample(n=sample_size, random_state=42)
df_neg_sample = df_neg.sample(n=sample_size, random_state=42)

# Combine and shuffle
df_sampled = pd.concat([df_pos_sample, df_neg_sample]).sample(frac=1, random_state=42).reset_index(drop=True)


In [None]:
# dfmodel = df_sampled.drop(columns=['index', 'id'])
dfmodel = df.drop(columns=['index', 'id'])


# Compute correlation matrix
correl_matrix = dfmodel.corr()

# Plot heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correl_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True, cbar_kws={'shrink': .8})
plt.title("Correlation Matrix", fontsize=16)
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
correl_matrix 

In [None]:
correlation_with_target = correl_matrix['heart_disease'].drop('heart_disease')

correlation_with_target_df = correlation_with_target.to_frame().reset_index()
correlation_with_target_df.columns = ['Feature', 'Correlation heart_disease']

print(correlation_with_target_df)

In [None]:
correlation_threshold = 0.06 # define the correlation threshold below which features will be dropped

strong_correlations = correlation_with_target[abs(correlation_with_target) >= correlation_threshold]

features_to_keep = strong_correlations.index.tolist()

dfmodel_final = dfmodel[features_to_keep + ['heart_disease']]

dfmodel_final.head()

In [None]:
dfmodel_final.columns

In [None]:
dfmodel.columns

In [None]:
selected = [ 'systolic_bp', 'diastolic_bp']
corr_subset = dfmodel[selected].corr()

print("\n🔗 Correlation Matrix:")
print(corr_subset)

#### 🔍 Data Preparation & Feature Engineering Summary above

Before building the models, the dataset was thoroughly preprocessed to improve data quality and ensure effective learning. The key steps included:

- **Age Transformation**: Converted `age_days` to more interpretable `age_years` by dividing by 365.
- **Outlier Normalization**: Applied capping to reduce the influence of extreme values in:
  - `weight_kg`
  - `systolic_bp`
  - `diastolic_bp`
- **Correlation Analysis**: Explored pairwise relationships using a correlation matrix to detect multicollinearity among features.
- **Feature Selection**: Based on correlation analysis and domain knowledge, selected a set of informative and less redundant features to improve model performance and prevent overfitting.

These steps helped create a cleaner, more stable input for training high-performing models.


In [None]:
# pip install lightgbm

In [None]:
# pip install catboost

### Model Training and Evaluation

In [None]:

from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

In [None]:
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV

# ========== Split data ==========
X = dfmodel_final.drop('heart_disease', axis=1)
y = dfmodel_final['heart_disease']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# ========== Preprocessing ==========
scaler = ColumnTransformer(
    transformers=[('num', StandardScaler(), X.columns)],
    remainder='passthrough'
)

# ========== Classifiers & Param Grids ==========
model_param_grid = [


    (
    'Logistic Regression', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42), {
        'clf__C': [0.01, 0.1, 1, 10],
        'clf__solver': ['liblinear', 'lbfgs']
    }
    )
    ,

    (
        'Random Forest', RandomForestClassifier(random_state=42), {
            'clf__n_estimators': [100, 200, 300],
            'clf__max_depth': [None, 10, 20],
            'clf__min_samples_split': [2, 5],
            'clf__min_samples_leaf': [1, 2]
        }
    ),
    (
        'Decision Tree', DecisionTreeClassifier(random_state=42), {
            'clf__max_depth': [None, 10, 20],
            'clf__min_samples_split': [2, 5],
            'clf__min_samples_leaf': [1, 2]
        }
    ),
    (
        'XGBoost', XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42), {
            'clf__n_estimators': [100, 200],
            'clf__max_depth': [3, 5, 7],
            'clf__learning_rate': [0.01, 0.1],
            'clf__subsample': [0.8, 1.0]
        }
    ),
    (
        'SVM', SVC(probability=True, random_state=42), {
            'clf__C': [0.1, 1, 10],
            'clf__kernel': ['linear', 'rbf']
        }
    ),
    (
        'Gaussian NB', GaussianNB(), {}
    ),
    (
        'Gradient Boosting', GradientBoostingClassifier(random_state=42), {
            'clf__n_estimators': [100, 200],
            'clf__learning_rate': [0.01, 0.1],
            'clf__max_depth': [3, 5]
        }
    ),
    (
        'LightGBM', LGBMClassifier(random_state=42), {
            'clf__n_estimators': [100, 200],
            'clf__learning_rate': [0.01, 0.1],
            'clf__max_depth': [5, 10, -1]
        }
    ),
    (
        'CatBoost', CatBoostClassifier(verbose=0, random_state=42), {
            'clf__iterations': [100, 200],
            'clf__learning_rate': [0.01, 0.1],
            'clf__depth': [4, 6, 8]
        }
    )
]

# ========== Results container ==========
results = {
    'Model': [], 'Accuracy': [], 'Precision': [], 'Recall': [],
    'F1 Score': [], 'ROC AUC': []
}

trained_models = []

# ========== Loop and evaluate ==========
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, clf, param_grid in model_param_grid:
    pipe = ImbPipeline([
        ('smote', SMOTE(random_state=42)),
        ('scaler', scaler),
        ('clf', clf)
    ])
    
    if param_grid:
        search = RandomizedSearchCV(
            pipe, param_distributions=param_grid,
            scoring='f1', n_iter=20, cv=cv, random_state=42, n_jobs=-1
        )
        search.fit(X_train, y_train)
        best_model = search.best_estimator_
    else:
        pipe.fit(X_train, y_train)
        best_model = pipe

    y_pred = best_model.predict(X_test)
    y_proba = best_model.predict_proba(X_test)[:, 1] if hasattr(best_model, "predict_proba") else None

    results['Model'].append(name)
    results['Accuracy'].append(round(accuracy_score(y_test, y_pred), 4))
    results['Precision'].append(round(precision_score(y_test, y_pred), 4))
    results['Recall'].append(round(recall_score(y_test, y_pred), 4))
    results['F1 Score'].append(round(f1_score(y_test, y_pred), 4))
    results['ROC AUC'].append(round(roc_auc_score(y_test, y_proba), 4) if y_proba is not None else 'N/A')

    trained_models.append((name, best_model))

    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(5, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f'Confusion Matrix - {name}')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.tight_layout()
    plt.show()

# ========== Show results ==========
results_df = pd.DataFrame(results)
print("\nModel Performance Comparison:\n")
print(results_df.sort_values(by='Accuracy', ascending=False))


### Get Best Model

In [None]:
import os, pickle, joblib, warnings

#  suppress harmless warnings
warnings.filterwarnings("ignore", category=UserWarning)

In [None]:
# Find best model based on accuracy
best_index = results_df['Accuracy'].idxmax()
best_model_name = results_df.loc[best_index, 'Model']
print(f"\n✅ Best Model (based on Accuracy): {best_model_name}")
print(results_df.loc[best_index])  # See all metrics

In [None]:
# Retrieve the trained best pipeline
best_pipeline = next(pipe for name, pipe in trained_models if name == best_model_name)

# Define save path
model_name_safe = best_model_name.replace(" ", "_").lower()
save_path = r"C:\Users\pascal\Desktop\PROJECTS 2025\Multi-disease prediction models\multi-health-ml-predictor\models"
os.makedirs(save_path, exist_ok=True)

In [None]:
# Save as Pickle
with open(os.path.join(save_path, f"{model_name_safe}_model_heart.pkl"), 'wb') as f:
    pickle.dump(best_pipeline, f)

# Save as Joblib
joblib.dump(best_pipeline, os.path.join(save_path, f"{model_name_safe}_model_heart.joblib"))

print(f"\n✅ Saved best model ({best_model_name}) as both Pickle and Joblib.")

### Test Model

In [None]:
model_path = r"C:\Users\pascal\Desktop\PROJECTS 2025\Multi-disease prediction models\multi-health-ml-predictor\models\random_forest_model_heart.joblib"
model = joblib.load(model_path)

In [None]:
# Single patient sample
sample_data = {
    'systolic_bp': 130,
    'diastolic_bp': 85,
    'cholesterol_level': 5.8,
    'glucose_level': 6.2,
    'physical_activity': 1,
    'weight_kg': 75,
    'age_years': 54
    
}



In [None]:
def predict_heart_disease(model, input_dict):
    df = pd.DataFrame([input_dict])
    pred = model.predict(df)[0]
    proba = model.predict_proba(df)[0][1]
    result = 'Heart Disease' if pred == 1 else 'No Heart Disease'
    print(f"🔍 Prediction: {result}")
    print(f"🧪 Probability: {round(proba, 4)}")
    return pred, proba


In [None]:
predict_heart_disease(model, sample_data)

## Set a better threshold to improve recall

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_fscore_support, accuracy_score, roc_auc_score

# 1. Get predicted probabilities
y_proba = best_model.predict_proba(X_test)[:, 1] if hasattr(best_model, "predict_proba") else None


# 2. Set thresholds to test
thresholds = np.arange(0.1, 0.91, 0.05)

# 3. Track performance metrics
scores = {
    'Threshold': [], 'Accuracy': [], 'Precision': [],
    'Recall': [], 'F1 Score': [], 'ROC AUC': []
}

for thresh in thresholds:
    y_pred_thresh = (y_proba >= thresh).astype(int)
    
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred_thresh, average='binary')
    accuracy = accuracy_score(y_test, y_pred_thresh)
    auc = roc_auc_score(y_test, y_proba)  # stays constant
    
    scores['Threshold'].append(thresh)
    scores['Accuracy'].append(accuracy)
    scores['Precision'].append(precision)
    scores['Recall'].append(recall)
    scores['F1 Score'].append(f1)
    scores['ROC AUC'].append(auc)

# 4. Plot the metrics
plt.figure(figsize=(10, 6))
plt.plot(scores['Threshold'], scores['Accuracy'], label='Accuracy')
plt.plot(scores['Threshold'], scores['Precision'], label='Precision')
plt.plot(scores['Threshold'], scores['Recall'], label='Recall')
plt.plot(scores['Threshold'], scores['F1 Score'], label='F1 Score')
plt.axvline(0.5, color='gray', linestyle='--', label='Default Threshold (0.5)')
plt.title('Threshold Tuning for Heart Disease Prediction')
plt.xlabel('Decision Threshold')
plt.ylabel('Score')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


### Evaluate the new threshold 0.45

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# Predict probabilities
y_proba = best_model.predict_proba(X_test)[:, 1] if hasattr(best_model, "predict_proba") else None

# Apply new threshold
custom_threshold = 0.45
y_pred_thresh = (y_proba >= custom_threshold).astype(int)

# Re-evaluate
print(f"🔍 Evaluation at Threshold = {custom_threshold}")
print("Accuracy:", accuracy_score(y_test, y_pred_thresh))
print("Precision:", precision_score(y_test, y_pred_thresh))
print("Recall:", recall_score(y_test, y_pred_thresh))
print("F1 Score:", f1_score(y_test, y_pred_thresh))
print("ROC AUC Score:", roc_auc_score(y_test, y_proba))  # AUC stays same

print("\n📊 Classification Report:\n", classification_report(y_test, y_pred_thresh))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_thresh)
plt.figure(figsize=(5, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'Confusion Matrix at Threshold = {custom_threshold}')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.tight_layout()
plt.show()


### Test new threshold with sample data

In [None]:
DEFAULT_THRESHOLD = 0.4  # Balanced threshold

def predict_heart_disease(model, input_dict, threshold=DEFAULT_THRESHOLD):
    df = pd.DataFrame([input_dict])
    proba = model.predict_proba(df)[0][1]
    pred = int(proba >= threshold)
    result = 'Heart Disease' if pred == 1 else 'No Heart Disease'
    
    print(f"🔍 Prediction at threshold {threshold}: {result}")
    print(f"🧪 Probability: {round(proba, 4)}")
    
    return pred, proba



In [None]:
pred, proba = predict_heart_disease(model, sample_data, threshold=0.45)

###  Machine Learning Pipeline Summary

This section above presents the end-to-end process of building, tuning, evaluating, and exporting a machine learning model to predict heart disease.

###  Key Steps:

- **Data Splitting**: Used `train_test_split` with stratification to ensure balanced target distribution.
- **Preprocessing Pipeline**:
  - Scaled numerical features using `StandardScaler`.
  - Handled class imbalance using `SMOTE` oversampling within an imbalanced-learn pipeline.
- **Model Selection & Tuning**:
  - Trained five powerful classifiers: Logistic Regression, Random Forest, XGBoost, LightGBM, and CatBoost.
  - Applied `RandomizedSearchCV` with cross-validation to tune hyperparameters and optimize F1 Score.
- **Evaluation Metrics**:
  - Accuracy, Precision, Recall, F1 Score, ROC AUC
  - Plotted confusion matrices for visual interpretation of predictions.
  - Performed **threshold tuning** to optimize performance trade-offs.
- **Best Model**:  
  📌 **LightGBM** delivered the best balance of performance metrics (Accuracy ≈ 73%, F1 Score ≈ 0.72, ROC AUC ≈ 0.798).

- **Model Export**:
  - Saved the final tuned model using both `pickle` and `joblib` formats.
  - Created a flexible prediction function with customizable decision threshold, suitable for deployment.

This pipeline is robust, reusable, and ready for integration via an API for front-end interface.
