# Hack4Health: Heart Disease Risk Prediction (Tabular)

This notebook trains and evaluates machine learning models to predict **HeartDisease** from routine clinical + ECG-derived features.

**Dataset:** `b2b/Datasets/Heart Attack/heart_processed.csv`

## Reproducibility
- Fixed random seed
- Stratified train/test split
- Cross-validation for model selection

> If running in Google Colab: upload the `b2b` folder or adjust `DATA_PATH`.


In [None]:
import sys
import os

# If a required package is missing, install dependencies into the current kernel.
# (Works in Jupyter/Colab; may take a minute on first run.)
try:
    import numpy as np
    import pandas as pd
except ModuleNotFoundError:
    !{sys.executable} -m pip -q install numpy pandas scikit-learn matplotlib seaborn
    import numpy as np
    import pandas as pd

try:
    import xgboost
    from xgboost import XGBClassifier
except ModuleNotFoundError:
    !{sys.executable} -m pip -q install xgboost
    import xgboost
    from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    roc_auc_score, average_precision_score, accuracy_score, f1_score,
    confusion_matrix, classification_report, RocCurveDisplay, PrecisionRecallDisplay
)
from sklearn.calibration import calibration_curve
from sklearn.inspection import permutation_importance

import matplotlib.pyplot as plt
import seaborn as sns

RANDOM_STATE = 42
sns.set_theme(style='whitegrid')


In [None]:
from pathlib import Path

# Robust path resolution for VS Code / Jupyter.
# This notebook lives in the 'b2b' folder, but the working directory can vary.

cwd = Path.cwd()

# Candidate locations (try relative to current working directory, and relative to this notebook's folder name if present)
candidates = [
    cwd / 'b2b' / 'Datasets' / 'Heart Attack' / 'heart_processed.csv',
    cwd / 'Datasets' / 'Heart Attack' / 'heart_processed.csv',
]

# Also try walking up parent directories and looking for a 'b2b' folder
for p in [cwd] + list(cwd.parents):
    candidates.append(p / 'b2b' / 'Datasets' / 'Heart Attack' / 'heart_processed.csv')

DATA_PATH = None
for c in candidates:
    if c.exists():
        DATA_PATH = c
        break

if DATA_PATH is None:
    raise FileNotFoundError(
        "Could not find heart_processed.csv. Tried:\n" + "\n".join(str(x) for x in candidates)
    )

print('Using dataset at:', DATA_PATH)
df = pd.read_csv(DATA_PATH)
df.head()


In [None]:
df.shape, df.columns.tolist()


## 1) Data checks
We verify:
- Target exists and is binary
- No missing values
- Boolean columns are valid


In [None]:
assert 'HeartDisease' in df.columns
df['HeartDisease'].value_counts(dropna=False)


In [None]:
missing = df.isna().mean().sort_values(ascending=False)
missing.head(15)


In [None]:
# Convert boolean columns (True/False) to 0/1 (safe if already numeric)
for col in df.columns:
    if df[col].dtype == 'bool':
        df[col] = df[col].astype(int)

# Also handle string 'True'/'False' just in case
for col in df.columns:
    if df[col].dtype == 'object':
        unique = set(df[col].dropna().unique().tolist())
        if unique.issubset({'True', 'False'}):
            df[col] = df[col].map({'False': 0, 'True': 1}).astype(int)

df.dtypes


## 2) Quick EDA


In [None]:
target_rate = df['HeartDisease'].mean()
print(f'Positive class prevalence (HeartDisease=1): {target_rate:.3f}')


In [None]:
num_cols = [c for c in df.columns if c != 'HeartDisease']
df[num_cols].describe().T


In [None]:
plt.figure(figsize=(6,4))
sns.countplot(data=df, x='HeartDisease')
plt.title('Target distribution')
plt.show()


In [None]:
# Correlation with target (quick sanity check; not causal)
corr = df.corr(numeric_only=True)['HeartDisease'].sort_values(ascending=False)
corr


## 3) Train/test split
We keep a hold-out test set for a final unbiased estimate.

In [None]:
X = df.drop(columns=['HeartDisease'])
y = df['HeartDisease'].astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)
X_train.shape, X_test.shape


## 4) Models
We compare:
- **Logistic Regression (scaled)**: strong baseline + interpretable
- **Random Forest**: nonlinear model, handles interactions

Evaluation metrics:
- ROC-AUC
- PR-AUC (Average Precision)
- F1, Accuracy


In [None]:
logreg = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=500, class_weight='balanced', random_state=RANDOM_STATE))
])

rf = RandomForestClassifier(
    n_estimators=600,
    random_state=RANDOM_STATE,
    class_weight='balanced',
    min_samples_leaf=2,
    n_jobs=-1
)

xgb = XGBClassifier(
    n_estimators=600,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.9,
    colsample_bytree=0.9,
    reg_lambda=1.0,
    min_child_weight=1,
    gamma=0,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    eval_metric='logloss'
)

models = {
    'LogReg (scaled)': logreg,
    'RandomForest': rf,
    'XGBoost': xgb,
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
scoring = {
    'roc_auc': 'roc_auc',
    'avg_precision': 'average_precision',
    'accuracy': 'accuracy',
    'f1': 'f1'
}

rows = []
for name, m in models.items():
    scores = cross_validate(m, X_train, y_train, cv=cv, scoring=scoring, n_jobs=-1)
    rows.append({
        'model': name,
        'cv_roc_auc_mean': np.mean(scores['test_roc_auc']),
        'cv_pr_auc_mean': np.mean(scores['test_avg_precision']),
        'cv_f1_mean': np.mean(scores['test_f1']),
        'cv_accuracy_mean': np.mean(scores['test_accuracy']),
    })

pd.DataFrame(rows).sort_values('cv_roc_auc_mean', ascending=False)


## 5) Final evaluation on test set
We fit models on the full training set, then evaluate once on the holdout test set.

In [None]:
def eval_model(model, X_train, y_train, X_test, y_test, name):
    model.fit(X_train, y_train)
    proba = model.predict_proba(X_test)[:, 1]
    pred = (proba >= 0.5).astype(int)

    out = {
        'model': name,
        'roc_auc': roc_auc_score(y_test, proba),
        'pr_auc': average_precision_score(y_test, proba),
        'accuracy': accuracy_score(y_test, pred),
        'f1': f1_score(y_test, pred),
    }

    print(name)
    print(out)
    print('Confusion matrix:\n', confusion_matrix(y_test, pred))
    print('\nClassification report:\n', classification_report(y_test, pred, digits=3))

    return out, proba


test_rows = []
test_probas = {}
for name, m in models.items():
    out, proba = eval_model(m, X_train, y_train, X_test, y_test, name)
    test_rows.append(out)
    test_probas[name] = proba

pd.DataFrame(test_rows).sort_values('roc_auc', ascending=False)


In [None]:
plt.figure(figsize=(6,5))
for name, proba in test_probas.items():
    RocCurveDisplay.from_predictions(y_test, proba, name=name)
plt.title('ROC curves (test set)')
plt.show()


In [None]:
plt.figure(figsize=(6,5))
for name, proba in test_probas.items():
    PrecisionRecallDisplay.from_predictions(y_test, proba, name=name)
plt.title('Precision-Recall curves (test set)')
plt.show()


## 6) Calibration
Calibration matters for risk prediction: we want probabilities that reflect true risk.

In [None]:
plt.figure(figsize=(6,5))
for name, proba in test_probas.items():
    frac_pos, mean_pred = calibration_curve(y_test, proba, n_bins=10, strategy='quantile')
    plt.plot(mean_pred, frac_pos, marker='o', label=name)

plt.plot([0, 1], [0, 1], linestyle='--', color='black')
plt.xlabel('Mean predicted probability')
plt.ylabel('Fraction of positives')
plt.title('Calibration plot (test set)')
plt.legend()
plt.show()


## 7) Interpretability (Permutation importance)
Permutation importance measures how much model performance drops when a feature is shuffled.

In [None]:
best_name = max(test_rows, key=lambda r: r['roc_auc'])['model']
best_model = models[best_name]
best_model.fit(X_train, y_train)

perm = permutation_importance(
    best_model, X_test, y_test,
    n_repeats=20, random_state=RANDOM_STATE, scoring='roc_auc'
)

imp = pd.DataFrame({
    'feature': X.columns,
    'importance_mean': perm.importances_mean,
    'importance_std': perm.importances_std
}).sort_values('importance_mean', ascending=False)

imp.head(15)


In [None]:
topk = 12
plt.figure(figsize=(8,5))
sns.barplot(data=imp.head(topk), x='importance_mean', y='feature', orient='h')
plt.title(f'Permutation importance (ROC-AUC drop) - {best_name} (top {topk})')
plt.xlabel('Mean importance')
plt.ylabel('Feature')
plt.show()


## 8) What to write in the report
Include:
- Problem framing (why early detection matters)
- Data description + preprocessing
- Model comparison table (CV + test)
- 1 ROC curve + 1 calibration plot
- Top features + interpretation + limitations
