# CSCA 5622 — Supervised Learning Final Project (Heart Disease — Cleveland)

**Task:** Binary classification — predict presence of heart disease from clinical attributes (Cleveland subset of UCI Heart Disease).  
**Dataset license:** CC BY 4.0 (UCI ML Repository).  
**Run date:** 2025-09-29 09:45 UTC


## 1) Data Provenance & Context

- **Source:** UCI Machine Learning Repository — *Heart Disease* (Cleveland subset). `id=45` in UCI.  
  - DOI/Citation: Janosi, Steinbrunn, Pfisterer, & Detrano (1989). Heart Disease [Dataset]. UCI ML Repository.  
- **License:** Creative Commons Attribution 4.0 (CC BY 4.0).  
- **Attributes used (14):** `age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal, num(target)`  
  - Target is `num` (0..4). We binarize to **presence = 1** if `num > 0`, else **0**.
- **Known quirks:** Missing values in **ca** and **thal** (often encoded as `?`). We'll clean and impute appropriately.


In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    classification_report,
    ConfusionMatrixDisplay,
    RocCurveDisplay,
    roc_auc_score
)

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

pd.set_option('display.max_columns', 100)
np.random.seed(42)


## 2) Load Data

We first try the **official UCI helper `ucimlrepo`** to fetch the dataset.  
On success, we **cache** a CSV locally (same folder) for reproducibility.

> If you're offline, re-run later with internet, or manually place `heart_cleveland_uci.csv` beside this notebook (columns: the 14 Cleveland features).

In [1]:
# Fetch via ucimlrepo (preferred). If unavailable/offline, show an instructive error.

def load_uci_cleveland() -> pd.DataFrame:
    try:
        from ucimlrepo import fetch_ucirepo
    except Exception as e:
        raise RuntimeError(
            "`ucimlrepo` not installed. Run: `pip install ucimlrepo` and re-run."
        )
    heart = fetch_ucirepo(id=45)  # Heart Disease
    # UCI helper returns features/targets as DataFrames
    X = heart.data.features.copy()
    y = heart.data.targets.copy()
    # In the UCI helper, Cleveland subset is typically what the docs demonstrate; however,
    # some mirrors provide combined or per-site files. We'll align to the 14-feature schema.
    # Compose a single DF if needed.
    if 'num' not in y.columns:
        # Sometimes target column is named differently
        # Attempt common fallbacks
        for cand in ['target','disease','diagnosis','class']:
            if cand in y.columns:
                y = y.rename(columns={cand: 'num'})
                break
        if 'num' not in y.columns:
            y = y.rename(columns={y.columns[0]: 'num'})
    df = pd.concat([X.reset_index(drop=True), y.reset_index(drop=True)], axis=1)
    return df

CACHE_PATH = Path('heart_cleveland_uci.csv')

try:
    if CACHE_PATH.exists():
        df = pd.read_csv(CACHE_PATH)
    else:
        df = load_uci_cleveland()
        # Keep only the canonical 14 columns if present
        canonical_cols = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
        # If all canonical present, filter and save
        if set(canonical_cols).issubset(df.columns):
            df = df[canonical_cols]
        df.to_csv(CACHE_PATH, index=False)
    print('Loaded data with shape:', df.shape)
    display(df.head())
except Exception as e:
    raise RuntimeError(
        f"Could not load UCI Cleveland dataset automatically.\n"
        f"Reason: {e}\n\n"
        "Fix: Ensure internet is available, then:\n"
        "  pip install ucimlrepo\n"
        "  # and re-run this cell.\n\n"
        "Alternatively, download from UCI and save as 'heart_cleveland_uci.csv' beside this notebook."
    )


NameError: name 'pd' is not defined

## 3) Cleaning & Target Definition

- Convert `?` to `NaN` (notably **ca**, **thal**).  
- Coerce numeric types.  
- Create binary target `target = 1 if num > 0 else 0`.  
- Drop rows with missing target, then impute numeric/categorical as needed (inside pipelines to avoid leakage).

In [None]:
# Replace '?' with NaN and coerce numerics
na_before = df.isna().sum()
df = df.replace('?', np.nan)

# Coerce specific columns to numeric (errors='coerce' turns bad parses into NaN)
for col in ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

print('Missing values per column (after coercion):')
print(df.isna().sum().sort_values(ascending=False).head(10))

# Binarize target: presence(1) if num > 0 else 0
if 'target' not in df.columns:
    df['target'] = (df['num'] > 0).astype(int)

# Basic sanity checks
print('\nClass balance (0=no disease, 1=disease):')
print(df['target'].value_counts(), (df['target'].value_counts()/len(df)).round(3))

# We'll keep rows where essential features are present; remaining NaNs will be imputed in pipelines
essential = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','target']
df = df.dropna(subset=['target'])
print('Remaining rows after ensuring target present:', len(df))


## 4) EDA Highlights

- Univariate distributions  
- Correlation with target  
- Quick boxplots for most informative features

In [None]:
# Univariate
subset_cols = [c for c in ['age','trestbps','chol','thalach','oldpeak','ca'] if c in df.columns]
_ = df[subset_cols].hist(figsize=(10,8))
plt.tight_layout(); plt.show()

# Correlation with target (numeric-only)
corr = df.corr(numeric_only=True)['target'].sort_values(ascending=False)
print('Top correlations with target:')
print(corr.head(10))
print('\nLeast correlations with target:')
print(corr.tail(10))


In [None]:
# Boxplots of top 6 absolute correlations (excluding 'target' itself)
abs_corr = df.corr(numeric_only=True)['target'].abs().sort_values(ascending=False)
feat_order = [c for c in abs_corr.index if c != 'target'][:6]
import math
rows = math.ceil(len(feat_order)/3)
fig, axes = plt.subplots(rows, 3, figsize=(12, 4*rows))
axes = axes.flatten()
for i, f in enumerate(feat_order):
    axes[i].boxplot([df[df['target']==0][f].dropna(), df[df['target']==1][f].dropna()], labels=['no disease','disease'])
    axes[i].set_title(f)
for j in range(i+1, len(axes)):
    axes[j].axis('off')
plt.tight_layout(); plt.show()


## 5) Train/Test Split & Preprocessing

- Split with stratification.  
- Numeric features: Standardize.  
- Categorical features (if any remain as discrete codes): One-hot encode.  
- Impute missing values **within** the pipeline to avoid leakage.


In [None]:
from sklearn.impute import SimpleImputer

feature_cols = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal']
feature_cols = [c for c in feature_cols if c in df.columns]
X = df[feature_cols].copy()
y = df['target'].copy()

# Identify numeric vs categorical by heuristic: integer-coded categories
categorical_like = [c for c in ['sex','cp','fbs','restecg','exang','slope','ca','thal'] if c in X.columns]
numeric_like = [c for c in X.columns if c not in categorical_like]

numeric_tf = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('scale', StandardScaler())
])

categorical_tf = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocess = ColumnTransformer([
    ('num', numeric_tf, numeric_like),
    ('cat', categorical_tf, categorical_like)
])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
print('Train/Test sizes:', X_train.shape, X_test.shape)


## 6) Baseline Model Comparison (CV)

We compare: Logistic Regression, SVC (RBF), Random Forest, Gradient Boosting.  
Scoring: **accuracy, F1, ROC AUC**.


In [None]:
models = {
    'LogisticRegression': Pipeline([
        ('prep', preprocess),
        ('clf', LogisticRegression(max_iter=1000))
    ]),
    'SVC (RBF)': Pipeline([
        ('prep', preprocess),
        ('clf', SVC(probability=True))
    ]),
    'RandomForest': Pipeline([
        ('prep', preprocess),
        ('clf', RandomForestClassifier(random_state=42))
    ]),
    'GradientBoosting': Pipeline([
        ('prep', preprocess),
        ('clf', GradientBoostingClassifier(random_state=42))
    ])
}

scoring = {'accuracy':'accuracy', 'f1':'f1', 'roc_auc':'roc_auc'}
cv_results = {}
for name, pipe in models.items():
    scores = cross_validate(pipe, X_train, y_train, scoring=scoring, cv=cv, n_jobs=-1)
    cv_results[name] = {k.replace('test_',''): float(np.mean(v)) for k,v in scores.items() if k.startswith('test_')}

pd.DataFrame(cv_results).T.sort_values('roc_auc', ascending=False)


## 7) Hyperparameter Tuning — choose a top model (SVC)

We tune `C` and `gamma` for an RBF SVC using 5-fold CV and ROC AUC as the refit metric.


In [None]:
svc_pipe = models['SVC (RBF)']
param_grid = {
    'clf__C': [0.1, 1, 10, 30],
    'clf__gamma': ['scale', 0.1, 0.01, 0.001]
}

gs = GridSearchCV(svc_pipe, param_grid, scoring='roc_auc', cv=cv, n_jobs=-1)
gs.fit(X_train, y_train)
print('Best params:', gs.best_params_)
print('Best CV ROC AUC:', round(gs.best_score_, 4))


## 8) Final Evaluation on Holdout Test Set

In [None]:
best_model = gs.best_estimator_
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:,1]

print(classification_report(y_test, y_pred, target_names=['no disease','disease']))
print('Test ROC AUC:', round(roc_auc_score(y_test, y_proba), 4))

ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.title('Confusion Matrix — Best SVC')
plt.show()

RocCurveDisplay.from_predictions(y_test, y_proba)
plt.title('ROC Curve — Best SVC')
plt.show()


## 9) Interpretability Notes

- For linear models (e.g., **Logistic Regression**), inspect standardized coefficients to see directional effects.
- For tree ensembles, use **permutation importance** and **partial dependence**; for SVC, prefer **model-agnostic** methods (Permutation Importance, SHAP).
- Beware of **category encodings** impacting interpretation (one-hot reference levels).


## 10) Conclusions & Next Steps

**Summary**
- Cleaned Cleveland dataset (handled `?` missing values) and binarized `num`.
- Compared 4 models; tuned SVC for ROC AUC.
- Reported holdout metrics & diagnostic plots.

**Limitations**
- Small tabular clinical dataset; potential site-specific biases.
- Binary reduction of multi-class severity (`num`) discards granularity.

**Next Steps**
- Explore threshold tuning for cost-sensitive decisions (FN vs FP).
- Calibrate probabilities (Platt/Isotonic) and evaluate PR curves (class imbalance).
- Add SHAP for feature attribution; test alternative learners (XGBoost/LightGBM).

**Reproducibility**
- Cache created: `heart_cleveland_uci.csv`. Commit code, not large/raw external data. Provide data download instructions in README.
