# 🫀 Heart Disease Prediction and Interpretability

This notebook demonstrates a complete machine learning pipeline applied to the UCI Heart Disease dataset. All models and tools are implemented from scratch, including:

- Logistic Regression
- Decision Tree
- Random Forest
- LIME (Local Interpretable Model-Agnostic Explanations)

The goal is both predictive performance and interpretability — understanding **why** the model made a prediction.


## 🔧 Setup and Imports

We begin by importing all required modules from our custom `courselib` framework.

In [1]:
# # Heart Disease Prediction and Interpretability
# Using custom Logistic Regression, Decision Tree, Random Forest, and LIME

import numpy as np
import pandas as pd
from ucimlrepo import fetch_ucirepo

from courselib.utils.loaders import load_heart_data
from courselib.utils.preprocessing import preprocess_dataframe
from courselib.utils.normalization import minmax_normalize
from courselib.utils.splits import train_test_split
from courselib.utils.metrics import accuracy, f1_score
from courselib.models.logistic import LogisticRegression
from courselib.models.tree import DecisionTree
from courselib.models.forest import RandomForest
from courselib.optimizers import GDOptimizer
from courselib.explain.lime import LimeTabularExplainer

## 📥 Load and Preprocess Data

We use the UCI Heart Disease dataset, which includes patient data (e.g. age, sex, cholesterol, etc.) and a target variable indicating presence (1-4) or absence (0) of heart disease (Source: https://archive.ics.uci.edu/dataset/45/heart+disease). 

> ### 💡 Binary Transformation
>
> The target ranges from 0 to 4:
> - 0 = no presence of heart disease
> - 1–4 = presence of heart disease  
>  
> Our research focuses on the binary classification task: **presence (1–4) vs. absence (0)** as described in the dataset information.
> So we convert:
>
> $$
> y = \begin{cases}
>     0 & \text{if } y = 0 \\\\
>     1 & \text{if } y \in \{1, 2, 3, 4\}
> \end{cases}
> $$

Steps:
- Fetch dataset
- Convert multiclass to binary target
- Encode categorical features
- Normalize numerical features to [0, 1] range
- Split into training and testing sets


In [7]:
y_series

<pandas.core.indexing._iLocIndexer at 0x18a5bb1c220>

In [None]:
# Load raw data
X_df, y_series = load_heart_data()

# Convert target to binary: 0 (no disease), 1 (disease)
y_series = (y_series > 0).astype(int)

# Combine and preprocess
X, y = preprocess_dataframe(pd.concat([X_df, y_series.rename("target")], axis=1))

# Normalize features
X = minmax_normalize(X)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, seed=42)
print('Training data split as follows:')
print(f'  Training data samples: {len(X_train)}')
print(f'      Test data samples: {len(X_test)}')


TypeError: '>' not supported between instances of '_iLocIndexer' and 'int'

## 📈 Logistic Regression

We implement logistic regression using gradient descent.

> ### 💡 Model and Loss
>
> The model computes probabilities as:
>
> $$
> \hat{y}_i = \sigma(w^T x_i + b), \quad \text{where} \quad \sigma(z) = \frac{1}{1 + e^{-z}}
> $$
>
> The loss function is binary cross-entropy:
>
> $$
> \mathcal{L}(w, b) = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
> $$
>
> Optimized using gradient descent.


In [None]:
optimizer = GDOptimizer(learning_rate=1e-4)
logreg = LogisticRegression(
    w=np.zeros(X_train.shape[1]), 
    b=0.0, 
    optimizer=optimizer, 
    penalty="none"
)
logreg.fit(X_train, y_train, num_epochs=100)

y_pred_logreg = logreg(X_test)
print("Logistic Regression Accuracy:", accuracy(y_test, y_pred_logreg))


TypeError: return arrays must be of ArrayType

In [58]:
from courselib.optimizers import GDOptimizer

w = np.zeros(X_train.shape[1])
b = 0
optimizer = GDOptimizer(learning_rate=1e-2)

model = LogisticRegression(w, b, optimizer)
metrics_history = model.fit(X_train,y_train,num_epochs=500, batch_size=len(X_train), compute_metrics=False)

TypeError: return arrays must be of ArrayType

## 🌳 Decision Tree

Decision trees recursively split the data to reduce impurity and create interpretable decision rules.

> ### 💡 Impurity Measures
>
> Gini impurity:
> $$
> G = 1 - \sum_{k=1}^K p_k^2
> $$
>
> Entropy:
> $$
> H = -\sum_{k=1}^K p_k \log(p_k)
> $$
>
> A split is chosen to minimize weighted impurity.


In [25]:
tree = DecisionTree(max_depth=4)
tree.fit(X_train, y_train)
y_pred_tree = tree.predict(X_test)
print("Decision Tree Accuracy:", accuracy(y_test, y_pred_tree))


Decision Tree Accuracy: 0.7377049180327869


## 🌲 Random Forest

Random forests are ensembles of decision trees, trained on random subsets of the data and features.

> ### 💡 Key Idea
>
> Combine multiple weak learners (trees) to create a strong learner.
> Each tree votes, and the majority decision is the output.
>
> This improves generalization and reduces variance.


In [26]:
forest = RandomForest(n_estimators=5, max_depth=4)
forest.fit(X_train, y_train)
y_pred_forest = forest.predict(X_test)
print("Random Forest Accuracy:", accuracy(y_test, y_pred_forest))


Random Forest Accuracy: 0.7704918032786885


## 🔍 LIME: Model Interpretability

LIME explains individual predictions by approximating the model locally with a simpler interpretable model.

> ### 💡 How LIME Works
>
> 1. Sample points around the instance using noise
> 2. Get predictions from the black-box model
> 3. Fit a weighted linear model (e.g. ridge regression)
> 4. Interpret feature weights of this surrogate model


In [27]:
explainer = LimeTabularExplainer(X_train, feature_names=X_df.columns.tolist())
instance = X_test[5]
predict_fn = lambda x: np.array([logreg.decision_function(x)]).T
weights, idx = explainer.explain_instance(instance, predict_fn, num_samples=300)

print("Top LIME Features:")
for feat, w in explainer.as_list(weights, idx, top_k=5):
    print(f"{feat}: {w:.4f}")


IndexError: index 1 is out of bounds for axis 1 with size 1

## ✅ Summary

We implemented and interpreted multiple models from scratch:

- 🧮 Logistic Regression (with gradients and sigmoid)
- 🌳 Decision Tree (using impurity criteria)
- 🌲 Random Forest (ensemble of trees)
- 🔍 LIME (local explanations with RidgeRegression)

All within a modular framework designed for transparency and learning.
