### Preprocessing and Modelling
This notebook walks through preparing the **Cleveland Heart Disease dataset** for machine learning, including feature scaling, model training, threshold tuning, and evaluating performance using cross-validation.

## Loading Required Libraries

The notebook begins by importing essential Python libraries for data analysis, visualization, and modeling:

1. numpy and pandas for numerical operations and data handling

2. matplotlib and seaborn for visualizations

3. Seaborn’s default theme is activated for cleaner plots

4. This sets up the environment for the entire workflow.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

In [2]:
df = pd.read_csv('~/heart_disease_predictor/data/heart_disease_clean.csv')

The cleaned heart disease dataset is loaded from disk. This file is assumed to contain no missing values or corrupted records, as earlier cleaning was handled in a previous notebook.

In [3]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,1
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [4]:
df['thal'] = df['thal'].astype(str)

The thal variable is converted to string to ensure it behaves as a categorical feature.
This avoids the model treating it like a continuous numeric variable.

In [5]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,1
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [9]:
feature_cols = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal']

### Threshold Tuning with Cross-Validation

This section evaluates how different classification thresholds affect precision and recall for Logistic Regression. Since the default threshold of 0.5 isn’t always ideal for medical prediction tasks, we test a range of thresholds from 0.10 to 0.90.

For each threshold, we perform 5-fold stratified cross-validation to get more reliable metrics. In every fold, the model is trained on the training split, probabilities are generated for the validation split, and predictions are made by applying the chosen threshold. We then compute precision and recall for that fold.

After evaluating all folds, we calculate the average precision and recall for each threshold. The final DataFrame provides an overview of how precision and recall change as the threshold increases, helping us choose the best trade-off for this project.

In [10]:
X = df[feature_cols]
y = df['target']

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction import DictVectorizer


thresholds = np.arange(0.1, 0.91, 0.05)
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

results = []

pipeline = make_pipeline(
    DictVectorizer(sparse=False),
    StandardScaler(),
    LogisticRegression(max_iter=2000, random_state=42, C=0.1)
)

print("Starting Threshold Tuning...")

for thresh in thresholds:
    fold_precisions = []
    fold_recalls = []
    
    for train_idx, val_idx in kf.split(X, y):
        
        X_train_fold, X_val_fold = X.iloc[train_idx], X.iloc[val_idx]
        y_train_fold, y_val_fold = y.iloc[train_idx], y.iloc[val_idx]
        
        X_train_dicts = X_train_fold.to_dict(orient='records')
        X_val_dicts = X_val_fold.to_dict(orient='records')
        
        pipeline.fit(X_train_dicts, y_train_fold)
        
        probs = pipeline.predict_proba(X_val_dicts)[:, 1]
        
        preds = (probs >= thresh).astype(int)
        
        fold_precisions.append(precision_score(y_val_fold, preds, zero_division=0))
        fold_recalls.append(recall_score(y_val_fold, preds, zero_division=0))
    
    results.append({
        "threshold": thresh,
        "precision_mean": np.mean(fold_precisions),
        "recall_mean": np.mean(fold_recalls)
    })

threshold_df = pd.DataFrame(results)
threshold_df

Starting Threshold Tuning...


Unnamed: 0,threshold,precision_mean,recall_mean
0,0.1,0.555929,0.985714
1,0.15,0.611657,0.964286
2,0.2,0.644469,0.93545
3,0.25,0.694352,0.906878
4,0.3,0.742781,0.885185
5,0.35,0.801802,0.856085
6,0.4,0.812224,0.812963
7,0.45,0.823906,0.79127
8,0.5,0.847919,0.79127
9,0.55,0.859879,0.79127


In [14]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split, cross_val_score

Rows are converted to dictionaries because the next step uses a DictVectorizer, which expects dictionary-based input.

In [15]:
X_train_dict = X_train.to_dict(orient='records')
X_test_dict = X_test.to_dict(orient='records')

In [16]:
dv = DictVectorizer(sparse=False)

In [17]:
X_train = dv.fit_transform(X_train_dict)
X_test = dv.transform(X_test_dict)

1. All categorical features are automatically one-hot encoded.

2. All numerical features are left unchanged.

3. sparse=False ensures the output is a dense NumPy array.

4. This creates a clean numerical feature matrix for modeling.

In [18]:
features = dv.get_feature_names_out()

This stores the final list of features after encoding which is useful for inspection or future documentation.

## Feature Scaling
Scaling is applied to all numeric features so they share the same range.
This is especially important for Logistic Regression because it is sensitive to feature magnitude.

In [19]:
from sklearn.preprocessing import StandardScaler

In [20]:
scaler = StandardScaler()

In [21]:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [22]:
from sklearn.linear_model import LogisticRegression

In [23]:
model = LogisticRegression(max_iter=2000, random_state=42, C=0.1)

A regularized Logistic Regression model is used:

1. max_iter=2000 ensures convergence

2. C=0.1 applies stronger regularization

3. random_state=42 ensures reproducibility

The model is evaluated using 5-fold cross-validation with ROC-AUC as the scoring metric.
This gives a more stable estimate of performance before touching the test set.

In [24]:
cross_val_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='roc_auc')

In [25]:
cross_val_scores

array([0.94314381, 0.82154882, 0.91958042, 0.92132867, 0.8548951 ])

In [26]:
print("Mean: ", cross_val_scores.mean())
print("Standard Deviation: ", cross_val_scores.std())

Mean:  0.8920993660124095
Standard Deviation:  0.04599267877050984


## Evaluating Multiple Metrics

This allows tracking of:

Accuracy, Precision, Recall, F1 Score, ROC AUC

Cross-validation ensures these scores represent general performance and are not biased by a single split.

In [27]:
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score

In [28]:
scoring_metrics = [
    'accuracy', 
    'precision', 
    'recall', 
    'f1', 
    'roc_auc'
]

In [29]:
from sklearn.model_selection import cross_validate
cross_val_scores = cross_validate(model, X_train_scaled, y_train, cv=5, scoring=scoring_metrics)

In [30]:
pd.DataFrame(cross_val_scores)

Unnamed: 0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1,test_roc_auc
0,0.007195,0.036191,0.918367,1.0,0.826087,0.904762,0.943144
1,0.007899,0.035909,0.795918,0.8,0.727273,0.761905,0.821549
2,0.004323,0.0181,0.833333,0.818182,0.818182,0.818182,0.91958
3,0.004386,0.014896,0.8125,0.782609,0.818182,0.8,0.921329
4,0.003534,0.012333,0.791667,0.833333,0.681818,0.75,0.854895


In [31]:
from sklearn.metrics import precision_recall_curve

In [32]:
model.fit(X_train_scaled, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,0.1
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'lbfgs'
,max_iter,2000


In [33]:
y_pred = model.predict_proba(X_test_scaled)[:, 1]

In [34]:
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score

In [35]:
y_pred_final = (y_pred >= 0.4).astype(int)

In [36]:
accuracy_score(y_test, y_pred_final)

0.8524590163934426

In [39]:
cl_rep = classification_report(y_test, y_pred_final, output_dict=True)

In [40]:
pd.DataFrame(cl_rep)

Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.928571,0.787879,0.852459,0.858225,0.863991
recall,0.787879,0.928571,0.852459,0.858225,0.852459
f1-score,0.852459,0.852459,0.852459,0.852459,0.852459
support,33.0,28.0,0.852459,61.0,61.0


*It’s a statistical quirk because the True Positives and True Negatives were both exactly 26.*

In [38]:
roc_auc_score(y_test, y_pred)

0.9523809523809523

The Logistic Regression model achieved an ROC-AUC of 0.95, indicating strong predictive power. By tuning the decision threshold to 0.4, we optimized the Recall to 0.82, ensuring we minimize false negatives (missing a patient with heart disease) while maintaining a Precision of 0.81