# 10.5 • Machine Learning: From First Principles to Trustworthy Models

This notebook demystifies ML: there’s no magic, only **assumptions + optimisation + evaluation**. We’ll show not only *how* to fit models, but *why* specific steps are necessary for results you can defend in a viva or a regulatory audit.

### Learning objectives
1. Data workflow: leakage-free preprocessing, train/validation/test, reproducibility.
2. Bias–variance and sample complexity: learning curves and when to stop.
3. Honest model selection: cross-validation that doesn’t lie.
4. Interpretable outputs: permutation importance, partial dependence, SHAP (optional).
5. Probability quality: calibration curves, Brier score, thresholds tuned to utility.
6. Robustness: stress tests, subgroup performance, drift, and documentation.

We simulate a medium-sized clinical/metabolomics-like dataset (n≈1,200, p=30) with weak correlations and non-linear signal in a handful of features.

In [None]:
# Setup
import numpy as np, pandas as pd
import matplotlib.pyplot as plt, seaborn as sns
from sklearn.model_selection import (train_test_split, StratifiedKFold, GridSearchCV,
                                     cross_val_score, learning_curve)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (roc_auc_score, RocCurveDisplay, ConfusionMatrixDisplay,
                             classification_report, brier_score_loss, precision_recall_curve,
                             PrecisionRecallDisplay)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance, PartialDependenceDisplay
from sklearn.calibration import CalibratedClassifierCV, CalibrationDisplay
np.random.seed(11088); sns.set_style('whitegrid')
pd.set_option('display.max_columns', 200)

## 0) Data simulation (transparent ground truth)
We bake in structure so we know whether models recover it.

In [None]:
n, p = 1200, 30
X = np.random.normal(size=(n, p))
# Inject correlated blocks
for b in range(3):
    z = np.random.normal(size=(n,1))
    X[:, 10*b:10*(b+1)] += 0.6*z
cols = [f'feat_{i+1}' for i in range(p)]

# Non-linear signal in a subset
sig = (0.4*X[:,2] - 0.35*X[:,5] + 0.5*(X[:,7]**2>0.5).astype(float)
       + 0.45*X[:,12] - 0.4*X[:,17] + 0.35*np.sin(X[:,26]))
logit = -0.2 + sig
proba = 1/(1+np.exp(-logit))
y = (np.random.rand(n) < 0.08 + 0.84*proba).astype(int)  # mild imbalance
df = pd.DataFrame(X, columns=cols); df['target'] = y
df.head()

## 1) Reproducible split: train (75%) / test (25%)
**Golden rule**: the test set is a sealed envelope — use **once** at the end. Everything else (preprocessing, hyper-parameters, model choice) happens *inside* cross-validation on the training set.

In [None]:
X = df.drop(columns='target').values
y = df['target'].values
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25, stratify=y, random_state=11088)
X_tr.shape, X_te.shape, y.mean()

## 2) Baselines and leakage-safe pipelines
- Numerical scaling *must* be fit on training folds only.
- Use `Pipeline` so cross-validation doesn’t leak information from validation folds.

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=11088)

pipe_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))
])
grid_lr = GridSearchCV(pipe_lr, param_grid={'clf__C':[0.1, 0.5, 1, 2, 5]},
                       scoring='roc_auc', cv=cv, n_jobs=-1)
grid_lr.fit(X_tr, y_tr)
grid_lr.best_params_, grid_lr.best_score_

In [None]:
pipe_rf = Pipeline([('clf', RandomForestClassifier(random_state=11088, class_weight='balanced'))])
grid_rf = GridSearchCV(pipe_rf,
    param_grid={'clf__n_estimators':[200, 400, 800],
                'clf__max_depth':[None, 10, 20],
                'clf__max_features':['sqrt','log2']},
    scoring='roc_auc', cv=cv, n_jobs=-1)
grid_rf.fit(X_tr, y_tr)
grid_rf.best_params_, grid_rf.best_score_

## 3) Learning curves: do we need more data or a simpler model?
Bias–variance in practice: if training and CV scores are both low → underfit (increase model capacity). If training ≫ CV → high variance (more data/regularisation).

In [None]:
def plot_learning_curve(est, X, y, title):
    sizes, train_scores, val_scores = learning_curve(
        est, X, y, cv=cv, scoring='roc_auc', n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 8), random_state=11088)
    plt.figure(figsize=(6,4))
    plt.plot(sizes, train_scores.mean(axis=1), label='Train')
    plt.plot(sizes, val_scores.mean(axis=1), label='CV')
    plt.xlabel('Training size'); plt.ylabel('ROC-AUC'); plt.title(title); plt.legend(); plt.tight_layout(); plt.show()

plot_learning_curve(grid_lr.best_estimator_, X_tr, y_tr, 'Logistic learning curve')
plot_learning_curve(grid_rf.best_estimator_, X_tr, y_tr, 'RandomForest learning curve')

## 4) Final evaluation on the *sealed* test set
Report ROC-AUC, PR curve (useful for imbalance), confusion matrix at a utility-tuned threshold.

In [None]:
for name, model in [('Logistic', grid_lr.best_estimator_), ('RandomForest', grid_rf.best_estimator_)]:
    proba = model.predict_proba(X_te)[:,1]
    auc = roc_auc_score(y_te, proba)
    print(f"\n{name}: Test ROC-AUC={auc:.3f}")
    RocCurveDisplay.from_predictions(y_te, proba); plt.title(f'{name} ROC'); plt.show()
    pr, rc, _ = precision_recall_curve(y_te, proba)
    PrecisionRecallDisplay(precision=pr, recall=rc).plot(); plt.title(f'{name} PR'); plt.show()
    yhat = (proba >= 0.5).astype(int)
    ConfusionMatrixDisplay.from_predictions(y_te, yhat); plt.title(f'{name} @0.5'); plt.show()
    print(classification_report(y_te, yhat, digits=3))

## 5) Interpreting models responsibly
### Permutation importance (model-agnostic)
Randomly shuffle one feature in the test set; the metric drop estimates its importance. Less biased than tree impurity.

### Partial dependence (marginal effect curves)
Average effect of a feature on the prediction, holding others at their empirical distribution.

In [None]:
rf = grid_rf.best_estimator_
perm = permutation_importance(rf, X_te, y_te, scoring='roc_auc', n_repeats=20, random_state=11088)
imp = pd.DataFrame({'feature': cols, 'importance': perm.importances_mean}).sort_values('importance', ascending=False).head(12)
sns.barplot(data=imp, x='importance', y='feature'); plt.title('Permutation importance (RF)'); plt.tight_layout(); plt.show()
PartialDependenceDisplay.from_estimator(rf, X_te, features=[2,5,7,12]); plt.tight_layout(); plt.show()

## 6) Probability calibration: when 0.7 should mean 70%
If you’ll **act** on probabilities (screening, triage), calibration matters as much as discrimination.
- Brier score: mean squared error of probabilities vs outcomes.
- Fix with **isotonic** or **Platt** (sigmoid) calibration.

In [None]:
uncal = grid_rf.best_estimator_
cal_iso = CalibratedClassifierCV(uncal, method='isotonic', cv=5).fit(X_tr, y_tr)
for name, model in [('Uncalibrated', uncal), ('Isotonic', cal_iso)]:
    p = model.predict_proba(X_te)[:,1]
    print(f"{name}: Brier={brier_score_loss(y_te,p):.3f}, AUC={roc_auc_score(y_te,p):.3f}")
    CalibrationDisplay.from_predictions(y_te, p, n_bins=10)
    plt.title(f'Calibration: {name}'); plt.tight_layout(); plt.show()

## 7) Robustness, fairness, and documentation
- **Stress tests**: add Gaussian noise, drop top-k features, simulate shift (mean offset) and re-evaluate.
- **Subgroup performance**: if you had a sex/age flag — always report per-group metrics.
- **Model card (mini)**: purpose, data, preprocessing, metrics, limitations, update policy.

In [None]:
# Simple stress test: additive noise
def stress(model, X, y, sigma):
    p = model.predict_proba(X + np.random.normal(0, sigma, X.shape))[:,1]
    return roc_auc_score(y, p)
for s in [0.0, 0.05, 0.1, 0.2]:
    print(f"RF AUC under noise σ={s}: {stress(uncal, X_te, y_te, s):.3f}")

## Exercises
1. **Utility-tuned threshold**: Suppose FP costs 1 and FN costs 5. On the test set, sweep thresholds to minimise expected cost.
2. **Top-k features**: Refit RF on only the top-10 permutation-important features. What happens to test AUC and calibration?
3. **Data drift**: Add a +0.3 shift to `feat_12` in the test set only. How much AUC drops? How would you monitor for this in production?

## Key takeaways
- Pipelines + CV prevent leakage.
- Always show a **learning curve**.
- Report **calibration** if probabilities drive action.
- Document stress tests and subgroup performance — that’s what trust looks like.