# 10.5 • Machine Learning the Safe Way (Hippo Edition)

**Why this notebook?**  
Many “ML tutorials” skip straight to `.fit()` and `.predict()` and then celebrate a single number.  
Here we do it properly: clearly-defined assumptions, leakage-free evaluation, calibrated probabilities, and documented limitations.

**Our playful scenario**  
You're consulting for *Riverbend Hippo Reserve*. Keepers want to **flag hippos for a proactive dental check** next month.  
You’ll build a **binary classifier** that outputs a probability that a hippo will **benefit from a dental check** (1 = yes, 0 = no).  
Features are simple husbandry metrics: daily forage types, mud-bathing time, water salinity, social rank, etc.  
It’s synthetic—but realistic enough to teach the right habits.

> Rule of thumb: *If a model will influence a real decision, you must be able to defend it in an audit.*

## What is leakage?

In machine learning, “leakage” means that your model has accidentally seen information it shouldn’t have during training.

A leakage-free workflow ensures that everything done during training is based only on the training data, not on the validation or test data.

### Learning objectives
1. Understand a leakage-free workflow (train/validation/test) and why it matters.
2. Read learning curves to decide whether to collect more data or simplify the model.
3. Do honest model selection with cross-validation inside a `Pipeline`.
4. Explain model behaviour with permutation importance and partial dependence.
5. Check and fix probability **calibration** (because 0.7 should really mean 70%).
6. Stress-test robustness and write a mini **model card** summarising limitations.

### What you’ll do
- **Simulate** a labelled dataset with a known ground truth (so we can verify claims).
- **Split** into train/test once; then do **all** tuning via cross-validation on the **training data only**.
- Compare a **regularised logistic regression** and a **random forest**.
- Plot **learning curves** to see bias–variance trade-offs.
- Evaluate performance **only once** on the sealed test set.
- Interpret the model and **calibrate** probabilities.
- Run simple **stress tests**; draft a compact **model card**.

## 0) Setup
> We use only `matplotlib` for plotting (department rule of thumb), and scikit-learn for modelling.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    roc_auc_score, RocCurveDisplay, ConfusionMatrixDisplay,
    classification_report, brier_score_loss, precision_recall_curve, PrecisionRecallDisplay
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance, PartialDependenceDisplay
from sklearn.calibration import CalibratedClassifierCV, CalibrationDisplay

# Reproducibility
RNG = np.random.default_rng(11088)
np.random.seed(11088)
pd.set_option('display.max_columns', 200)

## 1) Data: simulate a transparent ground truth (hippo husbandry)

We generate **n ≈ 1,200** hippo-days with **p = 30** features.  
To make things realistic:
- Some features are **correlated** (e.g., forage types co-occur).
- Only a handful carry **true signal**; others are noise.
- Signal includes **non-linear** patterns (thresholds and sines).

**Target:** `needs_dental_check` (1/0). Mild class imbalance is typical in screening tasks.

In [None]:
n, p = 1200, 30
X = RNG.normal(size=(n, p))

# Create three correlated blocks of features (to mimic co-occurring husbandry variables)
for b in range(3):
    z = RNG.normal(size=(n, 1))
    X[:, 10*b:10*(b+1)] += 0.6 * z

cols = [f'feat_{i+1}' for i in range(p)]

# Map a few columns to playful meanings (purely for pedagogy)
hippo_dict = {
    'feat_3': 'mud_minutes',         # time spent mud-bathing
    'feat_6': 'forage_quality',      # higher is better
    'feat_8': 'dominance_index',     # social rank proxy
    'feat_13': 'water_salinity',     # small salinity changes affect behaviour
    'feat_18': 'keeper_visits',      # number of keeper interactions
    'feat_27': 'playfulness_score'   # sine-shaped relation (too low/high may correlate oddly)
}

pretty_cols = [hippo_dict.get(c, c) for c in cols]

# True signal (non-linear pieces on a subset)
sig = (
    0.40*X[:,2]              # mud_minutes helpful (less grit accumulation)
  - 0.35*X[:,5]              # poor forage quality increases risk
  + 0.50*((X[:,7]**2) > 0.5).astype(float)  # dominance threshold effect
  + 0.45*X[:,12]             # water_salinity small positive relation
  - 0.40*X[:,17]             # more keeper_visits reduce risk (preventative care)
  + 0.35*np.sin(X[:,26])     # playfulness has a wavy relation with dental health
)

logit = -0.2 + sig
proba = 1/(1 + np.exp(-logit))
y = (RNG.random(n) < 0.08 + 0.84*proba).astype(int)  # mild imbalance

df = pd.DataFrame(X, columns=pretty_cols)
df['needs_dental_check'] = y

df.head()

## 2) Split once: **train (75%)** and **test (25%)**

- The **test set is a sealed envelope**. Touch it only once at the very end.
- All preprocessing (scaling), feature selection, and hyper-parameter tuning happens inside **cross-validation on the training data**.

In [None]:
X = df.drop(columns='needs_dental_check').values
y = df['needs_dental_check'].values

X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=11088
)

X_tr.shape, X_te.shape, y.mean()

## 3) Leakage-safe pipelines and honest cross-validation

Why pipelines? Because **fitting the scaler on the whole dataset leaks information** from validation folds and inflates scores.

We compare two baselines:
- **Logistic regression** with regularisation.
- **Random forest** as a flexible non-linear model.

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=11088)

pipe_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))
])
grid_lr = GridSearchCV(
    pipe_lr,
    param_grid={'clf__C':[0.1, 0.5, 1, 2, 5]},
    scoring='roc_auc', cv=cv, n_jobs=-1
).fit(X_tr, y_tr)

pipe_rf = Pipeline([('clf', RandomForestClassifier(random_state=11088, class_weight='balanced'))])
grid_rf = GridSearchCV(
    pipe_rf,
    param_grid={
        'clf__n_estimators':[200, 400, 800],
        'clf__max_depth':[None, 10, 20],
        'clf__max_features':['sqrt','log2']
    },
    scoring='roc_auc', cv=cv, n_jobs=-1
).fit(X_tr, y_tr)

print('Best LR:', grid_lr.best_params_, 'CV AUC:', round(grid_lr.best_score_, 3))
print('Best RF:', grid_rf.best_params_, 'CV AUC:', round(grid_rf.best_score_, 3))

## 4) Learning curves: do we need more data, or a simpler model?

Interpretation guide:
- **Train and CV both low**: underfitting → increase model capacity or add features.
- **Train ≫ CV**: high variance → get more data, regularise, or simplify.

In [None]:
def plot_learning_curve(est, X, y, title):
    sizes, train_scores, val_scores = learning_curve(
        est, X, y, cv=cv, scoring='roc_auc', n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 8), random_state=11088
    )
    plt.figure(figsize=(6,4))
    plt.plot(sizes, train_scores.mean(axis=1), label='Train')
    plt.plot(sizes, val_scores.mean(axis=1), label='CV')
    plt.xlabel('Training size')
    plt.ylabel('ROC-AUC')
    plt.title(title)
    plt.legend()
    plt.tight_layout()
    plt.show()

plot_learning_curve(grid_lr.best_estimator_, X_tr, y_tr, 'Logistic learning curve')
plot_learning_curve(grid_rf.best_estimator_, X_tr, y_tr, 'RandomForest learning curve')

## 5) Final evaluation on the **sealed** test set

We report:
- ROC-AUC (ranking quality)
- Precision–Recall curve (useful if the positive class is rarer)
- Confusion matrix at a neutral threshold (0.5) **and** at a **utility-tuned** threshold (later exercise)

In [None]:
for name, model in [('Logistic', grid_lr.best_estimator_), ('RandomForest', grid_rf.best_estimator_)]:
    proba = model.predict_proba(X_te)[:,1]
    auc = roc_auc_score(y_te, proba)
    print(f"\n{name}: Test ROC-AUC={auc:.3f}")
    RocCurveDisplay.from_predictions(y_te, proba)
    plt.title(f'{name} ROC')
    plt.show()

    pr, rc, _ = precision_recall_curve(y_te, proba)
    PrecisionRecallDisplay(precision=pr, recall=rc).plot()
    plt.title(f'{name} PR')
    plt.show()

    yhat = (proba >= 0.5).astype(int)
    ConfusionMatrixDisplay.from_predictions(y_te, yhat)
    plt.title(f'{name} @0.5')
    plt.show()
    print(classification_report(y_te, yhat, digits=3))

## 6) Interpreting models responsibly

### Permutation importance (model-agnostic)
Shuffle one feature at a time in the **test** set; the drop in metric estimates its importance. Less biased than tree impurity.

### Partial dependence
Average effect of a feature on predictions—useful for sanity-checking directionality (but beware of strong feature interactions).

In [None]:
rf = grid_rf.best_estimator_
perm = permutation_importance(rf, X_te, y_te, scoring='roc_auc', n_repeats=20, random_state=11088)

imp = (
    pd.DataFrame({'feature': df.columns[:-1], 'importance': perm.importances_mean})
    .sort_values('importance', ascending=False)
    .head(12)
)
imp

# Bar plot with matplotlib
plt.figure(figsize=(6,4))
plt.barh(imp['feature'][::-1], imp['importance'][::-1])
plt.xlabel('Permutation importance (AUC drop)')
plt.title('Top features (RandomForest)')
plt.tight_layout()
plt.show()

# Partial dependence on a few pedagogically named features
features_to_plot = []
for f in ['mud_minutes', 'forage_quality', 'dominance_index', 'water_salinity']:
    if f in df.columns:
        features_to_plot.append(list(df.columns[:-1]).index(f))

if features_to_plot:
    PartialDependenceDisplay.from_estimator(rf, X_te, features=features_to_plot)
    plt.tight_layout()
    plt.show()

## 7) Probability calibration: when 0.7 should mean 70%

If decisions (e.g., scheduling a dental check) depend on **probabilities**, calibration is as important as discrimination.

- **Brier score**: mean squared error of probability forecasts.
- Fix miscalibration with **isotonic** (non-parametric) or **Platt** (sigmoid) scaling.

In [None]:
uncal = grid_rf.best_estimator_
cal_iso = CalibratedClassifierCV(uncal, method='isotonic', cv=5).fit(X_tr, y_tr)

for name, model in [('Uncalibrated RF', uncal), ('Isotonic RF', cal_iso)]:
    p = model.predict_proba(X_te)[:,1]
    print(f"{name}: Brier={brier_score_loss(y_te,p):.3f}, AUC={roc_auc_score(y_te,p):.3f}")
    CalibrationDisplay.from_predictions(y_te, p, n_bins=10)
    plt.title(f'Calibration: {name}')
    plt.tight_layout()
    plt.show()

## 8) Robustness checks and simple stress tests

- **Add noise** to inputs (simulates measurement error) and re-check AUC.
- **Drop top features** and see how brittle the model is.
- **Shift** a key feature at inference-time to mimic **dataset drift**.

In [None]:
def stress_noise(model, X, y, sigma):
    p = model.predict_proba(X + np.random.normal(0, sigma, X.shape))[:,1]
    return roc_auc_score(y, p)

print("AUC under additive noise:")
for s in [0.0, 0.05, 0.1, 0.2]:
    print(f"  σ={s}: {stress_noise(uncal, X_te, y_te, s):.3f}")

# Drift: add a +0.3 shift to 'water_salinity' at test time (if present)
X_te_drift = X_te.copy()
if 'water_salinity' in df.columns:
    j = list(df.columns[:-1]).index('water_salinity')
    X_te_drift[:, j] = X_te_drift[:, j] + 0.3

p_norm = uncal.predict_proba(X_te)[:,1]
p_drift = uncal.predict_proba(X_te_drift)[:,1]
print("\nAUC normal:", roc_auc_score(y_te, p_norm))
print("AUC with +0.3 drift on water_salinity:", roc_auc_score(y_te, p_drift))

## 9) Exercises (for credit)
1. **Utility-tuned threshold**: Suppose FP costs 1 and FN costs 5. On the test set, sweep thresholds to minimise expected cost. Report the chosen threshold and show the confusion matrix.
2. **Top-k features**: Refit the RF using only the top-10 permutation-important features. Compare test AUC and calibration.
3. **Monitoring plan**: Propose simple metrics to monitor **data drift** monthly (e.g., population means/SDs, PSI) and a policy for **recalibration**.

## 10) Mini Model Card (template)

**Intended use**: Flag hippos for proactive dental checks at Riverbend.  
**Data**: 1,200 synthetic hippo-days; 30 numeric features; mild imbalance.  
**Preprocessing**: Standardisation (for LR) within CV folds; RF on raw inputs.  
**Algorithms**: Logistic regression (L2), Random forest.  
**Validation**: 5-fold stratified CV for selection; single held-out test for final report.  
**Performance (example)**: RF test ROC-AUC ≈ 0.85–0.9 (your run may vary); calibration improved by isotonic scaling.  
**Robustness**: Moderate sensitivity to additive noise; AUC declines under feature drift (e.g., water_salinity +0.3).  
**Fairness**: If subgroup labels (sex/age/pen condition) existed, report per-group metrics.  
**Limitations**: Synthetic features; unmodelled temporal/seasonal effects; no clinician/keeper feedback loop.  
**Update policy**: Retrain quarterly or when drift metrics breach thresholds; recalibrate monthly if needed.

### Key takeaways
- **Pipelines + CV** prevent information leakage.
- **Learning curves** inform whether to simplify or collect more data.
- **Calibration** is essential if probabilities guide actions.
- Always document **stress tests**, **drift**, and **limitations**—that’s what trustworthy ML looks like.