# Module 3: First Clinical Prediction Model (Traditional ML)
## Predicting 30-Day Readmission Risk

**Goal:** Build, compare, and interpret two baseline models (Logistic Regression and Decision Tree)
for a clinically meaningful prediction task.

### How to use this notebook
- Run cells top to bottom.
- Keep notes on what changes model behavior.
- This chapter introduces model training with small, reproducible steps.

### Learning objectives
1. Define features (`X`) and label (`y`) for a clinical prediction task.
2. Train a baseline Logistic Regression model and Decision Tree model.
3. Compare metrics that matter clinically, not just accuracy.
4. Explore threshold trade-offs between sensitivity and specificity.
5. Inspect model behavior and review missed high-risk patients.

## Section 0: Clinical Problem
At discharge, a care team wants to identify patients at high risk of 30-day readmission.
The team can only intervene for a limited number of patients, so false negatives and false positives
have different operational consequences.

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier, plot_tree
from IPython.display import display

try:
    import ipywidgets as widgets
except ImportError as exc:
    raise ImportError('ipywidgets is required for the threshold demo.') from exc

def resolve_data_path(filename):
    candidates = [Path('../data') / filename, Path('data') / filename]
    for cand in candidates:
        if cand.exists():
            return cand
    raise FileNotFoundError(f'Could not locate {filename} in ../data or data/')

DATA_PATH = resolve_data_path('module_02_cleaned_for_module_03.csv')
df = pd.read_csv(DATA_PATH)
print(f'Loaded {len(df)} rows and {df.shape[1]} columns from {DATA_PATH}.')


In [None]:
display(df.head(8))
print()
print('Class balance:')
print(df['label_readmit_30d'].value_counts(normalize=True).rename('proportion').round(3))


## Section 1: Define Features and Split Data
We use only information available by discharge time.

In [None]:
id_col = 'encounter_id'
target_col = 'label_readmit_30d'

numeric_features = [
    'age',
    'sex_female',
    'sbp',
    'ldl_mg_dl',
    'smoke',
    'comorbidity_count',
    'prior_admissions_12m',
]
categorical_features = ['insurance_type', 'race_group']
feature_cols = numeric_features + categorical_features

X = df[feature_cols].copy()
y = df[target_col].astype(int).copy()
encounter_ids = df[id_col].copy()

X_train, X_test, y_train, y_test, ids_train, ids_test = train_test_split(
    X,
    y,
    encounter_ids,
    test_size=0.25,
    random_state=42,
    stratify=y,
)

print(f'Train rows: {len(X_train)}')
print(f'Test rows: {len(X_test)}')
print(f'Train prevalence: {y_train.mean():.3f}')
print(f'Test prevalence: {y_test.mean():.3f}')


## Section 2: Train Baseline Models
Model A: Logistic Regression (linear, interpretable coefficients).

Model B: Decision Tree (rule-like, nonlinear splits).

In [None]:
lr_preprocess = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
])

tree_preprocess = ColumnTransformer([
    ('num', 'passthrough', numeric_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
])

lr_model = Pipeline([
    ('prep', lr_preprocess),
    ('clf', LogisticRegression(max_iter=1000, solver='liblinear', random_state=42)),
])

tree_model = Pipeline([
    ('prep', tree_preprocess),
    ('clf', DecisionTreeClassifier(max_depth=4, min_samples_leaf=5, random_state=42)),
])

lr_model.fit(X_train, y_train)
tree_model.fit(X_train, y_train)

lr_proba = lr_model.predict_proba(X_test)[:, 1]
tree_proba = tree_model.predict_proba(X_test)[:, 1]


In [None]:
def metrics_from_proba(y_true, proba, threshold=0.5):
    pred = (proba >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_true, pred, labels=[0, 1]).ravel()

    total = tp + tn + fp + fn
    accuracy = (tp + tn) / total if total else np.nan
    precision = tp / (tp + fp) if (tp + fp) else np.nan
    recall = tp / (tp + fn) if (tp + fn) else np.nan
    specificity = tn / (tn + fp) if (tn + fp) else np.nan
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else np.nan
    auc = roc_auc_score(y_true, proba)

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall_sensitivity': recall,
        'specificity': specificity,
        'f1': f1,
        'roc_auc': auc,
        'tp': tp,
        'tn': tn,
        'fp': fp,
        'fn': fn,
    }

lr_metrics = metrics_from_proba(y_test, lr_proba, threshold=0.5)
tree_metrics = metrics_from_proba(y_test, tree_proba, threshold=0.5)

comparison = pd.DataFrame([lr_metrics, tree_metrics], index=['Logistic Regression', 'Decision Tree'])
display(comparison[['accuracy', 'precision', 'recall_sensitivity', 'specificity', 'f1', 'roc_auc']].round(3))


## Section 3: Confusion Matrix View at Threshold 0.5
Confusion counts make the clinical trade-offs concrete.

In [None]:
def plot_confusion_counts(metrics, title):
    counts = pd.Series({
        'TP': metrics['tp'],
        'TN': metrics['tn'],
        'FP': metrics['fp'],
        'FN': metrics['fn'],
    })
    fig, ax = plt.subplots(figsize=(5.8, 3.3))
    ax.bar(counts.index, counts.values, color=['#54a24b', '#4c78a8', '#f58518', '#e45756'])
    ax.set_title(title)
    ax.set_ylabel('Patients')
    plt.tight_layout()
    plt.show()

plot_confusion_counts(lr_metrics, 'Logistic Regression @ 0.50 threshold')
plot_confusion_counts(tree_metrics, 'Decision Tree @ 0.50 threshold')


## Section 4: Interactive Threshold Tuning (Logistic Regression)
Lower thresholds usually increase sensitivity (catch more true readmissions) but can increase false positives.

In [None]:
def threshold_demo(threshold=0.50):
    m = metrics_from_proba(y_test, lr_proba, threshold=threshold)
    print(f'Threshold: {threshold:.2f}')
    print(
        f"Accuracy: {m['accuracy']:.3f} | Precision: {m['precision']:.3f} | "
        f"Sensitivity: {m['recall_sensitivity']:.3f} | Specificity: {m['specificity']:.3f}"
    )

    fig, ax = plt.subplots(figsize=(5.8, 3.3))
    labels = ['TP', 'TN', 'FP', 'FN']
    values = [m['tp'], m['tn'], m['fp'], m['fn']]
    colors = ['#54a24b', '#4c78a8', '#f58518', '#e45756']
    ax.bar(labels, values, color=colors)
    ax.set_title('Logistic Regression Confusion Counts')
    ax.set_ylabel('Patients')
    plt.tight_layout()
    plt.show()

widgets.interact(
    threshold_demo,
    threshold=widgets.FloatSlider(
        value=0.50, min=0.10, max=0.90, step=0.05, description='Threshold', continuous_update=False
    ),
)


## Section 5: Model Interpretation
Logistic Regression coefficients show directional influence.

Decision Trees show path-based logic.

In [None]:
lr_feature_names = lr_model.named_steps['prep'].get_feature_names_out()
lr_coefs = lr_model.named_steps['clf'].coef_[0]
coef_table = pd.DataFrame({'feature': lr_feature_names, 'coefficient': lr_coefs})
coef_table['abs_coefficient'] = coef_table['coefficient'].abs()
coef_table = coef_table.sort_values('abs_coefficient', ascending=False)

display(coef_table[['feature', 'coefficient']].head(12).round(3))

top10 = coef_table.head(10).iloc[::-1]
fig, ax = plt.subplots(figsize=(7, 4))
ax.barh(top10['feature'], top10['coefficient'], color='#4c78a8')
ax.axvline(0, color='black', linewidth=1)
ax.set_title('Top Logistic Coefficients (standardized scale)')
plt.tight_layout()
plt.show()


In [None]:
tree_feature_names = tree_model.named_steps['prep'].get_feature_names_out()
fig, ax = plt.subplots(figsize=(15, 6))
plot_tree(
    tree_model.named_steps['clf'],
    feature_names=tree_feature_names,
    class_names=['No readmit', 'Readmit'],
    filled=True,
    max_depth=2,
    ax=ax,
)
ax.set_title('Decision Tree (first 2 levels)')
plt.tight_layout()
plt.show()


## Section 6: Error Analysis - Who Did We Miss?
Focus on false negatives because missed high-risk patients may not receive interventions.

In [None]:
pred_df = X_test.copy()
pred_df['encounter_id'] = ids_test.values
pred_df['actual_readmit_30d'] = y_test.values
pred_df['lr_probability'] = lr_proba
pred_df['lr_pred_0_50'] = (lr_proba >= 0.50).astype(int)
pred_df['tree_probability'] = tree_proba
pred_df['tree_pred_0_50'] = (tree_proba >= 0.50).astype(int)

false_negatives = pred_df[(pred_df['actual_readmit_30d'] == 1) & (pred_df['lr_pred_0_50'] == 0)].copy()
false_negatives = false_negatives.sort_values('lr_probability', ascending=True)

print(f'False negatives (Logistic Regression @ 0.50): {len(false_negatives)}')
display(false_negatives[['encounter_id', 'age', 'sbp', 'ldl_mg_dl', 'smoke', 'comorbidity_count', 'prior_admissions_12m', 'lr_probability']].head(10).round(3))


## Section 7: Save Predictions for Module 4
Module 4 will go deeper into evaluation and threshold policy decisions.

In [None]:
out_path = DATA_PATH.parent / 'module_03_test_predictions.csv'
pred_df[['encounter_id', 'actual_readmit_30d', 'lr_probability', 'lr_pred_0_50', 'tree_probability', 'tree_pred_0_50']]    .to_csv(out_path, index=False)

print(f'Saved test prediction file to {out_path}')
display(pred_df.head(8).round(3))


## Wrap-up: Key Takeaways
- You trained your first two baseline clinical prediction models.
- Accuracy alone is not enough in medicine; sensitivity/specificity trade-offs matter.
- Threshold policy can change who gets flagged for intervention.
- Error analysis helps identify where the model may fail clinically.
- In Module 4, we will formalize evaluation beyond a single threshold.