<a href="https://colab.research.google.com/github/awsdevguru/PearsonMLFoundations/blob/main/2_4_03_Model_Evaluation_Metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Evaluation Metrics

Goal: practice computing and interpreting common classification metrics, see threshold trade-offs, and learn which metric to use when. Uses scikit-learn with a slightly imbalanced dataset.

In [None]:
# Core
import numpy as np, pandas as pd
import matplotlib.pyplot as plt

# ML
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix, ConfusionMatrixDisplay,
    roc_auc_score, RocCurveDisplay, average_precision_score, PrecisionRecallDisplay,
    precision_recall_curve, auc, balanced_accuracy_score
)

# Reproducible synthetic, imbalanced data
X, y = make_classification(
    n_samples=4000, n_features=20, n_informative=6, n_redundant=3,
    weights=[0.9, 0.1], flip_y=0.01, class_sep=1.0, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# Simple, robust baseline classifier
clf = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=1000))
]).fit(X_train, y_train)

# Predictions & scores
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1] # for ROC/PR/thresholds

## 1) Accuracy (and why it can mislead on imbalance)

If classes are imbalanced, "majority-class always" can look good, use more informative metrics.

In [None]:
acc = accuracy_score(y_test, y_pred)
base_rate = (y_test == 0).mean()  # majority-class baseline accuracy
print(f"Accuracy: {acc:.3f}  | Majority baseline: {base_rate:.3f}")

## 2) Precision, Recall, F1 + Report

When to prefer:
* **Precision** when false positives are costly (e.g., auto-lock accounts).
* **Recall** when missing positives is costly (e.g., fraud/attack detection).
* **F1** when you need a single balance of precision & recall.

Why useful: See where errors happen (FP vs FN) and quantify both class sides (specificity & sensitivity).

In [None]:
cm = confusion_matrix(y_test, y_pred, labels=[0,1])
disp = ConfusionMatrixDisplay(cm, display_labels=["Negative (0)","Positive (1)"])
disp.plot(values_format="d"); plt.title("Confusion Matrix"); plt.show()

tn, fp, fn, tp = cm.ravel()
specificity = tn / (tn + fp)        # True Negative Rate
sensitivity = tp / (tp + fn)        # Recall
bal_acc = balanced_accuracy_score(y_test, y_pred)
print(f"Specificity: {specificity:.3f}  Sensitivity/Recall: {sensitivity:.3f}  Balanced Acc: {bal_acc:.3f}")


## 4) ROC Curve & ROC-AUC (ranking quality across thresholds)

Use when: Class imbalance is moderate and you care about ranking true positives higher across many thresholds.

In [None]:
auc_roc = roc_auc_score(y_test, y_proba)
RocCurveDisplay.from_predictions(y_test, y_proba)
plt.title(f"ROC Curve (AUC = {auc_roc:.3f})"); plt.show()

## 5) Precision-Recall Curve & Average Precision (AP)

Use when: Positive class is rare; PR captures performance where it matters (high recall regions).

In [None]:
ap = average_precision_score(y_test, y_proba)
PrecisionRecallDisplay.from_predictions(y_test, y_proba)
plt.title(f"Precision-Recall Curve (AP = {ap:.3f})"); plt.show()


## 6) Threshold Tuning (trade precision vs recall)

Moving the threshold changes FP/FN and aligns with business risk.

In [None]:
thr_list = np.linspace(0.1, 0.9, 17)
rows = []
for thr in thr_list:
    y_hat = (y_proba >= thr).astype(int)
    rows.append({
        "threshold": thr,
        "precision": precision_score(y_test, y_hat, zero_division=0),
        "recall": recall_score(y_test, y_hat, zero_division=0),
        "f1": f1_score(y_test, y_hat, zero_division=0),
        "balanced_acc": balanced_accuracy_score(y_test, y_hat)
    })
thr_df = pd.DataFrame(rows)
thr_df.head()


In [None]:
fig, ax = plt.subplots(figsize=(7,4))
ax.plot(thr_df["threshold"], thr_df["precision"], label="Precision")
ax.plot(thr_df["threshold"], thr_df["recall"],    label="Recall")
ax.plot(thr_df["threshold"], thr_df["f1"],        label="F1")
ax.set_xlabel("Decision Threshold"); ax.set_ylabel("Score"); ax.grid(True); ax.legend(); plt.show()

# Pick operating point by business goal: e.g., target recall >= 0.85 with best precision
target_recall = 0.85
candidates = thr_df[thr_df["recall"] >= target_recall].sort_values("precision", ascending=False)
candidates.head(3)


## 7) Macro/Micro Averaging (quick note + example)

For multi-class or highly imbalanced binary tasks, macro treats classes equally; micro aggregates globally.

In [None]:
# Quick illustration: compute macro/micro F1 via cross_validate (binary still OK)
scores = cross_validate(
    clf, X_train, y_train, cv=5,
    scoring={"f1_macro":"f1_macro", "f1_micro":"f1_micro", "f1":"f1"}
)
{m: (scores[f"test_{m}"].mean(), scores[f"test_{m}"].std()) for m in ["f1_macro","f1_micro","f1"]}


## 8) Multiple Metrics via cross_validate (one pass)

Tip: Report mean ± std over folds to communicate stability.

In [None]:
scorers = {
    "acc": "accuracy",
    "f1": "f1",
    "rocauc": "roc_auc",
    "avg_prec": "average_precision",
    "bal_acc": "balanced_accuracy"
}
cv = cross_validate(clf, X_train, y_train, cv=5, scoring=scorers, return_train_score=False)
pd.DataFrame(cv).agg(["mean","std"]).T


9) Putting It Together: Metric Selection

* Balanced data & equal error costs: accuracy, F1.
* Rare positives (imbalance): PR curve, Average Precision, recall/F1.
* Screening (don’t miss positives): high recall, monitor FP.
* Auto-action (avoid false alarms): high precision, accept lower recall.
* Ranking scenarios: ROC-AUC.
* Unequal class importance: balanced accuracy, specificity + sensitivity.

## Summary

You computed and interpreted: Accuracy, Precision, Recall, F1, Confusion Matrix, ROC-AUC, PR/AP, Balanced Accuracy, and performed threshold tuning & CV with multiple metrics. You can now select and justify the right metric for the business need and set an operating threshold that matches risk.