<a href="https://colab.research.google.com/github/dougyd92/ML-Foudations/blob/main/Notebooks/6_Classification_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Session 6: Classification Evaluation & Advanced Classification

This notebook covers:
1. **Confusion Matrix** — understanding prediction errors
2. **Classification Metrics** — precision, recall, F1, and beyond
3. **Threshold Selection & ROC/AUC** — tuning the decision boundary
4. **Precision-Recall Curves** — evaluation for imbalanced settings
5. **Handling Imbalanced Datasets** — stratified splits, class weights, SMOTE
6. **Multi-class Classification** — OvR, OvO, softmax

By the end of this notebook, you'll be able to train a classifier and rigorously evaluate its performance using the right metrics for your problem.

In [None]:
# ============================================================
# Setup — Run this cell first!
# ============================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib.colors import ListedColormap

from sklearn.datasets import load_breast_cancer, make_classification, load_iris
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix, ConfusionMatrixDisplay,
    classification_report, accuracy_score,
    precision_score, recall_score, f1_score,
    roc_curve, auc, RocCurveDisplay,
    precision_recall_curve, average_precision_score, PrecisionRecallDisplay
)

# For SMOTE (install if needed)
try:
    from imblearn.over_sampling import SMOTE
    from imblearn.pipeline import Pipeline as ImbPipeline
    SMOTE_AVAILABLE = True
except ImportError:
    !pip install imbalanced-learn -q
    from imblearn.over_sampling import SMOTE
    from imblearn.pipeline import Pipeline as ImbPipeline
    SMOTE_AVAILABLE = True

np.random.seed(42)

print("✅ All imports successful!")

---
# Section 1: The Confusion Matrix

Accuracy tells us the fraction of predictions we got right. But when classes are imbalanced, accuracy can be deeply misleading. A model that **always** predicts "healthy" on a dataset where 99% of patients are healthy gets 99% accuracy — while missing every single sick patient.

The **confusion matrix** breaks down predictions into four categories, giving us a much richer picture of model performance.

## Building Intuition: A Simple Classifier

Let's train a logistic regression model on the **Breast Cancer Wisconsin** dataset and examine its confusion matrix. This dataset has two classes: **malignant** (positive) and **benign** (negative).

In [None]:
# Load and prepare the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Note: in this dataset, 1 = benign, 0 = malignant
# Let's flip so that 1 = malignant (the class we want to detect)
y = 1 - y

print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Class distribution: {np.sum(y == 0)} benign, {np.sum(y == 1)} malignant")
print(f"Malignant rate: {np.mean(y):.1%}")

In [None]:
# Train/test split and fit logistic regression
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression(random_state=42, max_iter=5000)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)[:, 1]  # probabilities for positive class

print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")

## The Confusion Matrix

The confusion matrix is a 2×2 table that cross-tabulates **actual** labels (rows) against **predicted** labels (columns):

|  | Predicted Negative | Predicted Positive |
|---|---|---|
| **Actual Negative** | True Negative (TN) | False Positive (FP) |
| **Actual Positive** | False Negative (FN) | True Positive (TP) |

- **True Positives (TP):** Correctly identified malignant tumors
- **True Negatives (TN):** Correctly identified benign tumors
- **False Positives (FP):** Benign tumors incorrectly flagged as malignant ("false alarms")
- **False Negatives (FN):** Malignant tumors missed by the model ("misses") ← *the dangerous ones*

In [None]:
# Compute and display the confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix (raw counts):")
print(cm)
print()

# Extract the four components
tn, fp, fn, tp = cm.ravel()
print(f"True Negatives  (TN): {tn}  — Correctly identified benign")
print(f"False Positives (FP): {fp}  — Benign incorrectly flagged as malignant")
print(f"False Negatives (FN): {fn}  — Malignant MISSED by the model")
print(f"True Positives  (TP): {tp}  — Correctly identified malignant")

In [None]:
# Visualize with ConfusionMatrixDisplay
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Raw counts
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred,
    display_labels=["Benign", "Malignant"],
    cmap="Blues", ax=axes[0]
)
axes[0].set_title("Confusion Matrix (Counts)")

# Normalized by true class (each row sums to 1)
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred,
    display_labels=["Benign", "Malignant"],
    normalize='true', cmap="Blues", ax=axes[1],
    values_format='.2%'
)
axes[1].set_title("Confusion Matrix (Normalized by Actual)")

plt.tight_layout()
plt.show()

---
## ✏️ Exercise 1: Confusion Matrix Interpretation

Use the confusion matrix from our breast cancer model above to answer these questions.

**Task 1:** Calculate the accuracy manually from TP, TN, FP, FN. Verify it matches `accuracy_score`.

In [None]:
# Write your code here

In [None]:
#@title Click to reveal solution.
accuracy_manual = (tp + tn) / (tp + tn + fp + fn)
print(f"Manual accuracy:  {accuracy_manual:.3f}")
print(f"sklearn accuracy: {accuracy_score(y_test, y_pred):.3f}")

**Task 2:** If this model were deployed as a cancer screening tool, which type of error (FP or FN) would be more dangerous? Calculate the **false negative rate** (FN / total actual positives).

In [None]:
# Write your code here

In [None]:
#@title Click to reveal solution.
fnr = fn / (fn + tp)
fpr_val = fp / (fp + tn)
print(f"False Negative Rate: {fnr:.3f} ({fnr:.1%} of malignant tumors missed)")
print(f"False Positive Rate: {fpr_val:.3f} ({fpr_val:.1%} of benign flagged)")
print()
print("In cancer screening, FN is far more dangerous — a missed malignant tumor")
print("could mean delayed treatment. FP just means an extra biopsy (inconvenient, but safe).")

**Task 3:** Create a "dummy" model that always predicts benign (class 0). Compute its confusion matrix and accuracy. What does this tell you about accuracy as a metric?

In [None]:
# Write your code here

In [None]:
#@title Click to reveal solution.
y_pred_dummy = np.zeros_like(y_test)
print(f"Dummy model accuracy: {accuracy_score(y_test, y_pred_dummy):.3f}")
print()
cm_dummy = confusion_matrix(y_test, y_pred_dummy)
print("Confusion Matrix (dummy):")
print(cm_dummy)
print()
tn_d, fp_d, fn_d, tp_d = cm_dummy.ravel()
print(f"TP={tp_d}, FP={fp_d}, FN={fn_d}, TN={tn_d}")
print()
print("The dummy model gets decent accuracy by predicting the majority class,")
print("but it misses EVERY malignant tumor (TP=0, FN=all positives).")
print("This is why accuracy alone is insufficient!")

---
# Section 2: Classification Metrics

Now that we understand the confusion matrix, we can derive more informative metrics. Each one answers a different question about model performance.

## Precision and Recall

**Precision** answers: *"When the model predicts positive, how often is it right?"*

$$\text{Precision} = \frac{TP}{TP + FP}$$

**Recall** (Sensitivity) answers: *"Of all actual positives, how many did the model catch?"*

$$\text{Recall} = \frac{TP}{TP + FN}$$

These metrics capture fundamentally different concerns:
- **High precision** → few false alarms (important for spam filtering)
- **High recall** → few missed cases (important for disease detection)

In [None]:
# Compute precision and recall for our breast cancer model
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f"Precision: {precision:.3f}")
print(f"  → When the model says 'malignant', it's correct {precision:.1%} of the time")
print()
print(f"Recall:    {recall:.3f}")
print(f"  → The model catches {recall:.1%} of all malignant tumors")

## F1 Score

The **F1 score** is the harmonic mean of precision and recall, providing a single balanced metric:

$$F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

Why the **harmonic** mean instead of the arithmetic mean? Because the harmonic mean punishes imbalance. If precision = 1.0 and recall = 0.0, the arithmetic mean would be 0.5, but the F1 is 0.0 — correctly reflecting a useless model.

In [None]:
# F1 score
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.3f}")
print()

# Compare harmonic vs arithmetic mean
arith_mean = (precision + recall) / 2
print(f"Arithmetic mean of P and R: {arith_mean:.3f}")
print(f"Harmonic mean (F1):         {f1:.3f}")
print("The harmonic mean is always ≤ the arithmetic mean.")

## The Classification Report

sklearn's `classification_report` gives you everything at a glance: precision, recall, F1, and support (count) for each class, plus macro and weighted averages.

In [None]:
# The all-in-one classification report
print(classification_report(y_test, y_pred, target_names=["Benign", "Malignant"]))

---
# Section 3: Threshold Selection & ROC/AUC

By default, logistic regression predicts class 1 when the predicted probability exceeds **0.5**. But this threshold is just a default — not a law of nature.

By adjusting the threshold, we can trade off between precision and recall to match our application's needs.

## How Threshold Affects Predictions

Let's see what happens when we change the decision threshold.

In [None]:
# Show how threshold changes predictions
thresholds = [0.3, 0.5, 0.7]

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for ax, thresh in zip(axes, thresholds):
    y_pred_thresh = (y_proba >= thresh).astype(int)
    cm_t = confusion_matrix(y_test, y_pred_thresh)
    ConfusionMatrixDisplay(cm_t, display_labels=["Benign", "Malignant"]).plot(
        ax=ax, cmap="Blues"
    )
    p = precision_score(y_test, y_pred_thresh, zero_division=0)
    r = recall_score(y_test, y_pred_thresh)
    ax.set_title(f"Threshold = {thresh}\nPrecision={p:.2f}, Recall={r:.2f}")

plt.tight_layout()
plt.show()

print("Notice: lower threshold → more positives → higher recall but lower precision")
print("        higher threshold → fewer positives → higher precision but lower recall")

## The Precision-Recall Trade-off

As we sweep the threshold from 0 to 1, precision and recall move in opposite directions. Let's visualize this.

In [None]:
# Precision and recall as a function of threshold
thresholds_range = np.linspace(0.01, 0.99, 200)
precisions = []
recalls = []

for t in thresholds_range:
    y_pred_t = (y_proba >= t).astype(int)
    precisions.append(precision_score(y_test, y_pred_t, zero_division=0))
    recalls.append(recall_score(y_test, y_pred_t, zero_division=0))

plt.figure(figsize=(10, 6))
plt.plot(thresholds_range, precisions, label="Precision", linewidth=2)
plt.plot(thresholds_range, recalls, label="Recall", linewidth=2)
plt.axvline(x=0.5, color='gray', linestyle='--', alpha=0.5, label="Default threshold (0.5)")
plt.xlabel("Threshold", fontsize=12)
plt.ylabel("Score", fontsize=12)
plt.title("Precision-Recall Trade-off vs Threshold", fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## ROC Curve

The **ROC (Receiver Operating Characteristic) curve** plots the True Positive Rate (recall) against the False Positive Rate at every threshold simultaneously.

- **Perfect model:** curve passes through the top-left corner (0, 1)
- **Random model:** diagonal line from (0, 0) to (1, 1)
- **Better models:** curve bows further toward the top-left

The **AUC (Area Under the Curve)** summarizes the entire ROC curve as a single number from 0.5 (random) to 1.0 (perfect).

In [None]:
# Plot ROC curve
fpr, tpr, roc_thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 7))
plt.plot(fpr, tpr, color='steelblue', linewidth=2, label=f'Logistic Regression (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random Classifier (AUC = 0.5)')
plt.fill_between(fpr, tpr, alpha=0.1, color='steelblue')

plt.xlabel("False Positive Rate", fontsize=12)
plt.ylabel("True Positive Rate (Recall)", fontsize=12)
plt.title("ROC Curve — Breast Cancer Classification", fontsize=14)
plt.legend(fontsize=11, loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"AUC = {roc_auc:.3f}")
print(f"Interpretation: there is a {roc_auc:.1%} chance that the model")
print("ranks a random malignant sample higher than a random benign sample.")

---
## ✏️ Exercise 2: Threshold Selection for Medical Screening

You're deploying the breast cancer model as a **screening tool**. The clinical team requires that **at least 95% of malignant tumors are detected** (recall ≥ 0.95). False positives are acceptable since they just lead to additional testing.

**Task 1:** Find the highest threshold that achieves at least 95% recall. *Hint:* use the `roc_thresholds` from the ROC curve computation above, along with the corresponding `tpr` values.

In [None]:
# Write your code here

In [None]:
#@title Click to reveal solution.
# Find the highest threshold where TPR >= 0.95
mask = tpr >= 0.95
valid_thresholds = roc_thresholds[mask]
best_threshold = valid_thresholds.max()  # highest threshold that still meets recall requirement

print(f"Best threshold for ≥95% recall: {best_threshold:.4f}")

# Verify
y_pred_clinical = (y_proba >= best_threshold).astype(int)
print(f"Recall at this threshold:    {recall_score(y_test, y_pred_clinical):.3f}")
print(f"Precision at this threshold: {precision_score(y_test, y_pred_clinical):.3f}")
print(f"Accuracy at this threshold:  {accuracy_score(y_test, y_pred_clinical):.3f}")

**Task 2:** Compare the confusion matrices at threshold=0.5 vs your clinical threshold side by side. How many additional false positives do we accept to meet the recall requirement?

In [None]:
# Write your code here

In [None]:
#@title Click to reveal solution.
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Default threshold
y_pred_default = (y_proba >= 0.5).astype(int)
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred_default,
    display_labels=["Benign", "Malignant"],
    cmap="Blues", ax=axes[0]
)
axes[0].set_title(f"Default Threshold (0.5)")

# Clinical threshold
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred_clinical,
    display_labels=["Benign", "Malignant"],
    cmap="Blues", ax=axes[1]
)
axes[1].set_title(f"Clinical Threshold ({best_threshold:.3f})")

plt.tight_layout()
plt.show()

cm_default = confusion_matrix(y_test, y_pred_default)
cm_clinical = confusion_matrix(y_test, y_pred_clinical)
extra_fp = cm_clinical[0, 1] - cm_default[0, 1]
fewer_fn = cm_default[1, 0] - cm_clinical[1, 0]
print(f"Additional false positives accepted: {extra_fp}")
print(f"Additional malignant tumors caught:  {fewer_fn}")
print(f"This trade-off is worth it in a medical screening context!")

---
# Section 4: Precision-Recall Curves

ROC curves are widely used, but they can be **overly optimistic** when classes are imbalanced. When the negative class is much larger, even many false positives barely move the False Positive Rate.

**Precision-Recall (PR) curves** plot precision vs. recall at every threshold and are more informative for imbalanced problems.

In [None]:
# Plot ROC and PR curves side by side
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# ROC Curve (manual plot for full control)
fpr_plot, tpr_plot, _ = roc_curve(y_test, y_proba)
roc_auc_plot = auc(fpr_plot, tpr_plot)
axes[0].plot(fpr_plot, tpr_plot, color='steelblue', linewidth=2, label=f'AUC = {roc_auc_plot:.3f}')
axes[0].plot([0, 1], [0, 1], 'k--', alpha=0.3)
axes[0].set_xlabel("False Positive Rate")
axes[0].set_ylabel("True Positive Rate")
axes[0].set_title("ROC Curve", fontsize=14)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Precision-Recall Curve (manual plot for full control)
prec_plot, rec_plot, _ = precision_recall_curve(y_test, y_proba)
ap_plot = average_precision_score(y_test, y_proba)
axes[1].plot(rec_plot, prec_plot, color='coral', linewidth=2, label=f'AP = {ap_plot:.3f}')
prevalence = np.mean(y_test)
axes[1].axhline(y=prevalence, color='gray', linestyle='--', alpha=0.5, label=f'Baseline (prevalence={prevalence:.2f})')
axes[1].set_xlabel("Recall")
axes[1].set_ylabel("Precision")
axes[1].set_title("Precision-Recall Curve", fontsize=14)
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

ap = average_precision_score(y_test, y_proba)
print(f"Average Precision (AP): {ap:.3f}")
print(f"ROC AUC:                {roc_auc:.3f}")
print()
print("For this relatively balanced dataset, both curves look strong.")
print("The difference becomes dramatic with highly imbalanced data (next section).")

---
# Section 5: Handling Imbalanced Datasets

Most real-world classification problems are imbalanced: fraud detection (0.1% fraud), disease screening (1-5% positive), manufacturing defects (<1%), etc.

Let's create a deliberately imbalanced dataset and see how different techniques affect performance.

In [None]:
# Create an imbalanced dataset (5% positive class)
X_imb, y_imb = make_classification(
    n_samples=2000, n_features=20, n_informative=10,
    n_redundant=5, weights=[0.95, 0.05],
    flip_y=0.02, random_state=42
)

print(f"Class distribution: {np.sum(y_imb == 0)} negative, {np.sum(y_imb == 1)} positive")
print(f"Positive rate: {np.mean(y_imb):.1%}")

## The Problem: Naive Approach

Let's train a standard logistic regression on this imbalanced data and see what happens.

In [None]:
# Stratified split — always use stratify for imbalanced data!
X_train_imb, X_test_imb, y_train_imb, y_test_imb = train_test_split(
    X_imb, y_imb, test_size=0.2, random_state=42, stratify=y_imb
)

scaler_imb = StandardScaler()
X_train_imb_s = scaler_imb.fit_transform(X_train_imb)
X_test_imb_s = scaler_imb.transform(X_test_imb)

# Train without any imbalance handling
model_naive = LogisticRegression(random_state=42, max_iter=1000)
model_naive.fit(X_train_imb_s, y_train_imb)
y_pred_naive = model_naive.predict(X_test_imb_s)

print("=== Naive Logistic Regression (no imbalance handling) ===")
print(f"Accuracy: {accuracy_score(y_test_imb, y_pred_naive):.3f}")
print()
print(classification_report(y_test_imb, y_pred_naive, target_names=["Negative", "Positive"]))
print("Notice: high accuracy, but look at recall for the Positive class!")

## Fix 1: Class Weights

The simplest approach — tell the model that errors on the minority class cost more.

In [None]:
# Train with balanced class weights
model_weighted = LogisticRegression(
    class_weight='balanced', random_state=42, max_iter=1000
)
model_weighted.fit(X_train_imb_s, y_train_imb)
y_pred_weighted = model_weighted.predict(X_test_imb_s)

print("=== Logistic Regression with class_weight='balanced' ===")
print(f"Accuracy: {accuracy_score(y_test_imb, y_pred_weighted):.3f}")
print()
print(classification_report(y_test_imb, y_pred_weighted, target_names=["Negative", "Positive"]))
print("Accuracy dropped, but recall for the Positive class improved significantly!")

## Fix 2: SMOTE (Synthetic Minority Oversampling)

SMOTE creates **new synthetic samples** by interpolating between existing minority class examples. It picks two nearby minority points and creates a new point somewhere on the line between them.

In [None]:
# Apply SMOTE to the training data only
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_imb_s, y_train_imb)

print(f"Before SMOTE: {np.sum(y_train_imb == 0)} negative, {np.sum(y_train_imb == 1)} positive")
print(f"After SMOTE:  {np.sum(y_train_smote == 0)} negative, {np.sum(y_train_smote == 1)} positive")
print()

# Train on SMOTE-resampled data
model_smote = LogisticRegression(random_state=42, max_iter=1000)
model_smote.fit(X_train_smote, y_train_smote)
y_pred_smote = model_smote.predict(X_test_imb_s)

print("=== Logistic Regression with SMOTE ===")
print(f"Accuracy: {accuracy_score(y_test_imb, y_pred_smote):.3f}")
print()
print(classification_report(y_test_imb, y_pred_smote, target_names=["Negative", "Positive"]))

## Comparing All Three Approaches

Let's compare the ROC and PR curves side by side for all three models.

In [None]:
# Get probabilities from all three models
y_proba_naive = model_naive.predict_proba(X_test_imb_s)[:, 1]
y_proba_weighted = model_weighted.predict_proba(X_test_imb_s)[:, 1]
y_proba_smote = model_smote.predict_proba(X_test_imb_s)[:, 1]

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# ROC Curves
for name, proba, color in [
    ("Naive", y_proba_naive, "gray"),
    ("Class Weights", y_proba_weighted, "steelblue"),
    ("SMOTE", y_proba_smote, "coral"),
]:
    fpr_i, tpr_i, _ = roc_curve(y_test_imb, proba)
    auc_i = auc(fpr_i, tpr_i)
    axes[0].plot(fpr_i, tpr_i, label=f"{name} (AUC={auc_i:.3f})", linewidth=2, color=color)

axes[0].plot([0, 1], [0, 1], 'k--', alpha=0.3)
axes[0].set_title("ROC Curves", fontsize=14)
axes[0].set_xlabel("False Positive Rate")
axes[0].set_ylabel("True Positive Rate")
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# PR Curves
for name, proba, color in [
    ("Naive", y_proba_naive, "gray"),
    ("Class Weights", y_proba_weighted, "steelblue"),
    ("SMOTE", y_proba_smote, "coral"),
]:
    prec_i, rec_i, _ = precision_recall_curve(y_test_imb, proba)
    ap_i = average_precision_score(y_test_imb, proba)
    axes[1].plot(rec_i, prec_i, label=f"{name} (AP={ap_i:.3f})", linewidth=2, color=color)

prevalence_imb = np.mean(y_test_imb)
axes[1].axhline(y=prevalence_imb, color='gray', linestyle='--', alpha=0.5, label=f'Baseline ({prevalence_imb:.2f})')
axes[1].set_title("Precision-Recall Curves", fontsize=14)
axes[1].set_xlabel("Recall")
axes[1].set_ylabel("Precision")
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key observation: The ROC curves look similar for all three models,")
print("but the PR curves reveal much bigger differences in the imbalanced setting.")
print("This is why PR curves are preferred for imbalanced data!")

---
## ✏️ Exercise 3: Imbalanced Data Strategies

Using the imbalanced dataset from above, experiment with stratified cross-validation and compare approaches more rigorously.

**Task 1:** Use `StratifiedKFold` with 5 folds to compute the mean cross-validated F1 score for the **naive** model (no class weights). Then do the same for the **class-weighted** model. Which performs better?

In [None]:
# Write your code here

In [None]:
#@title Click to reveal solution.
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Naive model
scores_naive = cross_val_score(
    LogisticRegression(random_state=42, max_iter=1000),
    scaler_imb.fit_transform(X_imb), y_imb,
    cv=skf, scoring='f1'
)

# Weighted model
scores_weighted = cross_val_score(
    LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000),
    scaler_imb.fit_transform(X_imb), y_imb,
    cv=skf, scoring='f1'
)

print(f"Naive model    — Mean F1: {scores_naive.mean():.3f} ± {scores_naive.std():.3f}")
print(f"Weighted model — Mean F1: {scores_weighted.mean():.3f} ± {scores_weighted.std():.3f}")
print()
print("The weighted model has a substantially higher F1 on the minority class.")

**Task 2:** Repeat the cross-validation, but this time use `scoring='recall'`. Why might a medical team prefer to optimize for recall over F1?

In [None]:
# Write your code here

In [None]:
#@title Click to reveal solution.
scores_naive_recall = cross_val_score(
    LogisticRegression(random_state=42, max_iter=1000),
    scaler_imb.fit_transform(X_imb), y_imb,
    cv=skf, scoring='recall'
)

scores_weighted_recall = cross_val_score(
    LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000),
    scaler_imb.fit_transform(X_imb), y_imb,
    cv=skf, scoring='recall'
)

print(f"Naive model    — Mean Recall: {scores_naive_recall.mean():.3f} ± {scores_naive_recall.std():.3f}")
print(f"Weighted model — Mean Recall: {scores_weighted_recall.mean():.3f} ± {scores_weighted_recall.std():.3f}")
print()
print("A medical team might optimize for recall because missing a positive case")
print("(false negative) is far more costly than a false alarm (false positive).")
print("F1 balances precision and recall equally, but in medicine, the costs are not equal.")

---
# Section 6: Multi-class Classification

So far we've focused on **binary** classification (two classes). Many real-world problems have **more than two classes**: digit recognition (0–9), species identification, document categorization, etc.

There are three main strategies for extending binary classifiers to multi-class problems:
1. **One-vs-Rest (OvR):** Train K binary classifiers, one per class
2. **One-vs-One (OvO):** Train K(K−1)/2 classifiers, one per pair
3. **Softmax (Multinomial):** Natively predict all K classes at once

## Demo: Multi-class with the Iris Dataset

The classic Iris dataset has 3 classes (setosa, versicolor, virginica) with 4 features. Let's compare OvR and Multinomial approaches.

In [None]:
# Load Iris dataset
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

X_tr_iris, X_te_iris, y_tr_iris, y_te_iris = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

scaler_iris = StandardScaler()
X_tr_iris_s = scaler_iris.fit_transform(X_tr_iris)
X_te_iris_s = scaler_iris.transform(X_te_iris)

# OvR strategy — use OneVsRestClassifier for explicit OvR
from sklearn.multiclass import OneVsRestClassifier

model_ovr = OneVsRestClassifier(
    LogisticRegression(max_iter=1000, random_state=42)
)
model_ovr.fit(X_tr_iris_s, y_tr_iris)

# Multinomial (softmax) strategy — default for LogisticRegression with lbfgs solver
model_multi = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=42)
model_multi.fit(X_tr_iris_s, y_tr_iris)

print("=== One-vs-Rest (OvR) ===")
print(classification_report(y_te_iris, model_ovr.predict(X_te_iris_s),
                            target_names=iris.target_names))

print("\n=== Multinomial (Softmax) ===")
print(classification_report(y_te_iris, model_multi.predict(X_te_iris_s),
                            target_names=iris.target_names))

## Multi-class Confusion Matrix

The confusion matrix generalizes to K×K. The diagonal shows correct predictions; off-diagonal entries reveal which classes get confused with each other.

In [None]:
# Multi-class confusion matrix
y_pred_iris = model_multi.predict(X_te_iris_s)

fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_predictions(
    y_te_iris, y_pred_iris,
    display_labels=iris.target_names,
    cmap="Blues", ax=ax
)
ax.set_title("Multi-class Confusion Matrix — Iris (Softmax)", fontsize=14)
plt.tight_layout()
plt.show()

print("Off-diagonal entries show which species the model confuses.")
print("Setosa is perfectly separated; versicolor and virginica are occasionally swapped.")

## Multi-class Decision Boundaries (2D Visualization)

To visualize decision boundaries, we'll use only 2 features (petal length and petal width) so we can plot in 2D.

In [None]:
# Helper function to plot decision boundaries
def plot_decision_boundary_multiclass(X, y, model, scaler, feature_names, class_names, title, ax):
    h = 0.02  # step size
    X_s = scaler.transform(X)

    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    grid = np.c_[xx.ravel(), yy.ravel()]
    grid_s = scaler.transform(grid)
    Z = model.predict(grid_s).reshape(xx.shape)

    cmap_bg = ListedColormap(['#FFDDD2', '#DCEDC8', '#B3E5FC'])
    cmap_pts = ListedColormap(['#E63946', '#2D6A4F', '#1D3557'])

    ax.contourf(xx, yy, Z, alpha=0.3, cmap=cmap_bg)
    for i, name in enumerate(class_names):
        mask = y == i
        ax.scatter(X[mask, 0], X[mask, 1], c=[cmap_pts.colors[i]],
                   label=name, edgecolors='k', s=50, alpha=0.8)

    ax.set_xlabel(feature_names[0], fontsize=11)
    ax.set_ylabel(feature_names[1], fontsize=11)
    ax.set_title(title, fontsize=13)
    ax.legend(fontsize=9)

# Use only petal length and petal width for 2D visualization
X_iris_2d = X_iris[:, 2:4]
feature_names_2d = [iris.feature_names[2], iris.feature_names[3]]

X_tr_2d, X_te_2d, y_tr_2d, y_te_2d = train_test_split(
    X_iris_2d, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

scaler_2d = StandardScaler()
X_tr_2d_s = scaler_2d.fit_transform(X_tr_2d)

# Train both strategies on 2D data
model_ovr_2d = OneVsRestClassifier(
    LogisticRegression(max_iter=1000, random_state=42)
)
model_ovr_2d.fit(X_tr_2d_s, y_tr_2d)

model_multi_2d = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=42)
model_multi_2d.fit(X_tr_2d_s, y_tr_2d)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

plot_decision_boundary_multiclass(
    X_iris_2d, y_iris, model_ovr_2d, scaler_2d,
    feature_names_2d, iris.target_names, "OvR Decision Boundaries", axes[0]
)
plot_decision_boundary_multiclass(
    X_iris_2d, y_iris, model_multi_2d, scaler_2d,
    feature_names_2d, iris.target_names, "Multinomial (Softmax) Decision Boundaries", axes[1]
)

plt.tight_layout()
plt.show()

print("Both strategies produce linear decision boundaries.")
print("Differences are subtle here; they diverge more with complex, overlapping classes.")

---
## ✏️ Exercise 4: Multi-class Evaluation

Using the Iris dataset with all 4 features (already split and scaled above as `X_tr_iris_s` and `X_te_iris_s`), practice with multi-class metrics.

**Task 1:** Print the classification report for the multinomial model. Which class has the lowest F1 score? Why might that be?

In [None]:
# Write your code here

In [None]:
#@title Click to reveal solution.
y_pred_multi = model_multi.predict(X_te_iris_s)
print(classification_report(y_te_iris, y_pred_multi, target_names=iris.target_names))
print("Setosa typically has a perfect F1 because it's linearly separable.")
print("Versicolor and virginica overlap more, so one of them usually has the lowest F1.")

**Task 2:** Compute the **macro-averaged** and **weighted-averaged** F1 scores manually from the per-class F1 scores. Verify they match what sklearn reports.

*Hint:* macro = simple mean; weighted = weighted by support (number of samples per class).

In [None]:
# Write your code here

In [None]:
#@title Click to reveal solution.
# Per-class F1 scores
f1_per_class = f1_score(y_te_iris, y_pred_multi, average=None)
support = np.array([np.sum(y_te_iris == c) for c in range(3)])

print("Per-class F1 scores:")
for i, name in enumerate(iris.target_names):
    print(f"  {name}: {f1_per_class[i]:.3f} (support={support[i]})")

# Macro average: simple mean
macro_manual = f1_per_class.mean()
macro_sklearn = f1_score(y_te_iris, y_pred_multi, average='macro')
print(f"\nMacro F1 (manual):  {macro_manual:.3f}")
print(f"Macro F1 (sklearn):  {macro_sklearn:.3f}")

# Weighted average: weighted by support
weighted_manual = np.average(f1_per_class, weights=support)
weighted_sklearn = f1_score(y_te_iris, y_pred_multi, average='weighted')
print(f"\nWeighted F1 (manual):  {weighted_manual:.3f}")
print(f"Weighted F1 (sklearn):  {weighted_sklearn:.3f}")

**Task 3:** Plot the multi-class confusion matrix for the OvR model. Compare it visually to the multinomial model's confusion matrix above. Are the same classes confused?

In [None]:
# Write your code here

In [None]:
#@title Click to reveal solution.
y_pred_ovr = model_ovr.predict(X_te_iris_s)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

ConfusionMatrixDisplay.from_predictions(
    y_te_iris, y_pred_ovr,
    display_labels=iris.target_names,
    cmap="Blues", ax=axes[0]
)
axes[0].set_title("OvR Confusion Matrix", fontsize=14)

ConfusionMatrixDisplay.from_predictions(
    y_te_iris, y_pred_multi,
    display_labels=iris.target_names,
    cmap="Blues", ax=axes[1]
)
axes[1].set_title("Multinomial (Softmax) Confusion Matrix", fontsize=14)

plt.tight_layout()
plt.show()

print("Both models tend to confuse versicolor ↔ virginica.")
print("Setosa is linearly separable and almost always perfectly classified.")

---
# Summary

## Key Takeaways

| Concept | What It Answers | When to Use |
|---|---|---|
| **Confusion Matrix** | What types of errors is the model making? | Always — it's the foundation |
| **Precision** | When model says positive, is it right? | False positives are costly (spam) |
| **Recall** | Did the model find all positives? | False negatives are costly (medical) |
| **F1 Score** | Balance of precision and recall | Need a single balanced metric |
| **ROC / AUC** | Overall discriminative ability | Comparing models, balanced data |
| **PR Curve / AP** | Performance on the positive class | Imbalanced datasets |
| **Class Weights** | Force model to attend to minority class | Simplest imbalance fix |
| **SMOTE** | Generate synthetic minority samples | When class weights aren't enough |
| **Stratified Splits** | Preserve class proportions | Always with imbalanced data |
| **OvR / OvO** | Extend binary classifiers to multi-class | Strategy depends on model |
| **Softmax** | Native multi-class probabilities | Logistic regression, neural networks |

## What's Next

**Session 7:** Decision Trees & Ensemble Methods — a fundamentally different approach to both classification and regression, followed by random forests and boosting.