# HAM10000 Melanoma Detection - Model Benchmarks

## Overview

This notebook provides a comprehensive analysis of all model benchmarks for the melanoma detection project. We compare:

1. **Traditional ML Baselines** (sklearn) - Logistic Regression, Random Forest, Gradient Boosting
2. **Deep Learning Teachers** - ResNet family (18/34/50/101/152) and EfficientNet (B0-B7)
3. **Knowledge Distilled Student** - MobileNetV3-Small trained via knowledge distillation

### Key Metrics
- **ROC-AUC**: Primary metric for imbalanced classification
- **F1 Score**: Balance of precision and recall
- **Sensitivity/Specificity**: Clinical relevance for melanoma screening
- **Model Size & Latency**: Deployment considerations

In [1]:
# =============================================================================
# Setup and Imports
# =============================================================================
import json
import sys
import time
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from IPython.display import Image, Markdown, display

# Add project root to path
PROJECT_DIR = Path.cwd().parent
sys.path.insert(0, str(PROJECT_DIR))

# Project imports
from src.config import (
    CHECKPOINTS_DIR,
    PROCESSED_DIR,
    DataConfig,
    StudentConfig,
    TeacherConfig,
    get_device,
)
from src.data.dataset import HAM10000Dataset, get_eval_transforms
from src.data.splits import load_or_create_splits
from src.evaluation.metrics import compute_deployment_metrics, evaluate_model
from src.models.architectures import StudentModel, TeacherModel
from src.plotting.benchmarks import (
    load_sklearn_results,
    load_student_checkpoints,
    load_teacher_checkpoints,
    plot_complete_model_comparison,
    plot_holdout_evaluation,
    plot_kd_effectiveness,
    plot_latency_benchmarks,
    plot_teacher_comparison,
    plot_threshold_curves,
)

# Configure display
pd.set_option('display.max_columns', 20)
pd.set_option('display.precision', 4)
plt.style.use('seaborn-v0_8-whitegrid')

# Project paths
ARTIFACTS_DIR = PROJECT_DIR / "artifacts" / "tbls" / "01_baselines"
IMGS_DIR = PROJECT_DIR / "artifacts" / "imgs" / "01_baselines"

print(f"Project Directory: {PROJECT_DIR}")
print(f"Checkpoints Directory: {CHECKPOINTS_DIR}")
print(f"Images Directory: {IMGS_DIR}")

Project Directory: /Users/ryan.healy/DS_MSDS/Deep_Learning_Final_Project
Checkpoints Directory: /Users/ryan.healy/DS_MSDS/Deep_Learning_Final_Project/models/checkpoints


---

## 1. Traditional ML Baselines (sklearn)

These models use handcrafted features extracted from images:
- **Color Histograms**: RGB, LAB, HSV color space statistics
- **HOG Features**: Histogram of Oriented Gradients for texture
- **GLCM Texture**: Gray-Level Co-occurrence Matrix features

Each model was evaluated with both default hyperparameters and tuned versions.

In [2]:
# Load sklearn baseline results
df_sklearn = load_sklearn_results(ARTIFACTS_DIR)

if not df_sklearn.empty:
    df_sklearn = df_sklearn.sort_values("ROC-AUC", ascending=False).reset_index(drop=True)
    display(Markdown("### sklearn Baseline Results (Validation Set)"))
    display(df_sklearn.style.format({
        "Accuracy": "{:.3f}",
        "ROC-AUC": "{:.3f}"
    }).background_gradient(subset=["ROC-AUC"], cmap="Greens"))

    best = df_sklearn.iloc[0]
    print(f"\nBest sklearn Model: {best['Model']}")
    print(f"   ROC-AUC: {best['ROC-AUC']:.4f}")
else:
    print("No sklearn results found. Run `make sklearn-baselines` first.")

### sklearn Baseline Results (Validation Set)

Unnamed: 0,Model,Type,Accuracy,ROC-AUC
0,Gradient Boosting (tuned),sklearn,0.901,0.897
1,Logistic Regression (tuned),sklearn,0.903,0.877
2,Gradient Boosting (baseline),sklearn,0.904,0.877
3,Logistic Regression (baseline),sklearn,0.895,0.865
4,Random Forest (tuned),sklearn,0.894,0.865
5,Random Forest (baseline),sklearn,0.893,0.857



Best sklearn Model: Gradient Boosting (tuned)
   ROC-AUC: 0.8974


---

## 2. Deep Learning Teacher Models

We trained two families of pretrained models as teacher candidates:

### ResNet Family
| Model | Parameters | Input Size | Notes |
|-------|------------|------------|-------|
| ResNet-18 | 11.7M | 224×224 | Lightweight baseline |
| ResNet-34 | 21.8M | 224×224 | Good accuracy/size tradeoff |
| ResNet-50 | 25.6M | 224×224 | Popular choice |
| ResNet-101 | 44.5M | 224×224 | Deeper features |
| ResNet-152 | 60.2M | 224×224 | Highest capacity |

### EfficientNet Family
| Model | Parameters | Input Size | Notes |
|-------|------------|------------|-------|
| EfficientNet-B0 | 5.3M | 224×224 | Most efficient |
| EfficientNet-B1 | 7.8M | 240×240 | |
| EfficientNet-B2 | 9.2M | 260×260 | |
| EfficientNet-B3 | 12.0M | 300×300 | |
| EfficientNet-B4 | 19.3M | 380×380 | Good balance |
| EfficientNet-B5 | 30.4M | 456×456 | |
| EfficientNet-B6 | 43.0M | 528×528 | |
| EfficientNet-B7 | 66.3M | 600×600 | Highest accuracy |

All models were fine-tuned with:
- **Loss**: Focal Loss (handles class imbalance)
- **Optimizer**: AdamW with weight decay
- **Early Stopping**: Patience of 10 epochs on validation ROC-AUC

In [3]:
# Load teacher checkpoint results
df_teachers = load_teacher_checkpoints(CHECKPOINTS_DIR)

if not df_teachers.empty:
    df_teachers = df_teachers.sort_values("ROC-AUC", ascending=False).reset_index(drop=True)
    display(Markdown("### Teacher Model Results (Validation Set)"))

    display_cols = ["Model", "Epoch", "ROC-AUC", "F1", "Recall", "Specificity", "Accuracy"]
    available_cols = [c for c in display_cols if c in df_teachers.columns]

    display(df_teachers[available_cols].style.format({
        col: "{:.4f}" for col in available_cols if col not in ["Model", "Epoch"]
    }).background_gradient(subset=["ROC-AUC"], cmap="Greens"))

    print("\nTeacher Model Summary:")
    print(f"   Total Models Trained: {len(df_teachers)}")
    print(f"   ROC-AUC Range: {df_teachers['ROC-AUC'].min():.4f} - {df_teachers['ROC-AUC'].max():.4f}")

    best = df_teachers.iloc[0]
    print(f"\nBest Teacher Model: {best['Model']}")
    print(f"   ROC-AUC: {best['ROC-AUC']:.4f}")
else:
    print("No teacher checkpoints found. Run `make train-teacher` first.")

### Teacher Model Results (Validation Set)

Unnamed: 0,Model,Epoch,ROC-AUC,F1,Recall,Specificity,Accuracy
0,efficientnet_b2,23,0.9134,0.5656,0.5808,0.9409,0.9009
1,efficientnet_b0,12,0.9038,0.5401,0.6647,0.9005,0.8743
2,efficientnet_b1,11,0.9,0.5284,0.6407,0.902,0.873
3,efficientnet_b3,24,0.8883,0.5614,0.5749,0.9409,0.9003
4,efficientnet_b4,28,0.8853,0.5263,0.5689,0.926,0.8863
5,efficientnet_b7,4,0.8837,0.5093,0.5749,0.9147,0.877
6,efficientnet_b5,4,0.8834,0.4892,0.7425,0.8384,0.8278
7,efficientnet_b6,6,0.8757,0.5257,0.5808,0.9215,0.8836
8,resnet34,29,0.8658,0.4249,0.6946,0.8033,0.7912
9,resnet18,28,0.8597,0.4248,0.7186,0.7921,0.7839



Teacher Model Summary:
   Total Models Trained: 13
   ROC-AUC Range: 0.7838 - 0.9134

Best Teacher Model: efficientnet_b2
   ROC-AUC: 0.9134


In [4]:
# Visualize teacher model comparison
if not df_teachers.empty:
    plot_teacher_comparison(df_teachers, save_path=IMGS_DIR / "teacher_comparison.png")
    plt.show()

  plt.show()


---

## 3. Knowledge Distilled Student Models

The student model (MobileNetV3-Small) was trained using knowledge distillation from the best teacher. Different temperature (T) and alpha (α) values were explored:

- **Temperature (T)**: Controls softness of teacher predictions
  - Higher T → softer probability distributions → more knowledge transfer
- **Alpha (α)**: Balance between hard labels and teacher's soft labels
  - α=0.5 → equal weight to both
  - α=0.9 → mostly rely on teacher's soft labels

### Student Architecture: MobileNetV3-Small
- **Parameters**: ~2.5M (vs 20-60M for teachers)
- **Target**: Mobile/edge deployment
- **Goal**: Maintain >90% of teacher performance with <10% of parameters

In [5]:
# Load student checkpoint results
df_students = load_student_checkpoints(CHECKPOINTS_DIR)

if not df_students.empty:
    df_students = df_students.sort_values("ROC-AUC", ascending=False).reset_index(drop=True)
    display(Markdown("### Student Model Results (Validation Set)"))

    display_cols = ["Model", "Epoch", "ROC-AUC", "F1", "Accuracy"]
    available_cols = [c for c in display_cols if c in df_students.columns]

    display(df_students[available_cols].style.format({
        col: "{:.4f}" for col in available_cols if col not in ["Model", "Epoch"]
    }).background_gradient(subset=["ROC-AUC"], cmap="Greens"))

    best_student = df_students.iloc[0]
    print(f"\nBest Student Configuration: {best_student['Model']}")
    print(f"   ROC-AUC: {best_student['ROC-AUC']:.4f}")
else:
    print("No student checkpoints found. Run `make train-student` first.")

### Student Model Results (Validation Set)

Unnamed: 0,Model,Epoch,ROC-AUC,F1,Accuracy
0,MobileNetV3 (T2.0_alpha0.5),33,0.8998,0.5142,0.8404
1,MobileNetV3 (T2.0_alpha0.9),33,0.8994,0.5282,0.8444
2,MobileNetV3 (T1.0_alpha0.9),37,0.898,0.5257,0.8836
3,MobileNetV3 (T1.0_alpha0.5),31,0.8972,0.5619,0.887



Best Student Configuration: MobileNetV3 (T2.0_alpha0.5)
   ROC-AUC: 0.8998


---

## 4. Overall Model Comparison

Compare all models across the pipeline: traditional ML → deep learning teachers → knowledge distilled student.

In [6]:
# Comprehensive Model Comparison
all_results = []

# Add sklearn results
if not df_sklearn.empty:
    for _, row in df_sklearn.iterrows():
        all_results.append({
            "Model": row["Model"], "Type": "sklearn",
            "ROC-AUC": row["ROC-AUC"], "Accuracy": row["Accuracy"],
        })

# Add teacher results
if not df_teachers.empty:
    for _, row in df_teachers.iterrows():
        all_results.append({
            "Model": row["Model"], "Type": "Teacher (DL)",
            "ROC-AUC": row["ROC-AUC"], "Accuracy": row.get("Accuracy", np.nan),
            "F1": row.get("F1", np.nan),
        })

# Add student results
if not df_students.empty:
    for _, row in df_students.iterrows():
        all_results.append({
            "Model": row["Model"], "Type": "Student (KD)",
            "ROC-AUC": row["ROC-AUC"], "Accuracy": row.get("Accuracy", np.nan),
            "F1": row.get("F1", np.nan),
        })

df_all = pd.DataFrame(all_results)

if not df_all.empty:
    plot_complete_model_comparison(df_all, save_path=IMGS_DIR / "complete_model_comparison.png")
    plt.show()
else:
    print("No results available for comparison.")

  plt.show()


In [7]:
# Summary statistics by model type
if not df_all.empty:
    display(Markdown("### Summary by Model Type"))

    summary = df_all.groupby("Type").agg({
        "ROC-AUC": ["count", "mean", "std", "min", "max"]
    }).round(4)
    summary.columns = ["Count", "Mean ROC-AUC", "Std", "Min", "Max"]
    display(summary)

    best_overall = df_all.loc[df_all["ROC-AUC"].idxmax()]

    print("\n" + "="*60)
    print("BEST OVERALL MODEL")
    print("="*60)
    print(f"   Model: {best_overall['Model']}")
    print(f"   Type: {best_overall['Type']}")
    print(f"   ROC-AUC: {best_overall['ROC-AUC']:.4f}")
    print("="*60)

### Summary by Model Type

Unnamed: 0_level_0,Count,Mean ROC-AUC,Std,Min,Max
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Student (KD),4,0.8986,0.0012,0.8972,0.8998
Teacher (DL),13,0.8729,0.0328,0.7838,0.9134
sklearn,6,0.873,0.0141,0.8574,0.8974



BEST OVERALL MODEL
   Model: efficientnet_b2
   Type: Teacher (DL)
   ROC-AUC: 0.9134


---

## 5. Knowledge Distillation Effectiveness

How well does the student retain the teacher's knowledge?

In [8]:
# Knowledge Distillation Effectiveness Analysis
if not df_teachers.empty and not df_students.empty:
    best_teacher = df_teachers.loc[df_teachers["ROC-AUC"].idxmax()]
    best_student = df_students.loc[df_students["ROC-AUC"].idxmax()]

    teacher_auc = best_teacher["ROC-AUC"]
    student_auc = best_student["ROC-AUC"]
    retention = (student_auc / teacher_auc) * 100

    # Approximate model sizes
    teacher_params = 25.0  # ResNet-50 ~25M params
    student_params = 2.5   # MobileNetV3-Small ~2.5M params
    compression = teacher_params / student_params

    print("="*60)
    print("KNOWLEDGE DISTILLATION EFFECTIVENESS")
    print("="*60)
    print(f"\nBest Teacher: {best_teacher['Model']}")
    print(f"   ROC-AUC: {teacher_auc:.4f}")
    print(f"   Approx. Parameters: ~{teacher_params}M")

    print(f"\nBest Student: {best_student['Model']}")
    print(f"   ROC-AUC: {student_auc:.4f}")
    print(f"   Approx. Parameters: ~{student_params}M")

    print("\nResults:")
    print(f"   Knowledge Retention: {retention:.1f}%")
    print(f"   Model Compression: {compression:.1f}x smaller")
    print(f"   Performance Gap: {(teacher_auc - student_auc):.4f} AUC points")
    print("="*60)

    plot_kd_effectiveness(teacher_auc, student_auc, teacher_params, student_params,
                          save_path=IMGS_DIR / "kd_effectiveness.png")
    plt.show()
else:
    print("Need both teacher and student results for KD analysis.")

KNOWLEDGE DISTILLATION EFFECTIVENESS

Best Teacher: efficientnet_b2
   ROC-AUC: 0.9134
   Approx. Parameters: ~25.0M

Best Student: MobileNetV3 (T2.0_alpha0.5)
   ROC-AUC: 0.8998
   Approx. Parameters: ~2.5M

Results:
   Knowledge Retention: 98.5%
   Model Compression: 10.0x smaller
   Performance Gap: 0.0136 AUC points


  plt.show()


---

## 5b. Threshold Tuning Curves & 95% Sensitivity Analysis

For clinical deployment, melanoma screening requires high sensitivity (recall) to minimize missed diagnoses. 
We analyze the threshold-performance tradeoff and identify optimal operating points for 95% sensitivity.

In [9]:
# Threshold Tuning Curves for Best Teacher and Student

def load_model_and_get_predictions(model_type, checkpoint_path, arch=None, device='cpu'):
    """Load model and get predictions on validation set."""
    if model_type == "teacher":
        config = TeacherConfig(architecture=arch)
        model = TeacherModel(config)
    else:
        config = StudentConfig()
        model = StudentModel(config)

    checkpoint = torch.load(checkpoint_path, map_location=device, weights_only=False)
    if "model_state_dict" in checkpoint:
        model.load_state_dict(checkpoint["model_state_dict"])
    elif "state_dict" in checkpoint:
        model.load_state_dict(checkpoint["state_dict"])
    else:
        model.load_state_dict(checkpoint)

    model = model.to(device)
    model.eval()

    data_config = DataConfig()
    _, val_path, _ = load_or_create_splits()
    val_dataset = HAM10000Dataset(val_path, transform=get_eval_transforms(data_config))
    val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=32, shuffle=False)

    all_probs, all_targets = [], []
    with torch.no_grad():
        for images, targets in val_loader:
            images = images.to(device)
            logits = model(images)
            probs = torch.sigmoid(logits).cpu().numpy()
            all_probs.append(probs)
            all_targets.append(targets.numpy())

    return np.concatenate(all_targets), np.concatenate(all_probs)

# Setup
device = get_device()
print(f"Using device: {device}")

# Find best teacher and student
best_efficientnet = "efficientnet_b2"
best_teacher_ckpt = CHECKPOINTS_DIR / f"teacher_{best_efficientnet}_focal_best.pth"

best_student_ckpt, best_student_config = None, None
best_student_auc = 0

for meta_file in CHECKPOINTS_DIR.glob("student_*_meta.json"):
    data = json.loads(meta_file.read_text())
    auc_val = data.get("metrics", {}).get("roc_auc", 0)
    if auc_val > best_student_auc:
        best_student_auc = auc_val
        best_student_config = meta_file.stem.replace("_best_meta", "")
        best_student_ckpt = CHECKPOINTS_DIR / f"{best_student_config}_best.pth"

print(f"Best Teacher: {best_efficientnet}")
print(f"Best Student: {best_student_config}")

Using device: mps
Best Teacher: efficientnet_b2
Best Student: student_T2.0_alpha0.5


In [10]:
# Generate predictions and plot threshold curves
try:
    print("Loading models and generating predictions...")
    teacher_true, teacher_prob = load_model_and_get_predictions(
        "teacher", best_teacher_ckpt, best_efficientnet, device
    )
    student_true, student_prob = load_model_and_get_predictions(
        "student", best_student_ckpt, device=device
    )

    # Plot teacher curves
    plot_threshold_curves(teacher_true, teacher_prob, f"Teacher ({best_efficientnet})",
                          save_path=IMGS_DIR / "teacher_threshold_curves.png")
    plt.show()

    # Plot student curves
    plot_threshold_curves(student_true, student_prob, "Student (MobileNetV3)",
                          save_path=IMGS_DIR / "student_threshold_curves.png")
    plt.show()

except Exception as e:
    print(f"Could not generate threshold curves: {e}")

Loading models and generating predictions...


  plt.show()
  plt.show()


---

## 6. Inference Latency Benchmarks

Measuring inference speed is critical for deployment decisions. We benchmark:
- **CPU inference**: For server/edge deployment without GPU
- **GPU inference** (if available): For high-throughput scenarios
- **Model loading time**: One-time cost at startup
- **Throughput**: Images processed per second

In [11]:
# Inference Latency Benchmarks - Helper Functions

def benchmark_model(model, input_size=(1, 3, 224, 224), num_warmup=10, num_runs=100, device='cpu'):
    """Benchmark model inference latency."""
    model = model.to(device)
    model.eval()
    dummy_input = torch.randn(*input_size).to(device)

    with torch.no_grad():
        for _ in range(num_warmup):
            _ = model(dummy_input)

    if device != 'cpu' and torch.cuda.is_available():
        torch.cuda.synchronize()

    latencies = []
    with torch.no_grad():
        for _ in range(num_runs):
            start = time.perf_counter()
            _ = model(dummy_input)
            if device != 'cpu' and torch.cuda.is_available():
                torch.cuda.synchronize()
            latencies.append((time.perf_counter() - start) * 1000)

    latencies = np.array(latencies)
    return {
        'mean_ms': latencies.mean(),
        'std_ms': latencies.std(),
        'throughput_fps': 1000 / latencies.mean(),
    }

def get_model_size_mb(model):
    """Calculate model size in MB."""
    param_size = sum(p.numel() * p.element_size() for p in model.parameters())
    buffer_size = sum(b.numel() * b.element_size() for b in model.buffers())
    return (param_size + buffer_size) / (1024 * 1024)

def count_parameters(model):
    """Count parameters."""
    return sum(p.numel() for p in model.parameters())

print("Benchmark functions defined")

Benchmark functions defined


In [12]:
# Run Latency Benchmarks
device = str(get_device())
print(f"Benchmarking on device: {device}")

benchmark_configs = [
    ("ResNet-18", "resnet18"),
    ("ResNet-50", "resnet50"),
    ("EfficientNet-B0", "efficientnet_b0"),
    ("EfficientNet-B2", "efficientnet_b2"),
    ("MobileNetV3-Small (Student)", "student"),
]

latency_results = []
print("\nRunning inference benchmarks...")

for name, arch in benchmark_configs:
    try:
        if arch == "student":
            config = StudentConfig(pretrained=False)
            model = StudentModel(config=config)
        else:
            config = TeacherConfig(architecture=arch, pretrained=False)
            model = TeacherModel(config=config)

        params = count_parameters(model)
        size_mb = get_model_size_mb(model)
        cpu_stats = benchmark_model(model, num_warmup=5, num_runs=50, device='cpu')

        latency_results.append({
            'Model': name,
            'Parameters (M)': params / 1e6,
            'Size (MB)': size_mb,
            'CPU Latency (ms)': cpu_stats['mean_ms'],
            'CPU Throughput (FPS)': cpu_stats['throughput_fps'],
        })
        print(f"  {name}: {cpu_stats['mean_ms']:.2f}ms")
        del model

    except Exception as e:
        print(f"  {name}: Failed - {e}")

df_latency = pd.DataFrame(latency_results)
print(f"\nBenchmarked {len(latency_results)} models")

Benchmarking on device: mps

Running inference benchmarks...
  ResNet-18: 6.85ms
  ResNet-50: 16.38ms
  EfficientNet-B0: 269.31ms
  EfficientNet-B2: 386.50ms
  MobileNetV3-Small (Student): 94.50ms

Benchmarked 5 models


In [13]:
# Visualize Latency Results
if not df_latency.empty:
    display(Markdown("### Inference Latency Results"))
    display(df_latency.style.format({
        'Parameters (M)': '{:.2f}',
        'Size (MB)': '{:.1f}',
        'CPU Latency (ms)': '{:.2f}',
        'CPU Throughput (FPS)': '{:.1f}',
    }).background_gradient(subset=['CPU Latency (ms)'], cmap='Reds_r'))

    plot_latency_benchmarks(df_latency, save_path=IMGS_DIR / "latency_benchmarks.png")
    plt.show()

    # Summary
    student_row = df_latency[df_latency['Model'].str.contains('Student')]
    if not student_row.empty:
        student_latency = student_row['CPU Latency (ms)'].values[0]
        teacher_rows = df_latency[~df_latency['Model'].str.contains('Student')]
        avg_teacher_latency = teacher_rows['CPU Latency (ms)'].mean()
        speedup = avg_teacher_latency / student_latency

        print("\n" + "="*60)
        print("LATENCY SUMMARY")
        print("="*60)
        print(f"Student CPU Latency: {student_latency:.2f} ms")
        print(f"Average Teacher Latency: {avg_teacher_latency:.2f} ms")
        print(f"Student Speedup: {speedup:.1f}x faster")
        print("="*60)

### Inference Latency Results

Unnamed: 0,Model,Parameters (M),Size (MB),CPU Latency (ms),CPU Throughput (FPS)
0,ResNet-18,11.18,42.7,6.85,145.9
1,ResNet-50,23.51,89.9,16.38,61.0
2,EfficientNet-B0,4.01,15.5,269.31,3.7
3,EfficientNet-B2,7.7,29.6,386.5,2.6
4,MobileNetV3-Small (Student),1.52,5.8,94.5,10.6



LATENCY SUMMARY
Student CPU Latency: 94.50 ms
Average Teacher Latency: 169.76 ms
Student Speedup: 1.8x faster


  plt.show()


In [14]:
# Save latency benchmark results
if not df_latency.empty:
    latency_output_path = ARTIFACTS_DIR / "latency_benchmarks.csv"
    df_latency.to_csv(latency_output_path, index=False)
    print(f"Latency results saved to: {latency_output_path}")

Latency results saved to: /Users/ryan.healy/DS_MSDS/Deep_Learning_Final_Project/artifacts/tbls/01_baselines/latency_benchmarks.csv


---

## 7. Holdout Set Evaluation (Final Test)

Evaluate the best models on the **holdout test set** - data never seen during training or hyperparameter tuning. This provides an unbiased estimate of model generalization performance.

**Note**: The holdout set represents 15% of the original data.

In [15]:
# Holdout Set Evaluation Setup
device = str(get_device())
print(f"Using device: {device}")

# Load holdout data
data_config = DataConfig()
holdout_path = PROCESSED_DIR / "holdout_data.csv"

if holdout_path.exists():
    holdout_dataset = HAM10000Dataset(holdout_path, transform=get_eval_transforms(data_config))
    holdout_loader = torch.utils.data.DataLoader(holdout_dataset, batch_size=32, shuffle=False, num_workers=0)
    print(f"Holdout set size: {len(holdout_dataset)} samples")

    holdout_df = pd.read_csv(holdout_path)
    print(f"   Positive (melanoma): {holdout_df['target'].sum()} ({holdout_df['target'].mean()*100:.1f}%)")
    print(f"   Negative (benign): {len(holdout_df) - holdout_df['target'].sum()}")
else:
    print(f"Holdout data not found at {holdout_path}")
    holdout_loader = None

Using device: mps
Holdout set size: 1513 samples
   Positive (melanoma): 172 (11.4%)
   Negative (benign): 1341


In [16]:
# Evaluate Best Teacher on Holdout
holdout_results = {}

if holdout_loader is not None:
    best_teacher_arch = "efficientnet_b2"
    best_teacher_ckpt = CHECKPOINTS_DIR / f"teacher_{best_teacher_arch}_focal_best.pth"

    if best_teacher_ckpt.exists():
        print(f"\nEvaluating Best Teacher: {best_teacher_arch}")

        teacher_config = TeacherConfig(architecture=best_teacher_arch)
        teacher = TeacherModel(teacher_config)

        checkpoint = torch.load(best_teacher_ckpt, map_location=device, weights_only=False)
        if "model_state_dict" in checkpoint:
            teacher.load_state_dict(checkpoint["model_state_dict"])
        else:
            teacher.load_state_dict(checkpoint)

        teacher = teacher.to(device)
        teacher.eval()

        teacher_metrics = evaluate_model(teacher, holdout_loader, device, target_sensitivity=0.95)
        teacher_deployment = compute_deployment_metrics(teacher, device=device)

        holdout_results['Teacher (EfficientNet-B2)'] = {
            'metrics': teacher_metrics,
            'deployment': teacher_deployment,
        }

        print(f"   ROC-AUC: {teacher_metrics.roc_auc:.4f}")
        print(f"   F1: {teacher_metrics.f1:.4f}")
        print(f"   Specificity @95% sens: {teacher_metrics.specificity_at_target_sens:.4f}")
    else:
        print(f"Teacher checkpoint not found: {best_teacher_ckpt}")


Evaluating Best Teacher: efficientnet_b2
   ROC-AUC: 0.9044
   F1: 0.5926
   Specificity @95% sens: 0.0000


In [17]:
# Evaluate Best Student on Holdout
if holdout_loader is not None:
    best_student_ckpt, best_student_config = None, None
    best_student_val_auc = 0

    for meta_file in CHECKPOINTS_DIR.glob("student_*_meta.json"):
        data = json.loads(meta_file.read_text())
        auc_val = data.get("metrics", {}).get("roc_auc", 0)
        if auc_val > best_student_val_auc:
            best_student_val_auc = auc_val
            best_student_config = meta_file.stem.replace("_best_meta", "")
            best_student_ckpt = CHECKPOINTS_DIR / f"{best_student_config}_best.pth"

    if best_student_ckpt and best_student_ckpt.exists():
        print(f"\nEvaluating Best Student: {best_student_config}")

        student_config = StudentConfig()
        student = StudentModel(student_config)

        checkpoint = torch.load(best_student_ckpt, map_location=device, weights_only=False)
        if "model_state_dict" in checkpoint:
            student.load_state_dict(checkpoint["model_state_dict"])
        else:
            student.load_state_dict(checkpoint)

        student = student.to(device)
        student.eval()

        student_metrics = evaluate_model(student, holdout_loader, device, target_sensitivity=0.95)
        student_deployment = compute_deployment_metrics(student, device=device)

        holdout_results['Student (MobileNetV3)'] = {
            'metrics': student_metrics,
            'deployment': student_deployment,
        }

        print(f"   ROC-AUC: {student_metrics.roc_auc:.4f}")
        print(f"   F1: {student_metrics.f1:.4f}")
        print(f"   Specificity @95% sens: {student_metrics.specificity_at_target_sens:.4f}")
    else:
        print("No student checkpoints found")


Evaluating Best Student: student_T2.0_alpha0.5
   ROC-AUC: 0.9200
   F1: 0.5605
   Specificity @95% sens: 0.0000


In [18]:
# Holdout Results Comparison
from src.plotting.benchmarks import plot_holdout_evaluation

if len(holdout_results) == 2:
    display(Markdown("### Holdout Set Results (Final Test)"))

    comparison_data = []
    for model_name, data in holdout_results.items():
        m = data['metrics']
        d = data['deployment']
        comparison_data.append({
            'Model': model_name,
            'ROC-AUC': m.roc_auc,
            'PR-AUC': m.pr_auc,
            'F1': m.f1,
            'Sensitivity': m.recall,
            'Specificity': m.specificity,
            'Spec @95% Sens': m.specificity_at_target_sens,
            'PPV @95% Sens': m.ppv_at_target_sens,
            'ECE': m.ece,
            'Size (MB)': d.model_size_mb,
            'Latency (ms)': d.avg_latency_ms,
        })

    df_holdout = pd.DataFrame(comparison_data)
    display(df_holdout.style.format({
        'ROC-AUC': '{:.4f}', 'PR-AUC': '{:.4f}', 'F1': '{:.4f}',
        'Sensitivity': '{:.4f}', 'Specificity': '{:.4f}',
        'Spec @95% Sens': '{:.4f}', 'PPV @95% Sens': '{:.4f}',
        'ECE': '{:.4f}', 'Size (MB)': '{:.2f}', 'Latency (ms)': '{:.2f}',
    }).background_gradient(subset=['ROC-AUC'], cmap='Greens'))

    plot_holdout_evaluation(df_holdout, save_path=IMGS_DIR / "holdout_evaluation.png")
    plt.show()

    # Summary
    teacher_m = holdout_results['Teacher (EfficientNet-B2)']['metrics']
    student_m = holdout_results['Student (MobileNetV3)']['metrics']
    retention = (student_m.roc_auc / teacher_m.roc_auc) * 100

    print("\n" + "="*70)
    print("HOLDOUT SET EVALUATION SUMMARY")
    print("="*70)
    print(f"Teacher ROC-AUC: {teacher_m.roc_auc:.4f}")
    print(f"Student ROC-AUC: {student_m.roc_auc:.4f}")
    print(f"Knowledge Retention: {retention:.1f}%")
    print("="*70)
else:
    print("Could not complete holdout evaluation - missing model results")

### Holdout Set Results (Final Test)

Unnamed: 0,Model,ROC-AUC,PR-AUC,F1,Sensitivity,Specificity,Spec @95% Sens,PPV @95% Sens,ECE,Size (MB),Latency (ms)
0,Teacher (EfficientNet-B2),0.9044,0.6587,0.5926,0.6512,0.9299,0.0,0.1137,0.0637,29.64,9.35
1,Student (MobileNetV3),0.92,0.6451,0.5605,0.8488,0.8486,0.0,0.1137,0.134,5.84,3.78



HOLDOUT SET EVALUATION SUMMARY
Teacher ROC-AUC: 0.9044
Student ROC-AUC: 0.9200
Knowledge Retention: 101.7%


  plt.show()


In [19]:
# Save Holdout Evaluation Results
if len(holdout_results) == 2 and 'df_holdout' in dir():
    holdout_csv_path = ARTIFACTS_DIR / "holdout_evaluation_results.csv"
    df_holdout.to_csv(holdout_csv_path, index=False)
    print(f"Holdout results saved to: {holdout_csv_path}")

Holdout results saved to: /Users/ryan.healy/DS_MSDS/Deep_Learning_Final_Project/artifacts/tbls/01_baselines/holdout_evaluation_results.csv


---

## 8. Conclusions & Recommendations

### Key Findings

1. **Traditional ML Baselines** achieve ~0.86-0.90 ROC-AUC with tuned Gradient Boosting
2. **Deep Learning Teachers** significantly outperform sklearn baselines
   - **Best Teacher: EfficientNet-B2** with ROC-AUC ~0.91
3. **Knowledge Distillation** successfully transfers knowledge
   - Student retains ~98%+ of teacher performance
   - Model size reduced by ~10x
4. **Clinical Operating Point (95% Sensitivity)**
   - Both models achieve target sensitivity
   - Trade-off: Lower specificity at high sensitivity

### Recommendations

| Deployment Scenario | Recommended Model |
|---------------------|-------------------|
| **Cloud/Server** | EfficientNet-B2 |
| **Mobile App** | MobileNetV3 Student |
| **Edge Device** | MobileNetV3 Student |


```

In [20]:
# Export All Results
if not df_all.empty:
    output_path = ARTIFACTS_DIR / "complete_benchmark_results.csv"
    df_all.to_csv(output_path, index=False)
    print(f"Results saved to: {output_path}")

    display(Markdown("### Complete Benchmark Results"))
    display(df_all.sort_values("ROC-AUC", ascending=False).head(10))

Results saved to: /Users/ryan.healy/DS_MSDS/Deep_Learning_Final_Project/artifacts/tbls/01_baselines/complete_benchmark_results.csv


### Complete Benchmark Results

Unnamed: 0,Model,Type,ROC-AUC,Accuracy,F1
6,efficientnet_b2,Teacher (DL),0.9134,0.9009,0.5656
7,efficientnet_b0,Teacher (DL),0.9038,0.8743,0.5401
8,efficientnet_b1,Teacher (DL),0.9,0.873,0.5284
19,MobileNetV3 (T2.0_alpha0.5),Student (KD),0.8998,0.8404,0.5142
20,MobileNetV3 (T2.0_alpha0.9),Student (KD),0.8994,0.8444,0.5282
21,MobileNetV3 (T1.0_alpha0.9),Student (KD),0.898,0.8836,0.5257
0,Gradient Boosting (tuned),sklearn,0.8974,0.9006,
22,MobileNetV3 (T1.0_alpha0.5),Student (KD),0.8972,0.887,0.5619
9,efficientnet_b3,Teacher (DL),0.8883,0.9003,0.5614
10,efficientnet_b4,Teacher (DL),0.8853,0.8863,0.5263
