In [1]:
"""
Calibration Theory Notes
Based on Guo et al. (2017) - "On Calibration of Modern Neural Networks"
"""

print("=" * 70)
print("CALIBRATION IN NEURAL NETWORKS")
print("=" * 70)

"""
## What is Calibration?

### Definition
A model is **calibrated** if its predicted probabilities match the actual frequencies:
  P(Y=1 | confidence=c) ≈ c

Example:
- Model predicts 100 samples with 80% confidence
- If 80 of them are actually correct → CALIBRATED ✓
- If only 60 are correct → OVERCONFIDENT ✗
- If 95 are correct → UNDERCONFIDENT ✗

### Why It Matters for Medical AI
1. **Trust**: Doctors need accurate confidence scores
2. **Decision Making**: Confidence determines when to defer
3. **Regulatory**: EU MDR requires transparency
4. **Safety**: Overconfident wrong predictions are dangerous

### Perfect vs Imperfect Calibration

Perfect Calibration:
- 90% confident predictions → 90% correct
- 50% confident predictions → 50% correct
- Reliability diagram follows diagonal line

Miscalibrated (typical modern networks):
- 90% confident predictions → only 70% correct (OVERCONFIDENT)
- Reliability diagram deviates from diagonal

## Expected Calibration Error (ECE)

### Formula
ECE = Σ (|acc(B_m) - conf(B_m)| × |B_m| / n)

Where:
- B_m = bins of predictions grouped by confidence
- acc(B_m) = accuracy within bin m
- conf(B_m) = average confidence in bin m
- n = total number of samples

### Interpretation
- ECE ∈ [0, 1]
- **ECE < 0.05**: Excellent calibration ✓
- **ECE < 0.10**: Good calibration
- **ECE < 0.15**: Acceptable calibration
- **ECE ≥ 0.15**: Poor calibration, needs fixing

### How to Calculate
1. Group predictions into bins (typically 10-15 bins)
2. For each bin:
   - Calculate average confidence
   - Calculate actual accuracy
   - Find absolute difference
3. Weight by number of samples in bin
4. Sum across all bins

## Temperature Scaling

### Concept
Scale logits by temperature T before softmax:

Original:  p_i = exp(z_i) / Σ exp(z_j)
Scaled:    p_i = exp(z_i/T) / Σ exp(z_j/T)

Where:
- z_i = logit for class i
- T = temperature parameter (T > 0)

### Effect of Temperature
- T = 1: No change (original model)
- T > 1: Softer probabilities (less confident)
- T < 1: Harder probabilities (more confident)

### Why It Works
- **Single parameter**: Easy to optimize
- **Post-hoc**: Applied after training
- **No retraining**: Use validation set to find T
- **Preserves accuracy**: Doesn't change predicted class
- **Fixes overconfidence**: Common in modern networks

### Optimization
Use validation set to find T that minimizes NLL (Negative Log-Likelihood):

T* = argmin_T  NLL(validation_set)

Typically: 1 < T < 5 for overconfident networks

## Why Modern Networks are Miscalibrated

### Key Findings from Paper
1. **Model Capacity**: Larger models → more overconfident
2. **Batch Normalization**: Contributes to miscalibration
3. **Modern Architectures**: ResNet, DenseNet more miscalibrated than LeNet
4. **Dataset Size**: Small datasets → worse calibration

### Practical Implication
ALWAYS apply calibration to modern neural networks, especially:
- Deep networks (>50 layers)
- Networks with batch normalization
- Medical AI applications
- When using ImageNet pretrained weights

## Reliability Diagrams

### How to Read
- X-axis: Predicted confidence (binned)
- Y-axis: Actual accuracy
- Diagonal line: Perfect calibration
- Gap from diagonal: Calibration error

### Patterns
- Points above diagonal: Underconfident
- Points below diagonal: Overconfident (common)
- Close to diagonal: Well-calibrated

### Best Practices
- Use 10-15 bins
- Weight by number of samples
- Show both before/after calibration
- Include confidence histogram

## Summary for Our Project

### Phase 1 (Baseline)
1. Train EfficientNet-B0 (will be overconfident)
2. Calculate ECE on validation set
3. Apply temperature scaling
4. Calculate ECE after scaling
5. Target: ECE < 0.05

### Phases 2-3 (Uncertainty)
1. MC Dropout provides epistemic uncertainty
2. Temperature scaling provides calibrated probabilities
3. Use BOTH together for best results

### Key Metrics to Track
- Accuracy (classification performance)
- AUROC (discrimination ability)
- ECE (calibration quality)
- Reliability diagram (visualization)

## References
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017).
On Calibration of Modern Neural Networks.
ICML 2017.
https://arxiv.org/abs/1706.04599
"""

print("\n✓ Calibration theory notes complete")
print("\nKey takeaways:")
print("  1. Calibration: predicted probability = actual frequency")
print("  2. ECE measures calibration quality (lower is better)")
print("  3. Temperature scaling: simple, effective post-hoc method")
print("  4. Modern networks are typically overconfident")
print("  5. Always check calibration in medical AI applications")

CALIBRATION IN NEURAL NETWORKS

✓ Calibration theory notes complete

Key takeaways:
  1. Calibration: predicted probability = actual frequency
  2. ECE measures calibration quality (lower is better)
  3. Temperature scaling: simple, effective post-hoc method
  4. Modern networks are typically overconfident
  5. Always check calibration in medical AI applications


In [None]:
"""
## Monte Carlo Dropout (Preview)

### High-Level Concept
MC Dropout keeps dropout ACTIVE at test time:
- Run multiple forward passes (T times)
- Each pass uses different dropout mask
- Variance across passes = epistemic uncertainty

### What It Provides
- Epistemic (model) uncertainty
- Confidence intervals
- Out-of-distribution detection
- Deferral capability

### Difference from Temperature Scaling

**Temperature Scaling:**
- Fixes calibration (probabilities match frequencies)
- Single forward pass
- Fast
- No uncertainty quantification

**MC Dropout:**
- Quantifies epistemic uncertainty
- Multiple forward passes (slower)
- Can detect when model "doesn't know"
- Provides prediction variance

### Why Use Both?
- Temperature Scaling → Calibrated probabilities
- MC Dropout → Uncertainty estimates
- Combined → Calibrated probabilities WITH uncertainty

### Phase 2 Focus
We'll implement and analyze MC Dropout in detail:
- How many forward passes (T)?
- What uncertainty metrics?
- How to visualize?
- When to defer to expert?

For now: Just know it exists and we'll use it later.
"""

print("\n✓ MC Dropout overview added")