# Error Analysis to Decisions - Thresholds, Calibration, and KPI Alignment

<hr>

<center>
<div>
<img src="https://raw.githubusercontent.com/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/main/notebooks/figures/mgmt_474_ai_logo_02-modified.png" width="200"/>
</div>
</center>

# <center><a class="tocSkip"></center>
# <center>MGMT47400 Predictive Analytics</center>
# <center>Professor: Davi Moreira </center>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/blob/main/notebooks/16_decision_thresholds_calibration.ipynb)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Translate model outputs into business decisions (thresholds, costs, constraints)
2. Evaluate calibration and when to calibrate probabilities
3. Compare models by decision impact, not only by AUC/accuracy
4. Produce a threshold/decision recommendation
5. Document risks and assumptions explicitly

---

## 1. Setup

Before we can translate model scores into business decisions, we need the right
toolkit. This cell imports scikit-learn's calibration utilities alongside the
usual data-wrangling and plotting libraries. We also fix `RANDOM_SEED = 474`
so that every threshold sweep and calibration curve you see is fully
reproducible.


In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
import warnings

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

RANDOM_SEED = 474
np.random.seed(RANDOM_SEED)

print("✓ Setup complete!")

**Reading the output:**

The `Setup complete!` confirmation tells you every import resolved and the
global random seed is locked to **474**. If any import had failed, Colab would
show a `ModuleNotFoundError` here. Notice that we import both
`calibration_curve` (for diagnosis) and `CalibratedClassifierCV` (for
correction) -- the two halves of the calibration workflow.

**Why this matters:** A clean setup cell is the first reproducibility
checkpoint. If someone else opens this notebook in a fresh Colab runtime,
this cell must succeed unchanged.

---


## 2. Load Data and Train Model

We generate a synthetic binary-classification dataset with `make_classification`
(5,000 samples, 20 features, 70/30 class imbalance, 5 % label noise). The data
is then split 60/20/20 into train, validation, and test sets. A Random Forest
is trained on the training set and its predicted probabilities on the validation
set become the raw material for every threshold and calibration analysis that
follows.


In [None]:
# Generate classification dataset
X, y = make_classification(
    n_samples=5000, n_features=20, n_informative=15,
    n_redundant=5, weights=[0.7, 0.3], flip_y=0.05,
    random_state=RANDOM_SEED
)

# Split
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.20, random_state=RANDOM_SEED)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=RANDOM_SEED)

print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")
print(f"Class distribution (validation): {np.bincount(y_val)}")

**Reading the output:**

The printout shows the **Train / Val / Test** sizes (roughly **3,000 / 1,000 /
1,000**) confirming the 60/20/20 split. The class distribution in the
validation set should reflect the 70/30 imbalance we specified in
`make_classification`. If the minority class were extremely rare (say < 5 %),
we would need stratified splitting -- here the imbalance is moderate enough
that a standard split works well.

**Key takeaway:** Always print split sizes and class counts right after
splitting so you can spot unexpected imbalances early.

---


In [None]:
# Train Random Forest classifier
rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=RANDOM_SEED, n_jobs=-1)
rf_model.fit(X_train, y_train)

# Get predicted probabilities
y_val_proba = rf_model.predict_proba(X_val)[:, 1]

print(f"\nROC-AUC Score: {roc_auc_score(y_val, y_val_proba):.4f}")

**Reading the output:**

The **ROC-AUC** score on the validation set tells you how well the Random
Forest separates the two classes *across all possible thresholds*. A value
close to **1.0** means the model produces well-separated probability
distributions for positives and negatives. However, a high AUC does **not**
guarantee that the default 0.50 threshold is the best business decision --
that is exactly what the cost-based sweep below will reveal.

**Why this matters:** AUC is a useful summary, but the optimal operating point
depends on the cost matrix, not on AUC alone.

---


## 3. From Prediction to Action: Thresholds and Costs

### 3.1 Define Cost Matrix

**Business Context:**
- True Positive (TP): Correctly identify positive case → gain $100
- False Positive (FP): Incorrectly classify negative as positive → cost $30
- False Negative (FN): Miss a positive case → cost $150
- True Negative (TN): Correctly identify negative case → gain $0

In [None]:
# Cost matrix
COST_MATRIX = {
    'TP': 100,   # Benefit
    'FP': -30,   # Cost
    'FN': -150,  # Cost
    'TN': 0      # No action
}

def compute_expected_value(y_true, y_pred, cost_matrix):
    """Compute expected value given cost matrix"""
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    
    total_value = (
        tp * cost_matrix['TP'] +
        fp * cost_matrix['FP'] +
        fn * cost_matrix['FN'] +
        tn * cost_matrix['TN']
    )
    
    return total_value, {'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn}

print("Cost Matrix Defined:")
for key, value in COST_MATRIX.items():
    print(f"  {key}: ${value}")

**Reading the output:**

The printed cost matrix shows **TP = +$100**, **FP = -$30**, **FN = -$150**,
and **TN = $0**. Two things jump out: missing a positive case (FN) costs
**five times** more than a false alarm (FP), and correctly identifying a
negative adds no monetary value. This asymmetry will push the optimal
threshold *below* 0.50 because we want to avoid the expensive false negatives
even at the price of more false positives.

**Key takeaway:** The ratio of FN cost to FP cost is the single most
important driver of threshold selection. Always map it before tuning.

---


### 3.2 Threshold Sweep with Expected Cost

Instead of accepting the default 0.50 decision boundary, we sweep thresholds
from 0.10 to 0.85 and compute the total expected value under our cost matrix
at each point. The threshold that maximises total value is the one the
business should adopt. This is where ML stops being about accuracy and starts
being about dollars.


In [None]:
# Sweep thresholds
thresholds = np.arange(0.1, 0.9, 0.05)
results = []

for threshold in thresholds:
    y_pred = (y_val_proba >= threshold).astype(int)
    total_value, counts = compute_expected_value(y_val, y_pred, COST_MATRIX)
    
    results.append({
        'threshold': threshold,
        'total_value': total_value,
        'avg_value_per_case': total_value / len(y_val),
        'TP': counts['TP'],
        'FP': counts['FP'],
        'FN': counts['FN'],
        'TN': counts['TN']
    })

results_df = pd.DataFrame(results)
best_threshold = results_df.loc[results_df['total_value'].idxmax(), 'threshold']

print("\n=== THRESHOLD SWEEP RESULTS ===")
print(results_df.head(10))
print(f"\nOptimal Threshold (by total value): {best_threshold:.2f}")

**Reading the output:**

The results table lists each threshold alongside the **total expected value**
and the four confusion-matrix counts. The optimal threshold (printed at the
bottom) is the one that maximises total value. Because our FN cost dominates,
the optimal point is typically below the default 0.50 -- the model accepts
more false positives in exchange for catching almost every true positive.
The `avg_value_per_case` column gives a quick per-record profitability
estimate you can quote to stakeholders.

**Why this matters:** Picking a threshold by accuracy alone would miss the
cost-optimal operating point. Always let the cost matrix, not a default, drive
the threshold.

---


In [None]:
# Visualize threshold impact
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Total value vs threshold
axes[0].plot(results_df['threshold'], results_df['total_value'], marker='o')
axes[0].axvline(best_threshold, color='r', linestyle='--', label=f'Optimal: {best_threshold:.2f}')
axes[0].set_xlabel('Threshold')
axes[0].set_ylabel('Total Expected Value ($)')
axes[0].set_title('Expected Value vs Threshold')
axes[0].legend()
axes[0].grid(True)

# Confusion matrix components
axes[1].plot(results_df['threshold'], results_df['TP'], marker='o', label='True Positives')
axes[1].plot(results_df['threshold'], results_df['FP'], marker='s', label='False Positives')
axes[1].plot(results_df['threshold'], results_df['FN'], marker='^', label='False Negatives')
axes[1].axvline(best_threshold, color='r', linestyle='--', alpha=0.5)
axes[1].set_xlabel('Threshold')
axes[1].set_ylabel('Count')
axes[1].set_title('Error Counts vs Threshold')
axes[1].legend()
axes[1].grid(True)

plt.tight_layout()
plt.show()

**Reading the output:**

The **left panel** shows the total expected value curve: it rises, peaks at
the optimal threshold (red dashed line), and then falls as we become too
conservative and start missing true positives. The **right panel** decomposes
the confusion-matrix counts: as the threshold increases, true positives and
false positives both drop, while false negatives climb. The optimal threshold
sits where the monetary gain from avoiding FPs is exactly offset by the
mounting cost of new FNs.

**Key takeaway:** Visualising value *and* error counts together lets you
explain the recommendation to non-technical stakeholders: 'We chose this
threshold because it balances alarm fatigue against missed detections.'

---


## 📝 PAUSE-AND-DO Exercise 1 (5 minutes)

**Task:** Select a threshold that minimizes expected cost and justify it.

**Instructions:**
1. Review the threshold sweep results above
2. Identify the threshold that maximizes expected value
3. Explain why this threshold makes business sense
4. Discuss what tradeoffs are being made

---

### YOUR THRESHOLD RECOMMENDATION HERE:

**Recommended Threshold:**  
[Value and reasoning]

**Business Justification:**  
[Why this threshold makes sense for the business]

**Tradeoffs:**  
[What are we gaining vs losing at this threshold?]

---

## 4. Calibration: Are Probabilities Trustworthy?

### 4.1 Calibration Plot

In [None]:
# Compute calibration curve
prob_true, prob_pred = calibration_curve(y_val, y_val_proba, n_bins=10, strategy='uniform')

# Plot calibration
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Calibration plot
axes[0].plot([0, 1], [0, 1], 'k--', label='Perfect Calibration')
axes[0].plot(prob_pred, prob_true, marker='o', label='Random Forest')
axes[0].set_xlabel('Mean Predicted Probability')
axes[0].set_ylabel('Fraction of Positives')
axes[0].set_title('Calibration Plot')
axes[0].legend()
axes[0].grid(True)

# Probability histogram
axes[1].hist(y_val_proba[y_val == 0], bins=30, alpha=0.5, label='Negative Class', edgecolor='black')
axes[1].hist(y_val_proba[y_val == 1], bins=30, alpha=0.5, label='Positive Class', edgecolor='black')
axes[1].set_xlabel('Predicted Probability')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Predicted Probability Distribution')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\n⚠️ Calibration Assessment:")
print("  - Points close to diagonal = well-calibrated")
print("  - Points below diagonal = overconfident")
print("  - Points above diagonal = underconfident")

**Reading the output:**

The **left panel** (reliability diagram) plots the actual fraction of
positives against the model's predicted probability in each bin. Points on
the diagonal mean perfect calibration; points *below* the diagonal indicate
the model is **overconfident** (it says 0.80 but the true rate is only 0.65).
The **right panel** shows how predicted probabilities are distributed for each
class -- well-separated histograms confirm good discrimination.

**Why this matters:** If you plan to use predicted probabilities to set
dollar-denominated decision policies (as we did above), those probabilities
must be trustworthy. Miscalibrated scores lead to miscalculated expected
values.

---


### 4.2 Apply Calibration

When a reliability diagram reveals systematic miscalibration, we can wrap the
trained classifier in `CalibratedClassifierCV` with isotonic regression. This
post-hoc adjustment maps raw scores to better-calibrated probabilities without
retraining the base model. We fit on the validation set and evaluate on the
held-out test set to avoid double-dipping.


In [None]:
# Calibrate using isotonic regression
calibrated_model = CalibratedClassifierCV(rf_model, method='isotonic', cv='prefit')
calibrated_model.fit(X_val, y_val)

# Get calibrated probabilities on a fresh split (simulating test set)
y_cal_proba = calibrated_model.predict_proba(X_test)[:, 1]

# Compare calibration
prob_true_cal, prob_pred_cal = calibration_curve(y_test, y_cal_proba, n_bins=10, strategy='uniform')

# Plot comparison
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot([0, 1], [0, 1], 'k--', label='Perfect Calibration')
ax.plot(prob_pred, prob_true, marker='o', label='Original RF', alpha=0.7)
ax.plot(prob_pred_cal, prob_true_cal, marker='s', label='Calibrated RF', alpha=0.7)
ax.set_xlabel('Mean Predicted Probability')
ax.set_ylabel('Fraction of Positives')
ax.set_title('Calibration: Before vs After')
ax.legend()
ax.grid(True)
plt.tight_layout()
plt.show()

print("\n✓ Calibration applied using isotonic regression")

**Reading the output:**

The overlay plot compares the **original** Random Forest calibration curve
(circles) with the **isotonic-calibrated** curve (squares). After calibration,
the points should hug the diagonal more tightly, meaning that a predicted
probability of, say, 0.70 now corresponds much closer to a 70 % true-positive
rate. Isotonic regression is non-parametric and flexible, but it can overfit
with very small validation sets -- always check the curve visually.

**Key takeaway:** Calibration does not change the model's ranking of
instances (AUC stays the same); it only re-maps the scores so they can be
interpreted as genuine probabilities.

---


## 📝 PAUSE-AND-DO Exercise 2 (5 minutes)

**Task:** Check calibration and decide whether calibration is needed.

**Instructions:**
1. Review the calibration plots above
2. Assess whether the model is well-calibrated
3. Decide if calibration would improve decision-making
4. Justify your recommendation

---

### YOUR CALIBRATION ASSESSMENT HERE:

**Calibration Quality:**  
[Is the model well-calibrated? What patterns do you see?]

**Recommendation:**  
[Should we use calibrated probabilities?]

**Justification:**  
[Why or why not? What's the impact on decision-making?]

---

## 5. Decision Policy Summary

### 5.1 Final Recommendation

In [None]:
# Apply optimal threshold
y_val_pred_optimal = (y_val_proba >= best_threshold).astype(int)
total_value, counts = compute_expected_value(y_val, y_val_pred_optimal, COST_MATRIX)

print("=== DECISION POLICY RECOMMENDATION ===")
print(f"\nOptimal Threshold: {best_threshold:.2f}")
print(f"Expected Total Value: ${total_value:,.2f}")
print(f"Expected Value per Case: ${total_value/len(y_val):,.2f}")
print(f"\nConfusion Matrix at Optimal Threshold:")
print(f"  True Positives: {counts['TP']}")
print(f"  False Positives: {counts['FP']}")
print(f"  False Negatives: {counts['FN']}")
print(f"  True Negatives: {counts['TN']}")

print(f"\nClassification Report:")
print(classification_report(y_val, y_val_pred_optimal))

**Reading the output:**

The summary prints the **optimal threshold**, the **total expected value** in
dollars, and the per-case average. The confusion-matrix breakdown shows
exactly how many true positives, false positives, false negatives, and true
negatives result from that threshold. The classification report adds
precision, recall, and F1 for both classes, giving a complete picture.
Quoting all three views (dollars, counts, rates) makes the recommendation
accessible to finance, operations, and data-science audiences alike.

**Why this matters:** A decision policy must be communicated in units each
stakeholder cares about -- dollars for the CFO, rates for the analyst,
counts for the operations team.

---


### 5.2 Sensitivity Analysis

Cost assumptions are never perfectly known. A sensitivity analysis varies the
false-negative cost over a plausible range and records how the optimal
threshold and expected value respond. If the threshold barely moves, the
decision policy is robust; if it swings dramatically, stakeholders need to
invest in tighter cost estimates before deploying.


In [None]:
# Test sensitivity to cost assumptions
fn_costs = [100, 150, 200, 250]
sensitivity_results = []

for fn_cost in fn_costs:
    temp_cost_matrix = COST_MATRIX.copy()
    temp_cost_matrix['FN'] = -fn_cost
    
    # Find optimal threshold for this cost
    best_value = -np.inf
    best_thresh = 0.5
    
    for threshold in thresholds:
        y_pred = (y_val_proba >= threshold).astype(int)
        value, _ = compute_expected_value(y_val, y_pred, temp_cost_matrix)
        if value > best_value:
            best_value = value
            best_thresh = threshold
    
    sensitivity_results.append({
        'FN_Cost': fn_cost,
        'Optimal_Threshold': best_thresh,
        'Expected_Value': best_value
    })

sensitivity_df = pd.DataFrame(sensitivity_results)

print("\n=== SENSITIVITY ANALYSIS ===")
print("How does optimal threshold change with FN cost?")
print(sensitivity_df)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(sensitivity_df['FN_Cost'], sensitivity_df['Optimal_Threshold'], marker='o')
axes[0].set_xlabel('False Negative Cost ($)')
axes[0].set_ylabel('Optimal Threshold')
axes[0].set_title('Threshold Sensitivity to FN Cost')
axes[0].grid(True)

axes[1].plot(sensitivity_df['FN_Cost'], sensitivity_df['Expected_Value'], marker='o')
axes[1].set_xlabel('False Negative Cost ($)')
axes[1].set_ylabel('Expected Value ($)')
axes[1].set_title('Expected Value vs FN Cost')
axes[1].grid(True)

plt.tight_layout()
plt.show()

**Reading the output:**

The sensitivity table shows how the optimal threshold and expected value
change as the **false-negative cost** varies from $100 to $250. If the
optimal threshold barely shifts across this range, the policy is **robust**
to cost-estimation error. The two plots reinforce this: a flat threshold
line means stability, while a steep slope means the business must invest in
more precise cost estimates before committing to a threshold.

**Key takeaway:** Never present a single optimal threshold without showing
how sensitive it is to the assumptions. Sensitivity analysis is what turns a
model recommendation into a trustworthy business policy.

---


## 6. Wrap-Up: Key Takeaways

### What We Learned Today:

1. **Cost-Sensitive Decisions**: How to translate predictions into actions using cost matrices
2. **Threshold Optimization**: Finding thresholds that maximize business value
3. **Calibration Assessment**: Evaluating whether probabilities are trustworthy
4. **Calibration Techniques**: Using isotonic regression to improve probability estimates
5. **Sensitivity Analysis**: Understanding how decisions change with cost assumptions

### Decision-Making Best Practices:

- ✓ Define cost matrix based on business reality, not convenience
- ✓ Optimize thresholds on validation set, not training set
- ✓ Check calibration before using probabilities for decisions
- ✓ Perform sensitivity analysis on cost assumptions
- ✓ Document decision policy clearly for stakeholders
- ✓ Plan for monitoring and updating thresholds over time

### Remember:

> **"Model performance metrics don't pay the bills - business value does."**  
> Always optimize for business outcomes, not just AUC or accuracy.

---

## Participation Assignment Submission Instructions

### To Submit This Notebook:

1. **Complete all exercises**: Fill in both PAUSE-AND-DO exercise cells with your findings
2. **Run All Cells**: Execute `Runtime → Run all` to ensure everything works
3. **Save a Copy**: `File → Save a copy in Drive`
4. **Submit**: Upload your `.ipynb` file in the participation assignment you find in the course Brightspace page.

### Before Submitting, Check:

- [ ] All cells execute without errors
- [ ] All outputs are visible
- [ ] Both exercise responses are complete
- [ ] Notebook is shared with correct permissions
- [ ] You can explain every line of code you wrote

### Next Step:

Complete the **Quiz** in Brightspace (auto-graded)

---

## Bibliography

- Provost, F., & Fawcett, T. (2013). *Data Science for Business*. O'Reilly Media.
- scikit-learn User Guide: [Probability Calibration](https://scikit-learn.org/stable/modules/calibration.html)
- Niculescu-Mizil, A., & Caruana, R. (2005). "Predicting good probabilities with supervised learning." *ICML*.
- Zadrozny, B., & Elkan, C. (2001). "Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers." *ICML*.

---




<center>

Thank you!

</center>