<a href="https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/example-notebooks/24_classification_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluating Classification Models

This notebook contains code examples from the **Evaluating Classification Models** chapter (Chapter 24) of the BANA 4080 textbook. Follow along to practice advanced classification evaluation metrics using pandas, scikit-learn, and Python.

## 📚 Chapter Overview

This chapter teaches you to evaluate classification models using metrics that align with business reality. While accuracy seems intuitive, it can be deeply misleading in real business scenarios, especially with imbalanced datasets like credit default prediction.

## 🎯 What You'll Practice

- Identify the "accuracy trap" and understand why 97.3% accuracy can be misleading with imbalanced data
- Construct and interpret confusion matrices to understand exactly how your model makes errors
- Calculate precision, recall, and F1-score and explain their business implications for different scenarios
- Use ROC curves and AUC to evaluate model ranking quality for risk-based pricing strategies
- Design business-aligned evaluation frameworks that select appropriate metrics based on specific costs

## 💡 How to Use This Notebook

1. **Read the chapter first** - This notebook supplements the textbook, not replaces it
2. **Run cells sequentially** - Code builds on previous examples
3. **Experiment freely** - Modify code to test your understanding
4. **Practice variations** - Try different approaches to reinforce learning

## Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from ISLP import load_data
import warnings

# Suppress numerical warnings for cleaner output
warnings.filterwarnings('ignore', category=RuntimeWarning)

# Load the Default dataset from Chapter 23
Default = load_data('Default')
print("Default dataset shape:", Default.shape)
print("\nFirst few rows:")
Default.head()

## The Accuracy Trap: When High Accuracy Misleads

Let's start by exploring why accuracy can be deeply misleading in business scenarios, especially with imbalanced datasets like credit default prediction.

In [None]:
# Explore the class imbalance in our Default dataset
print("Default distribution:")
print(Default['default'].value_counts())
print(f"\nDefault rate: {Default['default'].value_counts(normalize=True)['Yes']:.1%}")

# What would happen if we always predicted "No Default"?
naive_accuracy = (Default['default'] == 'No').mean()
print(f"\nAccuracy of always predicting 'No Default': {naive_accuracy:.1%}")
print(f"This 'lazy' model provides ZERO business value but achieves high accuracy!")

In [None]:
# Simulate fraud detection scenario to demonstrate the accuracy paradox
np.random.seed(42)
n_transactions = 100000
fraud_rate = 0.01

# True labels: 1% fraud, 99% legitimate
y_true = np.random.binomial(1, fraud_rate, n_transactions)

# Model A: "Lazy" model that always predicts "legitimate" 
y_pred_lazy = np.zeros(n_transactions)  # Always predicts 0 (legitimate)

# Model B: "Smart" model that catches some fraud but makes some mistakes
y_pred_smart = y_true.copy()
# Miss 20% of fraud (false negatives)
fraud_indices = np.where(y_true == 1)[0]
missed_fraud = np.random.choice(fraud_indices, int(0.2 * len(fraud_indices)), replace=False)
y_pred_smart[missed_fraud] = 0

# Flag 2% of legitimate transactions as fraud (false positives)  
legit_indices = np.where(y_true == 0)[0]
false_flags = np.random.choice(legit_indices, int(0.02 * len(legit_indices)), replace=False)
y_pred_smart[false_flags] = 1

# Calculate accuracies
accuracy_lazy = accuracy_score(y_true, y_pred_lazy)
accuracy_smart = accuracy_score(y_true, y_pred_smart)

print("Fraud Detection Model Comparison:")
print(f"Dataset: {n_transactions:,} transactions, {fraud_rate:.1%} fraud rate")
print(f"\nModel A (Lazy): {accuracy_lazy:.1%} accuracy")
print(f"Model B (Smart): {accuracy_smart:.1%} accuracy")
print(f"\nWhich model would you choose for your business?")

### 🏃‍♂️ Try It Yourself

Create your own accuracy paradox example. Try different fraud rates (0.5%, 2%, 5%) and see how the "lazy model" accuracy changes. What happens to the accuracy as fraud becomes rarer?

In [None]:
# Your code here
# Try different fraud rates and calculate lazy model accuracy
fraud_rates = [0.005, 0.02, 0.05]
for rate in fraud_rates:
    # Create a small dataset to test
    y_test = np.random.binomial(1, rate, 10000)
    lazy_pred = np.zeros(len(y_test))  # Always predict 0
    lazy_acc = accuracy_score(y_test, lazy_pred)
    print(f"Fraud rate: {rate:.1%}, Lazy model accuracy: {lazy_acc:.1%}")

## Building the Classification Model (Following Chapter 23)

Let's build the same logistic regression model from Chapter 23 to evaluate its performance using proper metrics.

In [None]:
# Use the Default dataset from chapter 23 with identical preparation
Default_encoded = pd.get_dummies(Default, columns=['student'], drop_first=True)
Default_encoded['default_binary'] = (Default_encoded['default'] == 'Yes').astype(int)

# Use the same feature matrix and target as chapter 23
X = Default_encoded[['balance', 'income', 'student_Yes']]
y = Default_encoded['default_binary']

# Split the data using the same approach as chapter 23 for consistency
X_simple = Default_encoded[['balance']]
X_simple_train, X_simple_test, X_train, X_test, y_train, y_test = train_test_split(
    X_simple, X, y, test_size=0.3, random_state=42
)

# Fit the same logistic regression model from chapter 23
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Make predictions on test set
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

print(f"Dataset context (matching chapter 23 results):")
print(f"Total test examples: {len(y_test):,}")
print(f"Actual default cases: {y_test.sum():,} ({y_test.mean():.1%})")
print(f"Actual non-default cases: {len(y_test) - y_test.sum():,} ({1-y_test.mean():.1%})")
print(f"\nBasic accuracy: {accuracy_score(y_test, y_pred):.1%}")

## The Confusion Matrix: Foundation for Understanding Model Performance

The confusion matrix shows exactly how your model makes mistakes, providing the foundation for all other classification metrics.

In [None]:
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix for Default Prediction Model:")
print(cm)

# Extract the four core values from our confusion matrix
tn, fp, fn, tp = cm.ravel()
total = tn + fp + fn + tp

print("\nConfusion Matrix Components for Default Prediction:")
print("=" * 60)
print(f"True Negatives (TN):  {tn:,} - Correctly identified non-default customers")
print(f"False Positives (FP): {fp:,} - Safe customers incorrectly flagged as high risk")
print(f"False Negatives (FN): {fn:,} - Risky customers that were missed")
print(f"True Positives (TP):  {tp:,} - Correctly identified default customers")
print(f"Total customers:      {total:,}")

# Manually calculate basic accuracy
accuracy = (tp + tn) / total
print(f"\nAccuracy = (TP + TN) / Total = ({tp} + {tn}) / {total} = {accuracy:.3f} or {accuracy:.1%}")

In [None]:
# Visualize the confusion matrix with business context
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Predicted: No Default', 'Predicted: Default'],
            yticklabels=['Actual: No Default', 'Actual: Default'])
plt.title('Confusion Matrix: Default Prediction Model')
plt.ylabel('Actual Class')
plt.xlabel('Predicted Class')
plt.tight_layout()
plt.show()

print("\nBusiness Impact Analysis:")
print(f"• False Positives ({fp} customers): Good customers denied credit → Lost revenue, customer churn")
print(f"• False Negatives ({fn} customers): Bad customers approved for credit → Direct financial losses")

### 🏃‍♂️ Try It Yourself

Calculate the business impact of this confusion matrix. Assume each false positive costs $50 (unhappy customer) and each false negative costs $500 (missed default). What's the total cost of model errors?

In [None]:
# Your code here
fp_cost = 50   # Cost of incorrectly flagging good customer
fn_cost = 500  # Cost of missing default

total_fp_cost = fp * fp_cost
total_fn_cost = fn * fn_cost
total_cost = total_fp_cost + total_fn_cost

print(f"Business costs:")
print(f"• False positives: {fp:,} × ${fp_cost} = ${total_fp_cost:,}")
print(f"• False negatives: {fn:,} × ${fn_cost} = ${total_fn_cost:,}")
print(f"• Total error cost: ${total_cost:,}")

## Essential Classification Metrics: Precision, Recall, and F1-Score

While accuracy treats all errors equally, business decisions require understanding the specific types of errors your model makes.

In [None]:
# Calculate precision manually and verify with sklearn
precision = tp / (tp + fp)

# Verify with sklearn's precision_score function
sklearn_precision = precision_score(y_test, y_pred)

print("PRECISION ANALYSIS")
print("=" * 30)
print(f"Precision = TP / (TP + FP) = {tp} / ({tp} + {fp}) = {precision:.3f} or {precision:.1%}")
print(f"\nBusiness Interpretation:")
print(f"• When our model flags a customer as 'high default risk', it's correct {precision:.1%} of the time")
print(f"• Out of {tp + fp} customers flagged as high risk, {tp} actually defaulted")
print(f"• {fp} customers were incorrectly flagged (false alarms)")

print(f"\nSklearn verification:")
print(f"• Manual calculation: {precision:.3f}")
print(f"• sklearn precision_score: {sklearn_precision:.3f}")
print(f"• Results match: {'✓' if abs(precision - sklearn_precision) < 0.001 else '✗'}")

In [None]:
# Calculate recall manually and verify with sklearn
recall = tp / (tp + fn)

# Verify with sklearn's recall_score function
sklearn_recall = recall_score(y_test, y_pred)

print("RECALL ANALYSIS")
print("=" * 30)
print(f"Recall = TP / (TP + FN) = {tp} / ({tp} + {fn}) = {recall:.3f} or {recall:.1%}")
print(f"\nBusiness Interpretation:")
print(f"• Our model catches {recall:.1%} of all customers who actually default")
print(f"• Out of {tp + fn} customers who actually defaulted, we caught {tp}")
print(f"• We missed {fn} customers who defaulted (this could be costly!)")

print(f"\nSklearn verification:")
print(f"• Manual calculation: {recall:.3f}")
print(f"• sklearn recall_score: {sklearn_recall:.3f}")
print(f"• Results match: {'✓' if abs(recall - sklearn_recall) < 0.001 else '✗'}")

In [None]:
# Comprehensive model evaluation
print("DEFAULT PREDICTION MODEL EVALUATION")
print("=" * 40)
print(f"Precision: {precision:.1%} - When we flag someone as high risk, we're right {precision:.1%} of the time")
print(f"Recall: {recall:.1%} - We catch {recall:.1%} of all customers who actually default")

# Business cost implications
print(f"\nBusiness Impact:")
print(f"• High precision ({precision:.1%}) = Few false alarms = Happy customers")
print(f"• Low recall ({recall:.1%}) = Miss many defaults = Financial losses")
print(f"\nThis suggests our model is conservative - it makes fewer false accusations,")
print(f"but it misses many customers who will actually default.")

In [None]:
# Demonstrate the precision-recall trade-off
print("PRECISION-RECALL TRADE-OFF DEMONSTRATION")
print("=" * 45)

# Test different thresholds
thresholds = [0.1, 0.3, 0.5, 0.7, 0.9]
print(f"{'Threshold':<12} {'Precision':<12} {'Recall':<12} {'Business Impact'}")
print("-" * 70)

for threshold in thresholds:
    # Make predictions at this threshold
    y_pred_thresh = (y_pred_proba > threshold).astype(int)
    
    if y_pred_thresh.sum() > 0:  # Avoid division by zero
        prec = precision_score(y_test, y_pred_thresh)
        rec = recall_score(y_test, y_pred_thresh)
        
        # Interpret the business impact
        if threshold <= 0.3:
            impact = "Flag many as risky - catch more defaults but annoy customers"
        elif threshold >= 0.7:
            impact = "Flag few as risky - happy customers but miss defaults"
        else:
            impact = "Balanced approach"
            
        print(f"{threshold:<12.1f} {prec:<12.3f} {rec:<12.3f} {impact}")
    else:
        print(f"{threshold:<12.1f} {'N/A':<12} {'0.000':<12} No customers flagged as risky")

In [None]:
# Calculate F1-score manually and verify with sklearn
f1 = 2 * (precision * recall) / (precision + recall)

# Compare with sklearn
sklearn_f1 = f1_score(y_test, y_pred)

print("F1-SCORE CALCULATION")
print("=" * 25)
print(f"F1-Score = 2 × (Precision × Recall) / (Precision + Recall)")
print(f"F1-Score = 2 × ({precision:.3f} × {recall:.3f}) / ({precision:.3f} + {recall:.3f})")
print(f"F1-Score = {f1:.3f} or {f1:.1%}")

print(f"\nSklearn verification: F1-Score = {sklearn_f1:.3f}")
print(f"Results match: {'✓' if abs(f1 - sklearn_f1) < 0.001 else '✗'}")

print(f"\nBusiness Interpretation:")
print(f"The F1-score of {f1:.1%} reflects the challenge of predicting rare events.")
print(f"While our model has good precision (few false alarms), it suffers from")
print(f"poor recall (missing many actual defaults).")

### 🏃‍♂️ Try It Yourself

Find the threshold that maximizes F1-score for our model. What business trade-offs does this represent? Would you recommend this threshold for a conservative bank vs. an aggressive lender?

In [None]:
# Your code here
# Find optimal threshold for F1-score
thresholds_fine = np.arange(0.05, 0.95, 0.05)
f1_scores = []

for threshold in thresholds_fine:
    y_pred_thresh = (y_pred_proba > threshold).astype(int)
    if y_pred_thresh.sum() > 0:  # Avoid division by zero
        f1_thresh = f1_score(y_test, y_pred_thresh)
        f1_scores.append((threshold, f1_thresh))

# Find best threshold
best_threshold, best_f1 = max(f1_scores, key=lambda x: x[1])
print(f"Optimal threshold for F1-score: {best_threshold:.2f}")
print(f"Best F1-score: {best_f1:.3f}")

# Analyze what this means
y_pred_optimal = (y_pred_proba > best_threshold).astype(int)
prec_opt = precision_score(y_test, y_pred_optimal)
rec_opt = recall_score(y_test, y_pred_optimal)

print(f"\nAt optimal threshold:")
print(f"• Precision: {prec_opt:.1%}")
print(f"• Recall: {rec_opt:.1%}")
print(f"\nThis represents a more balanced approach between catching defaults and avoiding false alarms.")

## ROC Curves and AUC: Evaluating Ranking Quality

ROC curves evaluate how well your model distinguishes between classes across all possible thresholds - crucial for risk ranking and pricing decisions.

In [None]:
# Calculate ROC curve and AUC score
fpr, tpr, roc_thresholds = roc_curve(y_test, y_pred_proba)
auc_score = roc_auc_score(y_test, y_pred_proba)

print(f"Our model's AUC: {auc_score:.3f}")
print(f"\nSimple interpretation: {auc_score:.1%} chance our model correctly ranks")
print(f"a defaulting customer as higher risk than a non-defaulting customer.")

In [None]:
# Create ROC curve visualization
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, linewidth=3, label=f'Our Model (AUC = {auc_score:.3f})', color='blue')
plt.plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random Guessing (AUC = 0.5)', alpha=0.7)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve: How Well Our Model Ranks Customers by Risk')
plt.legend()
plt.grid(True, alpha=0.3)

# Highlight the ideal points
plt.annotate('Perfect Model\n(0, 1)', xy=(0, 1), xytext=(0.02, 0.9),
            arrowprops=dict(arrowstyle='->', color='green'), fontsize=10)
plt.annotate('Our Model\nPerformance', xy=(0.18, 0.9), xytext=(0.25, 0.7),
            arrowprops=dict(arrowstyle='->', color='blue'), fontsize=10)

plt.tight_layout()
plt.show()

# AUC interpretation guide
print("\nAUC Score Interpretation Guide:")
print("• 0.9 - 1.0: Outstanding (Deploy with confidence)")
print("• 0.8 - 0.9: Excellent (Strong business value)")
print("• 0.7 - 0.8: Good (Useful with monitoring)")
print("• 0.6 - 0.7: Fair (Limited value)")
print("• 0.5 - 0.6: Poor (Barely better than random)")
print(f"\nOur model AUC of {auc_score:.3f} is {('Excellent' if auc_score >= 0.9 else 'Good' if auc_score >= 0.8 else 'Fair')}!")

In [None]:
# Create risk tiers using our model's probability predictions
risk_buckets = pd.qcut(y_pred_proba, q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
risk_analysis = pd.DataFrame({
    'Risk_Bucket': risk_buckets,
    'Actual_Default': y_test
}).groupby('Risk_Bucket', observed=True)['Actual_Default'].mean()

print("Default Rates by Risk Tier:")
for bucket, default_rate in risk_analysis.items():
    print(f"{bucket:>10}: {default_rate:.4%} default rate")

print(f"\nOur model creates excellent risk separation!")
print(f"This AUC of {auc_score:.3f} enables confident risk-based pricing decisions.")

### 🏃‍♂️ Try It Yourself

Create a risk-based pricing strategy using the model's probability scores. Suggest appropriate interest rates for each risk bucket based on the default rates you calculated above.

In [None]:
# Your code here
# Create pricing strategy based on risk
base_rate = 5.0  # Base interest rate in %

print("Risk-Based Pricing Strategy:")
print("=" * 35)

for bucket, default_rate in risk_analysis.items():
    # Add risk premium based on default rate
    risk_premium = default_rate * 100 * 2  # 2x the default rate as premium
    interest_rate = base_rate + risk_premium
    
    print(f"{bucket:>10}: {interest_rate:.1f}% interest rate ({default_rate:.1%} default risk)")

print(f"\nThis pricing strategy reflects actual risk levels while remaining competitive.")

## Choosing the Right Metric for Your Business Context

The most sophisticated aspect of classification evaluation is aligning your choice of metrics with your specific business context and cost structure.

In [None]:
# Business-aligned metric selection framework
print("BUSINESS-ALIGNED METRIC SELECTION GUIDE")
print("=" * 50)
print("\n• Use PRECISION when: False positives cost more than false negatives")
print("  Examples: Credit card fraud detection, Email spam filtering")
print("\n• Use RECALL when: False negatives cost more than false positives")
print("  Examples: Disease screening, Safety system alerts")
print("\n• Use F1-SCORE when: Both error types matter equally")
print("  Examples: Marketing campaigns, Quality control")
print("\n• Use ROC-AUC when: You need ranking quality across all thresholds")
print("  Examples: Insurance pricing, Risk assessment")
print("\n• Avoid ACCURACY when: You have imbalanced classes")
print("  Like our 3% default rate - accuracy can be misleading!")

# Summary of our model's performance
print(f"\nOUR DEFAULT MODEL PERFORMANCE SUMMARY:")
print(f"• Precision: {precision:.1%} (Good - few false alarms)")
print(f"• Recall: {recall:.1%} (Poor - misses many defaults)")
print(f"• F1-Score: {f1:.1%} (Low - reflects precision-recall imbalance)")
print(f"• AUC: {auc_score:.1%} (Excellent - great for risk ranking)")

print(f"\nBUSINESS RECOMMENDATION:")
print(f"This model is excellent for risk-based pricing (high AUC) but may")
print(f"need threshold adjustment for binary default decisions (low recall).")

## 🚀 Practice Challenges

Test your understanding with these additional exercises that combine multiple concepts from the chapter.

### Challenge 1: Custom Business Metric

Create a custom evaluation metric that calculates the total business cost for our Default model. Use the costs: False Positive = $50, False Negative = $500. Which threshold minimizes total cost?

In [None]:
# Your solution here
def calculate_business_cost(y_true, y_pred, fp_cost=50, fn_cost=500):
    """
    Calculate total business cost based on false positives and false negatives
    """
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    total_cost = (fp * fp_cost) + (fn * fn_cost)
    return total_cost, fp, fn

# Test different thresholds to find minimum cost
thresholds = np.arange(0.1, 0.9, 0.05)
costs = []

for thresh in thresholds:
    y_pred_thresh = (y_pred_proba > thresh).astype(int)
    cost, fp_count, fn_count = calculate_business_cost(y_test, y_pred_thresh)
    costs.append((thresh, cost, fp_count, fn_count))

# Find optimal threshold
best_threshold, min_cost, best_fp, best_fn = min(costs, key=lambda x: x[1])

print(f"Optimal threshold for minimum business cost: {best_threshold:.2f}")
print(f"Minimum total cost: ${min_cost:,}")
print(f"At this threshold: {best_fp} false positives, {best_fn} false negatives")

# Compare with default threshold
default_cost, default_fp, default_fn = calculate_business_cost(y_test, y_pred)
print(f"\nComparison with default threshold (0.5):")
print(f"Default cost: ${default_cost:,}")
print(f"Cost savings: ${default_cost - min_cost:,}")

### Challenge 2: Model Comparison

Compare our logistic regression model with a simple rule-based model (predict default if balance > $1500). Which performs better on different metrics?

In [None]:
# Your solution here
# Create simple rule-based model
balance_threshold = 1500
X_test_balance = X_test['balance']
y_pred_rule = (X_test_balance > balance_threshold).astype(int)

# Calculate metrics for both models
models = {
    'Logistic Regression': y_pred,
    'Rule-Based (Balance > $1500)': y_pred_rule
}

print("MODEL COMPARISON:")
print("=" * 50)
print(f"{'Model':<25} {'Precision':<12} {'Recall':<10} {'F1-Score':<10} {'Accuracy':<10}")
print("-" * 70)

for name, predictions in models.items():
    prec = precision_score(y_test, predictions)
    rec = recall_score(y_test, predictions)
    f1 = f1_score(y_test, predictions)
    acc = accuracy_score(y_test, predictions)
    
    print(f"{name:<25} {prec:<12.3f} {rec:<10.3f} {f1:<10.3f} {acc:<10.3f}")

print(f"\nBusiness Insights:")
print(f"• Logistic regression is more sophisticated and generally performs better")
print(f"• Rule-based model is simpler to explain but may miss nuanced patterns")
print(f"• Choice depends on business needs: interpretability vs. performance")

### Challenge 3: Threshold Strategy for Different Business Goals

Design optimal thresholds for three different business strategies: (1) Conservative bank (minimize defaults), (2) Growth-focused bank (maximize approvals), (3) Balanced approach.

In [None]:
# Your solution here
# Define different business strategies
strategies = {
    'Conservative (Minimize Defaults)': 'recall',  # Catch more defaults
    'Growth-Focused (Maximize Approvals)': 'precision',  # Fewer false alarms
    'Balanced Approach': 'f1'  # Balance both
}

print("BUSINESS STRATEGY OPTIMIZATION:")
print("=" * 40)

for strategy_name, metric in strategies.items():
    best_score = 0
    best_thresh = 0.5
    
    for thresh in np.arange(0.1, 0.9, 0.05):
        y_pred_thresh = (y_pred_proba > thresh).astype(int)
        
        if y_pred_thresh.sum() > 0:  # Avoid division by zero
            if metric == 'recall':
                score = recall_score(y_test, y_pred_thresh)
            elif metric == 'precision':
                score = precision_score(y_test, y_pred_thresh)
            else:  # f1
                score = f1_score(y_test, y_pred_thresh)
            
            if score > best_score:
                best_score = score
                best_thresh = thresh
    
    # Calculate metrics at optimal threshold
    y_pred_optimal = (y_pred_proba > best_thresh).astype(int)
    prec = precision_score(y_test, y_pred_optimal)
    rec = recall_score(y_test, y_pred_optimal)
    f1 = f1_score(y_test, y_pred_optimal)
    
    print(f"\n{strategy_name}:")
    print(f"  Optimal threshold: {best_thresh:.2f}")
    print(f"  Precision: {prec:.1%}, Recall: {rec:.1%}, F1: {f1:.1%}")
    
    # Business interpretation
    approval_rate = (y_pred_proba < best_thresh).mean()
    print(f"  Approval rate: {approval_rate:.1%} of applicants")

print(f"\nConclusion: Different business goals require different thresholds!")
print(f"The 'optimal' threshold depends entirely on your business priorities.")

## 📝 Chapter Summary

In this notebook, you practiced:

- ✅ Understanding why accuracy can be misleading with imbalanced data (3% default rate)
- ✅ Constructing and interpreting confusion matrices for business decision-making
- ✅ Calculating precision, recall, and F1-score and understanding their business implications
- ✅ Using ROC curves and AUC to evaluate model ranking quality for risk-based pricing
- ✅ Aligning evaluation metrics with specific business costs and objectives
- ✅ Analyzing precision-recall trade-offs and threshold optimization for different business strategies

## 🔗 Connections to Other Chapters

- **Previous chapters**: Built on logistic regression from Chapter 23, using the same Default dataset and model
- **Upcoming chapters**: These evaluation principles apply to all classification algorithms you'll learn (decision trees, random forests, neural networks, etc.)

## 📚 Additional Resources

- [Scikit-learn Classification Metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)
- [Understanding ROC Curves](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)
- [Precision and Recall Trade-offs](https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/)

## 🎯 Next Steps

1. **Review the chapter** to reinforce concepts about business-aligned evaluation
2. **Complete the end-of-chapter exercises** in the textbook using different business scenarios
3. **Practice with your own classification problems** to build intuition
4. **Apply these evaluation principles** to future classification algorithms in the course