# 📊 ML Cutoff Optimizer - Basic Usage

This notebook demonstrates the basic usage of **ML Cutoff Optimizer**, a library for finding optimal probability thresholds in binary classification.

## 🎯 What You'll Learn

1. How to create synthetic binary classification data
2. How to train a simple model
3. How to visualize probability distributions
4. How to find optimal cutoff points for three decision zones
5. How to interpret the results

---

## 📦 Step 1: Import Libraries

First, let's import all the necessary libraries.

In [None]:
# Standard libraries
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn for dataset and model
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Add project to path (if running from examples/notebooks/)
sys.path.insert(0, '../../src')

# Our library!
from ml_cutoff_optimizer import ThresholdVisualizer, CutoffOptimizer, MetricsCalculator

print("✅ All libraries imported successfully!")
print(f"   NumPy version: {np.__version__}")
print(f"   Pandas version: {pd.__version__}")

## 📊 Step 2: Create Synthetic Dataset

We'll create a synthetic binary classification dataset with:
- 1000 samples
- 20 features
- 2 classes (0 and 1)
- Imbalanced classes (60% class 0, 40% class 1)

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic data
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[0.6, 0.4],  # 60% class 0, 40% class 1
    flip_y=0.05,  # 5% label noise
    random_state=42
)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print("📊 Dataset Created:")
print(f"   Training set: {len(X_train)} samples")
print(f"   Test set:     {len(X_test)} samples")
print(f"\n   Class distribution (test set):")
print(f"   Class 0: {np.sum(y_test == 0)} samples ({np.sum(y_test == 0)/len(y_test)*100:.1f}%)")
print(f"   Class 1: {np.sum(y_test == 1)} samples ({np.sum(y_test == 1)/len(y_test)*100:.1f}%)")

## 🤖 Step 3: Train a Binary Classification Model

We'll use Logistic Regression, but **any binary classifier works** (Random Forest, XGBoost, Neural Networks, etc.).

The only requirement is that your model can output **probabilities** (via `predict_proba()`).

In [None]:
# Train Logistic Regression
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train, y_train)

# Get predictions and probabilities
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]  # Probability for class 1

# Evaluate model (standard threshold = 0.5)
accuracy = model.score(X_test, y_test)

print("🤖 Model Training Complete:")
print(f"   Model: Logistic Regression")
print(f"   Accuracy (threshold=0.5): {accuracy:.2%}")
print(f"\n   Probability statistics:")
print(f"   Min:  {y_proba.min():.4f}")
print(f"   Mean: {y_proba.mean():.4f}")
print(f"   Max:  {y_proba.max():.4f}")
print(f"\n   Classification Report (threshold=0.5):")
print(classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1']))

## 🎨 Step 4: Visualize Probability Distributions

Now let's visualize how the predicted probabilities are distributed.

**What to look for:**
- 🔵 **Blue bars**: Overall population distribution
- 🔴 **Red bars**: Distribution of positive class (y=1) only
- **Good model**: Red bars concentrated on the right (high probabilities for class 1)
- **Bad model**: Red bars spread evenly (model is guessing)

In [None]:
# Create visualizer with 5% bins
visualizer = ThresholdVisualizer(y_test, y_proba, step=0.05)

# Plot distributions
fig, ax = visualizer.plot_distributions(
    figsize=(16, 7),
    title="Probability Distribution Analysis - Basic Example"
)

plt.show()

print("\n📈 Interpretation:")
print("   - If red bars are on the RIGHT: Model gives high probabilities to positive class ✅")
print("   - If red bars are SPREAD OUT: Model is uncertain about positive class ⚠️")
print("   - If red bars are on the LEFT: Model is confused (needs improvement) ❌")

## 🎯 Step 5: Find Optimal Cutoffs (Three-Zone Strategy)

Instead of using a single threshold (0.5), let's find **three zones**:

1. **Negative Zone** (0% - X%): High confidence predictions for class 0 → **Auto-reject**
2. **Manual Zone** (X% - Y%): Uncertain predictions → **Human review**
3. **Positive Zone** (Y% - 100%): High confidence predictions for class 1 → **Auto-accept**

This reduces errors by flagging uncertain cases for manual review!

In [None]:
# Create optimizer
optimizer = CutoffOptimizer(y_test, y_proba)

# Suggest three zones
cutoffs = optimizer.suggest_three_zones(
    negative_zone_metric='specificity',  # Optimize for identifying negatives
    positive_zone_metric='recall',       # Optimize for identifying positives
    min_metric_value=0.80,               # Require 80% minimum performance
    max_manual_zone_width=0.40           # Max 40% of data in manual zone
)

print("🎯 SUGGESTED CUTOFFS:")
print("=" * 70)
print(f"   Negative Zone: 0% - {cutoffs['negative_cutoff']*100:.1f}%")
print(f"   Manual Zone:   {cutoffs['negative_cutoff']*100:.1f}% - {cutoffs['positive_cutoff']*100:.1f}%")
print(f"   Positive Zone: {cutoffs['positive_cutoff']*100:.1f}% - 100%")
print("=" * 70)

## 📊 Step 6: Analyze Population Distribution

In [None]:
# Extract population statistics
pop = cutoffs['population']

print("\n👥 POPULATION DISTRIBUTION:")
print("=" * 70)
print(f"   Negative Zone: {pop['negative_zone_count']:3d} samples ({pop['negative_zone_pct']:.1f}%)")
print(f"   Manual Zone:   {pop['manual_zone_count']:3d} samples ({pop['manual_zone_pct']:.1f}%)")
print(f"   Positive Zone: {pop['positive_zone_count']:3d} samples ({pop['positive_zone_pct']:.1f}%)")
print("=" * 70)

# Create pie chart
fig, ax = plt.subplots(figsize=(10, 6))
labels = ['Negative Zone\n(Auto-Reject)', 'Manual Zone\n(Human Review)', 'Positive Zone\n(Auto-Accept)']
sizes = [pop['negative_zone_pct'], pop['manual_zone_pct'], pop['positive_zone_pct']]
colors = ['#4CAF50', '#FFC107', '#2196F3']
explode = (0.05, 0.1, 0.05)

ax.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%',
       shadow=True, startangle=90, textprops={'fontsize': 12, 'weight': 'bold'})
ax.set_title('Population Distribution Across Three Zones', fontsize=14, fontweight='bold', pad=20)

plt.tight_layout()
plt.show()

print("\n💡 Interpretation:")
print(f"   → {pop['negative_zone_pct'] + pop['positive_zone_pct']:.1f}% of decisions can be AUTOMATED")
print(f"   → {pop['manual_zone_pct']:.1f}% require HUMAN REVIEW")
print(f"   → This reduces manual workload by {pop['negative_zone_pct'] + pop['positive_zone_pct']:.1f}%!")

## 📈 Step 7: Analyze Zone Performance

In [None]:
# Extract metrics
neg_metrics = cutoffs['metrics']['negative_zone']
pos_metrics = cutoffs['metrics']['positive_zone']

print("\n📈 ZONE PERFORMANCE METRICS:")
print("=" * 70)
print("\n🟢 NEGATIVE ZONE (Auto-Reject):")
print(f"   Specificity:        {neg_metrics['specificity']:.2%}  ← How well we identify true negatives")
print(f"   False Positive Rate: {neg_metrics['fpr']:.2%}  ← Error rate (wrongly reject)")
print(f"   Confusion Matrix: TN={neg_metrics['tn']}, FP={neg_metrics['fp']}, FN={neg_metrics['fn']}, TP={neg_metrics['tp']}")

print("\n🔵 POSITIVE ZONE (Auto-Accept):")
print(f"   Recall:             {pos_metrics['recall']:.2%}  ← How well we identify true positives")
print(f"   Precision:          {pos_metrics['precision']:.2%}  ← Accuracy of positive predictions")
print(f"   False Negative Rate: {pos_metrics['fnr']:.2%}  ← Error rate (wrongly accept)")
print(f"   Confusion Matrix: TN={pos_metrics['tn']}, FP={pos_metrics['fp']}, FN={pos_metrics['fn']}, TP={pos_metrics['tp']}")
print("=" * 70)

## 🎨 Step 8: Visualize with Cutoff Lines

In [None]:
# Create new visualization with cutoff lines
visualizer = ThresholdVisualizer(y_test, y_proba, step=0.05)
fig, ax = visualizer.plot_distributions(
    figsize=(16, 7),
    title="Probability Distribution with Suggested Cutoffs"
)

# Add cutoff lines
visualizer.add_cutoff_lines(
    cutoffs={
        'negative_cutoff': cutoffs['negative_cutoff'],
        'positive_cutoff': cutoffs['positive_cutoff']
    },
    labels={
        'negative_cutoff': f"Negative Cutoff ({cutoffs['negative_cutoff']*100:.1f}%)",
        'positive_cutoff': f"Positive Cutoff ({cutoffs['positive_cutoff']*100:.1f}%)"
    }
)

plt.show()

print("\n🎨 Visual Interpretation:")
print("   🟢 Green line (left):   Negative cutoff - everything LEFT is auto-rejected")
print("   🟠 Orange line (right):  Positive cutoff - everything RIGHT is auto-accepted")
print("   🟡 Area between lines:   Manual zone - requires human review")

## 📊 Step 9: View Full Justification Report

In [None]:
# Print the full justification
print(cutoffs['justification'])

## 🔍 Step 10: Compare with Standard Threshold (0.5)

In [None]:
# Calculate metrics at standard threshold
standard_metrics = MetricsCalculator.calculate_all_metrics(y_test, y_proba, threshold=0.5)

print("\n⚖️ COMPARISON: Three-Zone Strategy vs Standard Threshold")
print("=" * 70)
print("\nSTANDARD APPROACH (Single threshold = 0.5):")
print(f"   Accuracy:  {standard_metrics['accuracy']:.2%}")
print(f"   Precision: {standard_metrics['precision']:.2%}")
print(f"   Recall:    {standard_metrics['recall']:.2%}")
print(f"   F1-Score:  {standard_metrics['f1']:.2%}")
print(f"   → 100% of decisions are automated (no human review)")
print(f"   → Error rate: {(1 - standard_metrics['accuracy'])*100:.1f}%")

print("\nTHREE-ZONE STRATEGY:")
print(f"   Negative Zone Specificity: {neg_metrics['specificity']:.2%}")
print(f"   Positive Zone Recall:      {pos_metrics['recall']:.2%}")
print(f"   → {pop['negative_zone_pct'] + pop['positive_zone_pct']:.1f}% automated with HIGH CONFIDENCE")
print(f"   → {pop['manual_zone_pct']:.1f}% flagged for human review (uncertain cases)")
print(f"   → Lower error rate on automated decisions")
print("=" * 70)

print("\n✅ ADVANTAGES OF THREE-ZONE STRATEGY:")
print("   1. Higher confidence on automated decisions")
print("   2. Reduces critical errors by flagging uncertain cases")
print("   3. Allows human expertise where it matters most")
print("   4. Balances automation efficiency with accuracy")

## 🎓 Summary & Key Takeaways

### What We Learned:

1. **Standard threshold (0.5) is not always optimal** - It treats all predictions equally

2. **Three-zone strategy is smarter** - It recognizes that some predictions are more confident than others

3. **Manual review zone is valuable** - Instead of making wrong decisions automatically, flag uncertain cases for humans

4. **Visualization is key** - Overlapping histograms show where the model is confident vs uncertain

5. **Customizable metrics** - You can optimize for different goals (precision, recall, F1, etc.)

### Real-World Applications:

- **Spam Detection**: Auto-block obvious spam, auto-allow obvious ham, manually review borderline emails
- **Credit Risk**: Auto-reject high risk, auto-approve low risk, manually review medium risk
- **Medical Diagnosis**: Auto-negative for healthy, auto-positive for critical, manual review for uncertain
- **Fraud Detection**: Auto-block suspicious, auto-allow normal, investigate borderline cases

---

## 🚀 Next Steps

Check out the other notebooks:
- `02_spam_detection.ipynb` - Real-world spam classification example
- `03_credit_risk.ipynb` - Credit approval with cost-benefit analysis

---

**⭐ If you found this useful, star the repository!**