# CR_Score Playbook 07: Production Monitoring & Observability

**Level:** Advanced  
**Time:** 35-40 minutes  
**Goal:** Master production monitoring, drift detection, and observability

## What You'll Learn

- Population Stability Index (PSI) for drift detection
- Characteristic Stability Index (CSI)
- Performance monitoring over time
- Alert management and notification
- Metrics collection (Prometheus-compatible)
- SHAP explainability for model interpretability
- Regulatory-compliant reason codes (FCRA/ECOA)
- Interactive observability dashboards
- Exporting comprehensive reports

## Prerequisites

- Completed Playbook 01 or 06
- Understanding of model monitoring concepts

## Step 1: Setup and Train Model

In [None]:
import pandas as pd
import numpy as np
import sys
from pathlib import Path

project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / 'src'))

from cr_score import ScorecardPipeline
from cr_score.monitoring import (
    PerformanceMonitor,
    DriftMonitor,
    PredictionMonitor,
    AlertManager,
    MetricsCollector
)
from cr_score.explainability import (
    SHAPExplainer,
    ReasonCodeGenerator,
    FeatureImportanceAnalyzer
)
from cr_score.reporting import ObservabilityDashboard, ReportExporter

print("[OK] Libraries imported!")

# Load data
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

print(f"Training: {len(train_df)} samples")
print(f"Test: {len(test_df)} samples")

In [None]:
# Train scorecard
pipeline = ScorecardPipeline(max_n_bins=5, pdo=20, base_score=600)
pipeline.fit(train_df, target_col='default')

# Get predictions for both train and test
train_scores = pipeline.predict(train_df)
train_probas = pipeline.predict_proba(train_df)

test_scores = pipeline.predict(test_df)
test_probas = pipeline.predict_proba(test_df)

print("[OK] Model trained and predictions generated")

## Step 2: Population Stability Index (PSI)

PSI measures distribution shifts between training and production data.

In [None]:
from cr_score.evaluation import StabilityMetrics

# Calculate PSI between train and test scores
psi = StabilityMetrics.calculate_psi(
    expected=train_scores,
    actual=test_scores,
    bins=10
)

status = StabilityMetrics.psi_interpretation(psi)

print(f"PSI: {psi:.4f}")
print(f"Status: {status.upper()}")
print("\nInterpretation:")
print("  PSI < 0.1:  No significant change (STABLE)")
print("  0.1-0.2:    Moderate change (WARNING)")
print("  PSI > 0.2:  Significant change (CRITICAL)")

# Get detailed breakdown
psi_breakdown = StabilityMetrics.calculate_psi_breakdown(
    expected=train_scores,
    actual=test_scores,
    bins=10
)

print("\nPSI Breakdown by Bin:")
print(psi_breakdown[['bin_label', 'expected_percent', 'actual_percent', 'psi']].to_string(index=False))

## Step 3: Feature-Level Stability

Check PSI for individual features to find which ones are drifting.

In [None]:
# Select numeric features
numeric_cols = train_df.select_dtypes(include=['int64', 'float64']).columns.tolist()
feature_cols = [col for col in numeric_cols if col not in ['application_id', 'default']]

# Calculate feature stability
stability = StabilityMetrics.calculate_feature_stability(
    expected_df=train_df,
    actual_df=test_df,
    features=feature_cols,
    bins=10
)

print("Feature Stability Analysis:")
print("="*60)
print(stability.head(15).to_string(index=False))

# Count by status
print("\nSummary:")
print(stability['status'].value_counts().to_string())

## Step 4: Performance Monitoring

Set up baseline and monitor performance over time.

In [None]:
from sklearn.metrics import roc_auc_score, precision_score, recall_score

# Calculate baseline metrics on training data
train_pred = (train_probas > 0.5).astype(int)
baseline_metrics = {
    'auc': roc_auc_score(train_df['default'], train_probas),
    'precision': precision_score(train_df['default'], train_pred),
    'recall': recall_score(train_df['default'], train_pred)
}

print("Baseline Metrics (Training Data):")
for metric, value in baseline_metrics.items():
    print(f"  {metric}: {value:.4f}")

# Initialize performance monitor
perf_monitor = PerformanceMonitor(
    baseline_metrics=baseline_metrics,
    alert_thresholds={'auc': 0.05, 'precision': 0.10, 'recall': 0.10}
)

print("\n[OK] Performance monitor initialized")

In [None]:
# Monitor test set performance
test_pred = (test_probas > 0.5).astype(int)

metrics = perf_monitor.record_predictions(
    y_true=test_df['default'],
    y_pred=test_pred,
    y_proba=test_probas,
    metadata={'dataset': 'test', 'date': '2026-01-16'}
)

print("Current Metrics (Test Data):")
for metric in ['auc', 'precision', 'recall']:
    current = metrics[metric]
    baseline = baseline_metrics[metric]
    diff = current - baseline
    print(f"  {metric}: {current:.4f} (baseline: {baseline:.4f}, diff: {diff:+.4f})")

# Check health
health = perf_monitor.check_health()
print(f"\nHealth Status: {health['status'].upper()}")

if health['alerts']:
    print("\nALERTS:")
    for alert in health['alerts']:
        print(f"  - {alert['metric']}: {alert['degradation_pct']:.1%} degradation")
else:
    print("No alerts - model is performing well!")

## Step 5: Drift Detection Monitor

Initialize drift monitor and detect distribution changes.

In [None]:
# Initialize drift monitor with training data as reference
drift_monitor = DriftMonitor(
    reference_data=train_df[feature_cols],
    psi_threshold=0.1,
    ks_threshold=0.05
)

# Detect drift in test data
drift_report = drift_monitor.detect_drift(test_df[feature_cols])

print("Drift Detection Report:")
print("="*60)
print(f"Overall Status: {drift_report['overall_status'].upper()}")
print("\nSummary:")
print(f"  Critical: {drift_report['drift_summary']['critical']} features")
print(f"  Warning:  {drift_report['drift_summary']['warning']} features")
print(f"  Stable:   {drift_report['drift_summary']['stable']} features")

# Show top drifted features
print("\nTop 5 Drifted Features:")
feature_drifts = [(feat, results['psi']) 
                  for feat, results in drift_report['feature_results'].items()]
feature_drifts.sort(key=lambda x: x[1], reverse=True)

for feat, psi in feature_drifts[:5]:
    status_emoji = 'ðŸ”´' if psi > 0.2 else 'ðŸŸ¡' if psi > 0.1 else 'ðŸŸ¢'
    print(f"  {feat}: PSI = {psi:.4f}")

## Step 6: SHAP Explainability

Understand what drives model predictions using SHAP values.

In [None]:
# Get WoE-encoded features (what model actually uses)
X_woe = pipeline.woe_encoder_.transform(test_df[feature_cols])

# Create SHAP explainer
shap_explainer = SHAPExplainer(
    model=pipeline.model_,
    model_type='linear'  # Our LogisticScorecard
)

# Fit on sample of data (for speed)
shap_explainer.fit(X_woe, sample_size=100)

# Get global feature importance
shap_importance = shap_explainer.get_feature_importance(X_woe)

print("SHAP Feature Importance (Global):")
print("="*60)
print(shap_importance.head(10).to_string(index=False))

print("\n[OK] SHAP analysis complete")

## Step 7: Reason Codes (Regulatory Compliance)

Generate FCRA/ECOA-compliant adverse action reason codes.

In [None]:
# Create reason code generator
reason_generator = ReasonCodeGenerator(
    model=pipeline.model_,
    feature_names=pipeline.selected_features_
)

# Find declined applications (score < 620)
declined_mask = test_scores < 620
declined_apps = test_df[declined_mask].head(5)  # First 5 examples

print(f"Analyzing {len(declined_apps)} declined applications...\n")

for idx in declined_apps.index:
    app_id = declined_apps.loc[idx, 'application_id']
    score = test_scores[idx]
    
    # Generate top 4 reasons
    reasons = reason_generator.generate_reasons(
        x=test_df.loc[idx, pipeline.selected_features_],
        score=score,
        threshold=620,
        num_reasons=4
    )
    
    print(f"Application {app_id} (Score: {score:.0f}):")
    for rank, reason in enumerate(reasons, 1):
        print(f"  {rank}. {reason['code']}: {reason['description']}")
    print()

## Step 8: Metrics Collection (Prometheus-Compatible)

Collect system metrics for monitoring.

In [None]:
# Initialize metrics collector
metrics_collector = MetricsCollector(enable_prometheus=True)

# Record various metrics
metrics_collector.increment_counter('predictions_total', value=len(test_df))
metrics_collector.set_gauge('model_auc', value=metrics['auc'])
metrics_collector.set_gauge('psi_score', value=psi)
metrics_collector.record_histogram('score_value', value=test_scores.mean())

# Get all metrics
all_metrics = metrics_collector.get_metrics()

print("Collected Metrics:")
print("="*60)
for metric_name, metric_data in all_metrics.items():
    print(f"{metric_name}: {metric_data['value']} ({metric_data['type']})")

print("\n[OK] Metrics collected")

## Step 9: Alert Management

Create and manage alerts for critical issues.

In [None]:
from cr_score.monitoring.alert_manager import AlertSeverity

# Initialize alert manager
alert_manager = AlertManager()

# Create alerts based on monitoring results
if health['status'] == 'critical':
    for alert_info in health['alerts']:
        alert = alert_manager.create_alert(
            title=f"Performance Degradation: {alert_info['metric']}",
            severity=AlertSeverity.CRITICAL,
            details=alert_info,
            source='performance_monitor'
        )
        print(f"ALERT CREATED: {alert['title']}")

if drift_report['overall_status'] == 'critical':
    alert = alert_manager.create_alert(
        title='Critical Data Drift Detected',
        severity=AlertSeverity.CRITICAL,
        details=drift_report['drift_summary'],
        source='drift_monitor'
    )
    print(f"ALERT CREATED: {alert['title']}")

# Get all active alerts
active_alerts = alert_manager.get_active_alerts()
alert_summary = alert_manager.get_alert_summary()

print(f"\nAlert Summary:")
print(f"  Total: {alert_summary['total']}")
print(f"  Active: {alert_summary['active']}")
print(f"  Critical: {alert_summary.get('by_severity', {}).get('critical', 0)}")

## Step 10: Observability Dashboard

Generate interactive HTML dashboard with all monitoring data.

In [None]:
# Create observability dashboard
dashboard = ObservabilityDashboard(
    title="Production Scorecard Monitoring Dashboard"
)

# Add performance section
metrics_df = perf_monitor.get_metrics_summary()
dashboard.add_performance_section(metrics_df, health)

# Add drift section
dashboard.add_drift_section(drift_report)

# Add prediction section
pred_summary = {
    'mean_score': test_scores.mean(),
    'std_score': test_scores.std(),
    'min_score': test_scores.min(),
    'max_score': test_scores.max(),
    'mean_proba': test_probas.mean(),
}
dashboard.add_prediction_section(pred_summary)

# Add metrics section
dashboard.add_metrics_section(all_metrics)

# Add alerts section
dashboard.add_alerts_section(active_alerts, alert_summary)

# Export dashboard
dashboard.export('reports/observability_dashboard.html')

print("[OK] Observability dashboard generated!")
print("Open: reports/observability_dashboard.html")

## Step 11: Export Comprehensive Reports

Export all analysis in multiple formats.

In [None]:
# Create comprehensive report
exporter = ReportExporter()

# Get full metrics
full_metrics = pipeline.model_.get_performance_metrics(
    test_df['default'],
    test_probas,
    include_stability=True,
    y_train_proba=train_probas
)

# Export to all formats
files = exporter.export_comprehensive_report(
    model=pipeline.model_,
    metrics=full_metrics,
    X_test=pipeline.woe_encoder_.transform(test_df[feature_cols]),
    y_test=test_df['default'],
    output_dir='reports/comprehensive/',
    formats=['json', 'csv', 'excel', 'markdown'],
    include_curves=True
)

print("[OK] Comprehensive reports exported!")
print("\nGenerated files:")
for fmt, paths in files.items():
    print(f"  {fmt}: {len(paths)} file(s) in reports/comprehensive/")

## Summary

### What You Learned:

1. âœ… **PSI Calculation** - Detect population shifts
2. âœ… **Feature Stability** - Monitor individual features for drift
3. âœ… **Performance Monitoring** - Track metrics over time with baselines
4. âœ… **Drift Detection** - Automated monitoring with PSI/KS tests
5. âœ… **SHAP Explainability** - Understand model decisions
6. âœ… **Reason Codes** - FCRA/ECOA-compliant adverse action notices
7. âœ… **Metrics Collection** - Prometheus-compatible system metrics
8. âœ… **Alert Management** - Multi-severity alerting system
9. âœ… **Observability Dashboard** - Interactive monitoring interface
10. âœ… **Comprehensive Reports** - Multi-format exports

### Production Checklist:

- âœ… Set baseline metrics from validation data
- âœ… Configure alert thresholds (PSI > 0.1, performance drop > 5%)
- âœ… Monitor both feature-level and score-level drift
- âœ… Generate reason codes for all declined applications
- âœ… Set up automated dashboard generation (daily/weekly)
- âœ… Configure notification channels (email, Slack, PagerDuty)
- âœ… Document compliance procedures
- âœ… Establish model retraining criteria

### Key Metrics to Monitor:

**Performance:**
- AUC (should stay within 5% of baseline)
- Gini coefficient
- KS statistic
- Brier score (calibration)

**Stability:**
- PSI < 0.1: Stable
- PSI 0.1-0.2: Warning
- PSI > 0.2: Critical (investigate/retrain)

**Operational:**
- Prediction latency
- Throughput (predictions/second)
- Error rates
- System resource usage

### Next Steps:

1. Set up automated monitoring pipeline
2. Configure notification channels
3. Establish model retraining schedule
4. Document compliance procedures
5. Create runbooks for alert responses

### Files Generated:

- `reports/observability_dashboard.html` - Interactive monitoring dashboard
- `reports/comprehensive/` - Full reports (JSON, CSV, Excel, Markdown)
- Metrics, alerts, and monitoring data

### Production Tips:

1. **Daily Monitoring**: Run drift detection daily
2. **Weekly Reviews**: Review performance metrics weekly
3. **Monthly Deep Dive**: Full model health check monthly
4. **Retrain Triggers**: PSI > 0.2 OR performance drop > 10%
5. **Documentation**: Keep audit trail of all monitoring activities