# EWS Pipeline - Quick Start Guide
**Corporate Credit Early Warning System**

This notebook provides executable code to run the complete EWS pipeline from data generation to production scoring.

## 📋 Prerequisites
- Python 3.13+ with virtual environment activated
- All packages installed: `pip install -r requirements.txt`
- Project structure in place (see README.md)

## 🎯 Pipeline Overview

```
1. Generate Data        → Synthetic training data
2. Feature Engineering  → 20+ financial/behavioral features
3. Train Model          → LightGBM + SHAP + Calibration
4. Generate Scores      → Production scoring output
5. Validate Results     → Backtest + Dashboard
```

**Estimated runtime**: ~2-3 minutes for full pipeline (1000 customers)

## ⚙️ Setup: Verify Environment

First, let's check that all required packages and directories are in place:

In [None]:
import sys
from pathlib import Path

# Add src to Python path to enable imports
sys.path.insert(0, str(Path.cwd().parent / "src"))

# Import core libraries
import pandas as pd
import numpy as np
import json

# Check Python version
print(f"Python: {sys.version}")

# Check key packages
packages = ["pandas", "numpy", "lightgbm", "shap", "sklearn"]
for pkg in packages:
    try:
        __import__(pkg)
        print(f"✓ {pkg}")
    except ImportError:
        print(f"✗ {pkg} - Run: pip install {pkg}")

# Check project structure
required_dirs = ["src", "data", "artifacts", "notebooks"]
for d in required_dirs:
    if Path(d).exists():
        print(f"✓ {d}/")
    else:
        print(f"✗ {d}/ - Missing!")

print("\n✅ Setup complete!" if all(Path(d).exists() for d in required_dirs) else "\n⚠️ Fix missing directories!")

Python: 3.13.7 (tags/v3.13.7:bcee1c3, Aug 14 2025, 14:15:11) [MSC v.1944 64 bit (AMD64)]
✓ pandas
✓ numpy
✓ pandas
✓ numpy
✓ lightgbm
✓ lightgbm


  from .autonotebook import tqdm as notebook_tqdm


✓ shap
✓ sklearn
✗ src/ - Missing!
✗ data/ - Missing!
✗ artifacts/ - Missing!
✗ notebooks/ - Missing!

⚠️ Fix missing directories!


---

## 🔄 STEP 1: Generate Synthetic Data

Create training data with 1000 corporate customers:
- 5 raw data tables (financials, credit, cashflow, covenant, labels)
- 12 quarters of financial history
- 180 days of behavioral data
- Binary default labels (12-month horizon)

In [None]:
%%time

# Import data generation module
from gen_data.gen_input import Config, run

# Configure data generation
config = Config(
    random_seed=42,
    n_customers=1000,
    end_quarter="2025-06-30",
    n_quarters=12,
    asof_date="2025-06-30",
    behavior_days=180,
    label_horizon_days=365,
    output_dir="data/raw"
)

# Generate synthetic data
print("🔄 Generating synthetic corporate data...")
result = run(config)

# Display summary
print("\n📂 Files created in data/raw/:")
for f in Path("data/raw").glob("*.parquet"):
    size_mb = f.stat().st_size / 1024 / 1024
    print(f"  • {f.name} ({size_mb:.2f} MB)")

print(f"\n📊 Data Summary:")
print(f"  Customers: {result['summary']['n_customers']}")
print(f"  Default rate (12M): {result['summary']['event_rate_h12m']:.2%}")


📂 Files created in data/raw/:
CPU times: total: 31.2 ms
Wall time: 98.4 ms


python: can't open file 'c:\\Users\\Admin\\OneDrive\\Dylandocs\\Projects\\corporate-credit-ews\\notebooks\\src\\gen_data\\gen_input.py': [Errno 2] No such file or directory


---

## 🔧 STEP 2: Feature Engineering

Transform raw tables into modeling features:
- **Financial ratios**: Debt/EBITDA, ICR, Current Ratio, etc.
- **Credit behavior**: Utilization, DPD, breach counts
- **Cashflow metrics**: Negative days, volatility
- **Covenant tracking**: Breach frequency
- **Normalization**: Z-scores by sector & size

In [None]:
%%time

# Import feature engineering module
from modeling.feature_engineering import make_features

# Build features
print("🔧 Engineering features...")
df_features = make_features(
    raw_dir=Path("data/raw"),
    asof="2025-06-30",
    winsor=True,
    normalize=True,
    out_path=Path("data/processed/feature_ews.parquet")
)

print(f"\n✅ Features created: {df_features.shape[0]} customers × {df_features.shape[1]} columns")
print(f"\nTarget distribution:")
print(df_features['event_h12m'].value_counts())
print(f"Default rate: {df_features['event_h12m'].mean():.2%}")

print(f"\n📋 Feature columns:")
print(f"  Normalized features: {len([c for c in df_features.columns if '__zs_' in c])}")
print(f"  Raw features: {len([c for c in df_features.columns if '__zs_' not in c])}")

---

## 🤖 STEP 3: Train Model

Train LightGBM classifier with:
- **80/20 train-test split** (stratified)
- **Platt calibration** for probability adjustment
- **SHAP explainability** for feature importance
- **Percentile thresholds**: Red = top 5%, Amber = top 10%

**Outputs:**
- `model_lgbm.pkl` - Trained model bundle
- `baseline_metrics.json` - AUC, KS, Brier scores
- `scores_calibrated.csv` - All predictions with tiers
- `shap_summary.csv` - Feature importance rankings

In [None]:
%%time

# Import training module
from modeling.train_baseline import train_and_calibrate

# Train model with calibration
print("🤖 Training LightGBM with isotonic calibration...")
result = train_and_calibrate(
    df=df_features,
    target_col="event_h12m",
    test_size=0.2,
    seed=42,
    red_pct=0.05,     # Top 5% = Red
    amber_pct=0.10,   # Next 10% = Amber
    outdir=Path("artifacts/models")
)

# Display metrics
print("\n📊 Model Performance:")
metrics = result['metrics']
print(f"  AUC:    {metrics['AUC']:.3f} (target > 0.80)")
print(f"  KS:     {metrics['KS']:.3f}")
print(f"  Brier:  {metrics['Brier']:.4f} (target < 0.02)")
print(f"  PR-AUC: {metrics['PR_AUC']:.3f}")

print(f"\n🎯 Thresholds (percentile-based):")
thr = result['thresholds']
print(f"  Red tier:   PD ≥ {thr['red']:.2%}")
print(f"  Amber tier: PD ≥ {thr['amber']:.2%}")

print(f"\n📁 Artifacts saved to: {result['artifacts_dir']}")
print(f"  • model_lgbm.pkl")
print(f"  • thresholds.json")
print(f"  • baseline_metrics.json")
print(f"  • shap_summary.csv + .png")

---

## 📊 STEP 4: View Feature Importance

Check which features drive the model predictions:

In [None]:
# Display top 10 features from SHAP analysis
df_shap = pd.read_csv("artifacts/models/shap_summary.csv")
print("🔍 Top 10 Most Important Features:\n")
print(df_shap.head(10).to_string(index=False))

# Show SHAP summary plot if available
from IPython.display import Image, display
shap_plot = Path("artifacts/models/shap_summary.png")
if shap_plot.exists():
    print("\n📈 SHAP Summary Plot:")
    display(Image(filename=str(shap_plot), width=800))

---

## 🎯 STEP 5: Production Scoring

Score production portfolio (use generated data or upload your own):

In [None]:
%%time

# Import portfolio creation module
from gen_data.gen_portfolio import create_portfolio

# Create portfolio from scored data
print("🎯 Creating production portfolio...")
df_portfolio = create_portfolio(
    scores_path=Path("data/processed/scores_calibrated.csv"),
    output_path=Path("data/processed/portfolio_scored.csv"),
    seed=42,
    id_col="customer_id",
    sector_col="sector_code",
    pd_col="prob_calibrated"
)

print(f"\n✅ Portfolio created: {len(df_portfolio)} customers")
print(f"   Saved to: data/processed/portfolio_scored.csv")

In [None]:
%%time

# Import scoring module
from scoring import run_pipeline

# Score the portfolio
print("📊 Scoring production portfolio...")
result = run_pipeline(
    features_path=Path("data/processed/feature_ews.parquet"),
    model_path=Path("artifacts/models/model_lgbm.pkl"),
    thr_path=Path("artifacts/models/thresholds.json"),
    asof="2025-06-30",
    outdir=Path("artifacts/scoring")
)

# View scoring results
df_scored = pd.read_csv("artifacts/scoring/ews_scored_2025-06-30.csv")
print(f"\n📋 Scored {len(df_scored)} customers")
print("\nTier Distribution:")
print(df_scored['tier'].value_counts())
print("\nSample High-Risk Customers:")
print(df_scored[df_scored['tier'] == 'Red'][['customer_id', 'prob_default_12m_calibrated', 'score_ews', 'tier']].head())

---

## ✅ STEP 6: Validation & Backtesting

Run backtest over 18 months to validate model stability:

In [None]:
%%time

# Import cohort generation
from gen_data.gen_cohorts import generate_synthetic_cohorts

# Generate 18 months of backtest data
print("📅 Generating backtest cohorts...")
df_cohorts = generate_synthetic_cohorts(
    start="2024-01-31",
    end="2025-06-30",
    n_customers=10000,
    seed=42,
    output="data/processed/backtest_cohorts.parquet"
)

print(f"\n✅ Backtest cohorts created")
print(f"   Months: {df_cohorts['as_of_date'].nunique()}")
print(f"   Total rows: {len(df_cohorts):,}")

In [None]:
%%time

# Import backtest module
from backtest.backtest_monthly import run_backtest

# Run monthly backtest
print("🔄 Running monthly backtest...")
monthly_df, calibration_df = run_backtest(
    data_path=Path("data/processed/backtest_cohorts.parquet"),
    as_of_col="as_of_date",
    pd_col="pd_12m",
    y_col="y_event_12m",
    start_month="2024-01",
    end_month="2025-06",
    outdir=Path("artifacts/backtest")
)

# Display summary
print(f"\n📊 Backtest Results ({len(monthly_df)} months):\n")
print(f"AUC:   mean={monthly_df['auc'].mean():.3f}, range={monthly_df['auc'].min():.3f}–{monthly_df['auc'].max():.3f}")
print(f"Brier: mean={monthly_df['brier'].mean():.4f}")
print(f"Amber Alert Rate: {monthly_df['amber_alert_rate'].mean():.1%}")
print(f"Amber Precision:  {monthly_df['amber_precision'].mean():.1%}")
print(f"Amber Recall:     {monthly_df['amber_recall'].mean():.1%}")

In [None]:
%%time

# Import PSI calculation
from backtest.calculate_psi import compute_psi_monthly

# Calculate PSI
print("📉 Calculating Population Stability Index...")
df_psi = compute_psi_monthly(
    cohorts_path=Path("data/processed/backtest_cohorts.parquet"),
    pd_col="pd_12m",
    as_of_col="as_of_date",
    output_path=Path("artifacts/backtest/psi_monthly.csv")
)

print("\n📉 PSI Stability Check:")
print(df_psi.tail(5).to_string(index=False))
print(f"\nMax PSI: {df_psi['psi'].max():.4f} (threshold: 0.10 = warning, 0.25 = critical)")

---

## 📊 STEP 7: Generate Validation Dashboard

Create visual validation report with 5 plots:

In [None]:
%%time

# Import plot generation
from plot_validation import create_all_plots

# Generate validation plots
print("📊 Generating validation plots...")
plots_created = create_all_plots(
    monthly_metrics_path=Path("artifacts/backtest/monthly_metrics.csv"),
    calibration_path=Path("artifacts/backtest/monthly_calibration.csv"),
    output_dir=Path("artifacts/validation/plots")
)

print("\n✅ Plots generated in artifacts/validation/plots/:")
for plot_name in plots_created:
    print(f"   • {plot_name}")

In [None]:
# Display validation dashboard
from IPython.display import Image, display

dashboard = Path("artifacts/validation/plots/validation_dashboard.png")
if dashboard.exists():
    print("📊 Validation Dashboard:")
    display(Image(filename=str(dashboard), width=1000))
else:
    print("⚠️ Dashboard not found. Run plot_validation.py first.")

---

## 🧪 OPTIONAL: Stress Testing

Test model under crisis scenarios (recession, sector shock, credit crunch):

In [None]:
%%time

# Import stress testing
from stress_test import run_stress_test

# Run stress tests
print("🔥 Running stress scenarios...")
df_stress = run_stress_test(
    portfolio_path=Path("data/processed/portfolio_scored.csv"),
    scenarios_path=Path("artifacts/stress_testing/stress_scenarios.yaml"),
    output_path=Path("artifacts/stress_testing/stress_results.csv")
)

print("\n🔥 Stress Test Results:\n")
print(df_stress[['scenario', 'baseline_red_pct', 'stressed_red_pct', 'migration_rate']].to_string(index=False))

---

## 📦 Output Summary

After running this notebook, you'll have:

### 📁 Data Files
- `data/raw/*.parquet` - Raw input tables (5 files: financials, credit, cashflow, covenant, labels)
- `data/processed/feature_ews.parquet` - Modeling features (20+ engineered features)
- `data/processed/scores_calibrated.csv` - All scored customers with tier assignments
- `data/processed/backtest_cohorts.parquet` - 18 months backtest data

### 🤖 Model Artifacts
- `artifacts/models/model_lgbm.pkl` - Trained LightGBM model bundle
- `artifacts/models/baseline_metrics.json` - Performance metrics (AUC, KS, Brier, PR-AUC)
- `artifacts/models/thresholds.json` - Red/Amber cutoffs (percentile-based)
- `artifacts/shap/feature_importance.csv` - SHAP feature rankings
- `artifacts/shap/top_drivers_per_customer.csv` - Individual explanations

### 📊 Validation Reports
- `artifacts/validation/plots/*.png` - 5 validation charts
- `artifacts/validation/VALIDATION_REPORT_EN.md` - Full validation documentation
- `artifacts/backtest/monthly_metrics.csv` - 18-month performance tracking
- `artifacts/backtest/psi_monthly.csv` - Population stability monitoring

### 🎯 Scoring Outputs
- `artifacts/scoring/ews_scored_2025-06-30.csv` - Production scores with tiers
- `artifacts/stress_testing/stress_results.csv` - Scenario analysis results

---

## 🔄 Re-run Individual Steps

If you need to re-run specific steps without rerunning the entire pipeline:

In [None]:
# Re-train model only (if you modified hyperparameters)
result = train_and_calibrate(
    df=df_features,
    target_col="event_h12m",
    test_size=0.2,
    seed=42,
    red_pct=0.05,
    amber_pct=0.10,
    outdir=Path("artifacts/models")
)

# Re-score portfolio only (if you have new customer data)
result = run_pipeline(
    features_path=Path("data/processed/feature_ews.parquet"),
    model_path=Path("artifacts/models/model_lgbm.pkl"),
    thr_path=Path("artifacts/models/thresholds.json"),
    asof="2025-06-30",
    outdir=Path("artifacts/scoring")
)

# Regenerate plots only
plots = create_all_plots(
    monthly_metrics_path=Path("artifacts/backtest/monthly_metrics.csv"),
    calibration_path=Path("artifacts/backtest/monthly_calibration.csv"),
    output_dir=Path("artifacts/validation/plots")
)

---

## 📚 Next Steps

1. **Review validation report**: Open `artifacts/validation/VALIDATION_REPORT_EN.md` in your browser or editor
2. **Inspect SHAP drivers**: Check `artifacts/shap/top_drivers_per_customer.csv` to understand individual predictions
3. **Analyze backtest stability**: Review `artifacts/backtest/monthly_metrics.csv` for performance trends
4. **Customize thresholds**: Edit Red/Amber cutoffs in `artifacts/models/thresholds.json` if needed
5. **Deploy to production**: Use `src/scoring.py` with your real customer data

**For detailed technical documentation**, see: `reports/PIPELINE_DOCUMENTATION.md`

---

## 🆘 Troubleshooting

### Common Issues

**1. Import errors**
```bash
pip install -r requirements.txt
```

**2. File not found errors**

In [None]:
# Verify data files exist
import os
print("📂 data/raw/:")
for f in os.listdir("data/raw"):
    print(f"   • {f}")

print("\n📂 data/processed/:")
for f in os.listdir("data/processed"):
    print(f"   • {f}")

**3. Model training fails**
- Check default rate > 1% (need sufficient positive examples)
- Verify features have no missing values
- Reduce dataset size if memory issues: use `--n 500` instead of `--n 1000`

**4. Plots not displaying**

In [None]:
# Regenerate all plots
plots = create_all_plots(
    monthly_metrics_path=Path("artifacts/backtest/monthly_metrics.csv"),
    calibration_path=Path("artifacts/backtest/monthly_calibration.csv"),
    output_dir=Path("artifacts/validation/plots")
)

# Check if files were created
print("📊 Generated plots:")
for plot in plots:
    print(f"   • {plot}")

### Performance Tips

- **Quick testing**: Use `--n 500` for faster experimentation (30 seconds total)
- **Full production run**: Use `--n 10000` for realistic evaluation (~5 minutes)
- **Parallel processing**: Set `n_jobs=-1` in train_baseline.py to use all CPU cores
- **Memory optimization**: If you get MemoryError, reduce `--n` parameter or close other applications

---

## 📞 Support

- **Full Documentation**: `reports/PIPELINE_DOCUMENTATION.md`
- **Repository**: corporate-credit-ews
- **Validation Report**: `artifacts/validation/VALIDATION_REPORT_EN.md`

---

✅ **You're all set!** Run the cells above from top to bottom (Shift+Enter) to execute the complete EWS pipeline.