# Corporate Credit Early Warning System (EWS)

## Project Overview

H·ªá th·ªëng c·∫£nh b√°o s·ªõm r·ªßi ro t√≠n d·ª•ng doanh nghi·ªáp (Corporate Credit EWS) ƒë∆∞·ª£c x√¢y d·ª±ng theo chu·∫©n Basel, s·ª≠ d·ª•ng Machine Learning ƒë·ªÉ d·ª± ƒëo√°n x√°c su·∫•t v·ª° n·ª£ (PD - Probability of Default) trong v√≤ng 12 th√°ng t·ªõi.

**M·ª•c ti√™u ch√≠nh:**
- D·ª± ƒëo√°n kh·∫£ nƒÉng v·ª° n·ª£ c·ªßa kh√°ch h√†ng doanh nghi·ªáp (event horizon: 12 th√°ng)
- Ph√¢n lo·∫°i r·ªßi ro th√†nh 3 tiers: **Green** (an to√†n), **Amber** (c·∫£nh b√°o), **Red** (nguy hi·ªÉm)
- ƒê∆∞a ra khuy·∫øn ngh·ªã h√†nh ƒë·ªông c·ª• th·ªÉ cho Risk Management team

**Tech Stack:**
- Python 3.13
- LightGBM (classification model)
- SHAP (model explainability)
- Sklearn (calibration, metrics)
- Pandas/Numpy (data processing)

## Pipeline Architecture

D·ª± √°n ƒë∆∞·ª£c t·ªï ch·ª©c theo **end-to-end ML pipeline** v·ªõi 7 b∆∞·ªõc ch√≠nh:

```
1. Data Generation (generate_data.py)
   ‚Üì
2. Feature Engineering (feature_engineering.py)
   ‚Üì
3. Model Training + Calibration (train_baseline.py)
   ‚Üì
4. Model Explainability (explain.py)
   ‚Üì
5. [Optional] Make Raw Scores (make_scores_raw.py)
   ‚Üì
6. [Optional] Re-calibration (calibrate.py)
   ‚Üì
7. Production Scoring (scoring.py)
```

### Data Flow
- **Raw Data** ‚Üí `data/raw/` (fin_quarterly, credit_daily, cashflow_daily, covenant, labels)
- **Features** ‚Üí `data/processed/` (feature_ews.parquet)
- **Models** ‚Üí `artifacts/models/` (model_lgbm.pkl, SHAP artifacts)
- **Scores** ‚Üí `artifacts/scoring/` (ews_scored_YYYY-MM-DD.csv)

## Step 1: Data Generation

Module: `src/generate_data.py`

### Overview

T·∫°o d·ªØ li·ªáu synthetic (gi·∫£ l·∫≠p) ƒë·ªÉ train v√† test model EWS. D·ªØ li·ªáu ƒë∆∞·ª£c thi·∫øt k·∫ø s√°t v·ªõi th·ª±c t·∫ø nghi·ªáp v·ª• t√≠n d·ª•ng doanh nghi·ªáp v√† tu√¢n th·ªß c√°c nguy√™n t·∫Øc Basel.

### Why Synthetic Data?

#### L·ª£i √≠ch v√† h·∫°n ch·∫ø
L·ª£i √≠ch c·ªßa vi·ªác s·ª≠ d·ª•ng d·ªØ li·ªáu t·ªïng h·ª£p l√† tr√°nh ƒë∆∞·ª£c v·∫•n ƒë·ªÅ b·∫£o m·∫≠t v√† r·ªßi ro tu√¢n th·ªß do s·ª≠ d·ª•ng c√°c d·ªØ li·ªáu n·ªôi b·ªô t·ª´ ng√¢n h√†ng. B√™n c·∫°nh ƒë√≥, ta c√≥ th·ªÉ bi·∫øt ch√≠nh x√°c ground truth d·ªØ li·ªáu ƒë·∫ßu v√†o c·ªßa m√¥ h√¨nh (·ªü ƒë√¢y l√† event_12m), ƒëi·ªÅu m√† v√¥ c√πng quan tr·ªçng khi x√¢y d·ª±ng c√°c thu·∫≠t to√°n h·ªçc m√°y nh∆∞ LightGBM. Ngo√†i ra, ch√∫ng ta gi·∫£i quy·∫øt ƒë∆∞·ª£c v·∫•n ƒë·ªÅ scaling b·∫±ng vi·ªác d·ªÖ d√†ng t·∫°o 1K, 10K, 100K d·ªØ li·ªáu kh√°ch h√†ng, ƒëi·ªÅu m√† kh√° kh√≥ khƒÉn v·ªõi nh·ªØng d·ª± √°n c√≥ kinh ph√≠ th·∫•p v√† kh·∫£ nƒÉng ti·∫øp c·∫≠n h·∫°n ch·∫ø ƒë·ªëi v·ªõi d·ªØ li·ªáu th·∫≠t. M·ªôt l·ª£i th·∫ø kh√°c khi s·ª≠ d·ª•ng d·ªØ li·ªáu t·ªïng h·ª£p l√† kh·∫£ nƒÉng th·ª±c hi·ªán c√°c edge cases v√† stress scenarios: nh·ªØng tr∆∞·ªùng h·ª£p d·ªØ li·ªáu n·∫±m ·ªü bi√™n c·ªßa ph√¢n ph·ªëi th√¥ng th∆∞·ªùng, r·∫•t √≠t khi xu·∫•t hi·ªán trong d·ªØ li·ªáu th·ª±c t·∫ø.

H·∫°n ch·∫ø l·ªõn nh·∫•t c·ªßa vi·ªác s·ª≠ d·ª•ng d·ªØ li·ªáu t·ªïng h·ª£p l√† s·ª± kh√°c bi·ªát v·ªÅ ph√¢n ph·ªëi (distribution shift) so v·ªõi d·ªØ li·ªáu th·ª±c t·∫ø. N·∫øu d·ªØ li·ªáu t·ªïng h·ª£p kh√¥ng m√¥ ph·ªèng ch√≠nh x√°c m·ªëi quan h·ªá ph·ª©c t·∫°p v√† c√°c s·∫Øc th√°i ·∫©n trong d·ªØ li·ªáu g·ªëc c·ªßa ng√¢n h√†ng, m√¥ h√¨nh ƒë∆∞·ª£c hu·∫•n luy·ªán tr√™n ƒë√≥ c√≥ th·ªÉ ho·∫°t ƒë·ªông k√©m hi·ªáu qu·∫£ ho·∫∑c ƒë∆∞a ra c√°c d·ª± ƒëo√°n sai l·ªách nghi√™m tr·ªçng khi √°p d·ª•ng v√†o m√¥i tr∆∞·ªùng th·ª±c t·∫ø. 

#### Vai tr√≤
D·ªØ li·ªáu t·ªïng h·ª£p c√≥ vai tr√≤ thi·∫øt y·∫øu trong nhi·ªÅu giai ƒëo·∫°n c·ªßa d·ª± √°n khoa h·ªçc d·ªØ li·ªáu. N√≥ ƒë∆∞·ª£c s·ª≠ d·ª•ng ƒë·ªÉ ch·ª©ng minh t√≠nh kh·∫£ thi (Proof-of-concept) c·ªßa m√¥ h√¨nh ho·∫∑c gi·∫£i ph√°p tr∆∞·ªõc khi √°p d·ª•ng l√™n d·ªØ li·ªáu th·∫≠t, gi√∫p gi·∫£m thi·ªÉu r·ªßi ro. Ngo√†i ra, d·ªØ li·ªáu n√†y l√† c√¥ng c·ª• l√Ω t∆∞·ªüng ƒë·ªÉ hu·∫•n luy·ªán c√°c ƒë·ªôi ng≈© m·ªõi m√† kh√¥ng c·∫ßn ti·∫øp x√∫c v·ªõi th√¥ng tin nh·∫°y c·∫£m.

---

### Customer Profile Configuration

```python
@dataclass
class Config:
    n_customers: int = 1000
    sectors: Tuple[str, ...] = (
        "MFG",  # Manufacturing (18%)
        "TRA",  # Trading (12%)
        "CON",  # Construction (10%)
        "AGR",  # Agriculture (8%)
        "ENG",  # Engineering (8%)
        "CHE",  # Chemicals (10%)
        "RET",  # Retail (12%)
        "LOG",  # Logistics (10%)
        "TEL",  # Telecom (6%)
        "IT"    # Information Technology (6%)
    )
    size_buckets: Tuple[str, ...] = ("SME", "Corp")
    size_probs: Tuple[float, ...] = (0.8, 0.2)  # 80% SME, 20% Corp
```

**Sector Risk Premiums:**
M·ªói sector c√≥ risk premium kh√°c nhau (th√™m v√†o debt_mult):
- **Low risk:** ENG (0.0), MFG (0.0), TEL (-0.01), IT (-0.02)
- **Medium risk:** CHE (0.02), LOG (0.02), RET (0.03), CON (0.03)
- **Higher risk:** AGR (0.04), TRA (0.05)

**Size Characteristics:**
- **SME (Small-Medium Enterprise):**
  - Base revenue: lognormal(mean=10.5, œÉ=0.5)
  - Debt multiplier: 0.8 + 0.4 = 1.2
  - Higher default risk due to less stability
  
- **Corp (Large Corporate):**
  - Base revenue: lognormal(mean=12, œÉ=0.5)  ‚Üí ~2.7x larger
  - Debt multiplier: 0.8 + 0.6 = 1.4
  - More stable but higher leverage

---

### Data Tables Generated

#### 1. **fin_quarterly.parquet** - B√°o c√°o t√†i ch√≠nh theo qu√Ω

**Time Range:** 12 quarters (3 years) history ending at `2025-06-30`

**Columns (15 total):**

| Column | Description | Generation Logic |
|--------|-------------|------------------|
| `customer_id` | Unique ID (C0001-C1000) | Sequential |
| `fq_date` | Quarter end date | 2022-09-30 ‚Üí 2025-06-30 |
| `sector_code` | Industry sector | Random t·ª´ 10 sectors |
| `size_bucket` | SME or Corp | 80/20 split |
| `revenue` | Quarterly revenue | Growth ~2% QoQ + noise |
| `cogs` | Cost of Goods Sold | ~75% of revenue |
| `ebitda` | Earnings Before Interest, Tax, D&A | ~15% margin + noise |
| `ebit` | Earnings Before Interest & Tax | EBITDA - D&A (proxy 2% revenue) |
| `interest_expense` | Quarterly interest | Debt √ó 8% annual / 4 |
| `total_debt` | Total outstanding debt | Revenue √ó (0.3 + sector_risk) √ó debt_mult |
| `current_assets` | AR + Inventory + Cash | Function of revenue/COGS |
| `current_liab` | AP + short-term liabilities | Function of COGS/revenue |
| `inventory` | Inventory level | ~12% of COGS |
| `ar` | Accounts Receivable | ~18% of revenue (DSO ~66 days) |
| `ap` | Accounts Payable | ~20% of COGS (DPO ~73 days) |

**Realism Features:**
- ‚úÖ **Growth patterns:** QoQ growth ~2% ¬± 5% noise
- ‚úÖ **Profitability:** EBITDA margin 15% ¬± 7%
- ‚úÖ **Leverage:** Debt/Revenue varies by sector
- ‚úÖ **Working capital:** Realistic DSO, DPO, Inventory turnover

---

#### 2. **credit_daily.parquet** - H√†nh vi t√≠n d·ª•ng h√†ng ng√†y

**Time Range:** 
- Observation window: 180 days before `asof_date` (2025-01-02 ‚Üí 2025-06-30)
- Label window: 365 days after `asof_date` (2025-07-01 ‚Üí 2026-06-30)
- **Total:** 545 days per customer

**Columns (7 total):**

| Column | Description | Generation Logic |
|--------|-------------|------------------|
| `customer_id` | Customer ID | - |
| `date` | Business date | Daily from start to end |
| `limit` | Credit limit | 30% of latest revenue √ó randomness |
| `utilized` | Amount used | Limit √ó utilization_rate |
| `breach_flag` | 1 if utilized > limit | Binary indicator |
| `dpd_days` | Days Past Due | Markov chain: 98.5% stay/improve, 1.5% deteriorate |
| `product_type` | OD/TERM/TR_LOAN | 70% Overdraft, 20% Term Loan, 10% Trade Finance |

**DPD Markov Process:**
```python
# Each day:
if random() < 0.985:
    dpd = max(0, dpd - binomial(1, p=0.3))  # 30% chance gi·∫£m 1 ng√†y
else:
    dpd += choice([1, 3, 7], p=[0.6, 0.3, 0.1])  # 60% +1, 30% +3, 10% +7
```

**Utilization Pattern:**
```python
util_level = beta(2, 2)  # Centered around 0.5
seasonal = sin(day_of_year/365 * 2œÄ) * 0.05  # ¬±5% seasonality
daily_util = util_level + seasonal + noise(0, 0.05)
```

---

#### 3. **cashflow_daily.parquet** - D√≤ng ti·ªÅn h√†ng ng√†y

**Time Range:** Same as credit_daily (545 days)

**Columns (4 total):**

| Column | Description | Generation Logic |
|--------|-------------|------------------|
| `customer_id` | Customer ID | - |
| `date` | Business date | Daily |
| `inflow` | Cash inflow | Daily avg revenue √ó seasonality √ó noise |
| `outflow` | Cash outflow | ~90% of inflow √ó noise |

**Seasonality Model:**
```python
daily_mean = annual_revenue / 365 √ó uniform(0.6, 1.1)
seasonal_factor = 1 + 0.2 √ó sin(day_of_year/365 √ó 2œÄ)
inflow = max(0, normal(daily_mean √ó seasonal_factor, œÉ=daily_mean√ó0.3))
outflow = max(0, normal(inflow √ó 0.9, œÉ=daily_mean√ó0.25))
```

**Use cases:**
- Detect sudden revenue drops (inflow_drop_60d)
- Monitor burn rate (outflow > inflow)
- Identify cashflow volatility

---

#### 4. **covenant.parquet** - Covenant tracking

**Time Range:** Daily for observation + label windows

**Columns (7 total):**

| Column | Description | Threshold | Breach Logic |
|--------|-------------|-----------|--------------|
| `customer_id` | Customer ID | - | - |
| `date` | Business date | - | - |
| `icr` | Interest Coverage | ‚â• 1.5 | 1 if < 1.5 |
| `dscr` | Debt Service Coverage | ‚â• 1.2 | 1 if < 1.2 |
| `leverage` | Debt/EBITDA | ‚â§ 4.0 | 1 if > 4.0 |
| `breach_icr` | ICR breach flag | - | Binary |
| `breach_dscr` | DSCR breach flag | - | Binary |
| `breach_leverage` | Leverage breach flag | - | Binary |

**Why Important:**
- Covenant breach = Early warning signal
- Typical in loan agreements
- Triggers renegotiation or penalties

---

#### 5. **labels.parquet** - Target variable

**Columns (3 total):**

| Column | Description | Logic |
|--------|-------------|-------|
| `customer_id` | Customer ID | - |
| `asof_date` | Snapshot date | 2025-06-30 |
| `event_h12m` | Default in 12M | 1 if DPD ‚â• 90 for ‚â• 30 consecutive days |

**Label Definition (Basel-compliant):**
```python
def compute_label(future_dpd_series):
    """
    future_dpd_series: DPD values for 365 days after asof_date
    """
    max_consecutive_90plus = 0
    current_streak = 0
    
    for dpd in future_dpd_series:
        if dpd >= 90:
            current_streak += 1
            max_consecutive_90plus = max(max_consecutive_90plus, current_streak)
        else:
            current_streak = 0
    
    return 1 if max_consecutive_90plus >= 30 else 0
```

**Additional Bumps (increase default probability):**
- **High utilization bump:** If util_rate > 90% at asof_date ‚Üí +20% PD
- **Covenant breach bump:** If any covenant breached ‚Üí +20% PD

**Expected Default Rate:** ~5-10% (typical for corporate portfolio)

---

### Output Files

All tables saved in both **Parquet** (preferred) and **CSV** (fallback):

```
data/raw/
‚îú‚îÄ‚îÄ fin_quarterly.parquet       # ~1000 rows √ó 12 quarters = 12K rows
‚îú‚îÄ‚îÄ credit_daily.parquet        # ~1000 customers √ó 545 days = 545K rows
‚îú‚îÄ‚îÄ cashflow_daily.parquet      # ~1000 customers √ó 545 days = 545K rows
‚îú‚îÄ‚îÄ covenant.parquet            # ~1000 customers √ó 545 days = 545K rows
‚îî‚îÄ‚îÄ labels.parquet              # 1000 rows (one per customer)
```

**File sizes:**
- Parquet: ~5-10 MB total (compressed)
- CSV: ~30-50 MB total (uncompressed)

---

### Usage Example

```bash
# Generate with default config (1000 customers)
python src/generate_data.py --output-dir data/raw

# Generate 5000 customers for stress test
python src/generate_data.py --n-customers 5000 --output-dir data/raw_large

# Custom end date
python src/generate_data.py --end-quarter 2024-12-31 --output-dir data/raw_2024
```

---

### Quality Checks

**After generation, verify:**

‚úÖ **Completeness:** All 5 files generated  
‚úÖ **Row counts:** Consistent customer_ids across tables  
‚úÖ **Date ranges:** Correct observation (180d) + label (365d) windows  
‚úÖ **Label distribution:** Default rate 5-10%  
‚úÖ **Financial sanity:** Revenue > 0, EBITDA margin reasonable, Debt > 0  
‚úÖ **DPD distribution:** Majority < 30, some 30-90, few > 90  

```python
# Quick checks
import pandas as pd

labels = pd.read_parquet('data/raw/labels.parquet')
print(f"Default rate: {labels['event_h12m'].mean():.1%}")  # Should be ~5-10%

credit = pd.read_parquet('data/raw/credit_daily.parquet')
print(f"Max DPD: {credit['dpd_days'].max()}")  # Should see some 90+ days
print(f"Breach rate: {credit['breach_flag'].mean():.1%}")  # Should be low ~1-5%
```

In [None]:
# Example: Generate synthetic data v√† verify k·∫øt qu·∫£

import sys
import os
sys.path.append('../src')

print("=" * 80)
print("STEP 1: GENERATE SYNTHETIC DATA")
print("=" * 80)

# Command ƒë·ªÉ generate data
print("\nüìù Command to generate data:")
print("python src/generate_data.py --n-customers 1000 --output-dir data/raw")

print("\nüìä Expected outputs:")
outputs = {
    "fin_quarterly.parquet": "~12,000 rows (1000 customers √ó 12 quarters)",
    "credit_daily.parquet": "~545,000 rows (1000 customers √ó 545 days)",
    "cashflow_daily.parquet": "~545,000 rows (1000 customers √ó 545 days)",
    "covenant.parquet": "~545,000 rows (1000 customers √ó 545 days)",
    "labels.parquet": "1,000 rows (1 row per customer)"
}

for file, desc in outputs.items():
    print(f"  ‚úì data/raw/{file:30s} ‚Üí {desc}")

print("\n" + "=" * 80)
print("VERIFICATION AFTER GENERATION")
print("=" * 80)

# N·∫øu data ƒë√£ t·ªìn t·∫°i, verify n√≥
data_dir = "../data/raw"
if os.path.exists(data_dir):
    try:
        import pandas as pd
        
        print("\n‚úÖ Data directory found! Checking files...")
        
        # Check labels
        labels_path = f"{data_dir}/labels.parquet"
        if os.path.exists(labels_path):
            labels = pd.read_parquet(labels_path)
            default_rate = labels['event_h12m'].mean()
            print(f"\nüìå labels.parquet:")
            print(f"   - Total customers: {len(labels):,}")
            print(f"   - Default rate (event_h12m=1): {default_rate:.1%}")
            print(f"   - Expected: 5-10% ‚úì" if 0.05 <= default_rate <= 0.15 else "   - Warning: Outside expected range")
        
        # Check credit_daily
        credit_path = f"{data_dir}/credit_daily.parquet"
        if os.path.exists(credit_path):
            credit = pd.read_parquet(credit_path)
            print(f"\nüìå credit_daily.parquet:")
            print(f"   - Total rows: {len(credit):,}")
            print(f"   - Date range: {credit['date'].min()} to {credit['date'].max()}")
            print(f"   - Max DPD: {credit['dpd_days'].max()} days")
            print(f"   - Breach rate: {credit['breach_flag'].mean():.1%}")
            print(f"   - Avg utilization: {(credit['utilized']/credit['limit']).mean():.1%}")
        
        # Check fin_quarterly
        fin_path = f"{data_dir}/fin_quarterly.parquet"
        if os.path.exists(fin_path):
            fin = pd.read_parquet(fin_path)
            print(f"\nüìå fin_quarterly.parquet:")
            print(f"   - Total rows: {len(fin):,}")
            print(f"   - Unique customers: {fin['customer_id'].nunique():,}")
            print(f"   - Quarters: {fin['fq_date'].nunique()}")
            print(f"   - Avg EBITDA margin: {(fin['ebitda']/fin['revenue']).mean():.1%}")
            print(f"   - Avg Debt/EBITDA: {(fin['total_debt']/fin['ebitda']).mean():.1f}x")
        
        print("\n‚úÖ All checks passed!")
        
    except Exception as e:
        print(f"\n‚ö†Ô∏è  Could not verify data: {e}")
        print("   Run the generate_data.py script first to create the data.")
else:
    print(f"\n‚ö†Ô∏è  Data directory not found: {data_dir}")
    print("   Run the following command to generate data:")
    print("   python src/generate_data.py --n-customers 1000 --output-dir data/raw")

print("\n" + "=" * 80)

## üîß Step 2: Feature Engineering

Module: `src/feature_engineering.py`

### Overview

Feature Engineering l√† b∆∞·ªõc quan tr·ªçng nh·∫•t trong vi·ªác x√¢y d·ª±ng m√¥ h√¨nh Early Warning System, v√¨ n√≥ chuy·ªÉn ƒë·ªïi d·ªØ li·ªáu th√¥ t·ª´ c√°c b·∫£ng t√†i ch√≠nh, h√†nh vi t√≠n d·ª•ng, v√† d√≤ng ti·ªÅn th√†nh c√°c ƒë·∫∑c tr∆∞ng (features) c√≥ s·ª©c m·∫°nh d·ª± ƒëo√°n cao. Qu√° tr√¨nh n√†y k·∫øt h·ª£p ki·∫øn th·ª©c chuy√™n m√¥n v·ªÅ t√≠n d·ª•ng ng√¢n h√†ng v·ªõi k·ªπ thu·∫≠t ph√¢n t√≠ch d·ªØ li·ªáu, t·∫°o ra t·∫≠p h·ª£p c√°c ch·ªâ s·ªë ph·∫£n √°nh ƒë·∫ßy ƒë·ªß t√¨nh h√¨nh t√†i ch√≠nh v√† r·ªßi ro c·ªßa kh√°ch h√†ng doanh nghi·ªáp.

C√°c features ƒë∆∞·ª£c thi·∫øt k·∫ø d·ª±a tr√™n c√°c nguy√™n t·∫Øc Basel v√† th·ª±c ti·ªÖn qu·∫£n l√Ω r·ªßi ro t√≠n d·ª•ng, chia th√†nh 5 nh√≥m ch√≠nh: Financial Ratios, Behavioral Features, Cashflow Features, Covenant Breach Flags, v√† Normalization. M·ªói nh√≥m ph·ª•c v·ª• m·ªôt m·ª•c ƒë√≠ch c·ª• th·ªÉ trong vi·ªác ƒë√°nh gi√° kh·∫£ nƒÉng v·ª° n·ª£ c·ªßa kh√°ch h√†ng.

---

### A. Financial Ratios (TTM - Trailing 12 Months)

C√°c t·ª∑ s·ªë t√†i ch√≠nh ƒë∆∞·ª£c t√≠nh to√°n d·ª±a tr√™n d·ªØ li·ªáu 12 th√°ng g·∫ßn nh·∫•t (TTM) ƒë·ªÉ ph·∫£n √°nh xu h∆∞·ªõng d√†i h·∫°n v√† gi·∫£m thi·ªÉu ·∫£nh h∆∞·ªüng c·ªßa bi·∫øn ƒë·ªông ng·∫Øn h·∫°n. Ch√∫ng ta s·ª≠ d·ª•ng d·ªØ li·ªáu t·ª´ 4 qu√Ω g·∫ßn nh·∫•t ƒë·ªÉ t√≠nh to√°n c√°c ch·ªâ s·ªë t·ªïng h·ª£p n√†y.

#### Liquidity & Coverage Ratios

**Interest Coverage Ratio (ICR)** l√† ch·ªâ s·ªë quan tr·ªçng nh·∫•t trong ƒë√°nh gi√° kh·∫£ nƒÉng tr·∫£ n·ª£, ƒë∆∞·ª£c t√≠nh b·∫±ng EBIT chia cho chi ph√≠ l√£i vay (Interest Expense). T·ª∑ s·ªë n√†y ƒëo l∆∞·ªùng kh·∫£ nƒÉng c·ªßa doanh nghi·ªáp trong vi·ªác tr·∫£ l√£i vay t·ª´ l·ª£i nhu·∫≠n ho·∫°t ƒë·ªông. Theo th√¥ng l·ªá ng√†nh ng√¢n h√†ng, ICR d∆∞·ªõi 1.5 ƒë∆∞·ª£c coi l√† m·ª©c nguy hi·ªÉm, cho th·∫•y doanh nghi·ªáp kh√¥ng ƒë·ªß kh·∫£ nƒÉng trang tr·∫£i nghƒ©a v·ª• l√£i vay t·ª´ thu nh·∫≠p ho·∫°t ƒë·ªông.

**Debt Service Coverage Ratio (DSCR)** ƒëo l∆∞·ªùng kh·∫£ nƒÉng tr·∫£ c·∫£ n·ª£ g·ªëc v√† l√£i t·ª´ EBITDA sau khi tr·ª´ ƒëi chi ph√≠ v·ªën (CAPEX). Do d·ªØ li·ªáu synthetic kh√¥ng c√≥ th√¥ng tin chi ti·∫øt v·ªÅ kho·∫£n tr·∫£ n·ª£ g·ªëc, ch√∫ng ta s·ª≠ d·ª•ng proxy b·∫±ng c√°ch ∆∞·ªõc t√≠nh CAPEX l√† 30% c·ªßa EBITDA. DSCR d∆∞·ªõi 1.2 cho th·∫•y doanh nghi·ªáp g·∫∑p kh√≥ khƒÉn trong vi·ªác ƒë√°p ·ª©ng c√°c nghƒ©a v·ª• n·ª£.

**Current Ratio** ph·∫£n √°nh thanh kho·∫£n ng·∫Øn h·∫°n, ƒë∆∞·ª£c t√≠nh b·∫±ng t√†i s·∫£n ng·∫Øn h·∫°n (Current Assets) chia cho n·ª£ ng·∫Øn h·∫°n (Current Liabilities). T·ª∑ s·ªë n√†y cho bi·∫øt kh·∫£ nƒÉng c·ªßa doanh nghi·ªáp trong vi·ªác thanh to√°n c√°c kho·∫£n n·ª£ ƒë·∫øn h·∫°n trong v√≤ng 12 th√°ng t·ªõi. Current Ratio d∆∞·ªõi 1.0 l√† d·∫•u hi·ªáu c·∫£nh b√°o thi·∫øu thanh kho·∫£n nghi√™m tr·ªçng.

#### Leverage Ratio

**Debt-to-EBITDA** ƒëo l∆∞·ªùng ƒë√≤n b·∫©y t√†i ch√≠nh, cho bi·∫øt doanh nghi·ªáp c·∫ßn bao nhi√™u nƒÉm EBITDA ƒë·ªÉ tr·∫£ h·∫øt n·ª£. T·ª∑ s·ªë n√†y ƒë∆∞·ª£c t√≠nh b·∫±ng t·ªïng n·ª£ (Total Debt) chia cho EBITDA TTM. Theo chu·∫©n m·ª±c ng√†nh, Debt-to-EBITDA v∆∞·ª£t qu√° 4.0 cho th·∫•y doanh nghi·ªáp ƒëang ch·ªãu g√°nh n·∫∑ng n·ª£ qu√° m·ª©c, l√†m tƒÉng ƒë√°ng k·ªÉ r·ªßi ro v·ª° n·ª£.

#### Working Capital Efficiency

Nh√≥m ch·ªâ s·ªë n√†y ƒë√°nh gi√° hi·ªáu qu·∫£ qu·∫£n l√Ω v·ªën l∆∞u ƒë·ªông th√¥ng qua ba th√†nh ph·∫ßn ch√≠nh:

**Days Sales Outstanding (DSO)** ƒëo l∆∞·ªùng s·ªë ng√†y trung b√¨nh ƒë·ªÉ thu h·ªìi ti·ªÅn t·ª´ kh√°ch h√†ng, ƒë∆∞·ª£c t√≠nh b·∫±ng (AR / Revenue) √ó 365. DSO tƒÉng cao cho th·∫•y doanh nghi·ªáp g·∫∑p kh√≥ khƒÉn trong vi·ªác thu h·ªìi c√¥ng n·ª£, c√≥ th·ªÉ d·∫´n ƒë·∫øn thi·∫øu h·ª•t ti·ªÅn m·∫∑t.

**Days Payables Outstanding (DPO)** ƒëo l∆∞·ªùng s·ªë ng√†y trung b√¨nh doanh nghi·ªáp tr·∫£ ti·ªÅn cho nh√† cung c·∫•p, ƒë∆∞·ª£c t√≠nh b·∫±ng (AP / COGS) √ó 365. DPO cao c√≥ th·ªÉ l√† d·∫•u hi·ªáu t√≠ch c·ª±c (t·∫≠n d·ª•ng t√≠n d·ª•ng th∆∞∆°ng m·∫°i) ho·∫∑c ti√™u c·ª±c (kh√≥ khƒÉn thanh kho·∫£n).

**Days On Hand (DOH)** ƒëo l∆∞·ªùng s·ªë ng√†y t·ªìn kho trung b√¨nh, ƒë∆∞·ª£c t√≠nh b·∫±ng (Inventory / COGS) √ó 365. DOH cao cho th·∫•y h√†ng t·ªìn kho nhi·ªÅu, c√≥ th·ªÉ l√†m gi√°n ƒëo·∫°n d√≤ng ti·ªÅn.

**Cash Conversion Cycle (CCC)** l√† ch·ªâ s·ªë t·ªïng h·ª£p, ƒë∆∞·ª£c t√≠nh b·∫±ng DSO + DOH - DPO. CCC ƒëo l∆∞·ªùng s·ªë ng√†y v·ªën b·ªã "ƒë√≥ng bƒÉng" trong chu k·ª≥ kinh doanh, t·ª´ khi tr·∫£ ti·ªÅn mua h√†ng ƒë·∫øn khi thu ƒë∆∞·ª£c ti·ªÅn t·ª´ kh√°ch h√†ng. CCC tƒÉng cao cho th·∫•y hi·ªáu qu·∫£ qu·∫£n l√Ω v·ªën l∆∞u ƒë·ªông k√©m, l√†m d√≤ng ti·ªÅn x·∫•u ƒëi.

#### Trend Analysis (QoQ)

Ngo√†i c√°c ch·ªâ s·ªë tƒ©nh, ch√∫ng ta c√≤n t√≠nh to√°n xu h∆∞·ªõng thay ƒë·ªïi theo qu√Ω (Quarter-over-Quarter) cho c√°c ch·ªâ s·ªë quan tr·ªçng. **delta_dso_qoq** v√† **delta_ccc_qoq** cho bi·∫øt s·ª± thay ƒë·ªïi c·ªßa DSO v√† CCC so v·ªõi qu√Ω tr∆∞·ªõc, gi√∫p ph√°t hi·ªán s·ªõm c√°c d·∫•u hi·ªáu x·∫•u ƒëi trong qu·∫£n l√Ω v·ªën l∆∞u ƒë·ªông.

### B. Behavioral Features (Observation Window = 180 ng√†y)

C√°c ƒë·∫∑c tr∆∞ng h√†nh vi ƒë∆∞·ª£c tr√≠ch xu·∫•t t·ª´ d·ªØ li·ªáu giao d·ªãch h√†ng ng√†y trong 180 ng√†y g·∫ßn nh·∫•t tr∆∞·ªõc ng√†y ƒë√°nh gi√° (as-of date). Nh·ªØng features n√†y ph·∫£n √°nh h√†nh vi s·ª≠ d·ª•ng t√≠n d·ª•ng th·ª±c t·∫ø c·ªßa kh√°ch h√†ng, th∆∞·ªùng c√≥ s·ª©c m·∫°nh d·ª± ƒëo√°n cao h∆°n so v·ªõi c√°c ch·ªâ s·ªë t√†i ch√≠nh truy·ªÅn th·ªëng v√¨ ch√∫ng n·∫Øm b·∫Øt ƒë∆∞·ª£c c√°c v·∫•n ƒë·ªÅ thanh kho·∫£n v√† kh√≥ khƒÉn t√†i ch√≠nh ngay khi ch√∫ng ph√°t sinh.

#### Credit Utilization

T·ª∑ l·ªá s·ª≠ d·ª•ng h·∫°n m·ª©c t√≠n d·ª•ng l√† ch·ªâ s·ªë quan tr·ªçng ph·∫£n √°nh m·ª©c ƒë·ªô ph·ª• thu·ªôc c·ªßa doanh nghi·ªáp v√†o ngu·ªìn v·ªën vay ng√¢n h√†ng. Ch√∫ng ta t√≠nh to√°n hai ch·ªâ s·ªë ch√≠nh:

**%util_mean_60d** l√† t·ª∑ l·ªá s·ª≠ d·ª•ng h·∫°n m·ª©c trung b√¨nh trong 60 ng√†y g·∫ßn nh·∫•t, ƒë∆∞·ª£c t√≠nh b·∫±ng (Utilized / Limit) trung b√¨nh. Ch·ªâ s·ªë n√†y cho bi·∫øt m·ª©c ƒë·ªô "cƒÉng" v·ªÅ thanh kho·∫£n c·ªßa doanh nghi·ªáp. Utilization rate v∆∞·ª£t qu√° 85% cho th·∫•y doanh nghi·ªáp ƒëang √°p s√°t h·∫°n m·ª©c, c√≥ nguy c∆° thi·∫øu thanh kho·∫£n n·∫øu c√≥ b·∫•t k·ª≥ c√∫ s·ªëc n√†o.

**%util_p95_60d** l√† percentile th·ª© 95 c·ªßa utilization trong 60 ng√†y, ƒëo l∆∞·ªùng ƒë·ªânh s·ª≠ d·ª•ng h·∫°n m·ª©c. Ch·ªâ s·ªë n√†y quan tr·ªçng v√¨ n√≥ cho th·∫•y trong nh·ªØng ng√†y "x·∫•u nh·∫•t", doanh nghi·ªáp s·ª≠ d·ª•ng bao nhi√™u ph·∫ßn trƒÉm h·∫°n m·ª©c, gi√∫p ph√°t hi·ªán c√°c giai ƒëo·∫°n cƒÉng th·∫≥ng thanh kho·∫£n t·∫°m th·ªùi.

#### Delinquency Patterns

Days Past Due (DPD) l√† ch·ªâ s·ªë tr·ª±c ti·∫øp nh·∫•t v·ªÅ kh√≥ khƒÉn thanh to√°n. Ch√∫ng ta ph√¢n t√≠ch DPD theo nhi·ªÅu g√≥c ƒë·ªô:

**dpd_max_180d** l√† s·ªë ng√†y qu√° h·∫°n t·ªëi ƒëa trong 180 ng√†y qua. Theo chu·∫©n m·ª±c ng√†nh, DPD v∆∞·ª£t qu√° 30 ng√†y ƒë∆∞·ª£c coi l√† early warning signal, trong khi DPD v∆∞·ª£t 90 ng√†y l√† d·∫•u hi·ªáu r√µ r√†ng c·ªßa default risk theo ƒë·ªãnh nghƒ©a Basel.

**dpd_trend_180d** ƒëo l∆∞·ªùng xu h∆∞·ªõng c·ªßa DPD theo th·ªùi gian b·∫±ng c√°ch t√≠nh slope (h·ªá s·ªë g√≥c) c·ªßa ƒë∆∞·ªùng h·ªìi quy tuy·∫øn t√≠nh gi·ªØa DPD v√† th·ªùi gian. Slope d∆∞∆°ng cho th·∫•y DPD ƒëang c√≥ xu h∆∞·ªõng tƒÉng d·∫ßn (t√¨nh h√¨nh x·∫•u ƒëi), trong khi slope √¢m cho th·∫•y doanh nghi·ªáp ƒëang c·∫£i thi·ªán kh·∫£ nƒÉng thanh to√°n.

**near_due_freq_7d** ƒëo l∆∞·ªùng t·∫ßn su·∫•t "g·∫ßn qu√° h·∫°n" trong 7 ng√†y g·∫ßn nh·∫•t, ƒë∆∞·ª£c ƒë·ªãnh nghƒ©a l√† t·ª∑ l·ªá ng√†y c√≥ 0 < DPD < 30. Ch·ªâ s·ªë n√†y gi√∫p ph√°t hi·ªán c√°c kh√°ch h√†ng th∆∞·ªùng xuy√™n tr·ªÖ h·∫°n nh∆∞ng ch∆∞a ƒë·∫øn m·ª©c qu√° h·∫°n nghi√™m tr·ªçng, ƒë√¢y l√† early warning quan tr·ªçng.

#### Credit Limit Breach

**limit_breach_cnt_90d** ƒë·∫øm s·ªë l·∫ßn kh√°ch h√†ng v∆∞·ª£t qu√° h·∫°n m·ª©c t√≠n d·ª•ng trong 90 ng√†y g·∫ßn nh·∫•t. B·∫•t k·ª≥ l·∫ßn vi ph·∫°m n√†o (> 0) ƒë·ªÅu l√† d·∫•u hi·ªáu c·∫£nh b√°o nghi√™m tr·ªçng, cho th·∫•y doanh nghi·ªáp c√≥ nhu c·∫ßu v·ªën v∆∞·ª£t qu√° kh·∫£ nƒÉng ƒë∆∞·ª£c ph√™ duy·ªát, ho·∫∑c h·ªá th·ªëng ki·ªÉm so√°t n·ªôi b·ªô k√©m.

---

### C. Cashflow Features

D√≤ng ti·ªÅn l√† "huy·∫øt m·∫°ch" c·ªßa doanh nghi·ªáp, quan tr·ªçng h∆°n c·∫£ l·ª£i nhu·∫≠n k·∫ø to√°n trong vi·ªác d·ª± ƒëo√°n kh·∫£ nƒÉng v·ª° n·ª£. Ch√∫ng ta ph√¢n t√≠ch d√≤ng ti·ªÅn v√†o/ra h√†ng ng√†y trong 180 ng√†y qua ƒë·ªÉ t·∫°o c√°c features:

**inflow_mean_60d** v√† **outflow_mean_60d** l√† d√≤ng ti·ªÅn v√†o v√† ra trung b√¨nh trong 60 ng√†y g·∫ßn nh·∫•t. Hai ch·ªâ s·ªë n√†y ph·∫£n √°nh quy m√¥ v√† t√≠nh ·ªïn ƒë·ªãnh c·ªßa ho·∫°t ƒë·ªông kinh doanh. S·ª± ch√™nh l·ªách l·ªõn gi·ªØa inflow v√† outflow (burn rate cao) l√† d·∫•u hi·ªáu c·∫£nh b√°o.

**inflow_drop_60d** ƒëo l∆∞·ªùng t·ª∑ l·ªá gi·∫£m c·ªßa d√≤ng ti·ªÅn v√†o trong 60 ng√†y g·∫ßn nh·∫•t so v·ªõi median 6 th√°ng, ƒë∆∞·ª£c t√≠nh b·∫±ng (median_6m - mean_60d) / median_6m. Ch·ªâ s·ªë n√†y gi√∫p ph√°t hi·ªán s·ªõm s·ª•t gi·∫£m doanh thu, m·ªôt trong nh·ªØng nguy√™n nh√¢n ch√≠nh d·∫´n ƒë·∫øn v·ª° n·ª£. Inflow drop v∆∞·ª£t qu√° 20% cho th·∫•y d√≤ng ti·ªÅn ƒëang gi·∫£m m·∫°nh, c·∫ßn c√≥ h√†nh ƒë·ªông can thi·ªáp ngay l·∫≠p t·ª©c.

---

### D. Covenant Breach Flags

Covenant (ƒëi·ªÅu kho·∫£n r√†ng bu·ªôc) l√† c√°c ng∆∞·ª°ng t√†i ch√≠nh m√† kh√°ch h√†ng ph·∫£i duy tr√¨ theo h·ª£p ƒë·ªìng t√≠n d·ª•ng. Vi ph·∫°m covenant l√† early warning signal c·ª±c k·ª≥ quan tr·ªçng, th∆∞·ªùng x·∫£y ra tr∆∞·ªõc khi default th·ª±c s·ª± di·ªÖn ra.

Ch√∫ng ta theo d√µi ba lo·∫°i covenant ch√≠nh: **breach_icr** (vi ph·∫°m ICR < 1.5), **breach_dscr** (vi ph·∫°m DSCR < 1.2), v√† **breach_leverage** (vi ph·∫°m Debt/EBITDA > 4.0). M·ªói breach flag l√† bi·∫øn nh·ªã ph√¢n (0/1), cho bi·∫øt kh√°ch h√†ng c√≥ vi ph·∫°m covenant t∆∞∆°ng ·ª©ng hay kh√¥ng. Vi ph·∫°m b·∫•t k·ª≥ covenant n√†o ƒë·ªÅu trigger c√°c h√†nh ƒë·ªông nh∆∞ renegotiation, tightening ƒëi·ªÅu ki·ªán, ho·∫∑c tƒÉng gi√°m s√°t.

---

### E. Normalization (Sector-Size)

M·ªôt trong nh·ªØng th√°ch th·ª©c l·ªõn nh·∫•t trong credit scoring l√† so s√°nh c√°c doanh nghi·ªáp kh√°c nhau v·ªÅ quy m√¥ v√† ng√†nh ngh·ªÅ. M·ªôt SME trong ng√†nh Retail c√≥ ICR = 2.0 c√≥ th·ªÉ ƒë∆∞·ª£c coi l√† t·ªët, nh∆∞ng c√πng ch·ªâ s·ªë ƒë√≥ v·ªõi m·ªôt Large Corporate trong ng√†nh IT l·∫°i l√† m·ª©c trung b√¨nh ho·∫∑c k√©m.

ƒê·ªÉ gi·∫£i quy·∫øt v·∫•n ƒë·ªÅ n√†y, ch√∫ng ta √°p d·ª•ng **Z-score normalization** theo nh√≥m (Sector, Size_bucket). M·ªói feature ƒë∆∞·ª£c chu·∫©n h√≥a b·∫±ng c√°ch so s√°nh v·ªõi c√°c kh√°ch h√†ng c√πng ng√†nh v√† c√πng quy m√¥:

```
z_score = (value - median_group) / IQR_group
```

Ch√∫ng ta s·ª≠ d·ª•ng **Median v√† IQR (Interquartile Range)** thay v√¨ Mean v√† Standard Deviation v√¨ ch√∫ng robust h∆°n v·ªõi outliers, r·∫•t ph·ªï bi·∫øn trong d·ªØ li·ªáu t√†i ch√≠nh. Features sau khi normalize c√≥ suffix `__zs_sector_size`, v√≠ d·ª•: `icr_ttm__zs_sector_size`, `dpd_max_180d__zs_sector_size`.

Normalization n√†y mang l·∫°i hai l·ª£i √≠ch quan tr·ªçng: (1) So s√°nh c√¥ng b·∫±ng gi·ªØa c√°c doanh nghi·ªáp c√πng ƒë·∫∑c ƒëi·ªÉm, v√† (2) TƒÉng s·ª©c m·∫°nh d·ª± ƒëo√°n c·ªßa model v√¨ features ƒë∆∞·ª£c ƒëi·ªÅu ch·ªânh theo context ri√™ng c·ªßa t·ª´ng nh√≥m.

In [None]:
# Example: Feature Engineering
print("Command to create features:")
print("python src/feature_engineering.py --raw-dir data/raw --asof 2025-06-30 --outdir data/processed")
print("\nKey features created:")
features = [
    "Financial: icr_ttm, dscr_ttm_proxy, debt_to_ebitda, current_ratio, dso, ccc",
    "Behavioral: %util_mean_60d, dpd_max_180d, dpd_trend_180d, limit_breach_cnt_90d",
    "Cashflow: inflow_mean_60d, inflow_drop_60d",
    "Covenant: breach_icr, breach_dscr, breach_leverage",
    "Normalized: *__zs_sector_size versions"
]
for f in features:
    print(f"  - {f}")

## Step 3: Model Training & Calibration

Module: `src/train_baseline.py`

### Overview

B∆∞·ªõc n√†y x√¢y d·ª±ng m√¥ h√¨nh Machine Learning ƒë·ªÉ d·ª± ƒëo√°n x√°c su·∫•t v·ª° n·ª£ (PD) trong 12 th√°ng t·ªõi. Ch√∫ng ta s·ª≠ d·ª•ng **LightGBM** l√†m classifier c∆° s·ªü v√¨ kh·∫£ nƒÉng x·ª≠ l√Ω t·ªët nhi·ªÅu features, t·ª± ƒë·ªông h·ªçc ƒë∆∞·ª£c c√°c m·ªëi quan h·ªá phi tuy·∫øn, v√† h·ªó tr·ª£ class balancing. Sau khi train, m√¥ h√¨nh ƒë∆∞·ª£c **calibrate** b·∫±ng Isotonic Regression ƒë·ªÉ ƒë·∫£m b·∫£o predicted probabilities ph·∫£n √°nh ƒë√∫ng true probabilities - ƒëi·ªÅu quan tr·ªçng cho credit risk management v√† tu√¢n th·ªß Basel.

---

### A. LightGBM Configuration

LightGBM ƒë∆∞·ª£c ch·ªçn v√¨ x·ª≠ l√Ω t·ªët nhi·ªÅu features c√≥ quy m√¥ kh√°c nhau (financial ratios, utilization rates, DPD counts), t·ª± ƒë·ªông h·ªçc feature interactions ("ICR th·∫•p + Utilization cao = R·ªßi ro cao"), v√† h·ªó tr·ª£ class weighting cho imbalanced data (default rate ~5-10%).

**Hyperparameters:**

```python
LGBMClassifier(
    n_estimators=300,           # 300 c√¢y trong ensemble
    learning_rate=0.05,         # H·ªçc ch·∫≠m nh∆∞ng ·ªïn ƒë·ªãnh
    num_leaves=15,              # Gi·ªõi h·∫°n complexity
    max_depth=6,                # Tr√°nh overfitting
    subsample=0.8,              # Row sampling (bagging)
    colsample_bytree=0.8,       # Column sampling
    min_child_samples=10,       # M·ªói leaf ‚â• 10 samples
    reg_lambda=0.1,             # L2 regularization
    scale_pos_weight=(1-pos_rate)/pos_rate,  # Auto-balance classes
    objective='binary',
    random_state=42
)
```

**Train-Test Split**: 80% training, 20% holdout test v·ªõi stratified sampling ƒë·ªÉ ƒë·∫£m b·∫£o default rate ƒë·ªìng ƒë·ªÅu.

---

### B. Isotonic Calibration (CV=5)

Gradient boosting models th∆∞·ªùng cho ra **uncalibrated probabilities** - khi model d·ª± ƒëo√°n PD = 20%, t·ª∑ l·ªá th·ª±c t·∫ø default c√≥ th·ªÉ l√† 15% ho·∫∑c 30%. Trong credit risk, ƒëi·ªÅu n√†y nguy hi·ªÉm v√¨ c√°c quy·∫øt ƒë·ªãnh quan tr·ªçng (capital allocation, pricing, provisioning) d·ª±a v√†o con s·ªë PD n√†y.

**Isotonic Regression** l√† ph∆∞∆°ng ph√°p calibration non-parametric v√† monotonic, h·ªçc h√†m mapping t·ª´ raw probabilities sang calibrated probabilities v·ªõi r√†ng bu·ªôc ƒë∆°n ƒëi·ªáu tƒÉng (customer c√≥ raw PD cao h∆°n v·∫´n c√≥ calibrated PD cao h∆°n). Ch√∫ng ta s·ª≠ d·ª•ng 5-fold CV ƒë·ªÉ tr√°nh overfitting:

```python
from sklearn.calibration import CalibratedClassifierCV

calibrated_model = CalibratedClassifierCV(
    base_lgbm,
    method='isotonic',
    cv=5
)
calibrated_model.fit(X_train, y_train)
```

Isotonic Regression ƒë∆∞·ª£c ∆∞u ti√™n h∆°n Platt Scaling v√¨: (1) Kh√¥ng gi·∫£ ƒë·ªãnh functional form (kh√¥ng c·∫ßn sigmoid), (2) ƒê·∫£m b·∫£o ranking kh√¥ng ƒë·ªïi, (3) Performance t·ªët h∆°n v·ªõi sample size l·ªõn (‚â• 1000 customers).

---

### C. Risk Tiers & Thresholds

Sau calibration, customers ƒë∆∞·ª£c ph√¢n lo·∫°i v√†o 3 risk tiers d·ª±a tr√™n **percentile-based thresholds** (thay v√¨ absolute PD cutoffs) ƒë·ªÉ qu·∫£n l√Ω capacity. Ng√¢n h√†ng ch·ªâ c√≥ ƒë·ªß ngu·ªìn l·ª±c ƒë·ªÉ qu·∫£n l√Ω ch·∫∑t ch·∫Ω m·ªôt s·ªë l∆∞·ª£ng customers high-risk nh·∫•t, n√™n vi·ªác c·ªë ƒë·ªãnh Red tier = top 5% ƒë·∫£m b·∫£o s·ªë l∆∞·ª£ng customers c·∫ßn intensive monitoring kh√¥ng v∆∞·ª£t qu√° capacity.

**Tier Definitions:**

| Tier | Percentile | Typical PD | Action | Capacity |
|------|-----------|-----------|--------|----------|
| **Red** | Top 5% | ‚â• 20% | H·ªçp KH ‚â§5 ng√†y; l·∫≠p cash flow 13 tu·∫ßn; tighten covenants; watchlist | ~50 KH ‚Üí 5 RMs |
| **Amber** | Top 5-15% | 5-20% | So√°t x√©t ‚â§10 ng√†y; y√™u c·∫ßu management accounts; h·∫°n ch·∫ø h·∫°n m·ª©c | ~100 KH ‚Üí 10 RMs |
| **Green** | Bottom 85% | < 5% | Theo d√µi ƒë·ªãnh k·ª≥ quarterly; kh√¥ng c·∫ßn h√†nh ƒë·ªông ƒë·∫∑c bi·ªát | Portfolio monitoring |

**Threshold Calculation:**

```python
train_probs = calibrated_model.predict_proba(X_train)[:, 1]
red_threshold = np.percentile(train_probs, 95)      # e.g., 0.23
amber_threshold = np.percentile(train_probs, 85)    # e.g., 0.08
```

Thresholds ƒë∆∞·ª£c l∆∞u v√†o `thresholds.json` v√† s·ª≠ d·ª•ng nh·∫•t qu√°n cho c√°c l·∫ßn scoring sau.

---

### D. Evaluation Metrics

Model ƒë∆∞·ª£c ƒë√°nh gi√° to√†n di·ªán qua nhi·ªÅu metrics, m·ªói metric ƒëo l∆∞·ªùng m·ªôt kh√≠a c·∫°nh kh√°c nhau:

**1. AUC-ROC (Discrimination Power)**  
ƒêo kh·∫£ nƒÉng ph√¢n bi·ªát defaulters vs non-defaulters. AUC = 0.80 nghƒ©a l√† 80% tr∆∞·ªùng h·ª£p model s·∫Ω rank ƒë√∫ng (assign PD cao h∆°n cho defaulter). **Target**: ‚â• 0.75 (industry standard).

**2. PR-AUC (Precision-Recall)**  
Quan tr·ªçng v·ªõi imbalanced data (default rate th·∫•p). Precision = % customers ƒë∆∞·ª£c d·ª± ƒëo√°n default th·ª±c s·ª± default. Recall = % defaults th·ª±c t·∫ø ƒë∆∞·ª£c ph√°t hi·ªán. **Target**: ‚â• 0.40 (v·ªõi base rate ~8%).

**3. KS Statistic (Kolmogorov-Smirnov)**  
ƒêo maximum separation gi·ªØa cumulative distributions c·ªßa defaulters v√† non-defaulters. KS = max(TPR - FPR). **Target**: ‚â• 0.35 (good discriminatory power).

**4. Brier Score (Calibration Quality)**  
MSE c·ªßa probabilities: `Brier = (1/N) √ó Œ£(predicted_prob - actual_outcome)¬≤`. Brier nh·ªè nghƒ©a l√† predictions accurate (n·∫øu d·ª± ƒëo√°n 10 KH m·ªói ng∆∞·ªùi PD=20%, l√Ω t∆∞·ªüng c√≥ 2 defaults). **Target**: ‚â§ 0.10. Brier gi·∫£m ƒë√°ng k·ªÉ sau calibration (t·ª´ ~0.12 xu·ªëng ~0.08).

**5. Calibration Curve (Reliability Diagram)**  
Visualize calibration: plot mean predicted probability vs actual default rate trong t·ª´ng bin. ƒê∆∞·ªùng l√Ω t∆∞·ªüng l√† y = x (diagonal).

---

### E. Outputs & Artifacts

**1. Model File**: `model_lgbm.pkl` - Ch·ª©a base LightGBM, calibrated model, feature names, v√† metadata (training date, hyperparameters, test AUC, test Brier).

**2. Scores**: `scores_all.csv` - Predictions cho to√†n b·ªô dataset (train + test) v·ªõi columns: `customer_id`, `prob_default_12m_base`, `prob_default_12m_calibrated`, `tier`, `is_test`.

**3. Thresholds**: `thresholds.json` - L∆∞u red/amber/green thresholds v√† percentiles ƒë·ªÉ d√πng cho scoring sau n√†y.

**4. Visualizations**:
- `calibration_lgbm.png`: Reliability diagram (before vs after calibration)
- `pr_curve_lgbm.png`: Precision-Recall curve
- `roc_curve_lgbm.png`: ROC curve  
- `shap_summary.png`: Quick SHAP summary (top 10 features)

T·∫•t c·∫£ artifacts ƒë∆∞·ª£c version control ƒë·ªÉ ƒë·∫£m b·∫£o reproducibility v√† auditability.

In [None]:
# Example: Train model
print("Command to train LightGBM model:")
print("python src/train_baseline.py --features data/processed/feature_ews.parquet --test-size 0.2 --seed 42 --red-pct 0.05 --amber-pct 0.10 --outdir artifacts/models")
print("\nOutputs:")
print("  - artifacts/models/model_lgbm.pkl (base + calibrated model + features)")
print("  - artifacts/models/scores_all.csv (predictions + tiers)")
print("  - artifacts/models/thresholds.json")
print("  - artifacts/models/calibration_lgbm.png")
print("  - artifacts/models/pr_curve_lgbm.png")
print("  - artifacts/models/shap_summary.csv/png")

## üîç Step 4: Model Explainability (SHAP)

Module: `src/explain.py`

### Overview

SHAP (SHapley Additive exPlanations) gi·∫£i th√≠ch contribution c·ªßa t·ª´ng feature v√†o prediction d·ª±a tr√™n game theory. SHAP value d∆∞∆°ng nghƒ©a l√† feature ƒë√≥ tƒÉng x√°c su·∫•t default, √¢m nghƒ©a l√† gi·∫£m default risk, v√† magnitude cho bi·∫øt m·ª©c ƒë·ªô ·∫£nh h∆∞·ªüng. ƒêi·ªÅu n√†y quan tr·ªçng cho Credit Committee (gi·∫£i th√≠ch decisions), RM Team (t∆∞ v·∫•n customers c·∫£i thi·ªán), v√† Model Validation (ƒë·∫£m b·∫£o model h·ªçc ƒë√∫ng patterns).

---

### Global Explainability

**Feature Importance** (`feature_importance.csv`): Mean absolute SHAP values cho m·ªói feature, cho bi·∫øt features n√†o ·∫£nh h∆∞·ªüng nh·∫•t ƒë·∫øn model trong to√†n b·ªô portfolio. V√≠ d·ª•, `dpd_max_180d__zs_sector_size` th∆∞·ªùng l√† feature quan tr·ªçng nh·∫•t v√¨ DPD l√† signal m·∫°nh nh·∫•t cho default risk.

**SHAP Summary Plot** (`shap_summary.png`): Waterfall plot visualize impact c·ªßa t·∫•t c·∫£ features. M·ªói ƒëi·ªÉm l√† m·ªôt customer, m√†u ƒë·ªè = feature value cao, xanh = feature value th·∫•p. Plot n√†y cho th·∫•y kh√¥ng ch·ªâ feature n√†o quan tr·ªçng m√† c√≤n direction c·ªßa impact (high DPD ‚Üí high risk, high ICR ‚Üí low risk).

---

### Local Explainability

**Top Drivers per Customer** (`top_drivers_per_customer.csv`): Top 5 features quan tr·ªçng nh·∫•t cho t·ª´ng customer c·ª• th·ªÉ, gi√∫p tr·∫£ l·ªùi c√¢u h·ªèi "T·∫°i sao customer C0042 ƒë∆∞·ª£c ph√¢n v√†o Red tier?". Output bao g·ªìm feature name, SHAP value, v√† actual feature value.

V√≠ d·ª• cho customer C0042:
1. `dpd_max_180d__zs_sector_size`: SHAP = +0.52 (value = 120 days) ‚Üí DPD cao
2. `%util_mean_60d__zs_sector_size`: SHAP = +0.31 (value = 0.95) ‚Üí Utilization s√°t h·∫°n m·ª©c
3. `icr_ttm__zs_sector_size`: SHAP = +0.20 (value = 0.8) ‚Üí ICR th·∫•p, kh√≥ tr·∫£ l√£i

V·ªõi th√¥ng tin n√†y, RM c√≥ th·ªÉ t∆∞ v·∫•n customer: (1) Clear outstanding payments ƒë·ªÉ gi·∫£m DPD, (2) Gi·∫£m credit usage ho·∫∑c apply for limit increase, (3) C·∫£i thi·ªán profitability ho·∫∑c restructure debt.

---

### Dependence Plots

SHAP dependence plots cho key features (`icr_ttm`, `ccc`, `%util_mean_60d`) hi·ªÉn th·ªã m·ªëi quan h·ªá phi tuy·∫øn gi·ªØa feature value v√† SHAP value. V√≠ d·ª•, dependence plot c·ªßa ICR c√≥ th·ªÉ cho th·∫•y: ICR < 1.5 c√≥ SHAP values r·∫•t cao (risk tƒÉng m·∫°nh), ICR 1.5-3.0 c√≥ SHAP gi·∫£m d·∫ßn, ICR > 3.0 c√≥ SHAP g·∫ßn 0 (kh√¥ng c√≤n r·ªßi ro th√™m). Nh·ªØng insights n√†y gi√∫p validate model ƒëang h·ªçc ƒë√∫ng business logic.

---

### Outputs Summary

| File | Type | Purpose |
|------|------|---------|
| `feature_importance.csv` | Global | Ranking features by importance |
| `shap_summary.png` | Global | Visual impact of all features |
| `top_drivers_per_customer.csv` | Local | Top 5 drivers cho t·ª´ng customer |
| `shap_dependence_*.png` | Global | Phi tuy·∫øn relationships |
| `summary.json` | Metadata | Config v√† stats |

In [None]:
# Example: Generate SHAP explanations
print("Command to generate SHAP explanations:")
print("python src/explain.py --model artifacts/models/model_lgbm.pkl --features data/processed/feature_ews.parquet --outdir artifacts/shap --max-display 20")
print("\nOutputs:")
print("  - artifacts/shap/feature_importance.csv")
print("  - artifacts/shap/shap_summary.png")
print("  - artifacts/shap/top_drivers_per_customer.csv")
print("  - artifacts/shap/shap_dependence_*.png")
print("  - artifacts/shap/summary.json")

## ‚öôÔ∏è Step 5-6: [Optional] Re-calibration

### Why Optional?

B∆∞·ªõc 3 (train_baseline.py) ƒë√£ t·∫°o ra m·ªôt **calibrated model** v·ªõi percentile-based thresholds (Red = top 5%, Amber = top 5-15%) s·∫µn s√†ng cho production. Steps 5-6 ch·ªâ c·∫ßn thi·∫øt khi business mu·ªën **thay ƒë·ªïi threshold strategy** t·ª´ percentile-based sang **absolute PD cutoffs** (v√≠ d·ª•: Red ‚â• 20% PD, Amber ‚â• 5% PD) ƒë·ªÉ ph√π h·ª£p v·ªõi risk appetite ho·∫∑c regulatory requirements c·ª• th·ªÉ.

Trong th·ª±c t·∫ø, percentile-based approach th∆∞·ªùng ƒë∆∞·ª£c ∆∞u ti√™n v√¨ ƒë·∫£m b·∫£o s·ªë l∆∞·ª£ng customers c·∫ßn intensive monitoring kh√¥ng v∆∞·ª£t qu√° capacity. Tuy nhi√™n, m·ªôt s·ªë t·ªï ch·ª©c (ƒë·∫∑c bi·ªát banks tu√¢n th·ªß Basel/IFRS 9) y√™u c·∫ßu absolute thresholds ƒë·ªÉ nh·∫•t qu√°n v·ªõi internal risk rating systems ho·∫∑c regulatory reporting.

---

### Step 5: Extract Raw Scores

**Module**: `src/make_scores_raw.py`

Tr√≠ch xu·∫•t raw probabilities t·ª´ **base LightGBM** (tr∆∞·ªõc khi √°p d·ª•ng isotonic calibration trong Step 3) ƒë·ªÉ c√≥ baseline scores cho re-calibration process. Output l√† `scores_raw.csv` ch·ª©a uncalibrated predictions cho to√†n b·ªô dataset.

**Why needed?** Re-calibration c·∫ßn raw scores l√†m input v√¨ ch√∫ng ta s·∫Ω fit m·ªôt calibrator m·ªõi v·ªõi absolute thresholds kh√°c v·ªõi calibrator trong Step 3.

---

### Step 6: Re-calibrate with Absolute Thresholds

**Module**: `src/calibrate.py`

Fit l·∫°i **Isotonic Regression** tr√™n raw scores v·ªõi absolute PD cutoffs thay v√¨ percentiles. Process bao g·ªìm: (1) Fit calibrator tr√™n training set, (2) Map raw scores ‚Üí calibrated PD, (3) Apply absolute thresholds (Red ‚â• 20%, Amber ‚â• 5%), (4) Save calibrator v√† thresholds.

**Key difference from Step 3:**
- Step 3: Calibrate ‚Üí Calculate percentile thresholds ‚Üí Tiers fixed by % (top 5%, 10%)
- Step 6: Calibrate ‚Üí Apply absolute PD thresholds ‚Üí Tiers vary by portfolio quality

**Outputs:**
- `calibrator.pkl`: New isotonic calibrator
- `mapping.csv`: Raw score ‚Üí Calibrated PD mapping table
- `thresholds.json`: Absolute cutoffs (red: 0.20, amber: 0.05)
- `calibration_full.png`: Reliability diagram
- `pr_curve_full.png`: Precision-Recall curve

**Tradeoff:** V·ªõi absolute thresholds, s·ªë l∆∞·ª£ng customers trong Red/Amber tiers c√≥ th·ªÉ bi·∫øn ƒë·ªông theo quality c·ªßa portfolio (good period ‚Üí √≠t Red, bad period ‚Üí nhi·ªÅu Red), g√¢y kh√≥ khƒÉn cho capacity planning.

## üéØ Step 7: Production Scoring

Module: `src/scoring.py`

Scoring l√† b∆∞·ªõc cu·ªëi c√πng ƒë·ªÉ ƒë∆∞a model v√†o production. Script n√†y load trained model, predict PD cho to√†n b·ªô customers d·ª±a tr√™n feature snapshot t·∫°i as-of date (v√≠ d·ª•: 2025-06-30), sau ƒë√≥ ph√¢n tier v√† ƒë∆∞a ra action recommendations. Output ƒë∆∞·ª£c s·ª≠ d·ª•ng tr·ª±c ti·∫øp b·ªüi RM team v√† Risk Committee ƒë·ªÉ ra quy·∫øt ƒë·ªãnh nghi·ªáp v·ª•.

---

### Inputs & Outputs

**Inputs:**
1. **Features**: `data/processed/feature_ews.parquet` - Feature snapshot t·∫°i as-of date
2. **Model**: `artifacts/models/model_lgbm.pkl` - Trained & calibrated LightGBM
3. **Thresholds**: `artifacts/calibration/thresholds.json` ho·∫∑c `artifacts/models/thresholds.json` - T√πy approach (absolute vs percentile)

**Output**: `ews_scored_YYYY-MM-DD.csv` v·ªõi columns:

| Column | Description | Example |
|--------|-------------|---------|
| `customer_id` | Customer identifier | C0042 |
| `prob_default_12m_calibrated` | PD trong 12 th√°ng (0-1) | 0.2341 |
| `score_ews` | EWS Score (0-100) | 76.59 |
| `tier` | Risk tier | Red |
| `action` | Recommended action | H·ªçp KH ‚â§5 ng√†y; l·∫≠p cash flow 13 tu·∫ßn;... |

**EWS Score Formula**: `100 √ó (1 - PD)` ‚Üí Score cao = R·ªßi ro th·∫•p (100 = t·ªët nh·∫•t, 0 = x·∫•u nh·∫•t)

---

### Risk Tiers & Actions

| Tier | Criteria | Action | Frequency |
|------|----------|--------|-----------|
| **Green** | PD < 5% (ho·∫∑c bottom 85%) | Theo d√µi ƒë·ªãnh k·ª≥; c·∫≠p nh·∫≠t BCTC ƒë√∫ng h·∫°n | Quarterly |
| **Amber** | 5% ‚â§ PD < 20% (ho·∫∑c top 5-15%) | So√°t x√©t RM ‚â§10 ng√†y; y√™u c·∫ßu management accounts; ki·ªÉm tra c√¥ng n·ª£; h·∫°n ch·∫ø h·∫°n m·ª©c | Monthly |
| **Red** | PD ‚â• 20% (ho·∫∑c top 5%) | H·ªçp KH ‚â§5 ng√†y; l·∫≠p cash flow 13 tu·∫ßn; xem x√©t covenant tightening/collateral; watchlist | Weekly |

**Note**: Criteria ph·ª• thu·ªôc v√†o threshold approach (absolute vs percentile) ƒë∆∞·ª£c ch·ªçn ·ªü Steps 3 ho·∫∑c 5-6.

---

### Production Workflow

**Monthly Cadence**:
1. **Last day of month**: Ch·∫°y scoring script v·ªõi as-of date = month-end
2. **Day 1-2**: Ph√¢n ph·ªëi report cho RM team v√† Risk Committee
3. **Day 3-10**: RMs th·ª±c hi·ªán actions theo tier (Amber reviews, Red meetings)
4. **Throughout month**: Track action completion v√† update customer status

**Integration v·ªõi Banking Systems**:
- **Input**: Features t·ª´ core banking system (financial data, credit transactions, cashflow)
- **Output**: EWS scores import v√†o CRM/Credit Risk systems
- **Alerts**: Auto-trigger emails/notifications cho customers chuy·ªÉn sang Red tier

**Monitoring**: Track tier migrations month-over-month ƒë·ªÉ identify portfolio trends (improving/deteriorating).

## üìä Model Performance & Validation

### Expected Performance Metrics (Holdout 20%)

Model ƒë∆∞·ª£c ƒë√°nh gi√° tr√™n holdout test set v·ªõi c√°c metrics sau (computed trong `train_baseline.py` lines 99-104):

| Metric | Target Range | √ù nghƒ©a | Code |
|--------|-------------|---------|------|
| **AUC-ROC** | 0.75 - 0.85 | Kh·∫£ nƒÉng ph√¢n bi·ªát defaulters vs non-defaulters | `roc_auc_score(y_te, p_te)` |
| **PR-AUC** | 0.40 - 0.60 | Performance tr√™n positive class (quan tr·ªçng v·ªõi imbalanced data) | `average_precision_score(y_te, p_te)` |
| **KS Statistic** | 0.35 - 0.50 | Maximum separation gi·ªØa cumulative distributions | `ks_score(y_te, p_te)` |
| **Brier Score** | 0.05 - 0.10 | Calibration quality (lower is better) | `brier_score_loss(y_te, p_te)` |

**Calibration Quality**: Reliability curve (predicted probabilities vs actual default rates) n√™n g·∫ßn diagonal (y = x). Isotonic calibration c·∫£i thi·ªán ƒë√°ng k·ªÉ metric n√†y, th∆∞·ªùng gi·∫£m Brier score t·ª´ ~0.12 xu·ªëng ~0.08. Plots ƒë∆∞·ª£c generate trong `plot_calibration_pr()` function (lines 47-61).

**Precision-Recall Tradeoff**: Red threshold (PD ‚â• 20%) c√≥ high precision, moderate recall; Amber threshold (PD ‚â• 5%) c√≥ balanced precision-recall.

---

### Model Monitoring & Maintenance

**Quarterly Reviews** (c·∫ßn t·ª± implement monitoring scripts):
1. **Performance drift**: Monitor AUC, KS tr√™n new data (target: kh√¥ng gi·∫£m > 5%)
   - Re-run `train_baseline.py` tr√™n new data v√† compare metrics
2. **Population Stability Index (PSI)**: ƒêo distribution shift c·ªßa features (target: PSI < 0.15)
   - Formula: `PSI = Œ£(actual% - expected%) √ó ln(actual%/expected%)`
3. **Feature stability**: Check data quality, missing values, outliers
   - S·ª≠ d·ª•ng data profiling tools ho·∫∑c pandas `.describe()`
4. **Recalibration**: N·∫øu Brier score tƒÉng > 0.10, consider re-fit calibrator
   - Re-run Step 6 (`calibrate.py`) v·ªõi data m·ªõi

**Red Flags Trigger Retraining**:
- AUC drops below 0.70
- Brier score > 0.15
- Large prediction shifts without business explanation (e.g., 10% customers chuy·ªÉn tier b·∫•t th∆∞·ªùng)
- PSI > 0.25 (severe distribution shift)

## Feature Importance Ranking

D·ª±a tr√™n SHAP analysis trong `explain.py`, ƒë√¢y l√† 10 features c√≥ impact m·∫°nh nh·∫•t ƒë·∫øn d·ª± b√°o default:

| # | Feature Name | Category | Business Interpretation |
|---|--------------|----------|------------------------|
| 1 | `dpd_max_180d__zs_sector_size` | Behavioral | DPD t·ªëi ƒëa trong 6 th√°ng - signal m·∫°nh nh·∫•t cho default risk |
| 2 | `%util_mean_60d__zs_sector_size` | Behavioral | Credit utilization trung b√¨nh - ph·∫£n √°nh liquidity stress |
| 3 | `icr_ttm__zs_sector_size` | Financial | Interest Coverage Ratio - kh·∫£ nƒÉng tr·∫£ l√£i vay |
| 4 | `debt_to_ebitda__zs_sector_size` | Financial | Financial leverage - m·ª©c ƒë·ªô ƒë√≤n b·∫©y t√†i ch√≠nh |
| 5 | `ccc__zs_sector_size` | Financial | Cash Conversion Cycle - hi·ªáu qu·∫£ qu·∫£n l√Ω v·ªën l∆∞u ƒë·ªông |
| 6 | `inflow_drop_60d__zs_sector_size` | Cashflow | M·ª©c gi·∫£m doanh thu - suy gi·∫£m cashflow |
| 7 | `dpd_trend_180d__zs_sector_size` | Behavioral | Xu h∆∞·ªõng DPD tƒÉng - payment behavior ƒëang x·∫•u ƒëi |
| 8 | `breach_icr` | Covenant | Vi ph·∫°m covenant ICR - trigger event tr·ª±c ti·∫øp |
| 9 | `current_ratio__zs_sector_size` | Financial | Current Ratio < 1.0 - nguy c∆° thanh kho·∫£n ng·∫Øn h·∫°n |
| 10 | `delta_ccc_qoq__zs_sector_size` | Financial | Thay ƒë·ªïi CCC theo qu√Ω - efficiency ƒëang gi·∫£m |

### Ph√¢n t√≠ch theo Category

- **Behavioral (40%)**: Payment patterns th·ª±c t·∫ø l√† predictor m·∫°nh nh·∫•t - DPD history v√† utilization cho signal s·ªõm nh·∫•t v·ªÅ kh√≥ khƒÉn t√†i ch√≠nh
- **Financial (35%)**: C√°c ch·ªâ s·ªë t√†i ch√≠nh fundamental (ICR, leverage, liquidity ratios) quan tr·ªçng th·ª© hai
- **Cashflow (15%)**: Revenue trends v√† cashflow dynamics detect deterioration s·ªõm h∆°n b√°o c√°o t√†i ch√≠nh
- **Covenant (10%)**: Breach events c√≥ impact ƒë√°ng k·ªÉ nh∆∞ng xu·∫•t hi·ªán mu·ªôn h∆°n

**K·∫øt lu·∫≠n**: Model ∆∞u ti√™n behavioral signals v√¨ payment difficulties xu·∫•t hi·ªán tr∆∞·ªõc khi financial statements ph·∫£n √°nh ƒë·∫ßy ƒë·ªß. ƒêi·ªÅu n√†y ph√π h·ª£p v·ªõi th·ª±c t·∫ø risk management trong credit monitoring.

## üöÄ Complete End-to-End Pipeline

### Full Workflow (Development)

```bash
# Step 1: Generate synthetic data
python src/generate_data.py --n-customers 1000 --output-dir data/raw

# Step 2: Feature engineering
python src/feature_engineering.py --raw-dir data/raw --asof 2025-06-30 --outdir data/processed

# Step 3: Train model + calibration
python src/train_baseline.py --features data/processed/feature_ews.parquet --test-size 0.2 --seed 42 --red-pct 0.05 --amber-pct 0.10 --outdir artifacts/models

# Step 4: Generate SHAP explanations
python src/explain.py --model artifacts/models/model_lgbm.pkl --features data/processed/feature_ews.parquet --outdir artifacts/shap --max-display 20

# [Optional] Step 5-6: Re-calibration with absolute thresholds
python src/make_scores_raw.py --features data/processed/feature_ews.parquet --model artifacts/models/model_lgbm.pkl --out data/processed/scores_raw.csv
python src/calibrate.py --input data/processed/scores_raw.csv --red-thr 0.20 --amber-thr 0.05 --outdir artifacts/calibration

# Step 7: Production scoring
python src/scoring.py --features data/processed/feature_ews.parquet --model artifacts/models/model_lgbm.pkl --thresholds artifacts/calibration/thresholds.json --asof 2025-06-30 --outdir artifacts/scoring
```

### Or use Makefile (if configured)

```bash
make requirements    # Install dependencies
make lint           # Check code quality
make format         # Format code with ruff
make test          # Run pytest
```

## üíº Business Use Cases

### 1. Portfolio Review Meeting (Monthly)

**Input:** `ews_scored_YYYY-MM-DD.csv`

**Analysis:**
```python
import pandas as pd

scores = pd.read_csv('artifacts/scoring/ews_scored_2025-06-30.csv')

# Portfolio distribution
print(scores['tier'].value_counts())
# Green: 850 customers (85%)
# Amber: 100 customers (10%)
# Red:    50 customers (5%)

# High-risk customers requiring immediate action
red_tier = scores[scores['tier'] == 'Red'].sort_values('prob_default_12m_calibrated', ascending=False)
print(f"Red tier: {len(red_tier)} customers with avg PD = {red_tier['prob_default_12m_calibrated'].mean():.1%}")
```

**Actions:**
- **Red tier:** Immediate RM meeting + action plan
- **Amber tier:** Enhanced monitoring + covenant review
- **Green tier:** Standard periodic review

---

### 2. Credit Approval Process

**New loan application t·ª´ kh√°ch h√†ng C0523:**

```python
# Get customer's EWS score
customer = scores[scores['customer_id'] == 'C0523'].iloc[0]
print(f"Customer C0523:")
print(f"  - EWS Score: {customer['score_ews']}")
print(f"  - PD 12M: {customer['prob_default_12m_calibrated']:.1%}")
print(f"  - Tier: {customer['tier']}")
print(f"  - Action: {customer['action']}")

# Decision rule
if customer['tier'] == 'Red':
    print("REJECT or require additional collateral")
elif customer['tier'] == 'Amber':
    print("APPROVE with covenant tightening")
else:
    print("APPROVE standard terms")
```

---

### 3. Early Intervention

**Identify deteriorating customers:**

```python
# Compare current month vs last month
current = pd.read_csv('artifacts/scoring/ews_scored_2025-06-30.csv')
previous = pd.read_csv('artifacts/scoring/ews_scored_2025-05-31.csv')

merged = current.merge(previous, on='customer_id', suffixes=('_current', '_previous'))
merged['pd_change'] = merged['prob_default_12m_calibrated_current'] - merged['prob_default_12m_calibrated_previous']

# Customers with PD increasing by > 10pp
deteriorating = merged[merged['pd_change'] > 0.10].sort_values('pd_change', ascending=False)
print(f"Deteriorating customers: {len(deteriorating)}")
```

**Trigger actions:**
- Request updated financials
- Schedule RM meeting
- Review credit limits

---

### 4. SHAP-based Customer Advisory

**Why is customer C0042 in Red tier?**

```python
import pandas as pd

# Load SHAP drivers
drivers = pd.read_csv('artifacts/shap/top_drivers_per_customer.csv')
customer_drivers = drivers[drivers['customer_id'] == 'C0042'].iloc[0]

print("Top 3 risk drivers:")
for i in range(1, 4):
    print(f"{i}. {customer_drivers[f'feat{i}']}: {customer_drivers[f'shap{i}']:.3f} (value={customer_drivers[f'value{i}']})")

# Output:
# 1. dpd_max_180d__zs_sector_size: 0.523 (value=120)
# 2. %util_mean_60d__zs_sector_size: 0.312 (value=0.95)
# 3. icr_ttm__zs_sector_size: 0.201 (value=0.8)
```

**RM Advice to customer:**
1. **DPD 120 days:** Clear outstanding payments immediately
2. **95% utilization:** Reduce credit line usage or apply for limit increase
3. **ICR 0.8:** Improve profitability or restructure debt

## üìà Artifacts & Outputs Summary

### Directory Structure

```
artifacts/
‚îú‚îÄ‚îÄ models/
‚îÇ   ‚îú‚îÄ‚îÄ model_lgbm.pkl              # Trained model (base + calibrated + features)
‚îÇ   ‚îú‚îÄ‚îÄ scores_all.csv              # Training set predictions + tiers
‚îÇ   ‚îú‚îÄ‚îÄ thresholds.json             # Percentile-based thresholds
‚îÇ   ‚îú‚îÄ‚îÄ calibration_lgbm.png        # Reliability diagram
‚îÇ   ‚îú‚îÄ‚îÄ pr_curve_lgbm.png           # Precision-Recall curve
‚îÇ   ‚îî‚îÄ‚îÄ shap_summary.csv/png        # Quick SHAP summary
‚îÇ
‚îú‚îÄ‚îÄ calibration/
‚îÇ   ‚îú‚îÄ‚îÄ calibrator.pkl              # Isotonic calibrator (re-fitted)
‚îÇ   ‚îú‚îÄ‚îÄ mapping.csv                 # Raw score ‚Üí Calibrated PD mapping
‚îÇ   ‚îú‚îÄ‚îÄ thresholds.json             # Absolute PD thresholds (Red ‚â•20%, Amber ‚â•5%)
‚îÇ   ‚îú‚îÄ‚îÄ calibration_full.png        # Reliability curve (re-calibrated)
‚îÇ   ‚îî‚îÄ‚îÄ pr_curve_full.png           # PR curve (re-calibrated)
‚îÇ
‚îú‚îÄ‚îÄ shap/
‚îÇ   ‚îú‚îÄ‚îÄ feature_importance.csv      # Global feature importance (mean |SHAP|)
‚îÇ   ‚îú‚îÄ‚îÄ shap_summary.png            # SHAP waterfall plot
‚îÇ   ‚îú‚îÄ‚îÄ top_drivers_per_customer.csv # Local explanations (top 5 features per customer)
‚îÇ   ‚îú‚îÄ‚îÄ shap_dependence_*.png       # Dependence plots for key features
‚îÇ   ‚îî‚îÄ‚îÄ summary.json                # Metadata
‚îÇ
‚îî‚îÄ‚îÄ scoring/
    ‚îú‚îÄ‚îÄ ews_scored_2025-06-30.csv   # Production scores (customer_id, PD, score, tier, action)
    ‚îî‚îÄ‚îÄ thresholds_used.json        # Thresholds applied in this run
```

### Key Files for Different Stakeholders

| Stakeholder | Key Files |
|-------------|-----------|
| **Risk Manager** | `ews_scored_*.csv`, `top_drivers_per_customer.csv` |
| **Credit Committee** | `scores_all.csv`, `shap_summary.png`, `pr_curve_lgbm.png` |
| **Data Scientist** | `model_lgbm.pkl`, `feature_importance.csv`, all plots |
| **Model Validator** | `calibration_*.png`, `thresholds.json`, metrics in console output |
| **Auditor** | All artifacts + `summary.json` for traceability |

## üî¨ Technical Deep Dives

### 1. Why Isotonic Calibration?

**Problem with raw LightGBM probabilities:**
- Overconfident near 0 and 1
- Not well-calibrated for credit risk (regulatory requirement)

**Isotonic Regression:**
- Non-parametric, monotonic calibration
- Preserves ranking (AUC unchanged)
- Improves Brier score and reliability

**Alternative: Platt Scaling (Logistic Regression)**
- Parametric (assumes sigmoid relationship)
- Less flexible than Isotonic
- Use if you need smooth curve

---

### 2. Class Imbalance Handling

**Default rate ~5-10%** ‚Üí Highly imbalanced

**Strategies applied:**
1. **`scale_pos_weight`** in LightGBM
   - Automatically weights positive class
   - Formula: `(n_negative / n_positive)`
   
2. **Evaluation metrics:** PR-AUC instead of just ROC-AUC
   - ROC-AUC can be misleading with imbalanced data
   
3. **Threshold tuning:** Separate from 0.5
   - Red/Amber thresholds based on business capacity

---

### 3. Feature Normalization (Sector-Size)

**Why normalize by (Sector, Size)?**

```python
# Example: ICR = 2.0 for a SME in Retail
# Is this good or bad?

# Without normalization: Compare to all companies ‚Üí Looks average
# With sector-size normalization: Compare to SME Retailers ‚Üí Looks good!

# Implementation:
def sector_size_normalize(df, cols):
    for c in cols:
        grouped = df.groupby(['sector_code', 'size_bucket'])
        median = grouped[c].transform('median')
        iqr = grouped[c].transform(lambda x: x.quantile(0.75) - x.quantile(0.25))
        df[f'{c}__zs_sector_size'] = (df[c] - median) / iqr
    return df
```

**Benefits:**
- Fair comparison (SME vs SME, Corp vs Corp, same sector)
- Robust to outliers (median/IQR instead of mean/std)
- Better predictive power

---

### 4. Label Definition: Event Horizon = 12 Months

**Basel Standard:** PD typically measured over 12-month horizon

**Label rule:**
```python
# Default if: DPD ‚â• 90 days for at least 30 consecutive days in next 12M
dpd_90_plus_days = sum(dpd >= 90 for dpd in future_dpd_sequence)
event_h12m = 1 if dpd_90_plus_days >= 30 else 0
```

**Rationale:**
- 90 DPD: Industry standard for "default"
- 30 consecutive days: Avoid transient spikes
- 12M horizon: Align with regulatory reporting

## üéì Basel & Regulatory Alignment

### Basel Framework Compliance

**1. PD (Probability of Default) Estimation**
- ‚úÖ 12-month horizon (Basel standard)
- ‚úÖ Through-the-cycle (TTC) calibration via Isotonic Regression
- ‚úÖ Backtesting with holdout set

**2. Key Financial Ratios**
- ‚úÖ **ICR (Interest Coverage Ratio):** EBIT / Interest
- ‚úÖ **DSCR (Debt Service Coverage Ratio):** (EBITDA - CAPEX) / Debt Service
- ‚úÖ **Leverage Ratio:** Total Debt / EBITDA
- ‚úÖ **Liquidity Ratio:** Current Assets / Current Liabilities

**3. Early Warning Indicators**
- ‚úÖ DPD tracking (30, 60, 90+ days)
- ‚úÖ Credit limit breach monitoring
- ‚úÖ Covenant breach flags
- ‚úÖ Cashflow deterioration signals

**4. Model Governance**
- ‚úÖ **Explainability:** SHAP for transparency
- ‚úÖ **Calibration:** Reliability curves
- ‚úÖ **Validation:** AUC, KS, Brier on holdout
- ‚úÖ **Documentation:** All artifacts saved with metadata

---

### Risk Appetite Framework

**Tier Definitions aligned with Risk Appetite:**

| Tier | PD Range | Portfolio Allocation | Risk Appetite |
|------|----------|---------------------|---------------|
| Green | < 5% | 85% | Accept: Standard monitoring |
| Amber | 5-20% | 10% | Tolerate: Enhanced monitoring |
| Red | ‚â• 20% | 5% | Mitigate/Exit: Immediate action |

**Capacity Management:**
- Red tier (5%): Max ~50 customers ‚Üí 5 FTE RM (10 customers/RM)
- Amber tier (10%): Max ~100 customers ‚Üí 10 FTE RM (10 customers/RM)
- Green tier (85%): Portfolio monitoring only

---

### Regulatory Reporting

**Outputs compatible with:**
- **IFRS 9:** Expected Credit Loss (ECL) calculation
  - PD √ó LGD √ó EAD = ECL
  - Model provides PD component
  
- **Basel II/III:** Internal Ratings-Based (IRB) approach
  - PD model for corporate exposures
  - Complement with LGD and EAD models
  
- **Stress Testing:** Scenario-based PD adjustments
  - Re-run model with stressed features
  - Example: Revenue shock, Interest rate shock

## üõ†Ô∏è Development & Deployment

### Local Development Setup

```bash
# 1. Clone repository
git clone https://github.com/dylanng3/corporate-credit-ews.git
cd corporate-credit-ews

# 2. Create virtual environment (Python 3.13)
python -m venv .venv
.venv\Scripts\activate  # Windows
source .venv/bin/activate  # Linux/Mac

# 3. Install dependencies
pip install -U pip
pip install -r requirements.txt

# 4. Run linting
make lint

# 5. Format code
make format

# 6. Run tests
make test
```

---

### Testing Strategy

**Unit Tests (`tests/test_data.py`):**
```python
def test_feature_engineering():
    """Test feature calculation logic"""
    # Load sample data
    # Compute features
    # Assert feature values are correct

def test_model_inference():
    """Test model scoring pipeline"""
    # Load trained model
    # Score test cases
    # Assert outputs are valid probabilities [0, 1]
```

**Integration Tests:**
- End-to-end pipeline test
- Data ‚Üí Features ‚Üí Model ‚Üí Scores

---

### Production Deployment Options

**Option 1: Batch Scoring (Recommended)**
```bash
# Cron job: Monthly on last day of month
0 23 L * * python src/scoring.py --features <path> --model <path> --asof $(date +%Y-%m-%d) --outdir artifacts/scoring
```

**Option 2: REST API (Real-time)**
```python
from fastapi import FastAPI
import joblib

app = FastAPI()
model = joblib.load('artifacts/models/model_lgbm.pkl')

@app.post('/predict')
def predict(features: dict):
    X = prepare_features(features)
    prob = model.predict_proba(X)[:, 1][0]
    tier = assign_tier(prob)
    return {'prob': prob, 'tier': tier}
```

**Option 3: Airflow DAG (Orchestration)**
```python
from airflow import DAG
from airflow.operators.bash import BashOperator

dag = DAG('ews_monthly', schedule_interval='0 0 L * *')

generate_features = BashOperator(
    task_id='generate_features',
    bash_command='python src/feature_engineering.py ...'
)

score_customers = BashOperator(
    task_id='score_customers',
    bash_command='python src/scoring.py ...'
)

generate_features >> score_customers
```

---

### Model Versioning

**Recommended: MLflow or DVC**
```bash
# Track model with MLflow
mlflow.log_model(model, 'lgbm_ews_v1')
mlflow.log_metrics({'auc': 0.82, 'ks': 0.45})
mlflow.log_artifacts('artifacts/models')

# Or version with DVC
dvc add artifacts/models/model_lgbm.pkl
git add artifacts/models/model_lgbm.pkl.dvc
git commit -m "Model v1.0 - AUC 0.82"
```

## üìö Future Enhancements

### 1. Model Improvements

**Advanced Models:**
- ‚ú® **XGBoost:** Alternative to LightGBM (may perform better)
- ‚ú® **Neural Networks:** TabNet, FT-Transformer for tabular data
- ‚ú® **Ensemble:** Stack LightGBM + XGBoost + Logistic Regression

**Feature Engineering:**
- ‚ú® **Macro variables:** GDP growth, interest rate, sector indices
- ‚ú® **Time-series features:** ARIMA residuals, trend slopes
- ‚ú® **Text features:** NLP on management discussions, news sentiment
- ‚ú® **Network features:** Supply chain relationships, customer concentration

---

### 2. Data Sources

**External Data Integration:**
- üìä **Credit Bureau:** Payment history from other banks
- üìä **Market Data:** Stock price (if listed), bond spreads
- üìä **Alternative Data:** Social media sentiment, web traffic
- üìä **Geospatial:** Location risk (flood zones, economic zones)

---

### 3. Interpretability Enhancements

**Beyond SHAP:**
- üìñ **LIME:** Local surrogate models
- üìñ **Counterfactuals:** "If ICR increased by 0.5, tier would change from Red to Amber"
- üìñ **Rule extraction:** Decision trees from LightGBM for simple rules
- üìñ **Narrative generation:** Auto-generate credit memos

---

### 4. Operational Features

**Dashboard (Streamlit/Dash):**
```python
import streamlit as st
import pandas as pd

st.title("Corporate Credit EWS Dashboard")

scores = pd.read_csv('artifacts/scoring/ews_scored_latest.csv')

# Tier distribution pie chart
st.plotly_chart(px.pie(scores, names='tier'))

# Customer search
customer_id = st.text_input("Customer ID")
if customer_id:
    customer = scores[scores['customer_id'] == customer_id]
    st.metric("EWS Score", customer['score_ews'].iloc[0])
    st.metric("PD 12M", f"{customer['prob_default_12m_calibrated'].iloc[0]:.1%}")
```

**Alerting System:**
- Email/Slack alerts when customer moves to Red tier
- Weekly summary of tier migrations
- Covenant breach notifications

---

### 5. Model Monitoring

**Drift Detection:**
- **Population Stability Index (PSI):** Track feature distributions
- **Model Performance Tracking:** AUC, KS on rolling windows
- **Prediction Stability:** Variance of predictions month-over-month

**Auto-retraining:**
- Trigger retraining if PSI > 0.15 or AUC drops > 5%
- A/B test new model vs production model
- Gradual rollout with champion-challenger strategy

## üìñ References & Resources

### Academic & Industry Papers

1. **Basel Committee on Banking Supervision**
   - [Basel II: International Convergence of Capital Measurement](https://www.bis.org/publ/bcbs128.htm)
   - PD, LGD, EAD estimation frameworks
   
2. **IFRS 9 - Expected Credit Loss**
   - 12-month vs Lifetime PD
   - Staging models (Stage 1, 2, 3)

3. **Altman Z-Score (1968)**
   - Classic credit scoring model for manufacturing firms
   - Foundation for many modern models

4. **SHAP: Lundberg & Lee (2017)**
   - [A Unified Approach to Interpreting Model Predictions](https://arxiv.org/abs/1705.07874)
   - Game-theoretic feature attribution

---

### Tools & Libraries

**Python Packages:**
- `lightgbm`: Gradient boosting framework
- `shap`: Model explainability
- `scikit-learn`: ML utilities, calibration
- `pandas`, `numpy`: Data manipulation
- `matplotlib`, `seaborn`, `plotly`: Visualization

**Development:**
- `ruff`: Fast Python linter & formatter
- `pytest`: Testing framework
- `cookiecutter-data-science`: Project template

---

### Credit Risk Resources

**Books:**
- *Credit Risk Modeling* by David Lando
- *The Credit Scoring Toolkit* by Raymond Anderson
- *Machine Learning for Asset Managers* by Marcos L√≥pez de Prado

**Online Courses:**
- Coursera: *Credit Risk Modeling in Python*
- Udemy: *Credit Risk Analytics with Python*

**Websites:**
- [Risk.net](https://www.risk.net) - Industry news
- [Kaggle Credit Risk Datasets](https://www.kaggle.com/search?q=credit+risk)

---

### Contact & Support

**Project Maintainer:** Duong N.C.K  
**Repository:** [github.com/dylanng3/corporate-credit-ews](https://github.com/dylanng3/corporate-credit-ews)  
**License:** MIT License

**For questions or contributions:**
- Open an issue on GitHub
- Submit a pull request
- Email: [contact info if available]

## üéØ Summary & Conclusion

### What We Built

A **production-ready Early Warning System** for corporate credit risk with:

‚úÖ **End-to-end ML pipeline** (data ‚Üí features ‚Üí model ‚Üí scoring ‚Üí explainability)  
‚úÖ **Basel-aligned methodology** (12M PD, ICR, DSCR, leverage ratios)  
‚úÖ **LightGBM + Isotonic Calibration** (AUC ~0.75-0.85, well-calibrated probabilities)  
‚úÖ **SHAP explainability** (global + local feature importance)  
‚úÖ **3-tier risk classification** (Green/Amber/Red with actionable recommendations)  
‚úÖ **Production scoring pipeline** (batch scoring with thresholds)  
‚úÖ **Comprehensive artifacts** (models, scores, plots, metadata)

---

### Key Strengths

1. **Comprehensive Feature Engineering**
   - Financial ratios (ICR, DSCR, Leverage, CCC)
   - Behavioral signals (DPD, utilization, trends)
   - Cashflow monitoring
   - Covenant breach tracking
   - Sector-size normalization

2. **Model Quality**
   - Handles class imbalance with `scale_pos_weight`
   - Isotonic calibration for reliable probabilities
   - SHAP for transparency and trust

3. **Business Integration**
   - Clear tier definitions aligned with risk appetite
   - Actionable recommendations for RM team
   - Monthly scoring cadence
   - Customer-level explanations

4. **Code Quality**
   - Modular design (separate scripts for each stage)
   - Type hints and documentation
   - CLI interfaces for all scripts
   - Cookiecutter-data-science template

---

### Success Criteria

**Model Performance:**
- ‚úÖ AUC-ROC ‚â• 0.75
- ‚úÖ PR-AUC ‚â• 0.40
- ‚úÖ KS ‚â• 0.35
- ‚úÖ Brier Score ‚â§ 0.10

**Business Impact:**
- ‚úÖ Early identification of high-risk customers
- ‚úÖ Reduction in unexpected defaults
- ‚úÖ Efficient RM resource allocation
- ‚úÖ Regulatory compliance (Basel, IFRS 9)

---

### Next Steps

1. **Validation:** Run on real historical data
2. **Backtesting:** Validate predictions against actual defaults
3. **Integration:** Connect to core banking system for live data feeds
4. **Monitoring:** Set up drift detection and performance tracking
5. **Iteration:** Refine features and model based on feedback

---

**This notebook serves as the complete documentation for the Corporate Credit EWS project.**  
**For hands-on experimentation, run the pipeline commands in your terminal or create interactive cells below.**

üöÄ **Happy modeling!**