# Corporate Credit Early Warning System (EWS)

## Project Overview

H·ªá th·ªëng c·∫£nh b√°o s·ªõm r·ªßi ro t√≠n d·ª•ng doanh nghi·ªáp (Corporate Credit EWS) ƒë∆∞·ª£c x√¢y d·ª±ng theo chu·∫©n Basel, s·ª≠ d·ª•ng Machine Learning ƒë·ªÉ d·ª± ƒëo√°n x√°c su·∫•t v·ª° n·ª£ (PD - Probability of Default) trong v√≤ng 12 th√°ng t·ªõi.

**M·ª•c ti√™u ch√≠nh:**
- D·ª± ƒëo√°n kh·∫£ nƒÉng v·ª° n·ª£ c·ªßa kh√°ch h√†ng doanh nghi·ªáp (event horizon: 12 th√°ng)
- Ph√¢n lo·∫°i r·ªßi ro th√†nh 3 tiers: **Green** (an to√†n), **Amber** (c·∫£nh b√°o), **Red** (nguy hi·ªÉm)
- ƒê∆∞a ra khuy·∫øn ngh·ªã h√†nh ƒë·ªông c·ª• th·ªÉ cho Risk Management team

**Tech Stack:**
- Python 3.13
- LightGBM (classification model)
- SHAP (model explainability)
- Sklearn (calibration, metrics)
- Pandas/Numpy (data processing)

## Pipeline Architecture

D·ª± √°n ƒë∆∞·ª£c t·ªï ch·ª©c theo **end-to-end ML pipeline** v·ªõi 7 b∆∞·ªõc ch√≠nh:

```
1. Data Generation (generate_data.py)
   ‚Üì
2. Feature Engineering (feature_engineering.py)
   ‚Üì
3. Model Training + Calibration (train_baseline.py)
   ‚Üì
4. Model Explainability (explain.py)
   ‚Üì
5. [Optional] Make Raw Scores (make_scores_raw.py)
   ‚Üì
6. [Optional] Re-calibration (calibrate.py)
   ‚Üì
7. Production Scoring (scoring.py)
```

### Data Flow
- **Raw Data** ‚Üí `data/raw/` (fin_quarterly, credit_daily, cashflow_daily, covenant, labels)
- **Features** ‚Üí `data/processed/` (feature_ews.parquet)
- **Models** ‚Üí `artifacts/models/` (model_lgbm.pkl, SHAP artifacts)
- **Scores** ‚Üí `artifacts/scoring/` (ews_scored_YYYY-MM-DD.csv)

## Step 1: Data Generation

Module: `src/generate_data.py`

### M·ª•c ƒë√≠ch

T·∫°o d·ªØ li·ªáu synthetic ƒë·ªÉ train PD model khi ch∆∞a c√≥ d·ªØ li·ªáu th·∫≠t t·ª´ ng√¢n h√†ng. D·ªØ li·ªáu ƒë∆∞·ª£c thi·∫øt k·∫ø s√°t th·ª±c t·∫ø nghi·ªáp v·ª• corporate lending v√† tu√¢n th·ªß Basel principles.

**L√Ω do d√πng Synthetic Data:**
- ‚úÖ Tr√°nh v·∫•n ƒë·ªÅ b·∫£o m·∫≠t v√† compliance v·ªõi d·ªØ li·ªáu n·ªôi b·ªô
- ‚úÖ C√≥ ground truth ch√≠nh x√°c (bi·∫øt ch·∫Øc ai default) ƒë·ªÉ ƒë√°nh gi√° model
- ‚úÖ Scale d·ªÖ d√†ng (1K, 10K, 100K customers) cho stress testing
- ‚úÖ T·∫°o edge cases hi·∫øm g·∫∑p trong reality (severe stress scenarios)
- ‚ùå H·∫°n ch·∫ø: Distribution shift so v·ªõi d·ªØ li·ªáu th·∫≠t ‚Üí Model c·∫ßn retrain khi deploy production

---

### C·∫•u h√¨nh Portfolio

**10 Sectors** v·ªõi risk premiums kh√°c nhau:
- Low risk: IT (-0.02), Telecom (-0.01), Engineering/Manufacturing (0.0)
- Medium: Chemicals/Logistics (0.02), Retail/Construction (0.03)
- Higher risk: Agriculture (0.04), Trading (0.05)

**2 Size buckets:**
- **SME (80%)**: Revenue ~40K/quarter, Debt multiplier 1.2, higher default risk
- **Corp (20%)**: Revenue ~110K/quarter, Debt multiplier 1.4, more leverage but stable

‚Üí Total: 1000 customers (800 SMEs + 200 Corps)

---

### D·ªØ li·ªáu ƒë∆∞·ª£c t·∫°o

**1. fin_quarterly.parquet** - B√°o c√°o t√†i ch√≠nh quarterly

12 quarters (3 years history ƒë·∫øn 2025-06-30) √ó 1000 customers = 12,000 rows

Key columns: `revenue`, `ebitda`, `ebit`, `interest_expense`, `total_debt`, `current_assets`, `current_liab`, `ar`, `ap`, `inventory`

Logic:
- Revenue growth ~2% QoQ + noise
- EBITDA margin ~15% ¬± 7%
- Debt = Revenue √ó (0.3 + sector_risk) √ó size_multiplier
- Working capital items (AR, AP, Inventory) theo industry standards

---

**2. credit_daily.parquet** - H√†nh vi t√≠n d·ª•ng h√†ng ng√†y

545 days (180 days observation + 365 days label window) √ó 1000 = 545,000 rows

Key columns: `limit`, `utilized`, `dpd_days`, `breach_flag`, `product_type`

**DPD Dynamics** (Days Past Due):
- Markov chain: 98.5% stable/improve, 1.5% deteriorate
- Majority < 30 days, some 30-90, few > 90
- Persistent DPD ‚â• 90 for ‚â• 30 consecutive days ‚Üí Default

**Utilization Pattern:**
- Beta distribution (centered ~50%) + seasonality (¬±5%) + noise
- High utilization (>90%) = stress signal

---

**3. cashflow_daily.parquet** - D√≤ng ti·ªÅn h√†ng ng√†y

545 days √ó 1000 = 545,000 rows

Key columns: `inflow`, `outflow`

Logic:
- Daily inflow = Annual revenue / 365 √ó seasonality √ó noise
- Daily outflow = ~90% of inflow √ó noise
- Seasonality: Sin wave v·ªõi amplitude 20%

Use cases: Detect revenue drops, burn rate, cashflow volatility

---

**4. covenant.parquet** - Covenant compliance tracking

545 days √ó 1000 = 545,000 rows

Key covenants:
- **ICR** (Interest Coverage Ratio): EBIT / Interest ‚â• 1.5
- **DSCR** (Debt Service Coverage Ratio): Cashflow / Debt service ‚â• 1.2
- **Leverage**: Debt / EBITDA ‚â§ 4.0

Breach flags: `breach_icr`, `breach_dscr`, `breach_leverage`

---

**5. labels.parquet** - Target variable

1000 rows (1 per customer at asof_date = 2025-06-30)

**Label Definition** (Basel-compliant):
```python
event_h12m = 1 if DPD ‚â• 90 for ‚â• 30 consecutive days trong 12 th√°ng sau asof_date
           = 0 otherwise
```

**PD Bumps** (tƒÉng default probability):
- High utilization (>90%) ‚Üí +20% PD
- Covenant breach ‚Üí +20% PD

Expected default rate: **~5-10%** (realistic cho corporate portfolio)

---

### Output Structure

```
data/raw/
‚îú‚îÄ‚îÄ fin_quarterly.parquet       # 12K rows - Financial statements
‚îú‚îÄ‚îÄ credit_daily.parquet        # 545K rows - Credit behavior
‚îú‚îÄ‚îÄ cashflow_daily.parquet      # 545K rows - Cash movements
‚îú‚îÄ‚îÄ covenant.parquet            # 545K rows - Covenant tracking
‚îî‚îÄ‚îÄ labels.parquet              # 1K rows - Default labels
```

File sizes: ~5-10 MB total (Parquet compressed)

---

### Realism Features

‚úÖ **Financial ratios** follow industry norms (EBITDA margin 15%, Debt/Revenue 0.3-0.6)  
‚úÖ **DPD patterns** realistic (gradual deterioration, kh√¥ng ƒë·ªôt ng·ªôt 0 ‚Üí 90)  
‚úÖ **Seasonality** trong revenue v√† cashflow (quarterly peaks)  
‚úÖ **Covenant breaches** trigger increased default risk  
‚úÖ **Working capital** (DSO ~66 days, DPO ~73 days) theo industry benchmarks

‚Üí Model trained tr√™n data n√†y h·ªçc ƒë∆∞·ª£c patterns g·∫ßn v·ªõi reality!

In [None]:
# Example: Generate synthetic data v√† verify k·∫øt qu·∫£

import sys
import os
sys.path.append('../src')

print("=" * 80)
print("STEP 1: GENERATE SYNTHETIC DATA")
print("=" * 80)

# Command ƒë·ªÉ generate data
print("\nüìù Command to generate data:")
print("python src/generate_data.py --n-customers 1000 --output-dir data/raw")

print("\nüìä Expected outputs:")
outputs = {
    "fin_quarterly.parquet": "~12,000 rows (1000 customers √ó 12 quarters)",
    "credit_daily.parquet": "~545,000 rows (1000 customers √ó 545 days)",
    "cashflow_daily.parquet": "~545,000 rows (1000 customers √ó 545 days)",
    "covenant.parquet": "~545,000 rows (1000 customers √ó 545 days)",
    "labels.parquet": "1,000 rows (1 row per customer)"
}

for file, desc in outputs.items():
    print(f"  ‚úì data/raw/{file:30s} ‚Üí {desc}")

print("\n" + "=" * 80)
print("VERIFICATION AFTER GENERATION")
print("=" * 80)

# N·∫øu data ƒë√£ t·ªìn t·∫°i, verify n√≥
data_dir = "../data/raw"
if os.path.exists(data_dir):
    try:
        import pandas as pd
        
        print("\n‚úÖ Data directory found! Checking files...")
        
        # Check labels
        labels_path = f"{data_dir}/labels.parquet"
        if os.path.exists(labels_path):
            labels = pd.read_parquet(labels_path)
            default_rate = labels['event_h12m'].mean()
            print(f"\nüìå labels.parquet:")
            print(f"   - Total customers: {len(labels):,}")
            print(f"   - Default rate (event_h12m=1): {default_rate:.1%}")
            print(f"   - Expected: 5-10% ‚úì" if 0.05 <= default_rate <= 0.15 else "   - Warning: Outside expected range")
        
        # Check credit_daily
        credit_path = f"{data_dir}/credit_daily.parquet"
        if os.path.exists(credit_path):
            credit = pd.read_parquet(credit_path)
            print(f"\nüìå credit_daily.parquet:")
            print(f"   - Total rows: {len(credit):,}")
            print(f"   - Date range: {credit['date'].min()} to {credit['date'].max()}")
            print(f"   - Max DPD: {credit['dpd_days'].max()} days")
            print(f"   - Breach rate: {credit['breach_flag'].mean():.1%}")
            print(f"   - Avg utilization: {(credit['utilized']/credit['limit']).mean():.1%}")
        
        # Check fin_quarterly
        fin_path = f"{data_dir}/fin_quarterly.parquet"
        if os.path.exists(fin_path):
            fin = pd.read_parquet(fin_path)
            print(f"\nüìå fin_quarterly.parquet:")
            print(f"   - Total rows: {len(fin):,}")
            print(f"   - Unique customers: {fin['customer_id'].nunique():,}")
            print(f"   - Quarters: {fin['fq_date'].nunique()}")
            print(f"   - Avg EBITDA margin: {(fin['ebitda']/fin['revenue']).mean():.1%}")
            print(f"   - Avg Debt/EBITDA: {(fin['total_debt']/fin['ebitda']).mean():.1f}x")
        
        print("\n‚úÖ All checks passed!")
        
    except Exception as e:
        print(f"\n‚ö†Ô∏è  Could not verify data: {e}")
        print("   Run the generate_data.py script first to create the data.")
else:
    print(f"\n‚ö†Ô∏è  Data directory not found: {data_dir}")
    print("   Run the following command to generate data:")
    print("   python src/generate_data.py --n-customers 1000 --output-dir data/raw")

print("\n" + "=" * 80)

## üîß Step 2: Feature Engineering

Module: `src/feature_engineering.py`

### Overview

Feature Engineering l√† b∆∞·ªõc quan tr·ªçng nh·∫•t trong vi·ªác x√¢y d·ª±ng m√¥ h√¨nh Early Warning System, v√¨ n√≥ chuy·ªÉn ƒë·ªïi d·ªØ li·ªáu th√¥ t·ª´ c√°c b·∫£ng t√†i ch√≠nh, h√†nh vi t√≠n d·ª•ng, v√† d√≤ng ti·ªÅn th√†nh c√°c ƒë·∫∑c tr∆∞ng (features) c√≥ s·ª©c m·∫°nh d·ª± ƒëo√°n cao. Qu√° tr√¨nh n√†y k·∫øt h·ª£p ki·∫øn th·ª©c chuy√™n m√¥n v·ªÅ t√≠n d·ª•ng ng√¢n h√†ng v·ªõi k·ªπ thu·∫≠t ph√¢n t√≠ch d·ªØ li·ªáu, t·∫°o ra t·∫≠p h·ª£p c√°c ch·ªâ s·ªë ph·∫£n √°nh ƒë·∫ßy ƒë·ªß t√¨nh h√¨nh t√†i ch√≠nh v√† r·ªßi ro c·ªßa kh√°ch h√†ng doanh nghi·ªáp.

C√°c features ƒë∆∞·ª£c thi·∫øt k·∫ø d·ª±a tr√™n c√°c nguy√™n t·∫Øc Basel v√† th·ª±c ti·ªÖn qu·∫£n l√Ω r·ªßi ro t√≠n d·ª•ng, chia th√†nh 5 nh√≥m ch√≠nh: Financial Ratios, Behavioral Features, Cashflow Features, Covenant Breach Flags, v√† Normalization. M·ªói nh√≥m ph·ª•c v·ª• m·ªôt m·ª•c ƒë√≠ch c·ª• th·ªÉ trong vi·ªác ƒë√°nh gi√° kh·∫£ nƒÉng v·ª° n·ª£ c·ªßa kh√°ch h√†ng.

---

### A. Financial Ratios (TTM - Trailing 12 Months)

C√°c t·ª∑ s·ªë t√†i ch√≠nh ƒë∆∞·ª£c t√≠nh to√°n d·ª±a tr√™n d·ªØ li·ªáu 12 th√°ng g·∫ßn nh·∫•t (TTM) ƒë·ªÉ ph·∫£n √°nh xu h∆∞·ªõng d√†i h·∫°n v√† gi·∫£m thi·ªÉu ·∫£nh h∆∞·ªüng c·ªßa bi·∫øn ƒë·ªông ng·∫Øn h·∫°n. Ch√∫ng ta s·ª≠ d·ª•ng d·ªØ li·ªáu t·ª´ 4 qu√Ω g·∫ßn nh·∫•t ƒë·ªÉ t√≠nh to√°n c√°c ch·ªâ s·ªë t·ªïng h·ª£p n√†y.

#### Liquidity & Coverage Ratios

**Interest Coverage Ratio (ICR)** l√† ch·ªâ s·ªë quan tr·ªçng nh·∫•t trong ƒë√°nh gi√° kh·∫£ nƒÉng tr·∫£ n·ª£, ƒë∆∞·ª£c t√≠nh b·∫±ng EBIT chia cho chi ph√≠ l√£i vay (Interest Expense). 

$$\text{ICR} = \frac{\text{EBIT}_{\text{TTM}}}{\text{Interest Expense}_{\text{TTM}}}$$

T·ª∑ s·ªë n√†y ƒëo l∆∞·ªùng kh·∫£ nƒÉng c·ªßa doanh nghi·ªáp trong vi·ªác tr·∫£ l√£i vay t·ª´ l·ª£i nhu·∫≠n ho·∫°t ƒë·ªông. Theo th√¥ng l·ªá ng√†nh ng√¢n h√†ng, ICR d∆∞·ªõi 1.5 ƒë∆∞·ª£c coi l√† m·ª©c nguy hi·ªÉm, cho th·∫•y doanh nghi·ªáp kh√¥ng ƒë·ªß kh·∫£ nƒÉng trang tr·∫£i nghƒ©a v·ª• l√£i vay t·ª´ thu nh·∫≠p ho·∫°t ƒë·ªông.

**Debt Service Coverage Ratio (DSCR)** ƒëo l∆∞·ªùng kh·∫£ nƒÉng tr·∫£ c·∫£ n·ª£ g·ªëc v√† l√£i t·ª´ EBITDA sau khi tr·ª´ ƒëi chi ph√≠ v·ªën (CAPEX). 

$$\text{DSCR} = \frac{\text{EBITDA} - \text{CAPEX}}{\text{Principal} + \text{Interest}}$$

Do d·ªØ li·ªáu synthetic kh√¥ng c√≥ th√¥ng tin chi ti·∫øt v·ªÅ kho·∫£n tr·∫£ n·ª£ g·ªëc, ch√∫ng ta s·ª≠ d·ª•ng proxy b·∫±ng c√°ch ∆∞·ªõc t√≠nh CAPEX l√† 30% c·ªßa EBITDA. DSCR d∆∞·ªõi 1.2 cho th·∫•y doanh nghi·ªáp g·∫∑p kh√≥ khƒÉn trong vi·ªác ƒë√°p ·ª©ng c√°c nghƒ©a v·ª• n·ª£.

**Current Ratio** ph·∫£n √°nh thanh kho·∫£n ng·∫Øn h·∫°n, ƒë∆∞·ª£c t√≠nh b·∫±ng t√†i s·∫£n ng·∫Øn h·∫°n (Current Assets) chia cho n·ª£ ng·∫Øn h·∫°n (Current Liabilities). 

$$\text{Current Ratio} = \frac{\text{Current Assets}}{\text{Current Liabilities}}$$

T·ª∑ s·ªë n√†y cho bi·∫øt kh·∫£ nƒÉng c·ªßa doanh nghi·ªáp trong vi·ªác thanh to√°n c√°c kho·∫£n n·ª£ ƒë·∫øn h·∫°n trong v√≤ng 12 th√°ng t·ªõi. Current Ratio d∆∞·ªõi 1.0 l√† d·∫•u hi·ªáu c·∫£nh b√°o thi·∫øu thanh kho·∫£n nghi√™m tr·ªçng.

#### Leverage Ratio

**Debt-to-EBITDA** ƒëo l∆∞·ªùng ƒë√≤n b·∫©y t√†i ch√≠nh, cho bi·∫øt doanh nghi·ªáp c·∫ßn bao nhi√™u nƒÉm EBITDA ƒë·ªÉ tr·∫£ h·∫øt n·ª£. 

$$\text{Debt-to-EBITDA} = \frac{\text{Total Debt}}{\text{EBITDA}_{\text{TTM}}}$$

T·ª∑ s·ªë n√†y ƒë∆∞·ª£c t√≠nh b·∫±ng t·ªïng n·ª£ (Total Debt) chia cho EBITDA TTM. Theo chu·∫©n m·ª±c ng√†nh, Debt-to-EBITDA v∆∞·ª£t qu√° 4.0 cho th·∫•y doanh nghi·ªáp ƒëang ch·ªãu g√°nh n·∫∑ng n·ª£ qu√° m·ª©c, l√†m tƒÉng ƒë√°ng k·ªÉ r·ªßi ro v·ª° n·ª£.

#### Working Capital Efficiency

Nh√≥m ch·ªâ s·ªë n√†y ƒë√°nh gi√° hi·ªáu qu·∫£ qu·∫£n l√Ω v·ªën l∆∞u ƒë·ªông th√¥ng qua ba th√†nh ph·∫ßn ch√≠nh:

**Days Sales Outstanding (DSO)** ƒëo l∆∞·ªùng s·ªë ng√†y trung b√¨nh ƒë·ªÉ thu h·ªìi ti·ªÅn t·ª´ kh√°ch h√†ng:

$$\text{DSO} = \frac{\text{Accounts Receivable}}{\text{Revenue}} \times 365$$

DSO tƒÉng cao cho th·∫•y doanh nghi·ªáp g·∫∑p kh√≥ khƒÉn trong vi·ªác thu h·ªìi c√¥ng n·ª£, c√≥ th·ªÉ d·∫´n ƒë·∫øn thi·∫øu h·ª•t ti·ªÅn m·∫∑t.

**Days Payables Outstanding (DPO)** ƒëo l∆∞·ªùng s·ªë ng√†y trung b√¨nh doanh nghi·ªáp tr·∫£ ti·ªÅn cho nh√† cung c·∫•p:

$$\text{DPO} = \frac{\text{Accounts Payable}}{\text{COGS}} \times 365$$

DPO cao c√≥ th·ªÉ l√† d·∫•u hi·ªáu t√≠ch c·ª±c (t·∫≠n d·ª•ng t√≠n d·ª•ng th∆∞∆°ng m·∫°i) ho·∫∑c ti√™u c·ª±c (kh√≥ khƒÉn thanh kho·∫£n).

**Days On Hand (DOH)** ƒëo l∆∞·ªùng s·ªë ng√†y t·ªìn kho trung b√¨nh:

$$\text{DOH} = \frac{\text{Inventory}}{\text{COGS}} \times 365$$

DOH cao cho th·∫•y h√†ng t·ªìn kho nhi·ªÅu, c√≥ th·ªÉ l√†m gi√°n ƒëo·∫°n d√≤ng ti·ªÅn.

**Cash Conversion Cycle (CCC)** l√† ch·ªâ s·ªë t·ªïng h·ª£p:

$$\text{CCC} = \text{DSO} + \text{DOH} - \text{DPO}$$

CCC ƒëo l∆∞·ªùng s·ªë ng√†y v·ªën b·ªã "ƒë√≥ng bƒÉng" trong chu k·ª≥ kinh doanh, t·ª´ khi tr·∫£ ti·ªÅn mua h√†ng ƒë·∫øn khi thu ƒë∆∞·ª£c ti·ªÅn t·ª´ kh√°ch h√†ng. CCC tƒÉng cao cho th·∫•y hi·ªáu qu·∫£ qu·∫£n l√Ω v·ªën l∆∞u ƒë·ªông k√©m, l√†m d√≤ng ti·ªÅn x·∫•u ƒëi.

#### Trend Analysis (QoQ)

Ngo√†i c√°c ch·ªâ s·ªë tƒ©nh, ch√∫ng ta c√≤n t√≠nh to√°n xu h∆∞·ªõng thay ƒë·ªïi theo qu√Ω (Quarter-over-Quarter) cho c√°c ch·ªâ s·ªë quan tr·ªçng:

$$\Delta \text{DSO}_{\text{QoQ}} = \text{DSO}_{\text{current}} - \text{DSO}_{\text{previous quarter}}$$

$$\Delta \text{CCC}_{\text{QoQ}} = \text{CCC}_{\text{current}} - \text{CCC}_{\text{previous quarter}}$$

**delta_dso_qoq** v√† **delta_ccc_qoq** cho bi·∫øt s·ª± thay ƒë·ªïi c·ªßa DSO v√† CCC so v·ªõi qu√Ω tr∆∞·ªõc, gi√∫p ph√°t hi·ªán s·ªõm c√°c d·∫•u hi·ªáu x·∫•u ƒëi trong qu·∫£n l√Ω v·ªën l∆∞u ƒë·ªông.

### B. Behavioral Features (Observation Window = 180 ng√†y)

C√°c ƒë·∫∑c tr∆∞ng h√†nh vi ƒë∆∞·ª£c tr√≠ch xu·∫•t t·ª´ d·ªØ li·ªáu giao d·ªãch h√†ng ng√†y trong 180 ng√†y g·∫ßn nh·∫•t tr∆∞·ªõc ng√†y ƒë√°nh gi√° (as-of date). Nh·ªØng features n√†y ph·∫£n √°nh h√†nh vi s·ª≠ d·ª•ng t√≠n d·ª•ng th·ª±c t·∫ø c·ªßa kh√°ch h√†ng, th∆∞·ªùng c√≥ s·ª©c m·∫°nh d·ª± ƒëo√°n cao h∆°n so v·ªõi c√°c ch·ªâ s·ªë t√†i ch√≠nh truy·ªÅn th·ªëng v√¨ ch√∫ng n·∫Øm b·∫Øt ƒë∆∞·ª£c c√°c v·∫•n ƒë·ªÅ thanh kho·∫£n v√† kh√≥ khƒÉn t√†i ch√≠nh ngay khi ch√∫ng ph√°t sinh.

#### Credit Utilization

T·ª∑ l·ªá s·ª≠ d·ª•ng h·∫°n m·ª©c t√≠n d·ª•ng l√† ch·ªâ s·ªë quan tr·ªçng ph·∫£n √°nh m·ª©c ƒë·ªô ph·ª• thu·ªôc c·ªßa doanh nghi·ªáp v√†o ngu·ªìn v·ªën vay ng√¢n h√†ng. Ch√∫ng ta t√≠nh to√°n hai ch·ªâ s·ªë ch√≠nh:

**%util_mean_60d** l√† t·ª∑ l·ªá s·ª≠ d·ª•ng h·∫°n m·ª©c trung b√¨nh trong 60 ng√†y g·∫ßn nh·∫•t:

$$
\text{Util Mean}_{60d} = \frac{1}{60} \sum_{t=1}^{60} \frac{\text{Utilized}_t}{\text{Limit}_t}
$$

Ch·ªâ s·ªë n√†y cho bi·∫øt m·ª©c ƒë·ªô "cƒÉng" v·ªÅ thanh kho·∫£n c·ªßa doanh nghi·ªáp. Utilization rate v∆∞·ª£t qu√° 85% cho th·∫•y doanh nghi·ªáp ƒëang √°p s√°t h·∫°n m·ª©c, c√≥ nguy c∆° thi·∫øu thanh kho·∫£n n·∫øu c√≥ b·∫•t k·ª≥ c√∫ s·ªëc n√†o.

**%util_p95_60d** l√† percentile th·ª© 95 c·ªßa utilization trong 60 ng√†y:

$$
\text{Util P95}_{60d} = P_{95}\left(\frac{\text{Utilized}_t}{\text{Limit}_t}\right)_{t=1}^{60}
$$

Ch·ªâ s·ªë n√†y quan tr·ªçng v√¨ n√≥ cho th·∫•y trong nh·ªØng ng√†y "x·∫•u nh·∫•t", doanh nghi·ªáp s·ª≠ d·ª•ng bao nhi√™u ph·∫ßn trƒÉm h·∫°n m·ª©c, gi√∫p ph√°t hi·ªán c√°c giai ƒëo·∫°n cƒÉng th·∫≥ng thanh kho·∫£n t·∫°m th·ªùi.

#### Delinquency Patterns

Days Past Due (DPD) l√† ch·ªâ s·ªë tr·ª±c ti·∫øp nh·∫•t v·ªÅ kh√≥ khƒÉn thanh to√°n. Ch√∫ng ta ph√¢n t√≠ch DPD theo nhi·ªÅu g√≥c ƒë·ªô:

**dpd_max_180d** l√† s·ªë ng√†y qu√° h·∫°n t·ªëi ƒëa trong 180 ng√†y qua:

$$
\text{DPD Max}_{180d} = \max_{t=1}^{180} (\text{DPD}_t)
$$

Theo chu·∫©n m·ª±c ng√†nh, DPD v∆∞·ª£t qu√° 30 ng√†y ƒë∆∞·ª£c coi l√† early warning signal, trong khi DPD v∆∞·ª£t 90 ng√†y l√† d·∫•u hi·ªáu r√µ r√†ng c·ªßa default risk theo ƒë·ªãnh nghƒ©a Basel.

**dpd_trend_180d** ƒëo l∆∞·ªùng xu h∆∞·ªõng c·ªßa DPD theo th·ªùi gian b·∫±ng h·ªá s·ªë g√≥c c·ªßa h·ªìi quy tuy·∫øn t√≠nh:

$$
\text{DPD Trend}_{180d} = \beta_1 \text{ where } \text{DPD}_t = \beta_0 + \beta_1 \cdot t + \epsilon_t
$$

Slope d∆∞∆°ng ($\beta_1 > 0$) cho th·∫•y DPD ƒëang c√≥ xu h∆∞·ªõng tƒÉng d·∫ßn (t√¨nh h√¨nh x·∫•u ƒëi), trong khi slope √¢m ($\beta_1 < 0$) cho th·∫•y doanh nghi·ªáp ƒëang c·∫£i thi·ªán kh·∫£ nƒÉng thanh to√°n.

**near_due_freq_7d** ƒëo l∆∞·ªùng t·∫ßn su·∫•t "g·∫ßn qu√° h·∫°n" trong 7 ng√†y g·∫ßn nh·∫•t:

$$
\text{Near Due Freq}_{7d} = \frac{1}{7} \sum_{t=1}^{7} \mathbb{1}(0 < \text{DPD}_t < 30)
$$

Ch·ªâ s·ªë n√†y gi√∫p ph√°t hi·ªán c√°c kh√°ch h√†ng th∆∞·ªùng xuy√™n tr·ªÖ h·∫°n nh∆∞ng ch∆∞a ƒë·∫øn m·ª©c qu√° h·∫°n nghi√™m tr·ªçng, ƒë√¢y l√† early warning quan tr·ªçng.

#### Credit Limit Breach

**limit_breach_cnt_90d** ƒë·∫øm s·ªë l·∫ßn kh√°ch h√†ng v∆∞·ª£t qu√° h·∫°n m·ª©c t√≠n d·ª•ng trong 90 ng√†y g·∫ßn nh·∫•t:

$$
\text{Limit Breach Count}_{90d} = \sum_{t=1}^{90} \mathbb{1}(\text{Utilized}_t > \text{Limit}_t)
$$

B·∫•t k·ª≥ l·∫ßn vi ph·∫°m n√†o (> 0) ƒë·ªÅu l√† d·∫•u hi·ªáu c·∫£nh b√°o nghi√™m tr·ªçng, cho th·∫•y doanh nghi·ªáp c√≥ nhu c·∫ßu v·ªën v∆∞·ª£t qu√° kh·∫£ nƒÉng ƒë∆∞·ª£c ph√™ duy·ªát, ho·∫∑c h·ªá th·ªëng ki·ªÉm so√°t n·ªôi b·ªô k√©m.

---

### C. Cashflow Features

D√≤ng ti·ªÅn l√† "huy·∫øt m·∫°ch" c·ªßa doanh nghi·ªáp, quan tr·ªçng h∆°n c·∫£ l·ª£i nhu·∫≠n k·∫ø to√°n trong vi·ªác d·ª± ƒëo√°n kh·∫£ nƒÉng v·ª° n·ª£. Ch√∫ng ta ph√¢n t√≠ch d√≤ng ti·ªÅn v√†o/ra h√†ng ng√†y trong 180 ng√†y qua ƒë·ªÉ t·∫°o c√°c features:

**inflow_mean_60d** v√† **outflow_mean_60d** l√† d√≤ng ti·ªÅn v√†o v√† ra trung b√¨nh trong 60 ng√†y g·∫ßn nh·∫•t:

$$
\text{Inflow Mean}_{60d} = \frac{1}{60} \sum_{t=1}^{60} \text{Inflow}_t
$$

$$
\text{Outflow Mean}_{60d} = \frac{1}{60} \sum_{t=1}^{60} \text{Outflow}_t
$$

Hai ch·ªâ s·ªë n√†y ph·∫£n √°nh quy m√¥ v√† t√≠nh ·ªïn ƒë·ªãnh c·ªßa ho·∫°t ƒë·ªông kinh doanh. S·ª± ch√™nh l·ªách l·ªõn gi·ªØa inflow v√† outflow (burn rate cao) l√† d·∫•u hi·ªáu c·∫£nh b√°o.

**inflow_drop_60d** ƒëo l∆∞·ªùng t·ª∑ l·ªá gi·∫£m c·ªßa d√≤ng ti·ªÅn v√†o trong 60 ng√†y g·∫ßn nh·∫•t so v·ªõi median 6 th√°ng:

$$
\text{Inflow Drop}_{60d} = \frac{\text{Median}_{6m}(\text{Inflow}) - \text{Mean}_{60d}(\text{Inflow})}{\text{Median}_{6m}(\text{Inflow})}
$$

Ch·ªâ s·ªë n√†y gi√∫p ph√°t hi·ªán s·ªõm s·ª•t gi·∫£m doanh thu, m·ªôt trong nh·ªØng nguy√™n nh√¢n ch√≠nh d·∫´n ƒë·∫øn v·ª° n·ª£. Inflow drop v∆∞·ª£t qu√° 20% cho th·∫•y d√≤ng ti·ªÅn ƒëang gi·∫£m m·∫°nh, c·∫ßn c√≥ h√†nh ƒë·ªông can thi·ªáp ngay l·∫≠p t·ª©c.

---

### D. Covenant Breach Flags

Covenant (ƒëi·ªÅu kho·∫£n r√†ng bu·ªôc) l√† c√°c ng∆∞·ª°ng t√†i ch√≠nh m√† kh√°ch h√†ng ph·∫£i duy tr√¨ theo h·ª£p ƒë·ªìng t√≠n d·ª•ng. Vi ph·∫°m covenant l√† early warning signal c·ª±c k·ª≥ quan tr·ªçng, th∆∞·ªùng x·∫£y ra tr∆∞·ªõc khi default th·ª±c s·ª± di·ªÖn ra.

Ch√∫ng ta theo d√µi ba lo·∫°i covenant ch√≠nh v·ªõi ƒëi·ªÅu ki·ªán trigger:

$$
\text{breach\_icr} = \mathbb{1}(\text{ICR}_{\text{TTM}} < 1.5)
$$

$$
\text{breach\_dscr} = \mathbb{1}(\text{DSCR}_{\text{TTM}} < 1.2)
$$

$$
\text{breach\_leverage} = \mathbb{1}\left(\frac{\text{Total Debt}}{\text{EBITDA}_{\text{TTM}}} > 4.0\right)
$$

M·ªói breach flag l√† bi·∫øn nh·ªã ph√¢n (0/1), cho bi·∫øt kh√°ch h√†ng c√≥ vi ph·∫°m covenant t∆∞∆°ng ·ª©ng hay kh√¥ng. Vi ph·∫°m b·∫•t k·ª≥ covenant n√†o ƒë·ªÅu trigger c√°c h√†nh ƒë·ªông nh∆∞ renegotiation, tightening ƒëi·ªÅu ki·ªán, ho·∫∑c tƒÉng gi√°m s√°t.

---

### E. Normalization (Sector-Size)

M·ªôt trong nh·ªØng th√°ch th·ª©c l·ªõn nh·∫•t trong credit scoring l√† so s√°nh c√°c doanh nghi·ªáp kh√°c nhau v·ªÅ quy m√¥ v√† ng√†nh ngh·ªÅ. M·ªôt SME trong ng√†nh Retail c√≥ ICR = 2.0 c√≥ th·ªÉ ƒë∆∞·ª£c coi l√† t·ªët, nh∆∞ng c√πng ch·ªâ s·ªë ƒë√≥ v·ªõi m·ªôt Large Corporate trong ng√†nh IT l·∫°i l√† m·ª©c trung b√¨nh ho·∫∑c k√©m.

ƒê·ªÉ gi·∫£i quy·∫øt v·∫•n ƒë·ªÅ n√†y, ch√∫ng ta √°p d·ª•ng **Z-score normalization** theo nh√≥m (Sector, Size_bucket). M·ªói feature ƒë∆∞·ª£c chu·∫©n h√≥a b·∫±ng c√°ch so s√°nh v·ªõi c√°c kh√°ch h√†ng c√πng ng√†nh v√† c√πng quy m√¥:

$$
Z\text{-score} = \frac{x - \text{Median}_{\text{group}}}{\text{IQR}_{\text{group}}}
$$

Trong ƒë√≥:
- $x$ = gi√° tr·ªã feature c·ªßa kh√°ch h√†ng
- $\text{Median}_{\text{group}}$ = trung v·ªã c·ªßa nh√≥m (Sector, Size_bucket)
- $\text{IQR}_{\text{group}}$ = Interquartile Range (Q3 - Q1) c·ªßa nh√≥m

Ch√∫ng ta s·ª≠ d·ª•ng **Median v√† IQR** thay v√¨ Mean v√† Standard Deviation v√¨ ch√∫ng robust h∆°n v·ªõi outliers, r·∫•t ph·ªï bi·∫øn trong d·ªØ li·ªáu t√†i ch√≠nh. Features sau khi normalize c√≥ suffix `__zs_sector_size`, v√≠ d·ª•: `icr_ttm__zs_sector_size`, `dpd_max_180d__zs_sector_size`.

Normalization n√†y mang l·∫°i hai l·ª£i √≠ch quan tr·ªçng: (1) So s√°nh c√¥ng b·∫±ng gi·ªØa c√°c doanh nghi·ªáp c√πng ƒë·∫∑c ƒëi·ªÉm, v√† (2) TƒÉng s·ª©c m·∫°nh d·ª± ƒëo√°n c·ªßa model v√¨ features ƒë∆∞·ª£c ƒëi·ªÅu ch·ªânh theo context ri√™ng c·ªßa t·ª´ng nh√≥m.

## Step 3: Model Training & Calibration

Module: `src/train_baseline.py`

### Overview

B∆∞·ªõc n√†y x√¢y d·ª±ng m√¥ h√¨nh Machine Learning ƒë·ªÉ d·ª± ƒëo√°n x√°c su·∫•t v·ª° n·ª£ (PD) trong 12 th√°ng t·ªõi. Ch√∫ng ta s·ª≠ d·ª•ng **LightGBM** l√†m classifier c∆° s·ªü v√¨ kh·∫£ nƒÉng x·ª≠ l√Ω t·ªët nhi·ªÅu features, t·ª± ƒë·ªông h·ªçc ƒë∆∞·ª£c c√°c m·ªëi quan h·ªá phi tuy·∫øn, v√† h·ªó tr·ª£ class balancing. Sau khi train, m√¥ h√¨nh ƒë∆∞·ª£c **calibrate** b·∫±ng Isotonic Regression ƒë·ªÉ ƒë·∫£m b·∫£o predicted probabilities ph·∫£n √°nh ƒë√∫ng true probabilities - ƒëi·ªÅu quan tr·ªçng cho credit risk management v√† tu√¢n th·ªß Basel.

---

### A. LightGBM Configuration

LightGBM ƒë∆∞·ª£c ch·ªçn v√¨ x·ª≠ l√Ω t·ªët nhi·ªÅu features c√≥ quy m√¥ kh√°c nhau (financial ratios, utilization rates, DPD counts), t·ª± ƒë·ªông h·ªçc feature interactions ("ICR th·∫•p + Utilization cao = R·ªßi ro cao"), v√† h·ªó tr·ª£ class weighting cho imbalanced data (default rate ~5-10%).

**Hyperparameters:**

```python
LGBMClassifier(
    n_estimators=300,           # 300 c√¢y trong ensemble
    learning_rate=0.05,         # H·ªçc ch·∫≠m nh∆∞ng ·ªïn ƒë·ªãnh
    num_leaves=15,              # Gi·ªõi h·∫°n complexity
    max_depth=6,                # Tr√°nh overfitting
    subsample=0.8,              # Row sampling (bagging)
    colsample_bytree=0.8,       # Column sampling
    min_child_samples=10,       # M·ªói leaf ‚â• 10 samples
    reg_lambda=0.1,             # L2 regularization
    scale_pos_weight=(1-pos_rate)/pos_rate,  # Auto-balance classes
    objective='binary',
    random_state=42
)
```

**Train-Test Split**: 80% training, 20% holdout test v·ªõi stratified sampling ƒë·ªÉ ƒë·∫£m b·∫£o default rate ƒë·ªìng ƒë·ªÅu.

---

### B. Isotonic Calibration (CV=5)

Gradient boosting models th∆∞·ªùng cho ra **uncalibrated probabilities** - khi model d·ª± ƒëo√°n PD = 20%, t·ª∑ l·ªá th·ª±c t·∫ø default c√≥ th·ªÉ l√† 15% ho·∫∑c 30%. Trong credit risk, ƒëi·ªÅu n√†y nguy hi·ªÉm v√¨ c√°c quy·∫øt ƒë·ªãnh quan tr·ªçng (capital allocation, pricing, provisioning) d·ª±a v√†o con s·ªë PD n√†y.

**Isotonic Regression** l√† ph∆∞∆°ng ph√°p calibration non-parametric v√† monotonic, h·ªçc h√†m mapping t·ª´ raw probabilities sang calibrated probabilities v·ªõi r√†ng bu·ªôc ƒë∆°n ƒëi·ªáu tƒÉng (customer c√≥ raw PD cao h∆°n v·∫´n c√≥ calibrated PD cao h∆°n). Ch√∫ng ta s·ª≠ d·ª•ng 5-fold CV ƒë·ªÉ tr√°nh overfitting:

```python
from sklearn.calibration import CalibratedClassifierCV

calibrated_model = CalibratedClassifierCV(
    base_lgbm,
    method='isotonic',
    cv=5
)
calibrated_model.fit(X_train, y_train)
```

Isotonic Regression ƒë∆∞·ª£c ∆∞u ti√™n h∆°n Platt Scaling v√¨: (1) Kh√¥ng gi·∫£ ƒë·ªãnh functional form (kh√¥ng c·∫ßn sigmoid), (2) ƒê·∫£m b·∫£o ranking kh√¥ng ƒë·ªïi, (3) Performance t·ªët h∆°n v·ªõi sample size l·ªõn (‚â• 1000 customers).

---

### C. Risk Tiers & Thresholds

Sau calibration, customers ƒë∆∞·ª£c ph√¢n lo·∫°i v√†o 3 risk tiers d·ª±a tr√™n **percentile-based thresholds** (thay v√¨ absolute PD cutoffs) ƒë·ªÉ qu·∫£n l√Ω capacity. Ng√¢n h√†ng ch·ªâ c√≥ ƒë·ªß ngu·ªìn l·ª±c ƒë·ªÉ qu·∫£n l√Ω ch·∫∑t ch·∫Ω m·ªôt s·ªë l∆∞·ª£ng customers high-risk nh·∫•t, n√™n vi·ªác c·ªë ƒë·ªãnh Red tier = top 5% ƒë·∫£m b·∫£o s·ªë l∆∞·ª£ng customers c·∫ßn intensive monitoring kh√¥ng v∆∞·ª£t qu√° capacity.

**Tier Definitions:**

| Tier | Percentile | Typical PD | Action | Capacity |
|------|-----------|-----------|--------|----------|
| **Red** | Top 5% | ‚â• 20% | H·ªçp KH ‚â§5 ng√†y; l·∫≠p cash flow 13 tu·∫ßn; tighten covenants; watchlist | ~50 KH ‚Üí 5 RMs |
| **Amber** | Top 5-15% | 5-20% | So√°t x√©t ‚â§10 ng√†y; y√™u c·∫ßu management accounts; h·∫°n ch·∫ø h·∫°n m·ª©c | ~100 KH ‚Üí 10 RMs |
| **Green** | Bottom 85% | < 5% | Theo d√µi ƒë·ªãnh k·ª≥ quarterly; kh√¥ng c·∫ßn h√†nh ƒë·ªông ƒë·∫∑c bi·ªát | Portfolio monitoring |

**Threshold Calculation:**

```python
train_probs = calibrated_model.predict_proba(X_train)[:, 1]
red_threshold = np.percentile(train_probs, 95)      # e.g., 0.23
amber_threshold = np.percentile(train_probs, 85)    # e.g., 0.08
```

Thresholds ƒë∆∞·ª£c l∆∞u v√†o `thresholds.json` v√† s·ª≠ d·ª•ng nh·∫•t qu√°n cho c√°c l·∫ßn scoring sau.

---

### D. Evaluation Metrics

Model ƒë∆∞·ª£c ƒë√°nh gi√° to√†n di·ªán qua nhi·ªÅu metrics, m·ªói metric ƒëo l∆∞·ªùng m·ªôt kh√≠a c·∫°nh kh√°c nhau:

**1. AUC-ROC (Discrimination Power)**  
ƒêo kh·∫£ nƒÉng ph√¢n bi·ªát defaulters vs non-defaulters. AUC = 0.80 nghƒ©a l√† 80% tr∆∞·ªùng h·ª£p model s·∫Ω rank ƒë√∫ng (assign PD cao h∆°n cho defaulter). **Target**: ‚â• 0.75 (industry standard).

**2. PR-AUC (Precision-Recall)**  
Quan tr·ªçng v·ªõi imbalanced data (default rate th·∫•p). Precision = % customers ƒë∆∞·ª£c d·ª± ƒëo√°n default th·ª±c s·ª± default. Recall = % defaults th·ª±c t·∫ø ƒë∆∞·ª£c ph√°t hi·ªán. **Target**: ‚â• 0.40 (v·ªõi base rate ~8%).

**3. KS Statistic (Kolmogorov-Smirnov)**  
ƒêo maximum separation gi·ªØa cumulative distributions c·ªßa defaulters v√† non-defaulters. KS = max(TPR - FPR). **Target**: ‚â• 0.35 (good discriminatory power).

**4. Brier Score (Calibration Quality)**  
MSE c·ªßa probabilities: `Brier = (1/N) √ó Œ£(predicted_prob - actual_outcome)¬≤`. Brier nh·ªè nghƒ©a l√† predictions accurate (n·∫øu d·ª± ƒëo√°n 10 KH m·ªói ng∆∞·ªùi PD=20%, l√Ω t∆∞·ªüng c√≥ 2 defaults). **Target**: ‚â§ 0.10. Brier gi·∫£m ƒë√°ng k·ªÉ sau calibration (t·ª´ ~0.12 xu·ªëng ~0.08).

**5. Calibration Curve (Reliability Diagram)**  
Visualize calibration: plot mean predicted probability vs actual default rate trong t·ª´ng bin. ƒê∆∞·ªùng l√Ω t∆∞·ªüng l√† y = x (diagonal).

---

### E. Outputs & Artifacts

**1. Model File**: `model_lgbm.pkl` - Ch·ª©a base LightGBM, calibrated model, feature names, v√† metadata (training date, hyperparameters, test AUC, test Brier).

**2. Scores**: `scores_all.csv` - Predictions cho to√†n b·ªô dataset (train + test) v·ªõi columns: `customer_id`, `prob_default_12m_base`, `prob_default_12m_calibrated`, `tier`, `is_test`.

**3. Thresholds**: `thresholds.json` - L∆∞u red/amber/green thresholds v√† percentiles ƒë·ªÉ d√πng cho scoring sau n√†y.

**4. Visualizations**:
- `calibration_lgbm.png`: Reliability diagram (before vs after calibration)
- `pr_curve_lgbm.png`: Precision-Recall curve
- `roc_curve_lgbm.png`: ROC curve  
- `shap_summary.png`: Quick SHAP summary (top 10 features)

T·∫•t c·∫£ artifacts ƒë∆∞·ª£c version control ƒë·ªÉ ƒë·∫£m b·∫£o reproducibility v√† auditability.

## Step 3.5: Model Monitoring

Module: `run_monitoring.py`

### V·∫•n ƒë·ªÅ Model Degradation

Sau khi train, model c√≥ AUC = 0.93 tr√™n test set. Nh∆∞ng trong production, performance c√≥ th·ªÉ gi·∫£m d·∫ßn theo th·ªùi gian do 3 nguy√™n nh√¢n:

**1. Data Drift**: Portfolio thay ƒë·ªïi (v√≠ d·ª•: ng√¢n h√†ng chuy·ªÉn t·ª´ cho vay Manufacturing sang Tech). Features c√≥ distribution kh√°c so v·ªõi training data ‚Üí model predictions kh√¥ng c√≤n reliable.

**2. Concept Drift**: Quan h·ªá gi·ªØa features v√† default risk thay ƒë·ªïi (v√≠ d·ª•: ICR < 1.5 t·ª´ng l√† strong signal, nh∆∞ng sau government support programs, nhi·ªÅu firms c√≥ low ICR v·∫´n kh√¥ng default). Model h·ªçc patterns c≈©, kh√¥ng c√≤n √°p d·ª•ng ƒë∆∞·ª£c.

**3. Calibration Drift**: Predicted probabilities kh√¥ng c√≤n ch√≠nh x√°c. Model d·ª± ƒëo√°n 100 customers c√≥ PD = 20% m·ªói ng∆∞·ªùi, nh∆∞ng th·ª±c t·∫ø ch·ªâ 10 ng∆∞·ªùi default (10%) thay v√¨ 20 ng∆∞·ªùi (20%). Brier Score tƒÉng cao.

‚Üí **Monitoring l√† b·∫Øt bu·ªôc** ƒë·ªÉ ph√°t hi·ªán nh·ªØng thay ƒë·ªïi n√†y tr∆∞·ªõc khi ·∫£nh h∆∞·ªüng business decisions.

---

### Metrics Monitoring

**PSI (Population Stability Index)**: ƒêo distribution shift c·ªßa features.

$$
\text{PSI} = \sum_{i=1}^{n} (\text{Actual}_i - \text{Expected}_i) \times \ln\left(\frac{\text{Actual}_i}{\text{Expected}_i}\right)
$$

Chia m·ªói feature th√†nh bins (quantiles), t√≠nh % observations trong m·ªói bin cho baseline vs current data. PSI cao nghƒ©a l√† distribution thay ƒë·ªïi nhi·ªÅu.

- **PSI < 0.10**: Distribution ·ªïn ƒë·ªãnh, model v·∫´n ho·∫°t ƒë·ªông t·ªët
- **0.10 ‚â§ PSI < 0.25**: C√≥ shift nh·∫π, c·∫ßn investigate xem features n√†o thay ƒë·ªïi v√† t·∫°i sao
- **PSI ‚â• 0.25**: Severe shift, model kh√¥ng c√≤n ph√π h·ª£p v·ªõi data hi·ªán t·∫°i ‚Üí **Ph·∫£i retrain**

---

**Performance Metrics**: So s√°nh AUC, PR-AUC, KS, Brier gi·ªØa baseline v√† current data.

| Metric | Baseline | Current | Degradation Threshold |
|--------|----------|---------|----------------------|
| **AUC** | 0.9309 | 0.9315 | < 0.70 ho·∫∑c gi·∫£m > 5% |
| **PR-AUC** | 0.1782 | 0.1789 | Gi·∫£m > 10% |
| **KS** | 0.7927 | 0.7901 | Gi·∫£m > 20% |
| **Brier** | 0.0194 | 0.0195 | > 0.15 ho·∫∑c tƒÉng > 5% |

**AUC < 0.70**: Model m·∫•t kh·∫£ nƒÉng ph√¢n bi·ªát defaulters vs non-defaulters ‚Üí Worse than random ranking.

**Brier > 0.15**: Calibration ho√†n to√†n sai, predictions kh√¥ng ƒë√°ng tin c·∫≠y cho decision making.

**KS gi·∫£m 20%**: Separation gi·ªØa good v√† bad customers gi·∫£m m·∫°nh, tiers (Red/Amber/Green) kh√¥ng c√≤n meaningful.

---

### Workflow T·ª± ƒë·ªông

`run_monitoring.py` ƒë∆∞·ª£c t√≠ch h·ª£p s·∫µn v·ªõi `train_baseline.py`:

1. **Training** t·ª± ƒë·ªông t·∫°o `baseline_metrics.json` (AUC, Brier tr√™n test set) v√† `feature_ews_train.parquet` (training data)
2. **Monitoring** load baseline metrics, predict tr√™n current data, t√≠nh PSI + performance metrics
3. **Alerting** t·ª± ƒë·ªông c·∫£nh b√°o n·∫øu v∆∞·ª£t thresholds (AUC < 0.70, PSI ‚â• 0.25, Brier > 0.15)
4. **Logging** l∆∞u k·∫øt qu·∫£ v√†o `monitoring_YYYYMMDD_HHMMSS.json` v√† `psi_details.csv` ƒë·ªÉ track trends

‚Üí Production team ch·∫°y monitoring **h√†ng th√°ng** v·ªõi data th√°ng m·ªõi. N·∫øu c√≥ alerts, tri·ªáu t·∫≠p Data Science team ƒë·ªÉ retrain ho·∫∑c recalibrate model.

---

### Khi n√†o Retrain?

**Nguy√™n t·∫Øc**: Retrain khi model kh√¥ng c√≤n reflect current reality.

- **AUC < 0.70**: Discrimination power qu√° th·∫•p
- **PSI ‚â• 0.25**: Portfolio ƒë√£ thay ƒë·ªïi qu√° nhi·ªÅu so v·ªõi training period
- **Brier > 0.15**: Probabilities kh√¥ng c√≤n ch√≠nh x√°c cho capital allocation/provisioning

**Best practice**: Retrain **quarterly** v·ªõi 2-3 nƒÉm data g·∫ßn nh·∫•t ƒë·ªÉ model lu√¥n capture latest patterns, ngay c·∫£ khi metrics ch∆∞a degradation nghi√™m tr·ªçng.

## Step 4: Model Explainability (SHAP)

Module: `src/explain.py`

### Overview

SHAP (SHapley Additive exPlanations) gi·∫£i th√≠ch contribution c·ªßa t·ª´ng feature v√†o prediction d·ª±a tr√™n game theory. SHAP value d∆∞∆°ng nghƒ©a l√† feature ƒë√≥ tƒÉng x√°c su·∫•t default, √¢m nghƒ©a l√† gi·∫£m default risk, v√† magnitude cho bi·∫øt m·ª©c ƒë·ªô ·∫£nh h∆∞·ªüng. ƒêi·ªÅu n√†y quan tr·ªçng cho Credit Committee (gi·∫£i th√≠ch decisions), RM Team (t∆∞ v·∫•n customers c·∫£i thi·ªán), v√† Model Validation (ƒë·∫£m b·∫£o model h·ªçc ƒë√∫ng patterns).

---

### Global Explainability

**Feature Importance** (`feature_importance.csv`): Mean absolute SHAP values cho m·ªói feature, cho bi·∫øt features n√†o ·∫£nh h∆∞·ªüng nh·∫•t ƒë·∫øn model trong to√†n b·ªô portfolio. V√≠ d·ª•, `dpd_max_180d__zs_sector_size` th∆∞·ªùng l√† feature quan tr·ªçng nh·∫•t v√¨ DPD l√† signal m·∫°nh nh·∫•t cho default risk.

**SHAP Summary Plot** (`shap_summary.png`): Waterfall plot visualize impact c·ªßa t·∫•t c·∫£ features. M·ªói ƒëi·ªÉm l√† m·ªôt customer, m√†u ƒë·ªè = feature value cao, xanh = feature value th·∫•p. Plot n√†y cho th·∫•y kh√¥ng ch·ªâ feature n√†o quan tr·ªçng m√† c√≤n direction c·ªßa impact (high DPD ‚Üí high risk, high ICR ‚Üí low risk).

---

### Local Explainability

**Top Drivers per Customer** (`top_drivers_per_customer.csv`): Top 5 features quan tr·ªçng nh·∫•t cho t·ª´ng customer c·ª• th·ªÉ, gi√∫p tr·∫£ l·ªùi c√¢u h·ªèi "T·∫°i sao customer C0042 ƒë∆∞·ª£c ph√¢n v√†o Red tier?". Output bao g·ªìm feature name, SHAP value, v√† actual feature value.

V√≠ d·ª• cho customer C0042:
1. `dpd_max_180d__zs_sector_size`: SHAP = +0.52 (value = 120 days) ‚Üí DPD cao
2. `%util_mean_60d__zs_sector_size`: SHAP = +0.31 (value = 0.95) ‚Üí Utilization s√°t h·∫°n m·ª©c
3. `icr_ttm__zs_sector_size`: SHAP = +0.20 (value = 0.8) ‚Üí ICR th·∫•p, kh√≥ tr·∫£ l√£i

V·ªõi th√¥ng tin n√†y, RM c√≥ th·ªÉ t∆∞ v·∫•n customer: (1) Clear outstanding payments ƒë·ªÉ gi·∫£m DPD, (2) Gi·∫£m credit usage ho·∫∑c apply for limit increase, (3) C·∫£i thi·ªán profitability ho·∫∑c restructure debt.

---

### Dependence Plots

SHAP dependence plots cho key features (`icr_ttm`, `ccc`, `%util_mean_60d`) hi·ªÉn th·ªã m·ªëi quan h·ªá phi tuy·∫øn gi·ªØa feature value v√† SHAP value. V√≠ d·ª•, dependence plot c·ªßa ICR c√≥ th·ªÉ cho th·∫•y: ICR < 1.5 c√≥ SHAP values r·∫•t cao (risk tƒÉng m·∫°nh), ICR 1.5-3.0 c√≥ SHAP gi·∫£m d·∫ßn, ICR > 3.0 c√≥ SHAP g·∫ßn 0 (kh√¥ng c√≤n r·ªßi ro th√™m). Nh·ªØng insights n√†y gi√∫p validate model ƒëang h·ªçc ƒë√∫ng business logic.

---

### Outputs Summary

| File | Type | Purpose |
|------|------|---------|
| `feature_importance.csv` | Global | Ranking features by importance |
| `shap_summary.png` | Global | Visual impact of all features |
| `top_drivers_per_customer.csv` | Local | Top 5 drivers cho t·ª´ng customer |
| `shap_dependence_*.png` | Global | Phi tuy·∫øn relationships |
| `summary.json` | Metadata | Config v√† stats |

## ‚öôÔ∏è Step 5-6: [Optional] Re-calibration

### Why Optional?

B∆∞·ªõc 3 (train_baseline.py) ƒë√£ t·∫°o ra m·ªôt **calibrated model** v·ªõi percentile-based thresholds (Red = top 5%, Amber = top 5-15%) s·∫µn s√†ng cho production. Steps 5-6 ch·ªâ c·∫ßn thi·∫øt khi business mu·ªën **thay ƒë·ªïi threshold strategy** t·ª´ percentile-based sang **absolute PD cutoffs** (v√≠ d·ª•: Red ‚â• 20% PD, Amber ‚â• 5% PD) ƒë·ªÉ ph√π h·ª£p v·ªõi risk appetite ho·∫∑c regulatory requirements c·ª• th·ªÉ.

Trong th·ª±c t·∫ø, percentile-based approach th∆∞·ªùng ƒë∆∞·ª£c ∆∞u ti√™n v√¨ ƒë·∫£m b·∫£o s·ªë l∆∞·ª£ng customers c·∫ßn intensive monitoring kh√¥ng v∆∞·ª£t qu√° capacity. Tuy nhi√™n, m·ªôt s·ªë t·ªï ch·ª©c (ƒë·∫∑c bi·ªát banks tu√¢n th·ªß Basel/IFRS 9) y√™u c·∫ßu absolute thresholds ƒë·ªÉ nh·∫•t qu√°n v·ªõi internal risk rating systems ho·∫∑c regulatory reporting.

---

### Step 5: Extract Raw Scores

**Module**: `src/make_scores_raw.py`

Tr√≠ch xu·∫•t raw probabilities t·ª´ **base LightGBM** (tr∆∞·ªõc khi √°p d·ª•ng isotonic calibration trong Step 3) ƒë·ªÉ c√≥ baseline scores cho re-calibration process. Output l√† `scores_raw.csv` ch·ª©a uncalibrated predictions cho to√†n b·ªô dataset.

**Why needed?** Re-calibration c·∫ßn raw scores l√†m input v√¨ ch√∫ng ta s·∫Ω fit m·ªôt calibrator m·ªõi v·ªõi absolute thresholds kh√°c v·ªõi calibrator trong Step 3.

---

### Step 6: Re-calibrate with Absolute Thresholds

**Module**: `src/calibrate.py`

Fit l·∫°i **Isotonic Regression** tr√™n raw scores v·ªõi absolute PD cutoffs thay v√¨ percentiles. Process bao g·ªìm: (1) Fit calibrator tr√™n training set, (2) Map raw scores ‚Üí calibrated PD, (3) Apply absolute thresholds (Red ‚â• 20%, Amber ‚â• 5%), (4) Save calibrator v√† thresholds.

**Key difference from Step 3:**
- Step 3: Calibrate ‚Üí Calculate percentile thresholds ‚Üí Tiers fixed by % (top 5%, 10%)
- Step 6: Calibrate ‚Üí Apply absolute PD thresholds ‚Üí Tiers vary by portfolio quality

**Outputs:**
- `calibrator.pkl`: New isotonic calibrator
- `mapping.csv`: Raw score ‚Üí Calibrated PD mapping table
- `thresholds.json`: Absolute cutoffs (red: 0.20, amber: 0.05)
- `calibration_full.png`: Reliability diagram
- `pr_curve_full.png`: Precision-Recall curve

**Tradeoff:** V·ªõi absolute thresholds, s·ªë l∆∞·ª£ng customers trong Red/Amber tiers c√≥ th·ªÉ bi·∫øn ƒë·ªông theo quality c·ªßa portfolio (good period ‚Üí √≠t Red, bad period ‚Üí nhi·ªÅu Red), g√¢y kh√≥ khƒÉn cho capacity planning.

## üéØ Step 7: Production Scoring

Module: `src/scoring.py`

Scoring l√† b∆∞·ªõc cu·ªëi c√πng ƒë·ªÉ ƒë∆∞a model v√†o production. Script n√†y load trained model, predict PD cho to√†n b·ªô customers d·ª±a tr√™n feature snapshot t·∫°i as-of date (v√≠ d·ª•: 2025-06-30), sau ƒë√≥ ph√¢n tier v√† ƒë∆∞a ra action recommendations. Output ƒë∆∞·ª£c s·ª≠ d·ª•ng tr·ª±c ti·∫øp b·ªüi RM team v√† Risk Committee ƒë·ªÉ ra quy·∫øt ƒë·ªãnh nghi·ªáp v·ª•.

---

### Inputs & Outputs

**Inputs:**
1. **Features**: `data/processed/feature_ews.parquet` - Feature snapshot t·∫°i as-of date
2. **Model**: `artifacts/models/model_lgbm.pkl` - Trained & calibrated LightGBM
3. **Thresholds**: `artifacts/calibration/thresholds.json` ho·∫∑c `artifacts/models/thresholds.json` - T√πy approach (absolute vs percentile)

**Output**: `ews_scored_YYYY-MM-DD.csv` v·ªõi columns:

| Column | Description | Example |
|--------|-------------|---------|
| `customer_id` | Customer identifier | C0042 |
| `prob_default_12m_calibrated` | PD trong 12 th√°ng (0-1) | 0.2341 |
| `score_ews` | EWS Score (0-100) | 76.59 |
| `tier` | Risk tier | Red |
| `action` | Recommended action | H·ªçp KH ‚â§5 ng√†y; l·∫≠p cash flow 13 tu·∫ßn;... |

**EWS Score Formula**: `100 √ó (1 - PD)` ‚Üí Score cao = R·ªßi ro th·∫•p (100 = t·ªët nh·∫•t, 0 = x·∫•u nh·∫•t)

---

### Risk Tiers & Actions

| Tier | Criteria | Action | Frequency |
|------|----------|--------|-----------|
| **Green** | PD < 5% (ho·∫∑c bottom 85%) | Theo d√µi ƒë·ªãnh k·ª≥; c·∫≠p nh·∫≠t BCTC ƒë√∫ng h·∫°n | Quarterly |
| **Amber** | 5% ‚â§ PD < 20% (ho·∫∑c top 5-15%) | So√°t x√©t RM ‚â§10 ng√†y; y√™u c·∫ßu management accounts; ki·ªÉm tra c√¥ng n·ª£; h·∫°n ch·∫ø h·∫°n m·ª©c | Monthly |
| **Red** | PD ‚â• 20% (ho·∫∑c top 5%) | H·ªçp KH ‚â§5 ng√†y; l·∫≠p cash flow 13 tu·∫ßn; xem x√©t covenant tightening/collateral; watchlist | Weekly |

**Note**: Criteria ph·ª• thu·ªôc v√†o threshold approach (absolute vs percentile) ƒë∆∞·ª£c ch·ªçn ·ªü Steps 3 ho·∫∑c 5-6.

---

### Production Workflow

**Monthly Cadence**:
1. **Last day of month**: Ch·∫°y scoring script v·ªõi as-of date = month-end
2. **Day 1-2**: Ph√¢n ph·ªëi report cho RM team v√† Risk Committee
3. **Day 3-10**: RMs th·ª±c hi·ªán actions theo tier (Amber reviews, Red meetings)
4. **Throughout month**: Track action completion v√† update customer status

**Integration v·ªõi Banking Systems**:
- **Input**: Features t·ª´ core banking system (financial data, credit transactions, cashflow)
- **Output**: EWS scores import v√†o CRM/Credit Risk systems
- **Alerts**: Auto-trigger emails/notifications cho customers chuy·ªÉn sang Red tier

**Monitoring**: Track tier migrations month-over-month ƒë·ªÉ identify portfolio trends (improving/deteriorating).

## üìä Model Performance & Validation

### Expected Performance Metrics (Holdout 20%)

Model ƒë∆∞·ª£c ƒë√°nh gi√° tr√™n holdout test set v·ªõi c√°c metrics sau (computed trong `train_baseline.py` lines 99-104):

| Metric | Target Range | √ù nghƒ©a | Code |
|--------|-------------|---------|------|
| **AUC-ROC** | 0.75 - 0.85 | Kh·∫£ nƒÉng ph√¢n bi·ªát defaulters vs non-defaulters | `roc_auc_score(y_te, p_te)` |
| **PR-AUC** | 0.40 - 0.60 | Performance tr√™n positive class (quan tr·ªçng v·ªõi imbalanced data) | `average_precision_score(y_te, p_te)` |
| **KS Statistic** | 0.35 - 0.50 | Maximum separation gi·ªØa cumulative distributions | `ks_score(y_te, p_te)` |
| **Brier Score** | 0.05 - 0.10 | Calibration quality (lower is better) | `brier_score_loss(y_te, p_te)` |

**Calibration Quality**: Reliability curve (predicted probabilities vs actual default rates) n√™n g·∫ßn diagonal (y = x). Isotonic calibration c·∫£i thi·ªán ƒë√°ng k·ªÉ metric n√†y, th∆∞·ªùng gi·∫£m Brier score t·ª´ ~0.12 xu·ªëng ~0.08. Plots ƒë∆∞·ª£c generate trong `plot_calibration_pr()` function (lines 47-61).

**Precision-Recall Tradeoff**: Red threshold (PD ‚â• 20%) c√≥ high precision, moderate recall; Amber threshold (PD ‚â• 5%) c√≥ balanced precision-recall.

---

### Model Monitoring & Maintenance

**Quarterly Reviews** (c·∫ßn t·ª± implement monitoring scripts):
1. **Performance drift**: Monitor AUC, KS tr√™n new data (target: kh√¥ng gi·∫£m > 5%)
   - Re-run `train_baseline.py` tr√™n new data v√† compare metrics
2. **Population Stability Index (PSI)**: ƒêo distribution shift c·ªßa features (target: PSI < 0.15)
   - Formula: `PSI = Œ£(actual% - expected%) √ó ln(actual%/expected%)`
3. **Feature stability**: Check data quality, missing values, outliers
   - S·ª≠ d·ª•ng data profiling tools ho·∫∑c pandas `.describe()`
4. **Recalibration**: N·∫øu Brier score tƒÉng > 0.10, consider re-fit calibrator
   - Re-run Step 6 (`calibrate.py`) v·ªõi data m·ªõi

**Red Flags Trigger Retraining**:
- AUC drops below 0.70
- Brier score > 0.15
- Large prediction shifts without business explanation (e.g., 10% customers chuy·ªÉn tier b·∫•t th∆∞·ªùng)
- PSI > 0.25 (severe distribution shift)

## Feature Importance Ranking

D·ª±a tr√™n SHAP analysis trong `explain.py`, ƒë√¢y l√† 10 features c√≥ impact m·∫°nh nh·∫•t ƒë·∫øn d·ª± b√°o default:

| # | Feature Name | Category | Business Interpretation |
|---|--------------|----------|------------------------|
| 1 | `dpd_max_180d__zs_sector_size` | Behavioral | DPD t·ªëi ƒëa trong 6 th√°ng - signal m·∫°nh nh·∫•t cho default risk |
| 2 | `%util_mean_60d__zs_sector_size` | Behavioral | Credit utilization trung b√¨nh - ph·∫£n √°nh liquidity stress |
| 3 | `icr_ttm__zs_sector_size` | Financial | Interest Coverage Ratio - kh·∫£ nƒÉng tr·∫£ l√£i vay |
| 4 | `debt_to_ebitda__zs_sector_size` | Financial | Financial leverage - m·ª©c ƒë·ªô ƒë√≤n b·∫©y t√†i ch√≠nh |
| 5 | `ccc__zs_sector_size` | Financial | Cash Conversion Cycle - hi·ªáu qu·∫£ qu·∫£n l√Ω v·ªën l∆∞u ƒë·ªông |
| 6 | `inflow_drop_60d__zs_sector_size` | Cashflow | M·ª©c gi·∫£m doanh thu - suy gi·∫£m cashflow |
| 7 | `dpd_trend_180d__zs_sector_size` | Behavioral | Xu h∆∞·ªõng DPD tƒÉng - payment behavior ƒëang x·∫•u ƒëi |
| 8 | `breach_icr` | Covenant | Vi ph·∫°m covenant ICR - trigger event tr·ª±c ti·∫øp |
| 9 | `current_ratio__zs_sector_size` | Financial | Current Ratio < 1.0 - nguy c∆° thanh kho·∫£n ng·∫Øn h·∫°n |
| 10 | `delta_ccc_qoq__zs_sector_size` | Financial | Thay ƒë·ªïi CCC theo qu√Ω - efficiency ƒëang gi·∫£m |

### Ph√¢n t√≠ch theo Category

- **Behavioral (40%)**: Payment patterns th·ª±c t·∫ø l√† predictor m·∫°nh nh·∫•t - DPD history v√† utilization cho signal s·ªõm nh·∫•t v·ªÅ kh√≥ khƒÉn t√†i ch√≠nh
- **Financial (35%)**: C√°c ch·ªâ s·ªë t√†i ch√≠nh fundamental (ICR, leverage, liquidity ratios) quan tr·ªçng th·ª© hai
- **Cashflow (15%)**: Revenue trends v√† cashflow dynamics detect deterioration s·ªõm h∆°n b√°o c√°o t√†i ch√≠nh
- **Covenant (10%)**: Breach events c√≥ impact ƒë√°ng k·ªÉ nh∆∞ng xu·∫•t hi·ªán mu·ªôn h∆°n

**K·∫øt lu·∫≠n**: Model ∆∞u ti√™n behavioral signals v√¨ payment difficulties xu·∫•t hi·ªán tr∆∞·ªõc khi financial statements ph·∫£n √°nh ƒë·∫ßy ƒë·ªß. ƒêi·ªÅu n√†y ph√π h·ª£p v·ªõi th·ª±c t·∫ø risk management trong credit monitoring.

## üöÄ Complete End-to-End Pipeline

### Full Workflow (Development)

```bash
# Step 1: Generate synthetic data
python src/generate_data.py --n-customers 1000 --output-dir data/raw

# Step 2: Feature engineering
python src/feature_engineering.py --raw-dir data/raw --asof 2025-06-30 --outdir data/processed

# Step 3: Train model + calibration
python src/train_baseline.py --features data/processed/feature_ews.parquet --test-size 0.2 --seed 42 --red-pct 0.05 --amber-pct 0.10 --outdir artifacts/models

# Step 4: Generate SHAP explanations
python src/explain.py --model artifacts/models/model_lgbm.pkl --features data/processed/feature_ews.parquet --outdir artifacts/shap --max-display 20

# [Optional] Step 5-6: Re-calibration with absolute thresholds
python src/make_scores_raw.py --features data/processed/feature_ews.parquet --model artifacts/models/model_lgbm.pkl --out data/processed/scores_raw.csv
python src/calibrate.py --input data/processed/scores_raw.csv --red-thr 0.20 --amber-thr 0.05 --outdir artifacts/calibration

# Step 7: Production scoring
python src/scoring.py --features data/processed/feature_ews.parquet --model artifacts/models/model_lgbm.pkl --thresholds artifacts/calibration/thresholds.json --asof 2025-06-30 --outdir artifacts/scoring
```

## üìà Artifacts & Outputs Summary

### Directory Structure

```
artifacts/
‚îú‚îÄ‚îÄ models/
‚îÇ   ‚îú‚îÄ‚îÄ model_lgbm.pkl              # Trained model (base + calibrated + features)
‚îÇ   ‚îú‚îÄ‚îÄ scores_all.csv              # Training set predictions + tiers
‚îÇ   ‚îú‚îÄ‚îÄ thresholds.json             # Percentile-based thresholds
‚îÇ   ‚îú‚îÄ‚îÄ calibration_lgbm.png        # Reliability diagram
‚îÇ   ‚îú‚îÄ‚îÄ pr_curve_lgbm.png           # Precision-Recall curve
‚îÇ   ‚îî‚îÄ‚îÄ shap_summary.csv/png        # Quick SHAP summary
‚îÇ
‚îú‚îÄ‚îÄ calibration/
‚îÇ   ‚îú‚îÄ‚îÄ calibrator.pkl              # Isotonic calibrator (re-fitted)
‚îÇ   ‚îú‚îÄ‚îÄ mapping.csv                 # Raw score ‚Üí Calibrated PD mapping
‚îÇ   ‚îú‚îÄ‚îÄ thresholds.json             # Absolute PD thresholds (Red ‚â•20%, Amber ‚â•5%)
‚îÇ   ‚îú‚îÄ‚îÄ calibration_full.png        # Reliability curve (re-calibrated)
‚îÇ   ‚îî‚îÄ‚îÄ pr_curve_full.png           # PR curve (re-calibrated)
‚îÇ
‚îú‚îÄ‚îÄ shap/
‚îÇ   ‚îú‚îÄ‚îÄ feature_importance.csv      # Global feature importance (mean |SHAP|)
‚îÇ   ‚îú‚îÄ‚îÄ shap_summary.png            # SHAP waterfall plot
‚îÇ   ‚îú‚îÄ‚îÄ top_drivers_per_customer.csv # Local explanations (top 5 features per customer)
‚îÇ   ‚îú‚îÄ‚îÄ shap_dependence_*.png       # Dependence plots for key features
‚îÇ   ‚îî‚îÄ‚îÄ summary.json                # Metadata
‚îÇ
‚îî‚îÄ‚îÄ scoring/
    ‚îú‚îÄ‚îÄ ews_scored_2025-06-30.csv   # Production scores (customer_id, PD, score, tier, action)
    ‚îî‚îÄ‚îÄ thresholds_used.json        # Thresholds applied in this run
```

### Key Files for Different Stakeholders

| Stakeholder | Key Files |
|-------------|-----------|
| **Risk Manager** | `ews_scored_*.csv`, `top_drivers_per_customer.csv` |
| **Credit Committee** | `scores_all.csv`, `shap_summary.png`, `pr_curve_lgbm.png` |
| **Data Scientist** | `model_lgbm.pkl`, `feature_importance.csv`, all plots |
| **Model Validator** | `calibration_*.png`, `thresholds.json`, metrics in console output |
| **Auditor** | All artifacts + `summary.json` for traceability |

## üî¨ Technical Deep Dives

### 1. Why Isotonic Calibration?

**Problem with raw LightGBM probabilities:**
- Overconfident near 0 and 1
- Not well-calibrated for credit risk (regulatory requirement)

**Isotonic Regression:**
- Non-parametric, monotonic calibration
- Preserves ranking (AUC unchanged)
- Improves Brier score and reliability

---

### 2. Class Imbalance Handling

**Default rate ~5-10%** ‚Üí Highly imbalanced

**Strategies applied:**
1. **`scale_pos_weight`** in LightGBM
   - Automatically weights positive class
   - Formula: `(n_negative / n_positive)`
   
2. **Evaluation metrics:** PR-AUC instead of just ROC-AUC
   - ROC-AUC can be misleading with imbalanced data
   
3. **Threshold tuning:** Separate from 0.5
   - Red/Amber thresholds based on business capacity

---

### 3. Feature Normalization (Sector-Size)

**Why normalize by (Sector, Size)?**

```python
# Example: ICR = 2.0 for a SME in Retail
# Is this good or bad?

# Without normalization: Compare to all companies ‚Üí Looks average
# With sector-size normalization: Compare to SME Retailers ‚Üí Looks good!

# Implementation:
def sector_size_normalize(df, cols):
    for c in cols:
        grouped = df.groupby(['sector_code', 'size_bucket'])
        median = grouped[c].transform('median')
        iqr = grouped[c].transform(lambda x: x.quantile(0.75) - x.quantile(0.25))
        df[f'{c}__zs_sector_size'] = (df[c] - median) / iqr
    return df
```

**Benefits:**
- Fair comparison (SME vs SME, Corp vs Corp, same sector)
- Robust to outliers (median/IQR instead of mean/std)
- Better predictive power

---

### 4. Label Definition: Event Horizon = 12 Months

**Basel Standard:** PD typically measured over 12-month horizon

**Label rule:**
```python
# Default if: DPD ‚â• 90 days for at least 30 consecutive days in next 12M
dpd_90_plus_days = sum(dpd >= 90 for dpd in future_dpd_sequence)
event_h12m = 1 if dpd_90_plus_days >= 30 else 0
```

**Rationale:**
- 90 DPD: Industry standard for "default"
- 30 consecutive days: Avoid transient spikes
- 12M horizon: Align with regulatory reporting

## üéì Basel & Regulatory Alignment

### Basel Framework Compliance

**1. PD (Probability of Default) Estimation**
- ‚úÖ 12-month horizon (Basel standard)
- ‚úÖ Through-the-cycle (TTC) calibration via Isotonic Regression
- ‚úÖ Backtesting with holdout set

**2. Key Financial Ratios**
- ‚úÖ **ICR (Interest Coverage Ratio):** EBIT / Interest
- ‚úÖ **DSCR (Debt Service Coverage Ratio):** (EBITDA - CAPEX) / Debt Service
- ‚úÖ **Leverage Ratio:** Total Debt / EBITDA
- ‚úÖ **Liquidity Ratio:** Current Assets / Current Liabilities

**3. Early Warning Indicators**
- ‚úÖ DPD tracking (30, 60, 90+ days)
- ‚úÖ Credit limit breach monitoring
- ‚úÖ Covenant breach flags
- ‚úÖ Cashflow deterioration signals

**4. Model Governance**
- ‚úÖ **Explainability:** SHAP for transparency
- ‚úÖ **Calibration:** Reliability curves
- ‚úÖ **Validation:** AUC, KS, Brier on holdout
- ‚úÖ **Documentation:** All artifacts saved with metadata

---

### Risk Appetite Framework

**Tier Definitions aligned with Risk Appetite:**

| Tier | PD Range | Portfolio Allocation | Risk Appetite |
|------|----------|---------------------|---------------|
| Green | < 5% | 85% | Accept: Standard monitoring |
| Amber | 5-20% | 10% | Tolerate: Enhanced monitoring |
| Red | ‚â• 20% | 5% | Mitigate/Exit: Immediate action |

**Capacity Management:**
- Red tier (5%): Max ~50 customers ‚Üí 5 FTE RM (10 customers/RM)
- Amber tier (10%): Max ~100 customers ‚Üí 10 FTE RM (10 customers/RM)
- Green tier (85%): Portfolio monitoring only

---

### Regulatory Reporting

**Outputs compatible with:**
- **IFRS 9:** Expected Credit Loss (ECL) calculation
  - PD √ó LGD √ó EAD = ECL
  - Model provides PD component
  
- **Basel II/III:** Internal Ratings-Based (IRB) approach
  - PD model for corporate exposures
  - Complement with LGD and EAD models
  
- **Stress Testing:** Scenario-based PD adjustments
  - Re-run model with stressed features
  - Example: Revenue shock, Interest rate shock

## References & Resources

### Academic & Industry Papers

1. **Basel Committee on Banking Supervision**
   - [Basel II: International Convergence of Capital Measurement](https://www.bis.org/publ/bcbs128.htm)
   - PD, LGD, EAD estimation frameworks
   
2. **IFRS 9 - Expected Credit Loss**
   - 12-month vs Lifetime PD
   - Staging models (Stage 1, 2, 3)

3. **Altman Z-Score (1968)**
   - Classic credit scoring model for manufacturing firms
   - Foundation for many modern models

4. **SHAP: Lundberg & Lee (2017)**
   - [A Unified Approach to Interpreting Model Predictions](https://arxiv.org/abs/1705.07874)
   - Game-theoretic feature attribution

---

### Tools & Libraries

**Python Packages:**
- `lightgbm`: Gradient boosting framework
- `shap`: Model explainability
- `scikit-learn`: ML utilities, calibration
- `pandas`, `numpy`: Data manipulation
- `matplotlib`, `seaborn`, `plotly`: Visualization

**Development:**
- `ruff`: Fast Python linter & formatter
- `pytest`: Testing framework
- `cookiecutter-data-science`: Project template

---

### Contact & Support

**Project Maintainer:** Duong N.C.K  
**Repository:** [github.com/dylanng3/corporate-credit-ews](https://github.com/dylanng3/corporate-credit-ews)  
**License:** MIT License