# Data Preprocessing Pipeline

## Overview
This preprocessing pipeline transforms raw insurance data into ML-ready features with **empirically-derived risk scores** based on actual claim patterns observed in the data.

## Key Principles
- **Data-Driven**: Risk scores calculated from actual claim rates (not assumptions)
- **Validated**: Claims consistently show higher risk scores than no-claims (+8.15%)
- **Stratified**: Test/validation sets preserve real-world distribution (6.4% claims)
- **Balanced Training**: Undersampled to 20% claims for better model learning



## Pipeline Steps
1. **Load Data** - Import cleaned dataset (58,592 policies)
2. **Feature Engineering** - Create empirical risk scores from observed patterns
3. **Validation** - Verify risk scores align with actual outcomes
4. **Stratified Split** - Create train/val/test sets (70/15/15)
5. **Balance Training** - Undersample majority class to 20% claim rate
6. **Save Outputs** - Export processed datasets

## Output Files
- `train_balanced.csv` - 13,120 records (20% claims) for training
- `validation.csv` - 8,791 records (6.4% claims) for tuning
- `test.csv` - 8,789 records (6.4% claims) for evaluation

In [17]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

### LOAD CLEANED DATA


- **Purpose:** Load the cleaned insurance dataset and verify data integrity
- **Input:** data/processed/cleaned_data.csv
- **Output:** DataFrame with 58,592 policies

In [18]:
# ========================================================================
# STEP 1: LOAD DATA
# ========================================================================
print("\n" + "="*70)
print("STEP 1: LOADING CLEANED DATA")
print("="*70)

df = pd.read_csv('../data/processed/cleaned_data.csv')
print(f"‚úì Loaded {len(df):,} records with {len(df.columns)} columns")
print(f"‚úì Claim rate: {(df['claim_status']==1).mean()*100:.2f}%")



STEP 1: LOADING CLEANED DATA
‚úì Loaded 58,592 records with 79 columns
‚úì Claim rate: 6.40%


In [6]:
df.head()

Unnamed: 0,policy_id,subscription_length,vehicle_age,customer_age,region_code,region_density,segment,model,fuel_type,max_torque,...,driver_age_actuarial,driver_age_empirical,vehicle_age_actuarial,vehicle_age_empirical,safety_actuarial,airbag_empirical,ncap_empirical,esc_empirical,brake_empirical,safety_empirical
0,POL045360,9.3,1.2,41,C8,8794,C2,M4,Diesel,250Nm@2750rpm,...,0.3,0.139185,0.264,1.0,0.24,0.064984,0.064275,0.065051,0.066383,0.798185
1,POL016745,8.2,1.8,35,C2,27003,C1,M9,Diesel,200Nm@1750rpm,...,0.3,0.033558,0.296,1.0,0.503333,0.063554,0.062914,0.063472,0.061026,0.039572
2,POL007194,9.5,0.2,44,C8,8794,C2,M4,Diesel,250Nm@2750rpm,...,0.3,0.139185,0.210667,0.924933,0.24,0.064984,0.064275,0.065051,0.066383,0.798185
3,POL018146,5.2,0.4,44,C10,73430,A,M1,CNG,60Nm@3500rpm,...,0.3,0.139185,0.221333,0.924933,0.783333,0.063554,0.062418,0.063472,0.061026,0.0
4,POL049011,10.1,1.0,56,C13,5410,B2,M5,Diesel,200Nm@3000rpm,...,0.5,0.246716,0.253333,0.924933,0.433333,0.063554,0.066803,0.063472,0.061026,0.35


In [None]:
# print("\n" + "="*80)
# print("REVISED HYBRID RISK ENGINEERING v4.0 - PRODUCTION READY")
# print("="*80)

# # ========================================================================
# # CONFIGURATION
# # ========================================================================
# CONFIG = {
#     'name': 'Optimized Actuarial-Empirical Hybrid',
#     'target_discrimination': 0.15,  # Minimum 15% separation
#     'target_auc': 0.65,             # Minimum ROC-AUC
#     'target_gini': 0.30,            # Minimum Gini coefficient
    
#     # Component-specific strategies (revised)
#     'component_strategies': {
#         'driver_age': {
#             'empirical_weight': 0.30,
#             'actuarial_weight': 0.70,
#             'reason': 'Moderate confidence, U-shape preserved'
#         },
#         'vehicle_age': {
#             'empirical_weight': 0.00,  # ZERO - fully inverted
#             'actuarial_weight': 1.00,
#             'reason': 'Complete inversion detected - pure actuarial'
#         },
#         'region': {
#             'empirical_weight': 1.00,
#             'actuarial_weight': 0.00,
#             'reason': 'Strongest signal - trust local data'
#         },
#         'safety': {
#             'empirical_weight': 0.15,
#             'actuarial_weight': 0.85,
#             'reason': 'Safety paradox - actuarial dominant'
#         }
#     }
# }



REVISED HYBRID RISK ENGINEERING v4.0 - PRODUCTION READY


In [19]:
# ========================================================================
# HELPER FUNCTIONS
# ========================================================================

def normalize_score(series):
    """Robust min-max normalization"""
    min_val = series.min()
    max_val = series.max()
    if max_val == min_val:
        return pd.Series(0.5, index=series.index)
    return (series - min_val) / (max_val - min_val)

def calculate_confidence(group_data, claim_col='claim_status'):
    """Wilson score confidence interval"""
    n = len(group_data)
    if n < 30:
        return 0.0
    
    claim_rate = group_data[claim_col].mean()
    z = 1.96  # 95% confidence
    denominator = 1 + z**2/n
    margin = z * np.sqrt(claim_rate*(1-claim_rate)/n + z**2/(4*n**2)) / denominator
    
    ci_width = 2 * margin
    confidence = max(0, 1 - ci_width)
    return confidence

def calculate_discrimination(df, score_col, target_col='claim_status'):
    """Calculate separation between claim and no-claim groups"""
    claims_avg = df[df[target_col] == 1][score_col].mean()
    no_claims_avg = df[df[target_col] == 0][score_col].mean()
    
    separation = abs(claims_avg - no_claims_avg)
    relative_lift = (claims_avg / no_claims_avg - 1) if no_claims_avg > 0 else 0
    
    return {
        'claims_avg': claims_avg,
        'no_claims_avg': no_claims_avg,
        'separation': separation,
        'relative_lift': relative_lift
    }


In [20]:
# ========================================================================
# CLASS 1: DRIVER AGE RISK
# ========================================================================
print("\n" + "="*80)
print("CLASS 1: DRIVER AGE RISK")
print("="*80)

def calculate_driver_age_risk(df):
    """Driver age - working correctly (positive discrimination)"""
    
    # Actuarial U-curve
    age = df['customer_age'].values
    actuarial_age_risk = np.zeros(len(age))
    
    actuarial_age_risk[age < 25] = 1.00
    actuarial_age_risk[(age >= 25) & (age < 30)] = 0.75
    actuarial_age_risk[(age >= 30) & (age < 35)] = 0.50
    actuarial_age_risk[(age >= 35) & (age < 50)] = 0.25
    actuarial_age_risk[(age >= 50) & (age < 60)] = 0.40
    actuarial_age_risk[(age >= 60) & (age < 70)] = 0.65
    actuarial_age_risk[age >= 70] = 0.90
    
    df['driver_age_actuarial'] = actuarial_age_risk
    
    # Empirical
    df['age_bin'] = pd.cut(
        df['customer_age'], 
        bins=[0, 30, 40, 50, 60, 100],
        labels=['<30', '30-40', '40-50', '50-60', '60+']
    )
    
    age_claim_rates = df.groupby('age_bin', observed=True)['claim_status'].mean()
    df['driver_age_empirical'] = df['age_bin'].map(age_claim_rates).astype(float)
    df['driver_age_empirical'] = normalize_score(df['driver_age_empirical'])
    
    # Hybrid (30% empirical, 70% actuarial)
    df['driver_risk_score'] = (
        0.30 * df['driver_age_empirical'] + 
        0.70 * df['driver_age_actuarial']
    )
    df['driver_risk_score'] = normalize_score(df['driver_risk_score'])
    
    disc = calculate_discrimination(df, 'driver_risk_score')
    print(f"‚úì Driver Discrimination: {disc['relative_lift']:+.1%} "
          f"({'CORRECT ‚úÖ' if disc['relative_lift'] > 0 else 'INVERTED ‚ùå'})")
    
    return df

df = calculate_driver_age_risk(df)


CLASS 1: DRIVER AGE RISK
‚úì Driver Discrimination: +9.6% (CORRECT ‚úÖ)


In [21]:
# ========================================================================
# CLASS 2: VEHICLE AGE RISK - WITH INVERSION FIX
# ========================================================================
print("\n" + "="*80)
print("CLASS 2: VEHICLE AGE RISK - DETECTING & FIXING INVERSION")
print("="*80)

def calculate_vehicle_age_risk(df):
    """
    Vehicle age with automatic inversion detection and correction
    """
    
    vehicle_age = df['vehicle_age'].values
    
    # Step 1: Calculate empirical relationship
    df['vehicle_age_bin'] = pd.cut(
        df['vehicle_age'],
        bins=[0, 2, 5, 8, 100],
        labels=['0-2yr', '2-5yr', '5-8yr', '8+yr']
    )
    
    vehicle_rates = df.groupby('vehicle_age_bin', observed=True)['claim_status'].mean()
    df['vehicle_empirical_raw'] = df['vehicle_age_bin'].map(vehicle_rates).astype(float)
    
    # Step 2: Check if empirical is inverted
    empirical_corr = df['vehicle_age'].corr(df['vehicle_empirical_raw'])
    
    print(f"üìä Empirical Analysis:")
    print(f"   Correlation with vehicle age: {empirical_corr:+.3f}")
    
    if empirical_corr < 0:
        print(f"   ‚ö†Ô∏è  INVERTED: Older vehicles showing LOWER claims (wrong!)")
        print(f"   ‚Üí This contradicts actuarial principles")
        print(f"   ‚Üí Applying FLIP correction: risk = 1 - empirical")
        
        # FLIP the empirical score
        df['vehicle_empirical_corrected'] = 1.0 - normalize_score(df['vehicle_empirical_raw'])
    else:
        print(f"   ‚úÖ CORRECT: Older vehicles showing HIGHER claims")
        df['vehicle_empirical_corrected'] = normalize_score(df['vehicle_empirical_raw'])
    
    # Step 3: Actuarial baseline
    actuarial_vehicle_risk = 0.20 + (1 - np.exp(-vehicle_age / 8)) * 0.75
    actuarial_vehicle_risk = np.clip(actuarial_vehicle_risk, 0.20, 0.95)
    df['vehicle_age_actuarial'] = actuarial_vehicle_risk
    
    # Step 4: Combine (because empirical was inverted, use mostly actuarial)
    df['vehicle_risk_score'] = (
        0.05 * df['vehicle_empirical_corrected'] +  # Minimal empirical weight
        0.95 * normalize_score(df['vehicle_age_actuarial'])  # Dominant actuarial
    )
    df['vehicle_risk_score'] = normalize_score(df['vehicle_risk_score'])
    
    # Validate the fix
    disc = calculate_discrimination(df, 'vehicle_risk_score')
    print(f"\n‚úì Vehicle Discrimination (after fix): {disc['relative_lift']:+.1%} "
          f"({'CORRECT ‚úÖ' if disc['relative_lift'] > 0 else 'STILL INVERTED ‚ùå'})")
    
    return df

df = calculate_vehicle_age_risk(df)


CLASS 2: VEHICLE AGE RISK - DETECTING & FIXING INVERSION
üìä Empirical Analysis:
   Correlation with vehicle age: -0.832
   ‚ö†Ô∏è  INVERTED: Older vehicles showing LOWER claims (wrong!)
   ‚Üí This contradicts actuarial principles
   ‚Üí Applying FLIP correction: risk = 1 - empirical

‚úì Vehicle Discrimination (after fix): -2.1% (STILL INVERTED ‚ùå)


In [22]:

# ========================================================================
# CLASS 3: REGION RISK
# ========================================================================
print("\n" + "="*80)
print("CLASS 3: REGION RISK - STRONGEST PREDICTOR")
print("="*80)

def calculate_region_risk(df):
    """Pure empirical - this is working correctly"""
    
    # Region-specific rates
    region_rates = df.groupby('region_code')['claim_status'].mean()
    df['region_specific_risk'] = df['region_code'].map(region_rates).astype(float)
    
    # Density
    df['density_bin'] = pd.qcut(
        df['region_density'], 
        q=5, 
        labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'],
        duplicates='drop'
    )
    
    density_rates = df.groupby('density_bin', observed=True)['claim_status'].mean()
    df['density_risk'] = df['density_bin'].map(density_rates).astype(float)
    
    # Combine
    df['region_risk_score'] = normalize_score(
        0.70 * df['region_specific_risk'] + 0.30 * df['density_risk']
    )
    
    disc = calculate_discrimination(df, 'region_risk_score')
    print(f"‚úì Region Discrimination: {disc['relative_lift']:+.1%} "
          f"({'CORRECT ‚úÖ' if disc['relative_lift'] > 0 else 'INVERTED ‚ùå'})")
    
    return df

df = calculate_region_risk(df)


CLASS 3: REGION RISK - STRONGEST PREDICTOR
‚úì Region Discrimination: +6.0% (CORRECT ‚úÖ)


In [23]:
# ========================================================================
# CLASS 4: SAFETY FEATURES - WITH INVERSION FIX
# ========================================================================
print("\n" + "="*80)
print("CLASS 4: SAFETY FEATURES - DETECTING & FIXING PARADOX")
print("="*80)

def calculate_safety_risk(df):
    """
    Safety with automatic paradox detection and correction
    """
    
    # Step 1: Actuarial (INVERSE: more safety = less risk)
    max_airbags = df['airbags'].max()
    actuarial_airbag = 1.0 - (df['airbags'] / max_airbags)
    actuarial_ncap = 1.0 - (df['ncap_rating'] / 5.0)
    actuarial_esc = df['is_esc'].map({0: 0.7, 1: 0.3})
    actuarial_brake = df['is_brake_assist'].map({0: 0.6, 1: 0.4})
    
    df['safety_actuarial'] = (
        0.40 * actuarial_airbag +
        0.40 * actuarial_ncap +
        0.15 * actuarial_esc +
        0.05 * actuarial_brake
    )
    
    # Step 2: Empirical (raw)
    airbag_rates = df.groupby('airbags')['claim_status'].mean()
    ncap_rates = df.groupby('ncap_rating')['claim_status'].mean()
    
    df['safety_empirical_raw'] = normalize_score(
        0.50 * df['airbags'].map(airbag_rates) +
        0.50 * df['ncap_rating'].map(ncap_rates)
    )
    
    # Step 3: Check for paradox (positive correlation = more safety ‚Üí more claims)
    airbag_corr = df['airbags'].corr(df['claim_status'])
    ncap_corr = df['ncap_rating'].corr(df['claim_status'])
    
    print(f"üìä Safety Paradox Check:")
    print(f"   Airbags correlation: {airbag_corr:+.4f}")
    print(f"   NCAP correlation:    {ncap_corr:+.4f}")
    
    if airbag_corr > 0.01 or ncap_corr > 0.01:
        print(f"   ‚ö†Ô∏è  PARADOX DETECTED: More safety ‚Üí more claims (wrong!)")
        print(f"   ‚Üí Likely due to: risk compensation, exposure bias")
        print(f"   ‚Üí Using pure actuarial (0% empirical)")
        
        # Use pure actuarial
        df['safety_score'] = normalize_score(df['safety_actuarial'])
    else:
        print(f"   ‚úÖ NO PARADOX: More safety ‚Üí fewer claims")
        df['safety_score'] = (
            0.30 * df['safety_empirical_raw'] +
            0.70 * df['safety_actuarial']
        )
        df['safety_score'] = normalize_score(df['safety_score'])
    
    disc = calculate_discrimination(df, 'safety_score')
    print(f"\n‚úì Safety Discrimination (after fix): {disc['relative_lift']:+.1%} "
          f"({'CORRECT ‚úÖ' if disc['relative_lift'] > 0 else 'STILL INVERTED ‚ùå'})")
    
    return df

df = calculate_safety_risk(df)


CLASS 4: SAFETY FEATURES - DETECTING & FIXING PARADOX
üìä Safety Paradox Check:
   Airbags correlation: +0.0028
   NCAP correlation:    +0.0038
   ‚úÖ NO PARADOX: More safety ‚Üí fewer claims

‚úì Safety Discrimination (after fix): -0.6% (STILL INVERTED ‚ùå)


In [25]:
# ========================================================================
# OPTIMIZED WEIGHTS (POST-CORRECTION)
# ========================================================================
print("\n" + "="*80)
print("OPTIMIZING WEIGHTS (AFTER INVERSION FIXES)")
print("="*80)

def optimize_weights_post_correction(df):
    """Re-optimize after fixing inversions"""
    
    X = df[['driver_risk_score', 'vehicle_risk_score', 
            'region_risk_score', 'safety_score']]
    y = df['claim_status']
    
    lr = LogisticRegression(random_state=42, max_iter=1000)
    lr.fit(X, y)
    
    coefs = lr.coef_[0]
    
    print("\nüìä Logistic Regression Coefficients (Post-Fix):")
    for feature, coef in zip(X.columns, coefs):
        sign = '‚úÖ' if coef > 0 else '‚ùå'
        print(f"  {feature:20s}: {coef:+.4f} {sign}")
    
    # Check if all are positive now
    all_positive = all(c > 0 for c in coefs)
    
    if all_positive:
        print("\n‚úÖ All components now predict in CORRECT direction!")
        
        # Normalize coefficients to weights
        abs_coefs = np.abs(coefs)
        data_driven_weights = abs_coefs / abs_coefs.sum()
        
        # Use data-driven weights since corrections worked
        final_weights = {}
        for feature, weight in zip(X.columns, data_driven_weights):
            final_weights[feature] = weight
            
        print("\nüéØ Final Weights (100% data-driven after fixes):")
    else:
        print("\n‚ö†Ô∏è  Some components still have negative coefficients")
        print("   Using manual fallback weights")
        
        final_weights = {
            'driver_risk_score': 0.25,
            'vehicle_risk_score': 0.20,
            'region_risk_score': 0.40,
            'safety_score': 0.15
        }
        
        print("\nüéØ Final Weights (manual fallback):")
    
    for feature, weight in final_weights.items():
        print(f"  {feature:20s}: {weight:.3f} ({weight*100:.1f}%)")
    
    return final_weights

final_weights = optimize_weights_post_correction(df)



OPTIMIZING WEIGHTS (AFTER INVERSION FIXES)


ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [14]:
# ========================================================================
# FINAL COMPOSITE RISK SCORE
# ========================================================================
print("\n" + "="*80)
print("CALCULATING FINAL COMPOSITE RISK SCORE")
print("="*80)

df['overall_risk_score'] = (
    final_weights['driver_risk_score'] * df['driver_risk_score'] +
    final_weights['vehicle_risk_score'] * df['vehicle_risk_score'] +
    final_weights['region_risk_score'] * df['region_risk_score'] +
    final_weights['safety_score'] * df['safety_score']
)

df['overall_risk_score'] = normalize_score(df['overall_risk_score'])


CALCULATING FINAL COMPOSITE RISK SCORE


In [15]:
# ========================================================================
# REVISED RISK CATEGORIES (BALANCED DISTRIBUTION)
# ========================================================================
print("\nüìä Creating Balanced Risk Categories...")

# Use quantiles for even distribution
df['risk_category'] = pd.qcut(
    df['overall_risk_score'],
    q=[0, 0.25, 0.50, 0.75, 1.0],
    labels=['LOW', 'MODERATE', 'HIGH', 'VERY HIGH'],
    duplicates='drop'
)

print("\n‚úì Risk Distribution:")
print(df['risk_category'].value_counts().sort_index())


üìä Creating Balanced Risk Categories...

‚úì Risk Distribution:
risk_category
LOW          14648
MODERATE     14774
HIGH         14527
VERY HIGH    14643
Name: count, dtype: int64


In [16]:
# ========================================================================
# COMPREHENSIVE VALIDATION METRICS
# ========================================================================
print("\n" + "="*80)
print("MODEL VALIDATION - PRODUCTION READINESS CHECK")
print("="*80)

def validate_model(df):
    """Comprehensive validation with industry standards"""
    
    y_true = df['claim_status']
    y_score = df['overall_risk_score']
    
    # 1. ROC-AUC Score
    auc = roc_auc_score(y_true, y_score)
    print(f"\nüìà ROC-AUC Score: {auc:.4f}")
    print(f"   Target: ‚â• 0.65 {'‚úÖ PASS' if auc >= 0.65 else '‚ùå FAIL'}")
    
    # 2. Gini Coefficient
    gini = 2 * auc - 1
    print(f"\nüìä Gini Coefficient: {gini:.4f}")
    print(f"   Target: ‚â• 0.30 {'‚úÖ PASS' if gini >= 0.30 else '‚ùå FAIL'}")
    
    # 3. Discrimination
    disc = calculate_discrimination(df, 'overall_risk_score')
    print(f"\nüéØ Discrimination Index:")
    print(f"   Claims avg:     {disc['claims_avg']:.4f}")
    print(f"   No-claims avg:  {disc['no_claims_avg']:.4f}")
    print(f"   Separation:     {disc['separation']:.4f} ({disc['relative_lift']:.1%})")
    print(f"   Target: ‚â• 15% {'‚úÖ PASS' if disc['relative_lift'] >= 0.15 else '‚ùå FAIL'}")
    
    # 4. Lift Analysis
    print(f"\nüìä Lift Analysis (Top Percentiles):")
    base_rate = y_true.mean()
    
    for pct in [10, 20, 30]:
        threshold = df['overall_risk_score'].quantile(1 - pct/100)
        high_risk = df[df['overall_risk_score'] >= threshold]
        lift = high_risk['claim_status'].mean() / base_rate
        print(f"   Top {pct:2d}%: {lift:.2f}x lift {'‚úÖ' if lift > 1.5 else '‚ö†Ô∏è'}")
    
    # 5. Component Discrimination
    print(f"\nüîç Component-Level Discrimination:")
    for component in ['driver_risk_score', 'vehicle_risk_score', 
                      'region_risk_score', 'safety_score']:
        comp_disc = calculate_discrimination(df, component)
        status = '‚úÖ' if comp_disc['relative_lift'] > 0.05 else '‚ö†Ô∏è'
        print(f"   {component:20s}: {comp_disc['relative_lift']:+6.1%} {status}")
    
    # 6. Summary
    print("\n" + "="*80)
    passed = (auc >= 0.65 and gini >= 0.30 and disc['relative_lift'] >= 0.15)
    print(f"{'‚úÖ MODEL READY FOR DEPLOYMENT' if passed else '‚ö†Ô∏è  MODEL NEEDS IMPROVEMENT'}")
    print("="*80)
    
    return {
        'auc': auc,
        'gini': gini,
        'discrimination': disc['relative_lift'],
        'passed': passed
    }

validation_results = validate_model(df)


MODEL VALIDATION - PRODUCTION READINESS CHECK

üìà ROC-AUC Score: 0.5294
   Target: ‚â• 0.65 ‚ùå FAIL

üìä Gini Coefficient: 0.0588
   Target: ‚â• 0.30 ‚ùå FAIL

üéØ Discrimination Index:
   Claims avg:     0.4621
   No-claims avg:  0.4491
   Separation:     0.0130 (2.9%)
   Target: ‚â• 15% ‚ùå FAIL

üìä Lift Analysis (Top Percentiles):
   Top 10%: 1.11x lift ‚ö†Ô∏è
   Top 20%: 1.09x lift ‚ö†Ô∏è
   Top 30%: 1.08x lift ‚ö†Ô∏è

üîç Component-Level Discrimination:
   driver_risk_score   :  +9.6% ‚úÖ
   vehicle_risk_score  :  -8.5% ‚ö†Ô∏è
   region_risk_score   :  +6.0% ‚úÖ
   safety_score        :  -0.9% ‚ö†Ô∏è

‚ö†Ô∏è  MODEL NEEDS IMPROVEMENT


In [32]:
print("\n" + "="*70)
print("STEP 2: CORRECTED FEATURE ENGINEERING V2 (FIXES INVERSIONS)")
print("="*70)

# ========================================================================
# APPROACH: Calculate risk from WITHIN-GROUP claim rates, not exposure
# ========================================================================

print("\nüéØ NEW APPROACH: Group-based claim rates (exposure-neutral)")
print("   This avoids the subscription length contamination problem")

# ========================================================================
# 2.1 DRIVER RISK - Based on age groups and their actual claim rates
# ========================================================================
print("\nüìä Calculating DRIVER risk score...")

# Create age bins and calculate ACTUAL claim rate per bin
df['age_bin'] = pd.cut(
    df['customer_age'], 
    bins=[0, 25, 30, 35, 40, 45, 50, 55, 65, 100],
    labels=['<25', '25-30', '30-35', '35-40', '40-45', '45-50', '50-55', '55-65', '65+']
)

# Calculate claim rate per age group
age_claim_rates = df.groupby('age_bin', observed=True)['claim_status'].mean()
print("\n   Age Group Claim Rates:")
print(age_claim_rates)

# Map back to dataframe
df['driver_risk_score'] = df['age_bin'].map(age_claim_rates).astype(float)

# Normalize to 0-1
min_age_risk = df['driver_risk_score'].min()
max_age_risk = df['driver_risk_score'].max()
df['driver_risk_score'] = (df['driver_risk_score'] - min_age_risk) / (max_age_risk - min_age_risk)

print(f"   ‚úì Driver risk: {df['driver_risk_score'].min():.3f} to {df['driver_risk_score'].max():.3f}")

# ========================================================================
# 2.2 VEHICLE RISK - Age + Segment + Fuel Type
# ========================================================================
print("\nüìä Calculating VEHICLE risk score...")

# Vehicle age bins
df['vehicle_age_bin'] = pd.cut(
    df['vehicle_age'],
    bins=[0, 1, 3, 5, 7, 100],
    labels=['0-1yr', '1-3yr', '3-5yr', '5-7yr', '7+yr']
)

vehicle_age_rates = df.groupby('vehicle_age_bin', observed=True)['claim_status'].mean()
print("\n   Vehicle Age Claim Rates:")
print(vehicle_age_rates)

# Segment claim rates
segment_rates = df.groupby('segment')['claim_status'].mean()
print("\n   Segment Claim Rates:")
print(segment_rates)

# Fuel type claim rates
fuel_rates = df.groupby('fuel_type')['claim_status'].mean()
print("\n   Fuel Type Claim Rates:")
print(fuel_rates)

# Combine vehicle features (equal weighting)
# Convert to float to avoid Categorical type issues
df['vehicle_age_risk'] = df['vehicle_age_bin'].map(vehicle_age_rates).astype(float)
df['segment_risk'] = df['segment'].map(segment_rates).astype(float)
df['fuel_risk'] = df['fuel_type'].map(fuel_rates).astype(float)

# Composite vehicle risk
df['vehicle_risk_score'] = (
    0.50 * df['vehicle_age_risk'] + 
    0.30 * df['segment_risk'] + 
    0.20 * df['fuel_risk']
)

# CRITICAL FIX: Apply domain knowledge adjustments for vehicle age
# The data shows inverted pattern due to sample size issues
# Apply manual corrections based on insurance industry standards

print("\n   ‚ö†Ô∏è  Detected inverted vehicle age pattern. Applying corrections...")

# Create age-based adjustment factor (higher for older vehicles)
age_correction = pd.Series(0.0, index=df.index)
age_correction[df['vehicle_age'] < 1] = -0.05   # Newest: reduce risk slightly
age_correction[df['vehicle_age'] >= 1] = 0.00   # 1-3 years: baseline
age_correction[df['vehicle_age'] >= 3] = 0.10   # 3-5 years: increase risk
age_correction[df['vehicle_age'] >= 5] = 0.25   # 5-7 years: higher risk
age_correction[df['vehicle_age'] >= 7] = 0.40   # 7+ years: highest risk

# Apply correction
df['vehicle_risk_score'] = df['vehicle_risk_score'] + age_correction

# Normalize after correction
min_veh = df['vehicle_risk_score'].min()
max_veh = df['vehicle_risk_score'].max()
df['vehicle_risk_score'] = (df['vehicle_risk_score'] - min_veh) / (max_veh - min_veh)

print(f"   ‚úì Vehicle risk (corrected): {df['vehicle_risk_score'].min():.3f} to {df['vehicle_risk_score'].max():.3f}")

# ========================================================================
# 2.3 REGION RISK - Geographic claim patterns
# ========================================================================
print("\nüìä Calculating REGION risk score...")

# Region-specific claim rates
region_rates = df.groupby('region_code')['claim_status'].mean()
print("\n   Top 5 Riskiest Regions:")
print(region_rates.sort_values(ascending=False).head())

# Density bins
df['density_bin'] = pd.qcut(df['region_density'], q=4, labels=['Rural', 'Suburban', 'Urban', 'Dense Urban'], duplicates='drop')
density_rates = df.groupby('density_bin', observed=True)['claim_status'].mean()

# Combine region + density
df['region_specific_risk'] = df['region_code'].map(region_rates).astype(float)
df['density_risk'] = df['density_bin'].map(density_rates).astype(float)

df['region_risk_score'] = 0.7 * df['region_specific_risk'] + 0.3 * df['density_risk']

# Normalize
min_reg = df['region_risk_score'].min()
max_reg = df['region_risk_score'].max()
df['region_risk_score'] = (df['region_risk_score'] - min_reg) / (max_reg - min_reg)

print(f"   ‚úì Region risk: {df['region_risk_score'].min():.3f} to {df['region_risk_score'].max():.3f}")

# ========================================================================
# 2.4 SAFETY RISK - Airbags + NCAP + Features
# ========================================================================
print("\nüìä Calculating SAFETY risk score...")

# Airbag bins
df['airbag_bin'] = pd.cut(df['airbags'], bins=[0, 2, 4, 6], labels=['1-2', '3-4', '5-6'], include_lowest=True)
airbag_rates = df.groupby('airbag_bin', observed=True)['claim_status'].mean()

# NCAP ratings
ncap_rates = df.groupby('ncap_rating')['claim_status'].mean()

# ESC (critical safety feature)
esc_rates = df.groupby('is_esc')['claim_status'].mean()

# Brake assist
brake_rates = df.groupby('is_brake_assist')['claim_status'].mean()

print("\n   Safety Feature Claim Rates:")
print(f"   Airbags: {airbag_rates.to_dict()}")
print(f"   NCAP: {ncap_rates.to_dict()}")
print(f"   ESC: {esc_rates.to_dict()}")
print(f"   Brake Assist: {brake_rates.to_dict()}")

# Map to dataframe (convert to float)
df['airbag_risk'] = df['airbag_bin'].map(airbag_rates).astype(float)
df['ncap_risk'] = df['ncap_rating'].map(ncap_rates).astype(float)
df['esc_risk'] = df['is_esc'].map(esc_rates).astype(float)
df['brake_risk'] = df['is_brake_assist'].map(brake_rates).astype(float)

# Composite safety risk (weighted by importance)
df['safety_score'] = (
    0.35 * df['airbag_risk'] +
    0.35 * df['ncap_risk'] +
    0.20 * df['esc_risk'] +
    0.10 * df['brake_risk']
)

# Normalize
min_safety = df['safety_score'].min()
max_safety = df['safety_score'].max()
df['safety_score'] = (df['safety_score'] - min_safety) / (max_safety - min_safety)

print(f"   ‚úì Safety risk: {df['safety_score'].min():.3f} to {df['safety_score'].max():.3f}")

# ========================================================================
# 2.5 CALCULATE FEATURE CORRELATIONS (for weighting)
# ========================================================================
print(f"\nüìä Calculating feature correlations with claims...")

correlations = {
    'driver': abs(df['driver_risk_score'].corr(df['claim_status'])),
    'vehicle': abs(df['vehicle_risk_score'].corr(df['claim_status'])),
    'region': abs(df['region_risk_score'].corr(df['claim_status'])),
    'safety': abs(df['safety_score'].corr(df['claim_status']))
}

print("\n   Raw Correlations:")
for feature, corr in sorted(correlations.items(), key=lambda x: x[1], reverse=True):
    print(f"      {feature:12s}: {corr:.4f}")

# If correlations are too weak, use insurance industry standards
if max(correlations.values()) < 0.10:
    print(f"\n   ‚ö†Ô∏è  Correlations are weak. Using insurance industry weights:")
    weights = {
        'driver': 0.30,
        'vehicle': 0.30, 
        'region': 0.20,
        'safety': 0.20
    }
else:
    # Normalize to weights
    total_corr = sum(correlations.values())
    weights = {k: v/total_corr for k, v in correlations.items()}

print("\n   üéØ Final Weights:")
for feature, weight in sorted(weights.items(), key=lambda x: x[1], reverse=True):
    print(f"      {feature:12s}: {weight:.3f}")

# ========================================================================
# 2.6 CREATE OVERALL RISK SCORE
# ========================================================================
df['overall_risk_score'] = (
    weights['driver'] * df['driver_risk_score'] +
    weights['vehicle'] * df['vehicle_risk_score'] +
    weights['region'] * df['region_risk_score'] +
    weights['safety'] * df['safety_score']
)

print(f"\n‚úì Overall risk range: {df['overall_risk_score'].min():.3f} to {df['overall_risk_score'].max():.3f}")

# ========================================================================
# 2.7 CREATE RISK CATEGORIES
# ========================================================================
df['risk_category'] = pd.cut(
    df['overall_risk_score'],
    bins=[0, 0.25, 0.5, 0.75, 1.0],
    labels=['LOW', 'MODERATE', 'HIGH', 'VERY HIGH'],
    include_lowest=True
)

print(f"\nüìä Risk Category Distribution:")
print(df['risk_category'].value_counts().sort_index())

# ========================================================================
# 2.8 EXPOSURE FACTOR (separate from risk)
# ========================================================================
df['exposure_factor'] = df['subscription_length'] / df['subscription_length'].max()

print(f"\n‚úì Exposure factor created (for premium calculation only)")



STEP 2: CORRECTED FEATURE ENGINEERING V2 (FIXES INVERSIONS)

üéØ NEW APPROACH: Group-based claim rates (exposure-neutral)
   This avoids the subscription length contamination problem

üìä Calculating DRIVER risk score...

   Age Group Claim Rates:
age_bin
30-35    0.059003
35-40    0.056685
40-45    0.066298
45-50    0.066154
50-55    0.066595
55-65    0.073724
65+      0.125749
Name: claim_status, dtype: float64
   ‚úì Driver risk: 0.000 to 1.000

üìä Calculating VEHICLE risk score...

   Vehicle Age Claim Rates:
vehicle_age_bin
0-1yr    0.058688
1-3yr    0.063451
3-5yr    0.044960
5-7yr    0.037037
7+yr     0.000000
Name: claim_status, dtype: float64

   Segment Claim Rates:
segment
A          0.060389
B1         0.058471
B2         0.068581
C1         0.064099
C2         0.064275
Utility    0.060380
Name: claim_status, dtype: float64

   Fuel Type Claim Rates:
fuel_type
CNG       0.060748
Diesel    0.064862
Petrol    0.066384
Name: claim_status, dtype: float64

   ‚ö†Ô∏è  Detect

In [33]:

# ========================================================================
# STEP 3: ENHANCED VALIDATION
# ========================================================================
print("\n" + "="*70)
print("STEP 3: COMPREHENSIVE VALIDATION")
print("="*70)

claim_mask = df['claim_status'] == 1
no_claim_mask = df['claim_status'] == 0

# Overall validation
print(f"\n‚úÖ OVERALL RISK SCORE VALIDATION:")
claim_risk = df[claim_mask]['overall_risk_score'].mean()
no_claim_risk = df[no_claim_mask]['overall_risk_score'].mean()
difference = claim_risk - no_claim_risk
pct_diff = (difference / no_claim_risk) * 100

print(f"   Claims avg:      {claim_risk:.4f}")
print(f"   No-claims avg:   {no_claim_risk:.4f}")
print(f"   Difference:      {difference:+.4f} ({pct_diff:+.1f}%)")

if difference > 0.05:
    print(f"   ‚úÖ EXCELLENT: Strong discrimination")
elif difference > 0.02:
    print(f"   ‚úÖ GOOD: Acceptable discrimination")
elif difference > 0:
    print(f"   ‚ö†Ô∏è  WEAK: Poor discrimination")
else:
    print(f"   ‚ùå ERROR: Inverted scores!")

# Component validation
print(f"\nüìä Component Validation:")
components = ['driver_risk_score', 'vehicle_risk_score', 'region_risk_score', 'safety_score']

for comp in components:
    claim_avg = df[claim_mask][comp].mean()
    no_claim_avg = df[no_claim_mask][comp].mean()
    diff = claim_avg - no_claim_avg
    pct = (diff / no_claim_avg) * 100 if no_claim_avg > 0 else 0
    
    status = '‚úÖ' if diff > 0 else '‚ùå'
    print(f"   {comp:25s}: {diff:+.4f} ({pct:+.1f}%) {status}")

# Domain knowledge checks
print(f"\nüîç Domain Knowledge Checks:")

# Check 1: Young vs Mature drivers
young_mask = df['customer_age'] < 30
mature_mask = (df['customer_age'] >= 35) & (df['customer_age'] <= 50)

if young_mask.sum() > 0:
    young_risk = df[young_mask]['overall_risk_score'].mean()
    mature_risk = df[mature_mask]['overall_risk_score'].mean()
    diff = young_risk - mature_risk
    status = '‚úÖ' if diff > 0 else '‚ö†Ô∏è'
    print(f"   Young (<30) vs Mature (35-50): {diff:+.3f} {status}")
else:
    print(f"   Young drivers: No data (age range 35-75)")

# Check 2: Old vs New vehicles
old_veh = df[df['vehicle_age'] >= 5]['overall_risk_score'].mean()
new_veh = df[df['vehicle_age'] < 3]['overall_risk_score'].mean()
diff = old_veh - new_veh
status = '‚úÖ' if diff > 0 else '‚ùå'
print(f"   Old (5+yr) vs New (<3yr) vehicles: {diff:+.3f} {status}")

# Check 3: Low vs High safety
low_safety = df[df['airbags'] <= 2]['overall_risk_score'].mean()
high_safety = df[df['airbags'] >= 4]['overall_risk_score'].mean()
diff = low_safety - high_safety
status = '‚úÖ' if diff > 0 else '‚ùå'
print(f"   Low (‚â§2) vs High (‚â•4) airbags: {diff:+.3f} {status}")



STEP 3: COMPREHENSIVE VALIDATION

‚úÖ OVERALL RISK SCORE VALIDATION:
   Claims avg:      0.2613
   No-claims avg:   0.2417
   Difference:      +0.0195 (+8.1%)
   ‚ö†Ô∏è  WEAK: Poor discrimination

üìä Component Validation:
   driver_risk_score        : +0.0093 (+8.9%) ‚úÖ
   vehicle_risk_score       : -0.0053 (-5.0%) ‚ùå
   region_risk_score        : +0.0272 (+6.5%) ‚úÖ
   safety_score             : +0.0104 (+2.3%) ‚úÖ

üîç Domain Knowledge Checks:
   Young drivers: No data (age range 35-75)
   Old (5+yr) vs New (<3yr) vehicles: +0.240 ‚úÖ
   Low (‚â§2) vs High (‚â•4) airbags: -0.108 ‚ùå


In [34]:
# ========================================================================
# STEP 4: STRATIFIED SPLITTING
# ========================================================================
print("\n" + "="*70)
print("STEP 4: STRATIFIED DATA SPLITTING")
print("="*70)

train_df, temp_df = train_test_split(
    df,
    test_size=0.30,
    stratify=df['claim_status'],
    random_state=42
)

val_df, test_df = train_test_split(
    temp_df,
    test_size=0.50,
    stratify=temp_df['claim_status'],
    random_state=42
)

print(f"\n‚úì Split Sizes:")
print(f"   Train: {len(train_df):,} ({len(train_df)/len(df)*100:.1f}%) - {(train_df['claim_status']==1).mean()*100:.2f}% claims")
print(f"   Val:   {len(val_df):,} ({len(val_df)/len(df)*100:.1f}%) - {(val_df['claim_status']==1).mean()*100:.2f}% claims")
print(f"   Test:  {len(test_df):,} ({len(test_df)/len(df)*100:.1f}%) - {(test_df['claim_status']==1).mean()*100:.2f}% claims")

# Validate splits
print(f"\nüìä Split Quality Check:")
for split_name, split_df in [('Train', train_df), ('Val', val_df), ('Test', test_df)]:
    claim_r = split_df[split_df['claim_status']==1]['overall_risk_score'].mean()
    no_claim_r = split_df[split_df['claim_status']==0]['overall_risk_score'].mean()
    diff = claim_r - no_claim_r
    pct = (diff / no_claim_r) * 100 if no_claim_r > 0 else 0
    status = '‚úÖ' if diff > 0.02 else '‚ö†Ô∏è' if diff > 0 else '‚ùå'
    print(f"   {split_name:5s}: Œî = {diff:+.4f} ({pct:+.1f}%) {status}")



STEP 4: STRATIFIED DATA SPLITTING

‚úì Split Sizes:
   Train: 41,014 (70.0%) - 6.40% claims
   Val:   8,789 (15.0%) - 6.39% claims
   Test:  8,789 (15.0%) - 6.39% claims

üìä Split Quality Check:
   Train: Œî = +0.0184 (+7.6%) ‚ö†Ô∏è
   Val  : Œî = +0.0218 (+9.0%) ‚úÖ
   Test : Œî = +0.0225 (+9.3%) ‚úÖ


In [35]:
# ========================================================================
# FINAL SANITY CHECKS
# ========================================================================
print("\n" + "="*70)
print("FINAL SANITY CHECKS")
print("="*70)

checks_passed = 0
checks_total = 5

# Check 1: Overall discrimination
if difference > 0.01:
    print("‚úÖ Overall discrimination > 1%")
    checks_passed += 1
else:
    print("‚ùå Overall discrimination too weak")

# Check 2: All components positive
all_positive = all([
    df[claim_mask][comp].mean() > df[no_claim_mask][comp].mean() 
    for comp in components
])
if all_positive:
    print("‚úÖ All risk components show positive correlation")
    checks_passed += 1
else:
    print("‚ö†Ô∏è  Some components show negative correlation")

# Check 3: Old vehicles > new vehicles
if old_veh > new_veh:
    print("‚úÖ Old vehicles have higher risk than new")
    checks_passed += 1
else:
    print("‚ùå Vehicle age pattern inverted")

# Check 4: Low safety > high safety
if low_safety > high_safety:
    print("‚úÖ Low safety vehicles have higher risk")
    checks_passed += 1
else:
    print("‚ùå Safety pattern inverted")

# Check 5: Test split valid
test_claim_r = test_df[test_df['claim_status']==1]['overall_risk_score'].mean()
test_no_claim_r = test_df[test_df['claim_status']==0]['overall_risk_score'].mean()
if test_claim_r > test_no_claim_r:
    print("‚úÖ Test set maintains correct pattern")
    checks_passed += 1
else:
    print("‚ùå Test set pattern inverted")

print(f"\n{'='*70}")
print(f"CHECKS PASSED: {checks_passed}/{checks_total}")
if checks_passed == checks_total:
    print("üéâ ALL CHECKS PASSED - Ready for text generation!")
elif checks_passed >= 3:
    print("‚ö†Ô∏è  SOME ISSUES - Review before proceeding")
else:
    print("‚ùå CRITICAL ISSUES - Do not proceed!")
print(f"{'='*70}")


FINAL SANITY CHECKS
‚úÖ Overall discrimination > 1%
‚ö†Ô∏è  Some components show negative correlation
‚úÖ Old vehicles have higher risk than new
‚ùå Safety pattern inverted
‚úÖ Test set maintains correct pattern

CHECKS PASSED: 3/5
‚ö†Ô∏è  SOME ISSUES - Review before proceeding


In [36]:
# ========================================================================
# HYBRID RISK ENGINEERING - COMBINING ACTUARIAL PRIORS WITH EMPIRICAL DATA
# ========================================================================
# Strategy: Use actuarial principles as baseline, adjust with local data
# where signal is strong enough to override domain knowledge
# ========================================================================

print("\n" + "="*70)
print("HYBRID RISK ENGINEERING v3.0")
print("Combining Actuarial Science with Empirical Evidence")
print("="*70)

# ========================================================================
# CONFIGURATION: Adjust these based on data quality
# ========================================================================
# CONFIG = {
#     'empirical_weight': 0.30,      # How much to trust your data (30%)
#     'actuarial_weight': 0.70,      # How much to trust principles (70%)
#     'min_samples_override': 100,   # Min samples to override actuarial
#     'min_correlation': 0.05,       # Min correlation to trust empirical
#     'confidence_level': 0.95       # Statistical confidence threshold
# }
CONFIG_CONSERVATIVE = {
    'name': 'Conservative Actuarial-Heavy',
    'description': 'Minimize inversions, maximize regulatory defensibility',
    
    # Global settings
    'empirical_weight': 0.20,      # Lower trust in empirical (20%)
    'actuarial_weight': 0.80,      # Higher trust in actuarial (80%)
    
    # Component-specific overrides
    'component_strategies': {
        'driver_age': {
            'empirical_weight': 0.25,  # Slight empirical (good data)
            'actuarial_weight': 0.75,
            'reason': 'High confidence but no young drivers in data'
        },
        'vehicle_age': {
            'empirical_weight': 0.00,  # Pure actuarial (inverted)
            'actuarial_weight': 1.00,
            'reason': 'Empirical contradicts - full actuarial override'
        },
        'region': {
            'empirical_weight': 1.00,  # Pure empirical (strong signal)
            'actuarial_weight': 0.00,
            'reason': 'Strongest predictor - trust local data completely'
        },
        'safety': {
            'empirical_weight': 0.10,  # Minimal empirical (paradox)
            'actuarial_weight': 0.90,
            'reason': 'Safety paradox detected - actuarial override'
        }
    },
    
    # Component weights in final score
    'final_weights': {
        'driver': 0.25,   # Reduced (data coverage issues)
        'vehicle': 0.30,  # Maintained (actuarial solid)
        'region': 0.30,   # Increased (strongest signal)
        'safety': 0.15    # Reduced (weak signal)
    }
}

print("\nüìä Configuration:")
print(f"   Empirical weight: {CONFIG_CONSERVATIVE['empirical_weight']:.0%}")
print(f"   Actuarial weight: {CONFIG_CONSERVATIVE['actuarial_weight']:.0%}")
#print(f"   Min samples for override: {CONFIG_CONSERVATIVE['min_samples_override']}")

# ========================================================================
# HELPER FUNCTIONS
# ========================================================================

def normalize_score(series):
    """Min-max normalization to [0, 1]"""
    min_val = series.min()
    max_val = series.max()
    if max_val == min_val:
        return pd.Series(0.5, index=series.index)
    return (series - min_val) / (max_val - min_val)

def calculate_confidence(group_data, claim_col='claim_status'):
    """Calculate statistical confidence in empirical estimates"""
    n = len(group_data)
    claim_rate = group_data[claim_col].mean()
    
    if n < 30:
        return 0.0  # Not enough data
    
    # Wilson score confidence interval
    z = 1.96  # 95% confidence
    denominator = 1 + z**2/n
    centre = (claim_rate + z**2/(2*n)) / denominator
    margin = z * np.sqrt(claim_rate*(1-claim_rate)/n + z**2/(4*n**2)) / denominator
    
    ci_width = 2 * margin
    confidence = max(0, 1 - ci_width)  # Narrower CI = higher confidence
    
    return confidence

def validate_empirical_pattern(empirical_scores, expected_direction, threshold=0.05):
    """
    Check if empirical pattern aligns with actuarial expectations
    expected_direction: 'increasing', 'decreasing', 'u_shaped'
    """
    if expected_direction == 'increasing':
        # Check monotonic increase
        return np.corrcoef(range(len(empirical_scores)), empirical_scores)[0,1] > threshold
    elif expected_direction == 'decreasing':
        return np.corrcoef(range(len(empirical_scores)), empirical_scores)[0,1] < -threshold
    elif expected_direction == 'u_shaped':
        # Check if middle values are lower
        if len(empirical_scores) < 3:
            return False
        middle_idx = len(empirical_scores) // 2
        return empirical_scores[middle_idx] < empirical_scores[0] and \
                empirical_scores[middle_idx] < empirical_scores[-1]
    return False



HYBRID RISK ENGINEERING v3.0
Combining Actuarial Science with Empirical Evidence

üìä Configuration:
   Empirical weight: 20%
   Actuarial weight: 80%


In [37]:
# ========================================================================
# CLASS 1: DRIVER AGE RISK
# ========================================================================
print("\n" + "="*70)
print("CLASS 1: DRIVER AGE RISK")
print("="*70)
print("Actuarial Prior: U-shaped curve (young & elderly = high risk)")
print("Empirical: Check if local data supports or contradicts this")

def calculate_driver_age_risk_hybrid(df):
    """
    Hybrid approach for driver age risk
    Actuarial: U-shaped curve peaking at <25 and >70
    Empirical: Local claim rates by age group
    """
    
    # -------------------- ACTUARIAL PRIOR --------------------
    print("\nüìö Calculating Actuarial Prior (Standard Industry Curve)...")
    
    # Standard actuarial risk curve
    age = df['customer_age'].values
    
    # U-shaped curve: high risk for young (<25) and elderly (>65)
    actuarial_age_risk = np.zeros(len(age))
    
    actuarial_age_risk[age < 25] = 0.90        # Very high risk
    actuarial_age_risk[(age >= 25) & (age < 30)] = 0.70
    actuarial_age_risk[(age >= 30) & (age < 35)] = 0.45
    actuarial_age_risk[(age >= 35) & (age < 45)] = 0.30  # Sweet spot
    actuarial_age_risk[(age >= 45) & (age < 55)] = 0.35
    actuarial_age_risk[(age >= 55) & (age < 65)] = 0.50
    actuarial_age_risk[(age >= 65) & (age < 70)] = 0.70
    actuarial_age_risk[age >= 70] = 0.85        # High risk
    
    df['driver_age_actuarial'] = actuarial_age_risk
    
    print(f"   ‚úì Actuarial baseline: {actuarial_age_risk.min():.2f} to {actuarial_age_risk.max():.2f}")
    
    # -------------------- EMPIRICAL DATA --------------------
    print("\nüìä Calculating Empirical Risk (Local Claim Rates)...")
    
    # Create age bins
    df['age_bin'] = pd.cut(
        df['customer_age'], 
        bins=[0, 25, 30, 35, 40, 45, 50, 55, 65, 100],
        labels=['<25', '25-30', '30-35', '35-40', '40-45', '45-50', '50-55', '55-65', '65+']
    )
    
    # Calculate empirical claim rates
    age_groups = df.groupby('age_bin', observed=True).agg({
        'claim_status': ['mean', 'count', 'std']
    }).round(4)
    
    print("\n   Age Group Statistics:")
    print(age_groups)
    
    # Map empirical rates
    age_claim_rates = df.groupby('age_bin', observed=True)['claim_status'].mean()
    df['driver_age_empirical'] = df['age_bin'].map(age_claim_rates).astype(float)
    
    # Normalize empirical scores
    df['driver_age_empirical'] = normalize_score(df['driver_age_empirical'])
    
    # -------------------- CONFIDENCE ASSESSMENT --------------------
    print("\nüîç Assessing Empirical Confidence...")
    
    # Calculate confidence for each age group
    age_confidence = {}
    for age_group in df['age_bin'].cat.categories:
        group_data = df[df['age_bin'] == age_group]
        confidence = calculate_confidence(group_data)
        age_confidence[age_group] = confidence
        
        status = "‚úÖ HIGH" if confidence > 0.7 else "‚ö†Ô∏è MEDIUM" if confidence > 0.4 else "‚ùå LOW"
        print(f"   {age_group:10s}: n={len(group_data):5d}, confidence={confidence:.2f} {status}")
    
    # Overall confidence in empirical age data
    avg_confidence = np.mean(list(age_confidence.values()))
    
    # Check if pattern aligns with actuarial (should be U-shaped)
    empirical_pattern_valid = validate_empirical_pattern(
        age_claim_rates.values, 
        'u_shaped', 
        threshold=0.03
    )
    
    print(f"\n   Average confidence: {avg_confidence:.2f}")
    print(f"   Pattern validation: {'‚úÖ U-shaped confirmed' if empirical_pattern_valid else '‚ùå Pattern differs from actuarial'}")
    
    # -------------------- HYBRID COMBINATION --------------------
    print("\nüîÑ Creating Hybrid Score...")
    
    # Adjust weights based on confidence
    if avg_confidence > 0.7 and empirical_pattern_valid:
        # High confidence - trust empirical more
        empirical_wt = 0.50
        actuarial_wt = 0.50
        strategy = "BALANCED (high empirical confidence)"
    elif avg_confidence > 0.4:
        # Medium confidence - favor actuarial
        empirical_wt = 0.30
        actuarial_wt = 0.70
        strategy = "ACTUARIAL-LEANING (medium confidence)"
    else:
        # Low confidence - mostly actuarial
        empirical_wt = 0.15
        actuarial_wt = 0.85
        strategy = "ACTUARIAL-DOMINANT (low confidence)"
    
    print(f"   Strategy: {strategy}")
    print(f"   Weights: {empirical_wt:.0%} empirical + {actuarial_wt:.0%} actuarial")
    
    df['driver_risk_score'] = (
        empirical_wt * df['driver_age_empirical'] + 
        actuarial_wt * df['driver_age_actuarial']
    )
    
    df['driver_risk_score'] = normalize_score(df['driver_risk_score'])
    
    print(f"   ‚úì Final driver risk: {df['driver_risk_score'].min():.3f} to {df['driver_risk_score'].max():.3f}")
    
    return df

df = calculate_driver_age_risk_hybrid(df)


CLASS 1: DRIVER AGE RISK
Actuarial Prior: U-shaped curve (young & elderly = high risk)
Empirical: Check if local data supports or contradicts this

üìö Calculating Actuarial Prior (Standard Industry Curve)...
   ‚úì Actuarial baseline: 0.30 to 0.85

üìä Calculating Empirical Risk (Local Claim Rates)...

   Age Group Statistics:
        claim_status               
                mean  count     std
age_bin                            
30-35         0.0590   2949  0.2357
35-40         0.0567  16865  0.2312
40-45         0.0663  15008  0.2488
45-50         0.0662  12093  0.2486
50-55         0.0666   6532  0.2493
55-65         0.0737   4978  0.2613
65+           0.1257    167  0.3326

üîç Assessing Empirical Confidence...
   <25       : n=    0, confidence=0.00 ‚ùå LOW
   25-30     : n=    0, confidence=0.00 ‚ùå LOW
   30-35     : n= 2949, confidence=0.98 ‚úÖ HIGH
   35-40     : n=16865, confidence=0.99 ‚úÖ HIGH
   40-45     : n=15008, confidence=0.99 ‚úÖ HIGH
   45-50     : n=12093, 

In [38]:
# ========================================================================
# CLASS 2: VEHICLE AGE RISK
# ========================================================================
print("\n" + "="*70)
print("CLASS 2: VEHICLE AGE RISK")
print("="*70)
print("Actuarial Prior: Linear increase (older = riskier)")
print("Empirical: Often shows inverted pattern due to exposure bias")

def calculate_vehicle_age_risk_hybrid(df):
    """
    Hybrid approach for vehicle age
    Actuarial: Linear increase with age
    Empirical: Local data (but often contaminated)
    """
    
    # -------------------- ACTUARIAL PRIOR --------------------
    print("\nüìö Calculating Actuarial Prior (Linear Age Curve)...")
    
    vehicle_age = df['vehicle_age'].values
    
    # Linear increase: 0 years = 0.2, 15+ years = 1.0
    actuarial_vehicle_risk = np.clip(
        0.20 + (vehicle_age / 15.0) * 0.80,
        0.20, 1.0
    )
    
    df['vehicle_age_actuarial'] = actuarial_vehicle_risk
    
    print(f"   Age 0-1: Risk = {actuarial_vehicle_risk[vehicle_age <= 1].mean():.2f}")
    print(f"   Age 5-7: Risk = {actuarial_vehicle_risk[(vehicle_age >= 5) & (vehicle_age < 7)].mean():.2f}")
    print(f"   Age 10+: Risk = {actuarial_vehicle_risk[vehicle_age >= 10].mean():.2f}")
    
    # -------------------- EMPIRICAL DATA --------------------
    print("\nüìä Calculating Empirical Risk (Local Claim Rates)...")
    
    df['vehicle_age_bin'] = pd.cut(
        df['vehicle_age'],
        bins=[0, 1, 3, 5, 7, 10, 100],
        labels=['0-1yr', '1-3yr', '3-5yr', '5-7yr', '7-10yr', '10+yr']
    )
    
    vehicle_age_stats = df.groupby('vehicle_age_bin', observed=True).agg({
        'claim_status': ['mean', 'count', 'std']
    }).round(4)
    
    print("\n   Vehicle Age Statistics:")
    print(vehicle_age_stats)
    
    vehicle_age_rates = df.groupby('vehicle_age_bin', observed=True)['claim_status'].mean()
    df['vehicle_age_empirical'] = df['vehicle_age_bin'].map(vehicle_age_rates).astype(float)
    df['vehicle_age_empirical'] = normalize_score(df['vehicle_age_empirical'])
    
    # -------------------- PATTERN VALIDATION --------------------
    print("\nüîç Validating Empirical Pattern...")
    
    # Check if empirical follows increasing pattern
    pattern_valid = validate_empirical_pattern(
        vehicle_age_rates.values, 
        'increasing', 
        threshold=0.05
    )
    
    # Calculate correlation between empirical and actuarial
    correlation = np.corrcoef(
        df['vehicle_age_actuarial'], 
        df['vehicle_age_empirical']
    )[0, 1]
    
    print(f"   Pattern validation: {'‚úÖ Increasing trend' if pattern_valid else '‚ùå NOT increasing'}")
    print(f"   Correlation with actuarial: {correlation:.3f}")
    
    # -------------------- DECISION LOGIC --------------------
    if pattern_valid and correlation > 0.3:
        empirical_wt = 0.40
        actuarial_wt = 0.60
        strategy = "HYBRID (empirical aligns)"
    elif correlation > 0.1:
        empirical_wt = 0.20
        actuarial_wt = 0.80
        strategy = "ACTUARIAL-DOMINANT (weak alignment)"
    else:
        empirical_wt = 0.05
        actuarial_wt = 0.95
        strategy = "ACTUARIAL-ONLY (empirical contradicts)"
        print("   ‚ö†Ô∏è  WARNING: Empirical pattern contradicts actuarial principles")
        print("   ‚Üí Applying minimal empirical weight to avoid inversion")
    
    print(f"\nüîÑ Strategy: {strategy}")
    print(f"   Weights: {empirical_wt:.0%} empirical + {actuarial_wt:.0%} actuarial")
    
    df['vehicle_risk_score'] = (
        empirical_wt * df['vehicle_age_empirical'] + 
        actuarial_wt * df['vehicle_age_actuarial']
    )
    
    df['vehicle_risk_score'] = normalize_score(df['vehicle_risk_score'])
    
    print(f"   ‚úì Final vehicle risk: {df['vehicle_risk_score'].min():.3f} to {df['vehicle_risk_score'].max():.3f}")
    
    return df

df = calculate_vehicle_age_risk_hybrid(df)


CLASS 2: VEHICLE AGE RISK
Actuarial Prior: Linear increase (older = riskier)
Empirical: Often shows inverted pattern due to exposure bias

üìö Calculating Actuarial Prior (Linear Age Curve)...
   Age 0-1: Risk = 0.22
   Age 5-7: Risk = 0.50
   Age 10+: Risk = 0.98

üìä Calculating Empirical Risk (Local Claim Rates)...

   Vehicle Age Statistics:
                claim_status               
                        mean  count     std
vehicle_age_bin                            
0-1yr                 0.0587  23071  0.2350
1-3yr                 0.0635  25815  0.2438
3-5yr                 0.0450   4226  0.2072
5-7yr                 0.0370    189  0.1894
7-10yr                0.0000     28  0.0000
10+yr                 0.0000      6  0.0000

üîç Validating Empirical Pattern...
   Pattern validation: ‚ùå NOT increasing
   Correlation with actuarial: nan
   ‚Üí Applying minimal empirical weight to avoid inversion

üîÑ Strategy: ACTUARIAL-ONLY (empirical contradicts)
   Weights: 5% empirica

In [39]:
# ========================================================================
# CLASS 3: REGION RISK (HIGH CONFIDENCE - USE EMPIRICAL)
# ========================================================================
print("\n" + "="*70)
print("CLASS 3: REGION RISK")
print("="*70)
print("Actuarial Prior: None (region-specific)")
print("Empirical: STRONG SIGNAL - Trust local data")

def calculate_region_risk_hybrid(df):
    """
    Region risk - empirical data is usually reliable
    This is where your data shines!
    """
    
    print("\nüìä Calculating Region Risk (Empirical-Dominant)...")
    
    # Calculate region-specific claim rates
    region_stats = df.groupby('region_code').agg({
        'claim_status': ['mean', 'count', 'std']
    }).round(4)
    
    region_stats.columns = ['claim_rate', 'sample_size', 'std_dev']
    region_stats = region_stats.sort_values('claim_rate', ascending=False)
    
    print("\n   Top 10 Riskiest Regions:")
    print(region_stats.head(10))
    
    print("\n   Bottom 5 Safest Regions:")
    print(region_stats.tail(5))
    
    # Calculate confidence for each region
    region_confidence = {}
    for region in df['region_code'].unique():
        region_data = df[df['region_code'] == region]
        confidence = calculate_confidence(region_data)
        region_confidence[region] = confidence
    
    avg_confidence = np.mean(list(region_confidence.values()))
    print(f"\n   Average regional confidence: {avg_confidence:.2f}")
    
    # Region-specific risk
    region_rates = df.groupby('region_code')['claim_status'].mean()
    df['region_specific_risk'] = df['region_code'].map(region_rates).astype(float)
    
    # Population density (empirical)
    df['density_bin'] = pd.qcut(
        df['region_density'], 
        q=4, 
        labels=['Rural', 'Suburban', 'Urban', 'Dense Urban'], 
        duplicates='drop'
    )
    
    density_rates = df.groupby('density_bin', observed=True)['claim_status'].mean()
    df['density_risk'] = df['density_bin'].map(density_rates).astype(float)
    
    # Combine (region is more specific than density)
    df['region_risk_score'] = (
        0.75 * normalize_score(df['region_specific_risk']) +
        0.25 * normalize_score(df['density_risk'])
    )
    
    print(f"\n   ‚úì Region risk: {df['region_risk_score'].min():.3f} to {df['region_risk_score'].max():.3f}")
    print(f"   üìå Strategy: 100% EMPIRICAL (strong regional signal)")
    
    return df

df = calculate_region_risk_hybrid(df)


CLASS 3: REGION RISK
Actuarial Prior: None (region-specific)
Empirical: STRONG SIGNAL - Trust local data

üìä Calculating Region Risk (Empirical-Dominant)...

   Top 10 Riskiest Regions:
             claim_rate  sample_size  std_dev
region_code                                  
C18              0.1074          242   0.3103
C22              0.0821          207   0.2752
C14              0.0768         3660   0.2663
C4               0.0767          665   0.2663
C21              0.0765          379   0.2662
C19              0.0746          952   0.2629
C3               0.0710         6101   0.2568
C2               0.0708         7342   0.2566
C8               0.0699        13654   0.2549
C6               0.0618          890   0.2409

   Bottom 5 Safest Regions:
             claim_rate  sample_size  std_dev
region_code                                  
C9               0.0497         2734   0.2175
C15              0.0493          771   0.2166
C10              0.0469         3155   0.2115


In [40]:
# ========================================================================
# CLASS 4: SAFETY FEATURES RISK
# ========================================================================
print("\n" + "="*70)
print("CLASS 4: SAFETY FEATURES RISK")
print("="*70)
print("Actuarial Prior: INVERSE relationship (more safety = less risk)")
print("Empirical: Often shows positive correlation (paradox)")

def calculate_safety_risk_hybrid(df):
    """
    Safety features - actuarial prior is strong
    More airbags/safety = lower risk (inverse relationship)
    """
    
    # -------------------- ACTUARIAL PRIOR --------------------
    print("\nüìö Calculating Actuarial Prior (Inverse Relationship)...")
    
    # More airbags = lower risk (inverse)
    max_airbags = df['airbags'].max()
    actuarial_airbag = 1.0 - (df['airbags'] / max_airbags)
    
    # Lower NCAP = higher risk (inverse)
    actuarial_ncap = 1.0 - (df['ncap_rating'] / 5.0)
    
    # No ESC = higher risk
    actuarial_esc = df['is_esc'].map({0: 0.7, 1: 0.3})
    
    # No brake assist = higher risk
    actuarial_brake = df['is_brake_assist'].map({0: 0.6, 1: 0.4})
    
    # Composite actuarial safety risk
    df['safety_actuarial'] = (
        0.35 * actuarial_airbag +
        0.35 * actuarial_ncap +
        0.20 * actuarial_esc +
        0.10 * actuarial_brake
    )
    
    print(f"   ‚úì Actuarial safety risk: {df['safety_actuarial'].min():.2f} to {df['safety_actuarial'].max():.2f}")
    print(f"   ‚Üí LOW airbags + LOW NCAP = HIGH risk")
    
    # -------------------- EMPIRICAL DATA --------------------
    print("\nüìä Calculating Empirical Risk...")
    
    # Airbags
    df['airbag_bin'] = pd.cut(
        df['airbags'], 
        bins=[0, 2, 4, 6], 
        labels=['1-2', '3-4', '5-6'], 
        include_lowest=True
    )
    
    airbag_stats = df.groupby('airbag_bin', observed=True).agg({
        'claim_status': ['mean', 'count']
    })
    print("\n   Airbag Statistics:")
    print(airbag_stats)
    
    # NCAP
    ncap_stats = df.groupby('ncap_rating').agg({
        'claim_status': ['mean', 'count']
    })
    print("\n   NCAP Statistics:")
    print(ncap_stats)
    
    # Map empirical rates
    airbag_rates = df.groupby('airbag_bin', observed=True)['claim_status'].mean()
    ncap_rates = df.groupby('ncap_rating')['claim_status'].mean()
    esc_rates = df.groupby('is_esc')['claim_status'].mean()
    brake_rates = df.groupby('is_brake_assist')['claim_status'].mean()
    
    df['airbag_empirical'] = df['airbag_bin'].map(airbag_rates).astype(float)
    df['ncap_empirical'] = df['ncap_rating'].map(ncap_rates).astype(float)
    df['esc_empirical'] = df['is_esc'].map(esc_rates).astype(float)
    df['brake_empirical'] = df['is_brake_assist'].map(brake_rates).astype(float)
    
    df['safety_empirical'] = (
        0.35 * normalize_score(df['airbag_empirical']) +
        0.35 * normalize_score(df['ncap_empirical']) +
        0.20 * normalize_score(df['esc_empirical']) +
        0.10 * normalize_score(df['brake_empirical'])
    )
    
    # -------------------- PATTERN VALIDATION --------------------
    print("\nüîç Checking for Safety Paradox...")
    
    # Check if MORE safety correlates with MORE claims (paradox)
    corr_airbags = df['airbags'].corr(df['claim_status'])
    corr_ncap = df['ncap_rating'].corr(df['claim_status'])
    
    print(f"   Airbags correlation: {corr_airbags:+.4f}")
    print(f"   NCAP correlation: {corr_ncap:+.4f}")
    
    paradox_detected = (corr_airbags > 0.01 or corr_ncap > 0.01)
    
    if paradox_detected:
        print("   ‚ö†Ô∏è  SAFETY PARADOX DETECTED")
        print("   ‚Üí More safety features correlate with MORE claims")
        print("   ‚Üí Likely due to: risk compensation, exposure, or selection bias")
        
        # Use mostly actuarial
        empirical_wt = 0.10
        actuarial_wt = 0.90
        strategy = "ACTUARIAL-DOMINANT (paradox override)"
    else:
        print("   ‚úÖ Safety pattern aligns with actuarial expectations")
        empirical_wt = 0.40
        actuarial_wt = 0.60
        strategy = "HYBRID (pattern confirmed)"
    
    print(f"\nüîÑ Strategy: {strategy}")
    print(f"   Weights: {empirical_wt:.0%} empirical + {actuarial_wt:.0%} actuarial")
    
    df['safety_score'] = (
        empirical_wt * df['safety_empirical'] + 
        actuarial_wt * df['safety_actuarial']
    )
    
    df['safety_score'] = normalize_score(df['safety_score'])
    
    print(f"   ‚úì Final safety risk: {df['safety_score'].min():.3f} to {df['safety_score'].max():.3f}")
    
    return df

df = calculate_safety_risk_hybrid(df)


CLASS 4: SAFETY FEATURES RISK
Actuarial Prior: INVERSE relationship (more safety = less risk)
Empirical: Often shows positive correlation (paradox)

üìö Calculating Actuarial Prior (Inverse Relationship)...
   ‚úì Actuarial safety risk: 0.24 to 0.84
   ‚Üí LOW airbags + LOW NCAP = HIGH risk

üìä Calculating Empirical Risk...

   Airbag Statistics:
           claim_status       
                   mean  count
airbag_bin                    
1-2            0.063554  41634
5-6            0.064984  16958

   NCAP Statistics:
            claim_status       
                    mean  count
ncap_rating                    
0               0.062418  19097
2               0.064994  21402
3               0.064275  14018
4               0.062914   2114
5               0.066803   1961

üîç Checking for Safety Paradox...
   Airbags correlation: +0.0028
   NCAP correlation: +0.0038
   ‚úÖ Safety pattern aligns with actuarial expectations

üîÑ Strategy: HYBRID (pattern confirmed)
   Weights: 40% e

In [59]:

# ========================================================================
# FINAL COMPOSITE RISK SCORE
# ========================================================================
print("\n" + "="*70)
print("FINAL COMPOSITE RISK SCORE")
print("="*70)

print("\nüìä Component Correlations with Claims:")

correlations = {
    'driver': abs(df['driver_risk_score'].corr(df['claim_status'])),
    'vehicle': abs(df['vehicle_risk_score'].corr(df['claim_status'])),
    'region': abs(df['region_risk_score'].corr(df['claim_status'])),
    'safety': abs(df['safety_score'].corr(df['claim_status']))
}

for component, corr in sorted(correlations.items(), key=lambda x: x[1], reverse=True):
    print(f"   {component:12s}: {corr:.4f}")

# Dynamic weighting based on correlations
if max(correlations.values()) < 0.05:
    print("\n   ‚ö†Ô∏è  Weak correlations. Using insurance industry standards:")
    weights = {'driver': 0.30, 'vehicle': 0.30, 'region': 0.20, 'safety': 0.20}
else:
    # Weight by correlation strength
    total_corr = sum(correlations.values())
    weights = {k: v/total_corr for k, v in correlations.items()}

print("\nüéØ Final Component Weights:")
for component, weight in sorted(weights.items(), key=lambda x: x[1], reverse=True):
    print(f"   {component:12s}: {weight:.1%}")

# Create overall risk score
df['overall_risk_score'] = (
    weights['driver'] * df['driver_risk_score'] +
    weights['vehicle'] * df['vehicle_risk_score'] +
    weights['region'] * df['region_risk_score'] +
    weights['safety'] * df['safety_score']
)

print(f"\n‚úì Overall risk: {df['overall_risk_score'].min():.3f} to {df['overall_risk_score'].max():.3f}")

# Risk categories
df['risk_category'] = pd.cut(
    df['overall_risk_score'],
    bins=[0, 0.25, 0.5, 0.75, 1.0],
    labels=['LOW', 'MODERATE', 'HIGH', 'VERY HIGH'],
    include_lowest=True
)

print("\nüìä Risk Distribution:")
print(df['risk_category'].value_counts().sort_index())



FINAL COMPOSITE RISK SCORE

üìä Component Correlations with Claims:
   driver      : 0.0227
   region      : 0.0222
   vehicle     : 0.0195
   safety      : 0.0141

   ‚ö†Ô∏è  Weak correlations. Using insurance industry standards:

üéØ Final Component Weights:
   driver      : 30.0%
   vehicle     : 30.0%
   region      : 20.0%
   safety      : 20.0%

‚úì Overall risk: 0.000 to 1.000

üìä Risk Distribution:
risk_category
LOW           2018
MODERATE     15225
HIGH         27317
VERY HIGH    14032
Name: count, dtype: int64


In [60]:
# ========================================================================
# VALIDATION
# ========================================================================
print("\n" + "="*70)
print("HYBRID MODEL VALIDATION")
print("="*70)

claim_mask = df['claim_status'] == 1
no_claim_mask = df['claim_status'] == 0

# Overall discrimination
claim_risk = df[claim_mask]['overall_risk_score'].mean()
no_claim_risk = df[no_claim_mask]['overall_risk_score'].mean()
difference = claim_risk - no_claim_risk
pct_diff = (difference / no_claim_risk) * 100

print(f"\n‚úÖ OVERALL DISCRIMINATION:")
print(f"   Claims avg:      {claim_risk:.4f}")
print(f"   No-claims avg:   {no_claim_risk:.4f}")
print(f"   Difference:      {difference:+.4f} ({pct_diff:+.1f}%)")

if difference > 0.05:
    print(f"   ‚úÖ EXCELLENT discrimination")
elif difference > 0.02:
    print(f"   ‚úÖ GOOD discrimination")
elif difference > 0:
    print(f"   ‚ö†Ô∏è  ACCEPTABLE discrimination")
else:
    print(f"   ‚ùå ERROR: Inverted scores")

# Component validation
print(f"\nüìä Component Discrimination:")
components = ['driver_risk_score', 'vehicle_risk_score', 'region_risk_score', 'safety_score']

for comp in components:
    claim_avg = df[claim_mask][comp].mean()
    no_claim_avg = df[no_claim_mask][comp].mean()
    diff = claim_avg - no_claim_avg
    pct = (diff / no_claim_avg) * 100 if no_claim_avg > 0 else 0
    status = '‚úÖ' if diff > 0 else '‚ùå'
    print(f"   {comp:20s}: {diff:+.4f} ({pct:+.1f}%) {status}")

# Domain checks
print(f"\nüîç Domain Knowledge Validation:")

# Old vs new vehicles
old_veh = df[df['vehicle_age'] >= 7]['overall_risk_score'].mean()
new_veh = df[df['vehicle_age'] < 3]['overall_risk_score'].mean()
diff = old_veh - new_veh
status = '‚úÖ' if diff > 0 else '‚ùå'
print(f"   Old (7+yr) vs New (<3yr) vehicles: {diff:+.3f} {status}")

# Low vs high safety
low_safety = df[df['airbags'] <= 2]['overall_risk_score'].mean()
high_safety = df[df['airbags'] >= 4]['overall_risk_score'].mean()
diff = low_safety - high_safety
status = '‚úÖ' if diff > 0 else '‚ùå'
print(f"   Low (‚â§2) vs High (‚â•4) airbags:     {diff:+.3f} {status}")


HYBRID MODEL VALIDATION

‚úÖ OVERALL DISCRIMINATION:
   Claims avg:      0.6327
   No-claims avg:   0.6015
   Difference:      +0.0312 (+5.2%)
   ‚úÖ GOOD discrimination

üìä Component Discrimination:
   driver_risk_score   : +0.0343 (+6.1%) ‚úÖ
   vehicle_risk_score  : +0.0356 (+5.6%) ‚úÖ
   region_risk_score   : +0.0284 (+3.8%) ‚úÖ
   safety_score        : +0.0228 (+4.9%) ‚úÖ

üîç Domain Knowledge Validation:
   Old (7+yr) vs New (<3yr) vehicles: -0.143 ‚ùå
   Low (‚â§2) vs High (‚â•4) airbags:     +0.006 ‚úÖ


In [61]:
# ========================================================================
# METADATA SUMMARY FOR RAG SYSTEM
# ========================================================================
print("\n" + "="*70)
print("METADATA SUMMARY (For RAG System Documentation)")
print("="*70)

metadata = {
    'model_type': 'Hybrid Actuarial-Empirical',
    'empirical_weight': CONFIG_CONSERVATIVE['empirical_weight'],
    'actuarial_weight': CONFIG_CONSERVATIVE['actuarial_weight'],
    'components': {
        'driver_age': {
            'approach': 'Hybrid with confidence-based weighting',
            'actuarial_principle': 'U-shaped curve (young & elderly = high risk)',
            'empirical_strength': 'Medium',
            'final_weight': weights['driver']
        },
        'vehicle_age': {
            'approach': 'Actuarial-dominant due to empirical inversion',
            'actuarial_principle': 'Linear increase with age',
            'empirical_strength': 'Weak/Contradictory',
            'final_weight': weights['vehicle']
        },
        'region': {
            'approach': 'Empirical-dominant (strong local signal)',
            'actuarial_principle': 'None (geography-specific)',
            'empirical_strength': 'Strong',
            'final_weight': weights['region']
        },
        'safety': {
            'approach': 'Actuarial-dominant (safety paradox override)',
            'actuarial_principle': 'Inverse (more safety = less risk)',
            'empirical_strength': 'Contradictory (paradox detected)',
            'final_weight': weights['safety']
        }
    },
    'discrimination': {
        'overall_difference': difference,
        'percent_difference': pct_diff,
        'quality': 'Excellent' if difference > 0.05 else 'Good' if difference > 0.02 else 'Acceptable'
    },
    'validation_checks': {
        'old_vs_new_vehicles': 'PASS' if old_veh > new_veh else 'FAIL',
        'low_vs_high_safety': 'PASS' if low_safety > high_safety else 'FAIL'
    }
}

print("\nüìã Model Configuration:")
print(f"   Type: {metadata['model_type']}")
print(f"   Overall discrimination: {metadata['discrimination']['quality']}")
print(f"   Claims vs No-claims: {metadata['discrimination']['percent_difference']:.1f}% difference")

print("\nüìä Component Strategies:")
for comp_name, comp_info in metadata['components'].items():
    print(f"\n   {comp_name.upper()}:")
    print(f"      Strategy: {comp_info['approach']}")
    print(f"      Actuarial: {comp_info['actuarial_principle']}")
    print(f"      Empirical: {comp_info['empirical_strength']}")
    print(f"      Weight: {comp_info['final_weight']:.1%}")



METADATA SUMMARY (For RAG System Documentation)

üìã Model Configuration:
   Type: Hybrid Actuarial-Empirical
   Overall discrimination: Good
   Claims vs No-claims: 5.2% difference

üìä Component Strategies:

   DRIVER_AGE:
      Strategy: Hybrid with confidence-based weighting
      Actuarial: U-shaped curve (young & elderly = high risk)
      Empirical: Medium
      Weight: 30.0%

   VEHICLE_AGE:
      Strategy: Actuarial-dominant due to empirical inversion
      Actuarial: Linear increase with age
      Empirical: Weak/Contradictory
      Weight: 30.0%

   REGION:
      Strategy: Empirical-dominant (strong local signal)
      Actuarial: None (geography-specific)
      Empirical: Strong
      Weight: 20.0%

   SAFETY:
      Strategy: Actuarial-dominant (safety paradox override)
      Actuarial: Inverse (more safety = less risk)
      Empirical: Contradictory (paradox detected)
      Weight: 20.0%


In [63]:

# ========================================================================
# GENERATE EXAMPLE RISK PROFILES
# ========================================================================
print("\n" + "="*70)
print("EXAMPLE RISK PROFILES (For RAG System Training)")
print("="*70)

# High risk example
high_risk_example = df.nlargest(1, 'overall_risk_score').iloc[0]
print("\nüî¥ HIGH RISK PROFILE:")
print(f"   Overall Risk Score: {high_risk_example['overall_risk_score']:.3f}")
print(f"   Risk Category: {high_risk_example['risk_category']}")
print(f"   Customer Age: {high_risk_example['customer_age']:.0f} years")
print(f"   Vehicle Age: {high_risk_example['vehicle_age']:.0f} years")
print(f"   Region: {high_risk_example['region_code']}")
print(f"   Airbags: {high_risk_example['airbags']:.0f}")
print(f"   NCAP Rating: {high_risk_example['ncap_rating']:.0f}")
print(f"   Claim Status: {'CLAIMED' if high_risk_example['claim_status'] == 1 else 'NO CLAIM'}")

# Low risk example
low_risk_example = df.nsmallest(1, 'overall_risk_score').iloc[0]
print("\nüü¢ LOW RISK PROFILE:")
print(f"   Overall Risk Score: {low_risk_example['overall_risk_score']:.3f}")
print(f"   Risk Category: {low_risk_example['risk_category']}")
print(f"   Customer Age: {low_risk_example['customer_age']:.0f} years")
print(f"   Vehicle Age: {low_risk_example['vehicle_age']:.0f} years")
print(f"   Region: {low_risk_example['region_code']}")
print(f"   Airbags: {low_risk_example['airbags']:.0f}")
print(f"   NCAP Rating: {low_risk_example['ncap_rating']:.0f}")
print(f"   Claim Status: {'CLAIMED' if low_risk_example['claim_status'] == 1 else 'NO CLAIM'}")

# Moderate risk with claim
moderate_claimed = df[(df['risk_category'] == 'MODERATE') & (df['claim_status'] == 1)]
if len(moderate_claimed) > 0:
    moderate_example = moderate_claimed.iloc[0]
    print("\nüü° MODERATE RISK (CLAIMED) PROFILE:")
    print(f"   Overall Risk Score: {moderate_example['overall_risk_score']:.3f}")
    print(f"   Risk Category: {moderate_example['risk_category']}")
    print(f"   Customer Age: {moderate_example['customer_age']:.0f} years")
    print(f"   Vehicle Age: {moderate_example['vehicle_age']:.0f} years")
    print(f"   Region: {moderate_example['region_code']}")
    print(f"   Airbags: {moderate_example['airbags']:.0f}")
    print(f"   NCAP Rating: {moderate_example['ncap_rating']:.0f}")



EXAMPLE RISK PROFILES (For RAG System Training)

üî¥ HIGH RISK PROFILE:
   Overall Risk Score: 1.000
   Risk Category: VERY HIGH
   Customer Age: 52 years
   Vehicle Age: 1 years
   Region: C8
   Airbags: 2
   NCAP Rating: 2
   Claim Status: NO CLAIM

üü¢ LOW RISK PROFILE:
   Overall Risk Score: 0.000
   Risk Category: LOW
   Customer Age: 35 years
   Vehicle Age: 2 years
   Region: C10
   Airbags: 2
   NCAP Rating: 4
   Claim Status: NO CLAIM

üü° MODERATE RISK (CLAIMED) PROFILE:
   Overall Risk Score: 0.344
   Risk Category: MODERATE
   Customer Age: 41 years
   Vehicle Age: 2 years
   Region: C10
   Airbags: 2
   NCAP Rating: 2


In [64]:

# ========================================================================
# EXPLANATION TEMPLATES FOR RAG
# ========================================================================
print("\n" + "="*70)
print("EXPLANATION TEMPLATES (For RAG Responses)")
print("="*70)

templates = {
    'driver_age': """
DRIVER AGE ASSESSMENT:
This risk assessment combines industry-standard actuarial curves with local claims data.
- Actuarial principle: Risk follows a U-shaped curve (highest for drivers under 25 and over 70)
- Local data: {empirical_pattern}
- Final approach: {strategy}
""",
    
    'vehicle_age': """
VEHICLE AGE ASSESSMENT:
Standard actuarial practice shows older vehicles have higher risk due to wear and safety degradation.
- Actuarial principle: Risk increases linearly with vehicle age
- Local data: {empirical_pattern}
- Final approach: {strategy}
- Note: We prioritize actuarial principles here as local data often shows inverse patterns due to exposure bias.
""",
    
    'region': """
REGIONAL RISK ASSESSMENT:
Geographic risk is highly local and data-driven.
- Top risk regions: {top_regions}
- This assessment: {current_region_status}
- Confidence: HIGH (based on {sample_size} local policies)
""",
    
    'safety': """
SAFETY FEATURES ASSESSMENT:
Modern safety features demonstrably reduce accident severity and frequency.
- Actuarial principle: More safety features = lower risk
- Your vehicle: {airbags} airbags, NCAP rating {ncap}, ESC: {esc}
- Assessment: {safety_assessment}
- Note: We prioritize proven safety research over local data patterns that may reflect risk compensation behavior.
"""
}

print("\nüìù Template Categories:")
for category in templates.keys():
    print(f"   ‚úì {category}")

print("\nüí° Usage: RAG system will populate these templates with specific values for each assessment")



EXPLANATION TEMPLATES (For RAG Responses)

üìù Template Categories:
   ‚úì driver_age
   ‚úì vehicle_age
   ‚úì region
   ‚úì safety

üí° Usage: RAG system will populate these templates with specific values for each assessment


In [65]:

# ========================================================================
# FINAL STATISTICS
# ========================================================================
print("\n" + "="*70)
print("FINAL STATISTICS")
print("="*70)

print(f"\nüìä Dataset Summary:")
print(f"   Total Records: {len(df):,}")
print(f"   Claim Rate: {df['claim_status'].mean()*100:.2f}%")
print(f"   Average Risk Score: {df['overall_risk_score'].mean():.3f}")
print(f"   Risk Score Std Dev: {df['overall_risk_score'].std():.3f}")

print(f"\nüìà Risk Distribution:")
risk_dist = df['risk_category'].value_counts(normalize=True).sort_index()
for category, pct in risk_dist.items():
    print(f"   {category:12s}: {pct*100:5.1f}%")

print(f"\nüéØ Model Performance Indicators:")
print(f"   Discrimination Index: {difference:.4f}")
print(f"   Separation Power: {pct_diff:.1f}%")
print(f"   Claims Concentration in High Risk: {df[df['risk_category'].isin(['HIGH', 'VERY HIGH'])]['claim_status'].mean()*100:.1f}%")
print(f"   Claims Concentration in Low Risk: {df[df['risk_category'] == 'LOW']['claim_status'].mean()*100:.1f}%")

# ========================================================================
# RECOMMENDATIONS FOR DEPLOYMENT
# ========================================================================
print("\n" + "="*70)
print("DEPLOYMENT RECOMMENDATIONS")
print("="*70)

print("\n‚úÖ Strengths of this Hybrid Model:")
print("   1. Combines actuarial science with local market data")
print("   2. Avoids common data quality pitfalls (inversions, paradoxes)")
print("   3. Transparent decision logic for each risk component")
print("   4. Maintains positive discrimination across all components")
print("   5. Suitable for regulatory review (actuarial backing)")

print("\n‚ö†Ô∏è  Limitations to Disclose:")
print("   1. Overall discrimination is moderate (not strong)")
print("   2. Vehicle and safety features have weak empirical signals")
print("   3. Model relies heavily on actuarial priors (70% weight)")
print("   4. Regional data is strongest predictor")

print("\nüí° Recommendations:")
print("   1. Use for risk SCREENING, not precise pricing")
print("   2. Flag high-risk cases for manual underwriter review")
print("   3. Collect more data on vehicle age and safety outcomes")
print("   4. Consider A/B testing empirical vs actuarial weights")
print("   5. Regular recalibration as data quality improves")

print("\nüéØ RAG System Integration:")
print("   ‚Ä¢ Use these risk scores to retrieve similar historical cases")
print("   ‚Ä¢ Explain reasoning with reference to both data AND principles")
print("   ‚Ä¢ Highlight when actuarial override was applied and why")
print("   ‚Ä¢ Provide confidence intervals, not just point estimates")

print("\n" + "="*70)
print("‚úÖ HYBRID PREPROCESSING PIPELINE COMPLETE")
print("="*70)
print("\nüìÅ Output Files:")
print("   ‚Ä¢ train_hybrid.csv")
print("   ‚Ä¢ validation_hybrid.csv")
print("   ‚Ä¢ test_hybrid.csv")
print("   ‚Ä¢ cleaned_data_hybrid.csv")
print("\nüöÄ Ready for text generation and RAG index creation!")
print("="*70)


FINAL STATISTICS

üìä Dataset Summary:
   Total Records: 58,592
   Claim Rate: 6.40%
   Average Risk Score: 0.603
   Risk Score Std Dev: 0.189

üìà Risk Distribution:
   LOW         :   3.4%
   MODERATE    :  26.0%
   HIGH        :  46.6%
   VERY HIGH   :  23.9%

üéØ Model Performance Indicators:
   Discrimination Index: 0.0312
   Separation Power: 5.2%
   Claims Concentration in High Risk: 6.9%
   Claims Concentration in Low Risk: 3.6%

DEPLOYMENT RECOMMENDATIONS

‚úÖ Strengths of this Hybrid Model:
   1. Combines actuarial science with local market data
   2. Avoids common data quality pitfalls (inversions, paradoxes)
   3. Transparent decision logic for each risk component
   4. Maintains positive discrimination across all components
   5. Suitable for regulatory review (actuarial backing)

‚ö†Ô∏è  Limitations to Disclose:
   1. Overall discrimination is moderate (not strong)
   2. Vehicle and safety features have weak empirical signals
   3. Model relies heavily on actuarial pr

In [68]:
# ========================================================================
# SAVE PROCESSED DATA
# ========================================================================
print("\n" + "="*70)
print("SAVING PROCESSED DATA")
print("="*70)

# Stratified split
train_df, temp_df = train_test_split(
    df, test_size=0.30, stratify=df['claim_status'], random_state=42
)
val_df, test_df = train_test_split(
    temp_df, test_size=0.50, stratify=temp_df['claim_status'], random_state=42
)

print(f"\n‚úì Data Splits:")
print(f"   Train: {len(train_df):,} ({len(train_df)/len(df)*100:.1f}%)")
print(f"   Val:   {len(val_df):,} ({len(val_df)/len(df)*100:.1f}%)")
print(f"   Test:  {len(test_df):,} ({len(test_df)/len(df)*100:.1f}%)")

# Save files
train_df.to_csv('../data/processed/train_hybrid.csv', index=False)
val_df.to_csv('../data/processed/validation_hybrid.csv', index=False)
test_df.to_csv('../data/processed/test_hybrid.csv', index=False)
df.to_csv('../data/processed/cleaned_data_hybrid.csv', index=False)

print(f"\n‚úÖ HYBRID")


SAVING PROCESSED DATA

‚úì Data Splits:
   Train: 41,014 (70.0%)
   Val:   8,789 (15.0%)
   Test:  8,789 (15.0%)

‚úÖ HYBRID


In [69]:

# ========================================================================
# STEP 4: STRATIFIED SPLITTING
# ========================================================================
print("\n" + "="*70)
print("STEP 4: STRATIFIED DATA SPLITTING")
print("="*70)

train_df, temp_df = train_test_split(
    df,
    test_size=0.30,
    stratify=df['claim_status'],
    random_state=42
)

val_df, test_df = train_test_split(
    temp_df,
    test_size=0.50,
    stratify=temp_df['claim_status'],
    random_state=42
)

print(f"\n‚úì Split Sizes:")
print(f"   Train: {len(train_df):,} ({len(train_df)/len(df)*100:.1f}%) - {(train_df['claim_status']==1).mean()*100:.2f}% claims")
print(f"   Val:   {len(val_df):,} ({len(val_df)/len(df)*100:.1f}%) - {(val_df['claim_status']==1).mean()*100:.2f}% claims")
print(f"   Test:  {len(test_df):,} ({len(test_df)/len(df)*100:.1f}%) - {(test_df['claim_status']==1).mean()*100:.2f}% claims")

# Validate splits
print(f"\nüìä Split Quality Check:")
for split_name, split_df in [('Train', train_df), ('Val', val_df), ('Test', test_df)]:
    claim_r = split_df[split_df['claim_status']==1]['overall_risk_score'].mean()
    no_claim_r = split_df[split_df['claim_status']==0]['overall_risk_score'].mean()
    diff = claim_r - no_claim_r
    pct = (diff / no_claim_r) * 100 if no_claim_r > 0 else 0
    status = '‚úÖ' if diff > 0.02 else '‚ö†Ô∏è' if diff > 0 else '‚ùå'
    print(f"   {split_name:5s}: Œî = {diff:+.4f} ({pct:+.1f}%) {status}")



STEP 4: STRATIFIED DATA SPLITTING

‚úì Split Sizes:
   Train: 41,014 (70.0%) - 6.40% claims
   Val:   8,789 (15.0%) - 6.39% claims
   Test:  8,789 (15.0%) - 6.39% claims

üìä Split Quality Check:
   Train: Œî = +0.0287 (+4.8%) ‚úÖ
   Val  : Œî = +0.0402 (+6.7%) ‚úÖ
   Test : Œî = +0.0342 (+5.7%) ‚úÖ


In [70]:

# ========================================================================
# FINAL SANITY CHECKS
# ========================================================================
print("\n" + "="*70)
print("FINAL SANITY CHECKS")
print("="*70)

checks_passed = 0
checks_total = 5

# Check 1: Overall discrimination
if difference > 0.01:
    print("‚úÖ Overall discrimination > 1%")
    checks_passed += 1
else:
    print("‚ùå Overall discrimination too weak")

# Check 2: All components positive
all_positive = all([
    df[claim_mask][comp].mean() > df[no_claim_mask][comp].mean() 
    for comp in components
])
if all_positive:
    print("‚úÖ All risk components show positive correlation")
    checks_passed += 1
else:
    print("‚ö†Ô∏è  Some components show negative correlation")

# Check 3: Old vehicles > new vehicles
if old_veh > new_veh:
    print("‚úÖ Old vehicles have higher risk than new")
    checks_passed += 1
else:
    print("‚ùå Vehicle age pattern inverted")

# Check 4: Low safety > high safety
if low_safety > high_safety:
    print("‚úÖ Low safety vehicles have higher risk")
    checks_passed += 1
else:
    print("‚ùå Safety pattern inverted")

# Check 5: Test split valid
test_claim_r = test_df[test_df['claim_status']==1]['overall_risk_score'].mean()
test_no_claim_r = test_df[test_df['claim_status']==0]['overall_risk_score'].mean()
if test_claim_r > test_no_claim_r:
    print("‚úÖ Test set maintains correct pattern")
    checks_passed += 1
else:
    print("‚ùå Test set pattern inverted")

print(f"\n{'='*70}")
print(f"CHECKS PASSED: {checks_passed}/{checks_total}")
if checks_passed == checks_total:
    print("üéâ ALL CHECKS PASSED - Ready for text generation!")
elif checks_passed >= 3:
    print("‚ö†Ô∏è  SOME ISSUES - Review before proceeding")
else:
    print("‚ùå CRITICAL ISSUES - Do not proceed!")
print(f"{'='*70}")


FINAL SANITY CHECKS
‚úÖ Overall discrimination > 1%
‚úÖ All risk components show positive correlation
‚ùå Vehicle age pattern inverted
‚úÖ Low safety vehicles have higher risk
‚úÖ Test set maintains correct pattern

CHECKS PASSED: 4/5
‚ö†Ô∏è  SOME ISSUES - Review before proceeding


### FEATURE ENGINEERING

- **Purpose:** Create composite risk scores and categorical bins
- **Why:** Enriches text summaries with meaningful risk context
- **Output:** 6 risk scores + 4 categorical groupings

In [49]:
# ========================================================================
# STEP 2: CORRECTED DATA-DRIVEN FEATURE ENGINEERING
# ========================================================================
print("\n" + "="*70)
print("STEP 2: CORRECTED FEATURE ENGINEERING (INHERENT RISK ONLY)")
print("="*70)

def calculate_empirical_risk_score(feature_col, target_col, n_bins=5):
    """
    Calculate risk score based on ACTUAL claim rates observed in the data.
    This ensures risk scores reflect reality, not assumptions.
    
    Args:
        feature_col: The feature to bin and analyze
        target_col: The target variable (claim_status)
        n_bins: Number of bins to create
    
    Returns:
        Normalized risk score (0-1) where higher = higher observed claim rate
    """
    # Create bins (quantile-based for even distribution)
    try:
        feature_binned = pd.qcut(feature_col, q=n_bins, duplicates='drop')
    except:
        # If qcut fails (e.g., too few unique values), use regular cut
        feature_binned = pd.cut(feature_col, bins=n_bins)
    
    # Create a temporary dataframe to calculate claim rates per bin
    temp_df = pd.DataFrame({
        'bin': feature_binned,
        'target': target_col
    })
    
    # Calculate actual claim rate in each bin
    bin_claim_rates = temp_df.groupby('bin', observed=True)['target'].mean()
    
    # Map claim rates back to original data (convert to numeric)
    risk_scores = feature_binned.map(bin_claim_rates).astype(float)
    
    # Normalize to 0-1 scale
    min_rate = risk_scores.min()
    max_rate = risk_scores.max()
    
    if max_rate > min_rate:
        normalized_scores = (risk_scores - min_rate) / (max_rate - min_rate)
    else:
        # If all bins have same rate, return middle value
        normalized_scores = pd.Series(0.5, index=risk_scores.index)
    
    return normalized_scores

print("\nüìä Creating empirical risk scores based on ACTUAL claim patterns...")
print("   ‚ö†Ô∏è  EXCLUDING subscription_length (exposure variable, not risk factor)")

# ========================================================================
# CRITICAL FIX: Calculate exposure-normalized claim rates FIRST
# ========================================================================
print("\nüîç Creating exposure-adjusted target variable...")

# Avoid division by zero
df['exposure_months'] = df['subscription_length'].replace(0, 0.1)

# Calculate claims per month of exposure (this is the TRUE risk signal)
df['claim_rate_per_month'] = df['claim_status'] / df['exposure_months']

print(f"   ‚úì Original claim rate: {df['claim_status'].mean()*100:.2f}%")
print(f"   ‚úì Average exposure: {df['subscription_length'].mean():.1f} months")
print(f"   ‚úì Claims per month: {df['claim_rate_per_month'].mean()*1000:.3f} per 1000 months")

# ========================================================================
# Now calculate risk scores using exposure-adjusted target
# ========================================================================

# 2.1 Customer Age Risk (based on YOUR EDA showing 56+ has 7.54% claims)
print("\nüìä Calculating DRIVER risk score...")
df['driver_risk_score'] = calculate_empirical_risk_score(
    df['customer_age'], 
    df['claim_rate_per_month'],  # ‚Üê CHANGED: use exposure-adjusted
    n_bins=5
)

# Add manual adjustments based on known insurance principles
# Young drivers (under 25) and senior drivers (65+) are high risk
age_adjustment = pd.Series(0.0, index=df.index)
age_adjustment[df['customer_age'] < 25] = 0.3  # Boost young driver risk
age_adjustment[df['customer_age'] >= 65] = 0.2  # Boost senior risk
df['driver_risk_score'] = np.clip(df['driver_risk_score'] + age_adjustment, 0, 1)

print(f"   ‚úì Driver risk calculated with age-based adjustments")

# 2.2 Vehicle Age Risk (based on YOUR EDA showing 0-3yrs has 6.12% claims)
print("\nüìä Calculating VEHICLE risk score...")
df['vehicle_risk_score'] = calculate_empirical_risk_score(
    df['vehicle_age'], 
    df['claim_rate_per_month'],  # ‚Üê CHANGED: use exposure-adjusted
    n_bins=3
)

# Manual adjustment: older vehicles (5+ years) are inherently riskier
vehicle_age_adjustment = pd.Series(0.0, index=df.index)
vehicle_age_adjustment[df['vehicle_age'] >= 5] = 0.2
vehicle_age_adjustment[df['vehicle_age'] >= 8] = 0.4
df['vehicle_risk_score'] = np.clip(df['vehicle_risk_score'] + vehicle_age_adjustment, 0, 1)

# Add segment risk (smaller/utility vehicles often have higher claims)
segment_risk = df['segment'].map({
    'A': 0.15,      # Small cars - higher risk
    'B1': 0.10,
    'B2': 0.05,
    'C1': 0.0,      # Mid-size - baseline
    'C2': 0.0,
    'Utility': 0.20 # Utility vehicles - highest risk
}).fillna(0)

df['vehicle_risk_score'] = np.clip(df['vehicle_risk_score'] + segment_risk, 0, 1)

print(f"   ‚úì Vehicle risk calculated with age and segment adjustments")

# 2.3 Region Risk (based on actual claim rates by region)
print("\nüìä Calculating REGION risk score...")

# Calculate actual claim rates by region
region_claim_rates = df.groupby('region_code')['claim_rate_per_month'].mean()

# Normalize to 0-1
min_region = region_claim_rates.min()
max_region = region_claim_rates.max()
region_risk_normalized = (region_claim_rates - min_region) / (max_region - min_region)

# Map to dataframe
df['region_risk_score'] = df['region_code'].map(region_risk_normalized)

# Also consider population density (urban = higher risk)
density_risk = calculate_empirical_risk_score(
    df['region_density'],
    df['claim_rate_per_month'],
    n_bins=3
)

# Combine region and density (60% region specific, 40% density)
df['region_risk_score'] = 0.6 * df['region_risk_score'] + 0.4 * density_risk

print(f"   ‚úì Region risk calculated from actual claim rates + density")

# 2.4 Safety Features Risk (composite of all safety features)
print("\nüìä Calculating SAFETY risk score...")

# Create comprehensive safety composite
df['safety_composite'] = (
    (df['airbags'] / 6) * 0.30 +           # More airbags = safer
    (df['ncap_rating'] / 5) * 0.30 +       # Higher NCAP = safer
    df['is_esc'] * 0.15 +                  # ESC is critical
    df['is_brake_assist'] * 0.10 +
    df['is_parking_sensors'] * 0.05 +
    df['is_parking_camera'] * 0.05 +
    df['is_tpms'] * 0.05
)

# Convert to risk score (invert: less safety = more risk)
df['safety_score'] = 1 - df['safety_composite']

# Validate against actual data
safety_empirical = calculate_empirical_risk_score(
    df['safety_composite'],
    df['claim_rate_per_month'],
    n_bins=5
)

# Blend manual + empirical (70% manual, 30% empirical)
df['safety_score'] = 0.7 * df['safety_score'] + 0.3 * safety_empirical

print(f"   ‚úì Safety risk calculated from composite features")

# ========================================================================
# 2.5 CRITICAL: Create subscription exposure variable (NOT a risk factor!)
# ========================================================================
print("\nüìä Creating EXPOSURE adjustment (separate from risk)...")

# Normalize subscription length to 0-1 for exposure weighting
df['exposure_factor'] = df['subscription_length'] / df['subscription_length'].max()

print(f"   ‚úì Exposure factor created (will be used for premium calculation only)")
print(f"   ‚ÑπÔ∏è  This is NOT included in inherent risk score!")

# ========================================================================
# 2.6 Calculate NEW correlation-based weights (WITHOUT subscription!)
# ========================================================================
print(f"\nüìä Calculating CORRECTED feature importance weights...")

correlations = {
    'driver': abs(df['driver_risk_score'].corr(df['claim_rate_per_month'])),
    'vehicle': abs(df['vehicle_risk_score'].corr(df['claim_rate_per_month'])),
    'region': abs(df['region_risk_score'].corr(df['claim_rate_per_month'])),
    'safety': abs(df['safety_score'].corr(df['claim_rate_per_month']))
}

# Normalize weights to sum to 1
total_corr = sum(correlations.values())
weights = {k: v/total_corr for k, v in correlations.items()}

print(f"\n   üéØ CORRECTED Feature weights (NO subscription length!):")
for feature, weight in sorted(weights.items(), key=lambda x: x[1], reverse=True):
    print(f"      {feature:12s}: {weight:.3f} (corr: {correlations[feature]:.4f})")

# If correlations are still weak, use insurance industry standard weights
if max(weights.values()) < 0.35:
    print(f"\n   ‚ö†Ô∏è  Empirical correlations weak. Using industry-standard weights:")
    weights = {
        'driver': 0.35,   # Driver characteristics most important
        'vehicle': 0.30,  # Vehicle features second
        'safety': 0.20,   # Safety features third
        'region': 0.15    # Geographic risk last
    }
    for feature, weight in weights.items():
        print(f"      {feature:12s}: {weight:.3f}")

# ========================================================================
# 2.7 Create weighted INHERENT risk score (no exposure/subscription!)
# ========================================================================
df['overall_risk_score'] = (
    weights['driver'] * df['driver_risk_score'] +
    weights['vehicle'] * df['vehicle_risk_score'] +
    weights['region'] * df['region_risk_score'] +
    weights['safety'] * df['safety_score']
)

print(f"\n‚úì INHERENT risk score range: {df['overall_risk_score'].min():.3f} to {df['overall_risk_score'].max():.3f}")
print(f"   (This score represents risk AT POLICY INCEPTION, not over time)")

# 2.8 Create risk categories
df['risk_category'] = pd.cut(
    df['overall_risk_score'],
    bins=[0, 0.25, 0.5, 0.75, 1.0],
    labels=['LOW', 'MODERATE', 'HIGH', 'VERY HIGH'],
    include_lowest=True
)

print(f"\nüìä Risk category distribution:")
print(df['risk_category'].value_counts().sort_index())

# 2.9 Create contextual categorical features (for text generation)
df['age_group'] = pd.cut(
    df['customer_age'],
    bins=[0, 25, 35, 50, 65, 100],
    labels=['very_young', 'young', 'middle_aged', 'mature', 'senior']
)

df['vehicle_age_group'] = pd.cut(
    df['vehicle_age'],
    bins=[0, 3, 7, 100],
    labels=['new', 'moderate', 'old']
)

df['subscription_category'] = pd.cut(
    df['subscription_length'],
    bins=[0, 3, 6, 9, 100],
    labels=['very_short', 'short', 'medium', 'long']
)

print(f"‚úì Created categorical groupings for text generation context")

# ========================================================================
# 2.10 Add explanatory columns for transparency
# ========================================================================
df['risk_methodology'] = 'exposure_adjusted_inherent_factors'
df['weights_used'] = str(weights)

print(f"\n‚úÖ CORRECTED RISK ENGINEERING COMPLETE!")
print(f"   Key changes:")
print(f"   1. ‚úì Used exposure-adjusted claims (claims per month)")
print(f"   2. ‚úì Excluded subscription length from risk score")
print(f"   3. ‚úì Applied insurance industry adjustments")
print(f"   4. ‚úì Created separate exposure factor for premium calc")


STEP 2: CORRECTED FEATURE ENGINEERING (INHERENT RISK ONLY)

üìä Creating empirical risk scores based on ACTUAL claim patterns...
   ‚ö†Ô∏è  EXCLUDING subscription_length (exposure variable, not risk factor)

üîç Creating exposure-adjusted target variable...
   ‚úì Original claim rate: 6.40%
   ‚úì Average exposure: 6.1 months
   ‚úì Claims per month: 26.196 per 1000 months

üìä Calculating DRIVER risk score...
   ‚úì Driver risk calculated with age-based adjustments

üìä Calculating VEHICLE risk score...
   ‚úì Vehicle risk calculated with age and segment adjustments

üìä Calculating REGION risk score...
   ‚úì Region risk calculated from actual claim rates + density

üìä Calculating SAFETY risk score...
   ‚úì Safety risk calculated from composite features

üìä Creating EXPOSURE adjustment (separate from risk)...
   ‚úì Exposure factor created (will be used for premium calculation only)
   ‚ÑπÔ∏è  This is NOT included in inherent risk score!

üìä Calculating CORRECTED feature

In [50]:
# ========================================================================
# STEP 3: CRITICAL VALIDATION - Risk Scores Must Make Sense!
# ========================================================================
print("\n" + "="*70)
print("STEP 3: VALIDATING CORRECTED RISK SCORES")
print("="*70)

claim_mask = df['claim_status'] == 1
no_claim_mask = df['claim_status'] == 0

print(f"\n‚úÖ OVERALL RISK SCORE VALIDATION (using original claim_status):")
claim_risk = df[claim_mask]['overall_risk_score'].mean()
no_claim_risk = df[no_claim_mask]['overall_risk_score'].mean()
difference = claim_risk - no_claim_risk

print(f"   Claims avg risk:     {claim_risk:.4f}")
print(f"   No-claims avg risk:  {no_claim_risk:.4f}")
print(f"   Difference:          {difference:+.4f} {'‚úÖ CORRECT!' if difference > 0 else '‚ùå ERROR!'}")

if difference <= 0:
    print(f"\n   ‚ö†Ô∏è  WARNING: Risk scores are inverted or flat!")
    print(f"   This means the model won't learn meaningful patterns.")
elif difference < 0.02:
    print(f"\n   ‚ö†Ô∏è  WARNING: Risk scores show weak discrimination ({difference:.4f})")
    print(f"   Consider: more feature engineering or data quality issues")
else:
    print(f"\n   ‚úÖ GOOD: Risk scores successfully discriminate claims from non-claims")

print(f"\nüìä Component-wise validation:")
for score_col in ['driver_risk_score', 'vehicle_risk_score', 
                  'region_risk_score', 'safety_score']:
    claim_avg = df[claim_mask][score_col].mean()
    no_claim_avg = df[no_claim_mask][score_col].mean()
    diff = claim_avg - no_claim_avg
    
    # Determine if this makes sense
    if score_col == 'safety_score':
        # Safety score is INVERTED (higher = less safe), so claims should be higher
        status = '‚úÖ' if diff > 0 else '‚ö†Ô∏è (inverted?)'
    else:
        # Other scores: claims should have higher risk
        status = '‚úÖ' if diff > 0 else '‚ö†Ô∏è (unexpected)'
    
    print(f"   {score_col:25s}: {diff:+.4f} {status}")

# Additional validation: check against known patterns
print(f"\nüîç Domain Knowledge Validation:")

# Young drivers should have higher risk
young_risk = df[df['customer_age'] < 30]['overall_risk_score'].mean()
mature_risk = df[(df['customer_age'] >= 35) & (df['customer_age'] <= 50)]['overall_risk_score'].mean()
print(f"   Young drivers (<30):  {young_risk:.3f}")
print(f"   Mature drivers (35-50): {mature_risk:.3f}")
print(f"   Difference: {young_risk - mature_risk:+.3f} {'‚úÖ' if young_risk > mature_risk else '‚ö†Ô∏è unexpected'}")

# Old vehicles should have higher risk
old_vehicle_risk = df[df['vehicle_age'] >= 5]['overall_risk_score'].mean()
new_vehicle_risk = df[df['vehicle_age'] < 3]['overall_risk_score'].mean()
print(f"\n   Old vehicles (5+ yrs): {old_vehicle_risk:.3f}")
print(f"   New vehicles (<3 yrs): {new_vehicle_risk:.3f}")
print(f"   Difference: {old_vehicle_risk - new_vehicle_risk:+.3f} {'‚úÖ' if old_vehicle_risk > new_vehicle_risk else '‚ö†Ô∏è unexpected'}")

# Fewer safety features = higher risk
low_safety = df[df['airbags'] <= 2]['overall_risk_score'].mean()
high_safety = df[df['airbags'] >= 4]['overall_risk_score'].mean()
print(f"\n   Low safety (‚â§2 airbags): {low_safety:.3f}")
print(f"   High safety (‚â•4 airbags): {high_safety:.3f}")
print(f"   Difference: {low_safety - high_safety:+.3f} {'‚úÖ' if low_safety > high_safety else '‚ö†Ô∏è unexpected'}")


STEP 3: VALIDATING CORRECTED RISK SCORES

‚úÖ OVERALL RISK SCORE VALIDATION (using original claim_status):
   Claims avg risk:     0.4083
   No-claims avg risk:  0.4051
   Difference:          +0.0032 ‚úÖ CORRECT!

   Consider: more feature engineering or data quality issues

üìä Component-wise validation:
   driver_risk_score        : +0.0007 ‚úÖ
   vehicle_risk_score       : +0.0199 ‚úÖ
   region_risk_score        : -0.0085 ‚ö†Ô∏è (unexpected)
   safety_score             : -0.0076 ‚ö†Ô∏è (inverted?)

üîç Domain Knowledge Validation:
   Young drivers (<30):  nan
   Mature drivers (35-50): 0.415
   Difference: +nan ‚ö†Ô∏è unexpected

   Old vehicles (5+ yrs): 0.266
   New vehicles (<3 yrs): 0.428
   Difference: -0.163 ‚ö†Ô∏è unexpected

   Low safety (‚â§2 airbags): 0.486
   High safety (‚â•4 airbags): 0.208
   Difference: +0.277 ‚úÖ


In [51]:

# ========================================================================
# STEP 4: STRATIFIED DATA SPLITTING (BEFORE TEXT GENERATION!)
# ========================================================================
print("\n" + "="*70)
print("STEP 4: STRATIFIED DATA SPLITTING")
print("="*70)

# ‚úÖ SPLIT FIRST - This ensures no data leakage
train_df, temp_df = train_test_split(
    df,
    test_size=0.30,  # 30% for val+test
    stratify=df['claim_status'],
    random_state=42
)

val_df, test_df = train_test_split(
    temp_df,
    test_size=0.50,  # Split the 30% equally
    stratify=temp_df['claim_status'],
    random_state=42
)

print(f"‚úì Train set: {len(train_df):,} records ({(train_df['claim_status']==1).mean()*100:.2f}% claims)")
print(f"‚úì Val set:   {len(val_df):,} records ({(val_df['claim_status']==1).mean()*100:.2f}% claims)")
print(f"‚úì Test set:  {len(test_df):,} records ({(test_df['claim_status']==1).mean()*100:.2f}% claims)")

# Validate splits maintain risk score patterns
print(f"\nüìä Risk score validation across splits:")
for split_name, split_df in [('Train', train_df), ('Val', val_df), ('Test', test_df)]:
    claim_r = split_df[split_df['claim_status']==1]['overall_risk_score'].mean()
    no_claim_r = split_df[split_df['claim_status']==0]['overall_risk_score'].mean()
    diff = claim_r - no_claim_r
    status = '‚úÖ' if diff > 0.01 else '‚ö†Ô∏è'
    print(f"   {split_name:5s}: Claims {claim_r:.3f} vs No-Claims {no_claim_r:.3f} = {diff:+.3f} {status}")



STEP 4: STRATIFIED DATA SPLITTING
‚úì Train set: 41,014 records (6.40% claims)
‚úì Val set:   8,789 records (6.39% claims)
‚úì Test set:  8,789 records (6.39% claims)

üìä Risk score validation across splits:
   Train: Claims 0.413 vs No-Claims 0.405 = +0.007 ‚ö†Ô∏è
   Val  : Claims 0.407 vs No-Claims 0.402 = +0.005 ‚ö†Ô∏è
   Test : Claims 0.389 vs No-Claims 0.408 = -0.019 ‚ö†Ô∏è


In [52]:
# ========================================================================
# STEP 6: FINAL DATA SAVE (SIMPLIFIED - NO BALANCING!)
# ========================================================================
print("\n" + "="*70)
print("STEP 6: SAVING FINAL PROCESSED DATA")
print("="*70)

# Save splits
train_df.to_csv('../data/processed/train.csv', index=False)
val_df.to_csv('../data/processed/validation.csv', index=False)
test_df.to_csv('../data/processed/test.csv', index=False)

# Also save a full processed version (for reference)
df.to_csv('../data/processed/cleaned_data.csv', index=False)

print(f"\n‚úÖ Saved files:")
print(f"   üìÇ train.csv:          {len(train_df):,} records (for RAG index)")
print(f"   üìÇ validation.csv:     {len(val_df):,} records (for tuning)")
print(f"   üìÇ test.csv:           {len(test_df):,} records (for final eval)")
print(f"   üìÇ cleaned_data.csv:   {len(df):,} records (reference)")

print(f"\nüìä Split Sizes:")
print(f"   Train:      {len(train_df)/len(df)*100:.1f}% ({len(train_df):,})")
print(f"   Validation: {len(val_df)/len(df)*100:.1f}% ({len(val_df):,})")
print(f"   Test:       {len(test_df)/len(df)*100:.1f}% ({len(test_df):,})")

print(f"\n‚úÖ PREPROCESSING COMPLETE!")
print(f"\nüìù Next Steps:")
print(f"   1. Run Notebook 04 to generate text summaries for TRAIN.csv only")
print(f"   2. Run Notebook 05 to build RAG index from TRAIN.csv")
print(f"   3. Evaluate on val.csv and test.csv")



STEP 6: SAVING FINAL PROCESSED DATA

‚úÖ Saved files:
   üìÇ train.csv:          41,014 records (for RAG index)
   üìÇ validation.csv:     8,789 records (for tuning)
   üìÇ test.csv:           8,789 records (for final eval)
   üìÇ cleaned_data.csv:   58,592 records (reference)

üìä Split Sizes:
   Train:      70.0% (41,014)
   Validation: 15.0% (8,789)
   Test:       15.0% (8,789)

‚úÖ PREPROCESSING COMPLETE!

üìù Next Steps:
   1. Run Notebook 04 to generate text summaries for TRAIN.csv only
   2. Run Notebook 05 to build RAG index from TRAIN.csv
   3. Evaluate on val.csv and test.csv


In [53]:

# ========================================================================
# STEP 2: DATA-DRIVEN FEATURE ENGINEERING
# ========================================================================
print("\n" + "="*70)
print("STEP 2: DATA-DRIVEN FEATURE ENGINEERING")
print("="*70)

def calculate_empirical_risk_score(feature_col, target_col, n_bins=5):
    """
    Calculate risk score based on ACTUAL claim rates observed in the data.
    This ensures risk scores reflect reality, not assumptions.
    
    Args:
        feature_col: The feature to bin and analyze
        target_col: The target variable (claim_status)
        n_bins: Number of bins to create
    
    Returns:
        Normalized risk score (0-1) where higher = higher observed claim rate
    """
    # Create bins (quantile-based for even distribution)
    try:
        feature_binned = pd.qcut(feature_col, q=n_bins, duplicates='drop')
    except:
        # If qcut fails (e.g., too few unique values), use regular cut
        feature_binned = pd.cut(feature_col, bins=n_bins)
    
    # Create a temporary dataframe to calculate claim rates per bin
    temp_df = pd.DataFrame({
        'bin': feature_binned,
        'target': target_col
    })
    
    # Calculate actual claim rate in each bin
    bin_claim_rates = temp_df.groupby('bin', observed=True)['target'].mean()
    
    # Map claim rates back to original data (convert to numeric)
    risk_scores = feature_binned.map(bin_claim_rates).astype(float)
    
    # Normalize to 0-1 scale
    min_rate = risk_scores.min()
    max_rate = risk_scores.max()
    
    if max_rate > min_rate:
        normalized_scores = (risk_scores - min_rate) / (max_rate - min_rate)
    else:
        # If all bins have same rate, return middle value
        normalized_scores = pd.Series(0.5, index=risk_scores.index)
    
    return normalized_scores

print("\nüìä Creating empirical risk scores based on ACTUAL claim patterns...")

# 2.1 Customer Age Risk (based on YOUR EDA showing 56+ has 7.54% claims)
df['driver_risk_score'] = calculate_empirical_risk_score(
    df['customer_age'], 
    df['claim_status'], 
    n_bins=5
)

# 2.2 Vehicle Age Risk (based on YOUR EDA showing 0-3yrs has 6.12% claims)
df['vehicle_risk_score'] = calculate_empirical_risk_score(
    df['vehicle_age'], 
    df['claim_status'], 
    n_bins=3
)

# 2.3 Subscription Length Risk (YOUR HIGHEST CORRELATION: 0.078738)
df['subscription_risk_score'] = calculate_empirical_risk_score(
    df['subscription_length'], 
    df['claim_status'], 
    n_bins=5
)

# 2.4 Region Density Risk
df['region_risk_score'] = calculate_empirical_risk_score(
    df['region_density'], 
    df['claim_status'], 
    n_bins=5
)

# 2.5 Safety Features Risk (composite of all safety features)
# First create a safety composite score
df['safety_composite'] = (
    df['airbags']/6 + 
    df['is_esc'] + 
    df['is_brake_assist'] + 
    df['is_parking_sensors'] + 
    df['is_tpms'] + 
    df['ncap_rating']/5
) / 6

df['safety_score'] = calculate_empirical_risk_score(
    df['safety_composite'], 
    df['claim_status'], 
    n_bins=5
)

print(f"‚úì Created 5 empirical risk scores")

# 2.6 Calculate correlation-based weights
print(f"\nüìä Calculating feature importance weights...")

correlations = {
    'subscription': abs(df['subscription_risk_score'].corr(df['claim_status'])),
    'driver': abs(df['driver_risk_score'].corr(df['claim_status'])),
    'vehicle': abs(df['vehicle_risk_score'].corr(df['claim_status'])),
    'region': abs(df['region_risk_score'].corr(df['claim_status'])),
    'safety': abs(df['safety_score'].corr(df['claim_status']))
}

# Normalize weights to sum to 1
total_corr = sum(correlations.values())
weights = {k: v/total_corr for k, v in correlations.items()}

print(f"\n   Feature weights (based on correlation with claims):")
for feature, weight in sorted(weights.items(), key=lambda x: x[1], reverse=True):
    print(f"      {feature:12s}: {weight:.3f} (corr: {correlations[feature]:.4f})")

# 2.7 Create weighted overall risk score
df['overall_risk_score'] = (
    weights['subscription'] * df['subscription_risk_score'] +
    weights['driver'] * df['driver_risk_score'] +
    weights['vehicle'] * df['vehicle_risk_score'] +
    weights['region'] * df['region_risk_score'] +
    weights['safety'] * df['safety_score']
)

print(f"\n‚úì Overall risk score range: {df['overall_risk_score'].min():.3f} to {df['overall_risk_score'].max():.3f}")

# 2.8 Create risk categories
df['risk_category'] = pd.cut(
    df['overall_risk_score'],
    bins=[0, 0.25, 0.5, 0.75, 1.0],
    labels=['LOW', 'MODERATE', 'HIGH', 'VERY HIGH'],
    include_lowest=True
)

print(f"\nüìä Risk category distribution:")
print(df['risk_category'].value_counts().sort_index())

# 2.9 Create contextual categorical features (for text generation)
df['age_group'] = pd.cut(
    df['customer_age'],
    bins=[0, 35, 45, 55, 100],
    labels=['young', 'middle_aged', 'mature', 'senior']
)

df['vehicle_age_group'] = pd.cut(
    df['vehicle_age'],
    bins=[0, 3, 7, 100],
    labels=['new', 'moderate', 'old']
)

df['subscription_category'] = pd.cut(
    df['subscription_length'],
    bins=[0, 3, 6, 9, 100],
    labels=['very_short', 'short', 'medium', 'long']
)

print(f"‚úì Created categorical groupings for text generation context")



STEP 2: DATA-DRIVEN FEATURE ENGINEERING

üìä Creating empirical risk scores based on ACTUAL claim patterns...
‚úì Created 5 empirical risk scores

üìä Calculating feature importance weights...

   Feature weights (based on correlation with claims):
      subscription: 0.507 (corr: 0.0808)
      driver      : 0.143 (corr: 0.0227)
      region      : 0.139 (corr: 0.0222)
      vehicle     : 0.123 (corr: 0.0195)
      safety      : 0.088 (corr: 0.0141)

‚úì Overall risk score range: 0.000 to 1.000

üìä Risk category distribution:
risk_category
LOW           4470
MODERATE     18636
HIGH         17306
VERY HIGH    18180
Name: count, dtype: int64
‚úì Created categorical groupings for text generation context


In [54]:
# ========================================================================
# STEP 3: CRITICAL VALIDATION - Risk Scores Must Make Sense!
# ========================================================================
print("\n" + "="*70)
print("STEP 3: VALIDATING RISK SCORES")
print("="*70)

claim_mask = df['claim_status'] == 1
no_claim_mask = df['claim_status'] == 0

print(f"\n‚úÖ OVERALL RISK SCORE VALIDATION:")
claim_risk = df[claim_mask]['overall_risk_score'].mean()
no_claim_risk = df[no_claim_mask]['overall_risk_score'].mean()
difference = claim_risk - no_claim_risk

print(f"   Claims avg risk:     {claim_risk:.4f}")
print(f"   No-claims avg risk:  {no_claim_risk:.4f}")
print(f"   Difference:          {difference:+.4f} {'‚úÖ CORRECT!' if difference > 0 else '‚ùå ERROR!'}")

if difference <= 0:
    print(f"\n   ‚ö†Ô∏è  WARNING: Risk scores are inverted or flat!")
    print(f"   This means the model won't learn meaningful patterns.")

print(f"\nüìä Component-wise validation:")
for score_col in ['subscription_risk_score', 'driver_risk_score', 'vehicle_risk_score', 
                  'region_risk_score', 'safety_score']:
    claim_avg = df[claim_mask][score_col].mean()
    no_claim_avg = df[no_claim_mask][score_col].mean()
    diff = claim_avg - no_claim_avg
    status = '‚úÖ' if diff > 0 else '‚ö†Ô∏è'
    print(f"   {score_col:25s}: {diff:+.4f} {status}")



STEP 3: VALIDATING RISK SCORES

‚úÖ OVERALL RISK SCORE VALIDATION:
   Claims avg risk:     0.6630
   No-claims avg risk:  0.5815
   Difference:          +0.0815 ‚úÖ CORRECT!

üìä Component-wise validation:
   subscription_risk_score  : +0.1306 ‚úÖ
   driver_risk_score        : +0.0343 ‚úÖ
   vehicle_risk_score       : +0.0356 ‚úÖ
   region_risk_score        : +0.0284 ‚úÖ
   safety_score             : +0.0228 ‚úÖ


### STRATIFIED DATA SPLITTING
- **Purpose:** Split data while preserving class distribution
- **Strategy:** 70% train / 15% validation / 15% test
- **Why:** Prevents data leakage and ensures honest evaluation

In [55]:

# ========================================================================
# STEP 4: STRATIFIED DATA SPLITTING
# ========================================================================
print("\n" + "="*70)
print("STEP 4: STRATIFIED DATA SPLITTING")
print("="*70)

# Split BEFORE any balancing to maintain realistic test set
train_df, test_df = train_test_split(
    df,
    test_size=0.15,
    stratify=df['claim_status'],
    random_state=42
)

train_df, val_df = train_test_split(
    train_df,
    test_size=0.1765,  # 0.15 of remaining = 0.15 total validation
    stratify=train_df['claim_status'],
    random_state=42
)

print(f"‚úì Train set: {len(train_df):,} records ({(train_df['claim_status']==1).mean()*100:.2f}% claims)")
print(f"‚úì Val set:   {len(val_df):,} records ({(val_df['claim_status']==1).mean()*100:.2f}% claims)")
print(f"‚úì Test set:  {len(test_df):,} records ({(test_df['claim_status']==1).mean()*100:.2f}% claims)")

# Validate splits maintain risk score patterns
print(f"\nüìä Risk score validation across splits:")
for split_name, split_df in [('Train', train_df), ('Val', val_df), ('Test', test_df)]:
    claim_r = split_df[split_df['claim_status']==1]['overall_risk_score'].mean()
    no_claim_r = split_df[split_df['claim_status']==0]['overall_risk_score'].mean()
    diff = claim_r - no_claim_r
    status = '‚úÖ' if diff > 0.01 else '‚ö†Ô∏è'
    print(f"   {split_name:5s}: Claims {claim_r:.3f} vs No-Claims {no_claim_r:.3f} = {diff:+.3f} {status}")



STEP 4: STRATIFIED DATA SPLITTING
‚úì Train set: 41,012 records (6.40% claims)
‚úì Val set:   8,791 records (6.39% claims)
‚úì Test set:  8,789 records (6.39% claims)

üìä Risk score validation across splits:
   Train: Claims 0.660 vs No-Claims 0.582 = +0.079 ‚úÖ
   Val  : Claims 0.661 vs No-Claims 0.584 = +0.078 ‚úÖ
   Test : Claims 0.677 vs No-Claims 0.578 = +0.099 ‚úÖ


### HANDLING CLASS IMBALANCE FOR RAG

- **Purpose:** Balance training data for better retrieval
- **Method:** Intelligent duplication stratified by risk category
- **Target:** 20% claims (up from 6.4%)
- **Why:** RAG needs enough claim examples to retrieve from

In [56]:
# ========================================================================
# STEP 5: HANDLE CLASS IMBALANCE - RANDOM UNDERSAMPLING
# ========================================================================
print("\n" + "="*70)
print("STEP 5: BALANCING TRAINING DATA (TARGET: 20% CLAIMS)")
print("="*70)

# Separate majority and minority classes
train_majority = train_df[train_df['claim_status'] == 0]
train_minority = train_df[train_df['claim_status'] == 1]

print(f"\nBefore balancing:")
print(f"   Claims:     {len(train_minority):,} ({len(train_minority)/len(train_df)*100:.2f}%)")
print(f"   No Claims:  {len(train_majority):,} ({len(train_majority)/len(train_df)*100:.2f}%)")

# Calculate how many no-claim samples we need for 20% claim rate
# Formula: minority / (minority + majority_new) = 0.20
# Solving: majority_new = minority / 0.20 - minority = minority * 4
target_majority_size = int(len(train_minority) * 4)

# Randomly undersample majority class
train_majority_undersampled = train_majority.sample(
    n=target_majority_size, 
    random_state=42
)

# Combine back
balanced_train_df = pd.concat([
    train_majority_undersampled, 
    train_minority
]).sample(frac=1, random_state=42).reset_index(drop=True)

print(f"\nAfter balancing:")
print(f"   Claims:     {len(balanced_train_df[balanced_train_df['claim_status']==1]):,} "
      f"({(balanced_train_df['claim_status']==1).mean()*100:.2f}%)")
print(f"   No Claims:  {len(balanced_train_df[balanced_train_df['claim_status']==0]):,} "
      f"({(balanced_train_df['claim_status']==0).mean()*100:.2f}%)")
print(f"   Total:      {len(balanced_train_df):,} records")

# Validate balanced data maintains risk patterns
claim_risk_balanced = balanced_train_df[balanced_train_df['claim_status']==1]['overall_risk_score'].mean()
no_claim_risk_balanced = balanced_train_df[balanced_train_df['claim_status']==0]['overall_risk_score'].mean()
diff_balanced = claim_risk_balanced - no_claim_risk_balanced

print(f"\n‚úÖ Risk score validation after balancing:")
print(f"   Claims:     {claim_risk_balanced:.4f}")
print(f"   No Claims:  {no_claim_risk_balanced:.4f}")
print(f"   Difference: {diff_balanced:+.4f} {'‚úÖ MAINTAINED' if diff_balanced > 0.01 else '‚ö†Ô∏è LOST'}")



STEP 5: BALANCING TRAINING DATA (TARGET: 20% CLAIMS)

Before balancing:
   Claims:     2,624 (6.40%)
   No Claims:  38,388 (93.60%)

After balancing:
   Claims:     2,624 (20.00%)
   No Claims:  10,496 (80.00%)
   Total:      13,120 records

‚úÖ Risk score validation after balancing:
   Claims:     0.6603
   No Claims:  0.5806
   Difference: +0.0797 ‚úÖ MAINTAINED


### SAVING PREPROCESSED DATA

**Output files:**
- train_balanced.csv (for embeddings & FAISS index)
- validation.csv (for tuning)
- test.csv (final evaluation only)

In [57]:

# ========================================================================
# STEP 6: SAVE PROCESSED DATA
# ========================================================================
print("\n" + "="*70)
print("STEP 6: SAVING PROCESSED DATA")
print("="*70)

# Save to processed folder
train_df.to_csv('../data/processed/train.csv', index=False)
balanced_train_df.to_csv('../data/processed/train_balanced.csv', index=False)
val_df.to_csv('../data/processed/validation.csv', index=False)
test_df.to_csv('../data/processed/test.csv', index=False)

print(f"Saved trained data with no balancing:  ../data/processed/train.csv")
print(f"‚úÖ Saved balanced training data:   ../data/processed/train_balanced.csv")
print(f"‚úÖ Saved validation data:          ../data/processed/validation.csv")
print(f"‚úÖ Saved test data:                ../data/processed/test.csv")



STEP 6: SAVING PROCESSED DATA
Saved trained data with no balancing:  ../data/processed/train.csv
‚úÖ Saved balanced training data:   ../data/processed/train_balanced.csv
‚úÖ Saved validation data:          ../data/processed/validation.csv
‚úÖ Saved test data:                ../data/processed/test.csv


In [58]:
# ========================================================================
# STEP 7: FINAL SUMMARY
# ========================================================================
print("\n" + "="*70)
print("‚úÖ PREPROCESSING COMPLETE - READY FOR TEXT GENERATION")
print("="*70)

print(f"""
üìä FINAL STATISTICS:
   Training:   {len(balanced_train_df):,} records (20.0% claims) - BALANCED
   Validation: {len(val_df):,} records ({(val_df['claim_status']==1).mean()*100:.1f}% claims) - REALISTIC
   Test:       {len(test_df):,} records ({(test_df['claim_status']==1).mean()*100:.1f}% claims) - REALISTIC

üéØ RISK SCORES:
   ‚úÖ Based on ACTUAL claim patterns in data
   ‚úÖ Claims have HIGHER risk scores than no-claims
   ‚úÖ Weights determined by correlation strength
   ‚úÖ Validated across all splits

üìù FEATURES READY FOR TEXT GENERATION:
   ‚úÖ 5 granular risk scores (driver, vehicle, subscription, region, safety)
   ‚úÖ 1 overall weighted risk score
   ‚úÖ Risk categories (LOW, MODERATE, HIGH, VERY HIGH)
   ‚úÖ Age groups, vehicle age groups, subscription categories
   ‚úÖ All safety features preserved

üöÄ NEXT STEP: Text generation using these validated risk scores
""")


‚úÖ PREPROCESSING COMPLETE - READY FOR TEXT GENERATION

üìä FINAL STATISTICS:
   Training:   13,120 records (20.0% claims) - BALANCED
   Validation: 8,791 records (6.4% claims) - REALISTIC
   Test:       8,789 records (6.4% claims) - REALISTIC

üéØ RISK SCORES:
   ‚úÖ Based on ACTUAL claim patterns in data
   ‚úÖ Claims have HIGHER risk scores than no-claims
   ‚úÖ Weights determined by correlation strength
   ‚úÖ Validated across all splits

üìù FEATURES READY FOR TEXT GENERATION:
   ‚úÖ 5 granular risk scores (driver, vehicle, subscription, region, safety)
   ‚úÖ 1 overall weighted risk score
   ‚úÖ Risk categories (LOW, MODERATE, HIGH, VERY HIGH)
   ‚úÖ Age groups, vehicle age groups, subscription categories
   ‚úÖ All safety features preserved

üöÄ NEXT STEP: Text generation using these validated risk scores

