# Data Preprocessing Pipeline

## Overview
This preprocessing pipeline transforms raw insurance data into ML-ready features with **empirically-derived risk scores** based on actual claim patterns observed in the data.

## Key Principles
- **Data-Driven**: Risk scores calculated from actual claim rates (not assumptions)
- **Validated**: Claims consistently show higher risk scores than no-claims (+8.15%)
- **Stratified**: Test/validation sets preserve real-world distribution (6.4% claims)
- **Balanced Training**: Undersampled to 20% claims for better model learning



## Pipeline Steps
1. **Load Data** - Import cleaned dataset (58,592 policies)
2. **Feature Engineering** - Create empirical risk scores from observed patterns
3. **Validation** - Verify risk scores align with actual outcomes
4. **Stratified Split** - Create train/val/test sets (70/15/15)
5. **Balance Training** - Undersample majority class to 20% claim rate
6. **Save Outputs** - Export processed datasets

## Output Files
- `train_balanced.csv` - 13,120 records (20% claims) for training
- `validation.csv` - 8,791 records (6.4% claims) for tuning
- `test.csv` - 8,789 records (6.4% claims) for evaluation

In [4]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

### LOAD CLEANED DATA


- **Purpose:** Load the cleaned insurance dataset and verify data integrity
- **Input:** data/processed/cleaned_data.csv
- **Output:** DataFrame with 58,592 policies

In [5]:
# ========================================================================
# STEP 1: LOAD DATA
# ========================================================================
print("\n" + "="*70)
print("STEP 1: LOADING CLEANED DATA")
print("="*70)

df = pd.read_csv('../data/processed/cleaned_data.csv')
print(f"‚úì Loaded {len(df):,} records with {len(df.columns)} columns")
print(f"‚úì Claim rate: {(df['claim_status']==1).mean()*100:.2f}%")



STEP 1: LOADING CLEANED DATA
‚úì Loaded 58,592 records with 41 columns
‚úì Claim rate: 6.40%


In [41]:
df.head()

Unnamed: 0,policy_id,subscription_length,vehicle_age,customer_age,region_code,region_density,segment,model,fuel_type,max_torque,...,is_brake_assist,is_power_door_locks,is_central_locking,is_power_steering,is_driver_seat_height_adjustable,is_day_night_rear_view_mirror,is_ecw,is_speed_alert,ncap_rating,claim_status
0,POL045360,9.3,1.2,41,C8,8794,C2,M4,Diesel,250Nm@2750rpm,...,1,1,1,1,1,0,1,1,3,0
1,POL016745,8.2,1.8,35,C2,27003,C1,M9,Diesel,200Nm@1750rpm,...,0,1,1,1,1,1,1,1,4,0
2,POL007194,9.5,0.2,44,C8,8794,C2,M4,Diesel,250Nm@2750rpm,...,1,1,1,1,1,0,1,1,3,0
3,POL018146,5.2,0.4,44,C10,73430,A,M1,CNG,60Nm@3500rpm,...,0,0,0,1,0,0,0,1,0,0
4,POL049011,10.1,1.0,56,C13,5410,B2,M5,Diesel,200Nm@3000rpm,...,0,1,1,1,0,0,1,1,5,0


### FEATURE ENGINEERING

- **Purpose:** Create composite risk scores and categorical bins
- **Why:** Enriches text summaries with meaningful risk context
- **Output:** 6 risk scores + 4 categorical groupings

In [6]:

# ========================================================================
# STEP 2: DATA-DRIVEN FEATURE ENGINEERING
# ========================================================================
print("\n" + "="*70)
print("STEP 2: DATA-DRIVEN FEATURE ENGINEERING")
print("="*70)

def calculate_empirical_risk_score(feature_col, target_col, n_bins=5):
    """
    Calculate risk score based on ACTUAL claim rates observed in the data.
    This ensures risk scores reflect reality, not assumptions.
    
    Args:
        feature_col: The feature to bin and analyze
        target_col: The target variable (claim_status)
        n_bins: Number of bins to create
    
    Returns:
        Normalized risk score (0-1) where higher = higher observed claim rate
    """
    # Create bins (quantile-based for even distribution)
    try:
        feature_binned = pd.qcut(feature_col, q=n_bins, duplicates='drop')
    except:
        # If qcut fails (e.g., too few unique values), use regular cut
        feature_binned = pd.cut(feature_col, bins=n_bins)
    
    # Create a temporary dataframe to calculate claim rates per bin
    temp_df = pd.DataFrame({
        'bin': feature_binned,
        'target': target_col
    })
    
    # Calculate actual claim rate in each bin
    bin_claim_rates = temp_df.groupby('bin', observed=True)['target'].mean()
    
    # Map claim rates back to original data (convert to numeric)
    risk_scores = feature_binned.map(bin_claim_rates).astype(float)
    
    # Normalize to 0-1 scale
    min_rate = risk_scores.min()
    max_rate = risk_scores.max()
    
    if max_rate > min_rate:
        normalized_scores = (risk_scores - min_rate) / (max_rate - min_rate)
    else:
        # If all bins have same rate, return middle value
        normalized_scores = pd.Series(0.5, index=risk_scores.index)
    
    return normalized_scores

print("\nüìä Creating empirical risk scores based on ACTUAL claim patterns...")

# 2.1 Customer Age Risk (based on YOUR EDA showing 56+ has 7.54% claims)
df['driver_risk_score'] = calculate_empirical_risk_score(
    df['customer_age'], 
    df['claim_status'], 
    n_bins=5
)

# 2.2 Vehicle Age Risk (based on YOUR EDA showing 0-3yrs has 6.12% claims)
df['vehicle_risk_score'] = calculate_empirical_risk_score(
    df['vehicle_age'], 
    df['claim_status'], 
    n_bins=3
)

# 2.3 Subscription Length Risk (YOUR HIGHEST CORRELATION: 0.078738)
df['subscription_risk_score'] = calculate_empirical_risk_score(
    df['subscription_length'], 
    df['claim_status'], 
    n_bins=5
)

# 2.4 Region Density Risk
df['region_risk_score'] = calculate_empirical_risk_score(
    df['region_density'], 
    df['claim_status'], 
    n_bins=5
)

# 2.5 Safety Features Risk (composite of all safety features)
# First create a safety composite score
df['safety_composite'] = (
    df['airbags']/6 + 
    df['is_esc'] + 
    df['is_brake_assist'] + 
    df['is_parking_sensors'] + 
    df['is_tpms'] + 
    df['ncap_rating']/5
) / 6

df['safety_score'] = calculate_empirical_risk_score(
    df['safety_composite'], 
    df['claim_status'], 
    n_bins=5
)

print(f"‚úì Created 5 empirical risk scores")

# 2.6 Calculate correlation-based weights
print(f"\nüìä Calculating feature importance weights...")

correlations = {
    'subscription': abs(df['subscription_risk_score'].corr(df['claim_status'])),
    'driver': abs(df['driver_risk_score'].corr(df['claim_status'])),
    'vehicle': abs(df['vehicle_risk_score'].corr(df['claim_status'])),
    'region': abs(df['region_risk_score'].corr(df['claim_status'])),
    'safety': abs(df['safety_score'].corr(df['claim_status']))
}

# Normalize weights to sum to 1
total_corr = sum(correlations.values())
weights = {k: v/total_corr for k, v in correlations.items()}

print(f"\n   Feature weights (based on correlation with claims):")
for feature, weight in sorted(weights.items(), key=lambda x: x[1], reverse=True):
    print(f"      {feature:12s}: {weight:.3f} (corr: {correlations[feature]:.4f})")

# 2.7 Create weighted overall risk score
df['overall_risk_score'] = (
    weights['subscription'] * df['subscription_risk_score'] +
    weights['driver'] * df['driver_risk_score'] +
    weights['vehicle'] * df['vehicle_risk_score'] +
    weights['region'] * df['region_risk_score'] +
    weights['safety'] * df['safety_score']
)

print(f"\n‚úì Overall risk score range: {df['overall_risk_score'].min():.3f} to {df['overall_risk_score'].max():.3f}")

# 2.8 Create risk categories
df['risk_category'] = pd.cut(
    df['overall_risk_score'],
    bins=[0, 0.25, 0.5, 0.75, 1.0],
    labels=['LOW', 'MODERATE', 'HIGH', 'VERY HIGH'],
    include_lowest=True
)

print(f"\nüìä Risk category distribution:")
print(df['risk_category'].value_counts().sort_index())

# 2.9 Create contextual categorical features (for text generation)
df['age_group'] = pd.cut(
    df['customer_age'],
    bins=[0, 35, 45, 55, 100],
    labels=['young', 'middle_aged', 'mature', 'senior']
)

df['vehicle_age_group'] = pd.cut(
    df['vehicle_age'],
    bins=[0, 3, 7, 100],
    labels=['new', 'moderate', 'old']
)

df['subscription_category'] = pd.cut(
    df['subscription_length'],
    bins=[0, 3, 6, 9, 100],
    labels=['very_short', 'short', 'medium', 'long']
)

print(f"‚úì Created categorical groupings for text generation context")



STEP 2: DATA-DRIVEN FEATURE ENGINEERING

üìä Creating empirical risk scores based on ACTUAL claim patterns...
‚úì Created 5 empirical risk scores

üìä Calculating feature importance weights...

   Feature weights (based on correlation with claims):
      subscription: 0.507 (corr: 0.0808)
      driver      : 0.143 (corr: 0.0227)
      region      : 0.139 (corr: 0.0222)
      vehicle     : 0.123 (corr: 0.0195)
      safety      : 0.088 (corr: 0.0141)

‚úì Overall risk score range: 0.000 to 1.000

üìä Risk category distribution:
risk_category
LOW           4470
MODERATE     18636
HIGH         17306
VERY HIGH    18180
Name: count, dtype: int64
‚úì Created categorical groupings for text generation context


In [7]:
# ========================================================================
# STEP 3: CRITICAL VALIDATION - Risk Scores Must Make Sense!
# ========================================================================
print("\n" + "="*70)
print("STEP 3: VALIDATING RISK SCORES")
print("="*70)

claim_mask = df['claim_status'] == 1
no_claim_mask = df['claim_status'] == 0

print(f"\n‚úÖ OVERALL RISK SCORE VALIDATION:")
claim_risk = df[claim_mask]['overall_risk_score'].mean()
no_claim_risk = df[no_claim_mask]['overall_risk_score'].mean()
difference = claim_risk - no_claim_risk

print(f"   Claims avg risk:     {claim_risk:.4f}")
print(f"   No-claims avg risk:  {no_claim_risk:.4f}")
print(f"   Difference:          {difference:+.4f} {'‚úÖ CORRECT!' if difference > 0 else '‚ùå ERROR!'}")

if difference <= 0:
    print(f"\n   ‚ö†Ô∏è  WARNING: Risk scores are inverted or flat!")
    print(f"   This means the model won't learn meaningful patterns.")

print(f"\nüìä Component-wise validation:")
for score_col in ['subscription_risk_score', 'driver_risk_score', 'vehicle_risk_score', 
                  'region_risk_score', 'safety_score']:
    claim_avg = df[claim_mask][score_col].mean()
    no_claim_avg = df[no_claim_mask][score_col].mean()
    diff = claim_avg - no_claim_avg
    status = '‚úÖ' if diff > 0 else '‚ö†Ô∏è'
    print(f"   {score_col:25s}: {diff:+.4f} {status}")



STEP 3: VALIDATING RISK SCORES

‚úÖ OVERALL RISK SCORE VALIDATION:
   Claims avg risk:     0.6630
   No-claims avg risk:  0.5815
   Difference:          +0.0815 ‚úÖ CORRECT!

üìä Component-wise validation:
   subscription_risk_score  : +0.1306 ‚úÖ
   driver_risk_score        : +0.0343 ‚úÖ
   vehicle_risk_score       : +0.0356 ‚úÖ
   region_risk_score        : +0.0284 ‚úÖ
   safety_score             : +0.0228 ‚úÖ


### STRATIFIED DATA SPLITTING
- **Purpose:** Split data while preserving class distribution
- **Strategy:** 70% train / 15% validation / 15% test
- **Why:** Prevents data leakage and ensures honest evaluation

In [8]:

# ========================================================================
# STEP 4: STRATIFIED DATA SPLITTING
# ========================================================================
print("\n" + "="*70)
print("STEP 4: STRATIFIED DATA SPLITTING")
print("="*70)

# Split BEFORE any balancing to maintain realistic test set
train_df, test_df = train_test_split(
    df,
    test_size=0.15,
    stratify=df['claim_status'],
    random_state=42
)

train_df, val_df = train_test_split(
    train_df,
    test_size=0.1765,  # 0.15 of remaining = 0.15 total validation
    stratify=train_df['claim_status'],
    random_state=42
)

print(f"‚úì Train set: {len(train_df):,} records ({(train_df['claim_status']==1).mean()*100:.2f}% claims)")
print(f"‚úì Val set:   {len(val_df):,} records ({(val_df['claim_status']==1).mean()*100:.2f}% claims)")
print(f"‚úì Test set:  {len(test_df):,} records ({(test_df['claim_status']==1).mean()*100:.2f}% claims)")

# Validate splits maintain risk score patterns
print(f"\nüìä Risk score validation across splits:")
for split_name, split_df in [('Train', train_df), ('Val', val_df), ('Test', test_df)]:
    claim_r = split_df[split_df['claim_status']==1]['overall_risk_score'].mean()
    no_claim_r = split_df[split_df['claim_status']==0]['overall_risk_score'].mean()
    diff = claim_r - no_claim_r
    status = '‚úÖ' if diff > 0.01 else '‚ö†Ô∏è'
    print(f"   {split_name:5s}: Claims {claim_r:.3f} vs No-Claims {no_claim_r:.3f} = {diff:+.3f} {status}")



STEP 4: STRATIFIED DATA SPLITTING
‚úì Train set: 41,012 records (6.40% claims)
‚úì Val set:   8,791 records (6.39% claims)
‚úì Test set:  8,789 records (6.39% claims)

üìä Risk score validation across splits:
   Train: Claims 0.660 vs No-Claims 0.582 = +0.079 ‚úÖ
   Val  : Claims 0.661 vs No-Claims 0.584 = +0.078 ‚úÖ
   Test : Claims 0.677 vs No-Claims 0.578 = +0.099 ‚úÖ


### HANDLING CLASS IMBALANCE FOR RAG

- **Purpose:** Balance training data for better retrieval
- **Method:** Intelligent duplication stratified by risk category
- **Target:** 20% claims (up from 6.4%)
- **Why:** RAG needs enough claim examples to retrieve from

In [9]:
# ========================================================================
# STEP 5: HANDLE CLASS IMBALANCE - RANDOM UNDERSAMPLING
# ========================================================================
print("\n" + "="*70)
print("STEP 5: BALANCING TRAINING DATA (TARGET: 20% CLAIMS)")
print("="*70)

# Separate majority and minority classes
train_majority = train_df[train_df['claim_status'] == 0]
train_minority = train_df[train_df['claim_status'] == 1]

print(f"\nBefore balancing:")
print(f"   Claims:     {len(train_minority):,} ({len(train_minority)/len(train_df)*100:.2f}%)")
print(f"   No Claims:  {len(train_majority):,} ({len(train_majority)/len(train_df)*100:.2f}%)")

# Calculate how many no-claim samples we need for 20% claim rate
# Formula: minority / (minority + majority_new) = 0.20
# Solving: majority_new = minority / 0.20 - minority = minority * 4
target_majority_size = int(len(train_minority) * 4)

# Randomly undersample majority class
train_majority_undersampled = train_majority.sample(
    n=target_majority_size, 
    random_state=42
)

# Combine back
balanced_train_df = pd.concat([
    train_majority_undersampled, 
    train_minority
]).sample(frac=1, random_state=42).reset_index(drop=True)

print(f"\nAfter balancing:")
print(f"   Claims:     {len(balanced_train_df[balanced_train_df['claim_status']==1]):,} "
      f"({(balanced_train_df['claim_status']==1).mean()*100:.2f}%)")
print(f"   No Claims:  {len(balanced_train_df[balanced_train_df['claim_status']==0]):,} "
      f"({(balanced_train_df['claim_status']==0).mean()*100:.2f}%)")
print(f"   Total:      {len(balanced_train_df):,} records")

# Validate balanced data maintains risk patterns
claim_risk_balanced = balanced_train_df[balanced_train_df['claim_status']==1]['overall_risk_score'].mean()
no_claim_risk_balanced = balanced_train_df[balanced_train_df['claim_status']==0]['overall_risk_score'].mean()
diff_balanced = claim_risk_balanced - no_claim_risk_balanced

print(f"\n‚úÖ Risk score validation after balancing:")
print(f"   Claims:     {claim_risk_balanced:.4f}")
print(f"   No Claims:  {no_claim_risk_balanced:.4f}")
print(f"   Difference: {diff_balanced:+.4f} {'‚úÖ MAINTAINED' if diff_balanced > 0.01 else '‚ö†Ô∏è LOST'}")



STEP 5: BALANCING TRAINING DATA (TARGET: 20% CLAIMS)

Before balancing:
   Claims:     2,624 (6.40%)
   No Claims:  38,388 (93.60%)

After balancing:
   Claims:     2,624 (20.00%)
   No Claims:  10,496 (80.00%)
   Total:      13,120 records

‚úÖ Risk score validation after balancing:
   Claims:     0.6603
   No Claims:  0.5806
   Difference: +0.0797 ‚úÖ MAINTAINED


### SAVING PREPROCESSED DATA

**Output files:**
- train_balanced.csv (for embeddings & FAISS index)
- validation.csv (for tuning)
- test.csv (final evaluation only)

In [None]:

# ========================================================================
# STEP 6: SAVE PROCESSED DATA
# ========================================================================
print("\n" + "="*70)
print("STEP 6: SAVING PROCESSED DATA")
print("="*70)

# Save to processed folder
train_df.to_csv('../data/processed/train.csv', index=False)
balanced_train_df.to_csv('../data/processed/train_balanced.csv', index=False)
val_df.to_csv('../data/processed/validation.csv', index=False)
test_df.to_csv('../data/processed/test.csv', index=False)

print(f"Saved trained data with no balancing:  ../data/processed/train.csv")
print(f"‚úÖ Saved balanced training data:   ../data/processed/train_balanced.csv")
print(f"‚úÖ Saved validation data:          ../data/processed/validation.csv")
print(f"‚úÖ Saved test data:                ../data/processed/test.csv")



STEP 6: SAVING PROCESSED DATA
Saved trained data with no balancing  ./data/processed/train.csv
‚úÖ Saved balanced training data:   ../data/processed/train_balanced.csv
‚úÖ Saved validation data:          ../data/processed/validation.csv
‚úÖ Saved test data:                ../data/processed/test.csv


In [11]:
# ========================================================================
# STEP 7: FINAL SUMMARY
# ========================================================================
print("\n" + "="*70)
print("‚úÖ PREPROCESSING COMPLETE - READY FOR TEXT GENERATION")
print("="*70)

print(f"""
üìä FINAL STATISTICS:
   Training:   {len(balanced_train_df):,} records (20.0% claims) - BALANCED
   Validation: {len(val_df):,} records ({(val_df['claim_status']==1).mean()*100:.1f}% claims) - REALISTIC
   Test:       {len(test_df):,} records ({(test_df['claim_status']==1).mean()*100:.1f}% claims) - REALISTIC

üéØ RISK SCORES:
   ‚úÖ Based on ACTUAL claim patterns in data
   ‚úÖ Claims have HIGHER risk scores than no-claims
   ‚úÖ Weights determined by correlation strength
   ‚úÖ Validated across all splits

üìù FEATURES READY FOR TEXT GENERATION:
   ‚úÖ 5 granular risk scores (driver, vehicle, subscription, region, safety)
   ‚úÖ 1 overall weighted risk score
   ‚úÖ Risk categories (LOW, MODERATE, HIGH, VERY HIGH)
   ‚úÖ Age groups, vehicle age groups, subscription categories
   ‚úÖ All safety features preserved

üöÄ NEXT STEP: Text generation using these validated risk scores
""")


‚úÖ PREPROCESSING COMPLETE - READY FOR TEXT GENERATION

üìä FINAL STATISTICS:
   Training:   13,120 records (20.0% claims) - BALANCED
   Validation: 8,791 records (6.4% claims) - REALISTIC
   Test:       8,789 records (6.4% claims) - REALISTIC

üéØ RISK SCORES:
   ‚úÖ Based on ACTUAL claim patterns in data
   ‚úÖ Claims have HIGHER risk scores than no-claims
   ‚úÖ Weights determined by correlation strength
   ‚úÖ Validated across all splits

üìù FEATURES READY FOR TEXT GENERATION:
   ‚úÖ 5 granular risk scores (driver, vehicle, subscription, region, safety)
   ‚úÖ 1 overall weighted risk score
   ‚úÖ Risk categories (LOW, MODERATE, HIGH, VERY HIGH)
   ‚úÖ Age groups, vehicle age groups, subscription categories
   ‚úÖ All safety features preserved

üöÄ NEXT STEP: Text generation using these validated risk scores

