# Create Human-Readable Summaries

## Text Generation: Turning Data Into Stories

**What we're doing:** Converting each policy from a spreadsheet row of numbers and codes into a natural language narrative that reads like an underwriter's case file.

**Why?** AI embedding models understand *meaning* in sentences better than raw spreadsheet cells. 

**The magic:** We're teaching the AI to think like an underwriter who reads case files, not a calculator that crunches numbers.


In [1]:
import pandas as pd
import numpy as np

## 1. LOADING PREPROCESSED DATA

In [2]:
# ========================================================================
# LOAD PREPROCESSED DATA WITH RISK SCORES
# ========================================================================
print("="*70)
print("LOADING PREPROCESSED DATA")
print("="*70)

# Load the data that already has risk scores calculated
df = pd.read_csv('../data/processed/train_hybrid.csv')
print(f"‚úì Loaded {len(df):,} policies")
print(f"‚úì Claim rate: {(df['claim_status']==1).mean()*100:.1f}%")
print(f"‚úì Features available: {len(df.columns)} columns")

# Verify risk scores exist
required_cols = ['overall_risk_score', 'risk_category', 'driver_risk_score', 
                 'subscription_risk_score', 'safety_score']
missing = [col for col in required_cols if col not in df.columns]
if missing:
    print(f"\n‚ö†Ô∏è  WARNING: Missing columns: {missing}")
    print("   Please run the preprocessing script first!")
else:
    print(f"‚úì All risk score columns present")

LOADING PREPROCESSED DATA
‚úì Loaded 41,014 policies
‚úì Claim rate: 6.4%
‚úì Features available: 80 columns
‚úì All risk score columns present


In [3]:
df.head()

Unnamed: 0,policy_id,subscription_length,vehicle_age,customer_age,region_code,region_density,segment,model,fuel_type,max_torque,...,driver_age_empirical,vehicle_age_actuarial,vehicle_age_empirical,safety_actuarial,airbag_empirical,ncap_empirical,esc_empirical,brake_empirical,safety_empirical,subscription_risk_score
0,POL020365,1.2,2.0,48,C9,17804,B2,M6,Petrol,113Nm@4400rpm,...,0.1371,0.306667,1.0,0.623333,0.063554,0.064994,0.063472,0.066383,0.305614,0.0
1,POL008292,7.3,1.8,64,C5,34738,B2,M6,Petrol,113Nm@4400rpm,...,0.246716,0.296,1.0,0.623333,0.063554,0.064994,0.063472,0.066383,0.305614,0.607964
2,POL014801,9.4,0.0,43,C19,27742,A,M3,Petrol,91Nm@4250rpm,...,0.139185,0.2,,0.643333,0.063554,0.064994,0.063472,0.061026,0.205614,1.0
3,POL022698,4.3,0.2,48,C12,34791,A,M1,CNG,60Nm@3500rpm,...,0.1371,0.210667,0.924933,0.783333,0.063554,0.062418,0.063472,0.061026,0.0,0.607964
4,POL035537,10.4,0.6,46,C8,8794,B2,M6,Petrol,113Nm@4400rpm,...,0.1371,0.232,0.924933,0.623333,0.063554,0.064994,0.063472,0.066383,0.305614,1.0


## 2. Design Summary Template

**Why this matters:** 
- AI embedding models understand *meaning* and *context* in sentences, not just raw numbers
- A "42-year-old driver with a 3-month subscription" carries semantic weight that "customer_age: 42, subscription_length: 3" doesn't
- We're creating training data that mirrors how insurance professionals actually think and communicate

**The transformation:**
```
FROM (structured data):
customer_age: 42, vehicle_age: 1.2, fuel_type: Petrol, segment: B2, 
airbags: 4, ncap_rating: 4, subscription_length: 3, claim_status: 0

TO (human narrative):
"A 42-year-old driver in region C14 operates a 1.2-year-old Petrol B2 
with 4 airbags and 4-star NCAP rating. Short 3-month subscription. 
Risk Score: 0.65. No claim filed."
```


In [4]:
# ========================================================================
# ENHANCED SUMMARY GENERATOR
# ========================================================================
print("\n" + "="*70)
print("CREATING RISK-AWARE POLICY SUMMARIES")
print("="*70)

def create_risk_aware_summary(row):
    """
    Creates human-readable summaries that incorporate:
    1. Risk assessment (overall + component scores)
    2. All statistically significant features
    3. Natural language that an underwriter would use
    4. Contextual risk factors
    """
    
    # ===== RISK CONTEXT =====
    risk_level = row['risk_category']
    overall_risk = row['overall_risk_score']
    
    # Risk descriptor
    risk_descriptors = {
        'LOW': 'low-risk',
        'MODERATE': 'moderate-risk',
        'HIGH': 'elevated-risk',
        'VERY HIGH': 'high-risk'
    }
    risk_desc = risk_descriptors.get(risk_level, 'unknown-risk')
    
    # ===== DRIVER PROFILE =====
    age_group = row.get('age_group', 'unknown')
    age_descriptors = {
        'young': 'young',
        'middle_aged': 'middle-aged',
        'mature': 'mature',
        'senior': 'senior'
    }
    age_desc = age_descriptors.get(age_group, str(row['customer_age']) + '-year-old')
    
    # ===== SUBSCRIPTION ANALYSIS (HIGHEST PREDICTOR) =====
    sub_length = row['subscription_length']
    sub_category = row.get('subscription_category', 'unknown')
    
    sub_descriptors = {
        'very_short': f'very short {sub_length}-month',
        'short': f'short {sub_length}-month',
        'medium': f'{sub_length}-month',
        'long': f'long-term {sub_length}-month'
    }
    sub_desc = sub_descriptors.get(sub_category, f'{sub_length}-month')
    
    # ===== VEHICLE PROFILE =====
    vehicle_age = row['vehicle_age']
    vehicle_age_group = row.get('vehicle_age_group', 'unknown')
    
    vehicle_age_descriptors = {
        'new': f'{vehicle_age:.1f}-year-old (new)',
        'moderate': f'{vehicle_age:.1f}-year-old',
        'old': f'{vehicle_age:.1f}-year-old (aging)'
    }
    vehicle_desc = vehicle_age_descriptors.get(vehicle_age_group, f'{vehicle_age:.1f}-year-old')
    
    # ===== SAFETY FEATURES =====
    safety_features = []
    
    # Critical safety features (based on your EDA)
    if row['is_esc'] == 1:
        safety_features.append('ESC')
    if row['is_brake_assist'] == 1:
        safety_features.append('brake assist')
    if row['is_parking_sensors'] == 1:
        safety_features.append('parking sensors')
    if row['is_tpms'] == 1:
        safety_features.append('TPMS')
    if row['is_adjustable_steering'] == 1:
        safety_features.append('adjustable steering')
    
    # Format safety text
    if len(safety_features) == 0:
        safety_text = 'minimal safety features'
    elif len(safety_features) <= 2:
        safety_text = f'limited safety features ({", ".join(safety_features)})'
    else:
        safety_text = f'equipped with {", ".join(safety_features)}'
    
    # ===== REGION CONTEXT =====
    density = row['region_density']
    if density > 20000:
        density_desc = 'high-density urban'
    elif density > 15000:
        density_desc = 'moderate-density'
    else:
        density_desc = 'low-density rural'
    
    # ===== BUILD SUMMARY =====
    summary = (
        f"[{risk_level} RISK - Score: {overall_risk:.2f}] "
        f"A {age_desc} driver (age {row['customer_age']}) in {density_desc} region "
        f"{row['region_code']} operates a {vehicle_desc} {row['fuel_type']} "
        f"{row['segment']} {row['model']} with {row['transmission_type'].lower()} transmission. "
        f"The vehicle has {int(row['airbags'])} airbags, {safety_text}, "
        f"and a {int(row['ncap_rating'])}-star NCAP rating. "
        f"Policy holder maintains a {sub_desc} subscription. "
    )
    
    # Add risk factors if elevated/high risk
    if risk_level in ['HIGH', 'VERY HIGH']:
        risk_factors = []
        
        if row['subscription_risk_score'] > 0.6:
            risk_factors.append('short subscription')
        if row['driver_risk_score'] > 0.6:
            risk_factors.append('driver age profile')
        if row['safety_score'] > 0.6:
            risk_factors.append('limited safety features')
        if row['vehicle_risk_score'] > 0.6:
            risk_factors.append('vehicle age')
        
        if risk_factors:
            summary += f"Key risk factors: {', '.join(risk_factors)}. "
    
    # Add outcome
    if row['claim_status'] == 1:
        summary += "CLAIM FILED."
    else:
        summary += "No claim filed."
    
    return summary


def create_concise_summary(row):
    """
    Alternative: More concise version for faster generation
    Focus on key risk indicators only
    """
    risk_level = row['risk_category']
    overall_risk = row['overall_risk_score']
    
    summary = (
        f"[{risk_level}] {row['customer_age']}yo driver, {row['subscription_length']}mo subscription, "
        f"{row['vehicle_age']:.1f}yo {row['fuel_type']} {row['segment']}, "
        f"{row['airbags']} airbags, {row['ncap_rating']}‚òÖ NCAP, "
        f"region {row['region_code']}. "
        f"Risk: {overall_risk:.2f}. "
        f"{'CLAIM' if row['claim_status'] == 1 else 'NO CLAIM'}."
    )
    
    return summary


CREATING RISK-AWARE POLICY SUMMARIES


### Why Natural Language Summaries?

**Traditional ML:** Treats `age=32` and `age=33` as completely different numbers.

**Embedding Models:** Understand that "32-year-old" and "33-year-old" are *semantically similar* - both are middle-aged, experienced drivers.

**Real power:**
- "6-year-old Petrol Honda" is similar to "5-year-old Petrol Toyota"
- "2 airbags, no ESC" clusters with other low-safety profiles
- Geographic patterns emerge: "urban region R002" matches other urban cases

üí° **Human analogy:** Reading case notes is more insightful than staring at a spreadsheet. AI works the same way.

## 3. Generate All Summaries

In [5]:
# ========================================================================
# GENERATE SUMMARIES
# ========================================================================
print("\nüìù Generating detailed summaries...")

# Test on sample first
sample_row = df.iloc[0]
print("\n" + "-"*70)
print("SAMPLE SUMMARY (Detailed):")
print("-"*70)
print(create_risk_aware_summary(sample_row))

print("\n" + "-"*70)
print("SAMPLE SUMMARY (Concise):")
print("-"*70)
print(create_concise_summary(sample_row))

# Generate for all rows (choose one approach)
print("\n" + "="*70)
print("GENERATING ALL SUMMARIES")
print("="*70)

# Option 1: Detailed summaries (recommended for training)
print("\n‚è≥ Creating detailed summaries (this may take 1-2 minutes)...")
df['summary'] = df.apply(create_risk_aware_summary, axis=1)

# Option 2: Concise summaries (faster, use if memory constrained)
# print("\n‚è≥ Creating concise summaries...")
# df['summary'] = df.apply(create_concise_summary, axis=1)

print(f"‚úì Created {len(df):,} summaries")



üìù Generating detailed summaries...

----------------------------------------------------------------------
SAMPLE SUMMARY (Detailed):
----------------------------------------------------------------------
[MODERATE RISK - Score: 0.43] A mature driver (age 48) in moderate-density region C9 operates a 2.0-year-old (new) Petrol B2 M6 with manual transmission. The vehicle has 2 airbags, equipped with brake assist, parking sensors, adjustable steering, and a 2-star NCAP rating. Policy holder maintains a very short 1.2-month subscription. No claim filed.

----------------------------------------------------------------------
SAMPLE SUMMARY (Concise):
----------------------------------------------------------------------
[MODERATE] 48yo driver, 1.2mo subscription, 2.0yo Petrol B2, 2 airbags, 2‚òÖ NCAP, region C9. Risk: 0.43. NO CLAIM.

GENERATING ALL SUMMARIES

‚è≥ Creating detailed summaries (this may take 1-2 minutes)...
‚úì Created 41,014 summaries


## 4. Quality Check

In [6]:
# ========================================================================
# QUALITY CHECKS
# ========================================================================
print("\n" + "="*70)
print("QUALITY VALIDATION")
print("="*70)

# Check summary lengths
summary_lengths = df['summary'].str.len()
print(f"\nüìä Summary Statistics:")
print(f"   Average length: {summary_lengths.mean():.0f} characters")
print(f"   Min length:     {summary_lengths.min()} characters")
print(f"   Max length:     {summary_lengths.max()} characters")
print(f"   Median length:  {summary_lengths.median():.0f} characters")

# Check for missing summaries
missing_summaries = df['summary'].isna().sum()
print(f"\n‚úì Missing summaries: {missing_summaries}")

# Show distribution by risk category
print(f"\nüìä Summaries by Risk Category:")
print(df.groupby('risk_category')['summary'].count())

# Display sample summaries from each risk category
print("\n" + "="*70)
print("SAMPLE SUMMARIES BY RISK CATEGORY")
print("="*70)

for risk_cat in ['LOW', 'MODERATE', 'HIGH', 'VERY HIGH']:
    samples = df[df['risk_category'] == risk_cat].head(1)
    if len(samples) > 0:
        print(f"\n{risk_cat} RISK:")
        print("-" * 70)
        print(samples.iloc[0]['summary'])


QUALITY VALIDATION

üìä Summary Statistics:
   Average length: 385 characters
   Min length:     287 characters
   Max length:     454 characters
   Median length:  386 characters

‚úì Missing summaries: 0

üìä Summaries by Risk Category:
risk_category
HIGH         19100
LOW           1413
MODERATE     10667
VERY HIGH     9834
Name: summary, dtype: int64

SAMPLE SUMMARIES BY RISK CATEGORY

LOW RISK:
----------------------------------------------------------------------
[LOW RISK - Score: 0.20] A middle-aged driver (age 36) in high-density urban region C12 operates a 2.6-year-old (new) Petrol B2 M6 with manual transmission. The vehicle has 2 airbags, equipped with brake assist, parking sensors, adjustable steering, and a 2-star NCAP rating. Policy holder maintains a long-term 12.4-month subscription. No claim filed.

MODERATE RISK:
----------------------------------------------------------------------
[MODERATE RISK - Score: 0.43] A mature driver (age 48) in moderate-density region C

**Detailed (382 chars avg):** Comprehensive narratives including risk factors, safety features, and regional context


**Quality controls:**
- Every summary includes the calculated risk score and category
- High/Very High risk summaries explicitly list contributing risk factors
- Consistent structure ensures the AI can learn patterns effectively

---



##  SUMMARY: Text Generation Results

###  Generation Success
- **13,120 summaries created** (100% coverage, zero missing)
- **Average length: 382 characters** (consistent, detailed narratives)
- **Range: 292-454 characters** (tight distribution = quality consistency)

### üìà Risk Category Distribution
| Risk Level | Count | Percentage |
|------------|-------|------------|
| LOW | 966 | 7.4% |
| MODERATE | 3,922 | 29.9% |
| HIGH | 3,970 | 30.3% |
| VERY HIGH | 4,262 | 32.5% |

### üéØ Quality Validation
**Sample outputs show:**
- ‚úÖ **Readable narratives** - Natural language, not code-speak
- ‚úÖ **Risk context** - Scores and categories prominently featured
- ‚úÖ **Key features preserved** - Age, vehicle specs, safety equipment, subscription length all included
- ‚úÖ **Actionable insights** - High-risk policies explicitly list contributing factors


In [7]:
# Total number of claims filed
total_claims = df['claim_status'].sum()
print(f"Total claims filed: {total_claims}")

# Percentage of claims filed
claim_percent = (total_claims / len(df)) * 100
print(f"Percentage of policies with claims: {claim_percent:.2f}%")


Total claims filed: 2624
Percentage of policies with claims: 6.40%


## 5. Save Enhanced Dataset

In [None]:
# ========================================================================
# SAVE ENHANCED DATA
# ========================================================================
print("\n" + "="*70)
print("SAVING ENHANCED DATA")
print("="*70)

output_path = '../data/processed/train_data_with_summaries.csv'
df.to_csv(output_path, index=False)
print(f"‚úÖ Saved data with summaries: {output_path}")

# Also save just the summaries for quick loading
summaries_only = df[['policy_id', 'claim_status', 'risk_category', 
                      'overall_risk_score', 'summary']]
summaries_path = '../data/processed/summaries_train.csv'
summaries_only.to_csv(summaries_path, index=False)
print(f"‚úÖ Saved summaries only: {summaries_path}")

print("\n" + "="*70)
print("‚úÖ SUMMARY GENERATION COMPLETE")
print("="*70)
print(f"""
üìä RESULTS:
   Total summaries:     {len(df):,}
   Average length:      {summary_lengths.mean():.0f} chars
   Risk categories:     {df['risk_category'].nunique()}
   Claim rate:          {(df['claim_status']==1).mean()*100:.1f}%
   """)


SAVING ENHANCED DATA
‚úÖ Saved data with summaries: ../data/processed/train_data_with_summaries.csv
‚úÖ Saved summaries only: ../data/processed/summaries_train.csv

‚úÖ SUMMARY GENERATION COMPLETE

üìä RESULTS:
   Total summaries:     41,014
   Average length:      385 chars
   Risk categories:     4
   Claim rate:          6.4%
   



### Notable Pattern
The "VERY HIGH RISK" example reveals the model's intelligence:
- 47-year-old driver (mature age group, slight risk increase)
- 12.5-month subscription (actually long-term, but flagged due to other factors)
- 2 airbags + 2-star NCAP (below-average safety)
- **Risk Score: 0.93** - The weighted combination correctly identifies this as very high risk despite the longer subscription

### Output Files
1. `train_balanced_with_summaries.csv` - Full dataset with summaries appended
2. `summaries_train.csv` - Standalone summaries for direct embedding generation

### Ready For Next Phase
The text summaries are now embedding-ready. Each narrative contains the semantic richness needed for the AI to learn:
- What makes a policy high-risk vs. low-risk
- How different features interact (young driver + short subscription + weak safety = elevated risk)
- The language patterns that describe insurance claims scenarios

**Next step:** Generate embeddings from these summaries to create dense vector representations that capture claim risk patterns.