# Agentic Customer Lifetime Value (LTV) Prediction

## Project Overview
This project develops an agentic LTV prediction solution using retrieval-augmented reasoning and local LLM agents to combine structured customer/policy data with contextual document evidence.

## Step 1: Data Exploration & Preprocessing

### Objective
- Load and examine customer demographics, policy details, premium & payment history, claim history, and renewal/retention records
- Augment with document corpus: policy wording, underwriting notes, correspondence and case summaries
- Clean and preprocess data for LTV modeling

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
import json
import os
from pathlib import Path

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")


Libraries imported successfully!
Pandas version: 2.3.3
NumPy version: 2.3.3


In [2]:
# Generate synthetic insurance customer dataset
def generate_insurance_dataset(n_customers=5000, start_date='2020-01-01', end_date='2024-12-31'):
    """
    Generate a comprehensive insurance customer dataset for LTV prediction
    """
    np.random.seed(42)
    
    # Date range
    start = pd.to_datetime(start_date)
    end = pd.to_datetime(end_date)
    
    # Customer demographics
    customer_ids = [f"CUST_{i:05d}" for i in range(1, n_customers + 1)]
    
    # Age distribution (18-80, weighted towards middle age)
    ages = np.random.beta(2, 2, n_customers) * 62 + 18
    ages = np.round(ages).astype(int)
    
    # Gender
    genders = np.random.choice(['M', 'F'], n_customers, p=[0.52, 0.48])
    
    # Geographic regions
    regions = np.random.choice(['Urban', 'Suburban', 'Rural'], n_customers, p=[0.4, 0.45, 0.15])
    
    # Income levels (correlated with age)
    income_base = 30000 + ages * 800 + np.random.normal(0, 15000, n_customers)
    income_levels = np.clip(income_base, 20000, 200000)
    
    # Customer tenure (months since first policy)
    tenure_months = np.random.exponential(24, n_customers)
    tenure_months = np.clip(tenure_months, 1, 60)
    
    # Policy details
    policy_types = np.random.choice(['Auto', 'Home', 'Life', 'Health'], n_customers, p=[0.4, 0.3, 0.2, 0.1])
    
    # Premium amounts (correlated with income and policy type)
    premium_base = np.where(policy_types == 'Auto', 1200,
                   np.where(policy_types == 'Home', 1800,
                   np.where(policy_types == 'Life', 2400, 3000)))
    
    premium_amounts = premium_base * (1 + income_levels / 100000) * np.random.lognormal(0, 0.3, n_customers)
    premium_amounts = np.round(premium_amounts, 2)
    
    # Payment frequency
    payment_freq = np.random.choice(['Monthly', 'Quarterly', 'Annual'], n_customers, p=[0.6, 0.25, 0.15])
    
    # Payment regularity (0-1 score)
    payment_regularity = np.random.beta(3, 1, n_customers)
    
    # Number of claims (Poisson distribution, higher for certain demographics)
    claim_rate = np.where(ages < 25, 0.3, np.where(ages > 65, 0.2, 0.1))
    num_claims = np.random.poisson(claim_rate * tenure_months / 12, n_customers)
    
    # Last claim date (if any claims)
    last_claim_dates = []
    for i, claims in enumerate(num_claims):
        if claims > 0:
            days_ago = np.random.exponential(180, claims)
            last_claim = end - pd.Timedelta(days=np.min(days_ago))
            last_claim_dates.append(last_claim)
        else:
            last_claim_dates.append(None)
    
    # Renewal probability (inversely correlated with claims and age)
    renewal_prob = 1 - (num_claims * 0.1 + (ages - 50) * 0.002)
    renewal_prob = np.clip(renewal_prob, 0.3, 0.95)
    
    # Churn indicator (1 if churned, 0 if retained)
    churned = np.random.binomial(1, 1 - renewal_prob, n_customers)
    
    # Customer lifetime value (target variable)
    # Base LTV = premium * tenure * renewal_prob - claim_costs
    claim_costs = num_claims * np.random.lognormal(8, 1, n_customers)  # Average claim cost
    base_ltv = premium_amounts * tenure_months * renewal_prob - claim_costs
    ltv = np.maximum(base_ltv, 0)  # LTV can't be negative
    
    # Create DataFrame
    df = pd.DataFrame({
        'customer_id': customer_ids,
        'age': ages,
        'gender': genders,
        'region': regions,
        'income_level': income_levels,
        'policy_type': policy_types,
        'premium_amount': premium_amounts,
        'payment_frequency': payment_freq,
        'payment_regularity': payment_regularity,
        'tenure_months': tenure_months,
        'num_claims': num_claims,
        'last_claim_date': last_claim_dates,
        'renewal_probability': renewal_prob,
        'churned': churned,
        'customer_ltv': ltv
    })
    
    return df

# Generate the dataset
print("Generating synthetic insurance customer dataset...")
customer_data = generate_insurance_dataset(n_customers=5000)
print(f"Dataset generated with {len(customer_data)} customers")
print(f"Date range: 2020-01-01 to 2024-12-31")


Generating synthetic insurance customer dataset...
Dataset generated with 5000 customers
Date range: 2020-01-01 to 2024-12-31


In [3]:
# Generate document corpus for retrieval-augmented reasoning
def generate_document_corpus():
    """
    Generate synthetic document corpus including policy wording, 
    underwriting notes, correspondence, and case summaries
    """
    
    # Policy wording documents
    policy_documents = {
        "auto_policy_terms": """
        AUTO INSURANCE POLICY TERMS AND CONDITIONS
        
        Coverage Details:
        - Comprehensive Coverage: $50,000 deductible
        - Collision Coverage: $1,000 deductible  
        - Liability Coverage: $100,000/$300,000/$50,000
        - Uninsured Motorist: $25,000/$50,000
        
        Premium Calculation Factors:
        - Age: Drivers under 25 pay 25% surcharge
        - Claims History: Each claim increases premium by 15%
        - Payment Frequency: Annual payments receive 5% discount
        - Tenure: 5+ year customers receive loyalty discount
        
        Renewal Terms:
        - Automatic renewal unless 30-day notice given
        - Premium adjustments based on claims and risk factors
        - Grace period: 15 days for payment
        """,
        
        "home_policy_terms": """
        HOMEOWNERS INSURANCE POLICY TERMS
        
        Coverage Details:
        - Dwelling Coverage: Replacement cost up to $500,000
        - Personal Property: 70% of dwelling coverage
        - Liability Protection: $300,000 per occurrence
        - Additional Living Expenses: 20% of dwelling coverage
        
        Risk Assessment Factors:
        - Location: High-risk areas (flood, earthquake) have higher premiums
        - Home Age: Properties over 30 years require inspection
        - Claims History: 3+ claims in 5 years may result in non-renewal
        - Credit Score: Lower scores increase premium by up to 20%
        
        Deductible Options:
        - Standard: $1,000
        - High Deductible: $2,500 (15% premium reduction)
        - Low Deductible: $500 (20% premium increase)
        """,
        
        "life_policy_terms": """
        LIFE INSURANCE POLICY TERMS
        
        Coverage Types:
        - Term Life: 10, 20, 30-year terms available
        - Whole Life: Permanent coverage with cash value
        - Universal Life: Flexible premium and death benefit
        
        Underwriting Guidelines:
        - Medical Exam required for coverage over $500,000
        - Age limits: 18-75 for new policies
        - Health conditions: Pre-existing conditions may affect rates
        - Lifestyle factors: Smoking increases rates by 100-200%
        
        Premium Factors:
        - Age at issue: Primary factor in rate calculation
        - Health status: Medical underwriting determines rate class
        - Coverage amount: Higher amounts may require additional underwriting
        - Payment mode: Annual payments receive discount
        """,
        
        "health_policy_terms": """
        HEALTH INSURANCE POLICY TERMS
        
        Coverage Levels:
        - Bronze: 60% coverage, lowest premium
        - Silver: 70% coverage, moderate premium
        - Gold: 80% coverage, higher premium
        - Platinum: 90% coverage, highest premium
        
        Network Types:
        - HMO: Must use network providers, referrals required
        - PPO: Can use out-of-network with higher costs
        - EPO: Network only, no referrals needed
        - POS: Hybrid of HMO and PPO
        
        Pre-existing Conditions:
        - Coverage available after 12-month waiting period
        - Premium surcharge may apply
        - Annual maximum out-of-pocket: $8,700 individual
        """
    }
    
    # Underwriting notes templates
    underwriting_notes = {
        "high_risk_auto": """
        UNDERWRITING NOTES - HIGH RISK AUTO
        
        Risk Factors Identified:
        - Multiple traffic violations in past 3 years
        - Previous claims history indicates aggressive driving
        - Young driver (under 25) with high-performance vehicle
        - Urban location with high accident rates
        
        Recommended Actions:
        - Increase premium by 40%
        - Require defensive driving course completion
        - Install telematics device for monitoring
        - Consider 6-month policy term for review
        """,
        
        "preferred_home": """
        UNDERWRITING NOTES - PREFERRED HOME RISK
        
        Positive Factors:
        - Excellent credit score (750+)
        - No claims history in past 10 years
        - New construction with modern safety features
        - Located in low-risk area (no flood/earthquake zone)
        
        Recommended Actions:
        - Apply preferred customer discount (15%)
        - Offer multi-policy discount if applicable
        - Consider higher coverage limits
        - Annual policy term with automatic renewal
        """,
        
        "life_medical_risk": """
        UNDERWRITING NOTES - MEDICAL RISK ASSESSMENT
        
        Medical History:
        - Controlled diabetes (Type 2, HbA1c < 7%)
        - Regular medication compliance
        - Annual medical checkups maintained
        - No complications or hospitalizations
        
        Risk Assessment:
        - Standard Plus rate class recommended
        - 25% premium surcharge for diabetes
        - Annual medical review required
        - Consider term life over whole life
        """
    }
    
    # Correspondence templates
    correspondence = {
        "premium_increase_notice": """
        PREMIUM INCREASE NOTICE
        
        Dear Valued Customer,
        
        We are writing to inform you that your insurance premium will increase 
        effective your next renewal date. This adjustment is based on:
        
        - Recent claims activity in your area
        - Updated risk assessment models
        - General market conditions
        
        Your new premium will be $X.XX per month, representing a X% increase.
        We remain committed to providing you with excellent coverage and service.
        
        If you have any questions, please contact us at 1-800-INSURANCE.
        
        Sincerely,
        Your Insurance Team
        """,
        
        "claims_processing_update": """
        CLAIMS PROCESSING UPDATE
        
        Dear Customer,
        
        We have received your claim #CLM-XXXXXX and are processing it promptly.
        
        Current Status: Investigation in Progress
        Estimated Resolution: 7-10 business days
        Adjuster Assigned: John Smith (License #12345)
        
        Next Steps:
        1. Damage assessment scheduled
        2. Documentation review
        3. Settlement determination
        4. Payment processing
        
        We will keep you updated throughout the process.
        
        Best regards,
        Claims Department
        """,
        
        "renewal_reminder": """
        POLICY RENEWAL REMINDER
        
        Dear Customer,
        
        Your insurance policy expires in 30 days. To ensure continuous coverage:
        
        - Review your current coverage limits
        - Update any changes in circumstances
        - Consider additional coverage options
        - Complete renewal payment
        
        Renewal Options:
        - Online: www.insurance.com/renew
        - Phone: 1-800-INSURANCE
        - Agent: Contact your local agent
        
        Thank you for your continued trust in our services.
        
        Sincerely,
        Customer Service Team
        """
    }
    
    # Case summaries
    case_summaries = {
        "high_ltv_retention": """
        CASE SUMMARY - HIGH LTV CUSTOMER RETENTION
        
        Customer Profile:
        - 15-year tenure with company
        - Premium: $3,200 annually
        - Claims: 2 minor claims in 15 years
        - Payment: Always on time, annual billing
        
        Retention Strategy:
        - Offered loyalty discount (20%)
        - Upgraded to premium service tier
        - Assigned dedicated account manager
        - Provided additional coverage options
        
        Result: Customer retained, increased coverage by 25%
        LTV Impact: +$8,000 projected lifetime value
        """,
        
        "churn_prevention": """
        CASE SUMMARY - CHURN PREVENTION SUCCESS
        
        Customer Profile:
        - 3-year tenure
        - Premium: $1,800 annually
        - Recent claim: $5,000 auto accident
        - Risk: Premium increase notification sent
        
        Intervention Strategy:
        - Personal call from retention specialist
        - Explained claim impact on premium
        - Offered accident forgiveness program
        - Provided payment plan options
        
        Result: Customer retained with 10% premium increase
        LTV Impact: Preserved $12,000 lifetime value
        """,
        
        "new_customer_onboarding": """
        CASE SUMMARY - NEW CUSTOMER ONBOARDING
        
        Customer Profile:
        - First-time insurance buyer
        - Age: 28, single, urban location
        - Income: $65,000 annually
        - Vehicle: 2020 Honda Civic
        
        Onboarding Strategy:
        - Comprehensive coverage education
        - Set up automatic payments
        - Enrolled in mobile app
        - Scheduled 6-month check-in
        
        Result: Customer engaged, no early churn
        LTV Impact: $15,000 projected lifetime value
        """
    }
    
    return {
        "policy_documents": policy_documents,
        "underwriting_notes": underwriting_notes,
        "correspondence": correspondence,
        "case_summaries": case_summaries
    }

# Generate document corpus
print("Generating document corpus...")
document_corpus = generate_document_corpus()
print("Document corpus generated successfully!")
print(f"Policy documents: {len(document_corpus['policy_documents'])}")
print(f"Underwriting notes: {len(document_corpus['underwriting_notes'])}")
print(f"Correspondence templates: {len(document_corpus['correspondence'])}")
print(f"Case summaries: {len(document_corpus['case_summaries'])}")


Generating document corpus...
Document corpus generated successfully!
Policy documents: 4
Underwriting notes: 3
Correspondence templates: 3
Case summaries: 3


In [4]:
# Data Exploration - Examine the datasets
print("=" * 60)
print("DATA EXPLORATION")
print("=" * 60)

# Basic information about customer data
print("\n1. CUSTOMER DATA OVERVIEW")
print("-" * 30)
print(f"Dataset shape: {customer_data.shape}")
print(f"Memory usage: {customer_data.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print("\n2. DATA TYPES AND MISSING VALUES")
print("-" * 30)
print(customer_data.info())

print("\n3. FIRST FEW ROWS")
print("-" * 30)
print(customer_data.head())

print("\n4. BASIC STATISTICS")
print("-" * 30)
print(customer_data.describe())


DATA EXPLORATION

1. CUSTOMER DATA OVERVIEW
------------------------------
Dataset shape: (5000, 15)
Memory usage: 1.84 MB

2. DATA TYPES AND MISSING VALUES
------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   customer_id          5000 non-null   object        
 1   age                  5000 non-null   int64         
 2   gender               5000 non-null   object        
 3   region               5000 non-null   object        
 4   income_level         5000 non-null   float64       
 5   policy_type          5000 non-null   object        
 6   premium_amount       5000 non-null   float64       
 7   payment_frequency    5000 non-null   object        
 8   payment_regularity   5000 non-null   float64       
 9   tenure_months        5000 non-null   float64       
 10  num_claims      

In [5]:
# Detailed data analysis
print("\n5. CATEGORICAL VARIABLES ANALYSIS")
print("-" * 30)

categorical_vars = ['gender', 'region', 'policy_type', 'payment_frequency']
for var in categorical_vars:
    print(f"\n{var.upper()}:")
    print(customer_data[var].value_counts())
    print(f"Unique values: {customer_data[var].nunique()}")

print("\n6. TARGET VARIABLE ANALYSIS (Customer LTV)")
print("-" * 30)
print(f"LTV Statistics:")
print(f"  Mean: ${customer_data['customer_ltv'].mean():,.2f}")
print(f"  Median: ${customer_data['customer_ltv'].median():,.2f}")
print(f"  Std: ${customer_data['customer_ltv'].std():,.2f}")
print(f"  Min: ${customer_data['customer_ltv'].min():,.2f}")
print(f"  Max: ${customer_data['customer_ltv'].max():,.2f}")

print(f"\nLTV Distribution by Churn Status:")
churn_ltv = customer_data.groupby('churned')['customer_ltv'].agg(['count', 'mean', 'median', 'std'])
print(churn_ltv)

print("\n7. CLAIMS ANALYSIS")
print("-" * 30)
print(f"Customers with claims: {(customer_data['num_claims'] > 0).sum()} ({(customer_data['num_claims'] > 0).mean()*100:.1f}%)")
print(f"Average claims per customer: {customer_data['num_claims'].mean():.2f}")
print(f"Max claims: {customer_data['num_claims'].max()}")

# Check for missing values in last_claim_date
missing_claim_dates = customer_data['last_claim_date'].isna().sum()
print(f"Missing last claim dates: {missing_claim_dates} (customers with no claims)")

print("\n8. PAYMENT REGULARITY ANALYSIS")
print("-" * 30)
print(f"Payment regularity statistics:")
print(f"  Mean: {customer_data['payment_regularity'].mean():.3f}")
print(f"  Median: {customer_data['payment_regularity'].median():.3f}")
print(f"  Std: {customer_data['payment_regularity'].std():.3f}")

# Payment regularity by churn status
print(f"\nPayment regularity by churn status:")
churn_payment = customer_data.groupby('churned')['payment_regularity'].agg(['mean', 'median', 'std'])
print(churn_payment)



5. CATEGORICAL VARIABLES ANALYSIS
------------------------------

GENDER:
gender
M    2629
F    2371
Name: count, dtype: int64
Unique values: 2

REGION:
region
Suburban    2245
Urban       2045
Rural        710
Name: count, dtype: int64
Unique values: 3

POLICY_TYPE:
policy_type
Auto      2062
Home      1451
Life       990
Health     497
Name: count, dtype: int64
Unique values: 4

PAYMENT_FREQUENCY:
payment_frequency
Monthly      3004
Quarterly    1273
Annual        723
Name: count, dtype: int64
Unique values: 3

6. TARGET VARIABLE ANALYSIS (Customer LTV)
------------------------------
LTV Statistics:
  Mean: $64,136.57
  Median: $42,335.21
  Std: $66,680.80
  Min: $0.00
  Max: $554,532.48

LTV Distribution by Churn Status:
         count          mean        median           std
churned                                                 
0         4694  63816.100516  41874.920777  66711.794896
1          306  69052.590378  51063.386347  66117.121879

7. CLAIMS ANALYSIS
-----------------