# Claims Severity Prediction by Fine-Tuning a Foundation Model

In [1]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('train.csv')

print("=== BASIC PANDAS PROFILING FOR NUMERIC COLUMNS ===")
print(f"Dataset shape: {df.shape}")

# Identify numeric columns
numeric_columns = df.select_dtypes(include=['int64', 'float64', 'int32', 'float32']).columns
print(f"\nNumeric columns found: {len(numeric_columns)}")
print(f"Column names: {numeric_columns.tolist()}")

print("\n" + "="*70)

# Profile each numeric column
for col in numeric_columns:
    print(f"\n📊 COLUMN: {col}")
    print("-" * 50)
    
    # Basic statistics
    print("📈 Descriptive Statistics:")
    print(df[col].describe())
    
    # Data quality
    print(f"\n🔍 Data Quality:")
    print(f"   • Null values: {df[col].isnull().sum()} ({df[col].isnull().sum()/len(df)*100:.2f}%)")
    print(f"   • Unique values: {df[col].nunique()}")
    print(f"   • Data type: {df[col].dtype}")
    
    # Additional insights
    if df[col].nunique() > 1:  # Avoid division by zero
        print(f"   • Range: {df[col].max() - df[col].min():.2f}")
        print(f"   • Coefficient of Variation: {df[col].std()/df[col].mean()*100:.2f}%")
    
    # Check for potential outliers (using IQR method)
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    print(f"   • Potential outliers: {len(outliers)} ({len(outliers)/len(df)*100:.2f}%)")
    
    print("=" * 50)

print(f"\n✅ Profiling completed for {len(numeric_columns)} numeric columns!")

=== BASIC PANDAS PROFILING FOR NUMERIC COLUMNS ===
Dataset shape: (54000, 15)

Numeric columns found: 8
Column names: ['Age', 'DependentChildren', 'DependentsOther', 'WeeklyWages', 'HoursWorkedPerWeek', 'DaysWorkedPerWeek', 'InitialIncurredCalimsCost', 'UltimateIncurredClaimCost']


📊 COLUMN: Age
--------------------------------------------------
📈 Descriptive Statistics:
count    54000.000000
mean        33.842370
std         12.122165
min         13.000000
25%         23.000000
50%         32.000000
75%         43.000000
max         81.000000
Name: Age, dtype: float64

🔍 Data Quality:
   • Null values: 0 (0.00%)
   • Unique values: 68
   • Data type: int64
   • Range: 68.00
   • Coefficient of Variation: 35.82%
   • Potential outliers: 22 (0.04%)

📊 COLUMN: DependentChildren
--------------------------------------------------
📈 Descriptive Statistics:
count    54000.000000
mean         0.119185
std          0.517780
min          0.000000
25%          0.000000
50%          0.000000
75%

## Numerical Columns Profiling Analysis

### Dataset Overview
- **Total Records:** 54,000 
- **Numerical Columns:** 8 columns
- **Data Quality:** Excellent - Zero missing values across all features

---

### Detailed Analysis of All Numerical Features

#### **Demographic Features**

**1. Age**
- **Mean:** 33.8 years | **Range:** 13-81 years (68 years span)
- **Distribution:** Well-balanced (CV: 35.82%)
- **Quality:** Minimal outliers (0.04%) - excellent for modeling

**2. DependentChildren**
- **Mean:** 0.12 children | **Range:** 0-9 children
- **Distribution:** Highly skewed - 75% have no children (CV: 434%)
- **Outliers:** 6.22% - natural for count data pattern

**3. DependentsOther**
- **Mean:** 0.01 individuals | **Range:** 0-5 dependents  
- **Distribution:** Extremely sparse (CV: 1099%) - 99% have zero
- **Recommendation:** Consider removal due to low information value

---

#### **Employment & Economic Features**

**4. WeeklyWages**
- **Mean:** $416.36 | **Median:** $392.20 | **Range:** $1-$7,497
- **Distribution:** Right-skewed (CV: 59.72%) - typical wage distribution
- **Outliers:** 2.74% - high earners, manageable level
- **Insights:** Most workers earn $200-$500/week (Q1-Q3)

**5. HoursWorkedPerWeek** ⚠️
- **Mean:** 37.7 hours | **Median:** 38 hours | **Range:** 0-640 hours
- **Distribution:** Clustered around full-time (CV: 33.31%)
- **Critical Issue:** 13.79% outliers - some unrealistic values (640+ hours)
- **Action Required:** Cap extreme values or investigate data quality

**6. DaysWorkedPerWeek**
- **Mean:** 4.9 days | **Median:** 5 days | **Range:** 1-7 days
- **Distribution:** Very stable (CV: 11.25%) - mostly standard work week
- **Outliers:** 8.92% - likely weekend/shift workers

---

#### **Claim Cost Features (Critical for Prediction)**

**7. InitialIncurredClaimsCost**
- **Mean:** $7,841 | **Median:** $2,000 | **Range:** $1-$2M
- **Distribution:** Heavily right-skewed (CV: 262.51%)
- **Outliers:** 8.06% - high-cost initial assessments
- **Pattern:** Mean >> Median indicates extreme skewness

**8. UltimateIncurredClaimCost** **[TARGET VARIABLE]**
- **Mean:** $11,003 | **Median:** $3,371 | **Range:** $122-$4.03M
- **Distribution:** Extremely right-skewed (CV: 303.46%)
- **Outliers:** 12.60% - highest among all features
- **Critical Insight:** Claims escalate from initial ($7.8K) to ultimate ($11K) on average

---

### Key Data Insights

#### **Distribution Patterns:**
- **Normal/Balanced:** Age, DaysWorkedPerWeek
- **Right-Skewed:** WeeklyWages, Cost variables
- **Highly Sparse:** DependentChildren, DependentsOther
- **Clustered:** HoursWorkedPerWeek around 38-40 hours

#### **Data Quality Issues:**
1. **HoursWorkedPerWeek:** Unrealistic maximum (640 hours) needs investigation
2. **Cost Variables:** Extreme outliers but expected in insurance data
3. **Dependent Variables:** Very sparse, limited predictive value

---

### **Comprehensive Preprocessing Strategy**

#### **Feature Transformations:**

**1. Log Transformation Required:**
- WeeklyWages, InitialIncurredClaimsCost, UltimateIncurredClaimCost
- *Reason:* Heavy right-skewness (CV > 100%)

**2. Outlier Treatment:**
- **HoursWorkedPerWeek:** Cap at reasonable maximum (80 hours)
- **Cost Variables:** Use robust scaling methods
- **Age:** Minimal outliers, keep as-is

**3. Feature Engineering:**
- **Hourly Rate:** WeeklyWages ÷ HoursWorkedPerWeek
- **Cost Escalation:** UltimateIncurredClaimCost ÷ InitialIncurredClaimsCost
- **Work Intensity:** Categorical encoding for hours/days patterns

**4. Scaling Strategy:**
- **StandardScaler:** Age, work pattern features
- **RobustScaler:** Cost variables (outlier-resistant)
- **LogNormal:** Wage and cost variables after log transformation

---

### **Model Development Implications**

#### **Feature Importance Ranking (Expected):**
1. **High Impact:** InitialIncurredClaimsCost, WeeklyWages, Age
2. **Medium Impact:** HoursWorkedPerWeek, DaysWorkedPerWeek
3. **Low Impact:** DependentChildren, DependentsOther

#### **Target Variable Characteristics:**
- **Extreme Skewness:** Requires log transformation
- **High Outlier Rate:** 12.60% - consider robust loss functions
- **Wide Range:** $122 to $4M - multi-scale prediction challenge

#### **Correlation Expectations:**
- **Strong:** InitialIncurredClaimsCost ↔ UltimateIncurredClaimCost
- **Moderate:** Age ↔ WeeklyWages, WeeklyWages ↔ HoursWorkedPerWeek
- **Weak:** Dependent variables with other features