# Claims Severity Prediction by Fine-Tuning a Foundation Model

In [1]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('train.csv')

print("=== BASIC PANDAS PROFILING FOR NUMERIC COLUMNS ===")
print(f"Dataset shape: {df.shape}")

# Identify numeric columns
numeric_columns = df.select_dtypes(include=['int64', 'float64', 'int32', 'float32']).columns
print(f"\nNumeric columns found: {len(numeric_columns)}")
print(f"Column names: {numeric_columns.tolist()}")

print("\n" + "="*70)

# Profile each numeric column
for col in numeric_columns:
    print(f"\n📊 COLUMN: {col}")
    print("-" * 50)
    
    # Basic statistics
    print("📈 Descriptive Statistics:")
    print(df[col].describe())
    
    # Data quality
    print(f"\n🔍 Data Quality:")
    print(f"   • Null values: {df[col].isnull().sum()} ({df[col].isnull().sum()/len(df)*100:.2f}%)")
    print(f"   • Unique values: {df[col].nunique()}")
    print(f"   • Data type: {df[col].dtype}")
    
    # Additional insights
    if df[col].nunique() > 1:  # Avoid division by zero
        print(f"   • Range: {df[col].max() - df[col].min():.2f}")
        print(f"   • Coefficient of Variation: {df[col].std()/df[col].mean()*100:.2f}%")
    
    # Check for potential outliers (using IQR method)
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    print(f"   • Potential outliers: {len(outliers)} ({len(outliers)/len(df)*100:.2f}%)")
    
    print("=" * 50)

print(f"\n✅ Profiling completed for {len(numeric_columns)} numeric columns!")

=== BASIC PANDAS PROFILING FOR NUMERIC COLUMNS ===
Dataset shape: (54000, 15)

Numeric columns found: 8
Column names: ['Age', 'DependentChildren', 'DependentsOther', 'WeeklyWages', 'HoursWorkedPerWeek', 'DaysWorkedPerWeek', 'InitialIncurredCalimsCost', 'UltimateIncurredClaimCost']


📊 COLUMN: Age
--------------------------------------------------
📈 Descriptive Statistics:
count    54000.000000
mean        33.842370
std         12.122165
min         13.000000
25%         23.000000
50%         32.000000
75%         43.000000
max         81.000000
Name: Age, dtype: float64

🔍 Data Quality:
   • Null values: 0 (0.00%)
   • Unique values: 68
   • Data type: int64
   • Range: 68.00
   • Coefficient of Variation: 35.82%
   • Potential outliers: 22 (0.04%)

📊 COLUMN: DependentChildren
--------------------------------------------------
📈 Descriptive Statistics:
count    54000.000000
mean         0.119185
std          0.517780
min          0.000000
25%          0.000000
50%          0.000000
75%

## Numerical Columns Profiling Analysis

### Dataset Overview
- **Total Records:** 54,000 
- **Numerical Columns:** 8 columns
- **Data Quality:** Excellent - Zero missing values across all features

---

### Detailed Analysis of All Numerical Features

#### **Demographic Features**

**1. Age**
- **Mean:** 33.8 years | **Range:** 13-81 years (68 years span)
- **Distribution:** Well-balanced (CV: 35.82%)
- **Quality:** Minimal outliers (0.04%) - excellent for modeling

**2. DependentChildren**
- **Mean:** 0.12 children | **Range:** 0-9 children
- **Distribution:** Highly skewed - 75% have no children (CV: 434%)
- **Outliers:** 6.22% - natural for count data pattern

**3. DependentsOther**
- **Mean:** 0.01 individuals | **Range:** 0-5 dependents  
- **Distribution:** Extremely sparse (CV: 1099%) - 99% have zero
- **Recommendation:** Consider removal due to low information value

---

#### **Employment & Economic Features**

**4. WeeklyWages**
- **Mean:** $416.36 | **Median:** $392.20 | **Range:** $1-$7,497
- **Distribution:** Right-skewed (CV: 59.72%) - typical wage distribution
- **Outliers:** 2.74% - high earners, manageable level
- **Insights:** Most workers earn $200-$500/week (Q1-Q3)

**5. HoursWorkedPerWeek** ⚠️
- **Mean:** 37.7 hours | **Median:** 38 hours | **Range:** 0-640 hours
- **Distribution:** Clustered around full-time (CV: 33.31%)
- **Critical Issue:** 13.79% outliers - some unrealistic values (640+ hours)
- **Action Required:** Cap extreme values or investigate data quality

**6. DaysWorkedPerWeek**
- **Mean:** 4.9 days | **Median:** 5 days | **Range:** 1-7 days
- **Distribution:** Very stable (CV: 11.25%) - mostly standard work week
- **Outliers:** 8.92% - likely weekend/shift workers

---

#### **Claim Cost Features (Critical for Prediction)**

**7. InitialIncurredClaimsCost**
- **Mean:** $7,841 | **Median:** $2,000 | **Range:** $1-$2M
- **Distribution:** Heavily right-skewed (CV: 262.51%)
- **Outliers:** 8.06% - high-cost initial assessments
- **Pattern:** Mean >> Median indicates extreme skewness

**8. UltimateIncurredClaimCost** **[TARGET VARIABLE]**
- **Mean:** $11,003 | **Median:** $3,371 | **Range:** $122-$4.03M
- **Distribution:** Extremely right-skewed (CV: 303.46%)
- **Outliers:** 12.60% - highest among all features
- **Critical Insight:** Claims escalate from initial ($7.8K) to ultimate ($11K) on average

---

### Key Data Insights

#### **Distribution Patterns:**
- **Normal/Balanced:** Age, DaysWorkedPerWeek
- **Right-Skewed:** WeeklyWages, Cost variables
- **Highly Sparse:** DependentChildren, DependentsOther
- **Clustered:** HoursWorkedPerWeek around 38-40 hours

#### **Data Quality Issues:**
1. **HoursWorkedPerWeek:** Unrealistic maximum (640 hours) needs investigation
2. **Cost Variables:** Extreme outliers but expected in insurance data
3. **Dependent Variables:** Very sparse, limited predictive value

---

### **Comprehensive Preprocessing Strategy**

#### **Feature Transformations:**

**1. Log Transformation Required:**
- WeeklyWages, InitialIncurredClaimsCost, UltimateIncurredClaimCost
- *Reason:* Heavy right-skewness (CV > 100%)

**2. Outlier Treatment:**
- **HoursWorkedPerWeek:** Cap at reasonable maximum (80 hours)
- **Cost Variables:** Use robust scaling methods
- **Age:** Minimal outliers, keep as-is

**3. Feature Engineering:**
- **Hourly Rate:** WeeklyWages ÷ HoursWorkedPerWeek
- **Cost Escalation:** UltimateIncurredClaimCost ÷ InitialIncurredClaimsCost
- **Work Intensity:** Categorical encoding for hours/days patterns

**4. Scaling Strategy:**
- **StandardScaler:** Age, work pattern features
- **RobustScaler:** Cost variables (outlier-resistant)
- **LogNormal:** Wage and cost variables after log transformation

---

### **Model Development Implications**

#### **Feature Importance Ranking (Expected):**
1. **High Impact:** InitialIncurredClaimsCost, WeeklyWages, Age
2. **Medium Impact:** HoursWorkedPerWeek, DaysWorkedPerWeek
3. **Low Impact:** DependentChildren, DependentsOther

#### **Target Variable Characteristics:**
- **Extreme Skewness:** Requires log transformation
- **High Outlier Rate:** 12.60% - consider robust loss functions
- **Wide Range:** $122 to $4M - multi-scale prediction challenge

#### **Correlation Expectations:**
- **Strong:** InitialIncurredClaimsCost ↔ UltimateIncurredClaimCost
- **Moderate:** Age ↔ WeeklyWages, WeeklyWages ↔ HoursWorkedPerWeek
- **Weak:** Dependent variables with other features

## NLP-specific stats

In [3]:
from collections import Counter
import re
from sklearn.feature_extraction.text import CountVectorizer

descriptions = df['ClaimDescription'].fillna('')

print("=== NLP-SPECIFIC STATISTICS FOR CLAIM DESCRIPTIONS ===")
print(f"Total descriptions: {len(descriptions)}")
print(f"Empty descriptions: {descriptions.str.strip().eq('').sum()}")

# 1. TOKEN LENGTH ANALYSIS
print("\n📝 TOKEN LENGTH ANALYSIS")
print("-" * 50)
token_lengths = descriptions.str.split().str.len()
print(f"Average tokens per description: {token_lengths.mean():.2f}")
print(f"Median tokens: {token_lengths.median():.1f}")
print(f"Min tokens: {token_lengths.min()}")
print(f"Max tokens: {token_lengths.max()}")
print(f"Standard deviation: {token_lengths.std():.2f}")

# Character length analysis
char_lengths = descriptions.str.len()
print(f"\nAverage characters per description: {char_lengths.mean():.2f}")
print(f"Max characters: {char_lengths.max()}")

# 2. TOP N-GRAMS ANALYSIS
print("\n🔤 TOP N-GRAMS ANALYSIS")
print("-" * 50)

# Clean text function
def clean_text(text):
    # Convert to uppercase and remove extra spaces
    text = str(text).upper().strip()
    # Remove special characters but keep spaces
    text = re.sub(r'[^A-Z\s]', ' ', text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text)
    return text

# Clean descriptions
cleaned_descriptions = descriptions.apply(clean_text)

# Top 1-grams (words)
all_words = ' '.join(cleaned_descriptions).split()
top_words = Counter(all_words).most_common(15)
print("Top 15 Words (1-grams):")
for word, count in top_words:
    print(f"  '{word}': {count} times ({count/len(descriptions)*100:.2f}% of descriptions)")

# Top 2-grams
vectorizer_2gram = CountVectorizer(ngram_range=(2, 2), max_features=10)
try:
    vectorizer_2gram.fit(cleaned_descriptions)
    feature_names = vectorizer_2gram.get_feature_names_out()
    word_count_vector = vectorizer_2gram.transform(cleaned_descriptions)
    word_counts = word_count_vector.sum(axis=0).A1
    top_2grams = [(feature_names[i], word_counts[i]) for i in word_counts.argsort()[-10:][::-1]]
    
    print("\nTop 10 2-grams:")
    for phrase, count in top_2grams:
        print(f"  '{phrase}': {count} times")
except:
    print("\nCould not generate 2-grams (insufficient data)")

# 3. VOCABULARY ANALYSIS
print("\n📚 VOCABULARY ANALYSIS")
print("-" * 50)
unique_words = set(all_words)
print(f"Total unique words (vocabulary size): {len(unique_words)}")
print(f"Total word instances: {len(all_words)}")
print(f"Vocabulary richness: {len(unique_words)/len(all_words):.4f}")

# Most frequent word categories (injury-related)
injury_words = [word for word in all_words if 'INJUR' in word or 'HURT' in word or 'PAIN' in word]
body_parts = [word for word in all_words if any(part in word for part in ['ARM', 'LEG', 'BACK', 'HAND', 'FINGER', 'HEAD', 'NECK', 'SHOULDER'])]
actions = [word for word in all_words if any(action in word for action in ['LIFT', 'FALL', 'CUT', 'HIT', 'SLIP', 'TWIST'])]

print(f"\nDomain-specific word categories:")
print(f"  Injury-related words: {len(injury_words)} instances")
print(f"  Body part mentions: {len(body_parts)} instances")
print(f"  Action words: {len(actions)} instances")

# 4. LABEL SKEW ANALYSIS (TARGET VARIABLE)
print("\n⚖️ LABEL SKEW ANALYSIS")
print("-" * 50)
target = df['UltimateIncurredClaimCost']

# Multiple binning strategies
print("Cost Distribution Analysis:")

# Strategy 1: Equal-width bins
bins_equal = [0, 1000, 5000, 10000, 50000, float('inf')]
labels_equal = ['Very Low (<$1K)', 'Low ($1K-$5K)', 'Medium ($5K-$10K)', 'High ($10K-$50K)', 'Very High (>$50K)']
cost_ranges_equal = pd.cut(target, bins=bins_equal, labels=labels_equal)
print("\nEqual-width binning:")
distribution = cost_ranges_equal.value_counts(normalize=True).sort_index()
for label, pct in distribution.items():
    count = cost_ranges_equal.value_counts().sort_index()[label]
    print(f"  {label}: {pct:.1%} ({count:,} claims)")

# Strategy 2: Quantile-based bins
quantiles = target.quantile([0, 0.25, 0.5, 0.75, 0.9, 1.0])
print(f"\nQuantile-based analysis:")
print(f"  0-25th percentile: $0 - ${quantiles[0.25]:,.0f}")
print(f"  25-50th percentile: ${quantiles[0.25]:,.0f} - ${quantiles[0.5]:,.0f}")
print(f"  50-75th percentile: ${quantiles[0.5]:,.0f} - ${quantiles[0.75]:,.0f}")
print(f"  75-90th percentile: ${quantiles[0.75]:,.0f} - ${quantiles[0.9]:,.0f}")
print(f"  90-100th percentile: ${quantiles[0.9]:,.0f} - ${quantiles[1.0]:,.0f}")

# Skewness analysis
from scipy import stats
skewness = stats.skew(target)
print(f"\nTarget variable skewness: {skewness:.2f}")
if skewness > 1:
    print("  → Highly right-skewed (log transformation recommended)")
elif skewness > 0.5:
    print("  → Moderately right-skewed")
else:
    print("  → Relatively symmetric")

# 5. TEXT QUALITY METRICS
print("\n✅ TEXT QUALITY METRICS")
print("-" * 50)
# Empty or very short descriptions
very_short = (token_lengths <= 2).sum()
print(f"Very short descriptions (≤2 tokens): {very_short} ({very_short/len(descriptions)*100:.2f}%)")

# Descriptions with numbers (might indicate codes)
with_numbers = descriptions.str.contains(r'\d', na=False).sum()
print(f"Descriptions containing numbers: {with_numbers} ({with_numbers/len(descriptions)*100:.2f}%)")

# All caps descriptions (current format)
all_caps = descriptions.str.isupper().sum()
print(f"All uppercase descriptions: {all_caps} ({all_caps/len(descriptions)*100:.2f}%)")

print(f"\n✅ NLP Analysis completed!")
print("Recommendations:")
print("  • Text is clean and consistent (all uppercase)")
print("  • Good description length for model training")
print("  • Strong domain vocabulary for injury claims")
print("  • Target variable needs log transformation due to skewness")

=== NLP-SPECIFIC STATISTICS FOR CLAIM DESCRIPTIONS ===
Total descriptions: 54000
Empty descriptions: 0

📝 TOKEN LENGTH ANALYSIS
--------------------------------------------------
Average tokens per description: 7.02
Median tokens: 7.0
Min tokens: 1
Max tokens: 14
Standard deviation: 1.65

Average characters per description: 43.45
Max characters: 94

🔤 TOP N-GRAMS ANALYSIS
--------------------------------------------------
Top 15 Words (1-grams):
  'RIGHT': 22648 times (41.94% of descriptions)
  'LEFT': 20756 times (38.44% of descriptions)
  'BACK': 16346 times (30.27% of descriptions)
  'STRAIN': 15259 times (28.26% of descriptions)
  'LOWER': 9950 times (18.43% of descriptions)
  'AND': 9103 times (16.86% of descriptions)
  'FINGER': 8584 times (15.90% of descriptions)
  'LIFTING': 8300 times (15.37% of descriptions)
  'HAND': 7723 times (14.30% of descriptions)
  'STRUCK': 7354 times (13.62% of descriptions)
  'SHOULDER': 6198 times (11.48% of descriptions)
  'FELL': 5747 times (10.6