# Actuarial Pricing Process Automation
## End-to-End Insurance Claims Cost Prediction

### Objective
Demonstrate actuarial pricing automation with decision intervention points at each stage.

### Process Overview
1. Data Collection
2. Data Cleaning
3. Exploratory Data Analysis
4. Loss Estimation
5. Risk Factor Analysis
6. Premium Calculation
7. Final Pricing
8. Output Generation & Submission

## Stage 0: Environment Setup

In [None]:
!pip install -q xgboost lightgbm catboost kaggle

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor

print('Libraries loaded')

---
## Stage 1: Data Collection

**Decision Points:**
- Time period for analysis
- External data requirements
- Data granularity level

---
**Egyptian UHI Context:**

Per Law 2/2018, Article 44, actuarial review must be conducted every 4 years.

In [None]:
# Step 1: Upload kaggle.json file
from google.colab import files
print('Please upload your kaggle.json file:')
uploaded = files.upload()

In [None]:
# Step 2: Setup Kaggle credentials
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
print('Kaggle credentials configured.')

In [None]:
# Step 3: Download competition data
# Competition: actuarial-loss-estimation

COMPETITION = 'actuarial-loss-estimation'

print(f'Downloading data from {COMPETITION}...')
!kaggle competitions download -c {COMPETITION}

# Extract
import glob
zip_files = glob.glob('*.zip')
if zip_files:
    print(f'Found: {zip_files}')
    !unzip -q -o {COMPETITION}.zip
    print('Data extracted successfully!')
else:
    print('ERROR: Download failed. Please accept competition rules at:')
    print(f'https://www.kaggle.com/competitions/{COMPETITION}')

In [None]:
# Load data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
print(f'Training data: {train.shape}')
print(f'Test data: {test.shape}')
train.head()

---
## Stage 2: Data Cleaning

**Decision Points:**
- Missing value imputation method
- Outlier treatment approach
- Data quality thresholds

In [None]:
print('Data Quality Report')
print(f'Total records: {len(train):,}')
missing = train.isnull().sum()
missing_pct = (missing / len(train) * 100).round(2)
missing_report = pd.DataFrame({'Missing': missing, 'Pct': missing_pct})
print(missing_report[missing_report['Missing'] > 0])

In [None]:
# ACTUARY DECISION: Missing Value Strategy
MISSING_STRATEGY = {
    'numeric': 'median',
    'categorical': 'mode',
    'threshold_drop': 0.5
}

def clean_data(df, config):
    df = df.copy()
    for col in df.select_dtypes(include=[np.number]).columns:
        if df[col].isnull().any():
            df[col] = df[col].fillna(df[col].median())
    for col in df.select_dtypes(include=['object']).columns:
        if df[col].isnull().any():
            df[col] = df[col].fillna('MISSING')
    return df

train_clean = clean_data(train, MISSING_STRATEGY)
test_clean = clean_data(test, MISSING_STRATEGY)
print('Data cleaned')

---
## Stage 3: Exploratory Data Analysis

**Decision Points:**
- Target distribution transformation
- Key predictive variables

In [None]:
TARGET = 'UltimateIncurredClaimCost'

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].hist(train_clean[TARGET], bins=50, color='steelblue', edgecolor='white')
axes[0].set_title('Original Distribution')
axes[1].hist(np.log1p(train_clean[TARGET]), bins=50, color='coral', edgecolor='white')
axes[1].set_title('Log Transformed')
plt.tight_layout()
plt.show()

In [None]:
# ACTUARY DECISION: Target Transformation
TARGET_TRANSFORM = 'log'
print(f'Selected transformation: {TARGET_TRANSFORM}')

if TARGET_TRANSFORM == 'log':
    y = np.log1p(train_clean[TARGET])
else:
    y = train_clean[TARGET]

---
## Stage 4: Loss Estimation

**Decision Points:**
- Model selection
- Hyperparameters
- Validation approach

In [None]:
EXCLUDE_COLS = [TARGET, 'ClaimNumber', 'ClaimDescription', 'AccidentDescription',
                'DateTimeOfAccident', 'DateReported', 'DateOfBirth']

feature_cols = [c for c in train_clean.columns if c not in EXCLUDE_COLS]

for col in train_clean[feature_cols].select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    all_vals = pd.concat([train_clean[col], test_clean[col]]).astype(str).unique()
    le.fit(all_vals)
    train_clean[col] = le.transform(train_clean[col].astype(str))
    test_clean[col] = le.transform(test_clean[col].astype(str))

X = train_clean[feature_cols].fillna(-999)
X_test = test_clean[feature_cols].fillna(-999)
print(f'Features prepared: {len(feature_cols)}')

In [None]:
# ACTUARY DECISION: Model Configuration
MODEL_CONFIG = {
    'models': ['xgboost', 'lightgbm', 'catboost'],
    'n_folds': 5,
    'ensemble_method': 'average'
}

print('Model Configuration:')
for k, v in MODEL_CONFIG.items():
    print(f'  {k}: {v}')

In [None]:
kf = KFold(n_splits=MODEL_CONFIG['n_folds'], shuffle=True, random_state=42)
results = {}

for model_name in MODEL_CONFIG['models']:
    print(f'\nTraining {model_name}...')
    oof = np.zeros(len(X))
    pred = np.zeros(len(X_test))
    
    for fold, (tr_idx, val_idx) in enumerate(kf.split(X)):
        X_tr, X_val = X.iloc[tr_idx], X.iloc[val_idx]
        y_tr, y_val = y.iloc[tr_idx], y.iloc[val_idx]
        
        if model_name == 'xgboost':
            model = xgb.XGBRegressor(n_estimators=1000, max_depth=6, learning_rate=0.05,
                                      random_state=42, verbosity=0)
            model.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=False)
        elif model_name == 'lightgbm':
            model = lgb.LGBMRegressor(n_estimators=1000, max_depth=6, learning_rate=0.05,
                                       random_state=42, verbose=-1)
            model.fit(X_tr, y_tr, eval_set=[(X_val, y_val)],
                     callbacks=[lgb.early_stopping(50, verbose=False)])
        else:
            model = CatBoostRegressor(iterations=1000, depth=6, learning_rate=0.05,
                                       random_state=42, verbose=0)
            model.fit(X_tr, y_tr, eval_set=(X_val, y_val), early_stopping_rounds=50)
        
        oof[val_idx] = model.predict(X_val)
        pred += model.predict(X_test) / MODEL_CONFIG['n_folds']
        print(f'  Fold {fold+1}: MAE = {mean_absolute_error(y_val, oof[val_idx]):.4f}')
    
    results[model_name] = {'oof': oof, 'pred': pred}
    print(f'  OOF MAE: {mean_absolute_error(y, oof):.4f}')

---
## Stage 5: Risk Factor Analysis

In [None]:
importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': model.feature_importances_
}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(importance['Feature'][:15], importance['Importance'][:15], color='steelblue')
plt.xlabel('Importance')
plt.title('Top 15 Risk Factors')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

---
## Stage 6: Premium Calculation

**Egyptian UHI Context:**

| Category | Individual | Employer/State |
|----------|------------|----------------|
| Employee | 1% | 4% |
| Non-working spouse | 3% | - |
| Children (max 2) | 1% each | - |
| Self-employed | 4% | - |
| Pensioners | 1% | 2% |
| Unable-to-pay | 0% | 5% |

In [None]:
# ACTUARY DECISION: Pricing Parameters
PRICING_CONFIG = {
    'expense_loading': 0.25,
    'profit_margin': 0.10,
    'contingency_margin': 0.05,
    'reinsurance_cost': 0.03,
    'commission_rate': 0.15
}

total_loading = sum(PRICING_CONFIG.values())
print(f'Total Loading: {total_loading*100:.0f}%')

ensemble_pred = np.mean([results[m]['pred'] for m in results], axis=0)
expected_loss = np.expm1(ensemble_pred)
premium = expected_loss * (1 + total_loading)

print(f'Mean Premium: ${np.mean(premium):,.0f}')
print(f'Median Premium: ${np.median(premium):,.0f}')

---
## Stage 7: Output & Submission

In [None]:
# Create submission file
submission = pd.DataFrame({
    'ClaimNumber': test_clean['ClaimNumber'],
    'UltimateIncurredClaimCost': np.maximum(expected_loss, 0)
})

submission.to_csv('submission.csv', index=False)
print('Submission file created!')
print(submission.head())

In [None]:
# Submit to Kaggle
!kaggle competitions submit -c actuarial-loss-estimation -f submission.csv -m "Ensemble XGB+LGB+CatBoost"
print('\nSubmission complete!')

In [None]:
# Download submission file locally
from google.colab import files
files.download('submission.csv')

---
## Summary

| Stage | Objective | Decision Points |
|-------|-----------|----------------|
| 1. Data Collection | Gather historical data | Sources, period, granularity |
| 2. Data Cleaning | Ensure quality | Imputation, outliers |
| 3. EDA | Pattern analysis | Transformations |
| 4. Loss Estimation | Predictive modeling | Model selection |
| 5. Risk Factors | Driver identification | Factor validation |
| 6. Premium Calculation | Apply loadings | Margins |
| 7. Output & Submit | Kaggle submission | Final review |