# Synthetic Insurance Claims Data Exploration

This notebook provides an initial exploration of our synthetic insurance claims dataset. The data has been generated to simulate real-world insurance claims while maintaining interpretability and business relevance.

## Data Generation Overview

The synthetic data includes:
- Policyholder information (age)
- Vehicle information (type, make, age)
- Claims history
- Accident details
- Geographic information
- Claim severity amounts

## Key Features

- Driver age: 18-80 years
- Vehicle types: Sedan, SUV, Sports, Truck
- Vehicle makes: High-end, Regular
- Accident types: Minor, Moderate, Severe
- Regions: Urban, Suburban, Rural

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set plot style
plt.style.use('seaborn')
sns.set_palette('husl')

# Load the synthetic data
train_df = pd.read_csv('../data/train.csv')
test_df = pd.read_csv('../data/test.csv')
sample_submission = pd.read_csv('../data/sample_submission.csv')

## Synthetic Data Generation
The dataset was generated using a combination of:
- Realistic distributions (normal distributions for ages, vehicle ages)
- Business rules (e.g., young drivers have higher risk)
- Multipliers for different factors:
  - Vehicle type (SUV: 1.2x, Sports: 1.5x, Truck: 1.3x, Sedan: 1.0x)
  - Vehicle make (High-end: 1.5x, Regular: 1.0x)
  - Accident severity (Minor: 0.5x, Moderate: 1.0x, Severe: 2.0x)
  - Region (Urban: 1.2x, Suburban: 1.0x, Rural: 0.8x)