# Predictive Modeling for Car Insurance Claims

## Introduction

Car insurance plays a critical role in the financial stability of vehicle owners, especially during accidents or damage. Insurance companies face the challenge of determining which customers are most likely to make claims. By identifying these patterns early, insurers can optimize their pricing strategies, improve risk management, and enhance their claims processing systems. In the insurance industry, identifying high-risk customers who are likely to file claims is crucial for optimizing resources, mitigating fraud, and improving operational efficiency. In this project, I thus focus on developing a predictive model that uses customer demographics, vehicle information, and historical claims data to predict the likelihood of future claims.

The dataset, sourced from Kaggle, contains customer demographics, vehicle details, and insurance-related history. The goal is to understand the relationship between these variables and the occurrence of claims (claim_flag) to build a predictive model that estimates whether a policyholder will file an insurance claim or not. I explore several models: Logistic Regression, Decision Trees, Random Forest, and XGBoost, assessing their performance and identifying key predictive features.

Through this report, I shall take you through the steps involved, including data cleaning, exploration, model development, and analysis of outcomes. The final findings provide insights into factors influencing claims and demonstrate how machine learning can empower decision-making in insurance.


## Data Description

The dataset used for this analysis was sourced from https://www.kaggle.com/datasets/xiaomengsun/car-insurance-claim-data/data.
It comprises of 26 features and a binary target variable, claim_flag.

Feature Overview:

Demographic Features:

*   youth_drivers: Number of young drivers in the household.
*   age: Age of the primary policyholder.
*   num_of_children: Number of children in the household.
*   years_on_job: Number of years the policyholder has been employed.
*   income: Annual income of the policyholder.
*   single_parent: Whether the policyholder is a single parent.
*   marital_status: Marital status of the policyholder.
*   gender: Gender of the policyholder.
*   education_level: Highest education level achieved by the policyholder.
*   occupation: Occupation of the policyholder.

Vehicle-Related Features:

*  vehicle_value: Value of the vehicle.
*  vehicle_age: Age of the vehicle.
*  vehicle_type: Type of vehicle (e.g., van, sedan, SUV).
*  red_vehicle: Whether the vehicle is red (Yes/No).
*  travel_time: Estimated travel time for the policyholder.

Historical Claim and Licensing Features:

* 5yr_total_claims_value: Total value of claims filed in the past five years.
* 5yr_num_of_claims: Number of claims filed in the past five years.
* licence_revoked: Whether the policyholder's license has been revoked.
* license_points: Number of penalty points on the policyholder's license.
* time_in_force: Duration for which the policy has been active (in months).
* claim_flag: Target variable (1 if a claim was filed, 0 otherwise).

Additional Features:

* area: Type of area the policyholder resides in (urban/rural).
* home_value: Estimated value of the policyholder's home.
* new_claim_amount: Amount associated with a new claim.
* type_of_use: Primary vehicle use (commercial or private).

Under basic data cleaning, I first renamed the columns for better readability. Several monetary columns such as income and vehicle_value contained special characters like $ and commas, which were removed to convert them into numerical format. I also dropped unnecessary features like ID and DOB, as they did not contribute to the prediction task.

The dataset is imbalanced, with most policyholders not filing claims. This imbalance makes prediction more challenging, as models may tend to favor the majority class (no claims) while ignoring the minority class (claims). I addressed this later during modeling through stratified sampling and careful performance evaluation.

### Exploratory Data Analysis (EDA)

Before modeling, I explored the data to identify patterns and relationships between features and the target variable. This phase helped me understand feature importance and make informed preprocessing decisions. Histograms and boxplots were generated to visualize distributions of selected numerical features.
The graphs revealed that:

Income:
Claimants and non-claimants have similar median incomes, but claimants show more extreme outliers at higher income levels, suggesting no strong correlation between income and claims.

New Claim Amount:
The distribution remains very low, suggesting that even for those who file claims, the claim amount is often modest.

5-Year Number of Claims:
Claimants tend to have a higher number of claims in the past 5 years, highlighting claim history as a strong predictor of future claims.

Vehicle Age:
Older vehicles are more likely to be associated with claims, suggesting that vehicle age is a relevant predictor due to increased likelihood of issues with aging vehicles.

License Points:
Individuals with higher license points, indicative of poor driving behavior, are more likely to file claims, making this a key feature in predicting claim likelihood.

Feature Correlations:

To quantify relationships, I computed the correlation matrix. To visualise this, I showed a heatmap. The features 5yr_num_of_claims and license_points emerged as the most positively correlated with claim_flag. Other variables like income an home_value also showed correlations. This initial analysis informed feature selection and guided our preprocessing decisions. Features with very low correlation (e.g., red_vehicle) were removed.

## Modeling and Methods


I employed four machine learning models to predict claim_flag: Logistic Regression, Decision Trees, Random Forest, and XGBoost. A baseline model was also used to establish a lower bound for performance. Each model was trained on a stratified train-test split (80-20), ensuring the class imbalance was consistent across splits.

I built a Preprocessing Pipeline to prepare data for modeling:

1. Imputation:

  Numerical features: Missing values replaced with the median.

  Categorical features: Missing values replaced with the mode.
Encoding:

2. Applied One-Hot Encoding to categorical variables.

3. Standardized numerical features using StandardScaler to ensure equal weighting during model training.

The pipeline was implemented using make_column_transformer from sklearn and applied uniformly to all models.

### Baseline Model

To evaluate our models effectively, I created a benchmark by: predicting the majority class ("No Claims") for all instances. This resulted in an accuracy of 73%, but recall, precision, and F1 were zero since no claims were predicted. This highlighted the need for a more robust predictive model.



### Logistic Regression

Logistic Regression achieved an accuracy of 79.8%, showing a solid performance overall. The model demonstrated a precision of 69.1%, meaning that when it predicted a claim, it was correct about 69% of the time. However, the recall was 44.0%, indicating that it only correctly identified 44% of all actual claims. This is further reflected in the F1-Score of 53.8%, which balances precision and recall but highlights the model's struggle to effectively capture all positive cases.

The confusion matrix revealed that the model correctly identified 242 true positives (actual claims predicted correctly) but failed to detect 307 claims, categorizing them as false negatives. This suggests that while the model performs well in avoiding false positives, it tends to miss many true claims.

Logistic Regression is a valuable starting point due to its simplicity and interpretability. The coefficients of the model allow us to understand the influence of each feature on the predictions.


### Decision Tree

The Decision Tree model displayed lower overall performance compared to Logistic Regression, with an accuracy of 71%. While it improved slightly in recall by identifying 46% of claims, this came at the cost of precision, which dropped to 45.6%. The F1-Score of 45.8% highlights the imbalance between precision and recall, indicating an inability to consistently identify claims without misclassifying other instances.

One major challenge with the Decision Tree is overfitting, where the model becomes overly tailored to the training data, capturing noise instead of general patterns. This leads to poor generalization on unseen data. The model's reliance on splitting criteria for individual features can also make it sensitive to outliers and small variations in the data.



### Random Forest
The Random Forest model improved accuracy to 79.08%, significantly outperforming the standalone Decision Tree. Its precision of 71.6% shows that when the model predicted a claim, it was correct in most cases. However, the recall remained relatively low at 35.5%, indicating that the model missed a substantial number of actual claims.

The improved F1-Score of 47.5% reflects a better balance between precision and recall compared to the Decision Tree. This improvement is due to the Random Forest's ability to reduce overfitting by combining predictions from multiple trees, which improves performance and generalization.

While Random Forest demonstrated better overall performance, the trade-off between precision and recall remains a challenge. The model tends to favor precision over recall, which may lead to missed claims (false negatives). Nonetheless, Random Forest's strength lies in its ability to handle non-linear relationships and feature importance analysis, making it a reliable step forward in our modeling process.

While Random Forest demonstrated better overall performance compared to the Decision Tree, the trade-off between precision and recall remains a challenge. The model tends to favor precision over recall, which may lead to missed claims (false negatives).

### XGBoost

To build on these results, I applied XGBoost, a more advanced ensemble technique known for its ability to optimize predictive performance and strike a better balance between precision and recall. XGBoost (Extreme Gradient Boosting) is an efficient algorithm that builds on the principles of gradient boosting. Gradient boosting works by sequentially training a series of weak learners (like Decision Trees) where each subsequent model corrects the errors of its predecessor. XGBoost enhances this process through optimization techniques like regularization and parallel processing, making it both faster and more accurate compared to traditional boosting methods. The key strength of XGBoost lies in its ability to handle non-linear relationships in data and its flexibility for fine-tuning hyperparameters to achieve a balance between overfitting and underfitting.

XGBoost was applied with its default parameters to establish a baseline performance. The initial model achieved reasonably good accuracy of 77.9% and precision of 61.6%, meaning it correctly predicted many non-claims (class 0) while maintaining reliable positive predictions (class 1). However, the Recall of 45.5% indicated that the model missed a significant number of claims (false negatives), suggesting room for improvement in identifying true claims.

To enhance the model's performance by hyperparameter tuning, I applied GridSearchCV, to test multiple combinations of hyperparameters using cross-validation. Cross-validation ensures the model's performance is evaluated on different subsets of the data, preventing overfitting.

The following hyperparameters were considered:

* n_estimators: Number of boosting rounds or trees in the ensemble. Higher values allow the model to learn more but risk overfitting.
* max_depth: Maximum depth of each tree. Increasing this allows the model to capture more complex relationships but may lead to overfitting.
* learning_rate: Step size used to update the predictions at each boosting iteration. A smaller value ensures gradual learning but requires more iterations.
* subsample: Fraction of samples used to fit each tree, helping to prevent overfitting.
* min_child_weight: Minimum sum of weights required in a child node, controlling model complexity.

GridSearchCV performed a 3-fold cross-validation over 243 parameter combinations for a total of 729 fits, evaluating each configuration based on the F1 Score. F1 Score was chosen as the evaluation metric because it balances precision and recall, making it ideal for imbalanced datasets where both false positives and false negatives are critical.

After hyperparameter tuning, the best parameters were:

* learning_rate: 0.1
* max_depth: 3
* n_estimators: 200
* subsample: 0.7
* min_child_weight: 1

Using these optimized hyperparameters, the model's performance improved as follows:

* Accuracy: 80.6%
* Precision: 70.9%
* Recall: 46.3%
* F1 Score: 56.0%

The model correctly classified 80.6% of all instances. This is an improvement over the pre-tuned model's accuracy. When the model predicted a claim, it was correct 70.9% of the time. This indicates that false positives have been well controlled. The model identified 46.3% of all true claims. While this is a modest improvement over the pre-tuned recall, it reflects the challenge of increasing recall without sacrificing precision. The F1 Score improved to 56.0%, showing an overall better handling of both false positives and false negatives.

The confusion matrix for XGBoost revealed the following results:

True Negatives: 1408 (non-claims correctly identified)
False Positives: 104 (non-claims incorrectly predicted as claims)
False Negatives: 295 (actual claims missed by the model)
True Positives: 254 (actual claims correctly identified)
This indicates that while the model performs well in minimizing false positives, it still struggles to identify a significant portion of actual claims, as reflected by the 295 false negatives. Although the model successfully predicted 254 true claims, the tendency to miss true positives highlights the challenge of improving recall without compromising precision.

XGBoost demonstrated superior performance compared to previous models like Logistic Regression and Random Forest, with its ability to handle non-linear relationships and complex patterns in the data. While its recall still has room for improvement, the significant gains in precision and overall F1 Score make it the best-performing model in this analysis.

## Conclusion

#### Analysis of Model Behavior:

Each model had strengths and weaknesses. Logistic Regression was the second best but needed improvement on recall. Decision Tree overfit the data, while Random Forest improved precision but further reduced recall. XGBoost emerged as the best-performing model after tuning but still exhibited a trade-off between precision and recall.

The improvements seen in precision and F1 Score indicate that the model better balances precision and recall after tuning, making it more reliable for real-world application. However, the recall remains lower compared to precision, meaning the model still tends to miss claims. However, missing a valid claim can be costly for insurance companies. This behavior is common in classification problems with imbalanced datasets, where the majority class (non-claims) dominates. XGBoost, being an ensemble method, excels at capturing complex relationships in data, but balancing precision and recall often requires additional strategies like:

* Adjusting class weights to penalize false negatives more heavily.
* Exploring other evaluation metrics for additional insights.
* Further fine-tuning hyperparameters

#### Steps to improve models:

1. Advanced Feature Engineering: Create new features, such as ratios between existing numerical features or aggregating categorical variables into meaningful groups. Improve recall using techniques addressing class imbalance.

2. Model Optimization: Further hyperparameter tuning using techniques like RandomizedSearchCV or Bayesian Optimization to explore a larger hyperparameter space efficiently.

3. Data Augmentation and Preprocessing: Explore transformations such as log scaling for highly skewed features (e.g., income and claim amounts). Identify and address outliers more effectively.

4. Monitoring and Validation: Monitor the model's performance over time with new data to ensure it maintains accuracy and precision. Retrain the model periodically as patterns in claims may evolve.

5. Incorporate additional features into analysis like telematics-based driving behavior data, vehicle maintenance history, credit scores, inconsistencies in claim amounts or delays in reporting, etc.

By integrating these additional data sources and features into the analysis, I could refine the predictive models further, leading to a deeper understanding of the drivers of insurance claims. This approach would not only enhance model accuracy but also offer actionable insights for insurance companies to: improve risk assessment strategies, design better pricing models, detect fraudulent claims more effectively, and provide tailored solutions to high-risk policyholders.
