# Causal Analysis of Risk Factors

**Project:** PRISM – Predictive & Research-based Insurance Statistical Modeling

## Objective
To distinguish causal risk drivers from mere correlations using causal inference methods.



In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score


In [2]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [3]:
freq = pd.read_csv(
    "/content/drive/MyDrive/freMTPL2freq.csv"
)

freq = freq.rename(columns={
    "IDpol": "policy_id",
    "ClaimNb": "claim_count",
    "Exposure": "exposure",
    "Area": "area",
    "VehPower": "vehicle_power",
    "VehAge": "vehicle_age",
    "DrivAge": "driver_age",
    "BonusMalus": "bonus_malus",
    "VehBrand": "vehicle_brand",
    "VehGas": "vehicle_gas"
})


In [4]:
# Treatment: High risk score
freq["high_bonus"] = (freq["bonus_malus"] > freq["bonus_malus"].median()).astype(int)

# Outcome: Claim occurrence
freq["has_claim"] = (freq["claim_count"] > 0).astype(int)


In [6]:
X_ps = freq[["vehicle_power", "vehicle_age", "driver_age", "area"]]
y_ps = freq["high_bonus"]

# One-hot encode the 'area' column
X_ps = pd.get_dummies(X_ps, columns=['area'], drop_first=True)

ps_model = LogisticRegression(max_iter=1000)
ps_model.fit(X_ps, y_ps)

freq["propensity"] = ps_model.predict_proba(X_ps)[:,1]

In [7]:
freq.groupby("high_bonus")["propensity"].mean()


Unnamed: 0_level_0,propensity
high_bonus,Unnamed: 1_level_1
0,0.301378
1,0.605963


In [8]:
# Inverse probability weights
freq["weight"] = np.where(
    freq["high_bonus"] == 1,
    1 / freq["propensity"],
    1 / (1 - freq["propensity"])
)

# Weighted outcome means
treated = np.average(freq.loc[freq.high_bonus==1, "has_claim"],
                      weights=freq.loc[freq.high_bonus==1, "weight"])

control = np.average(freq.loc[freq.high_bonus==0, "has_claim"],
                      weights=freq.loc[freq.high_bonus==0, "weight"])

treated, control, treated - control


(np.float64(0.07826164296183412),
 np.float64(0.043085580894735374),
 np.float64(0.035176062067098744))

## Causal Effect of Bonus–Malus on Claim Risk

After adjusting for confounding using propensity score weighting,  
high Bonus–Malus policies exhibit a 3.5 percentage point higher probability of claims than comparable low Bonus–Malus policies.  

This provides evidence that Bonus–Malus is a causal risk driver and should be retained as a pricing factor.
