# Frequency Modeling using Generalized Linear Models

**Project:** PRISM â€“ Predictive & Research-based Insurance Statistical Modeling

## Objective
To model claim frequency using Poisson and Negative Binomial GLMs with exposure offsets and compare their suitability.


In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf


In [3]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [4]:
freq = pd.read_csv(
    "/content/drive/MyDrive/freMTPL2freq.csv"
)

freq = freq.rename(columns={
    "IDpol": "policy_id",
    "ClaimNb": "claim_count",
    "Exposure": "exposure",
    "Area": "area",
    "VehPower": "vehicle_power",
    "VehAge": "vehicle_age",
    "DrivAge": "driver_age",
    "BonusMalus": "bonus_malus",
    "VehBrand": "vehicle_brand",
    "VehGas": "vehicle_gas"
})

In [5]:
freq.head()


Unnamed: 0,policy_id,claim_count,exposure,area,vehicle_power,vehicle_age,driver_age,bonus_malus,vehicle_brand,vehicle_gas,Density,Region
0,1.0,1,0.1,D,5,0,55,50,B12,Regular,1217,R82
1,3.0,1,0.77,D,5,0,55,50,B12,Regular,1217,R82
2,5.0,1,0.75,B,6,2,52,50,B12,Diesel,54,R22
3,10.0,1,0.09,B,7,0,46,50,B12,Diesel,76,R72
4,11.0,1,0.84,B,7,0,46,50,B12,Diesel,76,R72


In [6]:
formula = """
claim_count ~ C(area) + vehicle_power + vehicle_age + driver_age + bonus_malus + C(vehicle_brand) + C(vehicle_gas)
"""

poisson_model = smf.glm(
    formula=formula,
    data=freq,
    family=sm.families.Poisson(),
    offset=np.log(freq["exposure"])
).fit()

poisson_model.summary()


0,1,2,3
Dep. Variable:,claim_count,No. Observations:,678013.0
Model:,GLM,Df Residuals:,677992.0
Model Family:,Poisson,Df Model:,20.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-143400.0
Date:,"Sat, 17 Jan 2026",Deviance:,217470.0
Time:,11:13:44,Pearson chi2:,1790000.0
No. Iterations:,7,Pseudo R-squ. (CS):,0.01028
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.9326,0.041,-96.171,0.000,-4.013,-3.852
C(area)[T.B],0.0430,0.021,2.004,0.045,0.001,0.085
C(area)[T.C],0.0735,0.017,4.231,0.000,0.039,0.108
C(area)[T.D],0.1541,0.018,8.571,0.000,0.119,0.189
C(area)[T.E],0.1969,0.018,10.681,0.000,0.161,0.233
C(area)[T.F],0.1727,0.034,5.147,0.000,0.107,0.239
C(vehicle_brand)[T.B10],-0.0104,0.037,-0.282,0.778,-0.082,0.062
C(vehicle_brand)[T.B11],0.0687,0.040,1.736,0.082,-0.009,0.146
C(vehicle_brand)[T.B12],0.1222,0.017,7.327,0.000,0.090,0.155


In [7]:
poisson_model.deviance / poisson_model.df_resid


np.float64(0.320760605713998)

## Poisson GLM Results

The Poisson model estimates claim frequency using rating factors with an exposure offset.  
The deviance-to-degrees-of-freedom ratio is below 1, indicating no evidence of overdispersion.  
This suggests that the Poisson model is an appropriate baseline for claim frequency modeling.




## Negative Binomial GLM

The Negative Binomial model is fitted to relax the Poisson assumption and account for potential overdispersion in claim frequency.




In [8]:
nb_model = smf.glm(
    formula=formula,
    data=freq,
    family=sm.families.NegativeBinomial(),
    offset=np.log(freq["exposure"])
).fit()

nb_model.summary()



0,1,2,3
Dep. Variable:,claim_count,No. Observations:,678013.0
Model:,GLM,Df Residuals:,677992.0
Model Family:,NegativeBinomial,Df Model:,20.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-142980.0
Date:,"Sat, 17 Jan 2026",Deviance:,189460.0
Time:,11:27:05,Pearson chi2:,1730000.0
No. Iterations:,8,Pseudo R-squ. (CS):,0.009682
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.9392,0.043,-91.332,0.000,-4.024,-3.855
C(area)[T.B],0.0434,0.022,1.959,0.050,-2.4e-05,0.087
C(area)[T.C],0.0735,0.018,4.099,0.000,0.038,0.109
C(area)[T.D],0.1559,0.019,8.387,0.000,0.119,0.192
C(area)[T.E],0.1972,0.019,10.334,0.000,0.160,0.235
C(area)[T.F],0.1776,0.035,5.094,0.000,0.109,0.246
C(vehicle_brand)[T.B10],-0.0069,0.038,-0.181,0.856,-0.082,0.068
C(vehicle_brand)[T.B11],0.0690,0.041,1.677,0.093,-0.012,0.150
C(vehicle_brand)[T.B12],0.1402,0.017,8.118,0.000,0.106,0.174


In [9]:
poisson_model.aic, nb_model.aic


(np.float64(286838.15101673116), np.float64(285993.22311047727))

## Model Comparison

The Poisson and Negative Binomial models are compared using AIC.  
Although overdispersion was not strongly indicated, the Negative Binomial model achieves a lower AIC, suggesting a better fit.  
Therefore, the Negative Binomial GLM is selected as the final frequency model for pricing.
