<a href="https://colab.research.google.com/github/danielbauer1979/CAS_PredMod/blob/main/pa_pynb_sess3_GLMCaseStudy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Session 3 -- GLM Case Study

Dani Bauer, 9/12/2022

In this tutorial, we will present a detailed case study in the context of auto liability insurance, showcasing a variety of GLM techniques.

Let's start by loading the libraries that are going to be helpful. Again, we're going to rely on the statistical learning toolkit ski-cit learn, which provides GLM functionalty but also will be used in the context of algorithmic learners. It is less comfortable to use than some of the other packages and, unlike R, does not support formulas. But it is versatile and fast, and therefore one of the most popular prdictive modeling toolkits.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PoissonRegressor
from sklearn.linear_model import GammaRegressor

We consider predictive modeling of auto claims, i.e. the overarching challenge is predicting frequencies and/or severities of claims.  We rely on the comprehensive French Motor Third-Part Liability datasets `ferMTPLfreq` and `ferMTPLsev` available within the [package CASdatasets](http://cas.uqam.ca/).

Let's take a peak, first at the frequency dataset:

In [None]:
!git clone https://github.com/danielbauer1979/CAS_PredMod.git

In [None]:
dat_frq1 = pd.read_csv('CAS_PredMod/pa_data_freMTPLfreq1.csv')
dat_frq2 = pd.read_csv('CAS_PredMod/pa_data_freMTPLfreq2.csv')
dat_frq = pd.concat([dat_frq1,dat_frq2])
dat_frq.head()

In [None]:
dat_frq.describe()

In [None]:
pd.crosstab(index=dat_frq['ClaimNb'], columns="count")

In [None]:
pd.crosstab(index=dat_frq['ClaimNb'], columns="count").plot(kind='bar')

So, as expected, multiple claims are rare. The vast majority of cases don't have a claim.

Let's look at the severities:

In [None]:
dat_sev = pd.read_csv('CAS_PredMod/pa_data_freMTPLsev.csv')
print(dat_sev.shape)
dat_sev.head()

In [None]:
dat_sev.describe()

So, again, as expected, we have a few very large claims.

## Merge the Data Sets

Since there are multiple claims for each policy, let's summarize the claims to the policy level, so as to allow for an easy merge:

In [None]:
df = dat_sev.groupby('PolicyID', as_index=False).agg({"ClaimAmount":"mean"})
df

Now we can simply merge the frequency and the severity sets into our master data set, where we set the `NA` `ClaimAmount` entries to zero where we don't have claims:

In [None]:
dat = pd.merge(dat_frq, dat_sev.groupby('PolicyID', as_index=False).agg({"ClaimAmount":"mean"}),how='left')
dat = dat.fillna(0)
dat.head()

## Building Models

As the previous time, we need to put the categorical variables to dummies. For that, we separate the dummies and the numerical variables, make the dummies, and then concatenate.

In [None]:
dummies = pd.get_dummies(dat[dat.columns[[4,7,8,9]]],drop_first=True)
dat = dat.drop(dat.columns[[0, 1, 4, 7, 8, 9]], axis=1)
dat = pd.concat([dat,dummies], axis=1)
dat.head()

Let's do some visualizations of some of the variables:

In [None]:
plt.hist(dat['Exposure'])
plt.show()

In [None]:
plt.hist(dat['DriverAge'])
plt.show()

...likely what we expected...

In [None]:
plt.hist(dat['CarAge'])
plt.show()

...also no surprises...

Let's look at population density, that's a more interesting variable:

In [None]:
plt.hist(dat['Density'])
plt.show()

So it looks like very small and a few very high densities. Let's go to log-scale.

In [None]:
dat.loc[:, 'Density'] = np.log(dat['Density'])
plt.hist(dat['Density'])
plt.show()

Let's also check out our targets:

In [None]:
dat.loc[dat['ClaimAmount']>0,'ClaimAmount'].quantile([.9, .95, .99, .999])

It is possible that the few very large claims will cause trouble, so let's cut off at 50K:

In [None]:
dat['ClaimAmount'][dat['ClaimAmount']>50000] = 50000

Let's split into a training and a test sample, so that we can evaluate our model:

In [19]:
Train, Test = train_test_split(dat, test_size=0.25)
Train_y_freq = Train['ClaimNb']
Train_y_sev = Train['ClaimAmount']
Train_X = Train.drop(columns = ['ClaimNb','ClaimAmount'])
Test_y_freq = Test['ClaimNb']
Test_y_sev = Test['ClaimAmount']
Test_X = Test.drop(columns = ['ClaimNb','ClaimAmount'])


And let's model frequencies via a Poisson Regression:

In [20]:
freqmodel = PoissonRegressor()
freqmodel.fit(Train_X,Train_y_freq)
preds_Train = freqmodel.predict(Train_X)
np.corrcoef(preds_Train,Train_y_freq)

array([[1.        , 0.01765559],
       [0.01765559, 1.        ]])

The correlation is fairly low, but when we look at a scatter plot...

In [None]:
plt.scatter(Train_y_freq,preds_Train)

it does seem like that the claims with higher frequencies have higher predictions, though they are still close to zero:

In [None]:
preds_Test = freqmodel.predict(Test_X)
np.corrcoef(preds_Test,Test_y_freq)

It seems the positive correlation sustains in the test set (out of sample), and the scatter plot:

In [None]:
plt.scatter(Test_y_freq,preds_Test)

Also suggests the model isn't crazy.

Let's looks at claims:

In [None]:
plt.hist(dat['ClaimAmount'])
plt.show()

and non-zero claims on a log-scale.

In [None]:
plt.hist(np.log(dat.loc[dat['ClaimAmount']>0,'ClaimAmount']))
plt.show()

Let's run a Gamma regression for the severities:

In [None]:
Train = Train.loc[Train['ClaimAmount']>0,:]
Train_y_sev = Train['ClaimAmount']
Train_X = Train.drop(columns = ['ClaimNb','ClaimAmount'])
sevmodel = GammaRegressor()
sevmodel.fit(Train_X,Train_y_sev)
plt.scatter(Train_y_sev,sevmodel.predict(Train_X))

So we're kind-of cathing the trend.

Now we can fuse together by muliplying predicted frequencies and severities:

In [None]:
preds_Test_sev = sevmodel.predict(Test_X)
preds_Test_tot = preds_Test * preds_Test_sev

In [None]:
plt.scatter(Test_y_freq * Test_y_sev,preds_Test_tot)