<a href="https://colab.research.google.com/github/danielbauer1979/CAS_PredMod/blob/main/pa_pynb_sess2_IntroGLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Session 2 -- Intro to GLM: Gamma and Poisson Regression
Dani Bauer, 9/8/2022

In this tutorial, we will see Poisson and Gamma regression in action.  We will cover two examples that go a little beyond of what was covered in class.

Let's start by loading the libraries that are going to be helpful. We're going to rely on the statistical learning toolkit [ski-cit learn](https://scikit-learn.org/stable/index.html), which provides GLM functionalty but also will be used in the context of algorithmic learners. It is less comfortable to use than some of the other packages and, unlike R, does not support formulas. But it is versatile and fast, and therefore one of the most popular prdictive modeling toolkits.

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PoissonRegressor
from sklearn.linear_model import GammaRegressor

## Gamma Regression

We consider a Gamma regression on a group data set.  More precisely, the data are auto collision losses grouped by ages and vehicle usages.  We are interested in predicting the loss  severity based on age and usage.  

The underlying dataset is a "classical" one in the actuarial literature used by Baxter, Coutts, and Ross (1980), McCullagh and Nelder (1989), Mildenhall (1999), and Frees (2010).

Let's take a look.

In [None]:
!git clone https://github.com/danielbauer1979/CAS_PredMod.git

In [None]:
dat_autocoll = pd.read_csv('CAS_PredMod/pa_data_AutoCollision.csv')
dat_autocoll

In [None]:
dat_autocoll.describe()

In [5]:
losses = dat_autocoll['Severity'] * dat_autocoll['Claim_Count']
dat_autocoll = dat_autocoll.assign(Loss=losses)

Let's see how claim counts and severities vary by age: 

In [None]:
dat_autocoll_byage = dat_autocoll.groupby('Age').sum()
plt.bar(['A','B','C','D','E','F','G','H'],dat_autocoll_byage['Claim_Count'])
plt.show()

In [None]:
plt.bar(['A','B','C','D','E','F','G','H'],dat_autocoll_byage['Loss']/dat_autocoll_byage['Claim_Count'])
plt.show()

So it seems like higher ages (by letter) have more claims but those claims, on average, are milder.

Let's do the same for Vehicle use:

In [None]:
dat_autocoll_byuse = dat_autocoll.groupby('Vehicle_Use').sum()
dat_autocoll_byuse

In [None]:
plt.bar(['Business','DriveLong','DriveShort','Pleasure'],dat_autocoll_byuse['Claim_Count'])
plt.show()

In [None]:
plt.bar(['Business','DriveLong','DriveShort','Pleasure'],dat_autocoll_byuse['Loss']/dat_autocoll_byuse['Claim_Count'])
plt.show()

So Business claims are less frequent but, again, more severe.

Hence, overall, it appears there is a clear relationship between the groups and incurred loss severities. Let's try to model this via a regression model.

Let's fit a Gamma regression model with a log link function, which is the default for Gamma regression in ski-kit learn (see [the documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.GammaRegressor.html)).

We need to prep the data by commuting the categorical variables to dummies. Those will be our feature matrix $X$.

In [None]:
X = pd.get_dummies(dat_autocoll[dat_autocoll.columns[[0,1]]])
X.head()

The labels are going to be the severities:

In [12]:
y = dat_autocoll['Severity']

Now we can fit our Gamma regression model as follows:

In [None]:
sevmodel = GammaRegressor(alpha=0)
sevmodel.fit(X,y)

Let's check out the fit by plotting predictions against realizations:

In [None]:
plt.scatter(y,sevmodel.predict(X))

The coefficients are:

In [None]:
sevmodel.coef_

But there is a twist: Should we allow all observations to have equal weight? After all, some groups have relatively small claim counts, some have greater claim counts. If we want to understand how a certain age group influences severities, we should probably lend more credibility to those group observations with large claim numbers.

We can do just that by weighting the severity observations by the claim counts.

In [None]:
sevmodel2 = GammaRegressor(alpha=0)
sevmodel2.fit(X,y,dat_autocoll['Claim_Count'])
sevmodel2.coef_

Quite a difference. Let's look at the predictions:

In [None]:
plt.scatter(y,sevmodel2.predict(X))

Again, they look differently. One of the key reasons is that the observation with a severity of close to 800 only has very few claims. Hence, there is little credibility.

## Poisson Regression

Similarly to before, we consider a Poisson regression on a grouped data set. More precisely, we have deaths counts for a set of diabetes patients that differ by age and gender. Obviously, we are looking to predict death rates -- deaths per exposure -- as a function of age and gender.

In [None]:
dat_diab = pd.read_csv('CAS_PredMod/pa_data_deJongHeller_diabetes.csv')
dat_diab

Let's look at death rates:

In [22]:
death_rate = dat_diab['deaths'] / dat_diab['pop']
dat_diab = dat_diab.assign(deathrate=death_rate)

In [None]:
plt.bar(dat_diab[dat_diab['gender']=='Male']['age'],dat_diab[dat_diab['gender']=='Male']['deathrate'])
plt.bar(dat_diab[dat_diab['gender']=='Female']['age'],dat_diab[dat_diab['gender']=='Female']['deathrate'])
plt.show()

Clearly, death rates are higher for Males. 

### First Effort

Let's commence with a basic Poisson regression.

Again, we need to transform the categorical variable (gender) to dummies:

In [None]:
dummies = pd.get_dummies(dat_diab['gender'])
X = pd.concat([dummies,dat_diab['ageMid']], axis=1)
X

In [25]:
y = dat_diab['deaths']
mortmodel = PoissonRegressor(alpha=0)
mortmodel.fit(X,y)

PoissonRegressor(alpha=0)

Here are the coefficients:

In [None]:
mortmodel.coef_

Let's look at the model fit:

In [None]:
plt.scatter(y,mortmodel.predict(X))

One problem here is that we are not considering the population size. This explains the two outliers. We can account for that by incorporating offsets.

### Second Try

Let's instead consider a Poisson regression with offset:where we use log popolation count as an offset since we are interested in rates ($log\{{\rm E}(d)/l\} = \beta_0 + \beta_1\,x \Leftrightarrow log\{{\rm E}(d)\} = \log\{l\} + \beta_0 + \beta_1\,x$). 

We can accomplish that via a (quasi-)Poisson model with death rate as response and the population counts as the weights:

In [None]:
y = dat_diab['deathrate']
mortmodel2 = PoissonRegressor(alpha=0)
mortmodel2.fit(X,y,dat_diab['pop'])

Let's look at the coefficients

In [None]:
mortmodel2.coef_

Again, this is quite a drastic change. The coefficient on age (as a Gompertz parameter) is closer to 11% versus around 5% before. Let's check the preditions of death counts now:

In [None]:
plt.scatter(dat_diab['deaths'],mortmodel2.predict(X)*dat_diab['pop'])

Clearly better!!