<a href="https://colab.research.google.com/github/danielbauer1979/CAS_PredMod/blob/main/pa_pynb_sess4_LASSO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Regularized Regression

In this tutorial, we will discuss regularized regression approaches, particularly least absolute shrinkage and selection operator -- or, in short, the [LASSO](http://statweb.stanford.edu/~tibs/lasso.html/) -- in the context of predicting claim sizes.

Let's install relevant packages. Again, we're going to rely on the statistical learning toolkit ski-cit learn, which provides LASSO regression but also has many other predictive models. As before, it is less comfortable to use than some of the other packages and, unlike R, does not support formulas. But it is versatile and fast, and therefore one of the most popular predictive modeling toolkits.

In [19]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, LassoCV
from sklearn.metrics import mean_squared_error

## Ridge and LASSO Regression

Ridge and LASSO regression are both examples of *regularized* regression approaches.  In what follows, we will first briefly review the corresponding approaches, and particularly highlight how they differ from their unregularized counterparts.   We then will work through a simulated example to become familiar with the impact of the *tuning parameter* on the resulting coefficient estimates.  We will also determine the in- and out-of-sample fit depending on the choice of the tuning parameter, uncovering a familiar relationship.

## Review of Concepts and Maths

In a conventional (linear) regression problem with independent variables $x_i$ and depedent variables $y_i$, we are determining the best fit in the least-squares sense:
$$
\hat{\beta}^{\text{OLS}} = \text{argmin}_{\beta}\left\{\sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p \beta_j\,x_{i,j}\right)\right)^2 \right\}.
$$
Within a *regularized* approach, we now include penalties for choosing many or large parameters:
$$
\hat{\beta}^{\text{REG}}_\lambda = \text{argmin}_{\beta}\left\{\sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p \beta_j\,x_{i,j}\right)\right)^2 + \lambda \times R(\beta) \right\}.
$$
Here, $R(\beta)$ is a so-called *regularization* term that imposes a penalty on the complexity of the regression equation.  In particular, within Rigde regression the penalty term is *quadratic*, $R(\beta) = \sum_{j=1}^p \beta_j^2,$ wheras the LASSO uses an L1 penalty, $R(\beta) = \sum_{j=1}^p |\beta_j|.$  

We call $\lambda$ the *tuning parameter*, and it governs how sizable the complexity penalty will be.  In particular, note that for $\lambda=0$ we are back to the unregularized problem, whereas for large lambda the penalty will be severe -- so this will lead to *shrinkage* of the coefficient estimates.  As $\lambda$ becomes large and larger, the prediction will more and more closely resemble a *constant* prediction, $\hat{y}_i = \beta_0.$  Thus, the choice of the tuning parameter will directly be related to trading off a reduction in variance (due to shrinkage) with an increase in bias (due to the less flexible model fit).  Again, we will explore these aspects in more detail in the context of our example below.


## Practical Application: Predicting Claim Sizes

We consider a data set of claim sizes (severities) from Allstate, that was used in a [Kaggle competition](https://www.kaggle.com/c/allstate-claims-severity) and is now available from various repositories, e.g. [here](https://github.com/Architectshwet/Allstate-Claims-Severity-Data/blob/master/Datasets).

Let's load it, and take a look:

In [None]:
!git clone https://github.com/danielbauer1979/CAS_PredMod.git

In [4]:
dat_1 = pd.read_csv('CAS_PredMod/pa_data_Allstate_train1.csv')
dat_2 = pd.read_csv('CAS_PredMod/pa_data_Allstate_train2.csv')
dat_3 = pd.read_csv('CAS_PredMod/pa_data_Allstate_train3.csv')
df = pd.concat([dat_1,dat_2,dat_3])

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.describe()

So it is a large data set, and it is particularly large in the $p$ direction -- that is, there are many co-variates. So possibly shrinkage and selection will come in handy here.  One quick comment about the dataset and many Kaggle competitions more generally:  We don't really know what the variables `catx' and 'contx' stand for, so it is difficult to use experience/intuition in building a model -- which is an important aspect in real-world applications.  So performing well in a kaggle competition does not necessarily qualify a data scientist to work in the insurance industry.

## Preparing the data

There are a few very small losses that are outliers.  We thus disregard losses that are very small and  keep only records with loss greater or equal to $\$100$, also because we are interested in these in actual settings.

In [8]:
df = df[df['loss']>100]

We convert categoricals into dummies:

In [9]:
del df['id']
objects = []
for c in df.columns:
    if str(df[c].dtype) == 'object':
        objects.append(c)
X_ = df.drop(objects, axis = 1).astype('float64')
X_ = X_.drop(['loss'], axis = 1)
dummies = pd.get_dummies(df[objects], drop_first=True)
X = pd.concat([X_, dummies], axis = 1)
y = df.loss

Let's look at out features:

In [None]:
X

We split data into training and test sets:

In [11]:
X_train, X_test , y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=1)
X_val, X_test, y_val, y_test = train_test_split(X,y,test_size = 0.6, random_state=2)

We go to lods. The log-transformation makes the data much more amenable to regression.

In [12]:
y_train = np.log(y_train)
y_val = np.log(y_val)
y_test = np.log(y_test)

##Run Models

We start with OLS:

In [None]:
model_ols = LinearRegression(fit_intercept=True)
model_ols.fit(X_train, y_train)
print(model_ols.intercept_)
print(model_ols.coef_)

The RMSE is

In [None]:
TestRMSE_ols = np.sqrt(mean_squared_error(y_test,model_ols.predict(X_test)))
print(TestRMSE_ols)

In [None]:
tmp = np.abs(model_ols.predict(X_test)-y_test)
print(np.median(tmp))
print(np.quantile(tmp,0.9))
print(np.quantile(tmp,0.99))

And let's run a LASSO regression, with some predefined values of lambda:

In [None]:
alphas = np.array([0.000001, 0.000003, 0.000007, 0.00001, 0.0001, 0.001])
model_lasso = Lasso(max_iter = 10000, normalize = True)
coefs = []
MSE = []
for a in alphas:
    model_lasso.set_params(alpha=a)
    model_lasso.fit(X_train, y_train)
    coefs.append(model_lasso.coef_)
    MSE.append(mean_squared_error(y_val, model_lasso.predict(X_val)))

In [23]:
RMSE = np.sqrt(MSE)
RMSE

array([0.56483571, 0.56444628, 0.56407866, 0.56411633, 0.58435664,
       0.72678256])

In [None]:
plt.plot(RMSE)

In [None]:
model_lasso.set_params(alpha=0.000007)
model_lasso.fit(X_test, y_test)
TestRMSE_lasso = np.sqrt(mean_squared_error(y_test,model_lasso.predict(X_test)))
print(TestRMSE_lasso)

In [None]:
tmp = np.abs(model_lasso.predict(X_test)-y_test)
print(np.median(tmp))
print(np.quantile(tmp,0.9))
print(np.quantile(tmp,0.99))