<a href="https://colab.research.google.com/github/danielbauer1979/ML_656/blob/main/Assignment3_LASSO_Sol.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Assignment 3

Let's install relevant packages. Again, we're going to rely on the statistical learning toolkit ski-cit learn, which provides LASSO regression but also has many other predictive models. As before, it is less comfortable to use than some of the other packages and, unlike R, does not support formulas. But it is versatile and fast, and therefore one of the most popular predictive modeling toolkits.

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, LassoCV
from sklearn.metrics import mean_squared_error

## Practical Application: Predicting Claim Sizes

We consider a data set of claim sizes (severities) from Allstate, that was used in a [Kaggle competition](https://www.kaggle.com/c/allstate-claims-severity) and is now available from various repositories, e.g. [here](https://github.com/Architectshwet/Allstate-Claims-Severity-Data/blob/master/Datasets).

Let's load it, and take a look:

In [None]:
!git clone https://github.com/danielbauer1979/ML_656.git

Cloning into 'ML_656'...
remote: Enumerating objects: 155, done.[K
remote: Counting objects: 100% (38/38), done.[K
remote: Compressing objects: 100% (38/38), done.[K
remote: Total 155 (delta 21), reused 0 (delta 0), pack-reused 117[K
Receiving objects: 100% (155/155), 23.39 MiB | 12.31 MiB/s, done.
Resolving deltas: 100% (71/71), done.
Checking out files: 100% (29/29), done.


In [None]:
dat_1 = pd.read_csv('ML_656/Allstate_train1.csv')
dat_2 = pd.read_csv('ML_656/Allstate_train2.csv')
dat_3 = pd.read_csv('ML_656/Allstate_train3.csv')
df = pd.concat([dat_1,dat_2,dat_3])

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.describe()

So it is a large data set, and it is particularly large in the $p$ direction -- that is, there are many co-variates. So possibly shrinkage and selection will come in handy here.  One quick comment about the dataset and many Kaggle competitions more generally:  We don't really know what the variables `catx' and 'contx' stand for, so it is difficult to use experience/intuition in building a model -- which is an important aspect in real-world applications.  So performing well in a kaggle competition does not necessarily qualify a data scientist to work in the insurance industry.

## Preparing the data

There are a few very small losses that are outliers.  We thus disregard losses that are very small and  keep only records with loss greater or equal to $\$100$, also because we are interested in these in actual settings.

In [None]:
df = df[df['loss']>100]

We convert categoricals into dummies:

In [None]:
del df['id']
objects = []
for c in df.columns:
    if str(df[c].dtype) == 'object':
        objects.append(c)
X_ = df.drop(objects, axis = 1).astype('float64')
X_ = X_.drop(['loss'], axis = 1)
dummies = pd.get_dummies(df[objects], drop_first=True)
X = pd.concat([X_, dummies], axis = 1)
y = df.loss

Let's look at out features:

In [None]:
X

We split data into training and test sets:

In [None]:
X_train, X_test , y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
X_val, X_test, y_val, y_test = train_test_split(X_test,y_test,test_size = 0.5, random_state=2)

We go to lods. The log-transformation makes the data much more amenable to regression.

In [None]:
y_train = np.log(y_train)
y_val = np.log(y_val)
y_test = np.log(y_test)

##Run Models

We start with OLS:

In [None]:
model_ols = LinearRegression(fit_intercept=True)
model_ols.fit(X_train, y_train)
print(model_ols.intercept_)
print(model_ols.coef_)

The RMSE is

In [None]:
TestRMSE_ols = np.sqrt(mean_squared_error(y_test,model_ols.predict(X_test)))
print(TestRMSE_ols)

In [None]:
tmp = np.abs(model_ols.predict(X_test)-y_test)
print(np.median(tmp))
print(np.quantile(tmp,0.9))
print(np.quantile(tmp,0.99))

And let's run a LASSO regression, with some predefined values of lambda:

In [None]:
alphas = np.array([0.000001, 0.000003, 0.000007, 0.00001, 0.0001, 0.001])
model_lasso = Lasso(max_iter = 10000, normalize = True)
coefs = []
MSE = []
for a in alphas:
    model_lasso.set_params(alpha=a)
    model_lasso.fit(X_train, y_train)
    coefs.append(model_lasso.coef_)
    MSE.append(mean_squared_error(y_val, model_lasso.predict(X_val)))

In [None]:
RMSE = np.sqrt(MSE)
RMSE

In [None]:
plt.plot(RMSE)

In [None]:
model_lasso.set_params(alpha=0.000007)
model_lasso.fit(X_test, y_test)
TestRMSE_lasso = np.sqrt(mean_squared_error(y_test,model_lasso.predict(X_test)))
print(TestRMSE_lasso)

In [None]:
tmp = np.abs(model_lasso.predict(X_test)-y_test)
print(np.median(tmp))
print(np.quantile(tmp,0.9))
print(np.quantile(tmp,0.99))