<a href="https://colab.research.google.com/github/danielbauer1979/ML_656/blob/main/Assignment3_LASSO_Prompt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Assignment 3

Let's install relevant packages. Again, we're going to rely on the statistical learning toolkit ski-cit learn, which provides LASSO regression but also has many other predictive models. As before, it is less comfortable to use than some of the other packages and, unlike R, does not support formulas. But it is versatile and fast, and therefore one of the most popular predictive modeling toolkits.

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, LassoCV
from sklearn.metrics import mean_squared_error

## Practical Application: Predicting Claim Sizes

We consider a data set of claim sizes (severities) from Allstate, that was used in a [Kaggle competition](https://www.kaggle.com/c/allstate-claims-severity) and is now available from various repositories, e.g. [here](https://github.com/Architectshwet/Allstate-Claims-Severity-Data/blob/master/Datasets).

Let's load it, and take a look:

In [None]:
!git clone https://github.com/danielbauer1979/ML_656.git

In [17]:
dat_1 = pd.read_csv('ML_656/Allstate_train1.csv')
dat_2 = pd.read_csv('ML_656/Allstate_train2.csv')
dat_3 = pd.read_csv('ML_656/Allstate_train3.csv')
df = pd.concat([dat_1,dat_2,dat_3])

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.describe()

So it is a large data set, and it is particularly large in the $p$ direction -- that is, there are many co-variates. So possibly shrinkage and selection will come in handy here.  One quick comment about the dataset and many Kaggle competitions more generally:  We don't really know what the variables `catx' and 'contx' stand for, so it is difficult to use experience/intuition in building a model -- which is an important aspect in real-world applications.  So performing well in a kaggle competition does not necessarily qualify a data scientist to work in the insurance industry.

## Preparing the data

There are a few very small losses that are outliers.  We thus disregard losses that are very small and  keep only records with loss greater or equal to $\$100$, also because we are interested in these in actual settings.

In [21]:
df = df[df['loss']>100]

We delete the id column

In [22]:
del df['id']

Since running the LASSO with the full set of data can take a very long time (especially if the penalty parameters aren's chosen right), it is fine to only us a sample of the data:

In [23]:
df = df.sample(n=50000, random_state=45)

We convert categoricals into dummies:

In [24]:
objects = []
for c in df.columns:
    if str(df[c].dtype) == 'object':
        objects.append(c)
X_ = df.drop(objects, axis = 1).astype('float64')
X_ = X_.drop(['loss'], axis = 1)
dummies = pd.get_dummies(df[objects], drop_first=True)
X = pd.concat([X_, dummies], axis = 1)
y = df.loss

Let's look at out features:

In [None]:
X

In [None]:
plt.bar(X.columns,X.std(axis=0))

So it seems like all the variables are already at similar scales, so it doesn't seem necessary to normalize the data.

We still carry out this step just to make sure.

In [None]:
X_org = X
scaler = StandardScaler()
X = scaler.fit_transform(X)
plt.bar(X_org.columns,X.std(axis=0))

We see that now the standard deviations of all features are the same at 1.

We split data into training and test sets:

In [28]:
X_train, X_test , y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
X_val, X_test, y_val, y_test = train_test_split(X_test,y_test,test_size = 0.5, random_state=2)

We go to logs. The log-transformation makes the data much more amenable to regression.

In [29]:
y_train = np.log(y_train)
y_val = np.log(y_val)
y_test = np.log(y_test)

**Up to you**:
- Run a basic OLS model for loss prediction.
- Check if a LASSO regression can improve upon the basic OLS appproach, and how drastic the improvement will be. Proceed as follows:
  - Run a LASSO regression.
  - Evaluate and visualize the LASSO fit for a selection of tuning parameters.
  - Determine a good choice for the tuning parameter.
  - Evaluate the performance of your predictive model.