### 6.6 Lab 2: Ridge Regression and the Lasso

In [1]:
import pandas as pd
import numpy as np
# Load dataset
hitters = pd.read_csv('Data/Hitters.csv', index_col = 0)
hitters = hitters.dropna()
dummies = pd.get_dummies(hitters[['League', 'Division', 'NewLeague']], drop_first=True)

hitters.drop(labels=['League', 'Division', 'NewLeague'], axis="columns", inplace=True)
hitters[['League', 'Division', 'NewLeague']] = dummies


### 6.6.1 Ridge Regression
We will use sklearne's Ridge method. Here alpha corresponds to lambda in the ISLR notation, it's the hyperparameter for the amount of regularisation.
We have chosen to implement the function over a grid of values ranging from alpha = 10^10 to alpha = 10^−2, essentially covering the full range of scenarios from the null model containing
only the intercept, to the least squares fit. 

We expect the coefficient estimates to be much smaller, in terms of L2 norm,
when a large value of alpha is used, as compared to when a small value of alpha is
used. These are the coefficients when alpha = 11,498, along with their L2 norm:

In [2]:
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X = hitters.drop(columns="Salary", axis = 1)
X = scaler.fit_transform(X)
Y = hitters.Salary
ridge = Ridge(alpha = 11498)
ridge.fit(X, Y)
display(f"The l2 norm is {np.linalg.norm(ridge.coef_)}")


'The l2 norm is 15.508830341737179'

In contrast, here are the coefficients when Alpha = 705, along with their L2
norm. Note the much larger L2 norm of the coefficients associated with this
smaller value of alpha.

In [3]:
X = hitters.drop(columns="Salary", axis = 1)
X = scaler.fit_transform(X)
Y = hitters.Salary
ridge = Ridge(alpha = 705)
ridge.fit(X, Y)
display(f"The l2 norm is {np.linalg.norm(ridge.coef_)}")

'The l2 norm is 84.42978999311623'

We now split the samples into a training set and a test set in order
to estimate the test error of ridge regression and the lasso. Next we fit a ridge regression model on the training set, and evaluate
its MSE on the test set, using alpha = 4

In [4]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error
X = hitters.drop(columns="Salary", axis = 1)
X = scaler.fit_transform(X)
Y = hitters.Salary
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.5, random_state=0)

ridge = Ridge(alpha = 4)
ridge.fit(X_train, y_train)
predictions = ridge.predict(X_test)
mse = mean_squared_error(y_test, predictions)
mse

127015.05249180352

The test MSE is 127015. Note that if we had instead simply fit a model
with just an intercept, we would have predicted each test observation using
the mean of the training observations. In that case, we could compute the
test set MSE like this:

In [5]:
np.mean(np.square(np.mean(y_train)-y_test))

231853.4150633351

We could also get the same result by fitting a ridge regression model with
a very large value of alpha:

In [6]:
ridge = Ridge(alpha = 1e10)
ridge.fit(X_train, y_train)
predictions = ridge.predict(X_test)
mse = mean_squared_error(y_test, predictions)
mse

231853.39962619243

So fitting a ridge regression model with λ = 4 leads to a much lower test
MSE than fitting a model with just an intercept. We now check whether
there is any benefit to performing ridge regression with λ = 4 instead of
just performing least squares regression. Recall that least squares is simply
ridge regression with λ = 0

In [7]:
ridge = Ridge(alpha = 0)
ridge.fit(X_train, y_train)
predictions = ridge.predict(X_test)
mse = mean_squared_error(y_test, predictions)
mse

134597.49902509098

In general, instead of arbitrarily choosing alpha = 4, it would be better to
use cross-validation to choose the tuning parameter alpha. We can do this using
the built-in cross-validation function. Note that we set a random seed first so our results will
be reproducible, since the choice of the cross-validation folds is random.

In [8]:
from sklearn.linear_model import RidgeCV
grid = np.logspace(1e-2, 1e10, 100, endpoint=True)
ridgecv = RidgeCV(alphas=grid, cv=10)
ridgecv.fit(X_train, y_train)
predictions = ridgecv.predict(X_test)
mse = mean_squared_error(y_test, predictions)
display(f"Best score is {mse} with alpha = {ridgecv.alpha_}")


'Best score is 125945.92026715071 with alpha = 1.023292992280754'

This represents a further improvement over the test MSE that we got using
alpha = 4. Finally, we refit our ridge regression model on the full data set,
using the value of λ chosen by cross-validation, and examine the coefficient
estimates.

In [9]:
ridge = Ridge(alpha = 1.023292992280754)
X = hitters.drop(columns="Salary", axis = 1)
X = scaler.fit_transform(X)
Y = hitters.Salary
ridge.fit(X, Y)
ridge.coef_

array([-269.91994573,  295.76138136,   17.88982349,  -28.9406864 ,
         -8.97697153,  124.2303736 ,  -38.97993038, -222.94459644,
        126.98360993,   39.5876684 ,  318.09263198,  159.34003461,
       -183.99779792,   78.60149557,   47.33392441,  -23.78702208,
         31.01112897,  -60.27326218,  -13.68439816])

As expected, none of the coefficients are zero—ridge regression does not
perform variable selection

### 6.6.2 The Lasso
We saw that ridge regression with a wise choice of λ can outperform least
squares as well as the null model on the Hitters data set. We now ask
whether the lasso can yield either a more accurate or a more interpretable
model than ridge regression

In [58]:
from sklearn.linear_model import LassoCV
grid = np.logspace(1e-3, 305, 150, endpoint=True)
X = hitters.drop(columns="Salary", axis = 1)
X = scaler.fit_transform(X)
Y = hitters.Salary
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.5, random_state=0)
lasso = LassoCV(alphas=grid, cv=10)
lasso.fit(X_train, y_train)
predictions = lasso.predict(X_test)
mse = mean_squared_error(y_test, predictions)
display(f"Best score is {mse} with alpha = {lasso.alpha_}")

'Best score is 128991.99155773126 with alpha = 1.0023052380778996'

This is substantially lower than the test set MSE of the null model and of
least squares, and very similar to the test MSE of ridge regression with alpha
chosen by cross-validation.
However, the lasso has a substantial advantage over ridge regression in
that the resulting coefficient estimates are sparse. Here we see that 2 of
the 19 coefficient estimates are exactly zero.

In [57]:
lasso.coef_

array([-126.28609652,  103.16157218,   90.71663763,  -14.62400387,
         -0.        ,   70.19279164,  -66.18973161, -618.97481896,
        705.42415496,   -0.        ,  257.32916914,  -20.39986592,
        -13.0852749 ,   41.44428577,   48.92161909,  -60.23940143,
         14.13477793,  -43.13284435,   18.18541659])