Creative Commons CC BY 4.0 Lynd Bacon & Associates, Ltd. Not warranted to be suitable for any particular purpose. (You're on your own!)

# Lasso, and elasticNet

Here are two more supervised learning algorithms that use shrinkage for regularization.

Lasso is an acronym for "least absolute shrinkage and selection operator."  It uses L1 regularization: the penalty applied to the cost (loss) function minimized during training is an L1 norm:  a constant (alpha here) times the sum of the absolute values of the regression weights.

elasticNet is a regularization method that combines the regularization penalties used in ridge regression (the L2 norm) and by the Lasso(L1 norm).

An important difference between the Lasso and Ridge regression is that the latter can shrink coefficients, but it can never make them equal to zero short of `alpha` going to $\infty$.  Lasso _can_ shrink coefficients to zero, an advantage when there are lots of regressors, and interpretation of their importance is of interest.  Lasso can be used for variable _selection_.

In the following we'll again use the inpatient satisfaction data.  We'll apply the Lasso.  We'll rescale the predictor variables, use cross validation, and search for a "good" shrinkage parameter value.  Then we'll experiment a little with elasticNet.

# Getting Some Packages and the Data

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score # Basic metrics
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline

Get that pesky satisfaction data, dummy code the patient categories, and create the numpy arrays we need.  

Assuming that that data file is in the DATA subdir:

In [2]:
ptSatDF=pd.read_csv('DATA/ML/DECART-patSat.csv')
patSatDF2=ptSatDF.copy()
patSatDF2[['ptCat1','ptCat2']]=pd.get_dummies(patSatDF2.ptCat,drop_first=True)
patSatDF2=patSatDF2.drop(['caseID','ptCat'],axis=1)
patSatDF2.columns

Index(['patSat', 'q2', 'q3', 'q4', 'q5', 'q6', 'q7', 'q8', 'q9', 'ptCat1',
       'ptCat2'],
      dtype='object')

# Setting Up for CV

In [4]:
X=patSatDF2.iloc[:,1:].to_numpy()
y=patSatDF2.iloc[:,0].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)
X_train.shape
X_test.shape
y_train.shape
X_test.shape

(1358, 10)

(453, 10)

(1358,)

(453, 10)

# Creating a Pipeline, a Grid to Search, and doing CV Lasso

Since we're going to do MinMax scaling on our "features" (predictors), we need to be sure that we do it separately for our training data and our test data.  

To make doing this more convenient we're going to set up a "pipeline" of methods that will sequentially apply rescaling to the X data within CV folds, and separately for the training and test X data.

Note that in the specification of the grid `param_grid`, the `alpha` parameter array is named `lasso__alpha`, "lasso(double underscore)alpha". The 

In [5]:
lassoReg=linear_model.Lasso(random_state=99, normalize=False)
pipe = Pipeline([("scaler", MinMaxScaler()), ("lasso", lassoReg)])
#
param_grid={'lasso__alpha': [0.001, 0.01, 0.1, 1, 10, 100]} # alpha values
#
grid = GridSearchCV(pipe, param_grid=param_grid, cv=20)
grid.fit(X_train, y_train)
print("Best cross-validation accuracy: {:.4f}".format(grid.best_score_))
print("Test set score: {:.4f}".format(grid.score(X_test, y_test)))
print("Best parameters: {}".format(grid.best_params_))

GridSearchCV(cv=20, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('scaler',
                                        MinMaxScaler(copy=True,
                                                     feature_range=(0, 1))),
                                       ('lasso',
                                        Lasso(alpha=1.0, copy_X=True,
                                              fit_intercept=True, max_iter=1000,
                                              normalize=False, positive=False,
                                              precompute=False, random_state=99,
                                              selection='cyclic', tol=0.0001,
                                              warm_start=False))],
                                verbose=False),
             iid='warn', n_jobs=None,
             param_grid={'lasso__alpha': [0.001, 0.01, 0.1, 1, 10, 100]},
             pre_dispatch='2*n_jobs', refit=Tr

Best cross-validation accuracy: 0.6747
Test set score: 0.7283
Best parameters: {'lasso__alpha': 0.01}


In [6]:
# Summarize the cv results

pd.DataFrame(grid.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_lasso__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,...,split13_test_score,split14_test_score,split15_test_score,split16_test_score,split17_test_score,split18_test_score,split19_test_score,mean_test_score,std_test_score,rank_test_score
0,0.003758,0.003261,0.000705,5.2e-05,0.001,{'lasso__alpha': 0.001},0.73748,0.549968,0.745594,0.66729,...,0.687571,0.787872,0.721716,0.622673,0.702962,0.667604,0.690136,0.674172,0.080751,2
1,0.001537,2e-05,0.000692,1.3e-05,0.01,{'lasso__alpha': 0.01},0.736189,0.552228,0.740032,0.663569,...,0.69083,0.790482,0.720442,0.624,0.699477,0.664623,0.689167,0.674723,0.077819,1
2,0.001493,2.3e-05,0.00069,1e-05,0.1,{'lasso__alpha': 0.1},0.68393,0.554172,0.654748,0.606068,...,0.670422,0.745071,0.651251,0.596415,0.635112,0.61938,0.65236,0.639161,0.054744,3
3,0.001357,1.4e-05,0.000691,7e-06,1.0,{'lasso__alpha': 1},-1e-06,-0.02016,-0.02843,-0.038922,...,-0.024961,-0.013839,-0.198912,-0.017209,-0.058169,-0.030193,-0.02103,-0.030924,0.040904,4
4,0.001343,1.7e-05,0.000685,1.2e-05,10.0,{'lasso__alpha': 10},-1e-06,-0.02016,-0.02843,-0.038922,...,-0.024961,-0.013839,-0.198912,-0.017209,-0.058169,-0.030193,-0.02103,-0.030924,0.040904,4
5,0.001351,1.6e-05,0.000701,2.5e-05,100.0,{'lasso__alpha': 100},-1e-06,-0.02016,-0.02843,-0.038922,...,-0.024961,-0.013839,-0.198912,-0.017209,-0.058169,-0.030193,-0.02103,-0.030924,0.040904,4


Doesn't seem all that different from the CV regression model we trained earlier.  Let's take a look at what the "best" Lasso model is.  What's in the pipeline defined are _named steps_ that can be accessed.  There are two: "scaler", and "lasso":

In [40]:
print("Best estimator:\n{}".format(grid.best_estimator_))

Best estimator:
Pipeline(memory=None,
         steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
                ('lasso',
                 Lasso(alpha=0.01, copy_X=True, fit_intercept=True,
                       max_iter=1000, normalize=False, positive=False,
                       precompute=False, random_state=99, selection='cyclic',
                       tol=0.0001, warm_start=False))],
         verbose=False)


The "lasso" step is:

In [41]:
print("lasso step:\n{}". \
      format(      grid.best_estimator_.named_steps["lasso"]))

lasso step:
Lasso(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=99,
      selection='cyclic', tol=0.0001, warm_start=False)


In [None]:
And the Lasso coefficients:

In [42]:
print("Lasso coefs:\n{}". \
      format(      grid.best_estimator_.named_steps["lasso"].coef_))

Lasso coefs:
[1.01168204 0.8140796  0.41591071 3.43746095 0.9112827  0.1551897
 0.         0.12520995 0.87604514 1.57825472]


Is there a _really_ true zero coefficient? That "zero" is the coefficient for the variable `q7`.  _Looks_ like a zero:

In [44]:
grid.best_estimator_.named_steps['lasso'].coef_[6]

0.0

It's worth taking a look at the correlation between `patSat` and `q7`and the other vars:

In [48]:
patSatDF2.corr().iloc[:,0]

patSat    1.000000
q2        0.649388
q3        0.613785
q4        0.652300
q5        0.741797
q6        0.589858
q7        0.564058
q8        0.443638
q9        0.452323
ptCat1    0.061737
ptCat2    0.588030
Name: patSat, dtype: float64

## What Happened?

# Lasso and LARS

LARS is an acronym for _least angles regression_, a method that does regulatization and feature selection.  It has strengths and weaknesses relative to other methods. The latter include difficulty dealing with highly colinear features.  See [LARS on Wikipedia](https://en.wikipedia.org/wiki/Least-angle_regression)

There is a version of the Lasso that uses LARS, ["LassoLARS"](https://en.wikipedia.org/wiki/Lasso_(statistics)) which is implemented in scikit-learn as a `linear_model` function as [LassoLARS](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLars.html#sklearn.linear_model.LassoLars)

# A UDU: Replicate the Above Lasso Application Using LassoLARS

# elasticNet

AKA "elastic net," has some advantages over the Lasso.  It can select more features than there are cases (rows of data), and it tends to select just one feature from sets of highly intercorrelated features.  It does this by using a cost function that combines the penalizations of both Lasso (L1) and of Ridge (L2) regression.  It can do a little better than the Lasso when features are (multi)collinear

Let's try elasticNet using the `radon` data.  We'll again do 20 fold CV with MinMax scaling, like with the Lasso, above.

In [50]:
eNetReg=linear_model.ElasticNet(random_state=99, normalize=False)
pipe = Pipeline([("scaler", MinMaxScaler()), ("eNet", eNetReg)])
#
param_grid={'eNet__alpha': [0.001, 0.01, 0.1, 1, 10, 100]} # alpha values
#
grid = GridSearchCV(pipe, param_grid=param_grid, cv=20)
grid.fit(X_train, y_train)
print("Best cross-validation accuracy: {:.4f}".format(grid.best_score_))
print("Test set score: {:.4f}".format(grid.score(X_test, y_test)))
print("Best parameters: {}".format(grid.best_params_))

GridSearchCV(cv=20, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('scaler',
                                        MinMaxScaler(copy=True,
                                                     feature_range=(0, 1))),
                                       ('eNet',
                                        ElasticNet(alpha=1.0, copy_X=True,
                                                   fit_intercept=True,
                                                   l1_ratio=0.5, max_iter=1000,
                                                   normalize=False,
                                                   positive=False,
                                                   precompute=False,
                                                   random_state=99,
                                                   selection='cyclic',
                                                   tol=0.0001,
                               

Best cross-validation accuracy: 0.6751
Test set score: 0.7306
Best parameters: {'eNet__alpha': 0.01}


In [51]:
print("ElasticNet coefs:\n{}". \
      format(      grid.best_estimator_.named_steps["eNet"].coef_))

ElasticNet coefs:
[0.99156813 0.82309054 0.57352466 3.08506985 0.90074559 0.2880325
 0.         0.20559254 0.89634168 1.59854033]


Note that the first coefficient is for the intercerpt, which was included by default.

By comparison, the Lasso coefficients from above, were:

# UDU: Apply elasticNet to the pt Sat Data, Find a "Best" shrinkage parameter value