
<br>
===================================================<br>
Lasso model selection: Cross-Validation / AIC / BIC<br>
===================================================<br>
Use the Akaike information criterion (AIC), the Bayes Information<br>
criterion (BIC) and cross-validation to select an optimal value<br>
of the regularization parameter alpha of the :ref:`lasso` estimator.<br>
Results obtained with LassoLarsIC are based on AIC/BIC criteria.<br>
Information-criterion based model selection is very fast, but it<br>
relies on a proper estimation of degrees of freedom, are<br>
derived for large samples (asymptotic results) and assume the model<br>
is correct, i.e. that the data are actually generated by this model.<br>
They also tend to break when the problem is badly conditioned<br>
(more features than samples).<br>
For cross-validation, we use 20-fold with 2 algorithms to compute the<br>
Lasso path: coordinate descent, as implemented by the LassoCV class, and<br>
Lars (least angle regression) as implemented by the LassoLarsCV class.<br>
Both algorithms give roughly the same results. They differ with regards<br>
to their execution speed and sources of numerical errors.<br>
Lars computes a path solution only for each kink in the path. As a<br>
result, it is very efficient when there are only of few kinks, which is<br>
the case if there are few features or samples. Also, it is able to<br>
compute the full path without setting any meta parameter. On the<br>
opposite, coordinate descent compute the path points on a pre-specified<br>
grid (here we use the default). Thus it is more efficient if the number<br>
of grid points is smaller than the number of kinks in the path. Such a<br>
strategy can be interesting if the number of features is really large<br>
and there are enough samples to select a large amount. In terms of<br>
numerical errors, for heavily correlated variables, Lars will accumulate<br>
more errors, while the coordinate descent algorithm will only sample the<br>
path on a grid.<br>
Note how the optimal value of alpha varies for each fold. This<br>
illustrates why nested-cross validation is necessary when trying to<br>
evaluate the performance of a method for which a parameter is chosen by<br>
cross-validation: this choice of parameter may not be optimal for unseen<br>
data.<br>


In [None]:
print(__doc__)

Author: Olivier Grisel, Gael Varoquaux, Alexandre Gramfort<br>
License: BSD 3 clause

In [None]:
import time

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from sklearn.linear_model import LassoCV, LassoLarsCV, LassoLarsIC
from sklearn import datasets

This is to avoid division by zero while doing np.log10

In [None]:
EPSILON = 1e-4

In [None]:
X, y = datasets.load_diabetes(return_X_y=True)

In [None]:
rng = np.random.RandomState(42)
X = np.c_[X, rng.randn(X.shape[0], 14)]  # add some bad features

normalize data as done by Lars to allow for comparison

In [None]:
X /= np.sqrt(np.sum(X ** 2, axis=0))

#############################################################################<br>
LassoLarsIC: least angle regression with BIC/AIC criterion

In [None]:
model_bic = LassoLarsIC(criterion='bic')
t1 = time.time()
model_bic.fit(X, y)
t_bic = time.time() - t1
alpha_bic_ = model_bic.alpha_

In [None]:
model_aic = LassoLarsIC(criterion='aic')
model_aic.fit(X, y)
alpha_aic_ = model_aic.alpha_

In [None]:
def plot_ic_criterion(model, name, color):
    alpha_ = model.alpha_ + EPSILON
    alphas_ = model.alphas_ + EPSILON
    criterion_ = model.criterion_
    plt.plot(-np.log10(alphas_), criterion_, '--', color=color,
             linewidth=3, label='%s criterion' % name)
    plt.axvline(-np.log10(alpha_), color=color, linewidth=3,
                label='alpha: %s estimate' % name)
    plt.xlabel('-log(alpha)')
    plt.ylabel('criterion')

In [None]:
plt.figure()
plot_ic_criterion(model_aic, 'AIC', 'b')
plot_ic_criterion(model_bic, 'BIC', 'r')
plt.legend()
plt.title('Information-criterion for model selection (training time %.3fs)'
          % t_bic)

#############################################################################<br>
LassoCV: coordinate descent

Compute paths

In [None]:
print("Computing regularization path using the coordinate descent lasso...")
t1 = time.time()
model = LassoCV(cv=20).fit(X, y)
t_lasso_cv = time.time() - t1

Display results

In [None]:
m_log_alphas = -np.log10(model.alphas_ + EPSILON)

In [None]:
plt.figure()
ymin, ymax = 2300, 3800
plt.plot(m_log_alphas, model.mse_path_, ':')
plt.plot(m_log_alphas, model.mse_path_.mean(axis=-1), 'k',
         label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_ + EPSILON), linestyle='--', color='k',
            label='alpha: CV estimate')

In [None]:
plt.legend()

In [None]:
plt.xlabel('-log(alpha)')
plt.ylabel('Mean square error')
plt.title('Mean square error on each fold: coordinate descent '
          '(train time: %.2fs)' % t_lasso_cv)
plt.axis('tight')
plt.ylim(ymin, ymax)

#############################################################################<br>
LassoLarsCV: least angle regression

Compute paths

In [None]:
print("Computing regularization path using the Lars lasso...")
t1 = time.time()
model = LassoLarsCV(cv=20).fit(X, y)
t_lasso_lars_cv = time.time() - t1

Display results

In [None]:
m_log_alphas = -np.log10(model.cv_alphas_ + EPSILON)

In [None]:
plt.figure()
plt.plot(m_log_alphas, model.mse_path_, ':')
plt.plot(m_log_alphas, model.mse_path_.mean(axis=-1), 'k',
         label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
            label='alpha CV')
plt.legend()

In [None]:
plt.xlabel('-log(alpha)')
plt.ylabel('Mean square error')
plt.title('Mean square error on each fold: Lars (train time: %.2fs)'
          % t_lasso_lars_cv)
plt.axis('tight')
plt.ylim(ymin, ymax)

In [None]:
plt.show()