## 1.1. Linear Models

### 1.1.3. Lasso

The [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso) is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer non-zero coefficients, effectively reducing the number of features upon which the given solution is dependent. For this reason, Lasso and its variants are fundamental to the field of compressed sensing. Under certain conditions, it can recover the exact set of non-zero coefficients (see [Compressive sensing: tomography reconstruction with L1 prior (Lasso)](https://scikit-learn.org/stable/auto_examples/applications/plot_tomography_l1_reconstruction.html#sphx-glr-auto-examples-applications-plot-tomography-l1-reconstruction-py)).

Mathematically, it consists of a linear model with an added regularization term. The objective function to minimize is:

$$\min_w\frac{1}{2n_{samples}}\lVert Xw - y \lVert_2^2 + \alpha\lVert w \rVert_1$$

The lasso estimate thus solves the minimization of the least-squares penalty with $\alpha\lVert w \rVert_1$ added, where $\alpha$ is a constant and $\lVert w \rVert_1$ is the $\ell_1$-norm of the coefficient vector.

The implementation in the :class:`~sklearn.linear_model.Lasso` uses coordinate descent as the algorithm to fit the coefficients. See [Least Angle Regression](https://scikit-learn.org/stable/modules/linear_model.html#least-angle-regression) for another implementation:

In [1]:
from sklearn import linear_model
reg = linear_model.Lasso(alpha=0.1)
reg.fit([[0, 0], [1, 1]], [0, 1])
reg.predict([[1, 1]])

array([0.8])

The function :class:`~sklearn.linear_model.lasso_path` is useful for lower-level tasks, as it computes the coefficients along the full path of possible values.

**Example:**
- L1-based models for Sparse Signals - [Sci-kit Link](https://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_and_elasticnet.html#sphx-glr-auto-examples-linear-model-plot-lasso-and-elasticnet-py) | [Python code](https://github.com/baksho/ml-handson/blob/main/Examples/06_example_plot_lasso_and_elasticnet.py) | [Jupyter Notebook](https://github.com/baksho/ml-handson/blob/main/Examples/06_example_plot_lasso_and_elasticnet.ipynb)
- Compressive sensing: tomography reconstruction with L1 prior (Lasso) - [Sci-kit Link](https://scikit-learn.org/stable/auto_examples/applications/plot_tomography_l1_reconstruction.html#sphx-glr-auto-examples-applications-plot-tomography-l1-reconstruction-py) | [Python code](https://github.com/baksho/ml-handson/blob/main/Examples/07_example_plot_tomography_l1_reconstruction.py) | [Jupyter Notebook](https://github.com/baksho/ml-handson/blob/main/Examples/07_example_plot_tomography_l1_reconstruction.ipynb)
- Common pitfalls in the interpretation of coefficients of linear models - [Sci-kit Link](https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html#sphx-glr-auto-examples-inspection-plot-linear-model-coefficient-interpretation-py) | [Python code](https://github.com/baksho/ml-handson/blob/main/Examples/05_example_plot_linear_model_coefficient_interpretation.py) | [Jupyter Notebook](https://github.com/baksho/ml-handson/blob/main/Examples/05_example_plot_linear_model_coefficient_interpretation.ipynb)

<div class="alert alert-info"><h5>Note</h5><h4>Feature selection with Lasso</h4>
    
<p>

As the Lasso regression yields sparse models, it can thus be used to perform feature selection, as detailed in [L1-based feature selection](https://scikit-learn.org/stable/modules/feature_selection.html#l1-feature-selection).</p></div>

**References**

The following two references explain the iterations used in the coordinate descent solver of scikit-learn, as well as the duality gap computation used for convergence control.

- *“Regularization Path For Generalized linear Models by Coordinate Descent”*, Friedman, Hastie & Tibshirani, J Stat Softw, 2010 ([Paper](https://www.jstatsoft.org/article/view/v033i01/v33i01.pdf)).
- *“An Interior-Point Method for Large-Scale L1-Regularized Least Squares”*, S. J. Kim, K. Koh, M. Lustig, S. Boyd and D. Gorinevsky, in IEEE Journal of Selected Topics in Signal Processing, 2007 ([Paper](https://web.stanford.edu/~boyd/papers/pdf/l1_ls.pdf))

#### 1.1.3.1. Setting regularization parameter

The `alpha` parameter controls the degree of sparsity of the estimated coefficients.

#### 1.1.3.1.1. Using cross-validation

scikit-learn exposes objects that set the Lasso `alpha` parameter by cross-validation: [LassoCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV) and [LassoLarsCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsCV.html#sklearn.linear_model.LassoLarsCV) is based on the [Least Angle Regression](https://scikit-learn.org/stable/modules/linear_model.html#least-angle-regression) algorithm explained below.

For high-dimensional datasets with many collinear features, :class:`~sklearn.linear_model.LassoCV`is most often preferable. However, :class:`~sklearn.linear_model.LassoLarsCV`has the advantage of exploring more relevant values of `alpha` parameter, and if the number of samples is very small compared to the number of features, it is often faster than :class:`~sklearn.linear_model.LassoCV`.

<center><img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_lasso_model_selection_002.png" /><img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_lasso_model_selection_003.png" /></center>

#### 1.1.3.1.2. Information-criteria based model selection

Alternatively, the estimator [LassoLarsIC](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsIC.html#sklearn.linear_model.LassoLarsIC) proposes to use the **Akaike Information Criterion (AIC)** and **Bayes Information Criterion (BIC)**. It is a computationally cheaper alternative to find the optimal value of alpha as the regularization path is computed only once instead of k+1 times when using k-fold cross-validation.

Indeed, these criteria are computed on the in-sample training set. In short, they penalize the over-optimistic scores of the different Lasso models by their flexibility (cf. to “Mathematical details” section below).

However, such criteria need a proper estimation of the degrees of freedom of the solution, are derived for large samples (asymptotic results) and assume the correct model is candidates under investigation. They also tend to break when the problem is badly conditioned (e.g. more features than samples).

<center><img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_lasso_lars_ic_001.png" /></center>

**Example:**
- Lasso model selection: AIC-BIC / cross-validation - [Sci-kit Link](https://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_model_selection.html#sphx-glr-auto-examples-linear-model-plot-lasso-model-selection-py) | [Python code](https://github.com/baksho/ml-handson/blob/main/Examples/08_example_plot_lasso_model_selection.py) | [Jupyter Notebook](https://github.com/baksho/ml-handson/blob/main/Examples/08_example_plot_lasso_model_selection.ipynb)
- Lasso model selection via information criteria - [Sci-kit Link](https://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_lars_ic.html#sphx-glr-auto-examples-linear-model-plot-lasso-lars-ic-py) | [Python code](https://github.com/baksho/ml-handson/blob/main/Examples/09_example_plot_lasso_lars_ic.py) | [Jupyter Notebook](https://github.com/baksho/ml-handson/blob/main/Examples/09_example_plot_lasso_lars_ic.ipynb)

#### 1.1.3.1.3. AIC and BIC criteria

The definition of AIC and BIC might differ in the literature. In this section, we give more information regarding the criterion computed in scikit-learn.

**Mathematical details**

The AIC criterion is defined as:

$$AIC = -2log(\hat{L}) + 2d$$

where $\hat{L}$ is the maximum likelihood of the model and $d$ is the number of parameters (as well referred to as degrees of freedom in the previous section).

The definition of BIC replace the constant $2$ by $log(N)$:

$$AIC = -2log(\hat{L}) + log(N)d$$

where $N$ is the number of samples.

For a linear Gaussian model, the maximum log-likelihood is defined as:

$$log(\hat{L}) = -\frac{n}{2}log(2\pi) - \frac{n}{2}ln(\sigma^2)-\frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{2\sigma^2}$$

where $\sigma^2$ is an estimate of the noise variance, $y_i$ and $\hat{y}_i$ are respectively the true and predicted targets, and $n$ is the number of samples.

Plugging the maximum log-likelihood in the AIC formula yields:

$$AIC = nlog(2\pi\sigma^2) + \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sigma^2} + 2d$$

The first time of the above expression is sometimes discarded since it is a constant when $\sigma^2$ is provided. In addition, it is sometimes stated that the AIC is equivalent to the $C_p$ statistic [[1](#ZHT2007)]. In a strict sense, however it is equivalent only up to some constant and a multiplicative factor.

At last, we mentioned above that $\sigma^2$ is an estimate of the noise variance. In :class:`~sklearn.linear_model.LassoLarsIC` when the parameter `noise_variance` is not provided (default), the noise variance is estimated via the unbiased estimator [[2](#CVY2003)] defined as:

$$\sigma^2 = \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{n - p}$$

where $p$ is the number of features and $\hat{y}_i$ is the predicted target using an ordinary least squares regression. Note, that this formula is valid only when `n_samples > n_features`.

**References**

1.  <a id='ZHT2007'></a>Zou, Hui, Trevor Hastie, and Robert Tibshirani. *“On the degrees of freedom of the lasso.”* The Annals of Statistics 35.5 (2007): 2173-2192 ([Paper](https://arxiv.org/abs/0712.0881.pdf))
2.  <a id='CVY2003'></a>Cherkassky, Vladimir, and Yunqian Ma. *“Comparison of model selection for regression.”* Neural computation 15.7 (2003): 1691-1714 ([Paper](https://doi.org/10.1162/089976603321891864))

#### 1.1.3.1.4. Comparison with the regularization parameter of SVM

The equivalence between `alpha` and the regularization parameter of SVM, `C` is given by `alpha = 1 / C` or `alpha = 1 / (n_samples * C)`, depending on the estimator and the exact objective function optimized by the model.

### 1.1.4. Multi-task Lasso