$\textbf{Linear Models}$

We consider regression in which the target value is expected to be a linear combination of the features. In mathematical notation, if $\hat{y}$ is the predicted value, then

$\hat{y} (w, x) = w_0 + w_1 x_1 + \ldots + w_p x_p$.

We designate the vector $w = (w_1, \ldots, w_p)$ as ${\rm coef\_}$,
and $w_0$ as ${\rm intercept\_}$.

$\textbf{Ordinary Least Squares}$

LinearRegression fits a linear model with coefficients $w = (w_1, \ldots, w_p)$ to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. Mathematically it solves a problem of the form:

$\min_{w} || X w - y ||_{2}^{2}$.

LinearRegression will take in its fit method arrays $X$, $y$ and will store the coefficients $w$ of the linear model in its ${\rm coef\_}$ member:

In [3]:
from sklearn import linear_model

reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
reg.coef_

array([0.5, 0.5])

In this this example we had $X = \begin{pmatrix}
0 & 0 \\
1 & 1 \\
2 & 2
\end{pmatrix}$ and $y = \begin{pmatrix}
0 \\
1 \\
2
\end{pmatrix}$, which gave $w = \begin{pmatrix}
1/2 \\
1/2
\end{pmatrix}$ from our linear regression model

The coefficient estimates for Ordinary Least Squares rely on the independence of the features. When features are correlated and the columns of the design matrix $X$ have an approximately linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed target, producing a large variance. This situation of multicollinearity can arise, for example, when data are collected without an experimental design.

It is possible to constrain all the coefficients to be non-negative, which may be useful when they represent some physical or naturally non-negative quantities (e.g., frequency counts or prices of goods). LinearRegression accepts a boolean positive parameter: when set to True Non-Negative Least Squares are then applied.

The least squares solution is computed using the singular value decomposition of $X$. If $X$ is a matrix of shape ($n_{\rm samples}$, $n_{\rm features}$) this method has a cost of $\mathcal{O} (n_{\rm samples} n_{\rm features}^{2})$ assuming that $n_{\rm samples} \geq n_{\rm features}$.

$\textbf{Ridge Regression and Classification}$

Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of the coefficients. The ridge coefficients minimize a penalized residual sum of squares:

$\min_{w} || X w - y ||_{2}^{2} + \alpha || w ||_{2}^{2}$.

The complexity parameter $\alpha \geq 0$ controls the amount of shrinkage: the larger the value of $\alpha$, the greater the amount of shrinkage and thus the coefficients become more robust to collinearity.

As with other linear models, Ridge will take in its fit method arrays $X$, $y$ and will store the coefficients $w$ of the linear model in its $coef\_$ member:

In [6]:
from sklearn import linear_model

reg = linear_model.Ridge(alpha=.5)
reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
print(reg.coef_)
print(reg.intercept_)

[0.34545455 0.34545455]
0.13636363636363638


The Ridge regressor has a classifier variant: RidgeClassifier. This classifier first converts binary targets to $\{-1, 1\}$ and then treats the problem as a regression task, optimizing the same objective as above. The predicted class corresponds to the sign of the regressor’s prediction. For multiclass classification, the problem is treated as multi-output regression, and the predicted class corresponds to the output with the highest value.

It might seem questionable to use a (penalized) Least Squares loss to fit a classification model instead of the more traditional logistic or hinge losses. However, in practice, all those models can lead to similar cross-validation scores in terms of accuracy or precision/recall, while the penalized least squares loss used by the RidgeClassifier allows for a very different choice of the numerical solvers with distinct computational performance profiles.

The RidgeClassifier can be significantly faster than e.g. LogisticRegression with a high number of classes because it can compute the projection matrix  only once.

This classifier is sometimes referred to as a Least Squares Support Vector Machines with a linear kernel.

RidgeCV implements ridge regression with built-in cross-validation of the alpha parameter. The object works in the same way as GridSearchCV except that it defaults to Leave-One-Out Cross-Validation:

In [8]:
import numpy as np
from sklearn import linear_model

reg = linear_model.RidgeCV(alphas=np.logspace(-6, 6, 13))
print(reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1]))
print(reg.alpha_)

RidgeCV(alphas=array([1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01,
       1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06]))
0.01


$\textbf{Lasso}$

The Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer non-zero coefficients, effectively reducing the number of features upon which the given solution is dependent. For this reason, Lasso and its variants are fundamental to the field of compressed sensing. Under certain conditions, it can recover the exact set of non-zero coefficients (see Compressive sensing: tomography reconstruction with L1 prior (Lasso)).

Mathematically, it consists of a linear model with an added regularization term. The objective function to minimize is:

$\min_{w} \frac{1}{2 n_{\rm samples}} || X w - y ||_{2}^{2} + \alpha || w ||_{1}$

The lasso estimate thus solves the minimization of the least-squares penalty with $\alpha || w ||_{1}$ added, where $\alpha$ is a constant and $|| w ||_{1}$ is the $\ell_{1}$-norm of the coefficient vector.

The implementation in the class Lasso uses coordinate descent as the algorithm to fit the coefficients. See Least Angle Regression for another implementation:

In [10]:
from sklearn import linear_model

reg = linear_model.Lasso(alpha=0.1)
reg.fit([[0, 0], [1, 1]], [0, 1])
# Get the values of the linear regression
print(reg.coef_)
print(reg.intercept_)
# Predict what y is for a given input value x
print(reg.predict([[1, 1]]))

[0.6 0. ]
0.2


array([0.8])

$\textbf{Multi-task Lasso}$

The MultiTaskLasso is a linear model that estimates sparse coefficients for multiple regression problems jointly: $y$ is a 2D array, of shape ($n_{\rm samples}$, $n_{\rm tasks}$). The constraint is that the selected features are the same for all the regression problems, also called tasks.

$\textbf{Elastic-Net}$

ElasticNet is a linear regression model trained with both $\ell_1$ and $\ell_2$-norm regularization of the coefficients. This combination allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge. We control the convex combination of $\ell_1$ and $\ell_2$ using the l1_ratio parameter.

Elastic-net is useful when there are multiple features that are correlated with one another. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both.

A practical advantage of trading-off between Lasso and Ridge is that it allows Elastic-Net to inherit some of Ridge’s stability under rotation.

The objective function to minimize is in this case

$\min_{w} \frac{1}{2 n_{\rm samples}} || X w - y ||_{2}^{2} + \alpha \rho || w ||_{1} + \frac{1}{2} \alpha (1 - \rho) || w ||_{2}^{2}$

The class ElasticNetCV can be used to set the parameters alpha ($\alpha$) and l1_ratio ($\rho$) by cross-validation

$\textbf{LARS Lasso}$

LassoLars is a lasso model implemented using the LARS (Least Angle Regression) algorithm, and unlike the implementation based on coordinate descent, this yields the exact solution, which is piecewise linear as a function of the norm of its coefficients.

In [12]:
from sklearn import linear_model
reg = linear_model.LassoLars(alpha=.1)
reg.fit([[0, 0], [1, 1]], [0, 1])
reg.coef_

array([0.6, 0. ])

$\textbf{Orthogonal Matching Pursuit (OMP)}$

OrthogonalMatchingPursuit and orthogonal_mp implement the OMP algorithm for approximating the fit of a linear model with constraints imposed on the number of non-zero coefficients (ie. the  pseudo-norm).

Being a forward feature selection method like Least Angle Regression, orthogonal matching pursuit can approximate the optimum solution vector with a fixed number of non-zero elements:

$\min_{w} || X w - y ||_{2}^{2}$ subject to $|| w ||_0 \leq n_{\rm nonzero-coefs}$

Alternatively, orthogonal matching pursuit can target a specific error instead of a specific number of non-zero coefficients. This can be expressed as:

$\min_{w} || w ||_0 $ subject to $|| X w - y ||_{2}^{2} \leq {\rm tolerance}$

OMP is based on a greedy algorithm that includes at each step the atom most highly correlated with the current residual. It is similar to the simpler matching pursuit (MP) method, but better in that at each iteration, the residual is recomputed using an orthogonal projection on the space of the previously chosen dictionary elements.

$\textbf{Bayesian Regression}$

Bayesian regression techniques can be used to include regularization parameters in the estimation procedure: the regularization parameter is not set in a hard sense but tuned to the data at hand.

This can be done by introducing uninformative priors over the hyper parameters of the model. The $\ell_2$ regularization used in Ridge regression and classification is equivalent to finding a maximum a posteriori estimation under a Gaussian prior over the coefficients $w$ with precision $\frac{1}{\lambda}$. Instead of setting lambda manually, it is possible to treat it as a random variable to be estimated from the data.

To obtain a fully probabilistic model, the output $y$ is assumed to be Gaussian distributed around $Xw$:

$p(y | X, w, \alpha) = \mathcal{N} (y | Xw, \alpha^{-1})$

where $\alpha$ is again treated as a random variable that is to be estimated from the data.

The advantages of Bayesian Regression are:
It adapts to the data at hand.
It can be used to include regularization parameters in the estimation procedure.
The disadvantages of Bayesian regression include:
Inference of the model can be time consuming.

BayesianRidge estimates a probabilistic model of the regression problem as described above. The prior for the coefficient $w$ is given by a spherical Gaussian:

$p (w | \lambda) = \mathcal{N} (w | 0, \lambda^{-1} \mathbf{I}_{p})$

The priors over $\alpha$ and $\lambda$ are chosen to be gamma distributions, the conjugate prior for the precision of the Gaussian. The resulting model is called Bayesian Ridge Regression, and is similar to the classical Ridge.

The parameters $w$, $\lambda$ and $\alpha$ are estimated jointly during the fit of the model, the regularization parameters $\alpha$ and $\lambda$ being estimated by maximizing the log marginal likelihood. The scikit-learn implementation is based on the algorithm described in Appendix A of (Tipping, 2001) where the update of the parameters $\alpha$ and $\lambda$ is done as suggested in (MacKay, 1992). The initial value of the maximization procedure can be set with the hyperparameters alpha_init and lambda_init.

There are four more hyperparameters, $\alpha_1$, $\alpha_2$, $\lambda_1$ and $\lambda_2$ of the gamma prior distributions over $\alpha$ and $\lambda$. These are usually chosen to be non-informative. By default $\alpha_1 = \alpha_2 = \lambda_1 = \lambda_2 = 10^{-6}$.



In [13]:
from sklearn import linear_model

X = [[0., 0.], [1., 1.], [2., 2.], [3., 3.]]
Y = [0., 1., 2., 3.]
reg = linear_model.BayesianRidge()
reg.fit(X, Y)

In [14]:
reg.predict([[1, 0.]])

array([0.50000013])

In [15]:
reg.coef_

array([0.49999993, 0.49999993])