## 1.1. Linear Models

The following are a set of methods intended for regression in which the target value is expected to be a linear combination of the features. In mathematical notation, if $\hat{y}$ is the predicted value.

$$\hat{y}(w, x) = w_0 + w_1x_1 + \dots + w_px_p$$

Across the module, we designate the vector $w = (w_1, \dots, w_p)$ as `coef_` and $w_0$ as `intercept_`.

To perform classification with generalized linear models, see **Logistic Regression**.

### 1.1.1. Ordinary Least Squares

[LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) fits a linear model with coefficients $w = (w_1, \dots, w_p)$ to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. Mathematically it solves a problem of the form:

$$\min_w \lVert Xw - y \rVert_2^2$$
<center><img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_ols_001.png" /></center>

[LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) will take in its `fit` method arrays `X`, `y` and will store the coefficients $w$ of the linear model in its `coef_` member:

In [1]:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
reg.coef_



array([0.5, 0.5])

The coefficient estimates for Ordinary Least Squares rely on the independence of the features. When features are correlated and the columns of the design matrix $X$ have an approximately linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed target, producing a large variance. This situation of _**multicollinearity**_ can arise, for example, when data are collected without an experimental design.

#### 1.1.1.1. Non-Negative Least Squares

It is possible to constrain all the coefficients to be non-negative, which may be useful when they represent some physical or naturally non-negative quantities (e.g., frequency counts or prices of goods).
[LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) accepts a boolean `positive` parameter: when set to `True`, [Non-Negative Least Squares](https://en.wikipedia.org/wiki/Non-negative_least_squares) are then applied.

In mathematical optimiyation, the problem of **non-negative least squares (NNLS)** is a type of constrained least squares problem where the coefficients are not allowed to become negative. That is, given a matrix $A$ and a (column) vector of response variables $y$, the goal is to find:

$$\text{arg}\min_x \lVert Ax - y\rVert_2^2$$ subject to $$x \geq 0$$

Here $x \geq 0$ means that each component of the vector $x$ should be non-negative, and $\lVert \cdot \rVert_2$ denotes the Euclidean norm.

Non-negative least squares problems turn up as subproblems in matrix decomposition, e.g. in algorithms for PARAFAC and non-negative matrix/tensor factorization. The latter can be considered a generalization of NNLS.

Another generalization of NNLS is **bounded-variable least squares** (BVLS), with simultaneous upper and lower bounds $\alpha_i \leq x_i \leq \beta_i$.

**Quadratic programming version**

The NNLS problem is equivalent to a quadratic programming problem:

$$\text{arg}\min_{x \geq 0} (\frac{1}{2}x^TQx+c^Tx)$$

where $Q = A^TA$ and $c = -A^Ty$. This problem is convex, as $Q$ is positive semidefinite and the non-negativity constraints form a convex feasible set.

#### 1.1.1.2. Ordinary Least Squares Complexity

The least squares solution is computed using the singular value decomposition of $X$. If $X$ is a matrix of shape `(n_samples, n_features)`, this method has a cost of $O(n_{samples}n^2_{features})$, assuming that $n_{samples} \geq n_{features}$.

### 1.1.2. Ridge Regression and Classification

#### 1.1.2.1. Regression

[Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) regression addresses some of the problems of [Ordinary Least Squares](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares) by imposing a penalty on the size of the coefficients. The ridge coefficients minimize a penalized residual sum of squares:

$$\min_w \lVert Xw - y \rVert_2^2 + \alpha \lVert Xw - y \rVert_2^2$$

The complexity parameter $\alpha \geq 0$ controls the amount of shrinkage: the larger the value of $\alpha$, the greater the amount of shrinkage and thus the coefficients become more robust to collinearity.
<center><img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_ridge_path_001.png" /></center>

As with other linear models, [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) will take in its fit method arrays X, y and will store the coefficients `fit` method arrays `X`, `y` and will store the coefficients $w$ of the linear model in its `coef_` member:

In [1]:
from sklearn import linear_model
reg = linear_model.Ridge(alpha=.5)
reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
reg.coef_
reg.intercept_



0.1363636363636364

Note that the class  [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) allows for the user to specify that the solver be automatically chosen by setting `solver="auto"`. When this option is specified,  [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) will choose between the `"lbfgs"`, `"cholesky"`, and `"sparse_cg"` solvers.  [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) will begin checking the conditions shown in the following table from top to bottom. If the condition is true, the corresponding solver is chosen.

| Solver      | Condition                                  |
| ----------- | -------------------------------------------|
| 'lbfgs'     | The `positive=True` option is specified.   |
| 'cholesky'  | The input array X is not sparse.           |
| 'sparse_cg' | None of the above conditions are fulfilled.|

#### 1.1.2.2. Classification

The [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) regressor has a classifier variant: [RidgeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier). This classifier first converts binary targets to `{-1, 1}` and then treats the problem as a regression task, optimizing the same objective as above. The predicted class corresponds to the sign of the regressor’s prediction. For multiclass classification, the problem is treated as multi-output regression, and the predicted class corresponds to the output with the highest value.

It might seem questionable to use a (penalized) Least Squares loss to fit a classification model instead of the more traditional logistic or hinge losses. However, in practice, all those models can lead to similar cross-validation scores in terms of accuracy or precision/recall, while the penalized least squares loss used by the [RidgeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier) allows for a very different choice of the numerical solvers with distinct computational performance profiles.

The [RidgeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier) can be significantly faster than e.g. [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression) with a high number of classes because it can compute the projection matrix $(X^TX)^{-1}X^T$ only once.

This classifier is sometimes referred to as a [Least Squares Support Vector Machines](https://en.wikipedia.org/wiki/Least-squares_support_vector_machine) with a linear kernel.

**Least-squares Support Vector Machines (LS-SVM)**

Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis.

In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. LS-SVMs are a class of kernel-based learning methods.

Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle.

#### 1.1.2.3. Ridge Complexity

This method has the same order of complexity as [Ordinary Least Squares](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares).

#### 1.1.2.4. Setting the regularization parameter: leave-one-out Cross-Validation

RidgeCV and RidgeClassifierCV implement ridge regression/classification with built-in cross-validation of the alpha parameter. They work in the same way as GridSearchCV except that it defaults to efficient Leave-One-Out cross-validation. When using the default cross-validation, alpha cannot be 0 due to the formulation used to calculate Leave-One-Out error. See [RL2007] for details.