## 1.1. Linear Models

The following are a set of methods intended for regression in which the target value is expected to be a linear combination of the features. In mathematical notation, if $\hat{y}$ is the predicted value.

$$\hat{y}(w, x) = w_0 + w_1x_1 + \dots + w_px_p$$

Across the module, we designate the vector $w = (w_1, \dots, w_p)$ as `coef_` and $w_0$ as `intercept_`.

To perform classification with generalized linear models, see **Logistic Regression**.

### 1.1.1. Ordinary Least Squares

[LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) fits a linear model with coefficients $w = (w_1, \dots, w_p)$ to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. Mathematically it solves a problem of the form:

$$\min_w \lVert Xw - y \rVert_2^2$$
<center><img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_ols_001.png"/></center>

[LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) will take in its `fit` method arrays `X`, `y` and will store the coefficients $w$ of the linear model in its `coef_` member:

In [1]:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
reg.coef_



array([0.5, 0.5])

The coefficient estimates for Ordinary Least Squares rely on the independence of the features. When features are correlated and the columns of the design matrix $X$ have an approximately linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed target, producing a large variance. This situation of _**multicollinearity**_ can arise, for example, when data are collected without an experimental design.

#### 1.1.1.1. Non-Negative Least Squares

It is possible to constrain all the coefficients to be non-negative, which may be useful when they represent some physical or naturally non-negative quantities (e.g., frequency counts or prices of goods).
[LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) accepts a boolean `positive` parameter: when set to `True`, [Non-Negative Least Squares](https://en.wikipedia.org/wiki/Non-negative_least_squares) are then applied.

In mathematical optimiyation, the problem of **non-negative least squares (NNLS)** is a type of constrained least squares problem where the coefficients are not allowed to become negative. That is, given a matrix $A$ and a (column) vector of response variables $y$, the goal is to find:

$$\text{arg}\min_x \lVert Ax - y\rVert_2^2$$ subject to $$x \geq 0$$

Here $x \geq 0$ means that each component of the vector $x$ should be non-negative, and $\lVert \cdot \rVert_2$ denotes the Euclidean norm.

Non-negative least squares problems turn up as subproblems in matrix decomposition, e.g. in algorithms for PARAFAC and non-negative matrix/tensor factorization. The latter can be considered a generalization of NNLS.

Another generalization of NNLS is **bounded-variable least squares** (BVLS), with simultaneous upper and lower bounds $\alpha_i \leq x_i \leq \beta_i$.

**Quadratic programming version**

The NNLS problem is equivalent to a quadratic programming problem:

$$\text{arg}\min_{x \geq 0} (\frac{1}{2}x^TQx+c^Tx)$$

where $Q = A^TA$ and $c = -A^Ty$. This problem is convex, as $Q$ is positive semidefinite and the non-negativity constraints form a convex feasible set.

#### 1.1.1.2. Ordinary Least Squares Complexity

The least squares solution is computed using the singular value decomposition of $X$. If $X$ is a matrix of shape `(n_samples, n_features)`, this method has a cost of $O(n_{samples}n^2_{features})$, assuming that $n_{samples} \geq n_{features}$.

### 1.1.2. Ridge Regression and Classification