# Linear Models for Regression

let $\textbf{X} \in \rm I\!R^{Nxp}$ be a set of input data with targets $\textbf{y} \in \rm I\!R^{N}$, where $N$ is the number of datapoints, and $p$ their dimensionality, meaning this data comes as a matrix where rows are datapoints, and columns are features.

Let $\textbf{x}$ be a generic variable representing any row (or datapoint) of this matrix, and $x_i$ be the $j^{th}$ column (or feature) of this datapoint.

Regression models can be used to try and and find the coefficients $\beta_{0,..,p}$ such that 

$$ \beta_0 + \sum_{j=1}^{p}\beta_{j} x_j $$

minimises the Residual Sum of Squares (RSS) over all the datapoints, possibly subject to some contraint.

In this notebook, only univariate regression is discussed.

**Note on transformations:** although the input datapoints $x$ are often referred to as those coming directly from the dataset, this is not at all a requirement, in fact it can sometimes be extremely useful to apply some function to these before submitting them to the algorithms, which is one simple way of using a linear model to solve non-linear problems. 
Some simple transformations are:
-  the polynomial expansion $\textbf{x} = (x_1, x_1^2, x_1^2,.., x_1^p)$
-  cross combinations $\textbf{x} = (x_1, x_2, x_3, ..., x_1 x_2, x_1 x_3, ... )$

---

## Ordinary Least Square (OLS)
_also known as Linear Least Square (LLS/LLSq)_


OLS is the simplest of these methods. We try to find the linear function $f(x;\beta)$ which minimises:

$$ RSS(\beta) = \sum_{i=1}^{N} (y_i - f(\textbf{x};\beta))^2 $$

where $ f(\textbf{x};\beta) = \beta_0 + \sum_{j=1}^{p}\beta_{j} x_j $

Note that this can be written in matrix form by adding one column of $1s$ to the $\textbf{x}$ vector, such that $\textbf{x} = (1, x_1, x_2, .., x_p)$, then

$$ f(x;\beta) = \textbf{x} \beta $$

and

$$ RSS(\beta) =  \sum_{i=1}^{N} (y_i -\beta \textbf{x})^2 $$
$$ \quad \Leftrightarrow (\textbf{y} - \textbf{X} \beta )^T (\textbf{y} - \textbf{X} \beta ) $$
in matrix form.

The vector $\beta$ which minimises is then 

$$ \hat{\beta} = ( \textbf{X}^T \textbf{X})^{-1} \textbf{X}^T\textbf{y} $$


### Julia implementations

**Libraries:**
-  SparseRegression.jl [_git_](https://github.com/joshday/SparseRegression.jl) [_notebook_](Sparse%20Regression.ipynb)
-  MultivariateStats.jl [_git_](https://github.com/juliaStats/MultivariateStats.jl) [_notebook_](MultivariateStats.ipynb) 
-  GLM.jl

_Note: MultivariateStats.jl allows for multi-variate regression_

** Simple code: **
```julia
# Simple (but suboptimal) implementation
function OLSEstimator(x,y)
    β = inv(x'*x)*(x'*y)
    return β
end
```

<img src="resources/LR_comparison_ols.png" width=600></img>


### References:

[1] The Elements of Statistical Learning (_Ch 4.1_)                                                  
[2] https://juliaeconomics.com/2014/06/15/introductory-example-ordinary-least-squares/

---

## Ridge Regression


### Introduction
One of the main problems with OLS is the prediction accuracy and the high variance. The prediction accuracy can sometimes be solved through subset selection, more or less assuming that many of the features given have different importance for the estimation, and therefore should not be given the same importance as is done in in _OLS_. 

_Ridge regression_ tries to solve this by adding a penalty on the coefficients, forcing the lowering of some of them so that others can increase.

This also solves an interpretation problem, whereas we can visualise how important a coefficient is compared to the others by their magnitude.


### Mathematical Background

_Ridge regression_ formulation is 

$$ \beta_{ridge} = \mathop{\mathrm{arg\,min}}_\beta \big \{ \sum_{i=1}^{N}(y_i - \beta_0 \sum_{j=1}^{p} x_{ij} \beta_j )^2 + \lambda \sum_{j=1}^{p}\beta_{j}^2 \big \} $$

where $\lambda$ is the regularizer term. A low $\lambda$ gives low regularization, with $\lambda=0$ giving _OLS_ back, while large $\lambda$ means strong regularization, forcing some coefficients closer to 0.

The problem can be reformulated as a contraint problem

$$ \beta_{rigde} = \mathop{\mathrm{arg\,min}}_\beta \textit{OLS} \quad \text{subject to} \quad \sum_{j=1}^{p} \beta_j^2 \le t $$

The vector $\beta$ which minimises the RSS is then

$$ \hat{\beta}_{ridge} = ( \textbf{X}^T \textbf{X} +\lambda \text{I} )^{-1} \textbf{X}^T\textbf{y} $$

**Note:** $\lambda$ is often also called $\alpha$, such as in scikit-learn models  


### Julia implementations

**Libraries:**
-  SparseRegression.jl [_git_](https://github.com/joshday/SparseRegression.jl) [_notebook_](Sparse%20Regression.ipynb)
-  MultivariateStats.jl [_git_](https://github.com/juliaStats/MultivariateStats.jl) [_notebook_](MultivariateStats.ipynb)
-  GLM.jl

_Note: MultivariateStats.jl allows for multi-variate regression_


** Simple code: **
```julia
# Simple (but suboptimal) implementation
function RidgeEstimator(x,y,λ,n)
    β = inv(x'*x + λ*eye(n,n) )*(x'*y)
    return β
end
```

<img src="resources/LR_comparison_ridge.png" width=600></img>


### References:

[1] The Elements of Statistical Learning  _(Ch 4.4)_                                            

---

## Lasso Regression


### Introduction

Lasso is a stronger regularization method, which uses the $L1$ norm for the regulatrization (absolute value) instead of the $L2$ in ridge. This makes for stronger regularization, likely to set some coefficients to 0 entirely.

It can therefore be used as a continuous subset selection, continuous in the sense that $\lambda$ is a continuous variable, and subset selection since coefficients will actually be set to 0 and therefore the features associated will not be considered at all.

### Mathematical Background

_Lasso regression_ formulation is 

\begin{equation}
\beta_{lasso} = \mathop{\mathrm{arg\,min}}_\beta \big \{ \sum_{i=1}^{N}(y_i - \beta_0 \sum_{j=1}^{p} x_{ij} \beta_j )^2 + \lambda \sum_{j=1}^{p} | \: \beta_{j} \:  | \big \} 
\end{equation}

where $\lambda$ is the regularizer term. A low $\lambda$ gives low regularization, with $\lambda=0$ giving _OLS_ back, while large $\lambda$ means strong regularization.

The problem can be reformulated as a contraint problem

\begin{equation}
\beta_{lasso} = \mathop{\mathrm{arg\,min}}_\beta \textit{OLS} \quad \text{subject to} \quad \sum_{j=1}^{p} | \: \beta_j \: | \le t 
\end{equation}


There is no matrix formulation for the optimal values of $\beta$, although efficient algorithms are available, which makes finding lasso solution as fast as ridge, for example using a lasso modified _Least Angle Regression_ (LAR) algorithm.

**Note:** $\lambda$ is often also called $\alpha$, such as in scikit-learn models  

**Practical Notice:** for lasso to work properly, data should be normalised.

#### Example of $\beta$ coefficients evolution with 

![lol](resources/lasso_coefficients.png)

### Julia implementations

**Libraries:**
-  SparseRegression.jl [_git_](https://github.com/joshday/SparseRegression.jl) [_notebook_](Sparse%20Regression.ipynb)
-  GLM.jl


### References:

[1] The Elements of Statistical Learning  _(Ch 4.4)_                                            


___
# Linear Models for Classification

It is pretty straightforward to use the above discussed methods for classification problem using multivariate regression.
Let $n$ be the number of classes of a classification problem, $\textbf{X} \in \rm I\!R^{Nxp}$  be the datapoints and $y \in \rm I\!R^{N}$ the target classes. Then $\textbf{y}=(y_1, y_2, ..., y_N)$ where $y_i \in (1,..,n)$. These targets can be transformed into a binary $N \times n$ matrix, where each row only has one non-zero element, corresponding to the class column.

e.g.:


\begin{bmatrix}
    0       & 1 & 0 & \dots & 0 \\
    1       & 0 & 0 & \dots & 0 \\
    \vdots & \vdots & \vdots & \vdots & \vdots \\
    0       & 0 & 0 & \dots & 1
\end{bmatrix}


would mean the target class of the first observation is 1, the second target is 0, and the $N^{th}$ target is n

Then it is possible to train a multivariate regression model to try and "regress" these. If the model is linear, such as _OLS_, then the regression will give continuous values for every target class, which can then be interpreted as _certainty_ (can further be intepreted as "probabilities" if transformed through a function such as _softmax_ )

### Julia implementations

There are no explicit implementations that directly allow for classification, but it is quite simple to setup using regression methods.
An example can be found in the [MultivariateStats notebook](MultivariateStats.ipynb).