## 1.1. Linear Models

### 1.1.10. Bayesian Regression

Bayesian regression techniques can be used to include regularization parameters in the estimation procedure: the regularization parameter is not set in a hard sense but tuned to the data at hand.

This can be done by introducing [uninformative priors](https://en.wikipedia.org/wiki/Non-informative_prior#Uninformative_priors) over the hyper parameters of the model. The $\ell_2$
 regularization used in [Ridge regression and classification](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression) is equivalent to finding a maximum a posteriori estimation under a Gaussian prior over the coefficients $w$ with precision $\lambda^{-1}$. Instead of setting `lambda` manually, it is possible to treat it as a random variable to be estimated from the data.

To obtain a fully probabilistic model, the output $y$ is assumed to be Gaussian distributed around $Xw$:

$$p(y|X, w, \alpha) = \mathcal{N}(y|Xw, \alpha^{-1})$$

where $\alpha$ is again treated as a random variable that is to be estimated from the data.

The **advantages** of Bayesian Regression are:

- It adapts to the data at hand.
- It can be used to include regularization parameters in the estimation procedure.

The **disadvantages** of Bayesian regression include:

- Inference of the model can be time consuming.

#### References-1.1.10.

- A good introduction to Bayesian methods is given in C. Bishop: Pattern Recognition and Machine learning
- Original Algorithm is detailed in the book `Bayesian learning for neural networks` by Radford M. Neal

#### 1.1.10.1. Bayesian Ridge Regression
<a id="bayesian-ridge-regression"></a>

[BayesianRidge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html#sklearn.linear_model.BayesianRidge) estimates a probabilistic model of the regression problem as described above. The prior for the coefficient $w$ is given by a spherical Gaussian:

$$p(w|\lambda) = \mathcal{N}(w|0, \lambda^{-1}I_p)$$

The priors over $\alpha$ and $\lambda$ are chosen to be [gamma distributions](https://en.wikipedia.org/wiki/Gamma_distribution), the conjugate prior for the precision of the Gaussian. The resulting model is called *Bayesian Ridge Regression*, and is similar to the classical [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge).

The parameters $w$, $\alpha$ and $\lambda$ are estimated jointly during the fit of the model, the regularization parameters $\alpha$ and $\lambda$ being estimated by maximizing the *log marginal likelihood*. The scikit-learn implementation is based on the algorithm described in Appendix A of (Tipping, 2001) where the update of the parameters $\alpha$ and $\lambda$ is done as suggested in (MacKay, 1992). The initial value of the maximization procedure can be set with the hyperparameters `alpha_init` and `lambda_init`.

There are four more hyperparameters, $\alpha_1$, $\alpha_2$, $\lambda_1$ and $\lambda_2$ of the gamma prior distributions over $\alpha$ and $\lambda$. These are usually chosen to be *non-informative*. By default, $\alpha_1 = \alpha_2 = \lambda_1 = \lambda_2 =10^{-6}$.

Bayesian Ridge Regression is used for regression:

In [1]:
from sklearn import linear_model
X = [[0., 0.], [1., 1.], [2., 2.], [3., 3.]]
Y = [0., 1., 2., 3.]
reg = linear_model.BayesianRidge()
reg.fit(X, Y)

After being fitted, the model can then be used to predict new values:

In [2]:
reg.predict([[1, 0.]])

array([0.50000013])

The coefficients $w$ of the model can be accessed:

In [3]:
reg.coef_

array([0.49999993, 0.49999993])

Due to the Bayesian framework, the weights found are slightly different to the ones found by [Ordinary Least Squares](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares). However, Bayesian Ridge Regression is more robust to ill-posed problems.

#### Example-1.1.10.1.
- Curve Fitting with Bayesian Ridge Regression - [Sci-kit Link](https://scikit-learn.org/stable/auto_examples/linear_model/plot_bayesian_ridge_curvefit.html#sphx-glr-auto-examples-linear-model-plot-bayesian-ridge-curvefit-py) | [Python code](https://github.com/baksho/ml-handson/blob/main/Examples/15_example_plot_bayesian_ridge_curvefit.py) | [Jupyter Notebook](https://github.com/baksho/ml-handson/blob/main/Examples/15_example_plot_bayesian_ridge_curvefit.ipynb)

#### References-1.1.10.1.

- Section 3.3 in Christopher M. Bishop: Pattern Recognition and Machine Learning, 2006
- David J. C. MacKay, [Bayesian Interpolation](https://citeseerx.ist.psu.edu/doc_view/pid/b14c7cc3686e82ba40653c6dff178356a33e5e2c), 1992.
- Michael E. Tipping, [Sparse Bayesian Learning and the Relevance Vector Machine](https://www.jmlr.org/papers/volume1/tipping01a/tipping01a.pdf), 2001.

#### 1.1.10.2. Automatic Relevance Determination - ARD

The Automatic Relevance Determination (as being implemented in [ARDRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ARDRegression.html#sklearn.linear_model.ARDRegression)) is a kind of linear model which is very similar to the [Bayesian Ridge Regression](#1.1.10.1.-Bayesian-Ridge-Regression), but that leads to sparser coefficients $w$.[[1,2]](#References-1.1.10.2.)

:class:`~sklearn.linear_model.ARDRegression` poses a different prior over $w$: it drops the spherical Gaussian distribution for a centered elliptic Gaussian distribution. This means each coefficient $w_i$ can itself be drawn from a Gaussian distribution, centered on $0$ and with a precision $\lambda_i$:

$$p(w|\lambda) = \mathcal{N}(w|0, A^{-1})$$

with $A$ being a positive definite diagonal matrix and $\text{diag}(A) = \lambda = \{\lambda_1, \dots, \lambda_p\}$.

In contrast to the [Bayesian Ridge Regression](#1.1.10.1.-Bayesian-Ridge-Regression), each coordinate of $w_i$ has its own standard deviation $\frac{1}{\lambda_i}$. The prior over all $\lambda_i$ is chosen to be the same gamma distribution given by the hyperparameters $\lambda_1$ and $\lambda_2$.

ARD is also known in the literature as *Sparse Bayesian Learning and Relevance Vector Machine*[[3,4]](#References-1.1.10.2.). For a worked-out comparison between ARD and [Bayesian Ridge Regression](#1.1.10.1.-Bayesian-Ridge-Regression), see the example below.

#### Example-1.1.10.2.
- Comparing Linear Bayesian Regressors - [Sci-kit Link](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ard.html#sphx-glr-auto-examples-linear-model-plot-ard-py) | [Python code](https://github.com/baksho/ml-handson/blob/main/Examples/16_example_plot_ard.py) | [Jupyter Notebook](https://github.com/baksho/ml-handson/blob/main/Examples/16_example_plot_ard.ipynb)

#### References-1.1.10.2.

1. Christopher M. Bishop: Pattern Recognition and Machine Learning, Chapter 7.2.1
2. David Wipf and Srikantan Nagarajan: [A New View of Automatic Relevance Determination](https://papers.nips.cc/paper/3372-a-new-view-of-automatic-relevance-determination.pdf)
3. Michael E. Tipping: [Sparse Bayesian Learning and the Relevance Vector Machine](https://www.jmlr.org/papers/volume1/tipping01a/tipping01a.pdf)
4. Tristan Fletcher: [Relevance Vector Machines Explained](https://citeseerx.ist.psu.edu/doc_view/pid/3dc9d625404fdfef6eaccc3babddefe4c176abd4)