<h1>Linear Models for Regression</h1>

Reference:

[1] PRML.
[2] MLAPP.

# 0. Linear Basis Function Models

## Linear Regression
> $y(x, w) = w_0 + w_1 \phi(x_1) + ... w_M \phi(x_M)$

> $= \sum_{i=1}^M w_i \phi(x_i) = W^{T}\phi(x)$

## Maximum Likelihood

Assume the target variable $t$ is the summation of <b>deteministic</b> function $y(x, w)$ and gaussian noise $\epsilon \sim \mathcal{N}(0, \beta^{-1})$.

> $t = y(x, w) + \epsilon$

Thus, $\mathbb{E}[t|x, w] = y(x, w)$, $Var[t|x, w] = \beta^{-1}$,

> $p(t|x, w) = \mathcal{N}(t|y(x, w), \beta^{-1})$

For $N$ independent identically distributed samples $\{X_1, X_2, ..., X_N\}$, the likelihood function is,

> $L(X, w) = \prod_{i=1}^N p(t_i|X_i, w)$

Compute the log-likelihood, (to avoid the underflow problem), 

> $l(X, w) = \log L(X, w) = \sum_{i=1}^N \log p(t_i|X_i, w)$

The log-likelihood function $l(X, w)$ is <b>concave</b> of $w_j$, 
thus compute the derivate of $l$ with respect to $w_j$, let it be zero, 
we will obtain the optimal value of parameters $w$.

# 1. Bias Variance Trade-off

> $\mathbb{E}[(y - \hat{y})^2] = \mathbb{E}[y^2 + {\hat{y}}^2 - 2y\hat{y}]$

> $= \mathbb{E}[y^2] + \mathbb{E}[{\hat{y}}^2] - 2\mathbb{E}[y]\mathbb{E}[\hat{y}]$

> $= Var[y] + \mathbb{E}^2[y] + Var[\hat{y}] + \mathbb{E}^2[\hat{y}] - 2 y \mathbb{E}[\hat{y}]$

> $= Var[y] + Var[\hat{y}] + \{\mathbb{E}^2[\hat{y}] - 2 y \mathbb{E}[\hat{y}]+ \mathbb{E}^2[y] \}$

> $= 
\underbrace{Var[y]}_{noise} + 
\underbrace{Var[\hat{y}]}_{variance} + 
\underbrace{\{ y - \mathbb{E}[\hat{y}] \}^2}_{bias^2}$



# 2. Bayesian Linear Regression

The likelihood of a linear regression problem is, 

> $L(X, w) = \prod_{i=1}^N p(t_i|X_i, w)$

where $p(t|x, w) = \mathcal{N}(t|y(x, w), \beta^{-1})$, 

thus the conjugate prior of the likelihood is also a Gaussian distribution.

> $p(w) = \mathcal{N}(w|\mu_0, S_0)$

The posterior distribution of $w$ is, 

> $p(w|t) = \mathcal{N}(w|\mu_1, S_1)$

> $\mu_1 = S_1(S_0^{-1}m_0 + \beta \phi^T t)$

> $S_1^{-1} = S_0^{-1} + \beta \phi^T \phi$

The predictive result will be, 

> $p(t^{test}|t^{train}, w) = \int p(t^{test}|x, w)p(w|t^{train})dw$

Almost everything about Gaussian Distribution is analytical.

> $p(t^{test}|t^{train}, w) = \mathbb{E}_{w}[p(t^{test}|x, w)]$ 

To compute the expectation of the likihood function, if it is not analytical, we will use some approximation methods.

# 3. Sampling Methods

## 3.0 Law of Large Number (Core)

According to the law of large number, 

if $f(x_i)$ is drawn from the distribution of $p$.

> $\mathbb{E}_{p}[f(x)] = \int f(x) p(x) dx \thickapprox \frac{1}{N} \sum_{i=1}^N f(x_i)$

## 3.1 Standard Distribution

> <b>Theorem</b>: Assume we have a uniform random number generator $z \sim U(0, 1)$, 
> $F^{-1}(z) \sim F$.

Proof:

> $p(F^{-1}(z) \le x)$

> $= p(z \le F(x))$, (Apply $F$ to both sides)

> $= F(x)$, (Because $p(z \le y) = \int_{0}^{y}dz = y$)

If we want to draw samples from $p(x)$ which is hard to draw from, we can 

> + compute the cdf (cummulative distribution function $F(x)$;
> + compute the inverse of cdf $F^{-1}(x)$;
> + sample from the uniform distribution $z \sim U(0, 1)$;
> + obtain the sample value from $p(x)$, which is $F^{-1}(u)$.

Example: 

> $p(x=1) = 0.2, p(x=2) = 0.3, p(x=3) = 0.1, p(x=4) = 0.4$

> $F(x=1) = p(x \le 1) = 0.2, F(x=2) = p(x \le 2) = 0.5, F(x=3) = p(x \le 3) = 0.6, F(x=4)  = p(x \le 4) = 1.0$

Let $y = F^{-1}(x)$, $f(y) = x$,

> $f(y=0.2) = 1, f(y=0.5) = 2, f(y=0.6) = 3, f(y=1.0) = 4$

Draw $z \sim U(0, 1)$, insert into $F^{-1}(z)$ we will obtain the samples drawn from $p(x)$,

> $x = 1, if 0 < z \le 0.2$

> $x = 2, if 0.2 < z \le 0.5$

> $x = 3, if 0.5 < z \le 0.6$

> $x = 4, if 0.6 < z < 1$.

## 3.2 Rejection Methods

Usually the cdf of $p(x)$ is hard to compute.

Use a proposal distribution $q(x)$ which is easy to draw from, and a constant $M$, 

> $Mq(x) \ge p(x)$

<b>Sampling process</b>

+ Sample $x$ from $q(x)$

+ Sample from the uniform distribution $u \sim U(0, 1)$

+ Check if $u \le \frac{p(x)}{Mq(x)}$, keep $x$, else reject the current sample.

Sometimes, an unnormalized distribution $\widetilde{p}(x)$ of $p(x)$ is used.

<b>Cons</b>

The sampling efficiency is very low cause of rejection of some samples.

## 3.3 Importance Sampling

To overcome the inefficiency of rejection sampling methods, we keep all of the samples drawn, but with different weights. 

Actually, the samples should be discarded in the rejection methods should have a low weight.

Also use a proposal distribution which is easy to draw from.

When to estimate the expectation of $f(x)$ under the distribution of $p(x)$, 

> $\mathbb{E}_{p(x)}[f(x)] = \int f(x) p(x) dx$

The accuracy not only relies on samples of large probabity, but also with <b>large absolute value</b> $|f(x)|$ which contribute a lot.

> $\mathbb{E}_{p(x)}[f(x)] = \int f(x) p(x) dx$

> $= \int f(x) \frac{p(x)}{q(x)} q(x) dx$

> $\thickapprox \frac{1}{N} \sum_{i=1}^N f(x_i)w_i$

where $w_i = \frac{p(x_i)}{q(x_i)}$.

<b>Sampling process</b>

+ Sample $x \sim q(x)$

+ Compute the weight $\frac{p(x)}{q(x)}$

## 3.4 Markov Chain Monte Carlo

The rejection methods and importance sampling may have severe limitations especially in <b>high dimensions</b>.

The MCMC methods are introduced.

+ A proposal distribution $q(x)$ which is easy to draw from, and with the <b>markov property</b> that the current state relies and only relies on the past state

+ Unnomalized version the distribution $p(x) = \frac{\widetilde{p}(x)}{Z_p}$, where $\widetilde{p}(x)$ is easy to evaluate

To borrow the idea from rejection methods, we need to set up some conditions to keep or to reject the samples drawn from $q(x)$.

Unlike find a consant M to make $Mq(x)$ envelope $p(x)$, here we concentrate on the proposal distributions.


### 3.4.0 Markov Chain

#### Markov Property

> $p(z^{(t+1)} | z^{(1)}, z^{(2)}, ..., z^{(t)}) = p(z^{(t+1)}|z^{(t)})$

#### Transition

Define the transition probability from state $a$ to state $b$ is

> $T(a, b) = p(b|a)$,

#### Invariant Distribution

A distribution is said to be <b>invariant</b> or <b>stationary</b> with respect to a Markov Chain if each step in the chain leaves that distribution invariant.

For a homogenous Markov Chain, with Transition probability $T(a, b)$, the distribution $p(x)$ is invariant if, 

> $p(b) = \sum_{a} T(a, b)p(a)$

<b>Note:</b> There may have more than one invariant distribution (e.g. the identity distribution).

#### Detailed Balance

> $p(a)T(a, b) = p(b)T(b, a)$

<b>Theroem: </b> A transition probability that satifies the detailed balance property, it will lead to the distribution invariant.

<b>Proof:</b>

> $\sum_{a} p(a)T(a, b) = \sum_{a} p(b)T(b, a) = p(b) \sum_{a} T(b, a) = p(b) \sum_{a} p(a|b) = p(b)$

### 3.4.1 Metropolis Hastings Algorithms

#### 3.4.1.0 Symmetric Proposal, Metropolis Algorithms

Assume that $q(x)$ is <b>symmetric</b>, 

> $q(z_a | z_b) = q(z_b | z_a)$

<b>Sampling Process</b>

+ Draw $u \sim U(0, 1)$

+ Draw $z \sim q(z)$

+ Compute the accept probability $A(z|z^{(t)}) = \min \left(1, \frac{\widetilde{p}(z)}{\widetilde{p}(z^{(t)})} \right)$

+ If $u \ge A(z|z^{(t)})$, accept the current sample, $z^{(t+1)} = z$, else use the previous sample $z^{(t+1)} = z^{(t)}$

<b>Proof:</b> Set up a Markov Chain with the detailed balance condition, we will obtain the invariant distribution as desired.

> $p(z^{(t)}) A(z|z^{(t)}) = p(z^{(t)}) \min \left(1, \frac{\widetilde{p}(z)}{\widetilde{p}(z^{(t)})} \right) $

> $= p(z^{(t)}) \min \left(1, \frac{p(z)}{p(z^{(t)})} \right)$

> $= \min(p(z^{(t)}), p(z))$

> $= p(z) \min(\frac{p(z^{(t)})}{p(z)}, 1)$

> $= p(z) \min(\frac{ \widetilde{p}(z^{(t)}) }{\widetilde{p}(z)}, 1)$

> $= p(z)A(z^{(t)}|z)$

The detailed balance is obtained.

#### 3.4.1.1 Asymmetric

The accept probability becomes, 

> $A(z|z^{(t)}) = \min \left(1, 
\frac{\widetilde{p}(z) q(z^{(t)}|z) }
{\widetilde{p}(z^{(t)}) q(z|z^{(t)}))} \right)$

<b>Proof:</b>

> $p(z_a) q(z_b|z_a) A(z_b|z_a) = p(z_a) q(z_b|z_a) 
\min \left(1, \frac{ \widetilde{p}(z_a)  q(z_b|z_a)}{ \widetilde{p}(z_b) q(z_a|z_b)}
\right)$

> $= p(z_a) q(z_b|z_a) 
\min \left(1, \frac{ p(z_a)  q(z_b|z_a)}{ p(z_b) q(z_a|z_b)}
\right)$

> $= p(z_a)q(z_b|z_a) \min \left( \frac{p(z_b)q(z_a|z_b)}{p(z_a)q(z_b|z_a)}, 1  \right)$

> $= \min \left( p(z_b)q(z_a|z_b), p(z_a)p(z_b|z_a) \right)$

> $= p(z_b)q(z_a|z_b) \min \left(1, \frac{p(z_a)p(z_b|z_a)}{p(z_b)q(z_a|z_b)} \right)$

Given a proposal distrbution $q(z)$ and an accept rate $A(z_b|z_a)$, view the product of them as the <b>transition</b> probability function,
we obtain a Markove Chain with its <b>detailed balance</b> property satisfied, which means we can obtain the invariant distribution $p(z)$ as desired.