<h1>Linear Models for Regression</h1>

Reference:

[1] PRML.
[2] MLAPP.

# 0. Linear Basis Function Models

## Linear Regression
> $y(x, w) = w_0 + w_1 \phi(x_1) + ... w_M \phi(x_M)$

> $= \sum_{i=1}^M w_i \phi(x_i) = W^{T}\phi(x)$

## Maximum Likelihood

Assume the target variable $t$ is the summation of <b>deteministic</b> function $y(x, w)$ and gaussian noise $\epsilon \sim \mathcal{N}(0, \beta^{-1})$.

> $t = y(x, w) + \epsilon$

Thus, $\mathbb{E}[t|x, w] = y(x, w)$, $Var[t|x, w] = \beta^{-1}$,

> $p(t|x, w) = \mathcal{N}(t|y(x, w), \beta^{-1})$

For $N$ independent identically distributed samples $\{X_1, X_2, ..., X_N\}$, the likelihood function is,

> $L(X, w) = \prod_{i=1}^N p(t_i|X_i, w)$

Compute the log-likelihood, (to avoid the underflow problem), 

> $l(X, w) = \log L(X, w) = \sum_{i=1}^N \log p(t_i|X_i, w)$

The log-likelihood function $l(X, w)$ is <b>concave</b> of $w_j$, 
thus compute the derivate of $l$ with respect to $w_j$, let it be zero, 
we will obtain the optimal value of parameters $w$.

# 1. Bias Variance Trade-off

> $\mathbb{E}[(y - \hat{y})^2] = \mathbb{E}[y^2 + {\hat{y}}^2 - 2y\hat{y}]$

> $= \mathbb{E}[y^2] + \mathbb{E}[{\hat{y}}^2] - 2\mathbb{E}[y]\mathbb{E}[\hat{y}]$

> $= Var[y] + \mathbb{E}^2[y] + Var[\hat{y}] + \mathbb{E}^2[\hat{y}] - 2 y \mathbb{E}[\hat{y}]$

> $= Var[y] + Var[\hat{y}] + \{\mathbb{E}^2[\hat{y}] - 2 y \mathbb{E}[\hat{y}]+ \mathbb{E}^2[y] \}$

> $= 
\underbrace{Var[y]}_{noise} + 
\underbrace{Var[\hat{y}]}_{variance} + 
\underbrace{\{ y - \mathbb{E}[\hat{y}] \}^2}_{bias^2}$



# 2. Bayesian Linear Regression

The likelihood of a linear regression problem is, 

> $L(X, w) = \prod_{i=1}^N p(t_i|X_i, w)$

where $p(t|x, w) = \mathcal{N}(t|y(x, w), \beta^{-1})$, 

thus the conjugate prior of the likelihood is also a Gaussian distribution.

> $p(w) = \mathcal{N}(w|\mu_0, S_0)$

The posterior distribution of $w$ is, 

> $p(w|t) = \mathcal{N}(w|\mu_1, S_1)$

> $\mu_1 = S_1(S_0^{-1}m_0 + \beta \phi^T t)$

> $S_1^{-1} = S_0^{-1} + \beta \phi^T \phi$

The predictive result will be, 

> $p(t^{test}|t^{train}, w) = \int p(t^{test}|x, w)p(w|t^{train})dw$

Almost everything about Gaussian Distribution is analytical.

> $p(t^{test}|t^{train}, w) = \mathbb{E}_{w}[p(t^{test}|x, w)]$ 

To compute the expectation of the likihood function, if it is not analytical, we will use some approximation methods.

# 3. Sampling Methods

## 3.0 Law of Large Number (Core)

According to the law of large number, 

if $f(x_i)$ is drawn from the distribution of $p$.

> $\mathbb{E}_{p}[f(x)] = \int f(x) p(x) dx \thickapprox \sum_{i=1}^N f(x_i)$

## 3.1 Standard Distribution

> <b>Theorem</b>: Assume we have a uniform random number generator $z \sim U(0, 1)$, 
> $F^{-1}(z) \sim F$.

Proof:

> $p(F^{-1}(z) \le x)$

> $= p(z \le F(x))$, (Apply $F$ to both sides)

> $= F(x)$, (Because $p(z \le y) = \int_{0}^{y}dz = y$)

If we want to draw samples from $p(x)$ which is hard to drawn from, we can 

> + compute the cdf (cummulative distribution function $F(x)$;
> + compute the inverse of cdf $F^{-1}(x)$;
> + sample from the uniform distribution $z \sim U(0, 1)$;
> + obtain the sample value from $p(x)$, which is $F^{-1}(u)$.

Example: 

> $p(x=1) = 0.2, p(x=2) = 0.3, p(x=3) = 0.1, p(x=4) = 0.4$

> $F(x=1) = p(x \le 1) = 0.2, F(x=2) = p(x \le 2) = 0.5, F(x=3) = p(x \le 3) = 0.6, F(x=4)  = p(x \le 4) = 1.0$

Let $y = F^{-1}(x)$, $f(y) = x$,

> $f(y=0.2) = 1, f(y=0.5) = 2, f(y=0.6) = 3, f(y=1.0) = 4$

Draw $z \sim U(0, 1)$, insert into $F^{-1}(z)$ we will obtain the samples drawn from $p(x)$,

> $x = 1, if 0 < z \le 0.2$

> $x = 2, if 0.2 < z \le 0.5$

> $x = 3, if 0.5 < z \le 0.6$

> $x = 4, if 0.6 < z < 1$.

## 3.2 Rejection Methods

Usually the cdf of $p(x)$ is hard to compute.

Use a proposal distribution $q(x)$ which is easy to drawn from, and a constant $M$, 

> $Mq(x) \ge p(x)$

Sampling process

+ Sample $x$ from $q(x)$

+ Sample from the uniform distribution $u \sim U(0, 1)$

+ Check if $u \le \frac{p(x)}{Mq(x)}$, keep $x$, else reject the current sample.

Sometimes, an unnormalized distribution $\widetilde{p}(x)$ of $p(x)$ is used.

## 3.3 Importance Sampling