# Linear Regression

## Model Specification
$$ p ( y | \mathbf { x } , \boldsymbol { \theta } ) = \mathcal { N } ( y | \mathbf { w } ^ { T } \mathbf { x } , \sigma ^ { 2 } )$$

non-linear function $\phi(x) = \left[ 1 , x , x ^ { 2 } , \ldots , x ^ { d } \right]$: basis function expansion
$$ p ( y | \mathbf { x } , \boldsymbol { \theta } ) = \mathcal { N } ( y | \mathbf { w } ^ { T } \boldsymbol { \phi } ( \mathbf { x } ) , \sigma ^ { 2 } )$$

## Maximum Likelihood Estimation (MLE)
$$\hat { \boldsymbol { \theta } } \triangleq \arg \max _ { \boldsymbol { \theta } } \log p ( \mathcal { D } | \boldsymbol { \theta } )$$

Assume training examples are independent and identically distributed (iid):
$$\ell ( \boldsymbol { \theta } ) \triangleq \log p ( \mathcal { D } | \boldsymbol { \theta } ) = \sum _ { i = 1 } ^ { N } \log p \left( y _ { i } | \mathbf { x } _ { i } , \boldsymbol { \theta } \right)$$

Equivalent to Minimize Negative Log Likelihood (NLL):
$$\ell ( \boldsymbol { \theta } ) \triangleq \log p ( \mathcal { D } | \boldsymbol { \theta } ) = \sum _ { i = 1 } ^ { N } \log p \left( y _ { i } | \mathbf { x } _ { i } , \boldsymbol { \theta } \right)$$

Apply Gaussian:
$$\begin{aligned} \ell ( \boldsymbol { \theta } ) & = \sum _ { i = 1 } ^ { N } \log \left[ \left( \frac { 1 } { 2 \pi \sigma ^ { 2 } } \right) ^ { \frac { 1 } { 2 } } \exp \left( - \frac { 1 } { 2 \sigma ^ { 2 } } \left( y _ { i } - \mathbf { w } ^ { T } \mathbf { x } _ { i } \right) ^ { 2 } \right) \right] \\ & = \frac { - 1 } { 2 \sigma ^ { 2 } } R S S ( \mathbf { w } ) - \frac { N } { 2 } \log \left( 2 \pi \sigma ^ { 2 } \right) \end{aligned}$$

RSS: Residuals sum of squares = Sum of squares error (SSE), Mean of squares Error (MSE): SSE/N

$$\operatorname { RSS } ( \mathbf { w } ) \triangleq \sum _ { i = 1 } ^ { N } \left( y _ { i } - \mathbf { w } ^ { T } \mathbf { x } _ { i } \right) ^ { 2 }$$

Solution is:
$$\hat { \mathbf { w } } _ { O L S } = \left( \mathbf { X } ^ { T } \mathbf { X } \right) ^ { - 1 } \mathbf { X } ^ { T } \mathbf { y }$$

## Ridge Regression, Weight Decay, L2-Regularization, MAP estimation
MLE can be overfitting with training data, so we introduce zero-mean prior: 
$$p ( \mathbf { w } ) = \prod _ { j } \mathcal { N } \left( w _ { j } | 0 , \tau ^ { 2 } \right)$$

$1/\tau^2$: control the strength of the prior
$$\underset { \mathbf { w } } { \operatorname { argmax } } \sum _ { i = 1 } ^ { N } \log \mathcal { N } \left( y _ { i } | w _ { 0 } + \mathbf { w } ^ { T } \mathbf { x } _ { i } , \sigma ^ { 2 } \right) + \sum _ { j = 1 } ^ { D } \log \mathcal { N } \left( w _ { j } | 0 , \tau ^ { 2 } \right)$$

It is equivalent to minimize:
$$J ( \mathbf { w } ) = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \left( y _ { i } - \left( w _ { 0 } + \mathbf { w } ^ { T } \mathbf { x } _ { i } \right) \right) ^ { 2 } + \lambda \| \mathbf { w } \| _ { 2 } ^ { 2 }$$

Solution:
$$\hat { \mathbf { w } } _ { r i d g e } = \left( \lambda \mathbf { I } _ { D } + \mathbf { X } ^ { T } \mathbf { X } \right) ^ { - 1 } \mathbf { X } ^ { T } \mathbf { y }$$

![L2 Regularization](../images/7.Regularization.png)