# <center> IMA205 - LAB3 Supervised Learning
### <center> Antoine Andurao ###

# OLS #

We assume that we are under the fixed design model, i.e. that :
$$
Y = X\beta + \epsilon \\
\epsilon \sim \mathcal{N}(0, \sigma^2)
$$

Let's compute $\mathbb{E}[\tilde\beta]$.

$$ 
\mathbb{E}[\tilde\beta] = \mathbb{E}[CY] = CX\beta + C\mathbb{E}[\epsilon] = HX\beta + DX\beta
$$

Since $\tilde\beta$ is unbiased, we have $DX\beta = 0$, $\forall\beta$

Thus, $DX=0$

Let's compute $\text{Var}(\tilde\beta)$ :

$$ 
\text{Var}(\tilde\beta) = \text{Var}(CY) = \text{Var}(CX\beta + C\epsilon) = \text{Var}(\epsilon)CC^T = \sigma^2(HH^T + HD^T + DH^T + DD^T) = \sigma^2((X^TX)^{-1} + DD^T) = \text{Var}(\beta^*) + \sigma^2DD^T
$$
$$
\sigma^2DD^T = \sigma^2||D||_2^2 > 0
$$

Hence $ \text{Var}(\tilde\beta) > \text{Var}(\beta^*) $  if $D\neq0$

# Ridge Regression #

Under the fixed design model, we have that $\beta^*_{ridge} = (\lambda I_d + X^TX)^{-1}X^TY$.

**Bias**

$$ 
\mathbb{E}[\beta^*_{ridge}] = \mathbb{E}[(\lambda I_d + X^TX)^{-1}X^Ty] = \mathbb{E}[(\lambda I_d + X^TX)^{-1}X^TX\theta] + \mathbb{E}[(\lambda I_d + X^TX)^{-1}X^T\epsilon] = (\lambda I_d + X^TX)^{-1}X^TX\theta = (\lambda I_d + X^TX)^{-1}(\lambda I_d + X^TX)\theta - \lambda I_d(\lambda I_d + X^TX)^{-1}\theta = \theta - \lambda I_d(\lambda I_d + X^TX)^{-1}\theta
$$

The ridge estimator has a bias of $- \lambda I_d(\lambda I_d + X^TX)^{-1}\theta$.

**SVD Decomposition**

We assume that $X = UDV^T$ :

$$
\beta^*_{ridge} = (\lambda I_d + X^TX)^{-1}X^T = (\lambda VV^T + VD^2V^T)^{-1}VDU^T = (V(D^2 + \lambda)V^T)^{-1}VDU^T = V(D^2 + \lambda)^{-1}V^TVDU^T = V(D^2 + \lambda)^{-1}DU^T
$$

**Variance**

Let's compute $\text{Var}(\beta^*_{ridge})$ :

$$
\text{Var}(\beta^*_{ridge}) = \mathbb{E}[(\beta^*_{ridge} - \mathbb{E}[\beta^*_{ridge}])(\beta^*_{ridge} - \mathbb{E}[\beta^*_{ridge}])^T] = \mathbb{E}[(V(D^2 + \lambda)^{-1}DU^T\epsilon)(V(D^2 + \lambda)^{-1}DU^T\epsilon)^T] = \mathbb{E}[(V(D^2 + \lambda)^{-1}DU^T\epsilon\epsilon^TUD(D^2 + \lambda)^{-1}V^T] = \sigma^2V(D^2 + \lambda)^{-1}D^2(D^2 + \lambda)^{-1}V^T
$$

Writing $D = \text{diag}(d_i)$, we have $ \text{Var}(\beta^*_{ridge}) = \sigma^2V\text{diag}(\frac{d_i^2}{(d_i^2 + \lambda)^2})V^T$

Thus : $ \text{Var}(\beta^*_{ridge}) -  \text{Var}(\beta^*_{OLS}) = \sigma^2V\text{diag}(\frac{d_i^2}{(d_i^2 + \lambda)^2} - \frac{1}{d_i^2})V^T = \sigma^2V\text{diag}(\frac{1}{d_i^2}( \frac{1}{(1+ \frac{\lambda}{d_i^2})^2} - 1))V^T$.

$\forall i$, $(\frac{1}{d_i^2}( \frac{1}{(1+ \frac{\lambda}{d_i^2})^2} - 1) \leq 0 $ as long as $\lambda \geq 0$.

If $\lambda \geq 0$, we have that $\text{Var}(\beta^*_{ridge}) \leq  \text{Var}(\beta^*_{OLS})$.

**Influence of parameter $\lambda$**

For the variance, we have :

$$
\lim_{\lambda \to +\infty} \text{Var}(\beta^*_{ridge}) = 0 \\
\lim_{\lambda \to 0} \text{Var}(\beta^*_{ridge}) = \text{Var}(\beta^*_{OLS}) = \sigma^2(X^TX)^{-1}
$$

For the expected value, we have :

$\mathbb{E}[\beta^*_{ridge}] = \mathbb{E}[(\lambda I_d + X^TX)^{-1}X^TX\theta] = V\text{diag}(\frac{d_i^2}{(d_i^2 + \lambda)})V^T\theta$

Thus : 

$$
\lim_{\lambda \to +\infty} \mathbb{E}[\beta^*_{ridge}] = 0 \\
\lim_{\lambda \to 0} \mathbb{E}[\beta^*_{ridge}] = \mathbb{E}[\beta^*_{OLS}] = \theta
$$

**Relation with OLS**

Let's assume that $X^TX = I_d$, and recall that $\beta^*_{ridge} = (\lambda I_d + X^TX)^{-1}X^TY = \frac{1}{\lambda + 1}X^TY$ and $\beta^*_{OLS} = (X^TX)^{-1}X^TY = X^TY$

Hence $\beta^*_{ridge} = \frac{\beta^*_{OLS}}{\lambda + 1}$.

# Elastic Net #

Let's assume that $X^TX = I_d$, and recall that $\beta^*_{OLS} = (X^TX)^{-1}X^TY = X^TY$

Let $f$ be our objective function : $f(\beta)=||Y - X\beta||^2 + \lambda_2||\beta||^2_2 + \lambda_1||\beta||_1$

Note that : $ ||\beta||_1 = \sum_{i=1}^n |\beta_i|$, hence $\nabla_\beta(\lambda_1||\beta||_1) = \pm \lambda_1$

$$
\nabla_\beta f (\beta) = -2X^T(Y - X\beta) + 2\lambda_2 \beta \pm \lambda_1 \\
\nabla_\beta f (\beta) = 0 \iff 2X^TY \pm \lambda_1 = 2(1+\lambda_2)\beta
$$

Hence : $\beta^*_{ElNet} = \frac{\beta^*_{OLS} \pm \frac{\lambda_1}{2}}{(1+\lambda_2)}$