# DATA6811 - Assignment 3 - Model Selection and Averaging

In [2]:
import pandas as pd
import numpy as np

### (a) Find the best value of $\lambda$ for given Lasso Regression

Given Lasso Regression results as below, we will use BIC to identify the best value of $\lambda$ among these values

In [7]:
model_selection_df = pd.DataFrame(
    {
        'λ': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
        'df' : [8, 5, 3, 3, 2, 1, 1, 1, 1, 0],
        'σ̂^2': [0.9033, 0.8116, 1.8865, 4.1080, 4.6396, 5.1962, 5.7991, 6.4485, 7.1444, 7.4611]
    }
)
model_selection_df

Unnamed: 0,λ,df,σ̂^2
0,0.0,8,0.9033
1,0.1,5,0.8116
2,0.2,3,1.8865
3,0.3,3,4.108
4,0.4,2,4.6396
5,0.5,1,5.1962
6,0.6,1,5.7991
7,0.7,1,6.4485
8,0.8,1,7.1444
9,0.9,0,7.4611


### Bayesian Information Criterion (BIC)

BIC for lasso regression can be computed as:

$$BIC(\lambda) = \bigg(\frac{||\mathbf{y} - \mathbf{X}\hat\beta^{lasso}_\lambda ||^2}{n}\bigg) + {df}_\lambda \frac{\log(n)}{n}$$

We are given that, 

$$\hat\sigma_{\lambda}^2 = \bigg(\frac{||\mathbf{y} - \mathbf{X}\hat\beta^{lasso}_\lambda ||^2}{n}\bigg)$$

So, the formulation can be updated as,

$$BIC(\lambda) = \hat\sigma_{\lambda}^2 + {df}_\lambda \frac{\log(n)}{n}$$

In [10]:
n = 97
model_selection_df.loc[:, 'BIC'] = np.log(model_selection_df.loc[:, 'σ̂^2']) \
                                    + model_selection_df.loc[:, 'df'] \
                                    * np.log(n)/n
model_selection_df

Unnamed: 0,λ,df,σ̂^2,BIC
0,0.0,8,0.9033,0.275595
1,0.1,5,0.8116,0.027062
2,0.2,3,1.8865,0.776209
3,0.3,3,4.108,1.554422
4,0.4,2,4.6396,1.628952
5,0.5,1,5.1962,1.69509
6,0.6,1,5.7991,1.804865
7,0.7,1,6.4485,1.91101
8,0.8,1,7.1444,2.013491
9,0.9,0,7.4611,2.009703


The best value for lambda can be identified as the one with the smallest value of the BIC

In [17]:
λ_best, σ̂_sq_best, BIC_best = model_selection_df.loc[np.argmin(model_selection_df.loc[:, 'BIC']), ['λ', 'σ̂^2', 'BIC']]
print(f"The best value of λ is {λ_best} with the estimated variance of error as {σ̂_sq_best} and BIC value as {BIC_best:.4f}")

The best value of λ is 0.1 with the estimated variance of error as 0.8116 and BIC value as 0.0271


### (b) Why variable selection is often necessary in regression and classification?

**Ans:** Variable selection can be viewed as a special case of model selection. When building a statistical model, we start with a large number of variables/covariates to cover as much variability in data as possible. However, when the number of variables/size of the model increases too much we start to see overfitting due to high variance in prediction. 

This can be explained by the following set of equations. The expected prediction error in the case of linear regression model  consists of three parts:

$$
\begin{eqnarray}
Error(M) &=& Variance + Bias + \sigma^2 \\
&=& \mathbf{x}'cov(\hat\beta)\mathbf{x} + (\mathop{\mathbb{E}(\mathbf{x}'\hat\beta)} - f(\mathbf{x}))^2 + \sigma^2 \\
&=& \sigma^2 \sum_{i=1}^{p} x_i^2 + (\mathop{\mathbb{E}(\mathbf{x}'\hat\beta)} - f(\mathbf{x}))^2 + \sigma^2
\end{eqnarray}
$$

where, the $f(\mathbf{x}) = \mathop{\mathbf{E}}(Y\mid X=\mathbf{x})$ is the true value of the target and $\sigma^2$ is constant independent of model $M$.

Here, as the number of covariates $p$ increases, the variance increases, the bias decreases, and vice-versa. Thus, selecting the right number of covariates can impact the variability and bias of the model predictions. A model with high variance may perform well on the training data but does not generalise well on unseen data. While a model with a high bias may underfit with low model performance on train as well as unseen data.