# Linear Model Selection And Regularisation

In [1]:
import pandas as pd, numpy as np

**1. We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain $p + 1$ models, containing $0, 1, 2, . . . ,p$ predictors. Explain your answers**

*(a) Which of the three models with k predictors has the smallest training RSS?*

Best subset selection would have the smallest training RSS for a given k predictors. This is because, both backward and forward stepwise are constrained by the choice of predictors to include/remove in the previous step. The result is best subset can consider all combinations of the k predictors to choose from at each step and therefore will certainly find the combination that minimises the training RSS.

*(b) Which of the three models with k predictors has the smallest test RSS?*

It is unclear which model will have the smallest test RSS. Best subset may overfit to the training data leading to poor test RSS scores. Forward and backward stepwise selection being constrained by their choices at previous steps may strike a balance between bias and variance leading to good test RSS scores.

**2. Comparing Regression Methods**

For parts (a) through (c), indicate which of the following is correct Justify your answer.

- More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
- More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
- Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
- Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

*(a) The lasso relative to least squares*

Is less flexible and will give improved predicition accuracy when the increase in bias is less than the decrease variance. The penalty term in the equation to fit a lasso regression has the effect of shrinking some coefficients to zero when the tuning parameter is sufficiently large. This leads to a simpler less flexible model

*(b) Ridge regression relative to least squares*

Is less flexible and will give improved predicition accuracy when the increase in bias is less than the decrease variance. The penalty term in the equation to fit a ridge regression has the effect of shrinking coefficients towards zero. This leads to a simpler less flexible model

*(c) non-linear method relative to least squares*

Is more flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. Non-linear methods can fit to a wider range of relationships between the predictors and the response. Hence they are more flexible than least squares which can only fit linear models


**3. Suppose we estimate the regression coefficients in a linear regression model by minimizing**

$$
\sum^{n}_{i=1} (y_{i} - \beta_{0} - \sum_{j=1}^{p}\beta_{j}x_{ij})^{2} \text{ subject to } \sum_{j=1}^{p}|\beta_{j}|\le s
$$

**for a particular value of s. For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.**

**4. Suppose we estimate the regression coefficients in a linear regression model by minimizing**

$$
\sum^{n}_{i=1} (y_{i} - \beta_{0} - \sum_{j=1}^{p}\beta_{j}x_{ij})^{2} \text{ subject to } \sum_{j=1}^{p}\beta_{j}^{2}\le s
$$

**5. It is well-known that ridge regression tends to give similar coefficient values to correlated variables, whereas the lasso may give quite different coefficient values to correlated variables. We will now explore this property in a very simple setting.**

Suppose that $n = 2$, $p = 2$, $x_{11} = x_{12}$, $x_{21} = x_{22}$. Furthermore, suppose that $y_{1}+y_{2} = 0$ and $x_{11}+x_{21} = 0$ and $x_{12}+x_{22} = 0$, so that the estimate for the intercept in a least squares, ridge regression, or lasso model is zero: $\hat{\beta}_{0} = 0$.

Lets create a python representation of this dataset to aid our understanding.

Here we have $x_{11} = x_{12} = 2$ and $x_{21} = x_{22} = -2$. Furthermore, $y_{1} + y_{2} = -1 + 1 = 0$ and $x_{11} + x_{21} = 2 - 2 = 0$ and $x_{12} + x_{22} = 2 - 2 = 0$

In [10]:
dataset = pd.DataFrame(data = {
    'x1':[2, -2],
    'x2':[2, -2],
    'y':[-1, 1]
})

dataset

Unnamed: 0,x1,x2,y
0,2,2,-1
1,-2,-2,1


Note that our predictors are perfectly postively correlated with eachother

In [11]:
dataset.corr()

Unnamed: 0,x1,x2,y
x1,1.0,1.0,-1.0
x2,1.0,1.0,-1.0
y,-1.0,-1.0,1.0


*(a) Write out the ridge regression optimization problem in this setting.*

In this setting ridge regression aims to minimise the following expression,

$$
\sum_{i=1}^{2}(y_{i} - \sum_{j=1}^{2}\beta_{j}x_{ij})^{2} + \lambda\sum_{j=1}^{2}\beta^{2}_{j}
$$

Where, because $x_{i1} = x_{i2}$ we have

$$
\sum_{j=1}^{2}\beta_{j}x_{ij} = x_{i}\sum_{j=1}^{2}\beta_{j}
$$

Therefore,

$$
(y_{i} - \sum_{j=1}^{2}\beta_{j}x_{ij})^{2} = (y_{i} - x_{i}\sum_{j=1}^{2}\beta_{j})^{2}
$$

Which means the aim is to minimise

$$
\sum_{i=1}^{2}(y_{i} - x_{i}\sum_{j=1}^{2}\beta_{j})^{2} + \lambda\sum_{j=1}^{2}\beta^{2}_{j}
$$

Which can be written as

$$
(y_{1} - x_{1}\beta_{1} - x_{1}\beta_{2})^{2} + (y_{2} - x_{2}\beta_{1} - x_{2}\beta_{2})^{2} + \lambda\beta^{2}_{1} + \lambda\beta^{2}_{2}
$$


*(b) Argue that in this setting, the ridge coefficient estimates satisfy $\hat{\beta}_{1} = \hat{\beta}_{2}$*

To find the solution to the above, we take the derivative of above expression with respect to $\beta_{1}$ and $\beta_{2}$,

$$
\frac{\partial}{\partial \beta_{1}} (y_{1} - x_{1}\beta_{1} - x_{1}\beta_{2})^{2} + (y_{2} - x_{2}\beta_{1} - x_{2}\beta_{2})^{2} + \lambda\beta^{2}_{1} + \lambda\beta^{2}_{2} = -2x_{1}(y_{1} - x_{1}\beta_{1} - x_{1}\beta_{2}) - 2x_{2}(y_{2} - x_{2}\beta_{1} - x_{2}\beta_{2}) + 2\lambda\beta_{1}
$$

$$
\frac{\partial}{\partial \beta_{2}}(y_{1} - x_{1}\beta_{1} - x_{1}\beta_{2})^{2} + (y_{2} - x_{2}\beta_{1} - x_{2}\beta_{2})^{2} + \lambda\beta^{2}_{1} + \lambda\beta^{2}_{2} = -2x_{1}(y_{1} - x_{1}\beta_{1} - x_{1}\beta_{2}) - 2x_{2}(y_{2} - x_{2}\beta_{1} - x_{2}\beta_{2}) + 2\lambda\beta_{2}
$$

Then setting them to zero and solving for the coefficients we find

$$
\beta_{1} = \frac{x_{1}(y_{1} - x_{1}\beta_{1} - x_{1}\beta_{2}) + x_{2}(y_{2} - x_{2}\beta_{1} - x_{2}\beta_{2})}{\lambda}
$$
$$
\beta_{2} = \frac{x_{1}(y_{1} - x_{1}\beta_{1} - x_{1}\beta_{2}) + x_{2}(y_{2} - x_{2}\beta_{1} - x_{2}\beta_{2})}{\lambda}
$$

*(c) Write out the lasso optimization problem in this setting.*

In general the lasso regression seeks to minimise

$$
\sum_{i=1}^{2}(y_{i} - \sum_{j=1}^{2}\beta_{j}x_{ij})^{2} + \lambda\sum_{j=1}^{2}|\beta_{j}|
$$

This can be simplfied, given the current setting using similar logic as above for,

$$
(y_{1} - x_{1}\beta_{1} - x_{1}\beta_{2})^{2} + (y_{2} - x_{2}\beta_{1} - x_{2}\beta_{2})^{2} + \lambda|\beta_{1}| + \lambda|\beta_{2}|
$$


*(d) Argue that in this setting, the lasso coefficients $\hat{\beta}_{1}, \hat{\beta}_{2}$ are not unique—in other words, there are many possible solutions to the optimization problem in (c). Describe these solutions.*

Again we find the partial derivatives,

$$
\frac{\partial}{\partial \beta_{1}} (y_{1} - x_{1}\beta_{1} - x_{1}\beta_{2})^{2} + (y_{2} - x_{2}\beta_{1} - x_{2}\beta_{2})^{2} + \lambda|\beta_{1}| + \lambda|\beta_{2}| = -2x_{1}(y_{1} - x_{1}\beta_{1} - x_{1}\beta_{2}) - 2x_{2}(y_{2} - x_{2}\beta_{1} - x_{2}\beta_{2}) + \lambda
$$

$$
\frac{\partial}{\partial \beta_{2}}(y_{1} - x_{1}\beta_{1} - x_{1}\beta_{2})^{2} + (y_{2} - x_{2}\beta_{1} - x_{2}\beta_{2})^{2} + \lambda|\beta_{1}| + \lambda|\beta_{2}| = -2x_{1}(y_{1} - x_{1}\beta_{1} - x_{1}\beta_{2}) - 2x_{2}(y_{2} - x_{2}\beta_{1} - x_{2}\beta_{2}) + \lambda
$$

Then setting the derivatives equal to zero and solving

