# Model Selection via Regularization

In this exercise, we will introduce regularization terms in our regression models to prevent overfitting. Let the covariates be $x \in \mathbb{R}^{n \times p}$ and the outcomes be $Y \in \mathbb{R}^n$. $x_j^{\left(i\right)}$ denotes the $j$th feature in the $i$th sample, while $Y^{\left(i\right)}$ denotes the outcome for the $i$th sample. The linear regression model is 

\begin{equation*}
    Y^{\left(i\right)} = \beta_0 + \sum_{j = 1}^p \beta_j x_j^{\left(i\right)} + Z^{\left(i\right)}
\end{equation*}

Let $\lambda$ be the regularization constant. Ridge regression (L2 regularization) penalizes the squared magnitude of the coefficients:

\begin{equation*}
    \hat{\beta}^{L2} = {\rm arg\,min}_{\beta} \left(\sum_{i=1}^n \left(Y^{\left(i\right)} - \sum_{j = 1}^p \beta_j x_j^{\left(i\right)} - \beta_0\right)^2 + \lambda \sum_{j=1}^p \beta_j^2\right)
\end{equation*}

LASSO regression (L1 regularization) penalizes the absolute value of the magnitude of the coefficients:

\begin{equation*}
    \hat{\beta}^{L1} = {\rm arg\,min}_{\beta} \left(\sum_{i=1}^n \left(Y^{\left(i\right)} - \sum_{j = 1}^p \beta_j x_j^{\left(i\right)} - \beta_0\right)^2 + \lambda \sum_{j=1}^p \lvert \beta_j \rvert\right)
\end{equation*}

Note that neither Ridge nor LASSO penalizes the intercept. If the intercept is regularized, many properties of linear regression will no longer hold.

We will compare the effects of L2 (Ridge) and L1 (LASSO) regularization on the coefficients. To understand how each value of the regularization constant constrains the model, we will compute the effective degrees of freedom. We will also use LOOCV (Leave-One-Out-Cross-Validation) to choose the right regularization parameter and hence the best model.

For this exercise, we will use the cars dataset. It is provided as <b>cars.csv</b> in the data directory. For information on what the features are, refer to https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/mtcars Note that this is a different automobile dataset from the OLS part. In this part, we will also build regression models to predict miles per gallon, this time from 10 features.

In [None]:
#Import the necessary libraries
%matplotlib inline

from matplotlib import pyplot as plt
import utils
import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy import stats
from sklearn import linear_model
import warnings

# some settings
warnings.filterwarnings('ignore')
plt.rc('font', size = 14)

Read the data as a pandas dataframe. For reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

In [None]:
# Read the data
filename = 'data/cars.csv'
df = pd.read_csv(filename)

# (0). Visualize the data

**(0a).** [1 pt] Print the number of observations in the dataset. 

**(0b).** [1 pt] Print all the columns in the dataset

**(0c).** [1 pt] Produce a Scatter plot of all variables against each other.

Feel free to use the <b>scatter_plot_dataframe()</b> function in utils.py. Note: this function call may take a while.

**(0d).** [1 pt] Produce a plot of correlations between all variables.

Feel free to use the <b>correlation_plot()</b> function in utils.py

**(0e).** [1 pt] Using the plots above, which variables have a (roughly) linear relationship with 'mpg'? 

**A:** (Type your answer here)

# (1). Ridge Regression
We will run Ridge regression by introducing an L2 penalty on the regression coefficients.

In [None]:
feat_names = ['cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb']

**(1a).** [3 pts] Create an OLS model using <b>statsmodels.OLS</b> You do not have to fit the model in this step.

The dependent variable is 'mpg', and the independent variables are 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb'.

Center and normalize each independent variable. Regularization penalizes coefficients based on their magnitude. Because features on different scales have coefficients of different magnitudes, standardizing features is important to ensure regularization is applied evenly.

Include the intercept using the add_constant() function in statsmodels. Store your multi-variate model in a variable called <b>mv_model</b>.

Hint: http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html

**(1b).** [2 pts] Write a function <b>get_mv_ridge()</b> which fits a model with an L2 regularization constant $\lambda$. Your function should take as arguments the statsmodels OLS model and $\lambda$ and return the fit.

Use statsmodels' <b>fit_regularized()</b> function: https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.fit_regularized.html Note the objective function in their documentation uses a different regularization constant $\alpha$:

\begin{equation*}
    \hat{\beta}^{L2} = {\rm arg\,min}_{\beta} \left(\frac{1}{2n} \sum_{i=1}^n \left(Y^{\left(i\right)} - \sum_{j = 1}^p \beta_j x_j^{\left(i\right)} - \beta_0\right)^2 + \frac{\alpha}{2} \sum_{j=1}^p \beta_j^2\right)
\end{equation*}

Be sure to account for this change from $\lambda$ to $\alpha$ and for no regularization on the intercept in your function.

In [None]:
def get_mv_ridge(model,
                 ridge_lambda):
    '''
    Compute a Ridge regression
    @param model: statsmodels OLS
    @param ridge_lambda: float, L2 regularization constant
    @return: fitted OLS model (RegularizedResults)
    '''
    pass

**(1c).** [3 pts] The number of degrees of freedom is the number of values that are free to vary. When regularization is applied, the number of effective degrees of freedom is reduced. The formula for Ridge effective degrees of freedom is:

$$
\mathrm{df}(\lambda) = \mathrm{tr}(X(X^TX + \lambda I)^{-1}X^T) + 1 = \sum_{j=1}^p \frac{d_j^2}{d_j^2 + \lambda} + 1,
$$
where $X$ is the $n\times p$ column-centered matrix (without constant) holding the independent variables, and $d_j$ are the diagonal entries of $D$ in the SVD decomposition of $X$:
$$
X = UDV^T
$$

Since the intercept contributes 1 degree of freedom, a 1 is added.

Define a function <b>degrees_of_freedom_ridge</b> that takes $X$ and the L2 regularization constant $\lambda$ as input and computes the effective degrees of freedoms. 

In [None]:
def degrees_of_freedom_ridge(X,
                             ridge_lambda):
    '''
    Compute the number of effective degrees of freedom for a ridge regression
    @param X: pandas DataFrame, column-centered data, may contain 'const' column
    @param ridge_lambda: float, L2 regularization constant
    @return: float, effective degrees of freedom
    '''
    pass

**(1d).** [3 pts] Use <b>get_mv_ridge()</b> to plot the Ridge regression coefficients vs. effective degrees of freedom $\mathrm{df}(\lambda)$ for each independent variable for $\lambda$ in the range [0,3200] in increments of 3.2. Use smaller increments of 0.32 from 0 to 3.2 since small values of $\lambda$ are more sensitive and hence require greater accuracy.<br>

Ensure that your legend clearly labels all lines (one per variable) and the axes are appropriately labeled (x-axis should have $\mathrm{df}(\lambda)$ and the y-axis should have the parameter values).

**(1e).** [1 pt] What do you notice about the coefficients as the degrees of freedom decrease? 

**A:** (Type your answer here)

**(1f).** [3 pts] Define a function <b>LOOCV_ridge(lambda)</b> which implements Leave-One-Out-Cross-Validation on the Ridge regression model as a function of the penalty parameter $\lambda$. In particular, for every $1 \leq i \leq n$ where $n$ is the sample size, train a ridge regression model on the sub-sample where the $i^{th}$ sample is removed. Evaluate this model on the remaining $i^{th}$ sample using squared error loss. Then return the average/standard deviation of this loss over all $n$ iterations.

Note: There is also an efficient way to implement this function without training a new ridge regression model for each LOOCV iteration. While this solution is not required, thinking about it is a nice exercise!

In [None]:
def LOOCV_ridge(X,
                Y,
                ridge_lambda):
    '''
    Perform leave one out cross validation on a ridge regression model
    Compute the average and standard deviation of the loss on the held-out sample across all iterations
    @param X: pandas DataFrame, column-centered data, contains 'const' column
    @param Y: np array, sample labels
    @param ridge_lambda: float, L2 regularization constant
    @return: 1. float, mean of LOOCV scores
             2. float, standard deviation of LOOCV scores
    '''
    pass

**(1g).** [2 pts] Use <b>LOOCV_ridge(lambda)</b> and <b>degrees_of_freedom_ridge(lambda)</b> to generate three arrays: <b>ridge_x_vals</b>, <b>ridge_y_vals</b>, <b>ridge_error_bars</b> holding the degrees of freedom, LOOCV score, and standard errors, correspondingly. Recall that you need to normalize the standard error to get a valid error bars (lecture-notes #8, sections 3.5-3.6). <br>The computation is expensive and may take a couple of minutes.

**(1h).** [2 pts] Use the arrays you generated in part (1g) to plot the Leave-One-Out-Cross-Validation score (with error bars) vs. effective degrees of freedom $\mathrm{df}(\lambda)$. Ensure that the axes are appropriately labeled (x-axis should have df(lambda) and the y-axis should have the LOOCV score).

<b>Hint:</b> You might want to plot the y-axis on a log scale for better readibility.

**(1i).** [1 pt] We can pick the best lambda based on which value minimizes the mean LOOCV score. What is the best lambda? What are the corresponding degrees of freedom and mean LOOCV score?

**(1j).** [1 pt] Build and fit a ridge regression model with this $\lambda$ and no held-out data.

**(1k).** [1 pt] Display the coefficients in a pandas DataFrame.

Hint: A RegularizedResults object does not have **summary()** implemented, so use the **params** property instead

**(1l).**  [1 pt] How would you interpret the coefficient on cylinders?

Hint: You will need to apply the normalization (but not the centering) introduced in 1a to the coefficient before interpreting it.

**A:** (Type your answer here)

# (2). LASSO (Least Absolute Shrinkage and Selection Operator)

We will now run LASSO by introducing an L1 penalty on the regression coefficients. Normalizing features is also important for applying L1 regularization evenly. We will use the normalized features created in (1a).

**(2a).** [2 pts] A LASSO coefficient path shows how the coefficients change as the L1 penalty is varied. Write a function **compute_lasso_path(X, Y, alphas)** that takes in the scaled features X (without intercept) and the outcomes Y. Your function should output a numpy array where each column contains the intercept (at index 0) and the coefficients for that value of alpha.

We recommend you take one of these approaches:

1. Use **sklearn.linear_model.Lasso** to fit a model for each $\alpha$: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
2. Use **sklearn.linear_model.lasso_path**: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.lasso_path.html To avoid regularizing the intercept, center Y before calling this function since the intercept will be 0 on the centered Y. You can manually add the intercept after. Also, note that the order of the alphas will be reversed by this function, so you will have to handle the outputted lasso paths accordingly.

Note that $\alpha$ is also different from $\lambda$ in this objective:
\begin{equation*}
    \hat{\beta}^{L1} = {\rm arg\,min}_{\beta} \left(\frac{1}{2n}\sum_{i=1}^n \left(Y^{\left(i\right)} - \sum_{j = 1}^p \beta_j x_j^{\left(i\right)} - \beta_0\right)^2 + \alpha \sum_{j=1}^p \lvert \beta_j \rvert\right)
\end{equation*}

We will use $\alpha$ as defined in the sklearn documentation for this problem, so you do not need to convert from $\lambda$.

In [None]:
def compute_lasso_path(X,
                       Y,
                       alphas):
    '''
    Compute the LASSO coefficient path as alpha is varied
    @param X: np array, # samples x # features, does not contain a constant column
    @param Y: np array, # samples, contains outcomes
    @param alphas: list of floats, alpha values on LASSO path
    @return: np array, (1 + # features) * # alphas, first row contains intercepts
    '''
    pass

**(2b).** [1 pt] Compute the LASSO path for each value of $\alpha$ in the range [0.001,100] in 1000 small increments **equally spaced on the log scale**. Hold your results in **main_lasso_path**.

Then plot the LASSO coefficients vs $\log_{10}\left(\alpha\right)$. The intercept does not need to be plotted. Ensure that your legend clearly labels all lines (one per each variable) and the axes are appropriately labeled (x-axis should have log alpha and the y-axis should have the parameter values).

**(2c).** [1 pt] What do you notice about the coefficients as the regularization penalty is increased? Contrast the behavior with what you observed for Ridge Regression? Which might you prefer for choosing a sparse subset of features (columns) in a regression model?

**A:** (Type your answer here)

**(2d).** [1 pt] To Implement LOOCV for Lasso regression, we need to fit the Lasso model on every sub-sample where the $i^{th}$ sample is removed for $1\leq i\leq n$. In this part, we'll do some pre-computation. Pre-compute $n$ lasso paths using **compute_lasso_path** for every sub-sample. Save your results in **sub_lasso_paths**, an numpy array of size $n \times p \times $ # alphas

**(2e).** [2 pts] Define a function <b>LOOCV_lasso</b> which implements Leave-One-Out-Cross-Validation on the Lasso regression model. Use the coefficients in pre-computed lasso paths instead of fitting an OLS model. The rest of the procedure is the same as the one for Ridge regression.  This function will take your pre-computed **sub_lasso_paths**, **alpha_index**, and the samples as input. It will return the mean and standard deviation of the LOOCV scores.

In [None]:
def LOOCV_lasso(sub_lasso_paths,
                alpha_index,
                X,
                Y):
    '''
    Perform leave one out cross validation on a Lasso regression model
    This model is represented by the coefficients at alpha_index in sub_lasso_paths
    Compute the mean and standard deviation of loss on the held-out sample
    @param sub_lasso_paths: np array, # samples x # coefficients x # alphas
    @param alpha_index: int, alpha index
    @param X: pandas DataFrame, column-centered data, contains 'const' column
    @param Y: np array, sample labels
    @return: 1. float, mean of LOOCV scores
             2. float, standard deviation of LOOCV scores
    '''
    pass

**(2f).** [2 pts] Use <b>LOOCV_lasso</b> to generate three arrays: <b>lasso_x_vals</b>, <b>lasso_y_vals</b>, <b>lasso_error_bars</b> for $\log_{10}\left(\alpha\right)$, LOOCV score, and standard errors, respectively

**(2g).** [2 pts] Use the arrays you generated in part (2f) to plot the Leave-One-Out-Cross-Validation score (with error bars) vs. $\log_{10}\left(\alpha\right)$.

Ensure that your legend clearly labels all lines (one per each variable) and the axes are appropriately labeled (x-axis should have log alpha and the y-axis should have the LOOCV score).

Hint: You might want to plot the y-axis on a log scale for better readibility

**(2h).** [2 pts] We can pick the value of alpha associated with the lowest mean LOOCV score. What is the best alpha and the corresponding mean LOOCV score?

**(2i).**  [1 pt] Once we have selected the best alpha from LOOCV, we can select the model fitted with that alpha on all data (i.e. no held-out sample) as the best model. Display the coefficients in a pandas DataFrame.

**(2j).** [1 pt] The formula for LASSO effective degrees of freedom is:

\begin{equation*}
    \sum_{j=0}^p \mathbb{1}\left\{\hat{\beta}_j(\alpha)\neq 0\right\}
\end{equation*}

where $\hat{\beta}_j(\alpha)$ is the LASSO estimate of coefficient $\beta_j$ for a penalty parameter $\alpha$. This is the number of non-zero coefficients (including the intercept) in the LASSO regression. What is the number of degrees of freedom in the selected LASSO model? How does it compare with Ridge regression?

**A:** (Type your answer here)

**(2k).** [1 pt] How do you interpret the coefficient on cylinders? How does the interpretation compare with that for Ridge regression in (1l)?

Hint: You will need to apply the normalization (but not the centering) introduced in 1a to the coefficient before interpreting it.

**A:** (Type your answer here)

**(2l).** [1 pt] How do the mean LOOCV scores for the selected ridge and LASSO models compare? Why does this occur?

**A:** (Type your answer here)

<b>Important Note:</b> If you look carefully into the implementation, you notice that we center/scale the data only once across the whole sample and don't standarize further when doing the LOOCV splits. Technically, we are introducing small leakage across training/hold-out sets (in this case the sub-sample of size $n-1$ and the remaining $i^{th}$ sample). To solve this problem, we should have re-centered/scaled for every sub-sample of size $n-1$ individually and used the scaling parameters to transform the remaining $i^{th}$ sample. However, since we are in the case of LOOCV, the leakage is so small (i.e. mean/std of $n$ numbers instead of $n-1$) that there is virtually no difference between the LOOCV curves. In the general case however, especially in cross-validation settings where the number of folds is small (5 for example) compared to the sample size, you should be very careful about leakage.