### Group Members: Weston Lu, Kevin Wang, Aamir Mohsin
# Data Science Lab: Lab 3

Submit:
1. A pdf of your notebook with solutions. Make sure that the solutions are present and visible in the pdf.
2. A link to your colab notebook or also upload your .ipynb if not working on colab.

# Goals of this Lab

1. More experience with regression and ridge regression (regularization)
2. Start playing with Kaggle
3. More experience with Lasso.
4. An initial shot at ensembling and stacking.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib

import matplotlib.pyplot as plt
from scipy.stats import skew
from scipy.stats.stats import pearsonr


%config InlineBackend.figure_format = 'retina' #set 'png' here when working on notebook
%matplotlib inline


## Problem 1 (Optional)

**Part 1** Make sure you can run through and understand the Jupyter notebook on Ridge
Regression and Colinearity we saw in class: https://colab.research.google.com/drive/1R7xTNHxAwhL1tANiGT2KRO-OT0D8KV2Z

**Part 2.** What is the test error of the “zero-variance” solution, namely, the all-zeros solution?

**Part 3.** The least-squares solution does not seem to do too well, because it has so much variance. Still, it is unbiased. Show this empirically: generate many copies of the data, and for each one, obtain the least-squares solution. Average these, to show that while each run produces a beta hat that is very different, their average begins to look more and more like the true beta.

**Part 4.** Alternatively, if one had access to lots of data, instead of computing the least-square solution over smaller batches and then averaging these solutions as in the previous part of the problem, an approach is to run a single least-squares regression over all the data. Which approach do you think is better? Can you support your conclusion with experiments?


### Problem 2: Starting in Kaggle.
Later this semester, we are opening a Kaggle competition made for this class. In that one, you will be participating on your own. This is an intro to get us started, and also an excuse to work with regularization and regression which we have been discussing.

**Part 1.** Let’s start with our first Kaggle submission in a playground regression competition. Make an account to Kaggle and find https://www.kaggle.com/c/house-prices-advanced-regression-techniques/

**Part 2.** Follow the data preprocessing steps from (new link!) https://www.kaggle.com/code/apapiu/regularized-linear-models. Then run a ridge regression using $\lambda = 0.1$. Make a submission of this prediction, what is the RMSE you get? (Hint: remember to exponentiate np.expm1(ypred) your predictions).



**Part 3.** Compare a ridge regression and a lasso regression model. Optimize the alphas using cross validation. What is the best score you can get from a single ridge regression model and from a single lasso model?

**Part 4.** The $\ell_0$ (or $L_0$) norm is the number of nonzeros of a vector. Plot the $L_0$ norm of the coefficients that lasso produces as you vary the strength of regularization parameter $\lambda$.

**Part 5.** Add the outputs of your models as features and train a ridge regression on all the features plus the model outputs (This is called Ensembling and Stacking). Be careful not to overfit. What score can you get? (We will be discussing ensembling more, later in the class, but you can start playing with it now).

## Problem 3 (Nothing to turn in)

Run this simple example from scikit learn, and understand what each command is doing: https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html

## Problem 4

Use the data generation used in the LASSO notebook where we first introduced Lasso, to generate data.

You can find that again here: https://colab.research.google.com/drive/1_NGlKLpXpcobUIlan5DY5nA-5aT39Hxc

**Part 1.** Manually implement forward selection. Report the order in which you add features.

**Part 2.** In this example, we know the true support size is 5. But what if we did not know this? Plot test error as a function of the size of the support. Use this to recover the true support size. Justify your answer.

**Part 3.** Use Lasso with a manually implemented Cross validation using the metric of your choice.
What is the value of the hyperparameter? (Manually implemented means that you can either
do it entirely on your own, or you can use GridSearchCV, but I’m asking you not to use
LassoCV, which you will use in the next problem).

**Part 4.** (Optional) Change the number of folds in your CV and repeat the previous step. How does the optimal
value of the hyperparameter change? Try to explain any trends that you find.

**Part 5.** (Optional) Read about and use LassoCV from sklearn.linear model. How does this compare with what
you did in the previous step? If they agree, then explain why they agree, and if they disagree
explain why. This will require you to make sure you understand what LassoCV is doing.

## Problem 5 (Optional): Higher vs Lower K in K-Fold CV.

Using either Ridge regression (e.g., with the setting in the Ridge Regression colab notebook) or Lasso (e.g., the setting of the Lasso colab notebook, also linked to above), or with any other data sets you wish to construct, design and execute an experiment to investigate the claim when we do $k$-fold cross validation, as $k$ decreases, we have more bias but less variance.  Note that this is an open-ended exercise. It is asking you to use simulation and investigate what is going on with increasing or decreasing the number of folds in cross validation.


## Problem 6 (Optional) Elastic Net

There may be settings where we want to combine ideas from Ridge and Lasso. There is a model that does this, by adding an L1 penalty (as in Lasso) and also an L2 penalty (as in Ridge). Read about this in sklearn and in [ISL](https://www.statlearning.com/) (or anywhere else). Try to construct an example where ElasticNetCV does better than LassoCV. Explain how you came up with this.
