Before you turn this problem in, make sure everything runs as expected. In the menubar, select **Kernel** $\rightarrow$ **Restart Kernel and Run All Cells...**. If you do not run a specific cell, you will not receive credit for that question. 

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = "" # put your full name here
COLLABORATORS = [] # list anyone you collaborated with on this workbook

## Lab 8: Model selection and regularization

Welcome to Lab 8! In this lab, we'll cover model selection and regularization (ISLR 6.1-6.2). 

### Setup

In [None]:
# Run this block.
import numpy as np
import pandas as pd
import sklearn

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns
sns.set_context("talk")
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)

### Section 1: EDA and data filtering

We will be working with the Novotny et al. land-use regression dataset used in lecture during Week 6.  Here's a refresher about the data:

* The data includes GIS land-use characteristics from EPA land-monitoring and in situ NO2 measurements from satellite sensors.
* The goal of land-use regression (LUR) is to estimate outdoor air pollution geospatially across the contiguous United States.

In [None]:
#run to load the dataset we'll be working with
df = pd.read_csv('data/BechleLUR_2006_allmodelbuildingdata.csv')

In [None]:
df.head()

**Question 1.1 (1pt)** We create a dataframe `df_model` that contains only the response and predictor variables (i.e. you should drop Monitor_ID, State, Latitude, Longitude, and Predicted_NO2_ppb). The response variable is the observed NO2 concentration. 

Reminder: the target variable will allow us to estimate surface-level $NO_2$ levels at different points in space.

Why do we drop the columns [Monitor_ID, State, Latitude, Longitude, and Predicted_NO2_ppb]?

In [None]:
df_model = df.drop(columns = ['Monitor_ID', 'State', 'Latitude', 'Longitude', 'Predicted_NO2_ppb'])
df_model.head()

YOUR ANSWER HERE

**Question 1.2 (3pts)** We have a lot of potential features in our dataset, and it's hard to visualize all of them in relation to our response variable. To gain some familiarity with the data, however, let's take a close look at a few. We have created a plot with 4 subplots below, and generated 4 scatterplots, each showing a different feature on the x-axis and the response variable on the y-axis. Do you observe any trends or relationships? Visually, would you expect a model selection algorithm to prioritize or minimize any of these features? Why?

In [None]:
y = df['Observed_NO2_ppb'] # y axis variable
ylab = "Observed atmospheric $NO_2$ (ppb)" # y axis label

msize = 70 # marker size
afsize = 15 # axis font size
tfsize = 20 # title font size

plt.figure(figsize = (20,10))

plt.subplot(221)
plt.scatter(df['WRF+DOMINO'], y, s = msize)
plt.xlabel("Ground-level $NO_2$ measurement (ppb)", fontsize = afsize)
plt.ylabel(ylab, fontsize = afsize)

plt.subplot(222)
plt.scatter(df['Elevation_truncated_km'], y, s = msize)
plt.xlabel("Elevation (km)", fontsize = afsize)
plt.ylabel(ylab, fontsize = afsize)

plt.subplot(223)
plt.scatter(df['Impervious_100'], y, s = msize)
plt.xlabel("% impervious surface within 100 m buffer", fontsize = afsize)
plt.ylabel(ylab, fontsize = afsize)

plt.subplot(224)
plt.scatter(df['Population_100'], y, s = msize)
plt.xlabel("Population within 100 m buffer", fontsize = afsize)
plt.ylabel(ylab, fontsize = afsize)

plt.suptitle("Atmospheric $NO_2$ measurement in the United States", fontsize = tfsize)

plt.subplots_adjust(top=0.5)
plt.tight_layout()

*Your answer here*

### Standardizing your features
As you may recall from lecture, it's a good idea to standardize your independent variables when using regularization methods, especially when the domains of the independent variables span dramatically different ranges.

**Question 1.3 (3pts)** Use the `.describe()` method on the original `df_model` dataframe and examine the results. Provide at least two examples of features whose values span dramatically different ranges (i.e., at least one order of magnitude). Why is it important to standardize your variables? 

In [None]:
# YOUR CODE HERE

*YOUR ANSWER HERE*

Luckily for us, `sklearn` has built a built-in method, `StandardScaler` that makes it easy to standardize our variables before we fit our regression models. You can read more about it [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html ). In essence, `StandardScaler` takes in our array of features. For each feature (i.e., column), the difference between each element and the mean value of the feature is divided by the standard devation of the features, according to the following equation:

$$
X_{stnd} = \frac{X - \mu}{\sigma}
$$

where $\mu$ and $\sigma$ are the mean and standard deviation, respectively, of the feature $X$. 

[Side note: Strictly speaking, we should only use `StandardScaler` if our features are normally distributed. If that's not the case, `sklearn` offers other scaling methods. For further reading, [see this article](https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02 ).]

The following code block shows an example of how to use `StandardScaler` to standardize the first three features from the Novotny et. al data.

In [None]:
from sklearn.preprocessing import StandardScaler

novotny_3 = df_model.iloc[:,1:4] # We'll standardize the WRF+DOMINO, Distance_to_coast_km, and Elevation_truncated_km features

scaler = StandardScaler() # Initialize the StandardScaler
scaler.fit(novotny_3) # Fit the standard scaler for each feature
print(scaler.mean_) # Print the mean value of each feature

novotny_stnd = scaler.transform(novotny_3) # Standardize each features
novotny_stnd

In the next section, we'll use `StandardScalar` to transform our features before creating models.

---

Now that we've loaded the data and done some visual exploration, we can think about how to choose which features to include in the model. Features can provide important information and predictive power. However, as we add more features to the model, we risk increasing its variance (meaning our model performs poorly with test data relative to training data). It also can be harder to make sense of a model with lots of features. Rather than throwing out features entirely, we can turn to a technique called regularization to reduce the variance of our model while still incorporating as much information about the data as possible.

More generally, we can adopt the framework of regularized loss minimization.

$$ \large \hat{\theta} = \arg \min_\theta \frac{1}{n} \sum_{i=1}^n \textbf{Loss}\left(y_i, \hat{y_i}\right) + \lambda \textbf{R}(\theta) $$

The regularization term $\textbf{R}(\theta)$ is essentially a penalty for complex and therefore higher variance models. The regularization parameter $\lambda$ determines the degree of regularization to apply and is typically determined through cross validation.

The two regularlization methods that we're exploring in this lab (Ridge regression and Lasso regression) use different regularization terms $\textbf{R}(\theta)$.

### Section 2: L2 Regularization with Ridge Regression


Ridge regression (L2 regularization) uses the *penalty* term $\large R_{L^2}(\theta) = \sum_{k=1}^p (\theta_k)^2$, where $p$ is the number of model features.

Note that $\lambda$ is a tunable parameter - as the person creating the model, you can choose to increase or decrease $\lambda$ based on how much you want to penalize the addition of model features. The higher the value of $\lambda$, the more a model is penalized for the addition of features. A large $\lambda$ decreases the model's variance at the cost of increasing its bias.

**Important:** In scikit-learn, the value of $\lambda$ is passed in through the argument `alpha`.

**Question 2.1 (1pt)** Separate the `df_model` dataframe into train and test sets, with 25% of the data in the test set. Set a `random_state` of 1. Begin by setting `X` to the matrix of predictor variables (all quantitative columns in the dataframe except the response variable) and set `y` equal to the response variable `Observed_NO2_ppb`. Standardize your `X` variables using `StandardScalar`. Then apply `train_test_split` to `X` and `y` to split the data.

In [None]:
# YOUR CODE IN THIS CELL
from sklearn.model_selection import train_test_split

X = df_model.drop(["Observed_NO2_ppb"], axis=1)

# Standardize the features
scaler = StandardScaler()
scaler.fit(...)
X_stnd = scaler.transform(...)

y = ...

X_train, X_test, y_train, y_test = (...)

In [None]:
#run this to make sure you split the data correctly
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

**Question 2.2 (1pt)** Fit a `LinearRegression` model (without regularization) and print the resulting list of coefficients.

In [None]:
# YOUR CODE IN THIS CELL
from sklearn.linear_model import LinearRegression

lm = ...
lm.fit(...)
lm_coefficients = ...

print(lm_coefficients)

**Question 2.3 (1pt)** Import and create a Ridge regression model with `alpha` value set to 1. Fit the training data into the model, then return a list of the coefficients that the model associates wtih each feature in the training data. The [scikit-learn documentation for Ridge()](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) is helpful here.

In [None]:
# YOUR CODE IN THIS CELL
from sklearn.linear_model import Ridge

ridge = ...
ridge.fit(...)
ridge_coefficients = ...

print(ridge_coefficients)

**Question 2.4 (1pt)** Run the code below to generate a bar chart that shows the coefficient values from simple linear regression in blue, and from ridge regression in red. Then, in the markdown cell below, comment on the results. Can you explain your observations based on your understanding of L2 ridge regression?

*Hint* If you're having a hard time making out some of the coefficients, try re-creating the plot using a log scale. 

In [None]:
# run this cell
ind = np.arange(len(lm_coefficients))
width = 0.5

plt.figure(figsize = (15,7))

plt.bar(ind-(width/2), width = width, height = lm_coefficients, label = "simple linear regression")
plt.bar(ind+(width/2), width = width, height = ridge_coefficients, label = r"ridge regression, $\alpha$ = 1")
plt.xlabel("feature number")
plt.ylabel("coefficient")
plt.title("Coefficient values with simple linear regression and ridge regression")
plt.legend()
plt.show()

*Your answer here*

**Question 2.5 (1pt)**: We just observed how the Ridge Regression model generates coefficients when `alpha` is set to one. Complete the following code which generalizes the fitting and predicting process we just did in Question 2.3 for various values of `alpha`.

Then, calculate the mean squared error (MSE) between our predictions and the test dataset. The MSE in this case is a measure of the accuracy of our predictions.

In [None]:
# Replace ellipses with your code
from sklearn.metrics import mean_squared_error
alphas = [1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8, 1e9]
mses = []

for a in alphas:
    model = ...
    model.fit(...)
    y_pred = model.predict(...)
    mses.append(mean_squared_error(...))

print(mses)


# Create a plot
a_log = np.log10(alphas)

plt.figure(figsize = (15,5))
sns.barplot(a_log, mses, color = 'cadetblue')
plt.xlabel(r'$log_{10}(\alpha)$')
plt.ylabel('MSE')
plt.title('Ridge regression MSE for each value of alpha');

**Question 2.6 (2 pts)** What is the lowest MSE observed and which value of alpha did it come from? What value of $\lambda$ does that correspond to? Does the value of $\lambda$ that minimizes MSE more heavily or less heavily penalize additional coefficients than our initial value of $\lambda$ that we used to produce the plot in Question 2.4?

In [None]:
print("lowest MSE:", np.min(mses), "; alpha:", alphas[np.argmin(mses)]) #"; lambda:", 1/alphas[np.argmin(mses)])

YOUR ANSWER HERE

### Section 3: L1 Regularization with Lasso Regression

While ridge regression minimizes coefficients, it incorporates *all* the features into your model. It won't actually drive any coefficients to 0 (unless $\lambda$ = $\infty$!). This can make your model less *interpretable* - for instance, in the case of the model we created in Section 2, we have over 120 non-zero coefficients and thus over 120 features.

Lasso regression (also called L1 Regularization) avoids the issue of including too many unimportant variables by using a model formulation that can drive some coefficients to 0.

Lasso regression uses the *penalty* term $\large R_{L^1}(\theta) = \sum_{k=1}^p \Big|\theta_k\Big|$, where $p$ is the number of model features.

**Question 3.1 (1pt)** Let's repeat the steps we did above for Ridge Regression, this time for Lasso Regression. Create a Lasso model with an `alpha` of 1 and fit on the X_train and y_train dataset.

In [None]:
# YOUR CODE IN THIS CELL
from sklearn.linear_model import Lasso

lasso = ...
lasso.fit(...)
lasso_coefficients = ...
print(lasso_coefficients)

Run the cell below to output a plot that shows the coefficients from the simple linear regression in part 2, the ridge regression in part 2, and the lasso regression above side-by-side. The coefficients are plotted on a logarithmic scale. You can adapt the code from question 2.4 or write your own.

In [None]:
# run this cell
ind = np.arange(len(lm_coefficients))
width = 0.3

plt.figure(figsize = (15,7))

plt.bar(ind-width, width = width, height = np.log(lm_coefficients), label = "simple linear regression")
plt.bar(ind, width = width, height = np.log(ridge_coefficients), label = r"ridge regression, $\alpha$ = 1")
plt.bar(ind+width, width = width, height = np.log(lasso_coefficients), label = r"lasso regression, $\alpha$ = 1")
plt.xlabel("feature number")
plt.ylabel("coefficient")
plt.title("Coefficient values with simple linear regression, ridge regression, and lasso regression (log scale)")
plt.legend()
plt.show()

**Question 3.2 (1pt):** Comment on the results in the graph above. Can you explain your observations based on your understanding of L1 lasso regression?

*Your answer here*

**Question 3.3 (2pts)** What proportion of the datasets features are "ignored" by this lasso model? What are the column names of the features that are **not** ignored by this lasso model?

In [None]:
prop_ignored = 1 - (np.count_nonzero(lasso_coefficients) / len(lasso_coefficients))
print("proportion of features ignored:", prop_ignored)

cols_included = X.columns[lasso_coefficients != 0]
print("columns included:", cols_included)

YOUR ANSWER HERE

**Question 3.4 (1pt)**: Look back to the features you plotted in Question 1.2. Were any of those features ignored or included by the lasso model?

YOUR ANSWER HERE

**Question 3.5 (2pts)**: Remember how we calculated the test MSE for different values of $\alpha$ in question 2.5? Now, we're going to write a function that automates that process, taking as input a list of alphas `alphas` and a model (`Ridge` or `Lasso`). Complete the function below, and then define a list of alphas and call the function using the `Lasso` model to return a list of MSEs.

In [None]:
def calculate_mses(alphas, Model):
    """
    Input:
        alphas (array): contains floats of various alpha values
        Model (sklearn model): the type of sklearn model on which to fit the data
    Output:
        an array of floats containing the mean-squared-errors from the predictions
    """
    mses = []

    for a in alphas:
        model = ... # Your code here
        model.fit(...)
        y_pred = ...
        mses.append(...)
    return mses

In [None]:
alphas = [1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8, 1e9]
lasso_mses = calculate_mses(alphas, Lasso)
lasso_mses

**Question 3.6 (1pt)** How do the best Ridge regression model (identified in Question 2.6) and the best Lasso regression model (based on your answer to Question 3.5) perform relative to simple linear regression with respect to the mean squared error? Which model has the lowest error?

In [None]:
print("linear model MSE:", mean_squared_error(lm.predict(X_test), y_test))
print("ridge regression minimum MSE:", np.min(mses))
print("lasso regression minimum MSE:", np.min(lasso_mses))

*YOUR ANSWER HERE*

# Hooray, you're done! 

Please remember to submit your lab work, after clicking Kernel -> Restart & Run All, in .pdf and .ipynb format on bCourses.

Further Reading:

Regularization - https://www.textbook.ds100.org/ch/16/reg_intro.html
    
Notebook developed by Alex McMurry, Kevin Marroquin, and Melissa Ly

Data Science Modules: http://data.berkeley.edu/education/modules