Before you turn this problem in, make sure everything runs as expected. In the menubar, select **Kernel** $\rightarrow$ **Restart Kernel and Run All Cells...**. If you do not run a specific cell, you will not receive credit for that question. 

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = "" # put your full name here
COLLABORATORS = [] # list anyone you collaborated with on this workbook

---

## Lab 7: Resampling

Welcome to the seventh lab of the semester!

In this notebook, we'll explore resampling (which is relevant to Homework 7). Resampling is just what it sounds like - it involves repeatedly taking different samples of the data to train and test your model. We'll focus on cross-validation. You can learn more about cross-validation in ISLP, or in the [scikit-learn user guide](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation). 

### Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings; warnings.simplefilter('ignore', FutureWarning) # Seaborn triggers warnings in scipy
%matplotlib inline

In [None]:
# Configure nice plotting defaults - (this must be done in a cell separate from %matplotlib call)
plt.style.use('seaborn-v0_8')
sns.set_context('talk', font_scale=1)
plt.rcParams['figure.figsize'] = (10, 6)

----

## Creating Testing and Training Splits

To gain a little more intuition about cross-validation, we're going to work with the Boston Housing dataset, which concerns the housing values in the suburbs of Boston. The dataset was originally assembled by [Harrison and Rubinfeld (1978)](https://doi.org/10.1016/0095-0696(78)90006-2), and is now a classic dataset in environmental economics and data science. We'll be using two features from the dataset, `NOX` (the nitrogen oxides concentrations, in ppm) and `LSTAT` (the percent of the population classified as "lower status" based on education and occupation), to predict the `target` column (the median value of owner-occupied homes, in thousands of dollars). Run the following cells to load the data and plot the two features we'll be working with. 

In [None]:
cv_data = pd.read_csv('data/boston_housing.csv')
cv_data

In [None]:
plt.scatter(cv_data.NOX, cv_data.target)
plt.xlabel('Nitrogen oxide concentration (ppm)')
plt.ylabel('Median value of owner-occupied homes (thousand $)');

In [None]:
plt.scatter(cv_data.LSTAT, cv_data.target)
plt.xlabel('Percent of population classified as lower status')
plt.ylabel('Median value of owner-occupied homes (thousand $)');

Before we attempt cross validation, we're going to split our data into training and testing sets and fit a model to which we can compare our cross validation results. 

**Question 1 (1pt)**  

Split the dataset, using `NOX` and `LSTAT` as our features and `target` as the dependent variable we're predicting. Set `test_size` to 25% of the sample and `random_state` to 2. Here, we'll use the [train_test_split() function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [None]:
from sklearn.model_selection import train_test_split

X = ... #specify the features
y = ... #specify the target variable

X_train, X_test, y_train, y_test = train_test_split(...)

In [None]:
assert np.isclose(len(X_train), 0.75*len(cv_data), atol=1)
assert np.isclose(len(y_test), 0.25*len(cv_data), atol=1)

Before we continue, let's review the arguments for `train_test_split`.

**Question 2 (1pt)** How do the parameters `test_size` and `random_state` affect the data we work with? 

*Your answer here*

Now, let's fit the model with our training data.

**Question 3 (2pts)** Instantiate a `LinearRegression` model and fit the training data. Then, predict the `target` variable using the testing data. Lastly, print the MSE of the fitted model. We'll use the sklearn functions [mean_squared_error()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html).

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lm = ...
lm.fit(...)
y_pred_test = ...

print('MSE: ', mean_squared_error(...))

With this, we have a baseline we can compare with the cross-validated error. 

## Resampling and Cross-Validation

To recap, in the previous section we did a random train/test split on the data, trained the model on the training data, and evaluated how the model performed on the testing data. But when we're trying to figure out how well our model will perform with new data, it's often not enough to get the MSE from just one random split into training and testing data. This is particularly true if we're doing any kind of model selection or hyperparameter tuning. Suppose we want to use $k$-Nearest Neighbors for a classification task. We wan't to select the value of $k$ that minimizes out-of-sample error. But to avoid overfitting, we shouldn't use the test data to evaluate our model until we've chosen a value for $k$. To resolve this conundrum, we can use cross-validation! 

In $k$-fold cross validation, we divide the training set into $k$ non-overlapping subsets. Then, each of the $k$ folds, we train the model on $k-1$ of the non-overlapping subsets. We then test the model on the one subset that wasn't used in training, which provides us with an estimate of the out-of-sample error. We can take the average of the model's performance across all $k$ folds to get an even better estimate of out-of-sample error. Finally, once we've done this for all of the parameter values we want to try (a process known as grid search), we can choose the best model and evaluate it on the test data. This process is illustrated in the figure below, for $k=5$:

<img src='cv.png' width="50%" height="50%"></img>

### Leave-One-Out Cross Validation

Let's begin by implementing Leave-One-Out Cross Validation (LOOCV), which essentially is $n$-fold cross validation, where $n$ is the number of observations in the dataset. For each fold, LOOCV splits the dataset two parts: a single observation $(x_i, y_i)$ is used as the validation set, and the rest are used as the training set. 

**Question 4 (1pt)**

What is a drawback with using only one observation for the validation set? Would LOOCV have much utility when splitting large datasets? Explain.

*YOUR ANSWER HERE*

We'll use scikit-learn's `LeaveOneOut` function to split our dataset. Run the following cell to perform LOOCV and check what happens to the MSE.

In [None]:
from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
loo.get_n_splits(X)
y_tests = []
y_predictions = []
for train, test in loo.split(X): #a
    Xr_train, Xr_test = np.array(X)[train], np.array(X)[test] #b
    yr_train, yr_test = np.array(y)[train], np.array(y)[test] #c
    
    lr = LinearRegression()
    lr.fit(Xr_train, yr_train) 
    yr_pred = lr.predict(Xr_test) #d
    
    y_tests.append(yr_test) #e
    y_predictions.append(yr_pred) #e

MSE_loo = mean_squared_error(y_tests, y_predictions)

print("MSE (LOOCV): ", MSE_loo)

**Question 5 (1pt)** Several of the lines in the code are associated with a commented letter (e.g., *#a*). Answer the question associated with each letter:

a. For each iteration of the `for` loop, what is `test`, and what is `train`? How do `test` and `train` relate to each other, and how do they change over each iteration?<br>
b. What do the values of `Xr_train` represent? <br>
c. What do the values of `yr_test` represent?<br>
d. How many values are in the `y_pred` array?<br>
e. How many values are in the `y_test` and `y_predictions` array?<br>

In [None]:
# scratch work

*YOUR ANSWER HERE*

**Question 6 (1pt)** Relative to Question 3, how does the MSE change when we do LOOCV?

*Your answer here*

### K-Fold Cross Validation with scikit-learn

We'll close by introducing how to implement k-fold cross validation using `scikit-learn`. Unlike LOOCV, k-fold cross validation uses a smaller number of folds than the number of observations in the dataset. 

**Question 7 (2pts)** Name one advantage and one disadvantage of k-fold cross validation relative to LOOCV.

*YOUR ANSWER HERE*

Let's practice splitting training data into k-folds for validation purposes. First, we'll import the `KFold` module from `scikit-learn`. 

In [None]:
from sklearn.model_selection import KFold

Now we'll split the array `X` from Question 1 into 4 folds, shuffling before we add the batches, with a random state of 1. For each fold, we'll print the indices of the Train and Validation sets onto the console.

In [None]:
kf = KFold(n_splits = 4, shuffle = True, random_state = 1)

fold = 1
for t_index, v_index in kf.split(X):
    print("Fold", fold)
    print("Train:", t_index, "Test:", v_index, '\n')
    fold+=1

**Question 8 (3 pts):** Using the code above as a starting point: 
1. Separate the `X` and `y` data into testing and validation subsets for each fold. 
2. Train a linear regression model using the validation subset for each fold. 
3. Save the testing MSE associated with each fold to a list called `fold_mse`. 
4. Print the mean MSE from across all four folds. 

In [None]:
kf = KFold(n_splits = 4, shuffle = True, random_state = 1)

fold_mse = ... # initiate a list to hold the MSE associated with each fold

for t_index, v_index in kf.split(X):
    # Subset X and y into training and validation subsets
    X_fold_train = ...
    y_fold_train = ...
    X_fold_val = ...
    y_fold_val = ...
    
    # Initiate and fit a linear regression model using the training data
    lm = ...
    
    # Predict the Y-values associated with the validation data
    y_pred = ...
    
    # Find the testing MSE and append it to fold_mse
    ...
  
    
print(fold_mse)

# Find the mean MSE across all four folds
np.mean(fold_mse)

### Bibliography
- DS100 - “Gradient Descent” - https://www.textbook.ds100.org/ch/11/gradient_descent_define.html 
- DS100 - “Absolute Loss” - https://www.textbook.ds100.org/ch/10/modeling_abs_huber.html
- DS100 - “Models and Estimation” - http://www.ds100.org/fa18/assets/lectures/lec09/09-Models-and-Estimation-II.html 

---
Notebook developed by: Joshua Asuncion, Rebekah Tang; 
Revised by: Dawson Verley

Data Science Modules: http://data.berkeley.edu/education/modules
