<a href="https://colab.research.google.com/github/dlsun/pods/blob/master/05-Regression-Models/5.6%20Model%20Selection%20and%20Hyperparameter%20Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 5.6 Model Selection and Hyperparameter Tuning

In this lesson, we will use the tools developed in the previous section to answer two important questions:

- Model Selection: Is $k$-nearest neighbors better or is linear regression better? Which features should we include in the model?
- Hyperparameter Tuning: How do we choose hyperparameters, such as $k$ in $k$-nearest neighbors?

In the previous lesson, we saw how to use cross-validation to estimate how well a model will perform on test data. A natural way to decide between competing models or hyperparameters is to choose the one that minimizes the validation error.

In [0]:
import pandas as pd
import numpy as np

# Extract the training data.
data_dir = "https://dlsun.github.io/pods/data/"
bordeaux_df = pd.read_csv(data_dir + "bordeaux.csv",
                          index_col="year")
bordeaux_train = bordeaux_df.loc[:1980].copy()
bordeaux_train["log(price)"] = np.log(bordeaux_train["price"])

## Model Selection

Suppose we wish to fit a $4$-nearest neighbors model but are not sure which features to include in the model. In the code below, we consider five possible sets of features.

In [0]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

# calculate estimate of test error for a given feature set
def get_cv_error(features):
  # define pipeline
  pipeline = make_pipeline(
      StandardScaler(),
      KNeighborsRegressor(n_neighbors=4)
  )
  # calculate errors from cross-validation
  cv_errs = -cross_val_score(pipeline, X=bordeaux_train[features], 
                             y=bordeaux_train["log(price)"],
                             scoring="neg_mean_squared_error", cv=10)
  # calculate average of the cross-validation errors
  return cv_errs.mean()

# calculate and store errors for different feature sets
errs = pd.Series()
for features in [["win", "summer"],
                 ["win", "summer", "age"],
                 ["win", "summer", "age", "sep"],
                 ["win", "summer", "age", "har"],
                 ["win", "summer", "age", "har", "sep"]]:
  errs[str(features)] = get_cv_error(features)

errs

Notice that more is not necessarily better. The model with the lowest mean-squared error is actually the one with four features:

- winter rainfall (**win**)
- average summer temperature (**summer**)
- age of the wine (**age**)
- harvest rainfall (**har**)

The mean-squared error actually increases when we add the average September temperature (**sep**) to the model.

## Hyperparameter Tuning

Now that we have a good working model, how do we determine the optimal value of the hyperparameter $k$? We can use cross-validation to estimate the test error for different values of $k$ and choose the one with the smallest (test) MSE. This is not hard to do manually.

In [0]:
X_train = bordeaux_train[["win", "summer", "age", "har"]]
y_train = bordeaux_train["log(price)"]

# calculate estimate of test error for a value of k
def get_cv_error(k):
  # define pipeline
  pipeline = make_pipeline(
      StandardScaler(),
      KNeighborsRegressor(n_neighbors=k)
  ) 
  # calculate errors from cross-validation
  cv_errs = -cross_val_score(pipeline, X=X_train, y=y_train,
                             scoring="neg_mean_squared_error", cv=10)
  # calculate average of the cross-validation errors
  return cv_errs.mean()
    
ks = pd.Series(range(1, 20))
ks.index = range(1, 20)
test_errs = ks.apply(get_cv_error)

test_errs.plot.line()
test_errs.sort_values()

The MSE is minimized at $k = 3$, which suggests that we should use a $3$-nearest neighbors model to predict wine quality.

Scikit-learn provides a utility, `GridSearchCV`, that automates most of the drudgery of trying different hyperparameters. We specify `param_grid=`, which is a dictionary that maps the name of the parameter (e.g., `n_neighbors`) to a list of parameter values to try. The model with the highest score will be stored in `.best_estimator_`. (Note the trailing underscore, which indicates that this attribute is only available after `.fit()` has been called.)

For simplicity, let's start by fitting $k$-nearest neighbors to the raw data directly, without standardization.

In [0]:
from sklearn.model_selection import GridSearchCV

model = KNeighborsRegressor(n_neighbors=5)

# GridSearchCV will replace n_neighbors by values in param_grid.
grid_search = GridSearchCV(model,
                           param_grid={"n_neighbors": range(1, 20)},
                           scoring="neg_mean_squared_error",
                           cv=10)
grid_search.fit(X_train, y_train)
grid_search.best_estimator_

More realistically, we will want to standardize the feature variables before passing them into `KNeighborsRegressor`, so we set up a pipeline. Notice that each step in the pipeline is automatically given a name. We will need to refer to this name when using `GridSearchCV`.

In [0]:
# define pipeline
pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor(n_neighbors=5)
) 
pipeline

The parameter that we want to tune is `n_neighbors`, which is embedded inside the `kneighborsregressor` step of the pipeline. In `GridSearchCV`, the convention is to refer to a parameter inside a step of the pipeline as `<step>__<parameter>`. So the parameter we are tuning in this case is `kneighborsregressor__n_neighbors`.

In [0]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(pipeline,
                           param_grid={
                               "kneighborsregressor__n_neighbors": range(1, 20)
                           },
                           scoring="neg_mean_squared_error",
                           cv=10)
grid_search.fit(X_train, y_train)
grid_search.best_estimator_

Inspecting the scikit-learn model that was deemed best by `GridSearchCV`, we see that a $3$-nearest neighbors model has the lowest MSE, which agrees with what we obtained earlier. It is possible to get a complete summary of what `GridSearchCV` tried, in the attribute `.cv_results_`.

In [0]:
grid_search.cv_results_

## Exercises

1\. Train a linear regression model on the Ames data (http://dlsun.github.io/pods/data/AmesHousing.txt ) that predicts the sale price using the square footage (**Gr Liv Area**), number of bedrooms (**Bedrooms AbvGr**), and neighborhood (**Neighborhood**). Decide whether it would be valuable to add the number of full bathrooms (**Full Bath**) and/or the number of half bathrooms (**Half Bath**) to this model.

2\. Train a $k$-nearest neighbors model on the tips data (http://dlsun.github.io/pods/data/tips.csv ). Use cross-validation to determine the optimal value of $k$.