# Model Selection

[Resource](https://harvard-iacs.github.io/2018-CS109A/sections/section-4/demo/)

In [3]:
import numpy as np
import pandas as pd
from sklearn import metrics, datasets
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
matplotlib.rcParams['figure.figsize'] = (13.0, 6.0)

import itertools

from IPython.display import display
pd.set_option("display.max_columns", 999)
pd.set_option("display.width", 500)
sns.set_style("whitegrid")
sns.set_context("talk")

# NYC Car Hire Dataset

Welp... here we are again. This data requires a request for access to the professor's google drive, SO it looks like our only current option is to read through the example for now. Who knows! Maybe they'll accept my Google Drive access request!

Worst comes to worst, we can just skip right to [the lab](https://harvard-iacs.github.io/2018-CS109A/labs/lab-4/solutions/) with the expectancy that it will show some model selection techniques.

Yea.. sounds like a plan! Read through the [model selection](https://harvard-iacs.github.io/2018-CS109A/sections/section-4/demo/) code to get an idea and know what to expect, then move on to [lab 4](https://harvard-iacs.github.io/2018-CS109A/sections/section-4/demo/).

While I'm going over the lecture, I'm gonna write down some notes.

## Python Set .difference()

The first noteworthy takeaway is the `.difference()` method, which can be used to find elements that exist in one set but not in another. It returns a new set containing elements from the first set that are not present in the second set.

This operation is similar to the subtraction of sets, where only unique elements from the first set remain. For example:

In [8]:
A = {10, 20, 30, 40, 80}
B = {100, 30, 80, 40, 60}

print(A.difference(B)) # Elements in A but not in B
print(B.difference(A)) # Elements in B but not in A

{10, 20}
{100, 60}


I remember using this once, and it could definitely come in handy later for checking which rows of data have already been processed within a function.

# Scaling and Normalization

Quick exercise: for which of the following do the units of the predictors matter (e.g., trip length in minutes vs. seconds; temperature in F or C)?

* **kNN: Yes.** Scaling affects distance metric, which determines what "neighbor" means.
* **Linear regression: No.** Multiply predictor by *c* -> divide coefficient by *c*.
* **Lasso: yes.** If we divide the coefficient by *c*, then the corresponding penalty term is also divided by *c*.
* **Ridge: yes.**. Same as the Lasso method, except we divide the penalty by *c*^2.

**Remember that the mean and variance used to scale data are parameters that need to be learned from our training data.**

In [None]:
from sklearn.preprocessing import StandardScaler

# Only scale the training data
scaler = StandardScaler().fit(train[all_predictors])

pd.DataFrame({
    "mean": scaler.mean_,
    "variance": scaler.var_
}, index=all_predictors).T

NameError: name 'all_predictors' is not defined

In [None]:
# Scaling in place (not the professor's favorite approach)
# Probably better to separately assign each scaled data set
for df in [train, valid, test]:
    df[all_predictors] = scaler.transform(df[all_predictors])

NameError: name 'valid' is not defined

# kNN Regression: How Many Neighbors Should we Use?

In [19]:
k_vals = [1, 5, 10, 15, 20, 25, 30, 35, 40, 50, 60]
knns = {
    k: KNeighborsRegressor(n_neighbors=k).fit(
        train[all_predictors], train.Fare_amount)
    for k in k_vals}

train_r2s = [
    metrics.r2_score(train.Fare_amount, model.predict(train[all_predictors]))
    for k, model in knns.items()]

NameError: name 'all_predictors' is not defined

In [20]:
plt.plot(k_vals, train_r2s, '-+', label="Train")
plt.xlabel('n_neighbors')  
plt.ylabel("$R^2$")
plt.legend();

NameError: name 'train_r2s' is not defined

# Validation set is currently unseen

So let's see how well our models do on it:

In [21]:
val_r2s = [
    metrics.r2_score(valid.Fare_amount, model.predict(valid[all_predictors]))
    for k, model in knns.items()]

NameError: name 'knns' is not defined

In [22]:
plt.plot(k_vals, train_r2s, '-+', label="Train")
plt.plot(k_vals, val_r2s, '-*', label="Validation")
plt.xlabel('n_neighbors')
plt.ylabel("$R^2$")
plt.legend();

NameError: name 'train_r2s' is not defined

Now which n_neighbors should we use?

In [23]:
best_r2_idx = np.argmax(val_r2s)
best_r2 = val_r2s[best_r2_idx]
best_n_neighbors = k_vals[best_r2_idx]
print(f"Best n_neighbors is {best_n_neighbors}, which gives a validation R^2 of {best_r2:.3f}")

NameError: name 'val_r2s' is not defined

# Cross-Validation

(From [this resource](https://stats.stackexchange.com/questions/193959/does-cross-validation-on-simple-or-multiple-linear-regression-make-sense))

Cross validation and generally validation model techniques are used not only to avoid overfitting (never the case when using linear models) but also when there are different models to compare.

A straight last square regression (with no macro-parameters) doesn't get any improvement with cross validation or train-test split that is not obtained by training the model with all the available data!

Different is the case if your model is linear but with macro-parameters to choose as Ridge or Lasso regression. In this case using CV validation is a good way to choose the best macro-parameter value , that is the linear model with the best score on the training data

Essentially, when you're comparing models, you want to do train/test splits for every model because... you're comparing models! And as for your concerns about scaling, you want to scale the train/test/validation set independently, as each set had their own std and variance and other metrics. If you were to scale all of them using the same scaling parameters, it just wouldn't work!