# Lab 10 - Model Selection

Throughout the course we have encountered many hypothesis classes that are in-fact sets of hypothesis classes characterized by some hyper-parameter. We have seen that often this hyper-parameter can be seen as a tuning parameter over the bias-variance trade-off graph
1. When choosing the number of neighbors $k$, in the $k$-NN classifier, we are contronling how complex are the hypotheses of this class.
2. When choosing the max depth $d$ of decision trees, we are controling how complex are the hypotheses of this class.
3. When choosing $\lambda$ the regularization parameter of the Lasso or Ridge regressions we are controling how complex are the hypotheses of this class.

Therefore, a key question is, how to correctly choose these parameters, or in other words how to select our preferred model in each set of hypothesis classes. To answer this question we will investigate 3 different ways of model selection based on 
1. the train set
2. a validation set
3. using cross validation 

In [1]:
import sys 
sys.path.append("../")
from utils import *

from scipy.stats import norm
from sklearn.model_selection import train_test_split, KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

np.random.seed(7)

To this end we will use the South Africa heart disease dataset which is comprised of 462 records of patients which have (`chd=0`) or doesn't have (`chd=1`) the disease.

In [34]:
df = pd.read_csv("../data/SAheart.data", header=0, index_col=0).sort_values('chd')
df.famhist = df.famhist == "Present"

train, test = train_test_split(df, test_size=0.2)
X_train, y_train, X_test, y_test = train.loc[:, train.columns != 'chd'].values, train["chd"].values, test.loc[:, test.columns != 'chd'].values, test["chd"].values

df

Unnamed: 0_level_0,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
row.names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
170,162,5.30,7.95,33.58,True,58,36.06,8.23,48,0
226,143,5.04,4.86,23.59,False,58,24.69,18.72,42,0
225,132,7.28,3.52,12.33,False,60,19.48,2.06,56,0
223,144,2.40,8.13,35.61,False,46,27.38,13.37,60,0
385,128,2.60,4.94,21.36,False,61,21.30,0.00,31,0
...,...,...,...,...,...,...,...,...,...,...
244,144,0.76,10.53,35.66,False,63,34.35,0.00,55,1
245,126,4.60,7.40,31.99,True,57,28.67,0.37,60,1
247,136,0.40,3.91,21.10,True,63,22.30,0.00,56,1
162,208,27.40,3.12,26.63,False,66,27.45,33.07,62,1


## Model Selection Based On ERM

We begin with the simplest approach for selecting a model out of a set of possible models. We fit a $k$-NN classifier for different values of $k$, from $1$ to $40$, and select the classifier that achieved the lowest training error. 

As seen in Figure 1, the selected classifier is the one where we predict for each point based on the single closest training point. Since we are evaluating the results based on the training set, each point in the "test" (that is, the training set) is closest to itself, and therefore is given its own response.

Though this approach yields a zero training error, we can see that for this dataset the test error is $0.4$. Thus, our classifier is heavily overfitted and does not general well to new datasets. 

In [27]:
k_range = list(range(1, 40, 2))

# Train and evaluate models for all values of k
train_errors, test_errors = [], []
for k in k_range:
    model = KNeighborsClassifier(k).fit(X_train, y_train)
    train_errors.append(1 - model.score(X_train, y_train))
    test_errors.append(1 - model.score(X_test, y_test))


# Select model with lowest training error
min_ind = np.argmin(np.array(train_errors))
selected_k = np.array(k_range)[min_ind]
selected_error = train_errors[min_ind]


# Plot train- and test errors as well as which model (value of k) was selected
go.Figure([go.Scatter(name='Train Error', x=k_range, y=train_errors, mode='markers+lines', marker_color='rgb(152,171,150)'), 
           go.Scatter(name='Test Error', x=k_range, y=test_errors, mode='markers+lines', marker_color='rgb(25,115,132)'),
           go.Scatter(name='Selected Model', x=[selected_k], y=[selected_error], mode='markers', marker=dict(color='darkred', symbol="x", size=10))])\
    .update_layout(title=r"$\text{(1) }k\text{-NN Errors - Selection By ERM}$", 
                   xaxis_title=r"$k\text{ - Number of Neighbors}$", 
                   yaxis_title=r"$\text{Error Value}$").show()

## Model Selection Based On A Validation Set

For the next approach we follow the following scheme:
1. Split training set into a training portion and a validation portion.
2. Train models over training portion.
3. Evaluate models over validation set and choose the one with the lowest error over the validation set.

Since evaluation over the validation set provides an unbiased estimator of the generalization error (see proof in course book), this approach approximates the unknown generalization error and aims to select the model that we assume to perform best by that error.

As evident by Figure 2, we do not select anymore the model where $k=1$ and instead choose the model where $k=25$. We can see that for all values of $k$ the validation- and test errors are similar, empirically showing how these independent sets can provide an unbiased estimation of the generalization error.

In [28]:
# Split training set into training- and validation sets
n = int(X_train.shape[0]*0.5)
X_train_smaller, y_train_smaller = X_train[:n], y_train[:n]
X_val, y_val = X_train[n:], y_train[n:]


# Train and evaluate models for all values of k
train_errors, val_errors, test_errors = [], [], []
for k in k_range:
    model = KNeighborsClassifier(k).fit(X_train_smaller, y_train_smaller)
    train_errors.append(1 - model.score(X_train_smaller, y_train_smaller))
    val_errors.append(1 - model.score(X_val, y_val))
    test_errors.append(1-model.score(X_test, y_test))


# Select model with lowest training error
min_ind = np.argmin(np.array(val_errors))
selected_k = np.array(k_range)[min_ind]
selected_error = val_errors[min_ind]


# Plot train- and test errors as well as which model (value of k) was selected
fig = go.Figure([ 
    go.Scatter(name='Train Error', x=k_range, y=train_errors, mode='markers+lines', marker_color='rgb(152,171,150)'),
    go.Scatter(name='Validation Error', x=k_range, y=val_errors, mode='markers+lines', marker_color='rgb(220,179,144)'),
    go.Scatter(name='Test Error', x=k_range, y=test_errors, mode='markers+lines', marker_color='rgb(25,115,132)'), 
    go.Scatter(name='Selected Model', x=[selected_k], y=[selected_error], mode='markers', marker=dict(color='darkred', symbol="x", size=10))
]).update_layout(title=r"$\text{(2) }k\text{-NN Errors - Selection By Minimal Error Over Validation Set}$", 
                 xaxis_title=r"$k\text{ - Number of Neighbors}$", 
                 yaxis_title=r"$\text{Error Value}$").show()

## $k$-Fold Cross Validation

In prepations for the next approach consider the following. Instead of using a single validation set, we can expand the above approach to use multiple validation sets. Then, we fit each model over the training set but evaluate its average performance over the different validation sets. We then select the model that achieved the lowest average error.

The following code splits the training set into 4 portions: a training set and 3 validation sets.

In [82]:
# Split training set into training and validation portions, and then 
# split validation portion into 3 validation sets
msk = np.random.binomial(1, .7, X_train.shape[0]).astype(bool)
X_train_smaller, y_train_smaller = X_train[msk], y_train[msk]

validations = np.array_split(np.argwhere(~msk), 3)
validations = [(X_train[v.ravel()], y_train[v.ravel()]) for v in validations]


# Train and evaluate models for all values of k
train_errors, test_errors, val_errors = [], [], [[] for _ in range(len(validations))]
for k in k_range:
    model = KNeighborsClassifier(k).fit(X_train_smaller, y_train_smaller)
    train_errors.append(1-model.score(X_train_smaller, y_train_smaller))
    test_errors.append(1-model.score(X_test, y_test))

    for i in range(len(validations)): 
        val_errors[i].append(1 - model.score(*validations[i]))
val_errors = np.array(val_errors)


# Select model with lowest training error
min_ind = np.argmin(val_errors.mean(axis=0))
selected_k = np.array(k_range)[min_ind]
selected_error = val_errors.mean(axis=0)[min_ind]
mean, std = np.mean(val_errors, axis=0), np.std(val_errors, axis=0)


# Select model with lowest training error
go.Figure([
    go.Scatter(name='Lower validation error', x=k_range, y=mean - 2*std, mode='lines', line=dict(color="lightgrey"), showlegend=False, fill=None),
    go.Scatter(name='Upper validation error', x=k_range, y=mean + 2*std, mode='lines', line=dict(color="lightgrey"), showlegend=False, fill="tonexty"), 

    go.Scatter(name='Train Error', x=k_range, y=train_errors, mode='markers+lines', marker_color='rgb(152,171,150)'),
    go.Scatter(name='Mean Validation Error', x=k_range, y=mean, mode='markers+lines', marker_color='rgb(220,179,144)'),
    go.Scatter(name='Test Error', x=k_range, y=test_errors, mode='markers+lines', marker_color='rgb(25,115,132)'), 
    go.Scatter(name='Selected Model', x=[selected_k], y=[selected_error], mode='markers', marker=dict(color='darkred', symbol="x", size=10))
]).update_layout(title=r"$\text{(3) }k\text{-NN Errors - Selection By Minimal Error Over Validation Set}$", 
                 xaxis_title=r"$k\text{ - Number of Neighbors}$", 
                 yaxis_title=r"$\text{Error Value}$").show()

In Figure 3, we can see the train- and test errors, as well as the results over the validation sets. These results are seen in two ways. The first is the average validation error achieved for each value of $k$ and is seen in the graph as the line of "Mean Validation Error". 

The second is the grey area seen in the plot. This is known as the confidence interval and is our estimation on where might the estimator be located (recall that the mean captures the first moment and the variance captures the second). This gives us a level of confidence in our prediction.

The main problem with the approach above is that we have to put a side a lot of data which we cannot train over and just use for these independent validations. To adress this problem we instead use the Cross Validation approach.

In [86]:
train_errors, test_errors = [], []
for k in k_range:
    model = KNeighborsClassifier(k).fit(X_train, y_train)
    train_errors.append(1-model.score(X_train, y_train))
    test_errors.append(1-model.score(X_test, y_test))

param_grid = {'n_neighbors':k_range}
knn_cv = GridSearchCV(KNeighborsClassifier(), param_grid, cv=3).fit(X_train, y_train)
cv_errors = 1 - knn_cv.cv_results_["mean_test_score"]
std = knn_cv.cv_results_["std_test_score"] 
    
min_ind = np.argmin(np.array(cv_errors))
selected_k = np.array(k_range)[min_ind]
selected_error = cv_errors[min_ind]


go.Figure([
        go.Scatter(name='Lower CV Error CI', x=k_range, y=cv_errors - 2*std, mode='lines', line=dict(color="lightgrey"), showlegend=False, fill=None),
    go.Scatter(name='Upper CV Error CI', x=k_range, y=cv_errors + 2*std, mode='lines', line=dict(color="lightgrey"), showlegend=False, fill="tonexty"), 
    
    go.Scatter(name="Train Error", x=k_range, y=train_errors, mode='markers + lines', marker_color='rgb(152,171,150)'), 
    go.Scatter(name="CV Error", x=k_range, y=cv_errors, mode='markers + lines', marker_color='rgb(220,179,144)'),
    go.Scatter(name="Test Error", x=k_range, y=test_errors, mode='markers + lines', marker_color='rgb(25,115,132)'), 
    go.Scatter(name='Selected Model', x=[selected_k], y=[selected_error], mode='markers', marker=dict(color='darkred', symbol="x", size=10))])\
.update_layout(title=r"$\text{(4) }k\text{-NN Errors - Selection By Cross-Validation}$", 
                 xaxis_title=r"$k\text{ - Number of Neighbors}$", 
                 yaxis_title=r"$\text{Error Value}$").show()
