*This notebook is part of  course materials for CS 345: Machine Learning Foundations and Practice at Colorado State University.
Original versions were created by Asa Ben-Hur and updated by Ross Beveridge.
The content is availabe [on GitHub](https://github.com/asabenhur/CS345).*

*The text is released under the [CC BY-SA license](https://creativecommons.org/licenses/by-sa/4.0/), and code is released under the [MIT license](https://opensource.org/licenses/MIT).*

<a href="https://colab.research.google.com/github//asabenhur/CS345/blob/master/fall23/notebooks/module05_01_hyperparameters_validation_set.ipynb">
  <img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm

# Hyper-parameter selection using a validation set

### Classifier parameters vs hyper-parameters

Training of a classifier consists of using the training data to find good values for the *parameters* of the model (e.g. the weight vector of the perceptron algorithm).
All the classifiers we have seen thus far have additional parameters that control the classifier's training, and need to set by the user.  These are called *hyper-parameters*. For example:

* KNN:  number of nearest neighbors
* Perceptron:  learning rate
* Ridge regression:  regularization parameter
* Non-linear SVM:  soft-margin constant and kernel parameter (degree of polynomial kernel or width of Gaussian kernel).


### The wrong way to select hyper-parameters

Our temptation is to do the following:

* Divide the data into train and test sets
* Loop over a set of potential values for the hyper-parameter(s)
* For each value of the hyper-parmeter(s) train a classifier on the training set and evaluate it on the test set
* Report to the user the value of the best performing classifier

Why is this wrong?  The choice of which value to report to the user uses information about the test set, and the end-result is an accuracy estimate that is over-optimistic.  Think of a hyperparameter as yet another parameter that needs to be set algorithmically, which we never do using the test set!  As you know, estimates of classifier performance should never be based on performance on the test set!
So what can we do?

### Use a validation set!

Here's a variation of the above procedure, that introduces the idea of using a **validation set**, a subset of the data used for choosing hyperparameters.

* Divide the data into **training, validation, and test** sets
* Loop over a set of potential values for the hyper-parameter(s)
* For each value of the hyper-parmeter(s) train a classifier on the training set and evaluate it on the **validation** set
* Choose the best performing classifier based on its performance on the validation set, and evaluate its performance on the test set.
* Report to the user the test-set performance of the classifier chosen based on its performance on the validation set.


Let's apply this to the breast cancer data:

In [2]:
from sklearn.datasets import load_breast_cancer
X,y = data = load_breast_cancer(return_X_y = True)

from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)
X.shape, y.shape

((569, 30), (569,))

Our next step is to split the data into training/validation/test set. 
scikit-learn does not have a method for a three-way split of a dataset, so we'll use `train_test_split` twice:

In [3]:
size_test = 0.3
size_validation = 0.2
size_train = 0.5

from sklearn.model_selection import train_test_split

# first split into training / test, where the training set
# will be further split into training / validation
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=size_test, random_state=5)

# in order to obtain the right size validation set we need to use
size_validation_rescaled = size_validation/(size_validation + size_train)
# now split the initial training set into the final training
# and validation sets:

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train, y_train, test_size=size_validation_rescaled, random_state=5)


Our next step is to loop over a set of hyper-parameter values, and select the best performing value based on its performance on the validation set. We will do this for an SVM, and for simplicity, we will focus on choosing the value of a single parameter, the Gaussian width parameter, $\gamma$.

In the next notebook we will explore scikit-learn functionality that will make this task much easier.

In [4]:
# for simplicity we will only consider the gamma hyper parameter:

def svm_select_gamma(X_train, X_valid, y_train, y_valid, gammas) :
    accuracies = []
    for gamma in gammas : 
        classifier = svm.SVC(kernel="rbf", gamma=gamma, C=10)
        classifier.fit(X_train, y_train)
        y_pred = classifier.predict(X_valid)
        accuracy = np.mean(y_valid == y_pred)
        accuracies.append(accuracy)
        print(f"gamma: {gamma}\t accuracy: {accuracy:0.3f}")
    return gammas[np.argmax(accuracies)]

The final step is to run this function and evaluate the resulting classifier over the test set:

In [5]:
gammas = np.logspace(-5, 2, num=8, endpoint=True, base=10.0)
gamma = svm_select_gamma(X_train, X_valid, y_train, y_valid, gammas)
print(f"chosen value of gamma: {gamma}")
classifier = svm.SVC(kernel="rbf", gamma=gamma, C=10)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(f'accuracy on test set:  {np.mean(y_test == y_pred):0.3f}')

gamma: 1e-05	 accuracy: 0.684
gamma: 0.0001	 accuracy: 0.956
gamma: 0.001	 accuracy: 0.982
gamma: 0.01	 accuracy: 0.965
gamma: 0.1	 accuracy: 0.921
gamma: 1.0	 accuracy: 0.623
gamma: 10.0	 accuracy: 0.623
gamma: 100.0	 accuracy: 0.623
chosen value of gamma: 0.001
accuracy on test set:  0.988


### Final comments

* When a classifier has multiple hyperparameters (e.g. non-linear SVM with soft-margin constant and kernel parameter) you would ideally run a process called grid-search, i.e. consider all combinations of hyperparameters (or a subset of those combinations).
* When you have a small amount of data, dividing it into training, validation, and test sets will leave you with datasets that are too small for effective training and evaluation.  In the next notebooks we will explore how to do that using the technique of cross-validation.