# Hyperparameter optimization

*Fraida Fund*

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd 
import scipy

from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.utils.fixes import loguniform

## Grid search

For models with a single hyperparameter controlling bias-variance (for example: $k$ in $k$ nearest neighbors), we used sklearns's `KFoldCV` or `validation_curve` to test a range of values for the hyperparameter, and to select the best one.

When we have *multiple* hyperparameters to tune, we can use `GridSearchCV` to select the best *combination* of them.

For example, we saw three ways to tune the bias-variance of an SVM classifier:

-   Changing the kernel
-   Changing $C$
-   For an RBF kernel, changing $\gamma$

To get the best performance from an SVM classifier, we need to find the best *combination* of these hyperparameters. This notebook shows how to use `GridSearchCV` to tune an SVM classifier.

We will work with a subset of the MNIST handwritten digits data. First, we will get the data, and assign a small subset of samples to training and test sets.

In [None]:
from sklearn.datasets import fetch_openml
X, y = fetch_openml('mnist_784', version=1, return_X_y=True )

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=1000, test_size=300)

Let’s try this initial parameter “grid”:

In [None]:
param_grid = [
  {'C': [0.1, 1000], 'kernel': ['linear']},
  {'C': [0.1, 1000], 'gamma': [0.01, 0.0001], 'kernel': ['rbf']},
 ]
param_grid

Now we’ll set up the grid search. We can use `fit` on it, just like any other `sklearn` model.

I added `return_train_score=True` to my `GridSearchSV` so that it will show me training scores as well:

In [None]:
clf = GridSearchCV(SVC(), param_grid, cv=3, refit=True, verbose=100, n_jobs=-1, return_train_score=True)
%time clf.fit(X_train, y_train)

Here are the results:

In [None]:
pd.DataFrame(clf.cv_results_)

To inform our search, we will use our understanding of how SVMs work, and especially how the $C$ and $\gamma$ parameters control the bias and variance of the SVM.

### Linear kernel

Let’s tackle the linear SVM first, since it’s faster to fit. We didn’t see any change in the accuracy when we vary $C$. So, we should extend the range of $C$ over which we search.

I’ll try higher and lower values of $C$, to see what happens.

In [None]:
param_grid = [
  {'C': [1e-6, 1e-4, 1e-2, 1e2, 1e4, 1e6], 'kernel': ['linear']},
 ]
param_grid

In [None]:
clf = GridSearchCV(SVC(), param_grid, cv=3, refit=True, verbose=100, n_jobs=-1, return_train_score=True)
%time clf.fit(X_train, y_train)

In [None]:
pd.DataFrame(clf.cv_results_)

In [None]:
sns.lineplot(data=pd.DataFrame(clf.cv_results_), x='param_C', y='mean_train_score', label="Training score");
sns.lineplot(data=pd.DataFrame(clf.cv_results_), x='param_C', y='mean_test_score', label="Validation score");
plt.xscale('log');

It looks like we get a slightly better validation score near the smaller values for $C$! What does this mean?

Let’s try:

In [None]:
param_grid = [
  {'C': np.linspace(1e-5, 1e-7, num=10), 'kernel': ['linear']},
 ]
param_grid

In [None]:
clf = GridSearchCV(SVC(), param_grid, cv=3, refit=True, verbose=100, n_jobs=-1, return_train_score=True)
%time clf.fit(X_train, y_train)

In [None]:
sns.lineplot(data=pd.DataFrame(clf.cv_results_), x='param_C', y='mean_train_score', label="Training score");
sns.lineplot(data=pd.DataFrame(clf.cv_results_), x='param_C', y='mean_test_score', label="Validation score");
plt.xscale('log');

We can be satisfied that we have found a good hyperparameter only when we see the high bias AND high variance side of the validation curve!

### RBF kernel

Now, let’s look at the RBF kernel.

In our first search, the accuracy of the RBF kernel is very poor. We may have high bias, high variance, (or both).

When $C=0.1$ in our first search, both training and validation scores were low. This suggests high bias.

When $C=1000$ in our first search, training scores were high and validation scores were low. This suggests high variance.

What next? We know from our discussion of bias and variance of SVMs that to combat overfitting, we can decrease $\gamma$ and/or decrease $C$.

For now, let’s keep the higher value of $C$, and try to reduce the overfitting by decreasing $\gamma$.

In [None]:
param_grid = [
  {'C': [1000], 'gamma': [1e-4, 1e-5, 1e-6, 1e-7, 1e-8, 1e-9, 1e-10, 1e-11], 'kernel': ['rbf']},
 ]
param_grid

In [None]:
clf = GridSearchCV(SVC(), param_grid, cv=2, refit=True, verbose=100, n_jobs=-1, return_train_score=True)
%time clf.fit(X_train, y_train)

In [None]:
sns.lineplot(data=pd.DataFrame(clf.cv_results_), x='param_gamma', y='mean_train_score', label="Training score")
sns.lineplot(data=pd.DataFrame(clf.cv_results_), x='param_gamma', y='mean_test_score', label="Validation score")
plt.xscale('log');

Here, we see that (at least for $C=1000$), values of $\gamma$ greater than `1e-5` seem to overfit, while decreasing $\gamma$ lower than `1e-10` may underfit.

But we know that changing $C$ also affects the bias variance tradeoff! For different values of $C$, the best value of $\gamma$ will be different, and there may be a better *combination* of $C$ and $\gamma$ than any we have seen so far. We can try to increase and decrease $C$ to see if that improves the validation score.

Now that we have a better idea of where to search, we can set up our “final” search grid.

We know that to find the best validation accuracy for the linear kernel, we should make sure our search space includes `1e-6` and `1e-7`. I chose to vary $C$ from `1e-8` to `1e-4`. (I want to make sure the best value is not at the edge of the search space, so that we can be sure there isn’t a better value if we go lower/higher.)

We know that to find the best validation accuracy for the RBF kernel, we should make sure our search space includes $\gamma$ values around `1e-6` and `1e-7` when $C=1000$. For larger values of $C$, we expect that we’ll get better results with smaller values of $\gamma$. For smaller values of $C$, we expect that we’ll get better results with larger values of $\gamma$. I chose to vary $C$ from `1` to `1e6` and $\gamma$ from `1e-4` to `1e-11`.

That’s a big search grid, so this takes a long time to fit! (Try this at home with a larger training set to get an idea...)

In [None]:
param_grid = [
  {'C': [1e-8, 1e-7, 1e-6, 1e-5, 1e-4], 'kernel': ['linear']},
  {'C': [1, 1e2, 1e3, 1e4, 1e5, 1e6], 'gamma': [1e-4, 1e-5, 1e-6, 1e-7, 1e-8, 1e-9, 1e-10, 1e-11], 'kernel': ['rbf']},
 ]
param_grid

In [None]:
clf = GridSearchCV(SVC(), param_grid, cv=3, refit=True, verbose=100, n_jobs=-1, return_train_score=True)
%time clf.fit(X_train, y_train)

For the linear kernel, here's what we found:

In [None]:
df_cv   = pd.DataFrame(clf.cv_results_)
df_cv = df_cv[df_cv['param_kernel']=='linear']

In [None]:
sns.lineplot(data=df_cv, x='param_C', y='mean_train_score', label="Training score")
sns.lineplot(data=df_cv, x='param_C', y='mean_test_score', label="Validation score")
plt.xscale('log');

For the RBF kernel, here's what we found:

In [None]:
df_cv   = pd.DataFrame(clf.cv_results_)
df_cv = df_cv[df_cv['param_kernel']=='rbf']

plt.figure(figsize=(12,5))

ax1=plt.subplot(1,2,1)
pvt = pd.pivot_table(df_cv, values='mean_test_score', index='param_C', columns='param_gamma')
sns.heatmap(pvt, annot=True, cbar=False, vmin=0, vmax=1, cmap='PiYG');
plt.title("Validation scores");

ax2=plt.subplot(1,2,2, sharey=ax1)
plt.setp(ax2.get_yticklabels(), visible=False)
pvt = pd.pivot_table(df_cv, values='mean_train_score', index='param_C', columns='param_gamma')
sns.heatmap(pvt, annot=True, cbar=False, vmin=0, vmax=1, cmap='PiYG');
plt.title("Training scores");

We see that $\gamma$ and $C$ control the bias-variance tradeoff of the SVM model as follows.

-   In the top left region, $C$ is small (the margin is wider) and $\gamma$ is small (the kernel bandwidth is large). In this region, the model has more bias (is prone to underfit). The validation scores and training scores are both low.
-   On the right side (and we'd expect to see this on the bottom right if we extend the range of $C$ even higher), $C$ is large (the margin is narrower) and $\gamma$ is large (the kernel bandwidth is small. In this region, the model has more variance (is likely to overfit). The validation scores are low, but the training scores are high.

In the middle, we have a region of good combinations of $C$ and $\gamma$.

Since the parameter grid above shows us the validation accuracy decreasing both as we increase each parameter\* and also as we decrease each parameter, we can be a bit more confident that we captured the point in the bias-variance surface where the error is smallest.

\* $C$ is different because increasing $C$ even more may not actually change the margin.

We can see the “best” parameters, with which the model was re-fitted:

In [None]:
print(clf.best_params_)

And we can evaluate the re-fitted model on the test set. (Note that the `GridSearchCV` only used the training set; we have not used the test set at all for model fitting.)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
accuracy_score(y_pred, y_test)

## Random search

Our grid search found a pretty good set of hyperparameters, but it took a long time - about 100 seconds.

With a random search, we may be able to find hyperparameters that are still pretty good, in much less time.

We will search a similar range of parameters, although focusing only on the RBF kernel. But instead of specifying points on a grid like

    param_grid = [
      {'C': [1, 1e2, 1e3, 1e4, 1e5, 1e6], 'gamma': [1e-4, 1e-5, 1e-6, 1e-7, 1e-8, 1e-9, 1e-10, 1e-11], 'kernel': ['rbf']},
     ]

we will specify distributions from which to sample:

In [None]:
param_grid = [
  {'C': loguniform(1, 1e6), 'gamma': loguniform(1e-11, 1e-4), 'kernel': ['rbf']},
 ]

and then we will specify the total number of points to sample - 10, in this example:

In [None]:
clf = RandomizedSearchCV(SVC(), param_grid, cv=3, refit=True, verbose=100, n_jobs=-1, return_train_score=True,  n_iter = 10)
%time clf.fit(X_train, y_train)

In [None]:
pd.DataFrame(clf.cv_results_)

In [None]:
print(clf.best_params_)

In [None]:
y_pred = clf.predict(X_test)
accuracy_score(y_pred, y_test)

Our random search can find a good solution, in only about ~20 seconds. However, depending on the random samples it chooses, it may be a better solution or a worse solution than the one we found via grid search.

## Adaptive Search (Bayes Search)

Finally, we’ll consider one other type of hyperparameter optimization: we will look at an adaptive search that uses information about the models it has seen so far in order to decide which part of the hyperparameter space to sample from next.

We will install the `scikit-optimize` package, which provides `BayesSearchCV`.

In [None]:
!pip install scikit-optimize

In [None]:
from skopt import BayesSearchCV
from skopt.plots import plot_evaluations

We will define the search space:

In [None]:
param_grid = [
  {'C': (1, 1e6, 'log-uniform'), 'gamma': (1e-11, 1e-4, 'log-uniform'), 'kernel': ['rbf']},
 ]

As before, we will specify the total number of points to sample - 5, in this example:

In [None]:
clf = BayesSearchCV(SVC(), param_grid, cv=3, refit=True, verbose=100, n_jobs=-1, return_train_score=True,  n_iter = 5)
%time clf.fit(X_train, y_train)

In [None]:
pd.DataFrame(clf.cv_results_)

In [None]:
print(clf.best_params_)

In [None]:
y_pred = clf.predict(X_test)
accuracy_score(y_pred, y_test)

To see how this works, we will re-run the Bayes search with more iterations than we really need, just so that we can visualize how it searches the hyperparameter space.

In [None]:
clf = BayesSearchCV(SVC(), param_grid, cv=3, refit=False, verbose=100, n_jobs=-1, n_iter = 50)
clf.fit(X_train, y_train)

In [None]:
plot_evaluations(clf.optimizer_results_[0])

This creates a grid of plots as follows:

-   the diagonal plots are histograms, that show the distribution of samples for each hyperparameter.
-   the scatter plot shows the samples in the hyperparameter space that were “visited”, and the order in which they were “visited” is encoded i the point’s color. A red star shows the best hyperparameters that we found.