# Comparing randomized search and grid search for hyperparameter estimation

Adapted from http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html

Compare randomized search and grid search for optimizing hyperparameters of a random forest. All parameters that influence the learning are searched simultaneously (except for the number of estimators, which poses a time / quality tradeoff).

The randomized search and the grid search explore exactly the same space of parameters. The result in parameter settings is quite similar, while the run time for randomized search is drastically lower.

The performance is slightly worse for the randomized search, though this is most likely a noise effect and would not carry over to a held-out test set.

Note that in practice, one would not search over this many different parameters simultaneously using grid search, but pick only the ones deemed most important.

In [1]:
Pkg.update()

INFO: Updating METADATA...
INFO: Updating cache of StatsBase...
INFO: Updating cache of URIParser...
INFO: Updating FixedPointNumbers...
INFO: Updating Autoreload...
INFO: Updating PyCall...
INFO: Updating FastAnonymous...
INFO: Updating Images...
INFO: Updating MultivariateStats...
INFO: Updating Plots...
INFO: Updating Benchmarks...
INFO: Updating QuartzImageIO...
INFO: Updating Regression...
INFO: Computing changes...
INFO: Upgrading StatsBase: v0.7.4 => v0.8.0
INFO: Upgrading URIParser: v0.1.2 => v0.1.3


In [6]:
Pkg.status("QuartzImageIO")

 - QuartzImageIO                 0.1.2              master


In [7]:
?Pkg.status("FastAnonymous")

```
status()
```

Prints out a summary of what packages are installed and what version and state they're in.


In [15]:
Pkg.status("FixedPointNumbers")

 - FixedPointNumbers

In [16]:
Pkg.status("Images")

             0.1.2              master
 - Images                        0.5.3+             master


In [18]:
Pkg.status("Images")

 - Images                        0.5.3


In [19]:
Pkg.status("Autoreload")

 - Autoreload                    0.2.0+             master


In [21]:
Pkg.status("Autoreload")

 - Autoreload                    0.2.0


In [22]:
Pkg.status("QuartzImageIO")

 - QuartzImageIO                 0.1.2              master


In [25]:
Pkg.status("QuartzImageIO")

             0.1.2              master
 - QuartzImageIO                 0.1.2


In [24]:
Pkg.status("FixedPointNumbers")

 - FixedPointNumbers

In [29]:
Pkg.free("Plots")

INFO: Freeing Plots
INFO: No packages to install, update or remove


In [30]:
Pkg.free("Regression")

INFO: Freeing Regression
INFO: No packages to install, update or remove


In [31]:
Pkg.free("MultivariateStats")

INFO: Freeing MultivariateStats
INFO: No packages to install, update or remove


In [32]:
Pkg.update()

INFO: Updating METADATA...
INFO: Updating Benchmarks...
INFO: Updating PyCall...
INFO: Computing changes...
INFO: No packages to install, update or remove


In [None]:
from time import time
from operator import itemgetter
from scipy.stats import randint as sp_randint

from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier

In [None]:
# get some data
digits = load_digits()
X, y = digits.data, digits.target

# build a classifier
clf = RandomForestClassifier(n_estimators=20)


# Utility function to report best scores
def report(grid_scores, n_top=3):
    top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
    for i, score in enumerate(top_scores):
        print("Model with rank: {0}".format(i + 1))
        print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
              score.mean_validation_score,
              np.std(score.cv_validation_scores)))
        print("Parameters: {0}".format(score.parameters))
        print("")


# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, nothing],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(1, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.grid_scores_)

# use a full grid over all parameters
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [1, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.grid_scores_)))
report(grid_search.grid_scores_)