# 网格搜索 Grid Search

超参学习的利器。那些不会在评估器内部学习的参数,可以通过搜索所有的预制参数来进行学习。例如:C, kernel等。

对于超参学习的网格搜索包含如下关键内容:

1. 一个评估器,可能是一个分类器也可能是一个回归器
2. 参数空间
3. 一个搜索或者采样的方法
4. 一个交叉验证集的scheme
5. 一个评分函数

In [1]:
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm
from sklearn import metrics
import pandas as pd
from pandas import DataFrame, Series
from sklearn.grid_search import GridSearchCV

In [2]:
iris = datasets.load_iris()

In [3]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    iris.data, 
    iris.target,
    test_size=0.4)

print X_train[:3]
print y_train[:3]

[[ 7.7  2.8  6.7  2. ]
 [ 6.4  3.2  4.5  1.5]
 [ 6.3  3.3  6.   2.5]]
[2 1 2]


In [4]:
param1 = {'kernel':['rbf'], 
          'gamma':[1e-3, 1e-4],
          'C':[0.9, 10, 100, 1000]}

param2 = {'kernel':['linear'],
          'C':[0.9, 10, 100, 1000]}

tuned_parameters = [param1, param2]

clf = GridSearchCV(svm.SVC(), 
                       tuned_parameters,
                       cv=5)

clf.fit(X_train, y_train)

print clf.best_estimator_
print '+' * 40
print clf.best_params_
print '-' * 40
print clf.grid_scores_
print '+' * 40
print clf.best_score_

{'kernel': 'linear', 'C': 0.9}
----------------------------------------
[mean: 0.51111, std: 0.12247, params: {'kernel': 'rbf', 'C': 0.9, 'gamma': 0.001}, mean: 0.36667, std: 0.01324, params: {'kernel': 'rbf', 'C': 0.9, 'gamma': 0.0001}, mean: 0.90000, std: 0.06920, params: {'kernel': 'rbf', 'C': 10, 'gamma': 0.001}, mean: 0.65556, std: 0.01824, params: {'kernel': 'rbf', 'C': 10, 'gamma': 0.0001}, mean: 0.96667, std: 0.02790, params: {'kernel': 'rbf', 'C': 100, 'gamma': 0.001}, mean: 0.90000, std: 0.06920, params: {'kernel': 'rbf', 'C': 100, 'gamma': 0.0001}, mean: 0.97778, std: 0.04706, params: {'kernel': 'rbf', 'C': 1000, 'gamma': 0.001}, mean: 0.96667, std: 0.02790, params: {'kernel': 'rbf', 'C': 1000, 'gamma': 0.0001}, mean: 0.98889, std: 0.02353, params: {'kernel': 'linear', 'C': 0.9}, mean: 0.96667, std: 0.04706, params: {'kernel': 'linear', 'C': 10}, mean: 0.95556, std: 0.04386, params: {'kernel': 'linear', 'C': 100}, mean: 0.94444, std: 0.05263, params: {'kernel': 'linear', 'C'

从上面的代码可以看出GridSearch已经涵盖了交叉验证集的运算。如果将超参变成1个,那就和交叉验证集是一致的了。

GridSearch是给出指定的分类器来进行超参训练的一种方法。然而,究竟哪种分类器更好?这种参数该如何设置呢? 可以使用pipline来讲评估器和参数空间组合在一起

## 随机Grid Seach

随机进行grid search的原理是,参数通过随机产生,那么如何通过随机产生呢?是使用随机分布函数,来生成随机数。常用的scipy的库 scipy.stats.randint 是一个随机整数分布函数,可以用来产生随机整数。在random search中还有一个参数 n_iter 用来表示产生多少个随机参数.

In [5]:
from scipy.stats import randint as sp_randint
from sklearn.grid_search import RandomizedSearchCV

random_param = {'kernel':['linear'],
          'C':sp_randint(0.01,100)}
random_search = RandomizedSearchCV(svm.SVC(), random_param, n_iter=20, cv=5)

In [6]:
random_search.fit(X_train, y_train)

RandomizedSearchCV(cv=5, error_score='raise',
          estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
          fit_params={}, iid=True, n_iter=20, n_jobs=1,
          param_distributions={'kernel': ['linear'], 'C': <scipy.stats.distributions.rv_frozen object at 0x10cfa2850>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          scoring=None, verbose=0)

In [7]:
print random_search.best_params_
print '-' * 40
print random_search.grid_scores_
print '+' * 40
print random_search.best_score_

{'kernel': 'linear', 'C': 18}
----------------------------------------
[mean: 0.96667, std: 0.04706, params: {'kernel': 'linear', 'C': 18}, mean: 0.96667, std: 0.04706, params: {'kernel': 'linear', 'C': 10}, mean: 0.95556, std: 0.04386, params: {'kernel': 'linear', 'C': 24}, mean: 0.96667, std: 0.04706, params: {'kernel': 'linear', 'C': 17}, mean: 0.95556, std: 0.04386, params: {'kernel': 'linear', 'C': 93}, mean: 0.96667, std: 0.04706, params: {'kernel': 'linear', 'C': 16}, mean: 0.95556, std: 0.04386, params: {'kernel': 'linear', 'C': 72}, mean: 0.95556, std: 0.04386, params: {'kernel': 'linear', 'C': 44}, mean: 0.95556, std: 0.04386, params: {'kernel': 'linear', 'C': 37}, mean: 0.96667, std: 0.04706, params: {'kernel': 'linear', 'C': 10}, mean: 0.95556, std: 0.04386, params: {'kernel': 'linear', 'C': 32}, mean: 0.96667, std: 0.04706, params: {'kernel': 'linear', 'C': 14}, mean: 0.95556, std: 0.04386, params: {'kernel': 'linear', 'C': 38}, mean: 0.95556, std: 0.04386, params: {'kerne

Grid Search很适合并行化,其中参数设置成 njobs=-1.