## 模型（超参数）选择
---
由于超参数的空间是无尽的，因此超参数的组合配置只能是**“更优”解**，没有最优解。
通常情况下，我们依靠**“网格搜索”(GridSearch)**对固定步长的超参数空间进行暴力搜索，对于每组超参数组合代入到学习函数中，视为新模型。
为了比较新模型之间的性能，每个模型都会在**相同的训练、开发数据集下**进行评估，通常我们采用**交叉验证**。因此，这个过程非常**耗时**，但是一旦获取比较好的参数，则可以保持一段时间使用，也相对一劳永逸。好在，由于各个新模型的交叉验证之间是互相独立的，因此，可以充分利用**多核**甚至是**分布式的计算资源**来**并行搜索（Parallel Grid Search）**。

超参数文章：
* https://www.jianshu.com/p/e46ed8049b5f
* https://blog.csdn.net/zhangbaoanhadoop/article/details/79559025


In [1]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np
news = fetch_20newsgroups(subset = 'all')

In [11]:
# 我们首先使用grid_search的单核版本
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV

X_train,X_test,y_train,y_test = train_test_split(news.data[:3000],news.target[:3000],test_size = 0.25,random_state = 33)

from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

clf = Pipeline([('vect', TfidfVectorizer(stop_words='english', analyzer='word')), ('svc', SVC())])

# 这里需要试验的2个超参数的个数分别是4、3，svc__gamma的参数共有10^-2,10^-1...
# 这样我们一共有12种超参数组合，12个不同参数下的模型
# numpy.logspace(start, stop, num=50, endpoint=True, base=10.0, dtype=None),返回的数字在对数上均匀分布
parameters = {'svc__gamma': np.logspace(-2, 1, 4), 'svc__C': np.logspace(-1, 1, 3)}

# 再考虑每个模型需要交叉验证3次，因此一共需要训练36次模型，根据下面的结果，单线程下，每个模型的训练任务耗时5秒左右。
gs = GridSearchCV(clf, parameters, verbose=2, refit=True, cv=3)

%time _=gs.fit(X_train, y_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] svc__C=0.1, svc__gamma=0.01 .....................................
[CV] ............................ svc__C=0.1, svc__gamma=0.01 -   4.9s
[CV] svc__C=0.1, svc__gamma=0.01 .....................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.9s remaining:    0.0s


[CV] ............................ svc__C=0.1, svc__gamma=0.01 -   5.0s
[CV] svc__C=0.1, svc__gamma=0.01 .....................................
[CV] ............................ svc__C=0.1, svc__gamma=0.01 -   5.0s
[CV] svc__C=0.1, svc__gamma=0.1 ......................................
[CV] ............................. svc__C=0.1, svc__gamma=0.1 -   4.9s
[CV] svc__C=0.1, svc__gamma=0.1 ......................................
[CV] ............................. svc__C=0.1, svc__gamma=0.1 -   5.2s
[CV] svc__C=0.1, svc__gamma=0.1 ......................................
[CV] ............................. svc__C=0.1, svc__gamma=0.1 -   5.1s
[CV] svc__C=0.1, svc__gamma=1.0 ......................................
[CV] ............................. svc__C=0.1, svc__gamma=1.0 -   5.0s
[CV] svc__C=0.1, svc__gamma=1.0 ......................................
[CV] ............................. svc__C=0.1, svc__gamma=1.0 -   5.3s
[CV] svc__C=0.1, svc__gamma=1.0 ......................................
[CV] .

[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  3.1min finished


Wall time: 3min 16s


In [13]:
gs.best_params_, gs.best_score_

print(gs.score(X_test, y_test))

0.822666666667


In [15]:
# 然后我们采用多线程并行搜索，观察时间性能的提高情况

from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(news.data[:3000], news.target[:3000], test_size=0.25, random_state=33)


from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

clf = Pipeline([('vect', TfidfVectorizer(stop_words='english', analyzer='word')), ('svc', SVC())])

parameters = {'svc__gamma': np.logspace(-2, 1, 4), 'svc__C': np.logspace(-1, 1, 3)}


gs = GridSearchCV(clf, parameters, verbose=2, refit=True, cv=3, n_jobs=-1)

%time _=gs.fit(X_train, y_train)
gs.best_params_, gs.best_score_
print(gs.score(X_test, y_test))
# 并行化寻找最优的超参数配置，同样获得相同的最优解，但是训练耗时基本上随着CPU核的数量成倍减少。

Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:  1.3min finished


Wall time: 1min 24s
0.822666666667
