# 模型检验

## 留一验证

留一验证最为简单，就是从任务提供的数据中，随机采样一定比例作为训练集，剩下的“留做”验证。一般实验70%作为训练集，30%用做模型验证。这一验证方法优化的模型性能也不稳定，原因在于对验证集合随机采样的不确定性，这一验证方法的升级版本就是交叉验证。

## 交叉验证

交叉验证可以理解为从事了多次留一验证的过程。只是需要强调的是，每次检验所使用的验证集之间是互斥的，并且要保证每一条可用数据都被模型验证过。交叉验证的好处在于，可以保证所有数据都有被训练和验证的机会，也尽最大可能让优化的模型性能表现的更加可信。

# 超参数搜索

## 网格搜索

In [2]:
# 使用单线程对文本分类的朴素贝叶斯模型的超参数组合执行网格搜索
from sklearn.datasets import fetch_20newsgroups
import numpy as np
news = fetch_20newsgroups(subset='all')
from sklearn.model_selection import train_test_split
#对前3000条新闻文本进行数据分割，25%文本用于未来测试
x_train, x_test, y_train, y_test = train_test_split(news.data[:3000], news.target[:3000], test_size=0.25, random_state=33)

In [3]:
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline  # 用于简化系统搭建流程，将文本抽取与分类器模型串联起来，
clf = Pipeline([('vect', TfidfVectorizer(stop_words='english', analyzer='word')), ('svc', SVC())])

In [4]:
parameters = {'svc__gamma': np.logspace(-2, 1, 4), 'svc__C': np.logspace(-1, 1, 3)}

In [6]:
from sklearn.grid_search import GridSearchCV  # 导入网格搜索模块
gs = GridSearchCV(clf, parameters, verbose=2, refit=True, cv=3)

In [8]:
# 执行单线程网格搜索
%time _ = gs.fit(x_train, y_train)
gs.best_params_, gs.best_score_

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] svc__C=0.1, svc__gamma=0.01 .....................................
[CV] ............................ svc__C=0.1, svc__gamma=0.01 -   5.2s
[CV] svc__C=0.1, svc__gamma=0.01 .....................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    5.2s remaining:    0.0s


[CV] ............................ svc__C=0.1, svc__gamma=0.01 -   5.2s
[CV] svc__C=0.1, svc__gamma=0.01 .....................................
[CV] ............................ svc__C=0.1, svc__gamma=0.01 -   5.3s
[CV] svc__C=0.1, svc__gamma=0.1 ......................................
[CV] ............................. svc__C=0.1, svc__gamma=0.1 -   5.1s
[CV] svc__C=0.1, svc__gamma=0.1 ......................................
[CV] ............................. svc__C=0.1, svc__gamma=0.1 -   5.2s
[CV] svc__C=0.1, svc__gamma=0.1 ......................................
[CV] ............................. svc__C=0.1, svc__gamma=0.1 -   5.3s
[CV] svc__C=0.1, svc__gamma=1.0 ......................................
[CV] ............................. svc__C=0.1, svc__gamma=1.0 -   5.2s
[CV] svc__C=0.1, svc__gamma=1.0 ......................................
[CV] ............................. svc__C=0.1, svc__gamma=1.0 -   5.3s
[CV] svc__C=0.1, svc__gamma=1.0 ......................................
[CV] .

[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  3.3min finished


Wall time: 3min 24s


({'svc__C': 10.0, 'svc__gamma': 0.10000000000000001}, 0.7906666666666666)

In [11]:
print(gs.score(x_test, y_test))

0.822666666667


## 并行搜索

并行搜索能够节约时间。

In [12]:
gs_ = GridSearchCV(clf, parameters, verbose=2, refit=True, cv=3, n_jobs=-1)

In [13]:
%time _ = gs_.fit(x_train, y_train)
gs_.best_params_, gs.best_score_

Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:  1.2min finished


Wall time: 1min 24s


({'svc__C': 10.0, 'svc__gamma': 0.10000000000000001}, 0.7906666666666666)

In [14]:
print(gs_.score(x_train, y_train))

0.999555555556
