## 网格搜索（GridSearchCV）自动调参

### 1、手工枚举法，确定最佳参数组合

+ 以支持向量机（SVM）分类模型调参为例
+ 定义gamma和C两个参数的取值列表
+ 定义循环，使用不同的参数组合创建模型并评估成绩
+ 取最佳成绩的参数组合

In [1]:
from sklearn.svm import SVC 
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris 

iris = load_iris() # 加载数据
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0) 
 
best_score = 0 # 最佳成绩

# 对每种参数组合都训练一个模型，评估其成绩，找到最佳参数组合
for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:     
    for C in [0.001, 0.01, 0.1, 1, 10, 100]: 
        svm = SVC(gamma=gamma, C=C).fit(X_train, y_train) 
        score = svm.score(X_test, y_test) # 在测试集上评估SVC
        # 如果得到了更高的分数，则保存该分数和对应的参数         
        if score > best_score:             
            best_score = score             
            best_parameters = {'C': C, 'gamma': gamma} 
 
print("Best score: {:.2f}".format(best_score))  # 输出最佳成绩
print("Best parameters: {}".format(best_parameters)) # 输出最佳参数组合

Best score: 0.97
Best parameters: {'C': 100, 'gamma': 0.001}


### 2、拆分出验证集进行调参（避免在测试集上调参）

+ 如果模型调参在测试集上进行，就不能保证测试的客观性（相当于练习题与考试题不分）
+ 因此，调参时需要拆分出独立的验证集，在验证集上调参
+ 测试集专门用于测试（即测试集不参与训练和调参过程）

In [2]:
# 首先，将数据拆分为（训练+验证集）与测试集 
X_train_valid, X_test, y_train_valid, y_test = train_test_split(iris.data, iris.target, random_state=0) 

# 然后，将（训练+验证集）拆分为训练集与验证集 
X_train, X_valid, y_train, y_valid = train_test_split(X_train_valid, y_train_valid, random_state=1) 

In [3]:
best_score = 0

# 对每种参数组合都训练一个模型，评估其成绩，找到最佳参数组合
for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:     
    for C in [0.001, 0.01, 0.1, 1, 10, 100]: 
        svm = SVC(gamma=gamma, C=C).fit(X_train, y_train) 
        score = svm.score(X_valid, y_valid) # 注意！这里改为在验证集上评估模型    
        if score > best_score:             
            best_score = score             
            best_parameters = {'C': C, 'gamma': gamma} 

print("Training_set size: {} , Validation_set size: {} , Test_set size: {}"
      .format(X_train.shape[0], X_valid.shape[0], X_test.shape[0])) 

# 利用调参得到的最佳参数组合，在(训练+验证集)上重新构建一个模型，并在测试集上进行评估 
svm = SVC(**best_parameters) 
svm.fit(X_train_valid, y_train_valid) 
test_score = svm.score(X_test, y_test) # 测试集只用做最终评估，不参与调参，保证测试的客观性

print("Best score on validation_set: {:.2f}".format(best_score)) 
print("Best parameters: ", best_parameters) 
print("Test_set score with best parameters: {:.2f}".format(test_score)) # 输出测试成绩
#最终结果的好坏与初始数据的划分结果有很大关系，为了处理这种情况，可以采用交叉验证的方式来减少偶然性

Training_set size: 84 , Validation_set size: 28 , Test_set size: 38
Best score on validation_set: 0.96
Best parameters:  {'C': 10, 'gamma': 0.001}
Test_set score with best parameters: 0.92


### 3、使用网格搜索（GridSearchCV）自动调参

+ GridSearchCV整合了自动调参和交叉验证功能
+ .best_params_和.best_score_属性，分别代表最佳参数组合及最佳成绩
+ 相对于手工枚举，网格搜素自动调参更高效

In [4]:
from sklearn.model_selection import GridSearchCV # 引入网格搜索类

# 定义参数列表
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1, 10, 100]} 

# 定义GridSearchCV对象，注意参数
# 参数1-模型
# 参数2-参数列表
# 参数3-交叉验证次数（网格搜索具有交叉验证功能）
grid_search = GridSearchCV(SVC(), param_grid, cv=10)

# 拆分数据集
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)
grid_search.fit(X_train, y_train) # 将训练数据交给GridSearchCV对象

# 输出最佳参数组合及最佳成绩（利用.best_params_和.best_score_属性）
print("Best parameters: {}".format(grid_search.best_params_)) 
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))


Best parameters: {'C': 10, 'gamma': 0.1}
Best cross-validation score: 0.98


# 4.交叉验证减少偶然性

In [7]:
from sklearn.model_selection import cross_val_score
 
best_score = 0.0
for gamma in [0.001,0.01,0.1,1,10,100]:
    for C in [0.001,0.01,0.1,1,10,100]:
        svm = SVC(gamma=gamma,C=C)
        scores = cross_val_score(svm,X_train,y_train,cv=3) #3折交叉验证
        score = scores.mean() #取平均数
        if score > best_score:
            best_score = score
            best_parameters = {"gamma":gamma,"C":C}
svm = SVC(**best_parameters)
svm.fit(X_train,y_train)
test_score = svm.score(X_test,y_test)
print("Best score on validation set:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
print("Score on testing set:{:.2f}".format(test_score))

Best score on validation set:0.96
Best parameters:{'gamma': 0.01, 'C': 100}
Score on testing set:0.97
