## 测试数据集的意义
> **问题1：针对特定测试数据集过拟合**

> **解决:添加验证数据集，验证数据集参与模型创建(调参)，测试数据集不参与模型创建**

> **问题2：随机性**

> **解决:使用交叉验证调参 cross validation,将原本模型分为k份**

<img src='./picture/6-1.png'>
<img src='./picture/6-2.png'>
<img src='./picture/6-3.png'>
<img src='./picture/6-4.png'>
<img src='./picture/6-5.png'>

## 交叉验证
> **采用交叉验证的方法，更有可行度**

In [1]:
import numpy as np
from sklearn import datasets
digits = datasets.load_digits()
X = digits.data
y = digits.target

---
使用测试数据集

In [2]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)

In [3]:
from sklearn.neighbors import KNeighborsClassifier

best_score, best_p, best_k = 0,0,0

for k in range(2, 11):
    for p in range(1, 6):
        knn_clf = KNeighborsClassifier(weights='distance', n_neighbors=k, p=p)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_score, best_p, best_k = score, p, k
            
print(best_score, best_k, best_p)

0.9866666666666667 5 2


---
使用交叉验证获得最佳的knn classifier

In [4]:
from sklearn.model_selection import cross_val_score

In [5]:
knn_clf = KNeighborsClassifier()
cross_val_score(knn_clf, X_train, y_train) #交叉验证最终分成三份

array([0.98896247, 0.98210291, 0.98210291])

In [6]:
best_score, best_p, best_k = 0,0,0

for k in range(2, 11):
    for p in range(1, 6):
        knn_clf = KNeighborsClassifier(weights='distance', n_neighbors=k, p=p)
        scores = cross_val_score(knn_clf, X_train, y_train) #交叉验证最终分成三份      
        score = np.mean(scores)
        if score > best_score:
            best_score, best_p, best_k = score, p, k
            
print(best_score, best_k, best_p)

0.9866166891368011 5 3


In [None]:
#cross_val_score(knn_clf, X_train, y_train, cv =5) #传入cv参数，改变数值

## 回顾网格搜索
> **传入cv参数，控制数据集分的份数**

In [7]:
param_grid = [
    {
        'weights':['uniform'],
        'n_neighbors':[i for i in range(1, 11)]
    },
    {
        'weights':['distance'],
        'n_neighbors':[i for i in range(1, 11)],
        'p': [i for i in range(1,6)]
    }    
]

In [8]:
from sklearn.model_selection import GridSearchCV #采用交叉验证的方法

In [11]:
#knn_clf = KNeighborsClassifier()


#grid_search = GridSearchCV(knn_clf, param_grid)
#grid_search.fit(X_train, y_train)

#grid_search = GridSearchCV(knn_clf, param_grid， cv=5)
#grid_search.fit(X_train, y_train)

Wall time: 0 ns


<img src='./picture/6-6.png'>
<img src='./picture/6-7.png'>