引入交叉验证可以防止我们的模型针对测试数据进行了过拟合

测试数据仅用来评判 不参模型的创建

### 通过手写数字识别的例子 来找寻最佳的超参数

In [18]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV

#### 加载数据

In [2]:
from sklearn.datasets import load_digits

In [4]:
data = load_digits()
X = data.data
y = data.target

In [10]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size =0.4,random_state = 666)

In [13]:
best_score = 0
best_p =0
best_k =0
for k in range(1,6):
    for p in range(1,4):
        knn_cls = KNeighborsClassifier(weights="distance",n_neighbors=k,p=p)
        knn_cls.fit(X_train,y_train)
        score = knn_cls.score(X_test,y_test)
        if best_score < score:
            best_score = score
            best_p = p
            best_k = k 

In [14]:
print(best_score,best_p,best_k)

0.986091794159 3 4


In [15]:
knn_cls = KNeighborsClassifier(weights="distance",n_neighbors=4,p=3)
knn_cls.fit(X_train,y_train)
knn_cls.score(X_test,y_test)

0.98609179415855353

使用交叉验证的方法 得到最好得到超参数

In [19]:
best_score = 0
best_p =0
best_k =0
for k in range(1,6):
    for p in range(1,4):
        knn_cls = KNeighborsClassifier(weights="distance",n_neighbors=k,p=p)
        score = np.average(cross_val_score(knn_cls,X_train,y_train))
        if best_score < score:
            best_score = score
            best_p = p
            best_k = k 

In [20]:
print(best_score,best_p,best_k)

0.982359987401 2 1


In [22]:
knn_cls = KNeighborsClassifier(weights="distance",n_neighbors=1,p=2)
knn_cls.fit(X_train,y_train)
knn_cls.score(X_test,y_test)

0.98052851182197498

网格搜索的方法(使用了交叉验证) 得到最好得到超参数

In [27]:
para_grid = [
    {
        "weights":["distance"],
        "n_neighbors":[k for k in range(1,11)],
        "p":[i for i in range(1,5)]
    }
]

In [32]:
grid_search  = GridSearchCV(KNeighborsClassifier(),param_grid=para_grid,n_jobs=-1,verbose=1)

In [33]:
grid_search.fit(X_train,y_train)

Fitting 3 folds for each of 40 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    6.4s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:   21.9s finished


GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid=[{'weights': ['distance'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'p': [1, 2, 3, 4]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [34]:
grid_search.best_params_

{'n_neighbors': 1, 'p': 2, 'weights': 'distance'}

In [35]:
knn_cls = grid_search.best_estimator_
knn_cls.fit(X_train,y_train)
knn_cls.score(X_test,y_test)

0.98052851182197498