# 网格搜索

In [1]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets

digits = datasets.load_digits()

In [2]:
print(digits.keys)

<built-in method keys of Bunch object at 0x000001EB88F8EEB8>


In [3]:
print(digits.DESCR)

Optical Recognition of Handwritten Digits Data Set

Notes
-----
Data Set Characteristics:
    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is a

In [4]:
X = digits.data
y = digits.target

In [5]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 666)

In [6]:
sk_knn_classifier = KNeighborsClassifier(n_neighbors = 4, weights = "uniform")
sk_knn_classifier.fit(X_train,y_train)
sk_knn_classifier.score(X_test,y_test)

0.9916666666666667

### Grid Search

In [7]:
param_grid = [
    {
        'weights' : ["uniform"],
        'n_neighbors' : [i for i in range(1,11)]
    },
    {
        "weights" : ["distance"],
        'n_neighbors' : [i for i in range(1,11)],
        'p' : [i for i in range(1,6)]
    }
]

In [8]:
knn_clf = KNeighborsClassifier()

sklearn中的网格搜索寻找最优超参数的评判标准是交叉验证(Cross Validation)

In [9]:
#加载sklearn中的网格搜索方法
from sklearn.model_selection import GridSearchCV

In [10]:
#GridSearch需要传入两个参数，一个是我们需要寻找最优超参数的分类器
#另一个是超参数集合
grid_search = GridSearchCV(knn_clf, param_grid)

In [11]:
%%time
#调用fit方法进行超参数的寻找
grid_search.fit(X_train,y_train)

Wall time: 3min 9s


GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'weights': ['uniform'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, {'weights': ['distance'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'p': [1, 2, 3, 4, 5]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

查看网格搜索到的最佳分类器

In [12]:
grid_search.best_estimator_

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=3,
           weights='distance')

该分类器当前的准确率

In [13]:
grid_search.best_score_

0.9853862212943633

我们所训练的具体参数取值

In [14]:
grid_search.best_params_

{'n_neighbors': 3, 'p': 3, 'weights': 'distance'}

如何使用这个新的分类器，直接赋值即可

In [15]:
knn_clf = grid_search.best_estimator_

In [16]:
y_predict = knn_clf.predict(X_test)

In [17]:
from playML.metrics import accuracy_score

In [18]:
knn_clf_score = accuracy_score(y_test,y_predict)

In [19]:
print(knn_clf_score)

0.9833333333333333


gridSearch中其他参数

In [22]:
%%time
#n_jobs指定搜索调用的cpu核数，-1为全部调用，verbose为指定搜索时的输出
grid_search = GridSearchCV(knn_clf, param_grid, n_jobs = -1, verbose = 2)
grid_search.fit(X_train,y_train)

Fitting 3 folds for each of 60 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   24.7s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  1.4min


Wall time: 1min 40s


[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:  1.7min finished


### 更多的距离定义

* 向量空间余弦相似度 Cosine Similarity        
* 调整余弦相似度 Adjusted Consine Similarity      
* 皮尔森相关系数 Pearson Correlation Coefficient      
* Jaccard相似系数 Jaccard Coeffiecient     