# 超参数
## 我们可能会有疑问，什么是超参数？
### 超参数是：在算法运行前需要决定或选取的参数
在本例子中的超参数就是我们的n_neighbors，该参数选取的不同会影响算法的精度；所以我们要找到合适的方法帮助我们选取最优的n_neighbors
### 模型参数是：算法过程中学习的参数
kNN算法没有模型参数，但k是典型的超参数

In [33]:
import numpy as np
from sklearn import datasets

In [34]:
digits=datasets.load_digits()
X=digits.data
y=digits.target

In [35]:
# %load KNN/model_selection.py
import numpy as np

def train_test_split(X,y,ratio=0.2,random_seed=None):
    assert X.shape[0]== y.shape[0], "the size of X must be equal to the size of y"
    assert 0.0<=ratio<=1.0,"the test size must be valid"
    if random_seed:
        np.random.seed(random_seed)
    shuffled_indexes=np.random.permutation(len(X))
    test_size=int(len(X)*ratio)
    test_indexes=shuffled_indexes[:test_size]
    train_indexes=shuffled_indexes[test_size:]
    X_train=X[train_indexes]
    y_train=y[train_indexes]

    X_test=X[test_indexes]
    y_test=y[test_indexes]

    return X_train,y_train,X_test,y_test

### 使用我们之前写的model_selection.py中的train_test_split进行数据集划分

In [41]:
from KNN.model_selection import train_test_split
X_train,y_train,X_test,y_test = train_test_split(X,y,ratio=0.2,random_seed=666)#随机种子一致保证实验结果一致

In [44]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf=KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train,y_train)
knn_clf.score(X_test,y_test)

0.9888579387186629

# 寻找最好的k 
调参，调超参数k，网格穷举搜索

In [47]:
best_score=0.0
best_k=-1
for k in range(1,15):
    knn_clf=KNeighborsClassifier(n_neighbors=k)#循环构造
    knn_clf.fit(X_train,y_train)
    score=knn_clf.score(X_test,y_test)
    if score > best_score:
        best_score=score
        best_k=k
print("best k:",best_k)
print("best score:",best_score)

best k: 3
best score: 0.9916434540389972


## 考虑距离还是不考虑距离呢？
解决knn平票的情况：有三个点对应三个分类，各有1票，如何给出预测保证准确

In [53]:
best_method = ""
best_score=0.0
best_k=-1
for method in ["uniform","distance"]:
    for k in range(1,11):
        knn_clf=KNeighborsClassifier(n_neighbors=k,weights=method)
        knn_clf.fit(X_train,y_train)
        score=knn_clf.score(X_test,y_test)
        if score > best_score:
            best_score=score
            best_k=k
            best_method=method
print("best k:",best_k)
print("best score:",best_score)
print("best method:",best_method)

best k: 3
best score: 0.9916434540389972
best method: uniform


# 搜索明可夫斯基距离相应的p
p也是一个超参数，决定了是欧式距离还是曼哈顿距离还是p距离，开根号p次方

In [58]:
%%time
best_p=-1
best_score=0.0
best_k=-1
for k in range(1,11):
    for p in range(1,6):
        knn_clf=KNeighborsClassifier(n_neighbors=k,weights="distance",p=p)#p参数给出了我们使用曼哈顿距离，欧式距离，还是3距离，甚至4距离和5距离
        knn_clf.fit(X_train,y_train)
        score=knn_clf.score(X_test,y_test)
        if score > best_score:
            best_score=score
            best_k=k
            best_p=p
print("best k:",best_k)
print("best score:",best_score)
print("best p:",best_p)

best k: 3
best score: 0.9916434540389972
best p: 2
Wall time: 34.3 s
