## 05 超参数

![IMAGE](https://farm2.staticflickr.com/1741/28850048178_0d492ef0a6_o.png)

> 调参师其实调的就是超参数

超参数从哪里来：
* 领域知识
* 经验数值，比如 knn 默认是5
* 实验搜索：网格搜索

In [2]:
import numpy as np
from sklearn import datasets

In [3]:
digits = datasets.load_digits()
X = digits.data
y = digits.target

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)

In [5]:
from sklearn.neighbors import KNeighborsClassifier
# ❤️ n_neighbors 是 sklearn 默认的参数值
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
knn_clf.score(X_test, y_test)

0.98888888888888893

### 一、寻找最好的k

In [8]:
best_score = 0.0
best_k = -1
for k in range(1, 11): #[1-10]
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X_train, y_train)
    score = knn_clf.score(X_test, y_test)
    if score > best_score:
        best_k = k
        best_score = score
        
print("best_k =", best_k)
print("best_score =", best_score)

best_k = 4
best_score = 0.991666666667


如果得到的数字式10，我们需要在10以上重新搜索一遍

### 二、考虑距离的权重
KNN 算法中不仅有 k（个数），其实还有一个其他的参数
![IMAGE](https://farm2.staticflickr.com/1735/41936018234_d6037dcb02_o.png)
#### 情况1：
虽然蓝色的点多，但是红色点的**距离近**！！
> 所以红色节点的权重，比蓝色的权重大，距离越小，权重越大，所以应该用倒数表示

#### 情况2：
之前只考虑个数，可能会发生平票。但是考虑了距离之后，就不会出现平票

Sklearn 文档，在构造方法中，如果weights = distance 那么就会考虑距离
![IMAGE](https://farm2.staticflickr.com/1747/40845927920_1eebfceb4b_o.png)

In [8]:
best_score = 0.0
best_k = -1
best_method = ""
for method in ["uniform", "distance"]:
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
            best_method = method
        
print("best_method =", best_method)
print("best_k =", best_k)
print("best_score =", best_score)

best_method = uniform
best_k = 4
best_score = 0.991666666667


In [9]:
sk_knn_clf = KNeighborsClassifier(n_neighbors=4, weights="distance", p=1)
sk_knn_clf.fit(X_train, y_train)
sk_knn_clf.score(X_test, y_test)

0.98333333333333328

### 三、更多的距离表达
#### 1 欧拉距离
#### 2 曼哈顿距离
绝对值的和
![IMAGE](https://farm2.staticflickr.com/1753/42603819682_832019bec0_o.png)
绿色是欧拉距离，其他的是曼哈顿距离
#### 3 明可夫斯基距离
p = 1 曼哈顿距离
p = 2 是欧拉距离
p = 3 是更多的距离

> 所以 p 又是一个超参数

![IMAGE](https://farm2.staticflickr.com/1727/27785218777_ca8ab0571e_o.png)



### 搜索明可夫斯基距离相应的p
![IMAGE](https://farm2.staticflickr.com/1731/42603921362_b1e563c219_o.png)

In [9]:
best_score = 0.0
best_k = -1
best_p = -1

for k in range(1, 11):
    for p in range(1, 6):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights="distance", p=p)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_p = p
            best_score = score
        
print("best_k =", best_k)
print("best_p =", best_p)
print("best_score =", best_score)

best_k = 3
best_p = 2
best_score = 0.988888888889
