### 超参数

超参数:在算法运行需要决定的参数
    
模型参数:算法过程中学习的参数

kNN算法没有模型参数,其K值是典型的超参数

In [11]:
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# 构造数据和模型
digits = datasets.load_digits()
X = digits.data
y = digits.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100) # 随机种子设为100

knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)
X_predict = knn_clf.predict(X_test)
X_predict.shape


(360,)

In [12]:
# 计算模型准确率(得分)
knn_score = knn_clf.score(X_test, y_test)
knn_score

0.9861111111111112

如何获取更高的得分?K值选取多少合适?

In [13]:
best_score = 0.0
best_k = -1

for k in range(1,11):
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X_train, y_train)
    knn_score = knn_clf.score(X_test, y_test)
    
    if knn_score > best_score:
        best_score = knn_score
        best_k = k

print('best_score =', best_score)
print('best_k =', best_k)

best_score = 0.9972222222222222
best_k = 3


通过最简单的遍历k值后,得倒k = 3的时候,模型最好

但是KNeighborsClassifier的参数有很多, 以下是源代码摘取初始化函数部分:
  
    def __init__(self, n_neighbors=5,
                 weights='uniform', algorithm='auto', leaf_size=30,
                 p=2, metric='minkowski', metric_params=None, n_jobs=None,
                 **kwargs):
  - n_neighbors: 即k值
  - weights: 权重,为uniform时不考虑距离;为distance时考虑距离.(因为有很多距离公式可用)
  - p: 不同的值表示不同的距离,p=1,表示曼哈顿距离,p=2表示欧拉距离, p=其他值(>2)表示明科夫斯基距离
  - metric: 表示使用明科夫斯基距离

### 先考虑权重weights参数的取值

In [18]:
best_weight = ['uniform', 'distance']
best_score = 0.0
best_k = -1

for weight in best_weight:
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights=weight)
        knn_clf.fit(X_train, y_train)
        knn_score = knn_clf.score(X_test, y_test)
        if knn_score > best_score:
            best_weight = weight
            best_score = knn_score
            best_k = k
            
print('best_score =', best_score)
print('best_k =', best_k)
print('best_weight =', best_weight)

best_score = 0.9972222222222222
best_k = 3
best_weight = uniform


值得注意的是,此测试中,p默认为2(取欧拉距离),如果p值改变,得分又如何?

先简单测试以下

In [21]:
# k=3, weights='distance',p=1 测试以下
knn_clf = KNeighborsClassifier(n_neighbors=3, weights='distance', p=1)
knn_clf.fit(X_train, y_train)
knn_clf.score(X_test, y_test)

0.9888888888888889

得分为:0.9888888.....,比 weight='uniform', p=2的得分要低

探索明可夫斯基距离相应的p

In [24]:
%%time
best_score = 0.0
best_k = -1
best_p = -1

for k in range(1, 11):
    for p in range(1, 6):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights='distance', p=p)
        knn_clf.fit(X_train, y_train)
        knn_score = knn_clf.score(X_test, y_test)
        if knn_score > best_score:
            best_p = p
            best_score = knn_score
            best_k = k
            
print('best_score =', best_score)
print('best_k =', best_k)
print('best_p =', best_p)    
            

best_score = 0.9944444444444445
best_k = 1
best_p = 2
Wall time: 22.4 s


探索p值时,耗时较长(22.4秒)

sklearn肯定封装了获取最优参数的方法,通过不断地训练模型来得出最好地参数

最普通的for循环效率太低了