<a href="https://colab.research.google.com/github/brew-brew-com/ML-Prep/blob/main/22_Classification_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>K近傍法（K-Nearest Meighbor）とロジスティック回帰（Logistic Regression）の比較</h1>

In [None]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
breast_cancer = load_breast_cancer()

X = breast_cancer.data
y = breast_cancer.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

pipe_knn_5 = Pipeline([('scl', StandardScaler()), ('est', KNeighborsClassifier())]) #Default n_neighbors=50
pipe_knn_50 = Pipeline([('scl', StandardScaler()), ('est', KNeighborsClassifier(n_neighbors=50))])
pipe_logistic = Pipeline([('scl',StandardScaler()), ('est',LogisticRegression(random_state=1))])

pipe_knn_5.fit(X_train, y_train)
pipe_knn_50.fit(X_train, y_train)
pipe_logistic.fit(X_train, y_train)

# accuracy_scoreは第一引数に正解ラベル、第二引数に予測ラベル（確率ではない）を指定。

print('KNN(5)_Train:%.3f'% accuracy_score(y_train, pipe_knn_5.predict(X_train)))
print('KNN(5)_Test:%.3f' % accuracy_score(y_test, pipe_knn_5.predict(X_test)))
print('KNN(50)_Train:%.3f'% accuracy_score(y_train, pipe_knn_50.predict(X_train)))
print('KNN(50)_Test:%.3f' % accuracy_score(y_test, pipe_knn_50.predict(X_test)))
print('Logistic_Train:%.3f'% accuracy_score(y_train, pipe_logistic.predict(X_train)))
print('Logistic_Test:%.3f' % accuracy_score(y_test, pipe_logistic.predict(X_test)))

# 以下、accuracy_scoreと同じ
# print('%.3f'% pipe_knn_5.score(X_train, y_train)) 
# print('%.3f'% pipe_knn_5.score(X_test, y_test)) 
# print('%.3f'% pipe_knn_50.score(X_train, y_train)) 
# print('%.3f'% pipe_knn_50.score(X_test, y_test)) 
# print('%.3f'% pipe_logistic.score(X_train, y_train)) 
# print('%.3f'% pipe_logistic.score(X_test, y_test)) 

KNN(5)_Train:0.980
KNN(5)_Test:0.939
KNN(50)_Train:0.956
KNN(50)_Test:0.921
Logistic_Train:0.993
Logistic_Test:0.947


KNNのkを5から50に増やしたら汎化性能が低下。多くのデータを見過ぎて平均値予測に近くなった。予測値が無難過ぎてTrainスコアとTestスコアが近いのも特徴。

以下、おまけ（predictとpredict_proba）

In [None]:
# predictは0,1の「予測ラベル」

print(pipe_knn_5.predict(X_train)[:30])
print(pipe_knn_50.predict(X_train)[:30])
print(pipe_logistic.predict(X_train)[:30])

[1 0 1 1 0 1 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0]
[1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0]
[1 0 1 1 0 1 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0]


In [None]:
# predict_probaは0,1の「予測確率」

print(pipe_knn_5.predict_proba(X_train)[:10])
print(pipe_knn_50.predict_proba(X_train)[:10])
print(pipe_logistic.predict_proba(X_train)[:10])

[[0.  1. ]
 [1.  0. ]
 [0.  1. ]
 [0.  1. ]
 [1.  0. ]
 [0.4 0.6]
 [1.  0. ]
 [1.  0. ]
 [0.2 0.8]
 [0.6 0.4]]
[[0.1  0.9 ]
 [1.   0.  ]
 [0.04 0.96]
 [0.16 0.84]
 [0.76 0.24]
 [0.2  0.8 ]
 [1.   0.  ]
 [1.   0.  ]
 [0.32 0.68]
 [0.26 0.74]]
[[5.27350520e-02 9.47264948e-01]
 [9.99999927e-01 7.30821532e-08]
 [6.82513609e-02 9.31748639e-01]
 [2.46642036e-01 7.53357964e-01]
 [9.90513865e-01 9.48613535e-03]
 [3.59207075e-01 6.40792925e-01]
 [9.99999996e-01 4.17722838e-09]
 [1.00000000e+00 1.44452033e-14]
 [3.53011543e-03 9.96469885e-01]
 [9.34423580e-01 6.55764199e-02]]
