In [2]:
%run ./global.ipynb

# KNN Classification

[Reference](https://realpython.com/knn-python/#use-knn-to-predict-the-age-of-sea-slugs)

## K = 3

In [3]:
%%time

from sklearn.neighbors import KNeighborsClassifier
k = 3
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train, y_train)

CPU times: user 539 µs, sys: 886 µs, total: 1.43 ms
Wall time: 1.32 ms


## GridSeachCV

With neighbors as parameters

In [5]:
%%time

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
parameters = {
    "n_neighbors": range(1, 50),
}
model = GridSearchCV(KNeighborsClassifier(), parameters)
_ = model.fit(X_train, y_train)
model.best_params_

CPU times: user 676 ms, sys: 1.3 ms, total: 677 ms
Wall time: 688 ms


{'n_neighbors': 9}

## GridSeachCV

With neighbors and weights as parameters

In [8]:
%%time

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
parameters = {
    "n_neighbors": range(1, 50),
    "weights": ["uniform", "distance"],
}
model = GridSearchCV(KNeighborsClassifier(), parameters)
_ = model.fit(X_train, y_train)
model.best_params_

CPU times: user 1 s, sys: 5.37 ms, total: 1.01 s
Wall time: 1.01 s


{'n_neighbors': 9, 'weights': 'uniform'}

## Bagging with GridSeachCV

With neighbors and weights as GridSearchCV parameters.

In [12]:
%%time

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
parameters = {
    "n_neighbors": range(1, 50),
    "weights": ["uniform", "distance"],
}
gscv = GridSearchCV(KNeighborsClassifier(), parameters)
_ = gscv.fit(X_train, y_train)
best_params = gscv.best_params_
best_params

bagged_model = KNeighborsClassifier(**best_params)

from sklearn.ensemble import BaggingClassifier
model = BaggingClassifier(bagged_model, n_estimators=100)
model.fit(X_train, y_train)

CPU times: user 1.07 s, sys: 7.16 ms, total: 1.08 s
Wall time: 1.1 s


In [13]:
%%time

# predict and measure
import numpy as np

train_preds = model.predict(X_train)
train_matches = train_preds == y_train
train_match_cnt = np.count_nonzero(train_matches == True)
train_cnt = len(train_matches)
train_accuracy = train_match_cnt / train_cnt

test_preds = model.predict(X_test)
test_matches = test_preds == y_test
test_match_cnt = np.count_nonzero(test_matches == True)
test_cnt = len(test_matches)
test_accuracy = test_match_cnt / test_cnt

f"Train accuracy: {train_match_cnt}/{train_cnt} ({round(train_accuracy, 4) * 100} %)"
f"Test accuracy: {test_match_cnt}/{test_cnt} ({round(test_accuracy, 4) * 100} %)"

CPU times: user 190 ms, sys: 1.46 ms, total: 191 ms
Wall time: 190 ms


'Test accuracy: 107/114 (93.86 %)'

## Comparison

GS = grid search
GridSearchCV with weights and neighbors found default weights (uniform) the best.
|ACCURACY (%)|k=3  |GS - k|GS - k, w|bagging|
|:---:       |:---:|:---: |:---:    |:---:  |
|**train**   |94.29|93.85 |93.85    |94.07  |
|**test**    |93.98|93.86 |93.86    |94.74  |

|CPU TIMES (ms)|k=3  |GS - k|GS - k, w|bagging|
|:---:         |:---:|:---: |:---:    |:---:  |
|**train**     |1.43 |677   |1010     |1080   |
|**test**      |17.3 |16.1  |12.6     |191    |

|WALL TIMES (ms)|k=3  |GS - k|GS - k, w|bagging|
|:---:          |:---:|:---: |:---:    |:---:  |
|**train**      |1.32 |688   |1010     |1010   |
|**test**       |16   |14.8  |11.2     |190    |