# KNN: K-nearest neighbor

In this case we work with ***irregular data***, meaning that the dataset was not previously prepared for as in a nice csv file with no missing values and no wrong values.

In [1]:
import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model, preprocessing
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsClassifier
import os

## Read the data

In [2]:
dir_here = os.path.abspath("")
dir_base = os.path.dirname(dir_here)
dir_data = os.path.join(dir_base, "data")
path_data_csv = os.path.join(dir_data, "car.data")
print(path_data_csv)

/home/dancab/git/TWT-ML-Tutorial/data/car.data


In [3]:
names = ["buying", "maint", "door", "persons", "lug_boot", "safety", "class"]
data = pd.read_csv(path_data_csv, header=None, names=names)
print(data.head())

  buying  maint door persons lug_boot safety  class
0  vhigh  vhigh    2       2    small    low  unacc
1  vhigh  vhigh    2       2    small    med  unacc
2  vhigh  vhigh    2       2    small   high  unacc
3  vhigh  vhigh    2       2      med    low  unacc
4  vhigh  vhigh    2       2      med    med  unacc


## Normalize the data

The dataset contains fields that are non-numerical.  
For example: 'safety' can be 'low', 'med', 'high'.

To process this data it is better to previously convert it (*preprocess* it) to numerical values.  
For instance, 'low'=0, 'med'=1, 'high'=2.

To do that, sklearn offers the module 'preprocessing' that we already imported.  
Another alternative would be to do it ourselves, which wouldn't be that hard either.

In [4]:
# Create a LabelEncoder object
le = preprocessing.LabelEncoder()
# It takes lists as inputs instead of pd.Series or np.arrays so:
buying = le.fit_transform(list(data["buying"]))
maint = le.fit_transform(list(data["maint"]))
door = le.fit_transform(list(data["door"]))
persons = le.fit_transform(list(data["persons"]))
lug_boot = le.fit_transform(list(data["lug_boot"]))
safety = le.fit_transform(list(data["safety"]))
car_class = le.fit_transform(list(data["class"]))
# Create new dataframe to visualize better this transformation
df_le = pd.DataFrame(
    {
        'buying': buying,
        'maint': maint,
        'door': door,
        'persons': persons,
        'lug_boot': lug_boot,
        'safety': safety,
        'class': car_class
    }
)
print(df_le.head())

   buying  maint  door  persons  lug_boot  safety  class
0       3      3     0        0         2       1      2
1       3      3     0        0         2       2      2
2       3      3     0        0         2       0      2
3       3      3     0        0         1       1      2
4       3      3     0        0         1       2      2


It converted the values to numeric categories.  
It seems that the order is alphabetical.
That's why in the 'safety' field, 'low'=1, 'med'=2, but 'high'=0.

### Split data between train and test batches

In [5]:
# Prepare lists of x and y
x = list(zip(buying, maint, door, persons, lug_boot, safety))
y = list(car_class)
# Split
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)
# Check it out
n = 30 if len(x_train)>30 else len(x_train)
for k in range(n):
    x_str = " ; ".join(f"{val:2d}" for val in x_train[k])
    print(x_str + "  |  " + f"{y_train[k]:2d}")


 1 ;  0 ;  1 ;  0 ;  2 ;  1  |   2
 0 ;  1 ;  3 ;  2 ;  1 ;  0  |   0
 3 ;  1 ;  3 ;  1 ;  0 ;  2  |   0
 0 ;  3 ;  2 ;  2 ;  2 ;  1  |   2
 1 ;  0 ;  0 ;  2 ;  1 ;  2  |   0
 3 ;  2 ;  3 ;  2 ;  2 ;  2  |   2
 3 ;  1 ;  1 ;  2 ;  0 ;  0  |   0
 0 ;  0 ;  0 ;  1 ;  2 ;  0  |   0
 1 ;  3 ;  1 ;  2 ;  1 ;  2  |   0
 0 ;  0 ;  2 ;  1 ;  2 ;  2  |   2
 1 ;  3 ;  3 ;  0 ;  0 ;  1  |   2
 1 ;  3 ;  1 ;  2 ;  2 ;  2  |   2
 0 ;  1 ;  3 ;  2 ;  0 ;  2  |   0
 0 ;  2 ;  3 ;  2 ;  2 ;  1  |   2
 3 ;  3 ;  1 ;  0 ;  2 ;  2  |   2
 3 ;  0 ;  3 ;  1 ;  1 ;  0  |   2
 1 ;  1 ;  0 ;  0 ;  0 ;  1  |   2
 0 ;  3 ;  1 ;  2 ;  1 ;  0  |   2
 1 ;  1 ;  2 ;  1 ;  0 ;  2  |   1
 1 ;  0 ;  2 ;  0 ;  2 ;  0  |   2
 2 ;  1 ;  0 ;  1 ;  0 ;  2  |   1
 1 ;  0 ;  1 ;  1 ;  0 ;  2  |   0
 0 ;  3 ;  0 ;  1 ;  2 ;  0  |   2
 1 ;  0 ;  0 ;  2 ;  1 ;  1  |   2
 1 ;  1 ;  2 ;  1 ;  1 ;  2  |   1
 2 ;  3 ;  1 ;  2 ;  2 ;  2  |   2
 0 ;  0 ;  1 ;  0 ;  1 ;  0  |   2
 3 ;  3 ;  0 ;  0 ;  0 ;  2  |   2
 1 ;  1 ;  2 ;  0 ; 

## Create the KNN Model

In [6]:
model = KNeighborsClassifier(n_neighbors=5)

`n_neighbors` is the k number of nearest neighbors to look.

It is a ***hyperparameter***:

In machine learning, a hyperparameter is a parameter whose value is used to control the learning process.  
By contrast, the values of other parameters (typically node weights) are derived via training.  
*Source: [Wikipedia's entry for hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))*

In this case we set it as 5 but we should play with it to see the differences in results.

### Fit the model

In [7]:
# Let's do a single fit and check the accuracy
model.fit(x_train, y_train)
acc = model.score(x_test, y_test)
print(acc)

0.8959537572254336


In [8]:
# Let's see with a changed n_neighbors
model = KNeighborsClassifier(n_neighbors=7)
model.fit(x_train, y_train)
acc = model.score(x_test, y_test)
print(acc)

0.9132947976878613


In [9]:
# Let's see with a changed n_neighbors
model = KNeighborsClassifier(n_neighbors=9)
model.fit(x_train, y_train)
acc = model.score(x_test, y_test)
print(acc)

0.9075144508670521


It seems 7 was the best option, let's run it again

In [10]:
# Let's see with a changed n_neighbors
model = KNeighborsClassifier(n_neighbors=7)
model.fit(x_train, y_train)
acc = model.score(x_test, y_test)
print(acc)

0.9132947976878613


### Predict

In [11]:
predicted = model.predict(x_test)
names = ["unacc", "acc", "good", "vgood"] 
# Tim uses this names in this order, I'm not so sure about the order, so I don't use them later
for k in range(len(predicted)):
    print("Predicted: ", predicted[k], "Data: ", x_test[k], "Actual: ", y_test[k])

Predicted:  2 Data:  (2, 2, 0, 2, 2, 1) Actual:  2
Predicted:  2 Data:  (1, 0, 0, 0, 1, 0) Actual:  2
Predicted:  2 Data:  (3, 0, 3, 2, 2, 2) Actual:  2
Predicted:  2 Data:  (3, 0, 2, 1, 2, 2) Actual:  2
Predicted:  2 Data:  (1, 0, 1, 0, 2, 2) Actual:  2
Predicted:  2 Data:  (2, 2, 0, 1, 2, 1) Actual:  2
Predicted:  0 Data:  (1, 3, 2, 2, 2, 0) Actual:  0
Predicted:  2 Data:  (0, 1, 2, 2, 1, 1) Actual:  2
Predicted:  0 Data:  (2, 3, 1, 2, 1, 0) Actual:  0
Predicted:  1 Data:  (1, 1, 3, 2, 2, 0) Actual:  1
Predicted:  0 Data:  (2, 1, 1, 1, 2, 2) Actual:  0
Predicted:  2 Data:  (2, 3, 0, 1, 2, 0) Actual:  0
Predicted:  2 Data:  (1, 3, 2, 0, 2, 1) Actual:  2
Predicted:  2 Data:  (2, 1, 1, 0, 2, 2) Actual:  2
Predicted:  2 Data:  (2, 3, 0, 0, 0, 2) Actual:  2
Predicted:  2 Data:  (2, 2, 1, 0, 1, 1) Actual:  2
Predicted:  2 Data:  (3, 3, 0, 0, 1, 0) Actual:  2
Predicted:  2 Data:  (2, 1, 1, 2, 2, 1) Actual:  2
Predicted:  2 Data:  (2, 0, 3, 0, 2, 1) Actual:  2
Predicted:  0 Data:  (3, 2, 2, 

### Print k-nearest neighbors of each point

In [12]:
for k in range(len(predicted)):
    n = model.kneighbors([x_test[k]], 7, True)
    print("---")
    print(f"{k:03d}")
    print("Distances: ", n[0].tolist())
    print("Indexes: ", n[1].tolist())

---
000
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]
Indexes:  [[141, 207, 1142, 916, 528, 1279, 857]]
---
001
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]
Indexes:  [[488, 216, 1315, 894, 597, 714, 320]]
---
002
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.4142135623730951, 1.4142135623730951]]
Indexes:  [[1087, 1052, 297, 1222, 39, 928, 135]]
---
003
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]
Indexes:  [[39, 630, 1187, 1186, 108, 1087, 72]]
---
004
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]
Indexes:  [[1521, 1102, 0, 694, 198, 1139, 678]]
---
005
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]
Indexes:  [[1538, 1277, 945, 293, 1381, 1096, 1051]]
---
006
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.4142135623730951, 1.4142135623730951]]
Indexes:  [[842, 229, 1133, 1165, 1280, 1074, 690]]
---
007
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]
Indexes:  [[758, 577, 1232, 830, 326, 741, 109]]
---
008
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]
Indexes:  [[924, 