# KNN: K-nearest neighbor

In this case we work with ***irregular data***, meaning that the dataset was not previously prepared for as in a nice csv file with no missing values and no wrong values.

In [10]:
import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model, preprocessing
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsClassifier
import os

## Read the data

In [11]:
dir_here = os.path.abspath("")
dir_base = os.path.dirname(dir_here)
dir_data = os.path.join(dir_base, "data")
path_data_csv = os.path.join(dir_data, "car.data")
print(path_data_csv)

/home/dancab/git/TWT-ML-Tutorial/data/car.data


In [12]:
names = ["buying", "maint", "door", "persons", "lug_boot", "safety", "class"]
data = pd.read_csv(path_data_csv, header=None, names=names)
print(data.head())

  buying  maint door persons lug_boot safety  class
0  vhigh  vhigh    2       2    small    low  unacc
1  vhigh  vhigh    2       2    small    med  unacc
2  vhigh  vhigh    2       2    small   high  unacc
3  vhigh  vhigh    2       2      med    low  unacc
4  vhigh  vhigh    2       2      med    med  unacc


## Normalize the data

The dataset contains fields that are non-numerical.  
For example: 'safety' can be 'low', 'med', 'high'.

To process this data it is better to previously convert it (*preprocess* it) to numerical values.  
For instance, 'low'=0, 'med'=1, 'high'=2.

To do that, sklearn offers the module 'preprocessing' that we already imported.  
Another alternative would be to do it ourselves, which wouldn't be that hard either.

In [14]:
# Create a LabelEncoder object
le = preprocessing.LabelEncoder()
# It takes lists as inputs instead of pd.Series or np.arrays so:
buying = le.fit_transform(list(data["buying"]))
maint = le.fit_transform(list(data["maint"]))
door = le.fit_transform(list(data["door"]))
persons = le.fit_transform(list(data["persons"]))
lug_boot = le.fit_transform(list(data["lug_boot"]))
safety = le.fit_transform(list(data["safety"]))
car_class = le.fit_transform(list(data["class"]))
# Create new dataframe to visualize better this transformation
df_le = pd.DataFrame(
    {
        'buying': buying,
        'maint': maint,
        'door': door,
        'persons': persons,
        'lug_boot': lug_boot,
        'safety': safety,
        'class': car_class
    }
)
print(df_le.head())

   buying  maint  door  persons  lug_boot  safety  class
0       3      3     0        0         2       1      2
1       3      3     0        0         2       2      2
2       3      3     0        0         2       0      2
3       3      3     0        0         1       1      2
4       3      3     0        0         1       2      2


It converted the values to numeric categories.  
It seems that the order is alphabetical.
That's why in the 'safety' field, 'low'=1, 'med'=2, but 'high'=0.

### Split data between train and test batches

In [22]:
# Prepare lists of x and y
x = list(zip(buying, maint, door, persons, lug_boot, safety))
y = list(car_class)
# Split
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)
# Check it out
n = 30 if len(x_train)>30 else len(x_train)
for k in range(n):
    x_str = " ; ".join(f"{val:2d}" for val in x_train[k])
    print(x_str + "  |  " + f"{y_train[k]:2d}")


 3 ;  3 ;  1 ;  1 ;  1 ;  0  |   2
 2 ;  3 ;  3 ;  2 ;  0 ;  1  |   2
 0 ;  2 ;  2 ;  2 ;  0 ;  1  |   2
 2 ;  0 ;  0 ;  0 ;  0 ;  0  |   2
 1 ;  0 ;  0 ;  1 ;  1 ;  1  |   2
 3 ;  2 ;  2 ;  0 ;  1 ;  0  |   2
 1 ;  3 ;  2 ;  0 ;  1 ;  0  |   2
 2 ;  3 ;  2 ;  2 ;  1 ;  2  |   0
 1 ;  0 ;  3 ;  0 ;  0 ;  0  |   2
 3 ;  0 ;  2 ;  1 ;  1 ;  2  |   2
 2 ;  0 ;  0 ;  0 ;  0 ;  1  |   2
 0 ;  0 ;  1 ;  0 ;  0 ;  2  |   2
 2 ;  0 ;  1 ;  2 ;  0 ;  1  |   2
 0 ;  1 ;  3 ;  2 ;  2 ;  2  |   2
 1 ;  3 ;  3 ;  2 ;  2 ;  2  |   2
 1 ;  0 ;  2 ;  2 ;  0 ;  2  |   0
 0 ;  1 ;  0 ;  2 ;  0 ;  0  |   0
 0 ;  0 ;  1 ;  1 ;  0 ;  0  |   0
 1 ;  2 ;  3 ;  0 ;  0 ;  0  |   2
 1 ;  3 ;  0 ;  1 ;  1 ;  2  |   2
 1 ;  3 ;  2 ;  1 ;  0 ;  1  |   2
 3 ;  1 ;  0 ;  2 ;  0 ;  0  |   0
 0 ;  3 ;  3 ;  0 ;  2 ;  2  |   2
 3 ;  1 ;  2 ;  0 ;  1 ;  2  |   2
 3 ;  1 ;  2 ;  2 ;  2 ;  1  |   2
 0 ;  0 ;  3 ;  2 ;  2 ;  1  |   2
 3 ;  3 ;  3 ;  0 ;  0 ;  2  |   2
 2 ;  0 ;  0 ;  2 ;  0 ;  1  |   2
 2 ;  2 ;  3 ;  1 ; 

## Create the KNN Model

In [23]:
model = KNeighborsClassifier(n_neighbors=5)

`n_neighbors` is the k number of nearest neighbors to look.

It is a ***hyperparameter***:

In machine learning, a hyperparameter is a parameter whose value is used to control the learning process.  
By contrast, the values of other parameters (typically node weights) are derived via training.  
*Source: [Wikipedia's entry for hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))*

In this case we set it as 5 but we should play with it to see the differences in results.

### Fit the model

In [24]:
# Let's do a single fit and check the accuracy
model.fit(x_train, y_train)
acc = model.score(x_test, y_test)
print(acc)

0.930635838150289


In [25]:
# Let's see with a changed n_neighbors
model = KNeighborsClassifier(n_neighbors=7)
model.fit(x_train, y_train)
acc = model.score(x_test, y_test)
print(acc)

0.9653179190751445


In [26]:
# Let's see with a changed n_neighbors
model = KNeighborsClassifier(n_neighbors=9)
model.fit(x_train, y_train)
acc = model.score(x_test, y_test)
print(acc)

0.930635838150289


It seems 7 was the best option, let's run it again

In [27]:
# Let's see with a changed n_neighbors
model = KNeighborsClassifier(n_neighbors=7)
model.fit(x_train, y_train)
acc = model.score(x_test, y_test)
print(acc)

0.9653179190751445


### Predict

In [29]:
predicted = model.predict(x_test)
names = ["unacc", "acc", "good", "vgood"] 
# Tim uses this names in this order, I'm not so sure about the order, so I don't use them later
for k in range(len(predicted)):
    print("Predicted: ", predicted[k], "Data: ", x_test[k], "Actual: ", y_test[k])

Predicted:  0 Data:  (2, 1, 1, 1, 2, 2) Actual:  0
Predicted:  0 Data:  (1, 0, 3, 1, 2, 2) Actual:  0
Predicted:  2 Data:  (1, 2, 2, 0, 1, 1) Actual:  2
Predicted:  2 Data:  (3, 0, 0, 2, 0, 1) Actual:  2
Predicted:  0 Data:  (0, 1, 3, 1, 1, 2) Actual:  0
Predicted:  2 Data:  (0, 2, 3, 0, 2, 2) Actual:  2
Predicted:  3 Data:  (1, 2, 3, 2, 0, 0) Actual:  3
Predicted:  2 Data:  (0, 2, 0, 0, 2, 2) Actual:  2
Predicted:  2 Data:  (0, 3, 1, 2, 1, 2) Actual:  2
Predicted:  0 Data:  (3, 1, 0, 2, 0, 2) Actual:  0
Predicted:  2 Data:  (2, 3, 3, 0, 1, 1) Actual:  2
Predicted:  1 Data:  (2, 1, 2, 2, 1, 2) Actual:  1
Predicted:  2 Data:  (0, 2, 3, 2, 2, 2) Actual:  2
Predicted:  2 Data:  (2, 2, 0, 0, 1, 1) Actual:  2
Predicted:  2 Data:  (1, 3, 2, 1, 1, 2) Actual:  0
Predicted:  0 Data:  (1, 3, 2, 1, 1, 0) Actual:  0
Predicted:  2 Data:  (3, 0, 2, 2, 1, 1) Actual:  2
Predicted:  3 Data:  (2, 2, 3, 1, 1, 0) Actual:  3
Predicted:  2 Data:  (1, 2, 3, 0, 0, 2) Actual:  2
Predicted:  2 Data:  (0, 3, 2, 

### Print k-nearest neighbors of each point

In [32]:
for k in range(len(predicted)):
    n = model.kneighbors([x_test[k]], 7, True)
    print("---")
    print(f"{k:03d}")
    print("Distances: ", n[0].tolist())
    print("Indexes: ", n[1].tolist())

---
000
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]
Indexes:  [[723, 1496, 1200, 118, 335, 482, 629]]
---
001
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]
Indexes:  [[151, 1411, 522, 1131, 330, 517, 1471]]
---
002
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]
Indexes:  [[600, 1062, 1545, 631, 741, 733, 1236]]
---
003
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.4142135623730951]]
Indexes:  [[27, 728, 1083, 1104, 1194, 579, 644]]
---
004
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]
Indexes:  [[1134, 802, 1542, 1540, 396, 767, 329]]
---
005
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]
Indexes:  [[860, 1504, 989, 22, 1330, 135, 108]]
---
006
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]
Indexes:  [[1157, 178, 911, 1309, 1035, 876, 163]]
---
007
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.4142135623730951, 1.4142135623730951]]
Indexes:  [[1407, 760, 945, 698, 495, 1026, 456]]
---
008
Distances:  [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]
Indexes:  [[1162, 783, 588, 64