KNN ALGORITHM

We will be using this algorithm to classify cars in 4 categories based upon certain features.

In [2]:
from sklearn import preprocessing

In [3]:
import sklearn
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
from sklearn import linear_model, preprocessing

In [4]:
data = pd.read_csv("car2.data")
print(data.head())  # To check if our data is loaded correctly

  buying  maint door persons lug_boot safety  class
0  vhigh  vhigh    2       2    small    low  unacc
1  vhigh  vhigh    2       2    small    med  unacc
2  vhigh  vhigh    2       2    small   high  unacc
3  vhigh  vhigh    2       2      med    low  unacc
4  vhigh  vhigh    2       2      med    med  unacc


As you may have noticed much of our data is not numeric. In order to train the K-Nearest Neighbor Classifier we must convert any string data into some kind of a number. Luckily for us sklearn has a method that can do this for us.

We will start by creating a label encoder object and then use that to encode each column of our data into integers.

In [5]:
le = preprocessing.LabelEncoder()

The method fit_transform() takes a list (each of our columns) and will return to us an array containing our new values.

In [6]:
buying = le.fit_transform(list(data["buying"]))

In [7]:
maint = le.fit_transform(list(data["maint"]))
door = le.fit_transform(list(data["door"]))
persons = le.fit_transform(list(data["persons"]))
lug_boot = le.fit_transform(list(data["lug_boot"]))
safety = le.fit_transform(list(data["safety"]))
clss = le.fit_transform(list(data["class"]))

In [8]:
predict = "class"

In [9]:
#Combine our data into a feature list and a label list, we can use zip()
X = list(zip(buying, maint, door, persons, lug_boot, safety))  # features
y = list(clss) # labels

Finally we will split our data into training and testing data using the same process seen previously.

In [10]:
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1) #Sample size aqui é 10% do Dataset

Training a KNN Classifier
Creating a KNN Classifier is almost identical to how we created the linear regression model. The only difference is we can specify how many neighbors to look for as the argument n_neighbors

In [11]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=9)

To train our model we follow precisely the same steps as outlined earlier.

In [12]:
model.fit(x_train, y_train)

KNeighborsClassifier(n_neighbors=9)

And once again to score our model we will do the following

In [13]:
acc = model.score(x_test, y_test)
print(acc)

0.9421965317919075


Testing Our Model:

If we'd like to see how our model is performing on the unique elements of our test data we can do the following.

In [14]:
predicted = model.predict(x_test)
names = ["unacc", "acc", "good", "vgood"]

for x in range(len(predicted)):
    print("Predicted: ", names[predicted[x]], "Data: ", x_test[x], "Actual: ", names[y_test[x]])


Predicted:  good Data:  (2, 1, 1, 1, 1, 2) Actual:  unacc
Predicted:  unacc Data:  (1, 3, 2, 2, 0, 0) Actual:  unacc
Predicted:  good Data:  (1, 3, 0, 1, 1, 2) Actual:  good
Predicted:  unacc Data:  (3, 2, 1, 1, 1, 0) Actual:  unacc
Predicted:  vgood Data:  (1, 1, 3, 1, 0, 0) Actual:  vgood
Predicted:  unacc Data:  (1, 2, 0, 1, 0, 2) Actual:  acc
Predicted:  good Data:  (0, 3, 0, 0, 0, 1) Actual:  good
Predicted:  unacc Data:  (1, 2, 1, 1, 0, 2) Actual:  acc
Predicted:  unacc Data:  (0, 1, 1, 2, 0, 0) Actual:  unacc
Predicted:  unacc Data:  (1, 1, 0, 2, 1, 2) Actual:  unacc
Predicted:  good Data:  (0, 0, 2, 0, 2, 2) Actual:  good
Predicted:  good Data:  (3, 3, 0, 1, 1, 0) Actual:  good
Predicted:  good Data:  (3, 0, 2, 2, 0, 2) Actual:  good
Predicted:  good Data:  (2, 1, 2, 2, 0, 1) Actual:  good
Predicted:  good Data:  (3, 3, 1, 2, 1, 1) Actual:  good
Predicted:  good Data:  (0, 2, 3, 1, 1, 1) Actual:  good
Predicted:  good Data:  (1, 2, 3, 2, 1, 1) Actual:  good
Predicted:  good Dat

This will display the predicted class, our data and the actual class
We create a names list so that we can convert our integer predictions into their string representation 

------------------------------------------------//---------------------------------------------------------

Looking at Neighbors
The KNN model has a unique method that allows for us to see the neighbors of a given data point. We can use this information to plot our data and get a better idea of where our model may lack accuracy. We can use model.neighbors to do this.

Note: the .neighbors method takes 2D as input, this means if we want to pass one data point we need surround it with [] so that it is in the right shape.
Parameters: The parameters for .neighbors are as follows: data(2D array), # of neighbors(int), distance(True or False)
Return: This will return to us an array with the index in our data of each neighbor. If distance=True then it will also return the distance to each neighbor from our data point.

In [15]:
predicted = model.predict(x_test)
names = ["unacc", "acc", "good", "vgood"]

for x in range(len(predicted)):
    print("Predicted: ", names[predicted[x]], "Data: ", x_test[x], "Actual: ", names[y_test[x]])
    # Now we will we see the neighbors of each point in our testing data
    n = model.kneighbors([x_test[x]], 9, True)
    print("N: ", n)

Predicted:  good Data:  (2, 1, 1, 1, 1, 2) Actual:  unacc
N:  (array([[1., 1., 1., 1., 1., 1., 1., 1., 1.]]), array([[1537,  154, 1516,  304, 1098,  409,  663,  892,  295]],
      dtype=int64))
Predicted:  unacc Data:  (1, 3, 2, 2, 0, 0) Actual:  unacc
N:  (array([[1.        , 1.        , 1.        , 1.        , 1.        ,
        1.        , 1.        , 1.        , 1.41421356]]), array([[1389,  597,  416,   78,  406, 1502, 1154,  101,  750]],
      dtype=int64))
Predicted:  good Data:  (1, 3, 0, 1, 1, 2) Actual:  good
N:  (array([[1., 1., 1., 1., 1., 1., 1., 1., 1.]]), array([[ 474, 1353, 1356,  229,  297,   92, 1457, 1469,  613]],
      dtype=int64))
Predicted:  unacc Data:  (3, 2, 1, 1, 1, 0) Actual:  unacc
N:  (array([[1.        , 1.        , 1.        , 1.        , 1.        ,
        1.        , 1.        , 1.41421356, 1.41421356]]), array([[1430,  221, 1275,  551, 1529,  246,  489,    8,  476]],
      dtype=int64))
Predicted:  vgood Data:  (1, 1, 3, 1, 0, 0) Actual:  vgood
N:  