# kNN Tracing Exercise

In this first machine learning exercise, kNN tracing will be demonstrated, using the `family_car.csv` dataset. By determining the five closest neighbors of a test instance, the calculations will attempt to predict if the test instance is a family car or not.

The test instance for this kNN tracing:
* **Price:** 21000
* **Engine Power:** 190
* **Family Car:** ?

## Normalize the Data

First, the dataset needs to be uploaded into a dataframe. Afterwards, the data needs to be normalized so that all values are weighted evenly. This will be done with "min-max" normalization so that all values are between 0 and 1. The formula below does this:

$$(x - min(xs)) / ((max(xs) - min(xs)) * 1.0)$$

In [50]:
import pandas as pd

car_df = pd.read_csv("family_car.csv")

X_train = car_df.drop(["Family Car"], axis=1)
y_train = car_df["Family Car"]

price_scaled = []
power_scaled = []

for i in range(len(car_df["Price"])):
    max_min = car_df["Price"].max() - car_df["Price"].min()
    x_scale = (car_df.loc[i]["Price"] - car_df["Price"].min()) / (max_min * 1.0)
    price_scaled.append(x_scale)

for i in range(len(car_df["Engine Power"])):
    max_min = car_df["Engine Power"].max() - car_df["Engine Power"].min()
    x_scale = (car_df.loc[i]["Engine Power"] - car_df["Engine Power"].min()) / (max_min * 1.0)
    power_scaled.append(x_scale)

print("Scaled Price:", price_scaled, "\n")
print("Scaled Engine Power:", power_scaled, "\n")

test_scale_price = (21000 - car_df["Price"].min()) / (car_df["Price"].max() - car_df["Price"].min() * 1.0)
test_scale_power = (190 - car_df["Engine Power"].min()) / (car_df["Engine Power"].max() - car_df["Engine Power"].min() * 1.0)

print("Test Instance Normalization: ", test_scale_price, ", ", test_scale_power, sep="")

price_scaled_ser = pd.Series(price_scaled)
power_scaled_ser = pd.Series(power_scaled)

Scaled Price: [0.0, 0.029411764705882353, 0.20588235294117646, 0.23529411764705882, 0.38235294117647056, 0.38235294117647056, 0.4411764705882353, 0.5294117647058824, 0.5882352941176471, 0.6470588235294118, 0.6764705882352942, 0.9411764705882353, 0.9705882352941176, 1.0] 

Scaled Engine Power: [0.8333333333333334, 0.1111111111111111, 0.2222222222222222, 0.6666666666666666, 0.5, 1.0, 0.7777777777777778, 0.5555555555555556, 0.6944444444444444, 1.0, 0.2777777777777778, 0.0, 0.4722222222222222, 0.6944444444444444] 

Test Instance Normalization: 0.4117647058823529, 0.16666666666666666


## Compute Distances Between Training Set and Test Instance

The next step to identify the closest neighbors is to identify how far the test instance is from each instance in the training set. This is computed with the following equation:

$$\sqrt{(x_1 - x_0)^2 + (y_1 - y_0)^2}$$

In [51]:
import numpy as np

X_train["Price"] = price_scaled_ser
X_train["Engine Power"] = power_scaled_ser

distance_set = []

for i in range(len(X_train["Price"])):
    x_diff = (test_scale_price - X_train.loc[i]["Price"]) ** 2
    y_diff = (test_scale_power - X_train.loc[i]["Engine Power"]) ** 2
    dist = np.sqrt(x_diff + y_diff)
    distance_set.append(dist)

distance_ser = pd.Series(distance_set)
print(distance_ser)

0     0.783578
1     0.386368
2     0.213246
3     0.530228
4     0.334628
5     0.833852
6     0.611818
7     0.406295
8     0.556499
9     0.865914
10    0.287080
11    0.555027
12    0.636905
13    0.790298
dtype: float64


## Find Nearest Neighbors and Majority Vote the Class Label

Now that the distance of each scaled instance from the scaled test instance has been found, the kNN nearest neighbors can be identified. For this exercise, a k of 5 will be used. To do this a list of the five nearest neighbor indicies will be created in order to determine the most frequent label of yes or no for the Family Car attribute. The code below details this:

In [52]:
kNN_idx_list = []

for i in range(5):
    idx_min = distance_ser.idxmin()
    kNN_idx_list.append(idx_min)
    distance_ser.drop(idx_min, inplace=True)

print("Nearest Neighbors Indicies:", kNN_idx_list)

kNN_idx_ser = pd.Series(kNN_idx_list)
family_car_list = []

for i in range(len(kNN_idx_list)):
    curr_idx = kNN_idx_ser.loc[i]
    kNN_family_car = y_train.loc[curr_idx]
    family_car_list.append(kNN_family_car)

family_car_ser = pd.Series(family_car_list)
y_predicted = family_car_ser.mode()

print("y_predicted:", y_predicted)

Nearest Neighbors Indicies: [2, 10, 4, 1, 7]
y_predicted: 0    no
dtype: object


After finding the five nearest neighbors and completing majority voting using the mode, the kNN tracing has predicted that the test instance given is not a family car.