# K-nearest Neighbors From Scratch

In this notebook, we will code the k-nearest neighbors model (regressor and classifier) from scratch to understand the theory behind this commonly used machine learning model.

In the end, we will compare the result of the k-nearest neighbors models we build from scratch with ones from scikit-learn.

YT: https://youtu.be/P-mM9396Dn8

## Implementation

For the first main function, the k-nearest neighbors regressor, we perform the steps as follows:
- Concatenate "X_train" and "X_test" pandas data frames.
- Transform "X" and "y" data frames into numpy arrays.
- Initiate "predict_output" list to store our final outputs.
- Use the `euclidean_distance` function to calculate the distance from the new data point ("point") to all of the training data points ("train_point").
- Sort the distance in ascending order.
- Select the top distance based on the number of neighbors ("k").
- For the regressor model, we get the label (prediction) by taking the mean of the labels from all k neighbors.
- For the comparison goal, this function only outputs the prediction for the test dataset ("X_test").

In [55]:
# main function: knn regressor
def knn_reg(X_train, y_train, X_test, k):
    X = pd.concat([X_train, X_test])
    X = X.to_numpy()
    y = y_train.to_numpy()
    
    predict_output = []
    
    for point in X:
        distance_label = [
            (euclidean_distance(point, train_point), train_label)
            for train_point, train_label in zip(X, y)
            ]
        neighbors = sorted(distance_label)[:k]
        predict_output.append(sum(label for _, label in neighbors) / k)
    return predict_output[-len(X_test):]

For our second main function, the k-nearest neighbors classifier, we perform the steps as follows:
- Concatenate "X_train" and "X_test" pandas data frames.
- Transform "X" and "y" data frames into numpy arrays.
- Initiate "predict_output" list to store our final outputs.
- Use the `euclidean_distance` function to calculate the distance from the new data point ("point") to all of the training data points ("train_point").
- Sort the distance in ascending order.
- Select the top distance based on the number of neighbors ("k").
- For the classifier model, we get the label (prediction) by picking the label with the highest count from all k neighbors.
- For the comparison goal, this function only outputs the prediction for the test dataset ("X_test").

In [56]:
# main function: knn classifier
def knn_class(X_train, y_train, X_test, k):
    X = pd.concat([X_train, X_test])
    X = X.to_numpy()
    y = y_train.to_numpy()
    
    predict_output = []
    
    for point in X:
        distance_label = [
            (euclidean_distance(point, train_point), train_label)
            for train_point, train_label in zip(X, y)
            ]
        neighbors = sorted(distance_label)[:k]
        neighbor_labels = [label for _, label in neighbors]
        predict_output.append(max(set(neighbor_labels), key = neighbor_labels.count))
    return predict_output[-len(X_test):]

In the function `euclidean_distance`, we calculate the euclidean distance of two data points based on the number of feature ("n").

In [48]:
# helper function: euclidean distance
def euclidean_distance(p1, p2):
    a = 0
    for n in range(len(p1)):
        a += (p1[n] - p2[n]) ** 2
    return (a) ** 0.5

## Model Comparison

For this section, we will be using two datasets (house price dataset for regression and Iris dataset for classification) to perform the comparison between our k-nearest neighbors models and the scikit-learn `KNeighborsRegressor` and `KNeighborsClassifier` models. The simple metric we will use for our comparison will be the outputted prediction from the models.

### Library & Data Preparation

In [1]:
# load library
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier

In [2]:
# load regression dataset
df_reg = pd.read_csv('data/House_Price_Dataset.csv')
df_reg.head()

Unnamed: 0,HouseSize,Rooms,Price
0,2104,3,399900
1,1600,3,329900
2,2400,3,369000
3,1416,2,232000
4,3000,4,539900


In [44]:
# load classification dataset
df_class = pd.read_csv('data/Iris_Dataset.csv')
df_class.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


### Data Transformation

In [3]:
# separate independent and dependent variables for regression dataset
X_train_reg = df_reg[["HouseSize", "Rooms"]][:-1] # independent variables
y_train_reg = df_reg["Price"][:-1] # dependent variable

# separate train and test dataset for regression dataset
X_test_reg = df_reg[["HouseSize", "Rooms"]][-1:]

In [50]:
# separate independent and dependent variables for classification dataset
X_train_class = df_class[["SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"]][:-1] # independent variables
y_train_class = df_class["Species"][:-1] # dependent variable

# separate train and test dataset for classification dataset
X_test_class = df_class[["SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"]][-1:]

### Regression Models Building

In [59]:
reg_model_sklearn = KNeighborsRegressor(n_neighbors = 5).fit(X_train_reg, y_train_reg)
print("KNeighborsRegressor prediction: ", reg_model_sklearn.predict(X_test_reg))

KNeighborsRegressor prediction:  [247720.]


In [60]:
print("knn_reg prediction: ", knn_reg(X_train_reg, y_train_reg, X_test_reg, k = 5))

knn_reg prediction:  [247720.0]


### Classification Models Building

In [61]:
class_model_sklearn = KNeighborsClassifier(n_neighbors = 5).fit(X_train_class, y_train_class)
print("KNeighborsClassifier prediction: ", class_model_sklearn.predict(X_test_class))

KNeighborsClassifier prediction:  ['Iris-virginica']


In [62]:
knn_class(X_train_class, y_train_class, X_test_class, k = 5)
print("knn_class prediction: ", knn_class(X_train_class, y_train_class, X_test_class, k = 5))

knn_class prediction:  ['Iris-virginica']


As we can see from the results above, the predictions are the same for both of the regression models and they are also the same for the classification models. Hence, we have successfully built k-nearest neighbors regressor and classification from scratch!