The K-Nearest Neighbors (KNN) algorithm is very simple and very effective. In this chapter
you will discover exactly how to implement it from scratch, step-by-step. After reading this
chapter you will know:

How to calculate the Euclidean distance between real valued vectors.
- How to calculate the Euclidean distance between real valued vectors.
- How to use Euclidean distance and the training dataset to make predictions for new data.



In [12]:
from linear_algebra import distance
from io import StringIO

import pandas as pd
import numpy as np

In [13]:
def clean_cols(cols): return cols.lower().strip() 

The problem is a binary (two-class) classification problem. This problem was contrived for this
tutorial. The dataset contains two input variables (X1 and X1) and the class output variable
with the values 0 and 1. The dataset contains 10 records, 5 that belong to each class.

In [14]:
dataset = StringIO("""X1 X2 Y
3.393533211 2.331273381 0
3.110073483 1.781539638 0
1.343808831 3.368360954 0
3.582294042 4.67917911 0
2.280362439 2.866990263 0
7.423436942 4.696522875 1
5.745051997 3.533989803 1
9.172168622 2.511101045 1
7.792783481 3.424088941 1
7.939820817 0.791637231 1
""")

knn = pd.read_csv(dataset, sep=' ').rename(columns=clean_cols)

In [15]:
knn.sample(5)

Unnamed: 0,x1,x2,y
3,3.582294,4.679179,0
5,7.423437,4.696523,1
0,3.393533,2.331273,0
8,7.792783,3.424089,1
2,1.343809,3.368361,0


Euclidian distance is the squared root for the sum of the squared difference. 


In [16]:
# assuming indx 0 and 1 are vectors of knn
a = knn.loc[0][:-1]
b = knn.loc[1][:-1]

The frst step is to calculate the **Euclidean distance between the new input instance and
all instances in the training dataset**. The table below lists the distance between each training
instance and the new data.

In [17]:
from typing import Tuple
new_point = (8.093607318, 3.365731514, 1)
point = Tuple[float, float, int]

def distance(df, new_point: point):
    """Calculates the euclidian distance for a new point 
    vs all points in knn df
    returns a series"""
    return np.sqrt(
        np.sum(
            np.square(knn.loc[:, ['x1', 'x2']].subtract(new_point[:-1])), 
            axis='columns')
    )

knn = knn.assign(distance = distance(knn,new_point))
knn

Unnamed: 0,x1,x2,y,distance
0,3.393533,2.331273,0,4.812567
1,3.110073,1.78154,0,5.229271
2,1.343809,3.368361,0,6.749799
3,3.582294,4.679179,0,4.698627
4,2.280362,2.86699,0,5.8346
5,7.423437,4.696523,1,1.490011
6,5.745052,3.53399,1,2.354575
7,9.172169,2.511101,1,1.376113
8,7.792783,3.424089,1,0.306432
9,7.939821,0.791637,1,2.578684


In [18]:
# deals with ties by assuming k = number classes plus 1
# check joel grus implementation of knn for a different example on ties 

def _smallest_dist(df, k:int):
    """Calculates k smallest distances from its neighbors
    returns its index"""
    return df['distance'].nsmallest(k).index
def prediction(df, k:int):
    """Calculates the most common output result of 
    the k smallest neighbors"""
    index = _smallest_dist(df, k)
    return knn.loc[index, 'y'].mode()


In [19]:
prediction(knn, 3)

0    1
dtype: int64