## k nearest neighbors

Recall that the k-nn is a non-parametric classification model. The idea is, given a data point $x$, find the $k$ closest points x and then label x to the class containing the majority of these k datapoints. 

It can be done using any kind of distance, but in this implementation, we use the Euclidean (or $L_2$) norm. The Euclidean distance of two points $x = (x_1,...,x_n)$ and $y = (y_1, ..., y_n)$ in $\mathbb{R}^n$ is defined by:

          $$ d(x,y) = \sqrt{\sum_{i=1}^{n} (x_i-y_i)^2} .$$




In [27]:
from sklearn import datasets
import knn
import random
import numpy as np


iris = datasets.load_iris()
X = iris.data  
y = iris.target
# test size
ntest = 40*len(y)//100

l = list(range(len(y)))
random.shuffle(l)
# creation of test and train data
xtrain = X[l[ntest:]]
xtest = X[l[:ntest]]
ytrain = y[l[ntest:]]
ytest = y[l[:ntest]]

In [28]:
model = knn.KNN(k = 8)

In [29]:
model.fit(xtrain, ytrain)

Info: 
 k-NN with  k = 8 and 90 datapoints and 3 labels


In [30]:
p = model.predict(xtest)

In [31]:
print(f'Miss-classified on test : {((p!=ytest).mean()*100).round(2)}%')

Miss-classified on test : 1.67%
