## k-Nearest Neighbors

The concept is simple: 
- We calculate the distances for each feature between Training Data and Test Data
- When predicting, we choose the k nearest ( k is usually odd ) training points based on the distance ( Euclidean, or something else )
- The predicted class is the class of the majority

## Imports

In [1]:
import pandas as pd
import numpy as np
from random import sample

## Dataset "Iris"

- Source: https://archive.ics.uci.edu/ml/datasets/iris

> The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

> Attribute Information:
> 1. sepal length in cm
> 2. sepal width in cm
> 3. petal length in cm
> 4. petal width in cm
> 5. class:
> -- Iris Setosa
> -- Iris Versicolour
> -- Iris Virginica

## Load Data and split Train/Test

In [2]:
# Load Data and split Train/Test

df = pd.read_csv('data/iris.data', names=['sepal-len', 'sepal-width', 'petal-len', 'petal-width', 'class'])

train_ratio = 0.6

train_rows_indices = sample(list(range(len(df))), int(train_ratio*len(df)))
test_rows_indices = list()
for i in range(len(df)):
    if i not in train_rows_indices:
        test_rows_indices.append(i)

train_df = df.iloc[train_rows_indices]
test_df = df.iloc[test_rows_indices]

In [3]:
print(train_df.head())
print("______________")
print(test_df.head())

     sepal-len  sepal-width  petal-len  petal-width            class
89         5.5          2.5        4.0          1.3  Iris-versicolor
68         6.2          2.2        4.5          1.5  Iris-versicolor
28         5.2          3.4        1.4          0.2      Iris-setosa
94         5.6          2.7        4.2          1.3  Iris-versicolor
125        7.2          3.2        6.0          1.8   Iris-virginica
______________
    sepal-len  sepal-width  petal-len  petal-width        class
3         4.6          3.1        1.5          0.2  Iris-setosa
4         5.0          3.6        1.4          0.2  Iris-setosa
5         5.4          3.9        1.7          0.4  Iris-setosa
6         4.6          3.4        1.4          0.3  Iris-setosa
14        5.8          4.0        1.2          0.2  Iris-setosa


## Calculate the Euclidean Distance

In [4]:
# Calculate the Euclidean Distance

def euclidean_distance(vector1, vector2):
    ret = ((vector1 - vector2)**2).sum()
    return np.sqrt(ret)


## Predict and Score

In [5]:
# Predict
k = 3
nbr_errors = 0

for i, test_row in test_df.iterrows():
    distances = {} # dict row_index => distance
    for j, train_row in train_df.iterrows():
        if i != j:
            d = euclidean_distance(test_row[:-1], train_row[:-1])
            distances[j] = d
    # get the k nearest points based on sorted distances
    k_nearest = sorted(distances.items(), key=lambda x: x[1])[:k] # returns a list of k tuples (index, distance)
    count_classes = {} # dict class => nbr_neighbours
    for e in k_nearest:
        c = train_df.loc[e[0]]['class']
        try:
            count_classes[c] += 1
        except:
            count_classes[c] = 1
    tmp = sorted(count_classes.items(), key=lambda x: x[1])
    tmp.reverse()
    #print(tmp)
    predicted_class = tmp[0][0]
    correct_class = test_row['class']
    if predicted_class != correct_class:
        nbr_errors += 1

print(f"Score = {1-(nbr_errors/len(test_df))}")

Score = 0.9666666666666667
