# Develop k-Nearest Neighbors Classifier in Python From Scratch

<font color='green'> 
I implemented k-Nearest Neighbors Classification Algorithm in python from scratch using [iris.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv) dataset.
    
</font>

#### Kaynaklar: 

- [Develop k-Nearest Neighbors in Python From Scratch](https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/)

## Step 1: Calculate Euclidean Distance

### <font color='blue'>Theoric </font>

**Euclidean Distance between A(x1,y1) and B(x2,y2) points:** 

```
d(A,B)^2 = (x2-x1)^2 + (y2-y1)^2 
d(A,B) = sqrt((x2-x1)^2 + (y2-y1)^2 )
```

**Euclidean Distance between two rows:** 

```
d(row1,row2)^2 = (row2)^2 - (row1)^2
d(row1,row2) = sqrt((row2)^2 - (row1)^2)
```

**We will do it for all columns values per row.**
```
0	4.9	3.0	1.4	0.2	Iris-setosa
1	4.7	3.2	1.3	0.2	Iris-setosa
```
```
d(row1,row2)^2 = (4.9-4.7)^2 + (3-3.2)^2 + (1.4-1.3)^2 + (0.2-0.2)^2
d(row1,row2) = sqrt((4.9-4.7)^2 + (3-3.2)^2 + (1.4-1.3)^2 + (0.2-0.2)^2)
d(row1,row2) = 0.3
```

### <font color='blue'>Loading Dataset</font>

In [1]:
import pandas as pd
import math 

In [2]:
df = pd.read_csv("iris.csv")

In [3]:
df.head()

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


### <font color='blue'>Calculate Euclidean Distance</font>

In [4]:
df = df.values # to obtain a numpy array

In [5]:
df[0]

array([4.9, 3.0, 1.4, 0.2, 'Iris-setosa'], dtype=object)

In [6]:
len(df[0])

5

In [7]:
def euclidean_distance(row1, row2):
    distance = 0
    for i in range(len(row1)-1): # ignored iris type
        distance = distance + ((row1[i]-row2[i])**2)
        
    return math.sqrt(distance)

In [8]:
euclidean_distance(df[0], df[1])

0.30000000000000016

### <font color='blue'>Calculate the distance between the first row and the other rows  </font>

In [9]:
row0 = df[0]

for row in df:
    distance = euclidean_distance(row0, row) 
    print(distance)

0.0
0.30000000000000016
0.3316624790355407
0.608276253029822
1.0908712114635715
0.5099019513592788
0.42426406871192834
0.5099019513592785
0.17320508075688784
0.8660254037844388
0.4582575694955841
0.1414213562373099
0.6782329983125273
1.360147050873544
1.6278820596099708
1.0535653752852738
0.5477225575051659
1.1747340124470729
0.8366600265340752
0.7071067811865475
0.7615773105863909
0.7810249675906658
0.5567764362830019
0.6480740698407861
0.22360679774997896
0.4999999999999999
0.5916079783099616
0.49999999999999983
0.3464101615137758
0.24494897427831822
0.6782329983125268
1.1489125293076055
1.3416407864998738
0.17320508075688784
0.3
0.7874007874011809
0.17320508075688784
0.5099019513592784
0.4582575694955836
0.529150262212918
0.8185352771872454
0.5477225575051662
0.6782329983125268
0.9848857801796101
0.14142135623730986
0.8485281374238567
0.3605551275463996
0.812403840463596
0.31622776601683766
4.096339829652808
3.6864617182333523
4.236744032862973
2.9698484809834995
3.811823710509183
3

In [9]:
# distances added in the list
def distance(train):
    row0 = df[0]
    distances = []
    for row in df:
        distances.append(euclidean_distance(row0, row)) 
    
    return distances

In [10]:
distance(df)

[0.0,
 0.30000000000000016,
 0.3316624790355407,
 0.608276253029822,
 1.0908712114635715,
 0.5099019513592788,
 0.42426406871192834,
 0.5099019513592785,
 0.17320508075688784,
 0.8660254037844388,
 0.4582575694955841,
 0.1414213562373099,
 0.6782329983125273,
 1.360147050873544,
 1.6278820596099708,
 1.0535653752852738,
 0.5477225575051659,
 1.1747340124470729,
 0.8366600265340752,
 0.7071067811865475,
 0.7615773105863909,
 0.7810249675906658,
 0.5567764362830019,
 0.6480740698407861,
 0.22360679774997896,
 0.4999999999999999,
 0.5916079783099616,
 0.49999999999999983,
 0.3464101615137758,
 0.24494897427831822,
 0.6782329983125268,
 1.1489125293076055,
 1.3416407864998738,
 0.17320508075688784,
 0.3,
 0.7874007874011809,
 0.17320508075688784,
 0.5099019513592784,
 0.4582575694955836,
 0.529150262212918,
 0.8185352771872454,
 0.5477225575051662,
 0.6782329983125268,
 0.9848857801796101,
 0.14142135623730986,
 0.8485281374238567,
 0.3605551275463996,
 0.812403840463596,
 0.31622776601683

## Step 2: Get Nearest Neighbors

In [23]:
b = [2,25, 32, 7, 11, 3]

In [24]:
sorted(b)[:3]

[2, 3, 7]

In [25]:
def nearest(x,k):
    return sorted(x)[:k]

In [26]:
nearest(b,3)

[2, 3, 7]

In [27]:
def get_neighbors(train, K):
    l = distance(train)
    return sorted(l)[:K]

In [28]:
get_neighbors(df, 5)

[0.0,
 0.14142135623730986,
 0.1414213562373099,
 0.17320508075688784,
 0.17320508075688784]

### <font color='blue'> Specifing the row in the function </font>

In [29]:
def euclidean_distance(row1, row2):
    distance = 0
    for i in range(len(row1)-1): # ignored iris type
        distance = distance + ((row1[i]-row2[i])**2)
        
    return math.sqrt(distance)

In [30]:
def get_neighbors(train, test_row, num_neighbors):
    distances_list = list()
    for row in train:
        distances_list.append(euclidean_distance(test_row, row)) 
    
    return sorted(distances_list)[:num_neighbors]

In [31]:
get_neighbors(df, df[0], 3)

[0.0, 0.14142135623730986, 0.1414213562373099]

### <font color='blue'> Adding the train rows too in the distances_list </font>

We added train_row inside tuple as (train_row, distance)

In [40]:
def euclidean_distance(row1, row2):
    distance = 0
    for i in range(len(row1)-1): # ignored iris type
        distance = distance + ((row1[i]-row2[i])**2)
        
    return math.sqrt(distance)

In [32]:
distances_list = list()
for row in df:
    distance = euclidean_distance(df[0], row)
    distances_list.append((row, distance)) 

In [33]:
distances_list.sort(key=lambda x: x[1]) # sorting the list of tuples according to second element

In [34]:
def get_neighbors(train, test_row, num_neighbors):
    distances_list = list()
    for row in train:
        distance = euclidean_distance(test_row, row)
        distances_list.append((row, distance)) 
    
    distances_list.sort(key=lambda x: x[1]) 
    return distances_list

In [35]:
get_neighbors(df, df[0], 3) # it didn't use num_neighbors, sorted all list.

[(array([4.9, 3.0, 1.4, 0.2, 'Iris-setosa'], dtype=object), 0.0),
 (array([4.8, 3.0, 1.4, 0.3, 'Iris-setosa'], dtype=object),
  0.14142135623730986),
 (array([4.8, 3.0, 1.4, 0.1, 'Iris-setosa'], dtype=object),
  0.1414213562373099),
 (array([4.9, 3.1, 1.5, 0.1, 'Iris-setosa'], dtype=object),
  0.17320508075688784),
 (array([4.9, 3.1, 1.5, 0.1, 'Iris-setosa'], dtype=object),
  0.17320508075688784),
 (array([4.9, 3.1, 1.5, 0.1, 'Iris-setosa'], dtype=object),
  0.17320508075688784),
 (array([5.0, 3.0, 1.6, 0.2, 'Iris-setosa'], dtype=object),
  0.22360679774997896),
 (array([4.8, 3.1, 1.6, 0.2, 'Iris-setosa'], dtype=object),
  0.24494897427831822),
 (array([5.0, 3.2, 1.2, 0.2, 'Iris-setosa'], dtype=object), 0.3),
 (array([4.7, 3.2, 1.3, 0.2, 'Iris-setosa'], dtype=object),
  0.30000000000000016),
 (array([5.0, 3.3, 1.4, 0.2, 'Iris-setosa'], dtype=object),
  0.31622776601683766),
 (array([4.6, 3.1, 1.5, 0.2, 'Iris-setosa'], dtype=object),
  0.3316624790355407),
 (array([4.7, 3.2, 1.6, 0.2, '

In [36]:
# Added num_neighbors in different for loop.

def get_neighbors(train, test_row, num_neighbors):
    distances_list = list()
    for row in train:
        distance = euclidean_distance(test_row, row)
        distances_list.append((row, distance)) 
    
    distances_list.sort(key=lambda x: x[1]) # sorting the list of tuples according to second element
    neighbors = list()
    for i in range(num_neighbors):
        neighbors.append(distances_list[i][0]) # take only neighbors not distances
    
    return neighbors

In [37]:
get_neighbors(df, df[0], 3) 

[array([4.9, 3.0, 1.4, 0.2, 'Iris-setosa'], dtype=object),
 array([4.8, 3.0, 1.4, 0.3, 'Iris-setosa'], dtype=object),
 array([4.8, 3.0, 1.4, 0.1, 'Iris-setosa'], dtype=object)]

In [38]:
neighbors = get_neighbors(df, df[0], 3)
for neighbor in neighbors:
    print(neighbor)

[4.9 3.0 1.4 0.2 'Iris-setosa']
[4.8 3.0 1.4 0.3 'Iris-setosa']
[4.8 3.0 1.4 0.1 'Iris-setosa']


## Step 3: Make Predictions

We will return the most represented class among the neighbors.

In [39]:
def euclidean_distance(row1, row2):
    distance = 0
    for i in range(len(row1)-1): # ignored iris type
        distance = distance + ((row1[i]-row2[i])**2)
        
    return math.sqrt(distance)

In [40]:
def get_neighbors(train, test_row, num_neighbors):
    distances_list = list()
    for row in train:
        distance = euclidean_distance(test_row, row)
        distances_list.append((row, distance)) 
    
    distances_list.sort(key=lambda x: x[1]) # sorting the list of tuples according to second element
    neighbors = list()
    for i in range(num_neighbors):
        neighbors.append(distances_list[i][0]) # take only neighbors not distances
    
    return neighbors

In [41]:
neighbors = get_neighbors(df, df[0], 3)
for neighbor in neighbors:
    print(neighbor)

[4.9 3.0 1.4 0.2 'Iris-setosa']
[4.8 3.0 1.4 0.3 'Iris-setosa']
[4.8 3.0 1.4 0.1 'Iris-setosa']


In [43]:
neighbors = get_neighbors(df, df[0], 3)
for neighbor in neighbors:
    print(neighbor[-1])

Iris-setosa
Iris-setosa
Iris-setosa


In [44]:
neighbors = get_neighbors(df, df[0], 3)

In [45]:
neighbors

[array([4.9, 3.0, 1.4, 0.2, 'Iris-setosa'], dtype=object),
 array([4.8, 3.0, 1.4, 0.3, 'Iris-setosa'], dtype=object),
 array([4.8, 3.0, 1.4, 0.1, 'Iris-setosa'], dtype=object)]

In [48]:
neighbors[0][-1]

'Iris-setosa'

In [55]:
neighbors = get_neighbors(df, df[0], 3)
output_values = list()
for neighbor in neighbors:
    output_values.append(neighbor[-1])

In [56]:
output_values 

['Iris-setosa', 'Iris-setosa', 'Iris-setosa']

In [70]:
neighbors = get_neighbors(df, df[100], 100)
output_values = [neighbor[-1] for neighbor in neighbors]

In [71]:
output_values

['Iris-virginica',
 'Iris-virginica',
 'Iris-virginica',
 'Iris-virginica',
 'Iris-virginica',
 'Iris-versicolor',
 'Iris-virginica',
 'Iris-virginica',
 'Iris-virginica',
 'Iris-virginica',
 'Iris-virginica',
 'Iris-virginica',
 'Iris-versicolor',
 'Iris-virginica',
 'Iris-virginica',
 'Iris-virginica',
 'Iris-versicolor',
 'Iris-versicolor',
 'Iris-virginica',
 'Iris-versicolor',
 'Iris-virginica',
 'Iris-virginica',
 'Iris-versicolor',
 'Iris-virginica',
 'Iris-versicolor',
 'Iris-virginica',
 'Iris-virginica',
 'Iris-versicolor',
 'Iris-versicolor',
 'Iris-virginica',
 'Iris-virginica',
 'Iris-versicolor',
 'Iris-virginica',
 'Iris-versicolor',
 'Iris-virginica',
 'Iris-versicolor',
 'Iris-versicolor',
 'Iris-versicolor',
 'Iris-versicolor',
 'Iris-versicolor',
 'Iris-virginica',
 'Iris-versicolor',
 'Iris-versicolor',
 'Iris-virginica',
 'Iris-versicolor',
 'Iris-versicolor',
 'Iris-versicolor',
 'Iris-virginica',
 'Iris-virginica',
 'Iris-versicolor',
 'Iris-versicolor',
 'Iris-v

In [77]:
max(set(output_values), key=output_values.count) # count most represented values

'Iris-versicolor'

In [78]:
def euclidean_distance(row1, row2):
    distance = 0
    for i in range(len(row1)-1): # ignored iris type
        distance = distance + ((row1[i]-row2[i])**2)
        
    return math.sqrt(distance)

In [79]:
def get_neighbors(train, test_row, num_neighbors):
    distances_list = list()
    for row in train:
        distance = euclidean_distance(test_row, row)
        distances_list.append((row, distance)) 
    
    distances_list.sort(key=lambda x: x[1]) # sorting the list of tuples according to second element
    neighbors = list()
    for i in range(num_neighbors):
        neighbors.append(distances_list[i][0]) # take only neighbors not distances
    
    return neighbors

In [80]:
def predict_classification(train, test_row, num_neighbors):
    neighbors = get_neighbors(train, test_row, num_neighbors)
    output_values = [neighbor[-1] for neighbor in neighbors]
    prediction = max(set(output_values), key=output_values.count)
    return prediction

In [81]:
predict_classification(df, df[100], 100)

'Iris-versicolor'

## Step 4: Evaluate Predictions

<font color='green'> **df[100] gerçekte neymiş buna bakalım ve KNN 100 yakın komşulukla ne bulmuş buna bakalım.**</font>

In [85]:
prediction = predict_classification(df, df[100], 100) 

In [89]:
print(f"gerçekte: {df[100][-1]}, KNN'in bulduğu: {prediction}")

gerçekte: Iris-virginica, KNN'in bulduğu: Iris-versicolor


<font color='green'>**df[100]'ü 10 yakın komşulukla bulmaya çalışalım.** </font>

In [90]:
prediction = predict_classification(df, df[100], 10)

In [91]:
print(f"gerçekte: {df[100][-1]}, KNN'in bulduğu: {prediction}")

gerçekte: Iris-virginica, KNN'in bulduğu: Iris-virginica


<font color='green'>**Komşu sayısını azaltınca daha isabetli karar veriyor.** </font>