### K Nearest Neighbor
KNN can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry. To evaluate any technique we generally look at 3 important aspects:
<ol>
    <li>Ease to interpret output</li>
    <li>Calculation time</li>
    <li>Predictive Power</li>
</ol>

#### How does it work?
Following is a spread of red circles (RC) and green squares (GS) :
<img src="https://www.analyticsvidhya.com/wp-content/uploads/2014/10/scenario1.png">
You intend to find out the class of the blue star (BS). BS can either be RC or GS and nothing else. The “K” is KNN algorithm is the nearest neighbor we wish to take the vote from. Let’s say K = 3. Hence, we will now make a circle with BS as the center just as big as to enclose only three datapoints on the plane. Refer to the following diagram for more details:
<img src="https://www.analyticsvidhya.com/wp-content/uploads/2014/10/scenario2.png">
The three closest points to BS is all RC. Hence, with a good confidence level, we can say that the BS should belong to the class RC.
#### How do we choose K?
The following are the different boundaries separating the two classes with different values of K. If you watch carefully, you can see that the boundary becomes smoother with increasing value of K. With K increasing to infinity it finally becomes all blue or all red depending on the total majority.
To get the optimal value of K, you can segregate the training and validation from the initial dataset.

In [3]:
!pip3 install pandas

Collecting pandas
  Downloading https://files.pythonhosted.org/packages/07/12/5a087658337a230f4a77e3d548c847e81aa59b332cdd8ddf5c8d7f11c4a1/pandas-1.0.3-cp38-cp38-win32.whl (7.6MB)
Collecting pytz>=2017.2 (from pandas)
  Using cached https://files.pythonhosted.org/packages/e7/f9/f0b53f88060247251bf481fa6ea62cd0d25bf1b11a87888e53ce5b7c8ad2/pytz-2019.3-py2.py3-none-any.whl
Installing collected packages: pytz, pandas
Successfully installed pandas-1.0.3 pytz-2019.3


You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [2]:
!pip3 install numpy



You should consider upgrading via the 'python -m pip install --upgrade pip' command.





#### Loading Data

In [7]:
import pandas as pd
import numpy as np
import math
import operator


# Importing data 
data = pd.read_csv('iris.csv')

print(data.head(5)) 

   SepalLength  SepalWidth  PetalLength  PetalWidth         Name
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa


#### Define the method to calculate the euclidean distance
It will return the distance between input data point and the data point within the range K

In [10]:
# Defining a function which calculates euclidean distance between two data points
def euclideanDistance(data1, data2, length):
    distance = 0
    for x in range(length):
        distance += np.square(data1[x] - data2[x])
    return np.sqrt(distance)

# Defining our KNN model
def knn(trainingSet, testInstance, k):
 
    distances = {}
    sort = {}
 
    length = testInstance.shape[1]
    
    # Calculating euclidean distance between each row of training data and test data
    for x in range(len(trainingSet)):
        
        dist = euclideanDistance(testInstance, trainingSet.iloc[x], length)

        distances[x] = dist[0]
 
    # Sorting them on the basis of distance
    sorted_d = sorted(distances.items(), key=operator.itemgetter(1))
 
    neighbors = []
    
    # Extracting top k neighbors
    for x in range(k):
        neighbors.append(sorted_d[x][0])
    classVotes = {}
    
    # Calculating the most freq class in the neighbors
    for x in range(len(neighbors)):
        response = trainingSet.iloc[neighbors[x]][-1]
 
        if response in classVotes:
            classVotes[response] += 1
        else:
            classVotes[response] = 1

    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
    return(sortedVotes[0][0], neighbors)

In [19]:
testSet = [[7.2, 3.6, 5.1, 2.5]]

#convert list to dataFrame (tabular order)
test = pd.DataFrame(testSet)
print(test)

     0    1    2    3
0  7.2  3.6  5.1  2.5


##### If K = 1

In [15]:
print('\n\nWith 1 Nearest Neighbour \n\n')
k = 1

# Running KNN model
result,neigh = knn(data, test, k)

# Predicted class
print('\nPredicted Class of the datapoint = ', result)

# Nearest neighbor
print('\nNearest Neighbour of the datapoints = ',neigh)



With 1 Nearest Neighbour 



Predicted Class of the datapoint =  Iris-virginica

Nearest Neighbour of the datapoints =  [141]


##### If K = 3

In [16]:
print('\n\nWith 3 Nearest Neighbours\n\n')
# Setting number of neighbors = 3 
k = 3 
# Running KNN model 
result,neigh = knn(data, test, k) 

# Predicted class 
print('\nPredicted class of the datapoint = ',result)

# Nearest neighbor
print('\nNearest Neighbours of the datapoints = ',neigh)




With 3 Nearest Neighbours



Predicted class of the datapoint =  Iris-virginica

Nearest Neighbours of the datapoints =  [141, 139, 120]


##### If K = 5

In [17]:
print('\n\nWith 5 Nearest Neighbours\n\n')
# Setting number of neighbors = 3 
k = 5
# Running KNN model 
result,neigh = knn(data, test, k) 

# Predicted class 
print('\nPredicted class of the datapoint = ',result)

# Nearest neighbor
print('\nNearest Neighbours of the datapoints = ',neigh)



With 5 Nearest Neighbours



Predicted class of the datapoint =  Iris-virginica

Nearest Neighbours of the datapoints =  [141, 139, 120, 145, 144]
