# KNN Algorithm to Classify Iris Species

### Dataset Name : Iris Flower Dataset
### Number of Attributes : 5
### Number of Instances : 150

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import operator

### Function to calculate Euclidean Distance
Euclidean Distance is calculated for all the points in the training set for the test instance.

In [2]:
def euclideanDistance(instance1, instance2):
    distance = 0
    length=len(instance1)
    for x in range(length):
        distance = pow((instance1[x] - instance2[x]), 2)
    return math.sqrt(distance)

### Function to find K nearest training set points to each test instance
Distance is calculated between each training example and test instances using the euclideanDistance() function. The distances are stored in a list. The list is sorted and the first K items in the list are returned as the K nearest neighbours to the test instance.

In [3]:
def getNeighbors(trainingSet, testInstance, k):
    distances = []
    for x in range(len(trainingSet)):
        dist = euclideanDistance(testInstance, trainingSet[x])
        distances.append((trainingSet[x], dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(k):
        neighbors.append(distances[x][0])
    return neighbors

### Function to predict the class of the test instance
Calculate the number of occurances of each class among the K nearest neighbours. Sort them based on the the number of occurances in decreasing order. The class with the highest occcurance is the predicted class.

In [22]:
def getResponse(neighbors):
    classVotes = {}
    for x in range(len(neighbors)):
        response = neighbors[x][-1]
        #print(response)
        if response in classVotes:
            classVotes[response] += 1
        else:
            classVotes[response] = 1
    
    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
    return sortedVotes[0][0]

### Function to calculate the accuracy of the KNN Model
For every instance in the test set, the actual target values and the predicted target values are compared. Based on the total number of correct predictions out of the total number of test instances, the accuracy is calculated.

In [23]:
def getAccuracy(testSet, predictions):
    correct = 0
    for x in range(len(testSet)):
        if testSet[x][-1] == predictions[x]:
            correct += 1
    return (correct/float(len(testSet))) * 100.0

### Main Function for the KNN Model
Load the training and test set using Pandas and convert them to an array.

For K=3, we calculate the 3 nearest neighbourse to every instance in the test set using the getNeighbors() function.

Based on the neighbours, we calculated the most frequently occusring class among the 3 nearest neighbors to the test instance using the getResponse() function.

We compare these predicted values for each test instance and comapare it with the actual value of the target variable and calculate the accuracy using the getAccuracy() function.

In [24]:
def main():
    train_set = pd.read_csv('train.csv')
    train_set=np.array(train_set)
    test_set = pd.read_csv('test.csv')
    test_set=np.array(test_set)
    predictions=[]
    k = 3
    for x in range(len(test_set)):
        neighbors = getNeighbors(train_set, test_set[x], k)
        result = getResponse(neighbors)
        predictions.append(result)
        print('Predicted=' + repr(result) + ', Actual=' + repr(test_set[x][-1]) + '\n')
    accuracy = getAccuracy(test_set, predictions)
    print('Accuracy: ' + repr(accuracy) + '%')

### Calling the main() function to run the KNN Model

In [25]:
main()

Predicted=1.0, Actual=1.0

Predicted=1.0, Actual=1.0

Predicted=2.0, Actual=2.0

Predicted=2.0, Actual=2.0

Predicted=2.0, Actual=2.0

Predicted=3.0, Actual=3.0

Predicted=3.0, Actual=3.0

Predicted=3.0, Actual=3.0

Accuracy: 100.0%
