# K-Nearest Neighbors Algorithm

### Implementing the KNN Algorithm Without Using External Libraries

First, let's add the Numpy and Pandas libraries to our workspace.

In [577]:
import numpy as np
import pandas as pd

Next, let's add our Iris dataset to our work using the Pandas library, and simplify and shuffle the created dataframe.

In [578]:
df = pd.read_csv("Iris.csv")
df.drop("Id",inplace=True,axis=1)
df = df.sample(frac=1)
df

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
88,5.6,3.0,4.1,1.3,Iris-versicolor
1,4.9,3.0,1.4,0.2,Iris-setosa
135,7.7,3.0,6.1,2.3,Iris-virginica
7,5.0,3.4,1.5,0.2,Iris-setosa
87,6.3,2.3,4.4,1.3,Iris-versicolor
...,...,...,...,...,...
68,6.2,2.2,4.5,1.5,Iris-versicolor
107,7.3,2.9,6.3,1.8,Iris-virginica
48,5.3,3.7,1.5,0.2,Iris-setosa
70,5.9,3.2,4.8,1.8,Iris-versicolor


Let's create a function called 'split_data' to separate the data into 'train' and 'test' sets. This function takes parameters: a numpy array 'X' containing the features of flowers, a numpy array 'y' containing the class labels of the flowers, and a percentage value 'train_size' indicating the proportion in which the data will be split into 'train' and 'test' sets.

In [579]:
def split_data(X, y, train_size):

    start = int(len(X)*train_size)

    X_train = X[0:start, :]
    X_test = X[start:-1, :]
    y_train = y[0:start]
    y_test = y[start:-1]

    return X_train, X_test, y_train, y_test

Now it's time to create the K-Nearest Neighbors Algorithm. Let's take a look at the functions we have created within the KNN class in order:

First, let's take a look at the 'euclidean_distance' and 'manhattan_distance' functions. These functions calculate the distances between all corresponding features of a given test and train data, provided as parameters. They sum up these distances and return the result.

Another function we have is the "predict" function. This function calculates the distances between each test data and each training data, storing them in a list called "distance" along with the indices of the compared training data. Then, it sorts the distances in ascending order from closest to farthest. After that, it keeps only the first "k" distances in the list and removes the others. Now, the list contains only the closest "k" distances and their corresponding indices in the training arrays. By adding the classes of these indices in the "y_train" to a list called "neighbors," we attempt to predict the class of the flower with the closest features among the top "k" flowers. Subsequently, based on the class with the highest frequency in the "neighbors" list, we predict the class of the flower we are trying to predict. Finally, we gather these predictions in a list called "predicted_classes" and return that list.

In [581]:
class KNN:

    def __init__(self, k, distance_type):
        self.distance_type = distance_type
        self.k = k

    def fit(self, X, y):

        assert len(X) == len(y)
        self.X_train = X
        self.y_train = y

    def euclidean_distance(self, X1, X2):

        distance = 0
        for i in range(len(X1)):
            distance += (X1[i] - X2[i]) ** 2
        return np.sqrt(distance)

    def manhattan_distance(self, X1, X2):
        distance = np.abs(np.array(X1) - np.array(X2)).sum()
        return distance

    def predict(self, X_test):

        predicted_classes = []
        
        for test_data_index in range(len(X_test)):
            
            distances = []
            for train_data_index in range(len(self.X_train)):
                if (self.distance_type == "euclidean"):
                    distance = self.euclidean_distance(self.X_train[train_data_index], X_test[test_data_index])
                elif (self.distance_type == "manhattan"):
                    distance = self.manhattan_distance(self.X_train[train_data_index], X_test[test_data_index])
                distances.append([distance, train_data_index])
            distances.sort()
            distances = distances[0:self.k]
            neighbors = []
            for distance, train_data_index in distances:
                neighbors.append(self.y_train[train_data_index])

            predicted_class = max(set(neighbors), key=neighbors.count)
            predicted_classes.append(predicted_class)
            
        return predicted_classes


We will use the Accuracy metric to measure the accuracy of our predictions. Accuracy is calculated by dividing the number of correctly predicted instances for all classes by the total number of test instances.

In [582]:
def print_acuracy(actual, predictions):

    assert len(actual) == len(predictions)
    correct = 0
    for i in range(len(actual)):
        if actual[i] == predictions[i]:
            correct += 1
    print("Acuracy of kNN model : ",correct / float(len(actual)))

Let's assign the columns containing the characteristics of flowers in the DataFrame to a variable named "X," and the columns containing the classes of flowers to a variable named "y" as numpy arrays. Then, let's pass these numpy arrays as parameters to the function called "split_data" that we created above.

In [589]:
X = np.array(df.iloc[:,0:-1])
y = np.array(df.iloc[:,-1:]).reshape((-1,))
X_train, X_test, y_train, y_test = split_data(X, y, train_size=0.7)

In [583]:
k=5
distance="manhattan"
knn = KNN(k,distance)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
print("Manhattan")
print_acuracy(y_test,y_pred)


Manhattan
Acuracy of kNN model :  0.9545454545454546


In [584]:
k=5
distance="euclidean"
knn = KNN(k,distance)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
print("Euclidean")
print_acuracy(y_test,y_pred)

Euclidean
Acuracy of kNN model :  0.9772727272727273


### Implementing the K-Nearest Neighbors Algorithm Using the Scikit-Learn Library

In [585]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [586]:
neigh = KNeighborsClassifier(n_neighbors=5,metric="manhattan")
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)
print("Manhattan")
accuracy_score(y_test, y_pred)

Manhattan


0.9555555555555556

In [587]:
neigh = KNeighborsClassifier(n_neighbors=5,metric="euclidean")
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)
print("Euclidean")
accuracy_score(y_test, y_pred)

Euclidean


0.9777777777777777