## K-Nearest Neighbors on PIMA Diabetes Dataset

### Theory

The algorithm looks at the $k$ closest datapoints to ($x^i_1,x^i_2$) with unknown $y^i$ and determines which label is the mode of that subset.

Advantages:

* Simple

* No optimization of parameters

    - No need to iteratively update the model

* Easy to Classify and apply Regression

Disadvantages:

* Slow

    - Computationally Intense
 
*  Sensitive to high dimension feature vectors

    - Can be used in conjunction with Principal Component Analysis(PCA) to reduce this




### Algorithm

1. Find the distance from the chosen coordinates to every other point in the dataset

1. Sort them in ascending order and select the first $k$ values and store them in an array.

1. Take the array of closest points and find the mode of their labels $y$.

1. Return the predicted label $\hat y$ of the chosen point.



In [1]:
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt


In [2]:
data = pd.read_csv("data/diabetes.csv")
x_data = np.asarray(data.iloc[:,0:8])
y_data = np.asarray(data.loc[:,"Outcome"])

data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [3]:
class KNN_Classifier:
    def __init__(self, x_data, y_data):
        self.x_data = x_data
        self.y_data = y_data
    
    def distance(self,p1,p2):
        sum = 0
        for i in range(len(p1)):
            sum += (p1[i] - p2[i])**2
        
        return sum**.5
    
    def KNN(self, p,features, labels, k = 5):
        dist_arr = []
        for i in range(len(features)):
            dist_arr.append((self.distance(p,features[i]),labels[i]))
            dist_arr.sort()
        return dist_arr[0:k]

    def prediction(self,p,features,labels, k = 5, return_str = True):
        neighbors = np.asarray(self.KNN(p,features, labels, k), dtype = object)
        d = 0
        nd = 0

        for i in range(len(neighbors)):
            if neighbors[i][1] == 1:
                d += 1
            if neighbors[i][1] == 0:
                nd += 1
        if return_str:
            temp = {"Diabetic":d, "Non-Diabetic":nd}
        else:
            temp = {1:d, 0:nd}

        return max(temp, key = temp.get)
    
    def success_rate(self,k = 5):
        knnvec = []
        for i in range(len(y_data)):
            x = self.prediction(x_data[i],x_data,y_data, return_str = False)
            knnvec.append(x)
        return (sum(knnvec == y_data)/len(y_data))*100

In [4]:
knn = KNN_Classifier(x_data, y_data)

In [8]:
knn.success_rate()

80.33854166666666

Citations:

https://www.kaggle.com/uciml/pima-indians-diabetes-database