# K-NEAREST NEIGHBORS
One of the simplest Supervised ML algorithms, mostly used for classification.
Classifies based on neighbor classification

**k**: parameter that refer to the number of nearest neighbors to include in the majority voting process.

## How to choose K
Choosing the right value of k is a process called parameter tuning, is important for better accuracy.
Too low: too noisy
Too big: takes forever to process

Most common use:
1. Sqrt(n) [n total number of points]
2. Odd value of K is selected to avoid confusion between two classes of data


## When do we use KNN?
- Labeled data
- Data is noise free
- Dataset is small, KNN is a lazy learner

## How does the algorithm work?
Normal classification based on labeled data.
- Calculate the euclidean distance between our point and other points surrounding
- Then we choose k

# Use Case: Diabetes

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

In [12]:
dataset = pd.read_csv('/Users/Hussein/Documents/Anaconda/K-Nearest Neighbors/diabetes.csv')

In [13]:
dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [14]:
len(dataset)

768

### Replace zeroes with the mean value of the column

In [17]:
zero_not_accepted = ['Glucose', 'BloodPressure', 'SkinThickness', 'BMI', 'Insulin']
for column in zero_not_accepted:
    dataset[column] = dataset[column].replace(0, np.NaN)
    mean = int(dataset[column].mean(skipna=True)) # skipna: skip NaN :) 
    dataset[column] = dataset[column].replace(np.NaN, mean)


In [32]:
# split data set
X = dataset.iloc[:, 0:8]
y = dataset.iloc[:, 8]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)

> Rule of thumb: Any algorithm that computes distance or assumes normality, scale your features

In [34]:
# feature scaling
sc_x = StandardScaler()
X_train = sc_x.fit_transform(X_train)
X_test = sc_x.transform(X_test)

In [35]:
import math
math.sqrt(len(y_test))

12.409673645990857

In [40]:
# define classifier
classifier = KNeighborsClassifier(n_neighbors=11, p=2, metric='euclidean')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
y_pred

array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [41]:
# evaluate the model
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[94 13]
 [15 32]]


In [42]:
print(f1_score(y_test, y_pred)) # more false positives

0.6956521739130436


In [43]:
print(accuracy_score(y_test, y_pred)) # for general telling

0.8181818181818182
