# K-Nearest Neighbor
### is one of the simplest Supervised Machine Learning algorithm mostly used for classification. It is based on feature similarity and classifies a data point based on how its neighbors are classified.
* stores all available cases and classifies new case based on a similarity measure
* k in KNN is a parameter that refers to the number of nearest neighbors to include in the majority voting process
* KNN algorithm can also be used for regression problems. The only difference will be using averages of nearest neighbors rather than voting from nearest neighbors.

### KNN Algorithm
We can implement a KNN model by following the below steps: 
1. Load the data
2. Initialize the value of k
3. For getting the predicted class, iterate from 1 to total number of training data points
   1. Calculate the distance between test data and each row of training dataset. Here we will use Euclidean distance as our distance metric since it’s the most popular method. The other distance function or metrics that can be used are Manhattan distance, Minkowski distance, Chebyshev, cosine, etc. If there are categorical variables, hamming distance can be used.
   2. Sort the calculated distances in ascending order based on distance values
   3. Get top k rows from the sorted array
   4. Get the most frequent class of these rows
   5. Return the predicted class
      
In this example, k =5, 4 out of 5 neighbors are red, so we predict it is red.

<img src="Image/KNN.JPG"  width="600" height="300">

### How do we choose 'k'?
KNN Algorithm us based on feature similarity: Choosing the right value of k is a process called parameter tuning, and is important for better accuracy

At k =3, we classify '?' as square  
At k = 7 we classify '?' as Triangle

<img src="Image/KNN2.JPG"  width="600" height="300">

When k is too small, too much bias  
when k is too big, cost too much resource

Most common way:
* Sqrt(n), where n is the total number of data points (number of test dataset)
* Odd value of k is selected to avoid confusion (voting)

### When do we use KNN Algorithm?
* Data is labeled
* Data is noise free
* Dataset is small

### Use Case - Predict Diabetes
Problem statement: Predict whether a person will be diagnosed with diabetes or not

Dataset: we have a dataset of 768 people who were or were not diagnosed with diabetes

### 1. Importing the Libraries

In [32]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score, classification_report

### 2. Load the dataset and have a look

In [36]:
dataset = pd.read_csv('data/diabetes.csv')
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [37]:
dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### 3. Data Preprocessing
Values of columns like 'Glucose', 'BloodPressure' cannot be accepted as zeroes because it will affect the outcome.  
We can replace such values with the mean of the respective column.

In [55]:
# Replace zeroes
zero_not_accepted = ['Glucose', 'BloodPressure', 'SkinThickness', 'BMI', 'Insulin']
# replace 0 with NaN, calculate mean without 0, replace NaN with mean
for column in zero_not_accepted:
    dataset[column] = dataset[column].replace(0, np.NaN)
    mean = int (dataset[column].mean(skipna=True))
    dataset[column] = dataset[column].replace(np.NaN, mean)

print(dataset[dataset['Glucose']==0])

Empty DataFrame
Columns: [Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome]
Index: []


### 4. Split Dataset

In [58]:
X = dataset.iloc[:,:8]
y = dataset.iloc[:,8]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)

### 5. Feature Scaling
### Rule of thumb: any algorithm that computes distance or assumes normality, scale your features!

In [59]:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

### 6. Fitting the training data

In [62]:
import math
math.sqrt(len(y_test))

12.409673645990857

In [64]:
# Define the model: Init K-NN
# we want odd number 
k = 12 - 1 
# p = 2, because we are looking for diabetic or not
clf = KNeighborsClassifier(n_neighbors=k,p=2,metric='euclidean')

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_pred

array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int64)

### 7. Evaluate Model

In [71]:
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(f1_score(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[94 13]
 [15 32]]
0.6956521739130436
0.8181818181818182
              precision    recall  f1-score   support

           0       0.86      0.88      0.87       107
           1       0.71      0.68      0.70        47

    accuracy                           0.82       154
   macro avg       0.79      0.78      0.78       154
weighted avg       0.82      0.82      0.82       154

