# KNN (K-Nearest Neighbors) Classifier

## Inroduction

KNN is a supervised classification algorithm.

the KNN algorithm is :

for new point $x$ :
1. choose $k$ - number of closest neighbors, (predicttion is based on $k$).
1. calculate the distance from $x$ to all points
2. sort points by distnace from $x$ in ascending order.
3. predict the majority label of the $k$ closest points.

![alt text](./knn.png "title")


as we can see :

> if we will choose $k=3$, then for the new point $?$, the predicted class will be B.

> if we will choose $k=7$, then for the new point $?$, the predicted class will be A.

## Predictions and Evaluations

let start explore the data from the file ```KNN.csv```

### import libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### get the data

In [2]:
df = pd.read_csv('KNN.csv')

In [3]:
df.head()

Unnamed: 0,XVPM,GWYH,TRAT,TLLZ,IGGA,HYKR,EDFS,GUUB,MGJM,JHZC,TARGET CLASS
0,1636.670614,817.988525,2565.995189,358.347163,550.417491,1618.870897,2147.641254,330.727893,1494.878631,845.136088,0
1,1013.40276,577.587332,2644.141273,280.428203,1161.873391,2084.107872,853.404981,447.157619,1193.032521,861.081809,1
2,1300.035501,820.518697,2025.854469,525.562292,922.206261,2552.355407,818.676686,845.491492,1968.367513,1647.186291,1
3,1059.347542,1066.866418,612.000041,480.827789,419.467495,685.666983,852.86781,341.664784,1154.391368,1450.935357,0
4,1018.340526,1313.679056,950.622661,724.742174,843.065903,1370.554164,905.469453,658.118202,539.45935,1899.850792,0


### EDA (Exploratory data analysis)

In [4]:
df

Unnamed: 0,XVPM,GWYH,TRAT,TLLZ,IGGA,HYKR,EDFS,GUUB,MGJM,JHZC,TARGET CLASS
0,1636.670614,817.988525,2565.995189,358.347163,550.417491,1618.870897,2147.641254,330.727893,1494.878631,845.136088,0
1,1013.402760,577.587332,2644.141273,280.428203,1161.873391,2084.107872,853.404981,447.157619,1193.032521,861.081809,1
2,1300.035501,820.518697,2025.854469,525.562292,922.206261,2552.355407,818.676686,845.491492,1968.367513,1647.186291,1
3,1059.347542,1066.866418,612.000041,480.827789,419.467495,685.666983,852.867810,341.664784,1154.391368,1450.935357,0
4,1018.340526,1313.679056,950.622661,724.742174,843.065903,1370.554164,905.469453,658.118202,539.459350,1899.850792,0
5,1587.993461,667.420121,2072.048544,364.624958,905.011385,2229.178514,880.442566,568.698408,645.719217,2156.949990,1
6,1497.293624,822.435853,908.059366,671.797517,1280.270442,1910.066313,1052.812386,713.757499,703.951632,1088.493788,0
7,1172.820769,1166.958461,455.656588,445.572745,1109.894585,1045.551452,1301.821786,264.940137,945.023932,2307.922229,0
8,839.494647,1089.747059,653.699894,659.334590,1529.725972,1521.320489,1401.676227,433.025043,1289.063583,1690.149835,0
9,1028.689140,202.089774,1030.841860,429.109102,943.104815,1978.506855,653.178512,753.655291,1336.375960,1057.864848,1


In [5]:
sns.set(style="ticks", color_codes=True)

In [None]:
sns.pairplot(df)

### Standardize the Variables

> because KNN is relies on majority based class of $k$ nearest points, it is better to scale data, using standart scaler.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
features = df.drop('TARGET CLASS', axis=1)

scaler.fit(features)

In [None]:
scaled_features = scaler.transform(features)

In [None]:
X = pd.DataFrame(scaled_features, columns=df.columns[:-1])

In [None]:
y = df['TARGET CLASS']

### Train and Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

### KNN prediction

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)

In [None]:
knn.fit(X_train, y_train)

In [None]:
predictions = knn.predict(X_test)

### Evaluation

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(predictions, y_test))

we can see that accuracy is $0.74$

### chosing $k$ value

In [None]:
error_rates = []
for i in range(1,40):
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(X_train, y_train)
    pred_i = model.predict(X_test)
    error_rates.append(np.mean(pred_i != y_test))

In [None]:
sns.lineplot(x=range(1,40), y=error_rates)

In [None]:
np.argmin(error_rates)

 we can see that the best $k$ with the minimum error rate is 25.

### Repeat prediction with $k=25$

In [None]:
knn = KNeighborsClassifier(n_neighbors=25)

In [None]:
knn.fit(X_train, y_train)

In [None]:
predictions = knn.predict(X_test)

In [None]:
print(classification_report(predictions, y_test))

with $k=25$, the accuracy is $0.82$. 