## K-Nearest-Neighbors (KNN)
The K-Nearest Neighbors algorithm uses the entire data set as the training set, rather than splitting the data set into a training set and test set.

When an outcome is required for a new data instance, the KNN algorithm goes through the entire data set to find the k-nearest instances to the new instance, or the k number of instances most similar to the new record, and then outputs the mean of the outcomes (for a regression problem) or the mode (most frequent class) for a classification problem. The value of k is user-specified.

The similarity between instances is calculated using measures such as Euclidean distance and Hamming distance.

![title](images/K-Nearest-Neighbors.png)

## Prepare Data

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# get data
iris = load_iris()
x = iris.data
y = iris.target

# split data into train(80%) & test(20%) data since we don't have anything to test on
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

## KNN Classifier

In [2]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

# initialize classifier
k = 3 # number of neighbors
knn = KNeighborsClassifier(n_neighbors=k)

# train/fit the model
knn.fit(x_train, y_train)

# use the model to predict the classes of the test data
y_predict = knn.predict(x_test)

# get the accuracy score
metrics.accuracy_score(y_test, y_predict)

1.0

## GridSearchCV (Grid Search with Cross-Validation) - Trying Different Parameter Values 
Grid search is the process of performing hyper parameter tuning in order to determine the optimal values for a given model.

The goal of cross-validation is to test the model's ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias and to give an insight on how the model will generalize to an independent dataset.

In [3]:
from sklearn.model_selection import GridSearchCV
import numpy as np

param_grid = {
    'n_neighbors': np.arange(1, 25)
}

# use gridsearch to test a range of values for n_neightbors
# cv is preferable to be 10 as most of the time this value gives good results
# cv = 10 means that x_train is splitted into 10 groups where 9 are for training and 1 is for testing
# these groups shuffle to search for the best train data for the model
knn = KNeighborsClassifier()
knn_gscv = GridSearchCV(knn, param_grid, cv=10)

# fit model to data
knn_gscv.fit(x, y)

# predict
y_predict_gscv = knn_gscv.predict(x_test)

# check the best n-neighbors value
print(f'Best K-Neighbors ==> {knn_gscv.best_params_}')

# mean best score
print(f'Mean Score = {knn_gscv.best_score_}')

# Accuracy score
print(f'Accuracy Score = {metrics.accuracy_score(y_test, y_predict_gscv)}')

Best K-Neighbors ==> {'n_neighbors': 13}
Mean Score = 0.9800000000000001
Accuracy Score = 0.9666666666666667
