#### How does k-Nearest Neighbors Work

    The kNN algorithm is belongs to the family of instance-based, competitive learning and lazy learning algorithms.
    
    Instance Based : model the problem using data instances (or rows) in order to make predictive decisions. The kNN algorithm is an extreme form of instance-based methods because all training observations are retained as part of the model.
    Competitive Learning : competition between model elements (data instances) in order to make a predictive decision.
    LazyLearning : There is no training the algorithm
    
    

#### Steps
    Handle Data: Open the dataset from CSV and split into test/train datasets.
    Similarity: Calculate the distance between two data instances.
    Neighbors: Locate k most similar data instances.
    Response: Generate a response from a set of data instances.
    Main: Tie it all together.
    Accuracy: Summarize the accuracy of predictions.
    

###### Step 1: Load data and divide into training and test

In [13]:
import pandas as pd
iris_data = pd.read_csv('iris.data.txt', header=None)
X = iris_data.drop(4, axis=1)
y = iris_data[[4]]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

###### Step 2: Define similarity matrix

In [16]:
from sklearn.metrics.pairwise import euclidean_distances
# use like euclidean_distances(X_train, X_test[i])

In [57]:
euclidean_distances(X_train.values, X_test.values[0].reshape(1,-1)) # distance of a test point from each training point

array([[ 0.65574385],
       [ 2.59036677],
       [ 0.64807407],
       [ 3.65786823],
       [ 2.68328157],
       [ 0.70710678],
       [ 3.6138622 ],
       [ 3.75099987],
       [ 3.23728281],
       [ 1.82482876],
       [ 0.67082039],
       [ 3.3       ],
       [ 3.44383507],
       [ 3.27719392],
       [ 0.72111026],
       [ 3.54964787],
       [ 0.67082039],
       [ 2.92574777],
       [ 3.78153408],
       [ 0.43588989],
       [ 0.98994949],
       [ 3.76961536],
       [ 0.64807407],
       [ 0.8660254 ],
       [ 0.9486833 ],
       [ 0.81240384],
       [ 1.43874946],
       [ 1.05356538],
       [ 3.36749165],
       [ 1.86010752],
       [ 0.97467943],
       [ 3.65376518],
       [ 3.82361086],
       [ 1.15758369],
       [ 0.86023253],
       [ 3.60277671],
       [ 1.3       ],
       [ 3.57910603],
       [ 3.41613817],
       [ 0.75498344],
       [ 1.99749844],
       [ 1.73493516],
       [ 0.58309519],
       [ 0.94339811],
       [ 2.40208243],
       [ 0

##### Step 3: Get K Nearest Neighbours for each testset

In [119]:
import operator 
def getNeighbors(X_train, y_train, test_instance, k):
    distances = euclidean_distances(X_train.values, test_instance.values.reshape(1,-1))
    distance_response_tuple = list(zip(distances.flatten(), y_train.values.flatten()))
    distance_response_tuple.sort()
    return distance_response_tuple[:k]

In [191]:
k_nearest_responses = getNeighbors(X_train, y_train, X_test.iloc[0], 30)

##### Step 4: Find response

    1 way is to take all responses equally and then average it out
    Other way is to take weighted response. Closest one has highest weight.

In [183]:
# way 1. equal response
from collections import Counter
def get_response(k_nearest_responses):
    response = [i[1] for i in k_nearest_responses]
    most_common = Counter(response).most_common()[0][0] # 4, 6 times
    return most_common

In [184]:
reponse = get_response(k_nearest_responses)

In [195]:
# way 2. weighted response
def get_response_weighted(k_nearest_responses):
    from collections import defaultdict
    reponse = defaultdict(int)
    for d, r in k_nearest_responses:
        reponse[r] += d
    return max(reponse,key=reponse.get)

In [196]:
# tie it all together
def knn(X_train, X_test, k):
    y_pred = []
    for i in range(X_test.shape[0]):
        k_nearest_responses = getNeighbors(X_train, y_train, X_test.iloc[i], k)
        y_pred.append(get_response_weighted(k_nearest_responses)) # get_response or get_response_weighted
    return y_pred

In [197]:
y_pred = knn(X_train, X_test,3)

##### Step 5: Accuracy

In [198]:
def accuracy(y_pred, y_test):
    correct = y_pred == y_test.values.flatten()
    accuracy = correct.sum()*100.0 / correct.size
    return accuracy

In [199]:
accuracy(y_pred, y_test)

98.0