# kNN

## Introduction
k Nearest Neighbours is one of the simples classification technique widely used in the Machine Learning domain. kNN is a non-parametric and lazy learning technique.

Non-parametric means that it doesn't make any underline assumption on the distribution of data, that is, the model structure is determined from the data.

Lazy learning means that kNN doesn't involve a training phase as in most algorithms. The prediction is made using most(all) of the data and no training has to be done before predicting a sample point.

Due to the above 2 main points, kNNs have a lot of real world applications, name, Anomaly Detection, Spam detection, Credit ratings, OCR etc

In this notebook we'll look how to apply KNN on the famous digit recognition dataset, the MNIST dataset

## Preparing the dataset
1. Download and place the ```mnist_train.csv``` and ```mnist_test.csv``` files into the current directory
2. Read ```mnist_train.csv``` and ```mnist_test.csv```

Link: https://pjreddie.com/projects/mnist-in-csv/

In [1]:
import pandas as pd # for reading and manipulating csv files
import numpy as np # for mathematical manipulation

In [2]:
# read the csv file
# note the csv file doesn't include a header row, hence we'll use the auto generated headers
df_train = pd.read_csv('./mnist_train.csv', header=None)
df_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,784
0,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Here, we can see that the 0th column is actually the label of the data, therefore we'll extract it

In [3]:
data = df_train.drop([0], axis=1)
labels = df_train[0]

In [4]:
# convert the data and labels into numpy arrays
X_train = np.array(data)
y_train = np.array(labels)

In [5]:
X_train.shape, y_train.shape

((60000, 784), (60000,))

In [6]:
df_test = pd.read_csv('./mnist_test.csv', header=None)
df_test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,784
0,7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
data_test = df_test.drop([0], axis=1)
labels_test = df_test[0]

In [8]:
# convert the data and labels into numpy arrays
X_test = np.array(data_test)
y_test = np.array(labels_test)

## kNN algorithm from sk-learn library
Before implementing our own kNN algorithm let's see how well does sklearn's kNN work with our dataset

In [9]:
# import the kNN classifier
from sklearn.neighbors import KNeighborsClassifier

In [10]:
# create an instance of kNN classifier with k=5
knn = KNeighborsClassifier(n_neighbors=5)

In [11]:
# fit the model using X_train as training data and y_train as target values
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

Let's check how the kNN model is doing on a random data point

In [12]:
# returns the distance of k closest naighbours and their indices
dist, indices = knn.kneighbors(X_test[54].reshape(1,-1))

In [13]:
for i in range(len(indices)):
    print(y_train[indices[i]])

[6 6 6 6 6]


In [14]:
# get the correct label
y_test[54]

6

The model predicts that the digit is 6 and it indeed is. Awesome! Now, let's calculate the accuracy over all the test data points

In [15]:
# returns the mean accuracy on the given test data and labels
knn.score(X_test, y_test)

0.9688

Accuracy = 96.88%

We've taken a look into sklearn's implementation of kNN algorithm. Now, it's time to implement our own algorithm

## kNN Algorithm from scratch
@TODO

In [None]:
from collections import Counter

In [None]:
class KNeighborsClassifier:
    def __init__(self, n_neighbors=5):
        self.n_neighbors = n_neighbors
        self.labels = []
        self.X_train = []
        self.y_train = []
        self.distances = []
        
    def fit(self, X_train, y_train):
        self.X_train = np.array(X_train)
        self.y_train = np.array(y_train)
        
    def euclidean_distance(self, X):
        x = np.array(X)
        distances = []
        for index, data_point in enumerate(self.X_train):
            data_point = np.array(data_point)
            distance = np.linalg.norm(data_point - x)
            distances.append([distance, y_train[index]])
            
        return distances
            
    def get_class_votes(self, distances):
        distances_sorted = sorted(distances)[:self.n_neighbors]
        label = Counter([distance[1] for distance in distances_sorted]).most_common(1)[0][0]
        
        return [label]
    
    def predict(self, X):
        predicted = []
        for element in X:
            distances = self.euclidean_distance(element)
            predicted.append(self.get_class_votes(distances))
        return np.array(predicted)
    
    def score(self, X, y):
        predicted = self.predict(X)
        _y = y.reshape(-1,1)
        checked = np.equal(predicted, _y)
        
        return np.mean(checked)

In [None]:
# create an instance of kNN classifier with k=5
knn = KNeighborsClassifier(n_neighbors=5)

In [None]:
# fit the model using X_train as training data and y_train as target values
knn.fit(X_train, y_train)

In [None]:
# returns the mean accuracy on the given test data and labels
knn.score(X_test, y_test)