## kNN introduction:

K Nearest Neighbor (KNN from now on) is one of those algorithms that are very simple to understand but works incredibly well in practice. Also it is surprisingly versatile and its applications range from vision to proteins to computational geometry to graphs and so on .

**KNN is an non parametric lazy learning algorithm.** That is a pretty concise statement. When you say a technique is **non parametric** , it means that it does not make any assumptions on the underlying data distribution. This is pretty useful , as in the real world , most of the practical data does not obey the typical theoretical assumptions made (eg gaussian mixtures, linearly separable etc) . Non parametric algorithms like KNN come to the rescue here.

It is also a **lazy algorithm**. What this means is that it does not use the training data points to do any generalization. In other words, there is no explicit training phase or it is very minimal. This means the training phase is pretty fast . Lack of generalization means that KNN keeps all the training data. More exactly, all the training data is needed during the testing phase. (Well this is an exaggeration, but not far from truth). This is in contrast to other techniques like SVM where you can discard all non support vectors without any problem.  Most of the lazy algorithms – especially KNN – makes decision based on the entire training data set (in the best case a subset of them).

The dichotomy is pretty obvious here – **There is a non existent or minimal training phase but a costly testing phase.** The cost is in terms of both time and memory. More time might be needed as in the worst case, all data points might take point in decision. More memory is needed as we need to store all training data.

### Assumptions in KNN
Before using KNN, let us revisit some of the assumptions in KNN.

KNN assumes that the data is in a feature space. More exactly, the data points are in a metric space. The data can be scalars or possibly even multidimensional vectors. Since the points are in feature space, they have a notion of distance – This need not necessarily be Euclidean distance although it is the one commonly used.

Each of the training data consists of a set of vectors and class label associated with each vector. In the simplest case , it will be either + or – (for positive or negative classes). But KNN , can work equally well with arbitrary number of classes.

We are also given a single number "k" . This number decides how many neighbors (where neighbors is defined based on the distance metric) influence the classification. This is usually a odd number if the number of classes is 2. If k=1 , then the algorithm is simply called the nearest neighbor algorithm.


### KNN for Density Estimation

Although classification remains the primary application of KNN, we can use it to do density estimation also. Since KNN is non parametric, it can do estimation for arbitrary distributions. The idea is very similar to use of [Parzen window](http://en.wikipedia.org/wiki/Parzen_window) . Instead of using hypercube and kernel functions, here we do the estimation as follows – For estimating the density at a point x, place a hypercube centered at x and keep increasing its size till k neighbors are captured. Now estimate the density using the formula,

<p>
        <img src = "assets/1.png">
</p>


Where n is the total number of V is the volume of the hypercube. Notice that the numerator is essentially a constant and the density is influenced by the volume. The intuition is this : Lets say density at x is very high. Now, we can find k points near x very quickly . These points are also very close to x (by definition of high density). This means the volume of hypercube is small and the resultant density is high. Lets say the density around x is very low. Then the volume of the hypercube needed to encompass k nearest neighbors is large and consequently, the ratio is low.

The volume performs a job similar to the bandwidth parameter in kernel density estimation. In fact , KNN is one of common methods to estimate the bandwidth (eg adaptive mean shift) .

<p>
        <img src = "assets/2.png">
</p>

<p>
        <img src = "assets/3.png">
</p>

<p>
        <img src = "assets/4.png" width = 300px height = 300px>
        <img src = "assets/5.png" width = 300px height = 300px>
</p>

## Code sample 1: kNN fr

In [1]:
from sklearn import datasets
import numpy as np
import pandas as pd

In [5]:
# define column names
names = [
  'sepal_length',
  'sepal_width',
  'petal_length',
  'petal_width',
  'class',
]

# load training data
df = pd.read_csv('iris.data', header=None, names=names)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [11]:
from sklearn.model_selection import train_test_split

# create design matrix X and target vector y
X = np.array(df.iloc[:, 0:4])  # end index is exclusive
y = np.array(df['class'])    # another way of indexing a pandas df

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [13]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# instantiate learning model (k = 3)
knn = KNeighborsClassifier(n_neighbors=3)

# fitting the model
knn.fit(X_train, y_train)

# predict the response
pred = knn.predict(X_test)

# evaluate accuracy
print("accuracy: {}".format(accuracy_score(y_test, pred)))

accuracy: 0.98


## REFERENCES:

- [knn scratch](https://kraj3.com.np/blog/2019/06/implementation-of-knn-from-scratch-in-python/)
- [knn from scratch](https://dataaspirant.com/2016/12/27/k-nearest-neighbor-algorithm-implementaion-python-scratch/)
- [Mail spam or not - kNN](https://anujkatiyal.com/blog/2017/10/01/ml-knn/#.XrZCVnUzZuQ)
- [kNN in depth analysis](https://tomaszgolan.github.io/introduction_to_machine_learning/markdown/introduction_to_machine_learning_01_knn/introduction_to_machine_learning_01_knn/)
- [kNN numpy NYC](https://nycdatascience.com/blog/student-works/machine-learning/knn-classifier-from-scratch-numpy-only/)
- [Maths kNN](http://www.datascribble.com/blog/machine-learning/understanding-math-behind-knn-codes-python/)
- [A Complete Guide to K-Nearest-Neighbors with Applications in Python and R](https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/)
- [kNN tds1](https://towardsdatascience.com/knn-k-nearest-neighbors-1-a4707b24bd1d)
- [KNN ALGORITHM AND IMPLEMENTATION FROM SCRATCH]()
- [Knn](https://medium.com/datadriveninvestor/knn-algorithm-and-implementation-from-scratch-b9f9b739c28f)
- [kNN from scratch](https://towardsdatascience.com/lets-make-a-knn-classifier-from-scratch-e73c43da346d)
- [kNN numpy scratch](https://towardsdatascience.com/k-nearest-neighbors-classification-from-scratch-with-numpy-cb222ecfeac1)
- [ML basics with kNN](https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761)
- [A Detailed Introduction to K-Nearest Neighbor (KNN) Algorithm](https://saravananthirumuruganathan.wordpress.com/2010/05/17/a-detailed-introduction-to-k-nearest-neighbor-knn-algorithm/)
- []()