# kNN

## algorithm

non-parametric
1. calculate the distance from x to all points in dataset
2. sort the points by increasing distance
3. predict the majority label (classification) / average value (regression) of the k closest point

 ![test](https://qph.fs.quoracdn.net/main-qimg-32a7eff79169c81b19f548248f0962ae)

## preprocessing

* categorical variables: use one hot encoder (no need to omit one dummy)
* standardizing/normalizing is essential (since it is distance based)
* text data:
* null values:

## tuning parameters

k <br>
distance:
* Euclidean: $\displaystyle\sum{(a_i-b_i)^2}$ <br>
* Manhattan: $\displaystyle\sum{|a_i-b_i|}$ <br>
* cosine: 1 - cosine similarity $\displaystyle1-\frac{ab}{|a||b|}$
* Mahalobnis <br>

## pluses / minuses

* works well for d<5 dimensions
* few hyperparameters
* super simple
* can weight distances
* can be used for classification / regression 
* prone to outliers
* outperforms linear regression models when data does not exhibit a linear relationship
* doesn't work well for small number of observations per predictors
* can be used to fill in missing data (by finding k nearest neighbors based on present features and then filling in the missing value based on the average of the nearest neighbors)
* can be used as a first classifier in a staged process (adding the result as a new feature to the record)

In [None]:
# k-Nearest Neighbor
from sklearn import datasets
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier

# load iris the datasets
dataset = datasets.load_iris()

# fit a k-nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(dataset.data, dataset.target)
print(model)

# make predictions
expected = dataset.target
predicted = model.predict(dataset.data)

# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))