## Phase 3.27
# K Nearest Neighbors (KNN)
## Objectives
- Get a high-level view of the <a href='#overview'>K-Nearest Neighbors</a> algorithm.
- Look at different <a href='#distance_metrics'>Distance Metrics</a> used with different ML models.
- <a href='#coding'>Code</a> through an example of KNN with a toy dataset.
- <a href='#recap'>Recap</a> KNN by talking through the Pros and Cons.

<a id='overview'></a>
# KNN - Overview

> ***Classifier implementing the k-nearest neighbors vote.***

---

K-Nearest Neighbors is a **non-parametric**, **lazy** learning algorithm. 
- **Non-parametric**: the model makes no *underlying assumptions* about the distribution of data.
- **Lazy learners** (or **instance-based** learning-methods) simply store the training examples and postpone the generalization until a new instance must be classified or prediction made.
    - In other words, no training is necessary! This makes training super fast but testing is slower and costly.
    
---
    
***QUESTION!***

> What color should the **gray point** be?

<img src='./images/knn_intro.png' width=40%>

---

KNN is one of the more simple-to-visualize model-types.
<img src='./images/knn-process.png'>
    

<a id='distance_metrics'></a>
# Distance Metrics

## Manhattan Distance
The first (and easiest) distance metric you'll cover is Manhattan distance. Manhattan distance is aptly named, because it measures the distance from one point to another traveling along the axes of a grid.

<img src='./images/manhattan_fs.png' width='300'>

$$ \large d(x,y) = \sum_{i=1}^{n}|x_i - y_i | $$

- For each dimension, you subtract one point's value from the other's, and add the absolute value to the running total.
- The final running total is the Manhattan Distance.

In [1]:
# Manhattan Distance
a = (0, 0)
b = (6, 6)

# # A short-version code:
# sum([abs(a-b) for a, b in zip(a, b)])

# A long-version code:
distance = 0
for a, b in zip(a, b):
    distance += abs(a - b)
distance

12

## Euclidean Distance
The equation at the heart of this one is probably familiar: $a^2 + b^2 = c^2$!

<img src='./images/euclidean_fs.png' width='300'>

$$ \large d(x,y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2} $$

- For each dimension, you subtract one point's value from the other's (to get the length of that "side" of the triangle in that dimension), square it, and add it to our running total. 
- The square root of that running total is our Euclidean distance.

In [2]:
import numpy as np

# Euclidean Distance
a = (0, 0)
b = (6, 6)

# # A short-version code:
# np.sqrt(sum([(a - b)**2 for a, b in zip(a, b)]))

# A long-version code:
distance = 0
for a, b in zip(a, b):
    distance += (a - b)**2
np.sqrt(distance)

8.48528137423857

## Generalized Distance Function: Minkowski Distance

The Minkowski Distance is the **generalized version of both of the above distance functions** with a customizable parameter: $c$.

$$\large d(x, y) = \left(\sum_{i=1}^{n}|x_i - y_i|^c\right)^\frac{1}{c}$$

For example:

> If: $c = 1$
> 
> $ \large d(x, y) = \left(\sum_{i=1}^{n}|x_i - y_i|^1\right)^\frac{1}{1}$
> 
> $=$
>
> $ \large d(x,y) = \sum_{i=1}^{n}|x_i - y_i | $ 

***(Manhattan Distance)***

> If: $c = 2$
> 
> $ \large d(x, y) = \left(\sum_{i=1}^{n}|x_i - y_i|^2\right)^\frac{1}{2}$
> 
> $=$
>
> $ \large d(x,y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2} $

***(Euclidean Distance)***

... and so on to *a Minkowski Distance with a value of 3*, *a Minkowski Distance with a value of 4*...

<a id='coding'></a>
# Coding! KNN

1. Load `diabetes` dataset.
2. Process data normally, including a train_test_split.
3. Create several models, experimenting with different K-values.
4. Track model performance with the training / test set.
    - *What Metrics could we use?*

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv('./data/diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


<a id='recap'></a>
# Recap: Pros and Cons of KNNs 

**Pros:**
- No assumptions about data. 
    - Useful for nonlinear data.
- Simple algorithm to explain and understand/interpret.
- High accuracy (relatively).
    - It is pretty high but not competitive in comparison to better supervised learning models.
- Versatile.
    - Useful for classification or regression.

**Cons:**
- Computationally expensive.
    - The algorithm stores all of the training data.
    - High memory requirement.
- Prediction stage might be slow (with big $n$).
- Sensitive to irrelevant features and the scale of the data.

# Resources

- KNN: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
- Distance Metrics: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html