# k-Nearest Neighbors
__MATH 3480__ - Dr. Michael Olson

Reading:

[Machine Learning Landscape](https://raw.githubusercontent.com/drolsonmi/math3480/main/Notes/Images/3480_05_ML_Landscape.png)

## The Concept behing k-Nearest Neighbors
The k-Nearest Neighbors (KNN) algorithm takes a point and gives it a classification based on the characteristics of points near it.
* Example: Cats have sharper claws and shorter ears, dogs have less sharp claws and longer ears
    * Graph this on the board, then add another unknown point 

The value of $k$ tells KNN to look at the $k$ nearest points.
* Count the number of neighbors with each category
* The category with the highest count becomes the classification of our point in question

As $k$ changes, it could change the result. How do we know what value of $k$ to use? This is kind of arbitrary, but we generally use the following rules
* $k = \sqrt{n}$
* If $k$ is even, we could be equally balanced between two categories
* If $k$ is a multiple of the number of groups to classify, we could be equally balanced between all categories
* Put it all together $\to$ choose an prime $k$ near $\sqrt{n}$

How do we determine the distance? We have many different distance measures:
* Manhattan distance (L1-norm)
* Euclidean distance (L2-norm)
* L- $\infty$ norm 
* Cosine distance
* Jaccard distance (if we are dealing with categorical variables)

The standard (default) option is generally the euclidean distance.
  * For 2 variables: $d = \sqrt{(x_0-x_{i0})^2 + (x_1-x_{i1})^2}$
  * For 3 variables: $d = \sqrt{(x_0-x_{i0})^2 + (x_1-x_{i1})^2 + (x_2-x_{i2})^2}$
  * For 4 variables: $d = \sqrt{(x_0-x_{i0})^2 + (x_1-x_{i1})^2 + (x_2-x_{i2})^2 + (x_3-x_{i3})^2}$
  * etc.

-----
## Random Dataset

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_gaussian_quantiles

X, y = make_gaussian_quantiles(n_features=2, n_classes=3, random_state=0)

for grp,color in zip([0,1,2],['orange','green','blue']):
    X_grp = X[y==grp]
    plt.scatter(X_grp[:, 0], X_grp[:, 1], c=color, label=grp)

plt.title("Gaussian divided into three quantiles")
plt.legend()
plt.show()

In [None]:
print(X[:10])
print(y[:10])

In [None]:
# Create the model 

from sklearn.neighbors import KNeighborsClassifier

k = 11

knn_class = KNeighborsClassifier(n_neighbors=k, metric='euclidean')
knn_class.fit(X,y)

In [None]:
x_test = np.array([[0.7,-0.5],
                    [0.5,0.5],
                    [-1.3,0.7],
                    [-1,-1.4]])

X, y = make_gaussian_quantiles(n_features=2, n_classes=3, random_state=0)

for grp,color in zip([0,1,2],['orange','green','blue']):
    X_grp = X[y==grp]
    plt.scatter(X_grp[:, 0], X_grp[:, 1], c=color, label=grp)

plt.scatter(x_test[:,0], x_test[:,1], c='red', marker='*')

for i in range(len(x_test)):
    plt.annotate(i,(x_test[i,0], x_test[i,1]))

plt.title("Gaussian divided into three quantiles")
plt.legend()
plt.show()

In [None]:
y_test = knn_class.predict(x_test)

print(y_test)

-----
The kNN model can be used as a regressor as well to predict the value of a point. For your point $x$, it will predict $\hat{y}$ to be the average of the $y$ values of the nearest $k$ points.
$$\hat{y} = \frac{1}{k}\sum_{j=1}^k y_j$$

In [None]:
import numpy as np
import matplotlib.pyplot as plt

X = np.random.rand(100,2)*10
x = X[:,0]
y = X[:,1] + x**2

plt.scatter(x,y)
print(x[:10].reshape(-1,1))
print(x[:10].reshape(1,-1))

In [None]:
from sklearn.neighbors import KNeighborsRegressor

knr = KNeighborsRegressor(n_neighbors=2)
knr.fit(x.reshape(-1, 1),y)

y_pred = knr.predict(x.reshape(-1, 1))

plt.scatter(x,y, c='blue')
plt.scatter(x,y_pred, c='red')

In [None]:
x_test = np.random.rand(10)*10
y_test = knr.predict(x_test.reshape(-1,1))


plt.scatter(x,y, c='blue')
plt.scatter(x_test,y_test, c='red')

-----
## Iris Flower Dataset

In [14]:
import numpy as np
import pandas as pd

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()
list(iris)

In [None]:
print(iris['feature_names'])
print(iris['data'][:10,:])

In [None]:
print(iris['target_names'])
print(iris['target'])

In [None]:
import seaborn as sns
iris_df = pd.DataFrame(iris['data'], columns=iris['feature_names'])

def species_name(x):
    return iris['target_names'][x]

iris_df['Species'] = pd.Series(iris['target']).apply(species_name)
display(iris_df)

sns.pairplot(iris_df, hue='Species')

In [None]:
sns.scatterplot(iris_df,
                x='petal length (cm)',
                y='petal width (cm)',
                hue='Species')

In [None]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')

scatter = ax.scatter(iris['data'][:,0],
                iris['data'][:,2],
                iris['data'][:,3],
                c=iris['target'])
          
ax.set_xlabel('Sepal Length (cm)')
ax.set_ylabel('Petal Length (cm)')
ax.set_zlabel('Petal Width (cm)')

## Preprocessing
1. Missing Data - No missing values in this example
2. Encode Categorical Variables - Using original data, no categorical variables
3. Split the data
4. Feature Scaling

## The KNN model

## Evaluate the model