<h2 style="font-size:30px;font-family:Calibri">
    K-Nearest Neighbors
</h2>
KNN is often
 considered a lazy learner; it doesn’t technically train a model to make
 predictions. Instead an observation is predicted to be the class of that of the
 largest proportion of the k nearest observations. 

In [43]:
from sklearn import datasets
from sklearn.neighbors import RadiusNeighborsClassifier, NearestNeighbors, KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
import numpy as np

import warnings
warnings.filterwarnings("ignore")

<h2 style="font-size:30px;font-family:Calibri">
    Finding an Observation's Nearest Neighbors
</h2>

In [44]:
iris = datasets.load_iris()
features = iris.data
target = iris.target

print(f"Features: {features[:5]}")
print(f"Target: {target[:5]}")

Features: [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
Target: [0 0 0 0 0]


In [45]:
standardizer = StandardScaler()
features_standardized = standardizer.fit_transform(features)
print(f"Standardized Features: {features_standardized[:5]}")

Standardized Features: [[-0.90068117  1.01900435 -1.34022653 -1.3154443 ]
 [-1.14301691 -0.13197948 -1.34022653 -1.3154443 ]
 [-1.38535265  0.32841405 -1.39706395 -1.3154443 ]
 [-1.50652052  0.09821729 -1.2833891  -1.3154443 ]
 [-1.02184904  1.24920112 -1.34022653 -1.3154443 ]]


In [46]:
nearest_neighbor = NearestNeighbors(n_neighbors = 2).fit(features_standardized)
print(f"{nearest_neighbor}")

NearestNeighbors(n_neighbors=2)


In [47]:
new_obs = [[1, 1, 1, 1]]
distance, indices = nearest_neighbor.kneighbors(new_obs)
print(f"Distance: {distance}\nIndices: {indices}")

Distance: [[0.49140089 0.74294782]]
Indices: [[124 110]]


In [48]:
features_standardized[indices]

array([[[1.03800476, 0.55861082, 1.10378283, 1.18556721],
        [0.79566902, 0.32841405, 0.76275827, 1.05393502]]])

In [49]:
euclidean_nearest_neighbor = NearestNeighbors(n_neighbors = 3, metric = 'euclidean').fit(features_standardized)
print(f"{euclidean_nearest_neighbor}")

NearestNeighbors(metric='euclidean', n_neighbors=3)


In [50]:
nearest_neighbor_with_self = euclidean_nearest_neighbor.kneighbors_graph(
    features_standardized
).toarray()

In [51]:
for i, x in enumerate(nearest_neighbor_with_self):
    x[i] = 0

nearest_neighbor_with_self[0]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

 When we are finding nearest neighbors or using any learning algorithm
 based on distance, it is important to transform features so that they are on
 the same scale. The reason is because the distance metrics treat all features
 as if they were on the same scale, but if one feature is in millions of dollars
 and a second feature is in percentages, the distance calculated will be biased
toward the former.

In [52]:
knn = KNeighborsClassifier(n_neighbors = 5, n_jobs = -1).fit(features_standardized, target)
new_obs = [[0.75, 0.75, 0.75, 0.75], [1, 1, 1, 1]]
knn.predict(new_obs)

array([1, 2])

In [53]:
knn.predict_proba(new_obs)

array([[0. , 0.6, 0.4],
       [0. , 0. , 1. ]])

<h2 style="font-size:30px;font-family:Calibri">
    Identifying the Best Neighborhood Size
</h2>

In [54]:
knn = KNeighborsClassifier(n_neighbors = 5, n_jobs = -1)
pipe = Pipeline([("standardizer", standardizer), ("knn", knn)])
search_space = [{"knn__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]
classifier = GridSearchCV(
    pipe, search_space, cv = 5, verbose = 0
).fit(features_standardized, target)

In [55]:
classifier.best_estimator_.get_params()

{'memory': None,
 'steps': [('standardizer', StandardScaler()),
  ('knn', KNeighborsClassifier(n_jobs=-1, n_neighbors=6))],
 'verbose': False,
 'standardizer': StandardScaler(),
 'knn': KNeighborsClassifier(n_jobs=-1, n_neighbors=6),
 'standardizer__copy': True,
 'standardizer__with_mean': True,
 'standardizer__with_std': True,
 'knn__algorithm': 'auto',
 'knn__leaf_size': 30,
 'knn__metric': 'minkowski',
 'knn__metric_params': None,
 'knn__n_jobs': -1,
 'knn__n_neighbors': 6,
 'knn__p': 2,
 'knn__weights': 'uniform'}

<h2 style="font-size:30px;font-family:Calibri">
    Creating a Radius-Based Nearest Neighbor Classifier
</h2>

In [56]:
rnn = RadiusNeighborsClassifier(radius = .5, n_jobs = -1).fit(features_standardized, target)

In [57]:
new_obs = [[1, 1, 1, 1]]
rnn.predict(new_obs)

array([2])