## Nearest Neighbors

Nearest Neighbors allows the query of the k-nearest neighbors from a set of input samples.

The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input. 

For information on converting your dataset to cuDF format, refer to the cuDF documentation: https://rapidsai.github.io/projects/cudf/en/latest/

For additional information on cuML's Nearest Neighbors implementation: https://rapidsai.github.io/projects/cuml/en/latest/api.html#nearest-neighbors

In [None]:
import os

import numpy as np

from cuml.datasets import make_blobs

import pandas as pd
import cudf as gd

from sklearn.neighbors import NearestNeighbors as skNN
from cuml.neighbors import NearestNeighbors as cumlNN

## Define Parameters

In [None]:
n_samples = 2**17
n_features = 40

n_query = 2**13

n_neighbors = 4

## Generate Data

### GPU

In [None]:
%%time
device_data, _ = make_blobs(
   n_samples=n_samples, n_features=n_features, centers=5, random_state=0)

device_data = gd.DataFrame.from_gpu_matrix(device_data)

### Host

In [None]:
host_data = np.asarray(device_data.as_gpu_matrix())

## Scikit-learn Model

In [None]:
%%time
knn_sk = skNN(metric = 'sqeuclidean', algorithm="brute", n_jobs=-1)
knn_sk.fit(host_data)

D_sk, I_sk = knn_sk.kneighbors(host_data[:n_query], n_neighbors)

## cuML Model

In [None]:
%%time
knn_cuml = cumlNN()
knn_cuml.fit(device_data)

D_cuml, I_cuml = knn_cuml.kneighbors(device_data[:n_query], n_neighbors)

## Compare Results

### Distances

In [None]:
passed = np.allclose(D_sk, D_cuml.as_gpu_matrix(), atol=1e-3)
print('compare knn: cuml vs sklearn distances %s'%('equal'if passed else 'NOT equal'))

### Indices

In [None]:
sk_sorted = np.sort(I_sk, axis=1)
cuml_sorted = np.sort(I_cuml.as_gpu_matrix(), axis=1)

diff = sk_sorted - cuml_sorted

passed = (len(diff[diff!=0]) / n_samples) < 1e-9
print('compare knn: cuml vs sklearn indexes %s'%('equal'if passed else 'NOT equal'))