# Bug 001b. Jan 24, 2020

The following notebook illustrates a bug with cuML v0.11 kNN. There are 3 ways to predict with cuML's kNN. You can use `predict()`, you can use `predict_proba()` and you can use `kneighbors()`. These 3 methods should all produce the same preditions. And they should agree with Scikit-learn's 3 types of predictions. Below you see that they do not.

This is based on RAPIDS example notebook [here][1] which works correctly. When changing the variable named `use_gaussian` to `True` below uses different data and then works correctly.

[1]: https://github.com/rapidsai/notebooks/blob/branch-0.13/cuml/kneighbors_classifier_demo.ipynb

In [1]:
import os

import numpy as np

from sklearn.datasets import make_blobs

import pandas as pd
import cudf as gd

from sklearn.neighbors import KNeighborsClassifier as skKNC
from cuml.neighbors import KNeighborsClassifier as cumlKNC

from scipy import stats
from cuml.neighbors import NearestNeighbors as cuNearestNeighbors
from sklearn.neighbors import NearestNeighbors as skKNearestNeighbors

## Define Parameters

In [2]:
n_samples = 2**17
n_features = 40

n_query = 5000

n_neighbors = 4

## Generate Data

### Host

In [3]:
use_gaussian = False

In [4]:
if use_gaussian:
    X_host_train, y_host_train = make_blobs(
       n_samples=n_samples, n_features=n_features, centers=5, random_state=0)

    X_host_train = pd.DataFrame(X_host_train)
    y_host_train = pd.DataFrame(y_host_train)

    X_host_test, y_host_test = make_blobs(
       n_samples=n_query, n_features=n_features, centers=5, random_state=0)

    X_host_test = pd.DataFrame(X_host_test)
    y_host_test = pd.DataFrame(y_host_test)

In [5]:
if not use_gaussian:
    X_host_train = pd.DataFrame( np.random.uniform(0,1,(n_samples,n_features)) )
    y_host_train = pd.DataFrame( np.random.randint(0,5,(n_samples,1)) )
    X_host_test = pd.DataFrame( np.random.uniform(0,1,(n_query,n_features)) )
    y_host_test = pd.DataFrame( np.random.randint(0,5,(n_query,1)) )

### Device

In [6]:
X_device_train = gd.DataFrame.from_pandas(X_host_train)
y_device_train = gd.DataFrame.from_pandas(y_host_train)

In [7]:
X_device_test = gd.DataFrame.from_pandas(X_host_test)
y_device_test = gd.DataFrame.from_pandas(y_host_test)

## Scikit-learn Model

In [8]:
%%time
knn_sk = skKNC(algorithm="brute", n_neighbors=n_neighbors, n_jobs=6)
knn_sk.fit(X_host_train, y_host_train)

sk_result = knn_sk.predict(X_host_test)
sk_result_p = knn_sk.predict_proba(X_host_test)

  


CPU times: user 1min 19s, sys: 17 s, total: 1min 36s
Wall time: 33.9 s


In [9]:
sk_result[:5]

array([1, 0, 1, 4, 0])

In [10]:
sk_result_p.argmax(axis=1)[:5]

array([1, 0, 1, 4, 0])

In [11]:
model = skKNearestNeighbors(n_neighbors=n_neighbors,n_jobs=6)
model.fit(X_host_train)
distances, indices = model.kneighbors(X_host_test)

In [12]:
for i in range(5):
    d = y_host_train.values.flatten()[ indices[i,:].astype(int) ]
    print(i, d, stats.mode(d)[0] )

0 [0 1 2 1] [1]
1 [1 0 0 0] [0]
2 [1 1 1 3] [1]
3 [2 4 4 4] [4]
4 [0 4 1 0] [0]


## cuML Model with cuDF input

In [13]:
%%time
knn_cuml = cumlKNC(n_neighbors=n_neighbors)
knn_cuml.fit(X_device_train, y_device_train)

cuml_result = knn_cuml.predict(X_device_test)
cuml_result_p = knn_cuml.predict_proba(X_device_test)

CPU times: user 1.68 s, sys: 300 ms, total: 1.98 s
Wall time: 1.98 s


In [14]:
cuml_result.iloc[:5,0].to_array()

array([2, 1, 3, 4, 4], dtype=int32)

In [15]:
cuml_result_p.to_pandas().values.argmax(axis=1)[:5]

array([1, 0, 1, 4, 0])

In [16]:
model = cuNearestNeighbors(n_neighbors=n_neighbors)
model.fit(X_device_train)
distances, indices = model.kneighbors(X_device_test)
for i in range(5):
    d = y_device_train.iloc[:,0][ indices.iloc[i,:] ].to_array()
    print(i, d, stats.mode(d)[0] )

0 [0 1 2 1] [1]
1 [1 0 0 0] [0]
2 [1 1 1 3] [1]
3 [2 4 4 4] [4]
4 [0 4 1 0] [0]


## cuML Model with NumPy input

In [17]:
%%time
knn_cuml = cumlKNC(n_neighbors=n_neighbors)
knn_cuml.fit(X_host_train.values, y_host_train.values)

cuml_result2 = knn_cuml.predict(X_host_test.values)
cuml_result2_p = knn_cuml.predict_proba(X_host_test.values)



CPU times: user 216 ms, sys: 156 ms, total: 372 ms
Wall time: 370 ms


In [18]:
cuml_result2[:5]

array([2, 1, 3, 4, 4], dtype=int32)

In [19]:
cuml_result2_p.argmax(axis=1)[:5]

array([0, 0, 0, 0, 0])

In [20]:
model = cuNearestNeighbors(n_neighbors=n_neighbors)
model.fit(X_host_train.values)
distances, indices = model.kneighbors(X_host_test.values)
for i in range(5):
    d = y_host_train.values.flatten()[ indices[i,:].astype(int) ]
    print(i, d, stats.mode(d)[0] )

0 [0 1 2 1] [1]
1 [1 0 0 0] [0]
2 [1 1 1 3] [1]
3 [2 4 4 4] [4]
4 [0 4 1 0] [0]


## Compare Results

In [21]:
passed = np.array_equal(np.asarray(cuml_result.as_gpu_matrix())[:,0], sk_result)
print('compare knn: cuml vs sklearn classes %s'%('equal'if passed else 'NOT equal'))

compare knn: cuml vs sklearn classes NOT equal
