# Density-Based Spatial Culstering of Applications with Noise (DBSCAN)

The DBSCAN algorithm is a clustering algorithm which works really well for datasets in which samples conregate in large groups. cuML’s DBSCAN expects a cuDF DataFrame, and constructs an adjacency graph to compute the distances between close neighbours. The DBSCAN model implemented in the cuML library can accept the following parameters :
- `eps`: maximum distance between 2 sample points 
- `min_samples`: minimum number of samples that should be present in a neighborhood for it to be considered as a core points

The methods that can be used with DBSCAN are: 
- `fit`: Perform DBSCAN clustering from features
- fit_predict: Performs clustering on input_gdf and returns cluster labels
- `get_params`: Sklearn style return parameter state
- `set_params`: Sklearn style set parameter state to dictionary of params

 
The model accepts only numpy arrays or cudf dataframes as the input. 
  - In order to convert your dataset to cudf format please read the cudf [documentation](https://rapidsai.github.io/projects/cudf/en/latest/) 
  - For additional information on the DBSCAN model please refer to the [documentation](https://rapidsai.github.io/projects/cuml/en/latest/index.html) </p>
<hr>

# Setup

1. Ensure that you have selected Python3 as your runtime type and 'GPU' as your hardware accelerator from the menu: Runtime > Change Runtime Type.
2. Use pynvml to confirm Colab allocated you a Tesla T4 GPU.
3. Install most recent Miniconda release compatible with Google Colab's Python install (3.6.7).
4. Install RAPIDS libraries.
5. Copy RAPIDS .so files into current working directory, a workaround for conda/colab interactions.
6. Update env variables so Python can find and use RAPIDS artifacts.
7. All of the above steps are automated in the next cell.
8. You should re-run this cell any time your instance re-starts.
    - may take a few minutes
    - long output (output display removed)


In [0]:
!wget -nc https://github.com/rapidsai/notebooks-extended/raw/master/utils/rapids-colab.sh
!bash rapids-colab.sh

import sys, os

sys.path.append('/usr/local/lib/python3.6/site-packages/')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

## Imports

In [None]:
import gzip
import numpy as np
import pandas as pd
# rapids
import cudf
from cuml import DBSCAN as cumlDBSCAN
from sklearn.cluster import DBSCAN as skDBSCAN
# dask
import dask
import dask.dataframe as dd

## Data

Here we can utilize either of the two load data functions:
1. Loading data from the zipped file into regular dataframe
2. Loading data from the zipped file into a CUDA dataframe
    - NOTE: The following functions both provide the same end result (a pandas dataframe)

In [0]:
def load_data(nrows, ncols, cached = 'data/mortgage.npy.gz'):
  if os.path.exists(cached):
      print('Using Mortgage Data')
      with gzip.open(cached) as f:
          X = np.load(f)
      X = X[np.random.randint(0,X.shape[0]-1,nrows),:ncols]
  else:
      # create a random dataset
      print('Using Random Data')
      X = np.random.rand(nrows,ncols)
  df = pd.DataFrame({'fea%d'%i:X[:,i] for i in range(X.shape[1])})
  return df

def load_data_alternate(nrows, ncols, cached = 'data/mortgage.csv.gz'):
  if os.path.exists(cached):
    with gzip.open(cached) as f:
      print('Using Mortgage Data')
      X = cudf.read_csv(f, usecols=[i for i in range(0, ncols)], nrows=nrows, header=None)
      return X
  

In [0]:
# Setting the  number of rows and columns that will be imported.
# Play around with the numbers: let nrows be (500, 5000) and run tests.
nrows = 10000
ncols = 20
df = load_data(nrows, ncols)
#df = load_data_alternate(nrows, ncols)


Using Random Data


## Performing Clustering

Setting up variables for distance between 2 sample points and the minimum number of samples for the DBSCAN algorithm.

In [0]:
# eps = maximum distance between 2 sample points for them to be in the same neighborhood
# min_samples = number of samples that should be present in a neighborhood for it to be considered as a core point

eps = 3
min_samples = 2

**<p> At this point, we can now compare the performance between the traditional sklearn dbscan model and the implementation done utilizing CUDA. </p>**


In [0]:
%%time
# use the sklearn DBSCAN model to fit the dataset 
clustering_sk = skDBSCAN(eps = eps, min_samples = min_samples)
clustering_sk.fit(df)

CPU times: user 5.72 s, sys: 1.29 s, total: 7.02 s
Wall time: 7.04 s


In [0]:
# convert dataframe to cudf from pandas 
df = cudf.from_pandas(df)

In [0]:
%%time
# run the cuml DBSCAN model to fit the dataset 
clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)
clustering_cuml.fit(df)

CPU times: user 1.1 s, sys: 124 ms, total: 1.22 s
Wall time: 1.22 s


**<p>These two functions determine whether the results from cuml and sklearn are equivalent.</p>**

In [0]:
from sklearn.metrics import mean_squared_error

# the function converts a variable from ndarray or dataframe format to numpy array
def to_nparray(x):
    if isinstance(x,np.ndarray) or isinstance(x,pd.DataFrame):
        return np.array(x)
    elif isinstance(x,np.float64):
        return np.array([x])
    elif isinstance(x,cudf.DataFrame) or isinstance(x,cudf.Series):
        return x.to_pandas().values
    return x

def array_equal(a,b,threshold=5e-3,with_sign=True):
    a = to_nparray(a)
    b = to_nparray(b)
    if with_sign == False:
        a,b = np.abs(a),np.abs(b)
    error = mean_squared_error(a,b)
    res = error<threshold
    return res

**<p>Ensuring that the results from both methods give the same output.</p>**



In [0]:
equals = array_equal(clustering_cuml.labels_,clustering_sk.labels_)
if equals:
  print("Results are equal.")
else:
  print("Results are not equal.")

Results are equal.
