# DBSCAN Clustering Hello World

An example application of Density-Based Spatial Clustering of Applications with Noise.

DBSCAN is an **unsupervised** algorithm: you only need to define the distance **(Îµ eps)** and the minimum number of samples within the distance **(minSample)**, then the algorithm will try to find clusters automatically. 

The basic working principles are:

1. Randomly choose one point, check how many points are within the radial distance of eps.
2. If the number of points are larger than minSample, these points will be treated as the same cluster.
3. If the number of points are smaller than minSample, this is a non-core point, calculation stops.
4. Run the step 1-3 iteratively for each points, until all points are covered.

**Reference:**
- https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
- https://cat.chriz.hk/2020/11/knndbscan-lof.html


In [1]:
# Load example dataset from scikit-learn dataset library
from sklearn import datasets
wine = datasets.load_wine()


In [2]:
# print the wine feature names
print(wine.feature_names)

# print the wine class names
print(wine.target_names)

# print the first 5 wine data
print(wine.data[0:5])

# print the first 5 wine labels / results (0:class_0, 1:class_1, 2:class_2)
print(wine.target[0:5])

# Extra: print data(feature) shape
print(wine.data.shape)

['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
['class_0' 'class_1' 'class_2']
[[1.423e+01 1.710e+00 2.430e+00 1.560e+01 1.270e+02 2.800e+00 3.060e+00
  2.800e-01 2.290e+00 5.640e+00 1.040e+00 3.920e+00 1.065e+03]
 [1.320e+01 1.780e+00 2.140e+00 1.120e+01 1.000e+02 2.650e+00 2.760e+00
  2.600e-01 1.280e+00 4.380e+00 1.050e+00 3.400e+00 1.050e+03]
 [1.316e+01 2.360e+00 2.670e+00 1.860e+01 1.010e+02 2.800e+00 3.240e+00
  3.000e-01 2.810e+00 5.680e+00 1.030e+00 3.170e+00 1.185e+03]
 [1.437e+01 1.950e+00 2.500e+00 1.680e+01 1.130e+02 3.850e+00 3.490e+00
  2.400e-01 2.180e+00 7.800e+00 8.600e-01 3.450e+00 1.480e+03]
 [1.324e+01 2.590e+00 2.870e+00 2.100e+01 1.180e+02 2.800e+00 2.690e+00
  3.900e-01 1.820e+00 4.320e+00 1.040e+00 2.930e+00 7.350e+02]]
[0 0 0 0 0]
(178, 13)


In [3]:
# Import DBSCAN Clustering from cluster model
from sklearn.cluster import DBSCAN

# Create DBSCAN Clustering
dbscan = DBSCAN(eps=10, min_samples=6) # epsilon and minSample

# Train the model using the training sets
dbscan.fit(wine.data)

# Just print out the clustered labels internally
cluster_result = dbscan.labels_
print(cluster_result)


[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  0 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  0 -1 -1  1 -1  0 -1 -1 -1 -1 -1  1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1  1 -1 -1 -1  1 -1 -1 -1 -1 -1  0 -1 -1 -1 -1 -1  1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1  1 -1 -1 -1 -1 -1 -1  0 -1 -1 -1 -1 -1 -1 -1  0 -1  0 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]


In [4]:
# The goal of DBSCAN clustering is not to train "accuracy", but to find hidden clusters
# so there is no simple way to calculate accuracy. 

# In this example, the generated labels [0,1,2] may not use the original wine labels
# so we change them multiple times (0 to 0, 0 to 1, 0 to 2 ...) to find the highest result

import itertools
import numpy as np
from sklearn import metrics

# change labels to [10,11,12]
abc = np.where(cluster_result==-1, 10, cluster_result)
abc = np.where(abc==0, 11, abc)
abc = np.where(abc==1, 12, abc)

for i in list(itertools.permutations([0, 1, 2])):
    print(i)
    # construct array of changed labels
    _abc = np.where(abc==10, i[0], abc)
    _abc = np.where(_abc==11, i[1], _abc)
    _abc = np.where(_abc==12, i[2], _abc)
    # Evalaute accuracy with scikit-learn metrics modules
    print("Accuracy: ", metrics.accuracy_score(wine.target, _abc))


(0, 1, 2)
Accuracy:  0.34831460674157305
(0, 2, 1)
Accuracy:  0.3707865168539326
(1, 0, 2)
Accuracy:  0.3651685393258427
(1, 2, 0)
Accuracy:  0.3707865168539326
(2, 0, 1)
Accuracy:  0.2808988764044944
(2, 1, 0)
Accuracy:  0.2640449438202247
