## DBSCAN: Desnity Based Clustering

In [1]:
from sklearn import cluster, datasets
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
X = datasets.load_iris().data

In [3]:
# Specify parameters for Clustering
# min_samples: Minimum nimber of samples required around a point of interest to call it the center of cluster. 
# Rest all points can be Border Points if Noise, else core points of cluster.
db = cluster.DBSCAN(eps=0.5, min_samples=5)

In [4]:
# Train DBSCAN
db.fit(X)

DBSCAN(algorithm='auto', eps=0.5, leaf_size=30, metric='euclidean',
    metric_params=None, min_samples=5, n_jobs=1, p=None)

In [5]:
# db.labels_: contains an array representing which cluster each point in X belongs to. Lables "-1" belong to Noise
labels = db.labels_
labels

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1,  1,  1,
        1,  1,  1, -1, -1,  1, -1, -1,  1,  1,  1,  1,  1,  1,  1, -1, -1,
        1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1, -1, -1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      dtype=int64)

In [6]:
# Labels now contain an array representing the cluster to which the point belongs
for i in range(len(X)):
    print('VALUE: {0}\t CLUSTER: {1}'.format(X[i], labels[i]))

VALUE: [5.1 3.5 1.4 0.2]	 CLUSTER: 0
VALUE: [4.9 3.  1.4 0.2]	 CLUSTER: 0
VALUE: [4.7 3.2 1.3 0.2]	 CLUSTER: 0
VALUE: [4.6 3.1 1.5 0.2]	 CLUSTER: 0
VALUE: [5.  3.6 1.4 0.2]	 CLUSTER: 0
VALUE: [5.4 3.9 1.7 0.4]	 CLUSTER: 0
VALUE: [4.6 3.4 1.4 0.3]	 CLUSTER: 0
VALUE: [5.  3.4 1.5 0.2]	 CLUSTER: 0
VALUE: [4.4 2.9 1.4 0.2]	 CLUSTER: 0
VALUE: [4.9 3.1 1.5 0.1]	 CLUSTER: 0
VALUE: [5.4 3.7 1.5 0.2]	 CLUSTER: 0
VALUE: [4.8 3.4 1.6 0.2]	 CLUSTER: 0
VALUE: [4.8 3.  1.4 0.1]	 CLUSTER: 0
VALUE: [4.3 3.  1.1 0.1]	 CLUSTER: 0
VALUE: [5.8 4.  1.2 0.2]	 CLUSTER: 0
VALUE: [5.7 4.4 1.5 0.4]	 CLUSTER: 0
VALUE: [5.4 3.9 1.3 0.4]	 CLUSTER: 0
VALUE: [5.1 3.5 1.4 0.3]	 CLUSTER: 0
VALUE: [5.7 3.8 1.7 0.3]	 CLUSTER: 0
VALUE: [5.1 3.8 1.5 0.3]	 CLUSTER: 0
VALUE: [5.4 3.4 1.7 0.2]	 CLUSTER: 0
VALUE: [5.1 3.7 1.5 0.4]	 CLUSTER: 0
VALUE: [4.6 3.6 1.  0.2]	 CLUSTER: 0
VALUE: [5.1 3.3 1.7 0.5]	 CLUSTER: 0
VALUE: [4.8 3.4 1.9 0.2]	 CLUSTER: 0
VALUE: [5.  3.  1.6 0.2]	 CLUSTER: 0
VALUE: [5.  3.4 1.6 0.4]	 CLUSTER: 0
V

### Advantages:

**1.** We don't need to specify number of clusters

**2.** Flexibility in shapes and sizes of clusters

**3.** Able to deal with noise

**4.** Able to deal with outliers

### Disadvantages:

**1.** Border points that are reachable from two clusters are added to cluster that finds them first. So, no guirantee of having same points in same clusters every time.

**2.** Difficulty finding clusters of varying densities [Solution: Use HDBSCAN]