Skip to content

How It Works

alitouka edited this page Jun 12, 2014 · 4 revisions

Spark DBSCAN provides a distributed implementation of the DBSCAN algorithm, which means that it splits input data into parts and processes each part separately. It also performs some pre- and post-processing of the data to enable parallel computation. This page describes each stage of the algorithm, and also explains why it doesn't support all distance measures.

Density-based partitioning

First of all, input data should be split into parts of approximately equal size. You can specify this size and the algorithm will do its best to partition your data set into box-shaped regions each of which will contain approximately the same number of points. Each partition will be processed independently.

Density-based partitions may look like this:

Density-based partitioning of a dataset

More details about the partitioning algorithm are available here

Counting neighbors of each point

Neighbors of each point are counted before clustering algorithm starts. It allows the algorithm to process each point correctly, even if its neighbors reside in different partitions.

This stage consists of 3 steps:

  1. For each point, count its neighbors which reside in the same partition;
  2. Find points which lie close to the bounds of their partition (to be more precise - points whose distance to any bound is less then the epsilon parameter of the DBSCAN algorithm);
  3. For each point identified in the previous step, count its neighbors which reside in different partitions.

Why not all distance measures are supported

This limitation is caused by the step 2 described above. It determines whether a point lies close to the bounds by simply comparing its coordinates with coordinates of the bounding box and checking whether difference is less than epsilon . It works for Euclidean and Manhattan distance measures but doesn't work for others. I will try to fix it later.

Clustering

After the data set is partitioned and neighbors of each point are counted, it is possible to run clustering algorithm in each partition independently. Each partition is processed with the algorithm very similar to the original DBSCAN algorithm with the exception that number of neighbors of each point is already known.

Upon completion of this stage, clusters in each partition are identified. The data set may look like this at this moment:

Partially clustered data set

Merging output

It is very likely that clusters are spread across multiple partitions of the data set, so we need to identify such clusters and stitch their parts together. To do that, we need to find pairs of points which:

  • reside in different partitions
  • close enough to each other (distance between them is less than or equal to epsilon )
  • have enough neighbors (not less than minPts )

When we find a pair of such points, we will merge clusters they belong to.

As a result, clusters whose parts reside in different partitions are merged and the whole clustering process is complete:

Clusters identified by the DBSCAN algorithm