Skip to content

Choosing parameters of DBSCAN algorithm

alitouka edited this page Jun 12, 2014 · 4 revisions

DBSCAN algorithm requires 2 parameters - epsilon , which specifies how close points should be to each other to be considered a part of a cluster; and minPts , which specifies how many neighbors a point should have to be included into a cluster. However, you may not know these values in advance.

Spark DBSCAN comes with two simple tools which may help you choose these parameters. The first tool estimates distance from each point to its nearest neighbor, while the second one counts each point's neighbors within a specified distance. Both tools produce a histogram which you can visualize with this R script (it assumes that the histogram data is stored in a file named "histogram.csv")

Estimating distance to the nearest neighbor

This feature is implemented in a class named DistanceToNearestNeighborDriver. This is a driver program which you can submit to a Spark cluster ( Learn how to run it ) . It calculates distance from each point to its nearest neighbor within the same partition, so, for a small fraction of points this distance will not be accurate. However, this inaccuracy doesn't affect the overall distribution of distances too much.

DistanceToNearestNeighborDriver produces a histogram which may look like this:

Distribution of distances to the nearest neighbor of each point

It indicates that the vast majority of points lie within 21.7027 units from their nearest neighbor. So, 22 may be a reasonable guess for the epsilon parameter.

Counting point's neighbors

After you have chosen the value for epsilon , you may wonder how many points lie within each point's epsilon-neighborhood. That's where another tool named NumberOfPointsWithinDistanceDriver comes into play ( Learn how to run it ) . It is also a driver program which counts each point's neighbors and builds a histogram which may look like this:

Distribution of distances to the number of neighbors

This histogram was obtained on a data set of 400,000 points, with epsilon = 22. It indicates that some points (about 25,000, which is 6.25% of all points) have too few neighbors. Probably they are noise points. A smaller fraction (about 15,000, which is 3.75% of all points) have 65 to 129 neighbors, and starting from 129, the number of neighbors begins to grow.

Based on the histograms above, I would try clustering my data set with the following parameters: epsilon = 22, minPts = 129.