Skip to content

Using Spark DBSCAN as a standalone application

alitouka edited this page Mar 15, 2015 · 5 revisions

This page describes how to submit Spark DBSCAN to a Spark cluster. To do that, you will need an assembly JAR which contains Spark DBSCAN and all its dependencies. You can download it here . Please make sure that you are familiar with application submission process described here

Clustering data

A class which runs clustering algorithm is named org.alitouka.spark.dbscan.DbscanDriver . Specify this name when you submit the application to Spark. The application creates its own Spark context so you also have to pass it a master URL and a path to the assembly JAR (note that these parameters appear twice in the command line below, because they are required by the submission program and by the driver program). Also, the following parameters are required:

  • --ds-input - path to the input data
  • --ds-output - path where clustering results will be stored
  • --eps - value of the epsilon parameter
  • --numPts - value of the minPts parameter

The resulting command line may look like this:

./bin/spark-submit \
  --class org.alitouka.spark.dbscan.DbscanDriver \
  --master spark://your.spark.master:7077 \
  --deploy-mode cluster \
  hdfs://your.hdfs:9000/path/to/dbscan_assembly.jar \
  --ds-master spark://your.spark.master:7077 \
  --ds-jar hdfs://your.hdfs:9000/path/to/dbscan_assembly.jar \
  --ds-input hdfs://your.hdfs:9000/path/to/your/data.csv \
  --ds-output hdfs://your.hdfs:9000/path/to/output/folder \
  --eps 25 \
  --numPts 30

The following parameters are optional:

  • --npp - an approximate number of points in each partition of the data set. This value is used by density-based partitioning algorithm which splits your data set into parts of the specified size to speed up further processing. The default value for this parameter is 50,000;
  • --distanceMeasure - a full name of a class which implements org.apache.commons.math3.ml.distance.DistanceMeasure interface. Currently, only Euclidean and Manhattan distances are supported.

Calculating distance to the nearest neighbor

A class responsible for this task is named org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver . You have to pass it a master URL, a path to the assembly JAR, an input path and an output path. The resulting command line may look like this:

./bin/spark-submit \
  --class org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver \
  --master spark://your.spark.master:7077 \
  --deploy-mode cluster \
  hdfs://your.hdfs:9000/path/to/dbscan_assembly.jar \
  --ds-master spark://your.spark.master:7077 \
  --ds-jar hdfs://your.hdfs:9000/path/to/dbscan_assembly.jar \
  --ds-input hdfs://your.hdfs:9000/path/to/your/data.csv \
  --ds-output hdfs://your.hdfs:9000/path/to/output/folder

This program will produce a histogram. You can specify the number of buckets in this histogram with an optional --numBuckets parameter. The default value is 16. You can also specify --npp and --distanceMeasure parameters described above.

Counting neighbors of each point

This task is performed by the org.alitouka.spark.dbscan.exploratoryAnalysis.NumberOfPointsWithinDistanceDriver class. You have to pass it a master URL, a path to the assembly JAR, an input path, an output path and a distance within which it should count neighbors of each point. The resulting command line may look like this:

./bin/spark-submit \
  --class org.alitouka.spark.dbscan.exploratoryAnalysis.NumberOfPointsWithinDistanceDriver \
  --master spark://your.spark.master:7077 \
  --deploy-mode cluster \
  hdfs://your.hdfs:9000/path/to/dbscan_assembly.jar \
  --ds-master spark://your.spark.master:7077 \
  --ds-jar hdfs://your.hdfs:9000/path/to/dbscan_assembly.jar \
  --ds-input hdfs://your.hdfs:9000/path/to/your/data.csv \
  --ds-output hdfs://your.hdfs:9000/path/to/output/folder \
  --eps 25

This program also accepts optional parameters --numBuckets, --npp and --distanceMeasure described above.