Skip to content

alitouka/spark_dbscan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark DBSCAN is an implementation of the DBSCAN clustering algorithm on top of Apache Spark . It also includes 2 simple tools which will help you choose parameters of the DBSCAN algorithm.

Clusters identified by the DBSCAN algorithm

This software is EXPERIMENTAL , it supports only Euclidean and Manhattan distance measures ( why? ) and it is not well optimized yet. I tested it only on small datasets (millions of records with 2 features in each record).

You can use Spark DBSCAN as a standalone application which you can submit to a Spark cluster ( Learn how ). Alternatively, you can include it into your own app - its API is documented and easy to use ( Learn how ).

Learn more about:

Performance

Performance chart

Credits

I was glad to receive contributions from other people and I'd like to say thank you:

  • Mark Geraty - for fixing a bug with Java RDDs;
  • @agrinh - for adding compatibility with Spark 1.1.0