Skip to content
C4E, a Scala or Spark library for local and distributed Clustering.
Branch: master
Clone or download
7 authors Pure scala UMAP version available
Design improvement are required (parallel and/or distributed NN-Ascent) and distance homogenization, huge thanks (unordered list) to engineer students Beugnet Vincent, Hurvois Guillaume, Ladjal Adlane, Merien Grégoire, and Serfas Florent

Co-authored-by: Beugnet Vincent <beugnetv@gmail.com>
Co-authored-by: Hurvois Guillaume <guillaume.hurvois@gmail.com>
Co-authored-by: Ladjal Adlane <ladjal.adlane@gmail.com>
Co-authored-by: Merien Grégoire <gregoiremerien@gmail.com>
Co-authored-by: Serfas Florent <florentserfas@gmail.com>
Co-authored-by: Beck Gaël <beck.gael@gmail.com>
Co-authored-by: Forest Florent <florent.forest9@gmail.com>
Latest commit ffc55f3 Jun 11, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
clustering Pure scala UMAP version available Jun 11, 2019
core
project new algorithms are now available on 0.9.3, fix class is not a member … Feb 14, 2019
.gitignore Pure scala UMAP version available Jun 11, 2019
LICENSE simplify design Jan 30, 2019
README.md Pure scala UMAP version available Jun 11, 2019
_config.yml enhance external and internal indices Feb 23, 2019
build.sbt

README.md

Clustering 4️⃣ Ever Download Maven Central

Welcome to Clustering4️⃣Ever, a Big Data Clustering Library gathering clustering, unsupervized algorithms, and quality indices. Don't hesitate to check our Wiki, ask questions or make recommendations in our Gitter.

API documentation

Include it in your project

Add following line in your build.sbt :

  • "org.clustering4ever" % "clustering4ever_2.11" % "0.9.6" to your libraryDependencies

Eventually add one of these resolvers :

  • resolvers += Resolver.bintrayRepo("clustering4ever", "C4E")
  • resolvers += "mvnrepository" at "http://mvnrepository.com/artifact/"

You can also take specifics parts (Core, ScalaClustering, ...) from Bintray or Maven.

Available algorithms

  • emphasized algorithms are in Scala.
  • bold algorithms are implemented in Spark.
  • They can be available in both versions

Clustering algorithms

  • Jenks Natural Breaks
  • Epsilon Proximity*
    • Scalar Epsilon Proximity*, Binary Epsilon Proximity*, Mixed Epsilon Proximity*, Any Object Epsilon Proximity*
  • K-Centers*
    • K-Means*, K-Modes*, K-Prototypes*, Any Object K-Centers*
  • Self Organizing Maps (Original project)
  • G-Stream (Original project)
  • PatchWork (Original project)
  • Random Local Area *
  • Clusterwize
  • Tensor Biclustering algorithms (Original project)
    • Folding-Spectral, Unfolding-Spectral, Thresholding Sum Of Squared Trajectory Length, Thresholding Individuals Trajectory Length, Recursive Biclustering, Multiple Biclustering
  • Ant-Tree
    • Continuous Ant-Tree, Binary Ant-Tree, Mixed Ant-Tree

Algorithm followed with a * can be executed by benchmarking classes.

Preprocessing

  • UMAP
  • Gradient Ascent (Mean-Shift related)
    • Scalar Gradient Ascent, Binary Gradient Ascent, Mixed Gradient Ascent, Any Object Gradient Ascent
  • Rough Set Features Selection

Quality Indices

You can realize manually your quality measures with dedicated class for local or distributed collection. Helpers ClustersIndicesAnalysisLocal and ClustersIndicesAnalysisDistributed allow you to test indices on multiple clustering at once.

  • Internal Indices
    • Davies Bouldin
    • Ball Hall
  • External Indices
    • Multiple Classification
      • Mutual Information, Normalized Mutual Information
      • Purity
      • Accuracy, Precision, Recall, fBeta, f1, RAND, ARAND, Matthews correlation coefficient, CzekanowskiDice, RogersTanimoto, FolkesMallows, Jaccard, Kulcztnski, McNemar, RusselRao, SokalSneath1, SokalSneath2
    • Binary Classification
      • Accuracy, Precision, Recall, fBeta, f1

Clustering benchmarking and analysis

Using classes ClusteringChainingLocal, BigDataClusteringChaining, DistributedClusteringChaining, and ChainingOneAlgorithm descendants you have the possibility to run multiple clustering algorithms respectively locally and parallely, in a sequentially distributed way, and parallely on a distributed system, locally and parallely, generate many different vectorizations of the data whilst keeping active information on each clustering including used vectorization, clustering model, clustering number and clustering arguments.

Classes ClustersIndicesAnalysisLocal and ClustersIndicesAnalysisDistributed are devoted for clustering indices analysis.

Classes ClustersAnalysisLocal and ClustersAnalysisDistributed will be use to describe obtained clustering in term of distributions, proportions of categorical features...

Incoming soon

  • UMAP
  • Gaussian Mixture Models
  • DBScan
  • Time Series K-Means

Citation

If you publish material based on informations obtained from this repository, then, in your acknowledgements, please note the assistance you received by using this community work. This will help others to obtain the same informations and replicate your experiments, because having results is cool but being able to compare to others is better. Citation: @misc{C4E, url = “https://github.com/Clustering4Ever/Clustering4Ever“, institution = “Paris 13 University, LIPN UMR CNRS 7030”}

C4E-Notebook examples

Basic usages of implemented algorithms are exposed with SparkNotebooks in Spark-Clustering-Notebook organization.

Miscellaneous

Helper functions to generate Clusterizable collections

You can easily generate your collections with basic Clusterizable using helpers in org.clustering4ever.util.{ArrayAndSeqTowardGVectorImplicit, ScalaCollectionImplicits, SparkImplicits} or explore Clusterizable and EasyClusterizable for more advanced usages.

References

What data structures are recommended for best performances

  • ArrayBuffer or ParArray as vector containers are recommended for local applications
You can’t perform that action at this time.