Java implementation of SparseHC

A memory-efficient online hierarchical clustering algorithm

Currently, this packages contains two implementations of a sparse hierarchical clustering

SparseHierarchicalClusterer is based on the source code found here and here
SparseHierarchicalClustererV2 is based on the paper by T.D. Nguyen, B. Schmidt, and C.K. Kwoh "SparseHC: a memory-efficient online hierarchical clustering algorithm".

I highly recommend using SparseHierarchicalClustererV2 because it has been better tested, while SparseHierarchicalClusterer is known to have some issues.

According to the paper, the empirical runtime complexity for complete-linkage is O(n^2), and the empirical memory complexity is O(n). Thus, this algorithm can be suitable for hierarchical clustering of many objects and when the available memory is scarce.

References

This implementation is based on the paper by T.D. Nguyen, B. Schmidt, and C.K. Kwoh "SparseHC: a memory-efficient online hierarchical clustering algorithm" and on its C++ implementation here.

Remark

So far, only the complete-linkage and single-linkage hierarchical clustering are implemented. Feel free to implement the rest.

Install

You can use this package in your Maven project by adding the following lines to your pom.xml file:

<dependency>
    <groupId>org.du-lab.jsparsehc</groupId>
    <artifactId>jsparsehc</artifactId>
    <version>0.3.0</version>
</dependency>

Example

To use the algorithm, you should create an implementation of the interface Matrix that provides distance values in the ascending order. Then, you can perform clustering as follows:

Matrix matrix = ...;

SparseHierarchicalClustererV2 clusterer = new SparseHierarchicalClustererV2(matrix, new CompleteLinkage());
clusterer.cluster(scoreTolerance);
Map<Integer, Integer> labelMap = clusterer.getLabels();

The map labelMap is a collection of pairs (index, label), where index is a row (column) index of an object in the distance matrix, label is a label of the cluster containing that object.

Command Line

It is possible to run the algorithm from the command line. First, create a "fat" jar file (i.e. with all dependencies included) by executing

mvn package -PshadeProfile

This will create a jar file in folder target/shade/. Then, you can run this file as follows:

java -jar target/shade/jsparsehc.jar -t THRESHOLD -i INPUTFILE -o OUTPUTFILE

Examples of input file and output file are located in the root directory of this repository. The input file contains lines with row ID, column ID, and distance values of the (sparse) distance matrix. E.g. the distance between point 1 and point 2 is equal 2, distance between point 3 and point 5 is equal 1.41, etc. The output file contains the clustering labels. I.g., cluster with label 6 contains points 0 and 1, cluster with label 8 contains points 3, 4, and 5, etc.

Versions

See CHANGELOG

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
distance-table.csv		distance-table.csv
labels.csv		labels.csv
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

CHANGELOG.md

CHANGELOG.md

LICENSE

LICENSE

README.md

README.md

distance-table.csv

distance-table.csv

labels.csv

labels.csv

pom.xml

pom.xml

Repository files navigation

Java implementation of SparseHC

A memory-efficient online hierarchical clustering algorithm

References

Remark

Install

Example

Command Line

Versions

About

Releases

Packages

Contributors 3

Languages

License

du-lab/sparseHC_Java

Folders and files

Latest commit

History

Repository files navigation

Java implementation of SparseHC

A memory-efficient online hierarchical clustering algorithm

References

Remark

Install

Example

Command Line

Versions

About

Resources

License

Stars

Watchers

Forks

Languages