-
-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hierarchical Clustering Questions #100
Comments
What data size are you interested in, what is the current runtime at 2^16 instances - and how much runtime are you willing to invest? |
Hi, thank you for your very fast answer.
|
I thought DoubleBuffer would be long indexed, but apparently I am wrong. Maybe I had the Unsafe class in mind, which has ELKI already uses fastutil in a number of places, which includes a DoubleBigArrayBigList (when building a bundle jar, remove the exclude lines from addons/bundle/build.gradle, we currently exclude these classes to reduce the jar size). These classes are usually very efficient and can be recommended. This is likely the easiest one to use instead. 151 seconds isn't too bad - which algorithm and which linkage did you use? |
I check the implementation of Most of the algorithms directly modify the For the simple experiments I used The question is how to proceed here. Maybe I'm able to provide an implementation of AGNES (in the beginning) with this new kind of implementation but I'm not sure if I can also change the other implementations at all (I would be interested in the |
Anderberg is certainly the better starting point than AGNES, it is surprisingly simple (that is part of why its fast). It will not be sufficient to substitute matrixparadigm, because it writes directly into the underlying matrix (to improve performance, it exploits the triangular layout, and does not go through a get(x,y) getter that repeats certain calculations unnecessarily; so we eventually removed this abstraction). |
I would stick to the pointer hierarchy to fulfill the requirements of your library (and just provide a transformation to the table style). Maybe the two versions can exist in ELKI side by side? Still, Anderberg looks more complicated (maybe also due to the triangular layout exploits) than simple AGNES. |
Our AGNES does the same optimizations wrt. to iterating the linearized diagonal form (which is also used in scipy, btw), so there is no difference there. P.S. single-link is a special case, in which certain effects cannot occur, hence the scalability of other linkages could be worse (but probably not by much). Single-link can be implemented with O(n) memory, too, without such a matrix in the first place. |
Branch https://github.com/elki-project/elki/tree/feature/newhac has a rewrite to a merge history representation that is likely a bit easier to use. It also uses more integer indexes rather than DBIDs (representing cluster numbers 0..2n-2) as we no longer identify clusters with the last object as in the SLINK pointer hierarchy. |
That sounds cool. Thank you for the information. |
I have merged the branch into main. It was 20% faster (for AGNES, Anderberg) in a brief test. |
Great, thanks a lot. |
I do want to make a new release because of the many new features in ELKI, but usually I want these releases to come along with a supporting publication to make them easier to cite; this will need some time and preparation. |
A new release (without the logging change) has been submitted as a demo paper, and I hope to release 0.8.0 end of the summer. |
ok, great. Thank you for the information. |
The demo paper has been accepted, it will appear at SISAP 2022 in Bologna, October 5-7. |
Congrats. |
ELKI 0.8.0 is on maven, but with a new artifact group id, "io.github.elki-project". |
Hi,
thank you for the really nice and helpful library,
I have a few questions about hierarchical clustering:
I would like to transform the
PointerHierarchy(Representation)Result
to the data structure used by e.g. the scipy linkage function such that the information about the merges are easily accessible. Am I right, that I first need to use the parent pointers and in case multiple DBIDs point to the same parent, then using the parent distance? In thetopologicalSort
function the merge order is also used to further make the distinction in case it has the same distance. Correct? The only possibility to access themergeOrder
is to place a class in the same namespace. Is this the way to go?I'm further interested in the merge hierarchy of datasets larger than 65,535 instances. I also have functions to calculate the distance matrix in parallel (and more efficiently for special distances such as euclidean by using the BLAS library). As far as I can see, this would require a whole rewrite of the algorithms instead of just changing the
MatrixParadigm
class. Would such a rewrite be useful for the library or what other options do I have?Thanks a lot
Best regards
Sven
The text was updated successfully, but these errors were encountered: