providing initial cluster counts #78

tritolol · 2023-03-21T16:50:39Z

Hi, thanks for this awesome package.

Is there a way to provide initial counts values before starting to fit?
I know that normally in agglomerative clustering, each observation is considered as a cluster and therefore each count value starts at 1.
However, I would like to experiment with different initial distributions.
Is this possible with your current implementation?

gagolews · 2023-03-21T22:38:13Z

Yeah-nah, I have kind-of done it in my recent paper Clustering with minimum spanning trees: How good can it be? https://arxiv.org/abs/2303.05679, using some code from genieclust, but with custom modifications.

There is also now the GIc algorithm implementation available (https://genieclust.gagolewski.com/genieclust_gic.html) – it starts from the intersection of the partitions returned by Genie with different thresholds. Similar idea, but does not allow for an arbitrary starting partition still.

Anyway, there is no public API in the package that would allow that, only modifying src/c_genie.h, but good luck with that.

For own experiments, I think it would be easier to implement Genie from scratch (the algorithm is quite simple if you do not care about its time complexity) in Python and then modify it accordingly. All the necessary bits are available: genieclust.inequity.gini_index, genieclust.internal.DisjointSets or genieclust.internal.GiniDisjointSets, genieclust.internal.mst_from_distance.

Let me know how motivated/confident/adventurous do you feel about the above, maybe I can help with something.

tritolol · 2023-03-22T07:38:47Z

Thanks for this helpful response.

I think I have what I need now by using this (dumb) workaround:
I simply repeat the rows of my feature matrix according to the distribution in a vector. Then I use Genie on this matrix instead and ignore the first merging level in the linkage tree.

Here is my code using pytorch

embeddings = ... # feature matrix
distribution = ... # vector of positive ints and shape = embeddings.shape[0]
emb_repeated = torch.repeat_interleave(embeddings, distribution, dim=0)
g = genieclust.Genie(compute_full_tree=True).fit(emb_repeated)

Let me know if this makes sense.

gagolews added the discussion A frequently asked question/interesting behaviour/etc. label Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

providing initial cluster counts #78

providing initial cluster counts #78

tritolol commented Mar 21, 2023

gagolews commented Mar 21, 2023

tritolol commented Mar 22, 2023 •

edited

providing initial cluster counts #78

providing initial cluster counts #78

Comments

tritolol commented Mar 21, 2023

gagolews commented Mar 21, 2023

tritolol commented Mar 22, 2023 • edited

tritolol commented Mar 22, 2023 •

edited