Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

providing initial cluster counts #78

Open
tritolol opened this issue Mar 21, 2023 · 2 comments
Open

providing initial cluster counts #78

tritolol opened this issue Mar 21, 2023 · 2 comments
Labels
discussion A frequently asked question/interesting behaviour/etc.

Comments

@tritolol
Copy link

Hi, thanks for this awesome package.

Is there a way to provide initial counts values before starting to fit?
I know that normally in agglomerative clustering, each observation is considered as a cluster and therefore each count value starts at 1.
However, I would like to experiment with different initial distributions.
Is this possible with your current implementation?

@gagolews
Copy link
Owner

Yeah-nah, I have kind-of done it in my recent paper Clustering with minimum spanning trees: How good can it be? https://arxiv.org/abs/2303.05679, using some code from genieclust, but with custom modifications.

There is also now the GIc algorithm implementation available (https://genieclust.gagolewski.com/genieclust_gic.html) – it starts from the intersection of the partitions returned by Genie with different thresholds. Similar idea, but does not allow for an arbitrary starting partition still.

Anyway, there is no public API in the package that would allow that, only modifying src/c_genie.h, but good luck with that.

For own experiments, I think it would be easier to implement Genie from scratch (the algorithm is quite simple if you do not care about its time complexity) in Python and then modify it accordingly. All the necessary bits are available: genieclust.inequity.gini_index, genieclust.internal.DisjointSets or genieclust.internal.GiniDisjointSets, genieclust.internal.mst_from_distance.

Let me know how motivated/confident/adventurous do you feel about the above, maybe I can help with something.

@gagolews gagolews added the discussion A frequently asked question/interesting behaviour/etc. label Mar 21, 2023
@tritolol
Copy link
Author

tritolol commented Mar 22, 2023

Thanks for this helpful response.

I think I have what I need now by using this (dumb) workaround:
I simply repeat the rows of my feature matrix according to the distribution in a vector. Then I use Genie on this matrix instead and ignore the first merging level in the linkage tree.

Here is my code using pytorch

embeddings = ... # feature matrix
distribution = ... # vector of positive ints and shape = embeddings.shape[0]
emb_repeated = torch.repeat_interleave(embeddings, distribution, dim=0)
g = genieclust.Genie(compute_full_tree=True).fit(emb_repeated)

Let me know if this makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion A frequently asked question/interesting behaviour/etc.
Projects
None yet
Development

No branches or pull requests

2 participants