New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distance_threshold #77
Comments
Yes, you can achieve that by passing This way, the whole "linkage matrix" can be generated, together with the corresponding distances. See Dendrograms in https://genieclust.gagolewski.com/weave/basics.html and the description of Some code as a starting point for your experiments: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import clustbench
# Load example data
data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"
benchmark = clustbench.load_dataset("wut", "x2", url=data_url)
X = benchmark.data
# Fit with compute_full_tree=True
import genieclust
g = genieclust.Genie(compute_full_tree=True)
g.fit(X)
# Plot the dendrogram
import scipy.cluster
linkage_matrix = np.column_stack([g.children_, g.distances_, g.counts_])
scipy.cluster.hierarchy.dendrogram(linkage_matrix,
show_leaf_counts=False, no_labels=True)
plt.show()
# Iteration (merge step) with a prescribed distance:
w = np.argmax(g.distances_ > 0.3)
w
# How many clusters?
k = X.shape[0]-w
k
# Determine the k-partition:
g.set_params(n_clusters=k)
c = g.fit_predict(X)
# Plot it:
genieclust.plots.plot_scatter(X, labels=c)
plt.show() |
Let me know if I can give you any more details/hints about the above. |
@gagolews thanks for your feedback. the resulting g.distances_ are all relatively low. I assume that g_distances_ refers to the distances between clusters on a scale between 0...1 I post here if I find something that works. |
quick update from my side for documentation. the ticket can be closed afterwards. I'm using this function: https://www.sbert.net/docs/package_reference/util.html#sentence_transformers.util.community_detection than I use the cluster count to run Genie.
this still scales better than AgglomerativeClustering (exact=False) and provides better clusters than community_detection itself. community_detection has very good performance/quality ratio. Very useful to get a preview or estimate of clusters |
Hi,
I've been testing genieclust as a replacement for sklearn.cluster.AgglomerativeCluster.
I love that the API is so similar and almost a drop-in-replacement to sklearn.cluster.*
I currently use AgglomerativeCluster because
Is there a way to achieve clustering by distance_threshold with genieclust?
Ideally I don't know the number of clusters ahead of time but I know that each cluster should similar to a certain degree.
Thanks in advance for any help.
The text was updated successfully, but these errors were encountered: