Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distance_threshold #77

Closed
leonardlin opened this issue Mar 19, 2023 · 4 comments
Closed

distance_threshold #77

leonardlin opened this issue Mar 19, 2023 · 4 comments
Labels
discussion A frequently asked question/interesting behaviour/etc.

Comments

@leonardlin
Copy link

Hi,
I've been testing genieclust as a replacement for sklearn.cluster.AgglomerativeCluster.
I love that the API is so similar and almost a drop-in-replacement to sklearn.cluster.*

I currently use AgglomerativeCluster because

  • like the outcome
  • cluster with distance_threshold instead of n_cluster

Is there a way to achieve clustering by distance_threshold with genieclust?
Ideally I don't know the number of clusters ahead of time but I know that each cluster should similar to a certain degree.

Thanks in advance for any help.

@gagolews
Copy link
Owner

Yes, you can achieve that by passing compute_full_tree=True to the constructor.

This way, the whole "linkage matrix" can be generated, together with the corresponding distances.

See Dendrograms in https://genieclust.gagolewski.com/weave/basics.html and the description of children_ and distances_ in https://genieclust.gagolewski.com/genieclust_genie.html#genieclust.Genie

Some code as a starting point for your experiments:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import clustbench

# Load example data 
data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"
benchmark = clustbench.load_dataset("wut", "x2", url=data_url)
X = benchmark.data


# Fit with compute_full_tree=True
import genieclust
g = genieclust.Genie(compute_full_tree=True)
g.fit(X)


# Plot the dendrogram
import scipy.cluster
linkage_matrix = np.column_stack([g.children_, g.distances_, g.counts_])
scipy.cluster.hierarchy.dendrogram(linkage_matrix,
    show_leaf_counts=False, no_labels=True)
plt.show()

# Iteration (merge step) with a prescribed distance:
w = np.argmax(g.distances_ > 0.3)
w

# How many clusters?
k = X.shape[0]-w
k

# Determine the k-partition:
g.set_params(n_clusters=k)
c = g.fit_predict(X)

# Plot it:
genieclust.plots.plot_scatter(X, labels=c)
plt.show()

@gagolews
Copy link
Owner

Let me know if I can give you any more details/hints about the above.

@gagolews gagolews added the discussion A frequently asked question/interesting behaviour/etc. label Mar 21, 2023
@leonardlin
Copy link
Author

@gagolews thanks for your feedback.
I tried your approach but the resulting clusters didn't look sensible.

the resulting g.distances_ are all relatively low.
I'm using affinity cosinesimil

I assume that g_distances_ refers to the distances between clusters on a scale between 0...1
It's not exactly the same as threshold in AgglomerativeClustering but I guess similiar.

I post here if I find something that works.

@leonardlin
Copy link
Author

leonardlin commented Mar 26, 2023

quick update from my side for documentation. the ticket can be closed afterwards.

I'm using this function: https://www.sbert.net/docs/package_reference/util.html#sentence_transformers.util.community_detection
to estimate the number of clusters first.

than I use the cluster count to run Genie.

# determine cluster count
clusters = sentence_transformers.util.community_detection(corpus_embeddings, min_community_size=1, threshold=threshold)
cluster_count = len(clusters)

# clustering
clustering_model = genieclust.Genie(n_clusters = cluster_count, affinity='cosinesimil', exact=False)
cluster_assignment = clustering_model.fit_predict(corpus_embeddings)

this still scales better than AgglomerativeClustering (exact=False) and provides better clusters than community_detection itself.

community_detection has very good performance/quality ratio. Very useful to get a preview or estimate of clusters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion A frequently asked question/interesting behaviour/etc.
Projects
None yet
Development

No branches or pull requests

2 participants