distance_threshold #77

leonardlin · 2023-03-19T17:33:30Z

Hi,
I've been testing genieclust as a replacement for sklearn.cluster.AgglomerativeCluster.
I love that the API is so similar and almost a drop-in-replacement to sklearn.cluster.*

I currently use AgglomerativeCluster because

like the outcome
cluster with distance_threshold instead of n_cluster

Is there a way to achieve clustering by distance_threshold with genieclust?
Ideally I don't know the number of clusters ahead of time but I know that each cluster should similar to a certain degree.

Thanks in advance for any help.

gagolews · 2023-03-21T07:58:11Z

Yes, you can achieve that by passing compute_full_tree=True to the constructor.

This way, the whole "linkage matrix" can be generated, together with the corresponding distances.

See Dendrograms in https://genieclust.gagolewski.com/weave/basics.html and the description of children_ and distances_ in https://genieclust.gagolewski.com/genieclust_genie.html#genieclust.Genie

Some code as a starting point for your experiments:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import clustbench

# Load example data 
data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"
benchmark = clustbench.load_dataset("wut", "x2", url=data_url)
X = benchmark.data


# Fit with compute_full_tree=True
import genieclust
g = genieclust.Genie(compute_full_tree=True)
g.fit(X)


# Plot the dendrogram
import scipy.cluster
linkage_matrix = np.column_stack([g.children_, g.distances_, g.counts_])
scipy.cluster.hierarchy.dendrogram(linkage_matrix,
    show_leaf_counts=False, no_labels=True)
plt.show()

# Iteration (merge step) with a prescribed distance:
w = np.argmax(g.distances_ > 0.3)
w

# How many clusters?
k = X.shape[0]-w
k

# Determine the k-partition:
g.set_params(n_clusters=k)
c = g.fit_predict(X)

# Plot it:
genieclust.plots.plot_scatter(X, labels=c)
plt.show()

gagolews · 2023-03-21T07:58:36Z

Let me know if I can give you any more details/hints about the above.

leonardlin · 2023-03-24T23:28:05Z

@gagolews thanks for your feedback.
I tried your approach but the resulting clusters didn't look sensible.

the resulting g.distances_ are all relatively low.
I'm using affinity cosinesimil

I assume that g_distances_ refers to the distances between clusters on a scale between 0...1
It's not exactly the same as threshold in AgglomerativeClustering but I guess similiar.

I post here if I find something that works.

leonardlin · 2023-03-26T07:32:19Z

quick update from my side for documentation. the ticket can be closed afterwards.

I'm using this function: https://www.sbert.net/docs/package_reference/util.html#sentence_transformers.util.community_detection
to estimate the number of clusters first.

than I use the cluster count to run Genie.

# determine cluster count
clusters = sentence_transformers.util.community_detection(corpus_embeddings, min_community_size=1, threshold=threshold)
cluster_count = len(clusters)

# clustering
clustering_model = genieclust.Genie(n_clusters = cluster_count, affinity='cosinesimil', exact=False)
cluster_assignment = clustering_model.fit_predict(corpus_embeddings)

this still scales better than AgglomerativeClustering (exact=False) and provides better clusters than community_detection itself.

community_detection has very good performance/quality ratio. Very useful to get a preview or estimate of clusters

gagolews added the discussion A frequently asked question/interesting behaviour/etc. label Mar 21, 2023

leonardlin closed this as completed Mar 24, 2023

leonardlin reopened this Mar 26, 2023

leonardlin closed this as completed Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distance_threshold #77

distance_threshold #77

leonardlin commented Mar 19, 2023

gagolews commented Mar 21, 2023

gagolews commented Mar 21, 2023

leonardlin commented Mar 24, 2023

leonardlin commented Mar 26, 2023 •

edited

distance_threshold #77

distance_threshold #77

Comments

leonardlin commented Mar 19, 2023

gagolews commented Mar 21, 2023

gagolews commented Mar 21, 2023

leonardlin commented Mar 24, 2023

leonardlin commented Mar 26, 2023 • edited

leonardlin commented Mar 26, 2023 •

edited