G-Means: Setting maximum number of clusters like for X-Means #602

tschechlovdev · 2020-05-27T14:59:36Z

Hi,

I wanted to ask if it is possible to add a k_max parameter to the call to gmeans? So Similar to xmeans, which support this parameter. The reason is that gmeans returns for some datasets a really large number of clusters (sometimes it is even the same of the size of the dataset, which is the worst case).
I do not know the reason behind this, but it would be nice if I could limit the number of clusters as I can do for xmeans.

annoviko · 2020-05-27T20:07:48Z

Hi @tschechlovdev ,

The changes are introduced and available on master branch. Be aware that G-Means tries to use C++ implementation by default and you have to build C++ pyclustering by yourself if you are going to use version from master branch.

tschechlovdev · 2020-05-28T06:43:50Z

Hi @annoviko,

thank you for your fast support! That helps me a lot!

However, I think it does not work 100% correct, I am specifying for some dataset k_max=200, but getting results with
Number of clusters: 226
Number of clusters: 223
....

I am running the code

from pyclustering.cluster.gmeans import gmeans
import numpy as np

# data is here a synthetically generated dataset with make_blobs from sklearn
# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html
gmeans_instance = gmeans(data, k_max=200)

labels = gmeans_instance.process().predict(data)

predicted_k = len(numpy.unique(labels))
print("Number of clusters: {}".format(predicted_k))

I am also a bit concerned because I am running gmeans on synthetic datasets with gaussian distributions that have for example 25 clusters using the make_blobs function from sklearn. Actually, I thought gmeans should be well suited for these datasets. Yet, gmeans is always running into k_max (or respectively a higher value) for me. Do you have any idea why this happens?

annoviko · 2020-05-28T09:25:16Z

Hi, @tschechlovdev
Yes, that is possible, I've allowed G-Means to perform the last attempt to perform full statistical optimization before stop. Suppose we have 150 clusters on step N, and lets suppose that we have to split 76 clusters in line with the statistical optimization. As a result we have 226 clusters and we stop processing due to limit kmax=200. Otherwise it is going to be a partial statistical optimization.

But, I agree this is a confusing behavior. I will interrupt the statistical optimization as well.

annoviko · 2020-05-28T10:03:05Z

I am also a bit concerned because I am running gmeans on synthetic datasets with gaussian distributions that have for example 25 clusters using the make_blobs function from sklearn. Actually, I thought gmeans should be well suited for these datasets. Yet, gmeans is always running into k_max (or respectively a higher value) for me. Do you have any idea why this happens?

It is possible due to k-means++, this algorithm has randomized mechanism to find optimal centers for particular amount of clusters, thus there is a probability that amount of clusters are going to be different. There is an example with a simple data-set SAMPLE03:

annoviko · 2020-05-28T10:24:48Z

@tschechlovdev ,
As I case see you want to have labels instead of CLUSTER_INDEX_LIST_SEPARATION. There is a way how to do that:

from pyclustering.cluster.gmeans import gmeans
from pyclustering.cluster.encoder import type_encoding, cluster_encoder
from pyclustering.samples.definitions import SIMPLE_SAMPLES
from pyclustering.utils import read_sample

data = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE1)

gmeans_instance = gmeans(data).process()
clusters = gmeans_instance.get_clusters()

encoder = cluster_encoder(type_encoding.CLUSTER_INDEX_LIST_SEPARATION, clusters, data)
encoder.set_encoding(type_encoding.CLUSTER_INDEX_LABELING)
labels = encoder.get_clusters()

print(labels)

Output:

[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]

Because currently you perform cluster analysis of your data twice.

tschechlovdev · 2020-05-28T11:39:41Z

Hi, @tschechlovdev
Yes, that is possible, I've allowed G-Means to perform the last attempt to perform full statistical optimization before stop. Suppose we have 150 clusters on step N, and lets suppose that we have to split 76 clusters in line with the statistical optimization. As a result we have 226 clusters and we stop processing due to limit kmax=200. Otherwise it is going to be a partial statistical optimization.

But, I agree this is a confusing behavior. I will interrupt the statistical optimization as well.

Ah, I see that makes sense.

It is possible due to k-means++, this algorithm has randomized mechanism to find optimal centers for particular amount of clusters, thus there is a probability that amount of clusters are going to be different. There is an example with a simple data-set SAMPLE03:

I see. That's why I get different results when trying multiple times. However, I tried with different datasets and it seems that GMeans has a problem with high dimensional (d>25) datasets. For these, it always runs into k_max (or respectively higher). For datasets with d < 20 it seems to work fine, but I think that is rather a problem with the algorithm in general.

@tschechlovdev ,
As I case see you want to have labels instead of CLUSTER_INDEX_LIST_SEPARATION. There is a way how to do that:
...

Thanks, didn't know about that way :)

annoviko · 2020-05-28T12:14:22Z

@tschechlovdev ,
I have already implemented additional parameter for g-means in scope of #578. It means that you can set your own random state as a random seed. In other words ability to control functions like random.seed(random_state) and numpy.random.seed(random_state) that are inside of algorithms that use random functionality like G-Means, X-Means, SyncNet, etc. Here is an example:

gmeans_instance = gmeans(data, random_state=1, kmax=200).process()

I think tomorrow I will be able to deliver this changes to master branch including the update for k_max and you will be able to try to use these parameters.

annoviko · 2020-05-29T12:25:18Z

@tschechlovdev ,

The changes regarded gmeans is on the master branch. You can use random_state to specify seed for the random engine and kmax to control population of clusters. C++ pyclustering library should be built in order to be able to use the changes.

annoviko self-assigned this May 27, 2020

annoviko added the Proposal Tasks that have been proposed by users label May 27, 2020

annoviko added this to To do in 0.10.0 Feature Development via automation May 27, 2020

annoviko moved this from To do to In progress in 0.10.0 Feature Development May 27, 2020

annoviko added the Enhancement Tasks related to enhancement and development label May 27, 2020

annoviko added a commit that referenced this issue May 27, 2020

#602: G-Means: Setting maximum number of clusters like for X-Means.

ef6ed9b

annoviko closed this as completed May 27, 2020

0.10.0 Feature Development automation moved this from In progress to Done May 27, 2020

annoviko reopened this May 28, 2020

0.10.0 Feature Development automation moved this from Done to In progress May 28, 2020

annoviko added a commit that referenced this issue May 28, 2020

#602: G-Means: Setting maximum number of clusters like for X-Means.

ab1cf9c

annoviko added a commit that referenced this issue May 29, 2020

#602: G-Means: and - after self-review changes.

7f264aa

annoviko closed this as completed May 29, 2020

0.10.0 Feature Development automation moved this from In progress to Done May 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

G-Means: Setting maximum number of clusters like for X-Means #602

G-Means: Setting maximum number of clusters like for X-Means #602

tschechlovdev commented May 27, 2020

annoviko commented May 27, 2020

tschechlovdev commented May 28, 2020

annoviko commented May 28, 2020

annoviko commented May 28, 2020

annoviko commented May 28, 2020 •

edited

tschechlovdev commented May 28, 2020

annoviko commented May 28, 2020 •

edited

annoviko commented May 29, 2020

G-Means: Setting maximum number of clusters like for X-Means #602

G-Means: Setting maximum number of clusters like for X-Means #602

Comments

tschechlovdev commented May 27, 2020

annoviko commented May 27, 2020

tschechlovdev commented May 28, 2020

annoviko commented May 28, 2020

annoviko commented May 28, 2020

annoviko commented May 28, 2020 • edited

tschechlovdev commented May 28, 2020

annoviko commented May 28, 2020 • edited

annoviko commented May 29, 2020

annoviko commented May 28, 2020 •

edited

annoviko commented May 28, 2020 •

edited