Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

G-Means: Setting maximum number of clusters like for X-Means #602

Closed
tschechlovdev opened this issue May 27, 2020 · 8 comments
Closed

G-Means: Setting maximum number of clusters like for X-Means #602

tschechlovdev opened this issue May 27, 2020 · 8 comments
Assignees
Labels
Enhancement Tasks related to enhancement and development Proposal Tasks that have been proposed by users

Comments

@tschechlovdev
Copy link

Hi,

I wanted to ask if it is possible to add a k_max parameter to the call to gmeans? So Similar to xmeans, which support this parameter. The reason is that gmeans returns for some datasets a really large number of clusters (sometimes it is even the same of the size of the dataset, which is the worst case).
I do not know the reason behind this, but it would be nice if I could limit the number of clusters as I can do for xmeans.

@annoviko annoviko self-assigned this May 27, 2020
@annoviko annoviko added the Proposal Tasks that have been proposed by users label May 27, 2020
@annoviko annoviko added this to To do in 0.10.0 Feature Development via automation May 27, 2020
@annoviko annoviko moved this from To do to In progress in 0.10.0 Feature Development May 27, 2020
@annoviko annoviko added the Enhancement Tasks related to enhancement and development label May 27, 2020
@annoviko
Copy link
Owner

Hi @tschechlovdev ,

The changes are introduced and available on master branch. Be aware that G-Means tries to use C++ implementation by default and you have to build C++ pyclustering by yourself if you are going to use version from master branch.

0.10.0 Feature Development automation moved this from In progress to Done May 27, 2020
@tschechlovdev
Copy link
Author

Hi @annoviko,

thank you for your fast support! That helps me a lot!

However, I think it does not work 100% correct, I am specifying for some dataset k_max=200, but getting results with
Number of clusters: 226
Number of clusters: 223
....

I am running the code

from pyclustering.cluster.gmeans import gmeans
import numpy as np

# data is here a synthetically generated dataset with make_blobs from sklearn
# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html
gmeans_instance = gmeans(data, k_max=200)

labels = gmeans_instance.process().predict(data)

predicted_k = len(numpy.unique(labels))
print("Number of clusters: {}".format(predicted_k))

I am also a bit concerned because I am running gmeans on synthetic datasets with gaussian distributions that have for example 25 clusters using the make_blobs function from sklearn. Actually, I thought gmeans should be well suited for these datasets. Yet, gmeans is always running into k_max (or respectively a higher value) for me. Do you have any idea why this happens?

@annoviko annoviko reopened this May 28, 2020
0.10.0 Feature Development automation moved this from Done to In progress May 28, 2020
@annoviko
Copy link
Owner

Hi, @tschechlovdev
Yes, that is possible, I've allowed G-Means to perform the last attempt to perform full statistical optimization before stop. Suppose we have 150 clusters on step N, and lets suppose that we have to split 76 clusters in line with the statistical optimization. As a result we have 226 clusters and we stop processing due to limit kmax=200. Otherwise it is going to be a partial statistical optimization.

But, I agree this is a confusing behavior. I will interrupt the statistical optimization as well.

@annoviko
Copy link
Owner

I am also a bit concerned because I am running gmeans on synthetic datasets with gaussian distributions that have for example 25 clusters using the make_blobs function from sklearn. Actually, I thought gmeans should be well suited for these datasets. Yet, gmeans is always running into k_max (or respectively a higher value) for me. Do you have any idea why this happens?

It is possible due to k-means++, this algorithm has randomized mechanism to find optimal centers for particular amount of clusters, thus there is a probability that amount of clusters are going to be different. There is an example with a simple data-set SAMPLE03:
1
2

@annoviko
Copy link
Owner

annoviko commented May 28, 2020

@tschechlovdev ,
As I case see you want to have labels instead of CLUSTER_INDEX_LIST_SEPARATION. There is a way how to do that:

from pyclustering.cluster.gmeans import gmeans
from pyclustering.cluster.encoder import type_encoding, cluster_encoder
from pyclustering.samples.definitions import SIMPLE_SAMPLES
from pyclustering.utils import read_sample

data = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE1)

gmeans_instance = gmeans(data).process()
clusters = gmeans_instance.get_clusters()

encoder = cluster_encoder(type_encoding.CLUSTER_INDEX_LIST_SEPARATION, clusters, data)
encoder.set_encoding(type_encoding.CLUSTER_INDEX_LABELING)
labels = encoder.get_clusters()

print(labels)

Output:

[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]

Because currently you perform cluster analysis of your data twice.

@tschechlovdev
Copy link
Author

Hi, @tschechlovdev
Yes, that is possible, I've allowed G-Means to perform the last attempt to perform full statistical optimization before stop. Suppose we have 150 clusters on step N, and lets suppose that we have to split 76 clusters in line with the statistical optimization. As a result we have 226 clusters and we stop processing due to limit kmax=200. Otherwise it is going to be a partial statistical optimization.

But, I agree this is a confusing behavior. I will interrupt the statistical optimization as well.

Ah, I see that makes sense.

It is possible due to k-means++, this algorithm has randomized mechanism to find optimal centers for particular amount of clusters, thus there is a probability that amount of clusters are going to be different. There is an example with a simple data-set SAMPLE03:

I see. That's why I get different results when trying multiple times. However, I tried with different datasets and it seems that GMeans has a problem with high dimensional (d>25) datasets. For these, it always runs into k_max (or respectively higher). For datasets with d < 20 it seems to work fine, but I think that is rather a problem with the algorithm in general.

@tschechlovdev ,
As I case see you want to have labels instead of CLUSTER_INDEX_LIST_SEPARATION. There is a way how to do that:
...

Thanks, didn't know about that way :)

@annoviko
Copy link
Owner

annoviko commented May 28, 2020

@tschechlovdev ,
I have already implemented additional parameter for g-means in scope of #578. It means that you can set your own random state as a random seed. In other words ability to control functions like random.seed(random_state) and numpy.random.seed(random_state) that are inside of algorithms that use random functionality like G-Means, X-Means, SyncNet, etc. Here is an example:

gmeans_instance = gmeans(data, random_state=1, kmax=200).process()

I think tomorrow I will be able to deliver this changes to master branch including the update for k_max and you will be able to try to use these parameters.

@annoviko
Copy link
Owner

@tschechlovdev ,

The changes regarded gmeans is on the master branch. You can use random_state to specify seed for the random engine and kmax to control population of clusters. C++ pyclustering library should be built in order to be able to use the changes.

0.10.0 Feature Development automation moved this from In progress to Done May 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Tasks related to enhancement and development Proposal Tasks that have been proposed by users
Projects
No open projects
Development

No branches or pull requests

2 participants