Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ccore.xmeans][pyclustering.cluster.xmeans] Amount of centers and amount of clusters not matched #389

Closed
annoviko opened this issue Nov 9, 2017 · 5 comments
Assignees
Labels
Bug Tasks related to found bugs

Comments

@annoviko
Copy link
Owner

annoviko commented Nov 9, 2017

Introduction
Amount of allocated centers is not matched to amount of allocated clusters. This bug wasn't observed in Python part, because centers were calculated by python implementation.

======================================================================
FAIL: testMndlWrongStartClusterAllocationSampleSimple2ByCore (__main__.XmeansIntegrationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\workspace\pyclustering\pyclustering\cluster\tests\integration\it_xmeans.py", line 74, in testMndlWrongStartClusterAllocationSampleSimple2ByCore
    XmeansTestTemplates.templateLengthProcessData(SIMPLE_SAMPLES.SAMPLE_SIMPLE2, [[3.5, 4.8], [6.9, 7]], [10, 5, 8], splitting_type.MINIMUM_NOISELESS_DESCRIPTION_LENGTH, 20, True);
  File "D:\workspace\pyclustering\pyclustering\cluster\tests\xmeans_templates.py", line 49, in templateLengthProcessData
    assert len(clusters) == len(centers);
AssertionError

For some tests the similar problem is observed for python implementation:

======================================================================
FAIL: testBicClusterAllocationMaxLessRealSampleSimple4 (__main__.XmeansUnitTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\workspace\pyclustering\pyclustering\cluster\tests\unit\ut_xmeans.py", line 92, in testBicClusterAllocationMaxLessRealSampleSimple4
    XmeansTestTemplates.templateLengthProcessData(SIMPLE_SAMPLES.SAMPLE_SIMPLE4, [[1.5, 4.0]], None, splitting_type.BAYESIAN_INFORMATION_CRITERION, 2, False);
  File "D:\workspace\pyclustering\pyclustering\cluster\tests\xmeans_templates.py", line 49, in templateLengthProcessData
    assert len(clusters) == len(centers);
AssertionError
@annoviko annoviko added the Bug Tasks related to found bugs label Nov 9, 2017
@annoviko annoviko added this to the 0.7 (release point) milestone Nov 9, 2017
@annoviko annoviko self-assigned this Nov 9, 2017
@annoviko annoviko changed the title [ccore.xmeans] Amount of centers and amount of clusters not matched [ccore.xmeans][pyclustering.cluster.xmeans] Amount of centers and amount of clusters not matched Nov 9, 2017
@himanshu94
Copy link

The number of clusters I get from pyclustering.cluster.xmeans.xmeans.get_centers is always equal to value of kmax and I checked this by getting clusters while iterating kmax for a range of value.

Thanks

@annoviko
Copy link
Owner Author

annoviko commented Nov 10, 2017

@himanshu94,

Formally it means that cluster separation process was stopped when kmax was reached.

Have you tried to increase kmax?
Could you please provide the data sample and code example that you are using to reproduce the issue?

There is an example of xmeans usage where amount of allocated clusters is less than kmax:

from pyclustering.cluster.xmeans import xmeans;
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer;

from pyclustering.utils import read_sample;

from pyclustering.samples.definitions import SIMPLE_SAMPLES;

# Read dataset 'SAMPLE_SIMPLE2'
sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE2);
initial_centers = kmeans_plusplus_initializer(sample, 3).initialize();

# Use Python implementation
xmeans_instance = xmeans(sample, initial_centers);
xmeans_instance.process();
clusters = xmeans_instance.get_clusters();

# Display allocated clusters
print(clusters);

# Use C/C++ implementation
xmeans_instance = xmeans(sample, initial_centers, ccore=True);
xmeans_instance.process();
clusters = xmeans_instance.get_clusters();

# Display allocated clusters
print(clusters);

Output:

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [15, 16, 17, 18, 19, 20, 21, 22], [10, 11, 12, 13, 14]]
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [15, 16, 17, 18, 19, 20, 21, 22], [10, 11, 12, 13, 14]]

Here an example of clustering where it started from 2 clusters and finished at 4 cluster and where kmax = 20.
elongate_four_clusters

@himanshu94
Copy link

Actuallty I cann't share the data but as I have run Kmeans and evaluated its clusters by Silhoutte value for different iteration I can say that at some point number of clusters formed should be less than kmax and I order to verfiy I ran Xmeans by iterating kmax for a range of values. But number of cluster produced is same as kmax value.
But When I ran XMeans of previous version (Before updating to 0.7) then the number of clusters were not equal to Kmax.

Thanks

@himanshu94
Copy link

I verified it using the old version of pyclustering.Then for all the things constant the number of clusters we get is not always equal to kmax. I used the same dataset.

Thanks @annoviko

@annoviko
Copy link
Owner Author

@himanshu94, previous version (before 0.7) had two bugs (#326, #328) that have been fixed in 0.7:

- Bug with calculation BIC splitting criterion for X-Means algorithm (pyclustering.cluster.xmeans).
  See: https://github.com/annoviko/pyclustering/issues/326

- Bug with calculation MNDL splitting criterion for X-Means algorithm (pyclustering.cluster.xmeans).
  See: https://github.com/annoviko/pyclustering/issues/328

I will try to verify implementation and add more tests to find out what can be wrong, but without data it's not trivial problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Tasks related to found bugs
Projects
No open projects
Development

No branches or pull requests

2 participants