Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pyclustering.cluster.xmeans][ccore.xmeans] Performance issue #372

Closed
annoviko opened this issue Sep 29, 2017 · 4 comments
Closed

[pyclustering.cluster.xmeans][ccore.xmeans] Performance issue #372

annoviko opened this issue Sep 29, 2017 · 4 comments
Assignees
Labels
Optimization Tasks related to code optimization

Comments

@annoviko
Copy link
Owner

annoviko commented Sep 29, 2017

Introduction
There is complaint related to performance of X-Means algorithm when 4-5 millions points are used:

Hello, I'm testing your implementation of X-means clustering algorithm in my research work, and the performance with some datasets that contains 4 or 5 million points is not exactly what i've expected. The future multicore version of this package will improve the performance of all the algorithms in the package (including X-means) or only some of them?

Description

  1. Investigation of bottlenecks in python and C++ implementations.
  2. Investigate multi-threading implementation for the C++ X-Means.
  3. Implement solutions that improves performance of the algorithm.
@annoviko annoviko self-assigned this Sep 29, 2017
@annoviko annoviko added the Optimization Tasks related to code optimization label Sep 29, 2017
@annoviko annoviko added this to the 0.7 (release point) milestone Oct 26, 2017
@annoviko annoviko moved this from To Do to In Progress in 0.7.x Feature Development Oct 27, 2017
@annoviko
Copy link
Owner Author

annoviko commented Oct 27, 2017

CCORE clustering results before optimization (instant values):

Sample:  Simple01.data ', Execution time: ' 0.004082958090449453 ', BAYESIAN INFORMATION CRITERION
Sample:  Simple02.data ', Execution time: ' 0.0006012819365098735 ', BAYESIAN INFORMATION CRITERION
Sample:  Simple03.data ', Execution time: ' 0.0008245174776607654 ', BAYESIAN INFORMATION CRITERION
Sample:  Simple04.data ', Execution time: ' 0.0008906613417054739 ', BAYESIAN INFORMATION CRITERION
Sample:  Simple05.data ', Execution time: ' 0.0006839617665657582 ', BAYESIAN INFORMATION CRITERION
Sample:  Elongate.data ', Execution time: ' 0.001444901305942171 ', BAYESIAN INFORMATION CRITERION
Sample:  Lsun.data ', Execution time: ' 0.003622231865034584 ', BAYESIAN INFORMATION CRITERION
Sample:  Target.data ', Execution time: ' 0.007986871583398576 ', BAYESIAN INFORMATION CRITERION
Sample:  TwoDiamonds.data ', Execution time: ' 0.00683762194562176 ', BAYESIAN INFORMATION CRITERION
Sample:  WingNut.data ', Execution time: ' 0.007406402155833804 ', BAYESIAN INFORMATION CRITERION
Sample:  Chainlink.data ', Execution time: ' 0.007445746350825913 ', BAYESIAN INFORMATION CRITERION
Sample:  Hepta.data ', Execution time: ' 0.002023375013471107 ', BAYESIAN INFORMATION CRITERION
Sample:  Tetra.data ', Execution time: ' 0.003347107602952072 ', BAYESIAN INFORMATION CRITERION
Sample:  Atom.data ', Execution time: ' 0.5167791587526864 ', BAYESIAN INFORMATION CRITERION

Random sample: (size:6000) ', Execution time: ' 0.04646036243416263
Random sample: (size:12000) ', Execution time: ' 0.09399926899626121
Random sample: (size:24000) ', Execution time: ' 0.26107211220736326
Random sample: (size:36000) ', Execution time: ' 0.3993355962897853
Random sample: (size:48000) ', Execution time: ' 0.4588097639747444
Random sample: (size:60000) ', Execution time: ' 0.5967379670911468
Random sample: (size:90000) ', Execution time: ' 0.9474227556560129
Random sample: (size:180000) ', Execution time: ' 2.0353881076753684
Random sample: (size:270000) ', Execution time: ' 3.0532292745899783
Random sample: (size:600000) ', Execution time: ' 6.735516632045675
Random sample: (size:1200000) ', Execution time: ' 13.662943983559828
Random sample: (size:1800000) ', Execution time: ' 19.687950925814526

@annoviko
Copy link
Owner Author

annoviko commented Oct 30, 2017

Results (instant values) after optimization: std::future was used to parallel improve_structure, update_clusters):

Sample:  Simple01.data ', Execution time: ' 0.0046933633185517005 ', BAYESIAN INFORMATION CRITERION 
Sample:  Simple02.data ', Execution time: ' 0.0007749095796272341 ', BAYESIAN INFORMATION CRITERION 
Sample:  Simple03.data ', Execution time: ' 0.0010417658587041616 ', BAYESIAN INFORMATION CRITERION 
Sample:  Simple04.data ', Execution time: ' 0.0010554507960927233 ', BAYESIAN INFORMATION CRITERION 
Sample:  Simple05.data ', Execution time: ' 0.0007831775626328218 ', BAYESIAN INFORMATION CRITERION 
Sample:  Elongate.data ', Execution time: ' 0.001718314950851117 ', BAYESIAN INFORMATION CRITERION 
Sample:  Lsun.data ', Execution time: ' 0.0049291433856421055 ', BAYESIAN INFORMATION CRITERION 
Sample:  Target.data ', Execution time: ' 0.010533125246257595 ', BAYESIAN INFORMATION CRITERION 
Sample:  TwoDiamonds.data ', Execution time: ' 0.0073254329429514875 ', BAYESIAN INFORMATION CRITERION 
Sample:  WingNut.data ', Execution time: ' 0.0075834510332983 ', BAYESIAN INFORMATION CRITERION 
Sample:  Chainlink.data ', Execution time: ' 0.010281094316018272 ', BAYESIAN INFORMATION CRITERION 
Sample:  Hepta.data ', Execution time: ' 0.00237718766553785 ', BAYESIAN INFORMATION CRITERION 
Sample:  Tetra.data ', Execution time: ' 0.006640330764936669 ', BAYESIAN INFORMATION CRITERION 
Sample:  Atom.data ', Execution time: ' 0.5612238439506589 ', BAYESIAN INFORMATION CRITERION 

Random sample: (size:6000) ', Execution time: ' 0.05573361813208588
Random sample: (size:12000) ', Execution time: ' 0.12438439164466153
Random sample: (size:24000) ', Execution time: ' 0.25501025514995557
Random sample: (size:36000) ', Execution time: ' 0.4005387303685295
Random sample: (size:48000) ', Execution time: ' 0.5117336933992416
Random sample: (size:60000) ', Execution time: ' 0.6301391929194131
Random sample: (size:90000) ', Execution time: ' 0.9582275839300061
Random sample: (size:180000) ', Execution time: ' 2.0943733239515483
Random sample: (size:270000) ', Execution time: ' 3.1364299875752177
Random sample: (size:600000) ', Execution time: ' 6.841610534664802
Random sample: (size:1200000) ', Execution time: ' 14.27894663615686
Random sample: (size:1800000) ', Execution time: ' 20.161835218528495

@annoviko
Copy link
Owner Author

annoviko commented Oct 30, 2017

Average values:

   WITHOUT OPTIMIZATION
Random sample: (size:6000) ', Execution time: ' 0.06487278995388744
Random sample: (size:12000) ', Execution time: ' 0.13240456324237218
Random sample: (size:24000) ', Execution time: ' 0.26943573296809764
Random sample: (size:36000) ', Execution time: ' 0.4183669678683374
Random sample: (size:48000) ', Execution time: ' 0.5665725589635487
Random sample: (size:60000) ', Execution time: ' 0.7230097111736941
Random sample: (size:90000) ', Execution time: ' 1.1084235208720965
Random sample: (size:180000) ', Execution time: ' 2.2789553175390154
Random sample: (size:270000) ', Execution time: ' 3.3854017783576125
Random sample: (size:600000) ', Execution time: ' 7.692325415836788


   WITH PARALLEL OPTIMIZATION
Random sample: (size:6000) ', Execution time: ' 0.06306853074520756
Random sample: (size:12000) ', Execution time: ' 0.1293804629272196
Random sample: (size:24000) ', Execution time: ' 0.2691053415161656
Random sample: (size:36000) ', Execution time: ' 0.4137031697166019
Random sample: (size:48000) ', Execution time: ' 0.5805145593478667
Random sample: (size:60000) ', Execution time: ' 0.7367947196669091
Random sample: (size:90000) ', Execution time: ' 1.1261870257670261
Random sample: (size:180000) ', Execution time: ' 2.3183102894193213
Random sample: (size:270000) ', Execution time: ' 3.421031895597614
Random sample: (size:600000) ', Execution time: ' 7.725590775896026

Looks like usage C++11 parallel mechanisms, such as std::future, haven't got benefits for this particular task and also they are a little bit worse than results without optimization.

The results became better in case of 6 millions points:

   WITH PARALLEL OPTIMIZATION:
Random sample: (size:6000000) ', Execution time: ' 67.3682303038113


   WITHOUT OPTIMIZATION
Random sample: (size:6000000) ', Execution time: ' 68.0722179805252

Probably there is create, destroy, join thread issue and it leads to overhead. Pool thread should be introduced to check this statement.

@annoviko
Copy link
Owner Author

annoviko commented Nov 2, 2017

Pool threads introduced:

   POOL THREADS
Sample:  Simple01.data ', Execution time: ' 0.027195676928278952 ', BAYESIAN INFORMATION CRITERION
Sample:  Simple02.data ', Execution time: ' 0.002401991614554618 ', BAYESIAN INFORMATION CRITERION
Sample:  Simple03.data ', Execution time: ' 0.002349817790760725 ', BAYESIAN INFORMATION CRITERION
Sample:  Simple04.data ', Execution time: ' 0.0022853845438895876 ', BAYESIAN INFORMATION CRITERION
Sample:  Simple05.data ', Execution time: ' 0.0019127551029135825 ', BAYESIAN INFORMATION CRITERION
Sample:  Elongate.data ', Execution time: ' 0.0027084771914859113 ', BAYESIAN INFORMATION CRITERION
Sample:  Lsun.data ', Execution time: ' 0.005898493117331809 ', BAYESIAN INFORMATION CRITERION
Sample:  Target.data ', Execution time: ' 0.010470117513697766 ', BAYESIAN INFORMATION CRITERION
Sample:  TwoDiamonds.data ', Execution time: ' 0.008957076623675057 ', BAYESIAN INFORMATION CRITERION
Sample:  WingNut.data ', Execution time: ' 0.009430632477891693 ', BAYESIAN INFORMATION CRITERION
Sample:  Chainlink.data ', Execution time: ' 0.009701765299902548 ', BAYESIAN INFORMATION CRITERION
Sample:  Hepta.data ', Execution time: ' 0.004180178166480683 ', BAYESIAN INFORMATION CRITERION
Sample:  Tetra.data ', Execution time: ' 0.005393290845404103 ', BAYESIAN INFORMATION CRITERION
Sample:  Atom.data ', Execution time: ' 0.4922671550668765 ', BAYESIAN INFORMATION CRITERION
Random sample: (size:6000) ', Execution time: ' 0.049572260175748784
Random sample: (size:12000) ', Execution time: ' 0.09769135106254989
Random sample: (size:24000) ', Execution time: ' 0.2657324035938927
Random sample: (size:36000) ', Execution time: ' 0.3849949223180231
Random sample: (size:48000) ', Execution time: ' 0.4437375160584185
Random sample: (size:60000) ', Execution time: ' 0.6082609694751766
Random sample: (size:90000) ', Execution time: ' 0.9253837341974611
Random sample: (size:180000) ', Execution time: ' 1.970817726327481
Random sample: (size:270000) ', Execution time: ' 2.8773772589412543
Random sample: (size:600000) ', Execution time: ' 6.37764183439743
Random sample: (size:1200000) ', Execution time: ' 13.580303782801792
Random sample: (size:1800000) ', Execution time: ' 19.804547732776513
Random sample: (size:6000000) ', Execution time: ' 72.58958530648273


   WITHOUT POOL THREADS
Sample:  Simple01.data ', Execution time: ' 0.008286229588773333 ', BAYESIAN INFORMATION CRITERION
Sample:  Simple02.data ', Execution time: ' 0.0024165318605299593 ', BAYESIAN INFORMATION CRITERION
Sample:  Simple03.data ', Execution time: ' 0.0025950062523057687 ', BAYESIAN INFORMATION CRITERION
Sample:  Simple04.data ', Execution time: ' 0.002580466006330424 ', BAYESIAN INFORMATION CRITERION
Sample:  Simple05.data ', Execution time: ' 0.0022631465206331795 ', BAYESIAN INFORMATION CRITERION
Sample:  Elongate.data ', Execution time: ' 0.0035903003444612766 ', BAYESIAN INFORMATION CRITERION
Sample:  Lsun.data ', Execution time: ' 0.0063711936629616585 ', BAYESIAN INFORMATION CRITERION
Sample:  Target.data ', Execution time: ' 0.011038897723909806 ', BAYESIAN INFORMATION CRITERION
Sample:  TwoDiamonds.data ', Execution time: ' 0.010556503680963052 ', BAYESIAN INFORMATION CRITERION
Sample:  WingNut.data ', Execution time: ' 0.010261707321384483 ', BAYESIAN INFORMATION CRITERION
Sample:  Chainlink.data ', Execution time: ' 0.010449875210477189 ', BAYESIAN INFORMATION CRITERION
Sample:  Hepta.data ', Execution time: ' 0.004214675612814342 ', BAYESIAN INFORMATION CRITERION
Sample:  Tetra.data ', Execution time: ' 0.005353946650412 ', BAYESIAN INFORMATION CRITERION
Sample:  Atom.data ', Execution time: ' 0.6581391450221326 ', BAYESIAN INFORMATION CRITERION
Random sample: (size:6000) ', Execution time: ' 0.0652075862450413
Random sample: (size:12000) ', Execution time: ' 0.13661672994999874
Random sample: (size:24000) ', Execution time: ' 0.33727896687846015
Random sample: (size:36000) ', Execution time: ' 0.48826003433778875
Random sample: (size:48000) ', Execution time: ' 0.603746080548401
Random sample: (size:60000) ', Execution time: ' 0.7843134124360729
Random sample: (size:90000) ', Execution time: ' 1.1528909715336195
Random sample: (size:180000) ', Execution time: ' 2.602659268437459
Random sample: (size:270000) ', Execution time: ' 3.4887010884657084
Random sample: (size:600000) ', Execution time: ' 7.998171350241455
Random sample: (size:1200000) ', Execution time: ' 15.768811229403454
Random sample: (size:1800000) ', Execution time: ' 19.55469014165621
Random sample: (size:6000000) ', Execution time: ' 73.35187309729478

Average values:

    POOL THREADS (AVG)
Random sample: (size:6000) ', Execution time: ' 0.06658213841972047
Random sample: (size:12000) ', Execution time: ' 0.13204861231883855
Random sample: (size:24000) ', Execution time: ' 0.2728966393785214
Random sample: (size:36000) ', Execution time: ' 0.4111670514604964
Random sample: (size:48000) ', Execution time: ' 0.5679987005011539
Random sample: (size:60000) ', Execution time: ' 0.7177979310655493
Random sample: (size:90000) ', Execution time: ' 1.1048938904167227
Random sample: (size:180000) ', Execution time: ' 2.303458696863241
Random sample: (size:270000) ', Execution time: ' 3.3325965372546733
Random sample: (size:600000) ', Execution time: ' 8.36458668923161
Random sample: (size:1200000) ', Execution time: ' 20.403341334269985
Random sample: (size:1800000) ', Execution time: ' 31.716519472810585


    WITHOUT POOL THREADS (AVG)
Random sample: (size:6000) ', Execution time: ' 0.06564386490001729
Random sample: (size:12000) ', Execution time: ' 0.13045497284965452
Random sample: (size:24000) ', Execution time: ' 0.2666950961737485
Random sample: (size:36000) ', Execution time: ' 0.4151558685858269
Random sample: (size:48000) ', Execution time: ' 0.5708520242018118
Random sample: (size:60000) ', Execution time: ' 0.727910657886258
Random sample: (size:90000) ', Execution time: ' 1.1134643247937412
Random sample: (size:180000) ', Execution time: ' 2.326188707873912
Random sample: (size:270000) ', Execution time: ' 3.3904631667059157
Random sample: (size:600000) ', Execution time: ' 7.8770766749935675
Random sample: (size:1200000) ', Execution time: ' 19.057157179488783
Random sample: (size:1800000) ', Execution time: ' 32.79242347696625

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Optimization Tasks related to code optimization
Projects
No open projects
Development

No branches or pull requests

1 participant