Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pyclustering.cluster.dbscan][ccore.dbscan] KD-Tree optimization #369

Closed
annoviko opened this issue Sep 27, 2017 · 3 comments
Closed

[pyclustering.cluster.dbscan][ccore.dbscan] KD-Tree optimization #369

annoviko opened this issue Sep 27, 2017 · 3 comments
Assignees
Labels
Enhancement Tasks related to enhancement and development Optimization Tasks related to code optimization

Comments

@annoviko
Copy link
Owner

annoviko commented Sep 27, 2017

Introduction
KD-Tree is already implemented in the pyclustering library. DBSCAN and OPTICS algorithms always calculate distance between points, it leads to O(n^2) complexity, therefore optimization is required. KD-tree helps to reduce this complexity and increase performance of these algorithm. There are several complains that performance of the algorith is significantly degradate in case of big data.

Resources

  1. Article 'A Fast Approach to Clustering Datasets using DBSCAN and Pruning Algorithms' - S.Vijayalaksmi, M Punithavalli - Link to PDF.

Description
KD-Tree optimization should be performed for DBSCAN algorithm for Python code and for C++ implementation.
KD-Tree implementation for Python: 'pyclustering/cluster/kdtree.py'
KD-Tree implementation for C++: 'ccore/container/kdtree.cpp'

@annoviko annoviko added Enhancement Tasks related to enhancement and development Optimization Tasks related to code optimization labels Sep 27, 2017
@annoviko annoviko self-assigned this Oct 19, 2017
@annoviko annoviko added this to the 0.7 (release point) milestone Oct 19, 2017
@annoviko annoviko added this to In Progress in 0.7.x Feature Development Oct 19, 2017
@annoviko annoviko moved this from In Progress to To Do in 0.7.x Feature Development Oct 23, 2017
@annoviko annoviko moved this from To Do to In Progress in 0.7.x Feature Development Oct 24, 2017
@annoviko
Copy link
Owner Author

annoviko commented Oct 25, 2017

Optimization results for python code:

     SAMPLE                                     WITH KD-TREE               WITHOUT KD-TREE

simple\Simple01.data        Execution time:  0.00022409097751563504   0.00020270825065345615 
simple\Simple02.data        Execution time:  0.0007729143003116778    0.0009830352296105183
simple\Simple03.data        Execution time:  0.003677543917269821     0.005556372850893165
simple\Simple04.data        Execution time:  0.007079108106505139     0.008534273945232762
simple\Simple05.data        Execution time:  0.0033622199718088552    0.005683813902992441
simple\Elongate.data        Execution time:  0.021121572491435536     0.036880642188861223
fcps\Lsun.data              Execution time:  0.0780252852170662       0.2669670512136273
fcps\Target.data            Execution time:  0.6633489227667306       1.169823327357573
fcps\TwoDiamonds.data       Execution time:  0.14622620530154862      0.9310871776625209
fcps\WingNut.data           Execution time:  0.23848611434227962      1.5618684968000043
fcps\Chainlink.data         Execution time:  0.7548290750345554       1.9553132369913193
fcps\Hepta.data             Execution time:  0.04232953119939609      0.08354830101405497
fcps\Tetra.data             Execution time:  0.04836060058662639      0.26458615585328005
fcps\Atom.data              Execution time:  1.0972711649081504       1.6633235485975213



RANDOM (0, 1] POINTS IN ONE CLUSTER:

   SIZE    PARAMS             WITH KD-TREE         WITHOUT KD-TREE
   1000    (0.2)           1.2415367167079496    2.1030196971977793
   2000    (0.2)           8.083576526213513     11.88685030722702
   3000    (0.2)           26.68899907170455     32.634481101090685

   1000    (0.1)           0.38808166719125625   1.5746536568454377
   2000    (0.1)           2.4408297182269787    7.391549375282253
   3000    (0.1)           7.557209058065222     18.784553631300206
   4000    (0.1)           16.789081352437517    37.501737702936325
   5000    (0.1)           33.33403611228954     63.62421311565146

@annoviko
Copy link
Owner Author

Result of optimization for CCORE:

RANDOM (0, 1] POINTS IN ONE CLUSTER:

   SIZE    PARAMS             WITH KD-TREE               WITHOUT KD-TREE
   1000    (0.1)          0.020409100032387703         0.019771609668869945
   2000    (0.1)          0.0661692759295499           0.06492137998987314
   3000    (0.1)          0.16569731638224788          0.17175746627801422
   4000    (0.1)          0.34311407770311875          0.36248768354932737
   5000    (0.1)          0.6570313248395441           0.6817283743653606
   10000   (0.1)          4.431236571647531            4.74091091556845
   20000   (0.1)          34.35042269374461            35.44703356004725

   

 WITHOUT KD-TREE (eps = 0.05)
Execution time (1000 2D-points): 0.021901614367367792
Execution time (2000 2D-points): 0.032420490468435675
Execution time (3000 2D-points): 0.09332334332334333
Execution time (4000 2D-points): 0.17029773879088944
Execution time (5000 2D-points): 0.3196760431349473
Execution time (10000 2D-points): 1.6479125896934115
Execution time (20000 2D-points): 10.745420390112171


 WITH KD-TREE (eps = 0.05)
Execution time (1000 2D-points): 0.015029205953863489
Execution time (2000 2D-points): 0.024186715111372645
Execution time (3000 2D-points): 0.0596119063584817
Execution time (4000 2D-points): 0.1180246693945324
Execution time (5000 2D-points): 0.23007529000679683
Execution time (10000 2D-points): 1.261819516100338
Execution time (20000 2D-points): 9.261596850637947

Here some debug log that display how many nodes are visited (euclidean distance calculation) for method 'get_neighbors()':

[DEBUG] Amount of touched nodes: 164
[DEBUG] Amount of touched nodes: 185
[DEBUG] Amount of touched nodes: 170
[DEBUG] Amount of touched nodes: 198
[DEBUG] Amount of touched nodes: 198
[DEBUG] Amount of touched nodes: 205
[DEBUG] Amount of touched nodes: 151
[DEBUG] Amount of touched nodes: 149
[DEBUG] Amount of touched nodes: 182
[DEBUG] Amount of touched nodes: 177
[DEBUG] Amount of touched nodes: 132
[DEBUG] Amount of touched nodes: 149

@annoviko
Copy link
Owner Author

Looks like amount of euclidean distance calculation can be reduced - investigation is required - #379.

@annoviko annoviko moved this from In Progress to Done in 0.7.x Feature Development Oct 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Tasks related to enhancement and development Optimization Tasks related to code optimization
Projects
No open projects
Development

No branches or pull requests

1 participant