Implement DBSCAN #331

aprokop · 2020-06-03T15:16:45Z

This was pretty straightforward. The only thing that I am not fond of it doing two passes. But can't think about how to fix this yet.

Checked that the timings and output of the original halo finder example on HACC data is unchanged.

Also, not 100% sure about whether verifyClusters() function catches all the corner cases.

examples/dbscan/ArborX_DetailsDBSCANCallback.hpp

aprokop · 2020-06-06T13:34:32Z

I realized that there is a bug in the current implementation. Specifically, we do propagate(i,j) when i is a boundary point, and j is a core point. But what it really does is that it sets the larger representative to the smaller one. So what could happen is that the representative of the core point could be set to the representative of the boundary point. And that could lead to issue. If boundary point is then connected to a different cluster, and updates its representative to that one, this would effectively bridge two clusters, which is a mistake.

There are couple ways to avoid the issue. First, stop processing any edges for a boundary point once it has connection to some core point. Second, instead of choosing the smaller representative, simply assign the representative of the boundary point to be that of the core point. This way, core point representatives will never be a boundary point, and thus no bridge will ever be formed.

Not sure which version is better. Can also do another variant: a boundary point is only processed if its representative is itself. First time it connects to a core point, it sets it representative to it, and thus will never be processed again.

In any case, need to fix verify routine that did not catch this bug. I had a feeling that it was not complete, and this certainly proves it.

aprokop · 2020-06-06T18:57:44Z

I think I fixed the bug. But I'm not sure how to write a simple verification function without implementing full blown sequential DBSCAN. The question that needs to be answered: if a point is connected to multiple core points, how would one know if those core points should belong to the same cluster or not?

aprokop · 2020-06-08T15:36:52Z

OK, I fixed the verification routine. The new routine successfully detected that the version prior to fix failed verification, while the latest version passed it.

aprokop · 2020-08-07T22:36:51Z

Rebased on top of master and changed some calls to direct callbacks.

aprokop · 2020-10-02T19:16:43Z

I have thought of a (potentially) better way to do this. So, right now, we do two full passes overall all points. The first pass is counting the number of neighbors of each point, and the second produces cluster labels.

The new trick I am thinking about is to notice that we don't have to go through all points in the second loop. The points that are not near any points are not going to be used, as well as the points that are not near any of the core points. It should be possible to transform the second loop to go over only the core points, potentially significantly reducing the size of the loop.

I did a quick run on the HACC 36M problem calculating the number of core points depending on the min_pts size:

min_pts	#core points	% of total
2	18900454	51%
5	14032705	38%
10	11056272	30%
25	7656005	21%
50	5397584	15%
100	3499114	9%

So, at least half of the points are not core points. As the time taken by the second loop is ~2x taken by the first, any reduction in it should be significant.

I think it should be implemented as follows. In the first parallel loop, besides counting of neighbors, a mapping "core point index" -> "point index" should be created. After first loop, a wrapper around queries should be created using the constructed mapping. In addition, the main dbscan loop should be updated to only use core points. Core-core processing should remain untouched, while core-boundary should now be allowed, and the boundary point should be checked for whether it has been processed before (so as to not be processed by multiple clusters in a bridge-boundary situation).

I think this should be pretty straightforward to implement. But I think that maybe the first priority should still introduce proper veritification routine, merge this pr, and then do this work. This should allow us to demonstrate the improvement more clearly.

Need to update verify function to check results of the procedure.

Specifically, we do propagate(i,j) when i is a boundary point, and j is a core point. But what it really does is that it sets the larger representative to the smaller one. So what could happen is that the representative of the core point could be set to the representative of the boundary point. And that could lead to issue. If boundary point is then connected to a different cluster, and updates its representative to that one, this would effectively bridge two clusters, which is a mistake.

aprokop · 2020-10-23T16:29:50Z

Merging, only standard HIP failure.

aprokop added the enhancement New feature or request label Jun 3, 2020

aprokop force-pushed the dbscan branch from aab40c7 to 2cc563b Compare June 3, 2020 16:04

dalg24 reviewed Jun 5, 2020

View reviewed changes

examples/dbscan/ArborX_DetailsDBSCANCallback.hpp Outdated Show resolved Hide resolved

aprokop force-pushed the dbscan branch from 70e3015 to 1a339f0 Compare June 6, 2020 01:31

aprokop marked this pull request as draft June 6, 2020 13:24

aprokop marked this pull request as ready for review June 8, 2020 15:37

aprokop force-pushed the dbscan branch from ce77b63 to d576044 Compare August 7, 2020 22:34

aprokop force-pushed the dbscan branch from d576044 to b0f3a16 Compare August 7, 2020 23:09

aprokop force-pushed the dbscan branch from b0f3a16 to 821b6c7 Compare September 12, 2020 15:17

aprokop added the application Anything to support specific applications label Sep 30, 2020

aprokop force-pushed the dbscan branch 2 times, most recently from 4f7182f to 816eea4 Compare October 1, 2020 21:06

aprokop mentioned this pull request Oct 2, 2020

Question: MPI-parallel DBSCAN #405

Closed

aprokop added 12 commits October 3, 2020 15:18

Renamed halo_finder example dir to dbscan

b3cf682

Renamed ArborX_HaloFinder.hpp and vars within

c7cb35a

Move DBSCAN callbacks into a separate file

e329444

Implement DBSCAN

bacfc46

Need to update verify function to check results of the procedure.

Update verify function for a general dbscan

7a0bf25

Update cmd line options for halo finder

43307b9

Change all terminology from halo to dbscan

e213af3

Change test and driver names to dbscan

f2965a0

Update README.md for DBSCAN

fa60217

Fix unused var warning

59866dc

Tag propgation condition instead of operator

847fbc5

Replace tags by a core points struct

1c1f03c

aprokop added 4 commits October 3, 2020 15:18

Remove spurious FIXME as we already attach indices

f0c02cc

Fix dbscan verification routine

fa158e7

Use direct callback

0671c95

aprokop force-pushed the dbscan branch from 816eea4 to 0671c95 Compare October 3, 2020 19:18

aprokop added 3 commits October 4, 2020 07:35

Add two more timers

f65f1e4

Get rid of deprecated Traits::Acess

6cd51fa

Rename wrap to something more meaningful

3570831

dalg24 approved these changes Oct 23, 2020

View reviewed changes

aprokop merged commit 3a3906a into arborx:master Oct 23, 2020

aprokop deleted the dbscan branch October 23, 2020 16:30

aprokop mentioned this pull request Oct 24, 2020

DBSCAN algorithm improvements #417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement DBSCAN #331

Implement DBSCAN #331

aprokop commented Jun 3, 2020 •

edited

Loading

aprokop commented Jun 6, 2020 •

edited

Loading

aprokop commented Jun 6, 2020

aprokop commented Jun 8, 2020

aprokop commented Aug 7, 2020

aprokop commented Oct 2, 2020

aprokop commented Oct 23, 2020

Implement DBSCAN #331

Implement DBSCAN #331

Conversation

aprokop commented Jun 3, 2020 • edited Loading

aprokop commented Jun 6, 2020 • edited Loading

aprokop commented Jun 6, 2020

aprokop commented Jun 8, 2020

aprokop commented Aug 7, 2020

aprokop commented Oct 2, 2020

aprokop commented Oct 23, 2020

aprokop commented Jun 3, 2020 •

edited

Loading

aprokop commented Jun 6, 2020 •

edited

Loading