Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DBSCAN #331

Merged
merged 19 commits into from
Oct 23, 2020
Merged

Implement DBSCAN #331

merged 19 commits into from
Oct 23, 2020

Conversation

aprokop
Copy link
Contributor

@aprokop aprokop commented Jun 3, 2020

Fix #330.

This was pretty straightforward. The only thing that I am not fond of it doing two passes. But can't think about how to fix this yet.

Checked that the timings and output of the original halo finder example on HACC data is unchanged.

Also, not 100% sure about whether verifyClusters() function catches all the corner cases.

@aprokop aprokop added the enhancement New feature or request label Jun 3, 2020
@aprokop
Copy link
Contributor Author

aprokop commented Jun 6, 2020

I realized that there is a bug in the current implementation. Specifically, we do propagate(i,j) when i is a boundary point, and j is a core point. But what it really does is that it sets the larger representative to the smaller one. So what could happen is that the representative of the core point could be set to the representative of the boundary point. And that could lead to issue. If boundary point is then connected to a different cluster, and updates its representative to that one, this would effectively bridge two clusters, which is a mistake.

There are couple ways to avoid the issue. First, stop processing any edges for a boundary point once it has connection to some core point. Second, instead of choosing the smaller representative, simply assign the representative of the boundary point to be that of the core point. This way, core point representatives will never be a boundary point, and thus no bridge will ever be formed.

Not sure which version is better. Can also do another variant: a boundary point is only processed if its representative is itself. First time it connects to a core point, it sets it representative to it, and thus will never be processed again.

In any case, need to fix verify routine that did not catch this bug. I had a feeling that it was not complete, and this certainly proves it.

@aprokop
Copy link
Contributor Author

aprokop commented Jun 6, 2020

I think I fixed the bug. But I'm not sure how to write a simple verification function without implementing full blown sequential DBSCAN. The question that needs to be answered: if a point is connected to multiple core points, how would one know if those core points should belong to the same cluster or not?

@aprokop
Copy link
Contributor Author

aprokop commented Jun 8, 2020

OK, I fixed the verification routine. The new routine successfully detected that the version prior to fix failed verification, while the latest version passed it.

@aprokop aprokop marked this pull request as ready for review June 8, 2020 15:37
@aprokop
Copy link
Contributor Author

aprokop commented Aug 7, 2020

Rebased on top of master and changed some calls to direct callbacks.

@aprokop
Copy link
Contributor Author

aprokop commented Oct 2, 2020

I have thought of a (potentially) better way to do this. So, right now, we do two full passes overall all points. The first pass is counting the number of neighbors of each point, and the second produces cluster labels.

The new trick I am thinking about is to notice that we don't have to go through all points in the second loop. The points that are not near any points are not going to be used, as well as the points that are not near any of the core points. It should be possible to transform the second loop to go over only the core points, potentially significantly reducing the size of the loop.

I did a quick run on the HACC 36M problem calculating the number of core points depending on the min_pts size:

min_pts #core points % of total
2 18900454 51%
5 14032705 38%
10 11056272 30%
25 7656005 21%
50 5397584 15%
100 3499114 9%

So, at least half of the points are not core points. As the time taken by the second loop is ~2x taken by the first, any reduction in it should be significant.

I think it should be implemented as follows. In the first parallel loop, besides counting of neighbors, a mapping "core point index" -> "point index" should be created. After first loop, a wrapper around queries should be created using the constructed mapping. In addition, the main dbscan loop should be updated to only use core points. Core-core processing should remain untouched, while core-boundary should now be allowed, and the boundary point should be checked for whether it has been processed before (so as to not be processed by multiple clusters in a bridge-boundary situation).

I think this should be pretty straightforward to implement. But I think that maybe the first priority should still introduce proper veritification routine, merge this pr, and then do this work. This should allow us to demonstrate the improvement more clearly.

Specifically, we do propagate(i,j) when i is a boundary point, and j is
a core point. But what it really does is that it sets the larger
representative to the smaller one. So what could happen is that the
representative of the core point could be set to the representative of
the boundary point. And that could lead to issue. If boundary point is
then connected to a different cluster, and updates its representative to
that one, this would effectively bridge two clusters, which is a
mistake.
@aprokop
Copy link
Contributor Author

aprokop commented Oct 23, 2020

Merging, only standard HIP failure.

@aprokop aprokop merged commit 3a3906a into arborx:master Oct 23, 2020
@aprokop aprokop deleted the dbscan branch October 23, 2020 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
application Anything to support specific applications enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement DBSCAN
2 participants