-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DBSCAN #331
Implement DBSCAN #331
Conversation
I realized that there is a bug in the current implementation. Specifically, we do There are couple ways to avoid the issue. First, stop processing any edges for a boundary point once it has connection to some core point. Second, instead of choosing the smaller representative, simply assign the representative of the boundary point to be that of the core point. This way, core point representatives will never be a boundary point, and thus no bridge will ever be formed. Not sure which version is better. Can also do another variant: a boundary point is only processed if its representative is itself. First time it connects to a core point, it sets it representative to it, and thus will never be processed again. In any case, need to fix |
I think I fixed the bug. But I'm not sure how to write a simple verification function without implementing full blown sequential DBSCAN. The question that needs to be answered: if a point is connected to multiple core points, how would one know if those core points should belong to the same cluster or not? |
OK, I fixed the verification routine. The new routine successfully detected that the version prior to fix failed verification, while the latest version passed it. |
Rebased on top of |
4f7182f
to
816eea4
Compare
I have thought of a (potentially) better way to do this. So, right now, we do two full passes overall all points. The first pass is counting the number of neighbors of each point, and the second produces cluster labels. The new trick I am thinking about is to notice that we don't have to go through all points in the second loop. The points that are not near any points are not going to be used, as well as the points that are not near any of the core points. It should be possible to transform the second loop to go over only the core points, potentially significantly reducing the size of the loop. I did a quick run on the HACC 36M problem calculating the number of core points depending on the
So, at least half of the points are not core points. As the time taken by the second loop is ~2x taken by the first, any reduction in it should be significant. I think it should be implemented as follows. In the first parallel loop, besides counting of neighbors, a mapping "core point index" -> "point index" should be created. After first loop, a wrapper around queries should be created using the constructed mapping. In addition, the main dbscan loop should be updated to only use core points. Core-core processing should remain untouched, while core-boundary should now be allowed, and the boundary point should be checked for whether it has been processed before (so as to not be processed by multiple clusters in a bridge-boundary situation). I think this should be pretty straightforward to implement. But I think that maybe the first priority should still introduce proper veritification routine, merge this pr, and then do this work. This should allow us to demonstrate the improvement more clearly. |
Need to update verify function to check results of the procedure.
Specifically, we do propagate(i,j) when i is a boundary point, and j is a core point. But what it really does is that it sets the larger representative to the smaller one. So what could happen is that the representative of the core point could be set to the representative of the boundary point. And that could lead to issue. If boundary point is then connected to a different cluster, and updates its representative to that one, this would effectively bridge two clusters, which is a mistake.
Merging, only standard HIP failure. |
Fix #330.
This was pretty straightforward. The only thing that I am not fond of it doing two passes. But can't think about how to fix this yet.
Checked that the timings and output of the original halo finder example on HACC data is unchanged.
Also, not 100% sure about whether
verifyClusters()
function catches all the corner cases.