-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Halo finder #237
Halo finder #237
Conversation
@masterleinad I do not want to touch the original ECL-CC without a need. Would your change have performance implications, or are they absolutely equivalent? I would rather not touch performance-sensitive code. |
That's the proper replacement for it according to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html. Otherwise, the code does not even compile for the |
I'm testing using 10.1. Are you saying that's too old? |
I am not sure if that's the issue. Honestly, part of me pushing the two changes was just to see if that fixes running the testers. |
4c4fa04
to
a54b3de
Compare
I forced-push a fix that uses |
Does this pull request pass the tests for you locally? |
Technically yes, but it probably does not really. I have not put the code to copy the Overall, I would say to not worry about passing and things. I just made a draft PR to share thoughts. It's not final yet, and more work needs to be done. |
afe552b
to
6906cd4
Compare
Cleaned up and rebased on master. |
f3f8a66
to
59f37ef
Compare
Implemented v2 of the halo finder. Changesv1 did the following steps:
v2 changed it to
I was able to combine
ResultsTimings (ran on Summit login node):
Speedup: 2.4x Memory footprintMemory is split as follows (for the HACC data with 37M points):
Memory reduction: 3.6x Misc commentsI see very few downsides to the new implementation. There are couple areas in callback that could be examined in a bit more details (such as initialization), as initialization. There is also the fact that original ECL-CC implementation separated all nodes into three groups depending on the degree, and processed them differently (one thread per node/warp per node/block per node). However, here we can't possibly do that as node degrees are not available at any given moment. From the maintenance point of view, the new code is more fragile. The reason for that is that we have no way to verify whether connected components were computed correctly, as we don't have access to a graph anymore. I guess I may need to restore the verify code just for the driver to have something to test against. |
{ | ||
// initialize to the first neighbor that's smaller | ||
if (Kokkos::atomic_compare_exchange(&stat_(i), i, j) == i) | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the init stage. Need to think more carefully whether this return
is warranted, or if it makes sense to continue here. It's all about races between threads.
The v2 algorithm is based on the [1] paper. The general gist of the algorithm can be found on the second page and reads:
Union-find algorithm is sped up through path compression:
ECL-CC makes use of an intermediate pointer jumping:
This is encoded in [1] Jaiganesh, Jayadharini, and Martin Burtscher. "A high-performance connected components implementation for GPUs." In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, pp. 92-104. 2018. |
The comments contain snippets of text from the original paper.
OpenMP seems to produce correct results even without atomics.
It is certainly advisable to look at the postprocessing in more details, as there is something really not digestible for OpenMP there. Edit: quick examination showed that it is all inside It is surprising, however, that BinSort is so bad. It is done on |
For me, it's OK to keep this is |
@masterleinad What kind of documentation do you envision? |
Something like
|
@masterleinad I put some documentation in. It should be sufficient for now. |
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only minor comments. Looks good other than that.
@dalg24 Thanks for the review. I'll address your comments, but the main thing is that I still don't understand why PGI build fails:
Update |
Co-authored-by: Damien L-G <dalg24@gmail.com>
Should be ready. |
Work related to #161.
Remaining items:
verify()
in the driverBoth in
representative()
andflatten()
Properties...
compilation issuebenchmarks
toexamples