Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Halo finder #237

Merged
merged 18 commits into from
May 12, 2020
Merged

Halo finder #237

merged 18 commits into from
May 12, 2020

Conversation

aprokop
Copy link
Contributor

@aprokop aprokop commented Mar 5, 2020

Work related to #161.

Remaining items:

  • Re-add verify() in the driver
  • Resolve whether atomics are needed for OpenMP
    Both in representative() and flatten()
  • Fix Properties... compilation issue
  • Add brief comments in the code about the algorithm
  • Cleanup code
  • Migrate from benchmarks to examples

@aprokop aprokop added the application Anything to support specific applications label Mar 5, 2020
benchmarks/halo_finder/ECL-CC_10.cu Outdated Show resolved Hide resolved
benchmarks/halo_finder/ECL-CC_10.cu Outdated Show resolved Hide resolved
@aprokop
Copy link
Contributor Author

aprokop commented Mar 5, 2020

@masterleinad I do not want to touch the original ECL-CC without a need. Would your change have performance implications, or are they absolutely equivalent? I would rather not touch performance-sensitive code.

@masterleinad
Copy link
Collaborator

That's the proper replacement for it according to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html. Otherwise, the code does not even compile for the Cuda versions we have on the CI tester.

@aprokop
Copy link
Contributor Author

aprokop commented Mar 5, 2020

the code does not even compile for the Cuda versions we have on the CI tester.

I'm testing using 10.1. Are you saying that's too old?

@masterleinad
Copy link
Collaborator

I'm testing using 10.1. Are you saying that's too old?

I am not sure if that's the issue. Honestly, part of me pushing the two changes was just to see if that fixes running the testers.

@masterleinad
Copy link
Collaborator

I forced-push a fix that uses __shfl_sync the (hopefully) correct way, see https://stackoverflow.com/questions/46345811/cuda-9-shfl-vs-shfl-sync.

benchmarks/halo_finder/CMakeLists.txt Outdated Show resolved Hide resolved
benchmarks/halo_finder/halo_finder.cpp Outdated Show resolved Hide resolved
benchmarks/halo_finder/halo_finder.cpp Outdated Show resolved Hide resolved
benchmarks/halo_finder/halo_finder.cpp Outdated Show resolved Hide resolved
benchmarks/halo_finder/halo_finder.cpp Outdated Show resolved Hide resolved
benchmarks/halo_finder/CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
@masterleinad
Copy link
Collaborator

Does this pull request pass the tests for you locally?

@aprokop
Copy link
Contributor Author

aprokop commented Mar 6, 2020

Technically yes, but it probably does not really. I have not put the code to copy the input.txt file to the build directory, for example.

Overall, I would say to not worry about passing and things. I just made a draft PR to share thoughts. It's not final yet, and more work needs to be done.

benchmarks/halo_finder/halo_finder.cpp Outdated Show resolved Hide resolved
benchmarks/halo_finder/halo_finder.cpp Outdated Show resolved Hide resolved
benchmarks/halo_finder/halo_finder.cpp Outdated Show resolved Hide resolved
benchmarks/halo_finder/halo_finder.cpp Outdated Show resolved Hide resolved
benchmarks/halo_finder/ArborX_HaloFinder.hpp Outdated Show resolved Hide resolved
@aprokop
Copy link
Contributor Author

aprokop commented Apr 20, 2020

Cleaned up and rebased on master.

@aprokop aprokop marked this pull request as ready for review April 20, 2020 23:40
@aprokop aprokop marked this pull request as draft April 20, 2020 23:43
@aprokop aprokop force-pushed the halo_finder branch 2 times, most recently from f3f8a66 to 59f37ef Compare April 21, 2020 03:30
@aprokop
Copy link
Contributor Author

aprokop commented May 3, 2020

Implemented v2 of the halo finder.

Changes

v1 did the following steps:

  1. Construct tree
  2. Do queries producing (indices, offset)
  3. Call ECL-CC on (indices, offset) producing ccs
  4. Postprocess ccs

v2 changed it to

  1. Construct tree
  2. Do queries with callback producing ccs
  3. Postprocess ccs

I was able to combine init and compute1 stages of ECL-CC algorithm and use them inside inline callbacks. The benefits are tremendous:

  • Single pass in radius search
    We never call insert(), instead updating a single array
  • Drastic reduction in used memory
    We don't store results
  • Code can be used with any backend

Results

Timings (ran on Summit login node):

Step v1 v2
construction 0.11 0.11
query 8.21 3.35
ccs + halos 0.15 0.05
total 8.47 3.50

Speedup: 2.4x

Memory footprint

Memory is split as follows (for the HACC data with 37M points):

Type v1 size (MB) v2 size (MB)
tree 2252.36 <-
tree query results (csr) 6517.58 0
ccs 140.77 <-
halos (csr) 52.03 <-
total 8962.74 2445.16

Memory reduction: 3.6x

Misc comments

I see very few downsides to the new implementation. There are couple areas in callback that could be examined in a bit more details (such as initialization), as initialization. There is also the fact that original ECL-CC implementation separated all nodes into three groups depending on the degree, and processed them differently (one thread per node/warp per node/block per node). However, here we can't possibly do that as node degrees are not available at any given moment.

From the maintenance point of view, the new code is more fragile. The reason for that is that we have no way to verify whether connected components were computed correctly, as we don't have access to a graph anymore. I guess I may need to restore the verify code just for the driver to have something to test against.
I did verify that this code produces the same results as v1 for the HACC data.

benchmarks/halo_finder/ArborX_HaloFinder.hpp Outdated Show resolved Hide resolved
benchmarks/halo_finder/ArborX_HaloFinder.hpp Outdated Show resolved Hide resolved
{
// initialize to the first neighbor that's smaller
if (Kokkos::atomic_compare_exchange(&stat_(i), i, j) == i)
return;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the init stage. Need to think more carefully whether this return is warranted, or if it makes sense to continue here. It's all about races between threads.

benchmarks/halo_finder/halo_finder.cpp Outdated Show resolved Hide resolved
.clang-format-ignore Outdated Show resolved Hide resolved
@aprokop
Copy link
Contributor Author

aprokop commented May 5, 2020

The v2 algorithm is based on the [1] paper. The general gist of the algorithm can be found on the second page and reads:

...the majority of the parallel CC algorithms are based on the following 
“label propagation” approach. Each vertex has a label to hold the component ID                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
to which it belongs. Initially, this label is set to the vertex ID, that is,                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
each vertex is considered a separate component, which can trivially be done in                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
parallel. Then the vertices are iteratively processed in parallel to determine                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
the connected components. For instance, each vertex’s label can be updated with                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
the smallest label among its neighbors. This process repeats until there are no                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
more updates, at which point all vertices in a connected component have the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
same label. In this example, the ID of the minimum vertex in each component                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
serves as the component ID, which guarantees uniqueness. To speed up the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
computation, the labels are often replaced by “parent” pointers that form a                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
union-find data structure.  

Union-find algorithm is sped up through path compression:

Starting from any vertex, we can follow the parent pointers until we reach a                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
vertex that points to itself. This final vertex is the “representative”. Every                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
vertex’s implicit label is the ID of its representative and, as before, all                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
vertices with the same label (aka representative) belong to the same component.                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
Thus, making the representative u point to v indirectly changes the labels of                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
all vertices whose representative used to be u. This makes union operations                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
very fast. A chain of parent pointers leading to a representative is called a                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
“path”. To also make the traversals of these paths, i.e., the find operations,                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
very fast, the paths are sporadically shortened by making earlier elements skip                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
over some elements and point directly to later elements. This process is                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
referred to as “path compression”.

ECL-CC makes use of an intermediate pointer jumping:

Intermediate pointer jumping only requires a single traversal while still                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
compressing the path of all elements encountered along the way, albeit not by                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
as much as multiple pointer jumping. It accomplishes this by making every                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
element skip over the next element, thus halving the path length in each                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
traversal.

This is encoded in representative() function.

[1] Jaiganesh, Jayadharini, and Martin Burtscher. "A high-performance connected components implementation for GPUs." In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, pp. 92-104. 2018.

@aprokop
Copy link
Contributor Author

aprokop commented May 5, 2020

OpenMP seems to produce correct results even without atomics.
However, timings are interesting:

Step Cuda OpenMP
construction 0.11 1.60
query + ccs 3.39 16.14
halos 0.05 70.44

It is certainly advisable to look at the postprocessing in more details, as there is something really not digestible for OpenMP there.

Edit: quick examination showed that it is all inside Kokkos::BinSort. The other two kernels, compute_halos_starts_and_sizes and populate_halos, take no time at all.

It is surprising, however, that BinSort is so bad. It is done on ccs size, which is the same size as predicates and primitives. However, sorting Morton codes for those does not exhibit this behavior. It must be that the properties of the arrays that being sorted are different from ccs. The obvious difference is that ccs indices are not unique, sometimes by a large margin (for example, largest halo size encountered is ~100K, which means the array we sort has that many indices that are the same).

@masterleinad
Copy link
Collaborator

For me, it's OK to keep this is examples but we should provide some documentation, maybe in the form of a README.md.

@aprokop
Copy link
Contributor Author

aprokop commented May 5, 2020

For me, it's OK to keep this is examples but we should provide some documentation, maybe in the form of a README.md.

@masterleinad What kind of documentation do you envision?

@masterleinad
Copy link
Collaborator

@masterleinad What kind of documentation do you envision?

Something like

This example shows how ArborX can be used for ... The implementation follows the paper ...
The program expects as input file in the following format: ...
An example is given in "examples/halo_finder/input.txt". This case describes ...
Hence, we expect to see ... and this matches with the output when running the executable with that file as input.

@aprokop
Copy link
Contributor Author

aprokop commented May 5, 2020

@masterleinad I put some documentation in. It should be sufficient for now.

@aprokop
Copy link
Contributor Author

aprokop commented May 9, 2020

retest this please

Copy link
Contributor

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor comments. Looks good other than that.

examples/halo_finder/ArborX_HaloFinder.hpp Outdated Show resolved Hide resolved
examples/halo_finder/ArborX_HaloFinder.hpp Outdated Show resolved Hide resolved
examples/halo_finder/ArborX_HaloFinder.hpp Outdated Show resolved Hide resolved
examples/halo_finder/ArborX_HaloFinder.hpp Outdated Show resolved Hide resolved
examples/halo_finder/ArborX_HaloFinder.hpp Outdated Show resolved Hide resolved
examples/halo_finder/halo_finder.cpp Outdated Show resolved Hide resolved
examples/halo_finder/halo_finder.cpp Outdated Show resolved Hide resolved
@aprokop
Copy link
Contributor Author

aprokop commented May 11, 2020

@dalg24 Thanks for the review. I'll address your comments, but the main thing is that I still don't understand why PGI build fails:

"/var/jenkins/workspace/ArborX_PR-237/test_install/examples/halo_finder/ArborX_

          HaloFinder.hpp", line 15: catastrophic error: cannot open source file

          "ArborX_DetailsSortUtils.hpp"

  #include <ArborX_DetailsSortUtils.hpp>

Update
Explanation: when halo finder was copied from benchmarks/ to examples/, the line with set(ArborX_Target ArborX) was not removed. But when compiling internally and externally, this target should be different.

aprokop and others added 2 commits May 11, 2020 20:27
Co-authored-by: Damien L-G <dalg24@gmail.com>
@aprokop
Copy link
Contributor Author

aprokop commented May 12, 2020

Should be ready.

@aprokop aprokop merged commit 281b77d into arborx:master May 12, 2020
@aprokop aprokop deleted the halo_finder branch May 12, 2020 01:13
@aprokop aprokop mentioned this pull request Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
application Anything to support specific applications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants