Halo finder #237

aprokop · 2020-03-05T19:42:35Z

Work related to #161.

Remaining items:

Re-add verify() in the driver
Resolve whether atomics are needed for OpenMP
Both in representative() and flatten()
Fix Properties... compilation issue
Add brief comments in the code about the algorithm
Cleanup code
Migrate from benchmarks to examples

benchmarks/halo_finder/ECL-CC_10.cu

aprokop · 2020-03-05T22:17:25Z

@masterleinad I do not want to touch the original ECL-CC without a need. Would your change have performance implications, or are they absolutely equivalent? I would rather not touch performance-sensitive code.

masterleinad · 2020-03-05T22:21:08Z

That's the proper replacement for it according to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html. Otherwise, the code does not even compile for the Cuda versions we have on the CI tester.

aprokop · 2020-03-05T23:56:45Z

the code does not even compile for the Cuda versions we have on the CI tester.

I'm testing using 10.1. Are you saying that's too old?

masterleinad · 2020-03-06T04:39:56Z

I'm testing using 10.1. Are you saying that's too old?

I am not sure if that's the issue. Honestly, part of me pushing the two changes was just to see if that fixes running the testers.

masterleinad · 2020-03-06T05:38:15Z

I forced-push a fix that uses __shfl_sync the (hopefully) correct way, see https://stackoverflow.com/questions/46345811/cuda-9-shfl-vs-shfl-sync.

benchmarks/halo_finder/CMakeLists.txt

benchmarks/halo_finder/halo_finder.cpp

benchmarks/halo_finder/CMakeLists.txt

CMakeLists.txt

masterleinad · 2020-03-06T16:52:39Z

Does this pull request pass the tests for you locally?

aprokop · 2020-03-06T16:53:57Z

Technically yes, but it probably does not really. I have not put the code to copy the input.txt file to the build directory, for example.

Overall, I would say to not worry about passing and things. I just made a draft PR to share thoughts. It's not final yet, and more work needs to be done.

benchmarks/halo_finder/halo_finder.cpp

benchmarks/halo_finder/ArborX_HaloFinder.hpp

aprokop · 2020-04-20T23:33:54Z

Cleaned up and rebased on master.

aprokop · 2020-05-03T17:44:00Z

Implemented v2 of the halo finder.

Changes

v1 did the following steps:

Construct tree
Do queries producing (indices, offset)
Call ECL-CC on (indices, offset) producing ccs
Postprocess ccs

v2 changed it to

Construct tree
Do queries with callback producing ccs
Postprocess ccs

I was able to combine init and compute1 stages of ECL-CC algorithm and use them inside inline callbacks. The benefits are tremendous:

Single pass in radius search
We never call insert(), instead updating a single array
Drastic reduction in used memory
We don't store results
Code can be used with any backend

Results

Timings (ran on Summit login node):

Step	v1	v2
construction	0.11	0.11
query	8.21	3.35
ccs + halos	0.15	0.05
total	8.47	3.50

Speedup: 2.4x

Memory footprint

Memory is split as follows (for the HACC data with 37M points):

Type	v1 size (MB)	v2 size (MB)
tree	2252.36	<-
tree query results (csr)	6517.58	0
ccs	140.77	<-
halos (csr)	52.03	<-
total	8962.74	2445.16

Memory reduction: 3.6x

Misc comments

I see very few downsides to the new implementation. There are couple areas in callback that could be examined in a bit more details (such as initialization), as initialization. There is also the fact that original ECL-CC implementation separated all nodes into three groups depending on the degree, and processed them differently (one thread per node/warp per node/block per node). However, here we can't possibly do that as node degrees are not available at any given moment.

From the maintenance point of view, the new code is more fragile. The reason for that is that we have no way to verify whether connected components were computed correctly, as we don't have access to a graph anymore. I guess I may need to restore the verify code just for the driver to have something to test against.
I did verify that this code produces the same results as v1 for the HACC data.

benchmarks/halo_finder/ArborX_HaloFinder.hpp

aprokop · 2020-05-03T17:46:01Z

benchmarks/halo_finder/ArborX_HaloFinder.hpp

+    {
+      // initialize to the first neighbor that's smaller
+      if (Kokkos::atomic_compare_exchange(&stat_(i), i, j) == i)
+        return;


This is the init stage. Need to think more carefully whether this return is warranted, or if it makes sense to continue here. It's all about races between threads.

benchmarks/halo_finder/halo_finder.cpp

.clang-format-ignore

benchmarks/halo_finder/halo_finder.cpp

benchmarks/halo_finder/ArborX_HaloFinder.hpp

aprokop · 2020-05-05T11:56:59Z

The v2 algorithm is based on the [1] paper. The general gist of the algorithm can be found on the second page and reads:

...the majority of the parallel CC algorithms are based on the following 
“label propagation” approach. Each vertex has a label to hold the component ID                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
to which it belongs. Initially, this label is set to the vertex ID, that is,                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
each vertex is considered a separate component, which can trivially be done in                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
parallel. Then the vertices are iteratively processed in parallel to determine                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
the connected components. For instance, each vertex’s label can be updated with                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
the smallest label among its neighbors. This process repeats until there are no                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
more updates, at which point all vertices in a connected component have the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
same label. In this example, the ID of the minimum vertex in each component                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
serves as the component ID, which guarantees uniqueness. To speed up the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
computation, the labels are often replaced by “parent” pointers that form a                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
union-find data structure.

Union-find algorithm is sped up through path compression:

Starting from any vertex, we can follow the parent pointers until we reach a                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
vertex that points to itself. This final vertex is the “representative”. Every                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
vertex’s implicit label is the ID of its representative and, as before, all                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
vertices with the same label (aka representative) belong to the same component.                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
Thus, making the representative u point to v indirectly changes the labels of                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
all vertices whose representative used to be u. This makes union operations                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
very fast. A chain of parent pointers leading to a representative is called a                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
“path”. To also make the traversals of these paths, i.e., the find operations,                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
very fast, the paths are sporadically shortened by making earlier elements skip                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
over some elements and point directly to later elements. This process is                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
referred to as “path compression”.

ECL-CC makes use of an intermediate pointer jumping:

Intermediate pointer jumping only requires a single traversal while still                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
compressing the path of all elements encountered along the way, albeit not by                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
as much as multiple pointer jumping. It accomplishes this by making every                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
element skip over the next element, thus halving the path length in each                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
traversal.

This is encoded in representative() function.

[1] Jaiganesh, Jayadharini, and Martin Burtscher. "A high-performance connected components implementation for GPUs." In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, pp. 92-104. 2018.

The comments contain snippets of text from the original paper.

aprokop · 2020-05-05T17:27:02Z

OpenMP seems to produce correct results even without atomics.
However, timings are interesting:

Step	Cuda	OpenMP
construction	0.11	1.60
query + ccs	3.39	16.14
halos	0.05	70.44

It is certainly advisable to look at the postprocessing in more details, as there is something really not digestible for OpenMP there.

Edit: quick examination showed that it is all inside Kokkos::BinSort. The other two kernels, compute_halos_starts_and_sizes and populate_halos, take no time at all.

It is surprising, however, that BinSort is so bad. It is done on ccs size, which is the same size as predicates and primitives. However, sorting Morton codes for those does not exhibit this behavior. It must be that the properties of the arrays that being sorted are different from ccs. The obvious difference is that ccs indices are not unique, sometimes by a large margin (for example, largest halo size encountered is ~100K, which means the array we sort has that many indices that are the same).

masterleinad · 2020-05-05T17:59:30Z

For me, it's OK to keep this is examples but we should provide some documentation, maybe in the form of a README.md.

aprokop · 2020-05-05T18:59:11Z

For me, it's OK to keep this is examples but we should provide some documentation, maybe in the form of a README.md.

@masterleinad What kind of documentation do you envision?

masterleinad · 2020-05-05T19:07:01Z

@masterleinad What kind of documentation do you envision?

Something like

This example shows how ArborX can be used for ... The implementation follows the paper ...
The program expects as input file in the following format: ...
An example is given in "examples/halo_finder/input.txt". This case describes ...
Hence, we expect to see ... and this matches with the output when running the executable with that file as input.

aprokop · 2020-05-05T22:26:23Z

@masterleinad I put some documentation in. It should be sufficient for now.

aprokop · 2020-05-09T19:28:33Z

retest this please

dalg24

Only minor comments. Looks good other than that.

examples/halo_finder/ArborX_HaloFinder.hpp

examples/halo_finder/halo_finder.cpp

aprokop · 2020-05-11T18:53:11Z

@dalg24 Thanks for the review. I'll address your comments, but the main thing is that I still don't understand why PGI build fails:

"/var/jenkins/workspace/ArborX_PR-237/test_install/examples/halo_finder/ArborX_

          HaloFinder.hpp", line 15: catastrophic error: cannot open source file

          "ArborX_DetailsSortUtils.hpp"

  #include <ArborX_DetailsSortUtils.hpp>

Update
Explanation: when halo finder was copied from benchmarks/ to examples/, the line with set(ArborX_Target ArborX) was not removed. But when compiling internally and externally, this target should be different.

Co-authored-by: Damien L-G <dalg24@gmail.com>

aprokop · 2020-05-12T00:49:28Z

Should be ready.

aprokop added the application Anything to support specific applications label Mar 5, 2020

masterleinad reviewed Mar 5, 2020

View reviewed changes

benchmarks/halo_finder/ECL-CC_10.cu Outdated Show resolved Hide resolved

benchmarks/halo_finder/ECL-CC_10.cu Outdated Show resolved Hide resolved

masterleinad force-pushed the halo_finder branch from 4c4fa04 to a54b3de Compare March 6, 2020 05:36

dalg24 reviewed Mar 6, 2020

View reviewed changes

aprokop commented Mar 6, 2020

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

aprokop force-pushed the halo_finder branch from 19a10b0 to f0da7f2 Compare March 9, 2020 20:01

aprokop force-pushed the halo_finder branch 2 times, most recently from afe552b to 6906cd4 Compare April 14, 2020 20:41

dalg24 reviewed Apr 16, 2020

View reviewed changes

aprokop force-pushed the halo_finder branch from a3f28a5 to 7be558a Compare April 20, 2020 23:31

aprokop marked this pull request as ready for review April 20, 2020 23:40

aprokop marked this pull request as draft April 20, 2020 23:43

aprokop force-pushed the halo_finder branch 2 times, most recently from f3f8a66 to 59f37ef Compare April 21, 2020 03:30

aprokop commented May 3, 2020

View reviewed changes

.clang-format-ignore Outdated Show resolved Hide resolved

aprokop commented May 3, 2020

View reviewed changes

benchmarks/halo_finder/halo_finder.cpp Outdated Show resolved Hide resolved

dalg24 reviewed May 5, 2020

View reviewed changes

benchmarks/halo_finder/ArborX_HaloFinder.hpp Outdated Show resolved Hide resolved

benchmarks/halo_finder/ArborX_HaloFinder.hpp Outdated Show resolved Hide resolved

aprokop added 9 commits May 5, 2020 13:13

Implement second version of halo exchange to contruct CC on the fly

722afd0

Move all halo related code from the driver to dedicated file

9d282ab

Explicitly mark code blocks under ECL license

15cc096

Migrate halo finder from benchmarks to examples

f0aa73e

Added comments about CC algorithm

22ea86f

The comments contain snippets of text from the original paper.

Decide on the backend to run at compile time

9804de8

Compute halos centers on device

5dcca1c

Add cmd line switches for printing info

42a2bf2

Re-add verify

c8b1dd3

aprokop force-pushed the halo_finder branch from 7d2ac96 to c8b1dd3 Compare May 5, 2020 17:13

Add README file for halo finder

bc856c8

Fix clang-tidy warnings

06666b0

dalg24 reviewed May 11, 2020

View reviewed changes

Do not explicitly set ArborX_Target in halo finder

6495634

aprokop force-pushed the halo_finder branch from af39daa to 6495634 Compare May 11, 2020 22:01

aprokop and others added 2 commits May 11, 2020 20:27

Address comments

b27a466

Co-authored-by: Damien L-G <dalg24@gmail.com>

Enable --verify on halo example

4000f0d

dalg24 approved these changes May 12, 2020

View reviewed changes

aprokop merged commit 281b77d into arborx:master May 12, 2020

aprokop deleted the halo_finder branch May 12, 2020 01:13

aprokop mentioned this pull request Oct 1, 2020

Friends-of-Friends Query #161

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Halo finder #237

Halo finder #237

aprokop commented Mar 5, 2020 •

edited

Loading

aprokop commented Mar 5, 2020

masterleinad commented Mar 5, 2020

aprokop commented Mar 5, 2020

masterleinad commented Mar 6, 2020

masterleinad commented Mar 6, 2020

masterleinad commented Mar 6, 2020

aprokop commented Mar 6, 2020 •

edited

Loading

aprokop commented Apr 20, 2020

aprokop commented May 3, 2020 •

edited

Loading

aprokop May 3, 2020

aprokop commented May 5, 2020 •

edited

Loading

aprokop commented May 5, 2020 •

edited

Loading

masterleinad commented May 5, 2020

aprokop commented May 5, 2020

masterleinad commented May 5, 2020

aprokop commented May 5, 2020

aprokop commented May 9, 2020

dalg24 left a comment

aprokop commented May 11, 2020 •

edited

Loading

aprokop commented May 12, 2020

Halo finder #237

Halo finder #237

Conversation

aprokop commented Mar 5, 2020 • edited Loading

aprokop commented Mar 5, 2020

masterleinad commented Mar 5, 2020

aprokop commented Mar 5, 2020

masterleinad commented Mar 6, 2020

masterleinad commented Mar 6, 2020

masterleinad commented Mar 6, 2020

aprokop commented Mar 6, 2020 • edited Loading

aprokop commented Apr 20, 2020

aprokop commented May 3, 2020 • edited Loading

Changes

Results

Timings (ran on Summit login node):

Memory footprint

Misc comments

aprokop May 3, 2020

Choose a reason for hiding this comment

aprokop commented May 5, 2020 • edited Loading

aprokop commented May 5, 2020 • edited Loading

masterleinad commented May 5, 2020

aprokop commented May 5, 2020

masterleinad commented May 5, 2020

aprokop commented May 5, 2020

aprokop commented May 9, 2020

dalg24 left a comment

Choose a reason for hiding this comment

aprokop commented May 11, 2020 • edited Loading

aprokop commented May 12, 2020

aprokop commented Mar 5, 2020 •

edited

Loading

aprokop commented Mar 6, 2020 •

edited

Loading

aprokop commented May 3, 2020 •

edited

Loading

aprokop commented May 5, 2020 •

edited

Loading

aprokop commented May 5, 2020 •

edited

Loading

aprokop commented May 11, 2020 •

edited

Loading