-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement stackless tree traversal using escape index (ropes) #364
Conversation
f8f63c2
to
9b9f577
Compare
Summit results (master 8f1cc64 vs branch 9b9f577): Serial
Everything looks good except unsorted spatial traverses (which are about 5-15% slower). This is acceptable. OpenMP
Radius traverses 5-10% slower. Acceptable. Cuda
Construction time ~15% slower for larger sizes, this is expected. Spatial traverses are much faster, and the larger size the better the improvement. Some reach 2x speedup. |
f3c4c9b
to
bd3eff0
Compare
@dalg24 Should have addressed most (all?) of your comments. |
retest this please |
Summit results (master 4834bff vs branch bd3eff0) Serial
OpenMP
Cuda
|
So, with the latest patch, Serial and Cuda behave well. Serial is on par with master, Cuda is improving. But something is completely off with OpenMP. Not only did it not get back when using stack-based traversal, it became much worse with it. No clue why. |
OK. Something I learned. If I revert 2754cf1, the results are reasonable. I'm not sure what in that patch causes the issue. Summit results (master 4834bff vs branch bd3eff0 with 2754cf1 reverted): Serial
OpenMP
Cuda
|
I confirmed that the resulting hierarchy does not change with the patch. The only thing that changes slightly (independent of patch) is that some child nodes may be swapped with their neighbor. This happens when two primitives have the same Morton codes. As Another fun thing that I discovered. During one of the interactive sessions on Summit, running one executable produced different OpenMP timings depending on whether only OpenMP was run, or OpenMP together with other tests (i.e., after Serial). Seems to also depend on the node, as I was not always able to reproduce it. |
I'm stuck and not sure how to move forward here. Would appreciate any help. |
…construction and traversal
bb0524e
to
6499e4b
Compare
I rebased on master. I also implemented slightly different approach. Now, we allow two different node types: one with two children, and one with a left child and rope (this became possible due to #403). We choose node type depending on whether we are running Cuda, or host, as previous results indicated that rope seem to slow things down on the host. Each node now carries a tag, which we use to construct and traverse hierarchies differently. But the differences are very minute (except for the introduction of ropes based traversal), allowing most of the code to be shared. This is a bit unfinished at the moment, as there are at least couple things that require fixing:
There are couple reasons why I pushed this is a bit early. First, I wanted to run and post results from Summit. Second, I wanted @dalg24 to take a look and see if there is anything he completely disagrees with. My plan is to post the Summit results once that's completed, and work on finishing the tests I mentioned earlier. The core of this PR (assuming satisfactory Summit runs) is complete, from my perspective. |
Summit results: master (1d7ed60) vs stackless branch (6499e4b) Serial (complete parity)
OpenMP (complete parity)
Cuda (complete parity in construction and knn, yay! in spatial)
The only slowdown on Cuda is +7% for the smallest 10K filled mesh for non-sorted queries. Not an issue. Reran the HACC problem and confirmed the gain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
6499e4b
to
c5367b0
Compare
@dalg24 Three things:
The number of |
The previous version broke the abstraction layer for leaf nodes by explicitly setting permutation index to the right child inside tree construction.
c5367b0
to
81e198a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diff --git a/test/tstDetailsTreeConstruction.cpp b/test/tstDetailsTreeConstruction.cpp
index 670e02d..cfc6be0 100644
--- a/test/tstDetailsTreeConstruction.cpp
+++ b/test/tstDetailsTreeConstruction.cpp
@@ -162,10 +162,10 @@ BOOST_AUTO_TEST_CASE_TEMPLATE(example_tree_construction, DeviceType,
<< "L3"
<< "I4"
<< "L4"
+ << "L6"
<< "I5"
<< "I6"
<< "L5"
- << "L6"
<< "L7";
std::cout << "ref=" << ref.str() << "\n";
@@ -217,4 +217,6 @@ BOOST_AUTO_TEST_CASE_TEMPLATE(example_tree_construction, DeviceType,
std::cout << "sol=" << sol.str() << "\n";
BOOST_TEST(sol.str().compare(ref.str()) == 0);
+ BOOST_TEST(sol.str() == ref.str());
+ BOOST_TEST(sol.str() == ref.str(), tt::per_element());
}
<ArborX>/test/tstDetailsTreeConstruction.cpp:219: error: in "example_tree_construction<Kokkos__Device<Kokkos__OpenMP_ Kokkos__HostSpace>>": check sol.str().compare(ref.str()) == 0 has failed [-3 != 0]
<ArborX>/test/tstDetailsTreeConstruction.cpp:220: error: in "example_tree_construction<Kokkos__Device<Kokkos__OpenMP_ Kokkos__HostSpace>>": check sol.str() == ref.str() has failed [I0I3I1L0L1I2L2L3I4L4I5I6L5L6L7 != I0I3I1L0L1I2L2L3I4L4L6I5I6L5L7]
<ArborX>/test/tstDetailsTreeConstruction.cpp:221: error: in "example_tree_construction<Kokkos__Device<Kokkos__OpenMP_ Kokkos__HostSpace>>": check sol.str() == ref.str() has failed
- mismatch at position 20: ['I' == 'L'] is false
- mismatch at position 21: ['5' == '6'] is false
- mismatch at position 23: ['6' == '5'] is false
- mismatch at position 24: ['L' == 'I'] is false
- mismatch at position 25: ['5' == '6'] is false
- mismatch at position 27: ['6' == '5'] is false
*** 3 failures are detected in the test module "Master Test Suite"
test/tstDetailsTreeConstruction.cpp
Outdated
std::ostringstream sol; | ||
traverseRecursive(root, sol); | ||
std::cout << "sol=" << sol.str() << "\n"; | ||
BOOST_TEST(sol.str().compare(ref.str()) == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BOOST_TEST(sol.str().compare(ref.str()) == 0); | |
BOOST_TEST(sol.str() == ref.str()); |
test/tstDetailsTreeConstruction.cpp
Outdated
traverse(leaf_nodes, internal_nodes, root, sol); | ||
std::cout << "sol(node_with_left_child_and_rope) = " << sol.str() << "\n"; | ||
|
||
BOOST_TEST(sol.str().compare(ref.str()) == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BOOST_TEST(sol.str().compare(ref.str()) == 0); | |
BOOST_TEST(sol.str() == ref.str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can merge after you fixed
Co-authored-by: Damien L-G <dalg24@gmail.com>
6f2bdd8
to
61d3e06
Compare
Merging with the standard HIP failure. |
Here's a figure to help understanding what ropes are: