Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use lower bounds to avoid traversals in MST #631

Merged
merged 8 commits into from
Feb 17, 2022

Conversation

aprokop
Copy link
Contributor

@aprokop aprokop commented Feb 15, 2022

No description provided.

@aprokop aprokop added the performance Something is slower than it should be label Feb 15, 2022
if (radius < _lower_bounds(i - n + 1))
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened to the version with the scan you showed me? I thought you said it was faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so too. But I repeated experiments on Summit yesterday, and could not reproduce. Maybe I ran so many experiments that different results got confused in my head. So, the current patch does not actually seem to help on GPU. On CPU it's better than what I showed you for two reasons: a) we don't do parallel_scan, and b) more importantly, radius may have already been updated by another thread that converged, so it even more threads drop out.

@aprokop
Copy link
Contributor Author

aprokop commented Feb 16, 2022

I changed to using lower bound only for Serial. I investigated using it on Nvidia A100 (I used OACISS Saturn, where I get consistent timings; on Perlmutter, the timings from run to run are very inconsistent). The summary is:

  • If I use it inside operator() (same as the current version for Serial), it runs ~6+% slower
  • If I used pre-filtering, it runs ~2% faster

In short, there is no benefit on GPU, so I disabled it there.

On CPU (AMD EPYC 7763) in Serial it speeds up the MST construction by 25%.

@@ -482,6 +513,16 @@ struct MinimumSpanningTree
Kokkos::view_alloc(Kokkos::WithoutInitializing, "ArborX::MST::radii"),
n);

bool const use_lower_bounds =
(std::is_same<ExecutionSpace, Kokkos::Serial>{});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kokkos::Serial is undefined when the serial backend is not enabled. It is worrisome that the CI passed...

@aprokop
Copy link
Contributor Author

aprokop commented Feb 16, 2022

@dalg24 Did you mean to push seemingly unrelated stuff in example_callback.cpp?

@dalg24
Copy link
Contributor

dalg24 commented Feb 16, 2022

@dalg24 Did you mean to push seemingly unrelated stuff in example_callback.cpp?

No. Will fix.

@aprokop
Copy link
Contributor Author

aprokop commented Feb 16, 2022

This does not compile:

arborx/src/details/ArborX_MinimumSpanningTree.hpp(559): error:
The enclosing parent function ("doBoruvka") for an extended __host__ __device__
lambda cannot have private or protected access within its class

@aprokop
Copy link
Contributor Author

aprokop commented Feb 17, 2022

I was sure that having if (radius < lower_bound) inside traverse was fine for GPU with noop functor. But it's not. I may be hallucinating, but I rerun all commits in this PR and they all introduced 12% perf regression on A100.

So I did the final salute by just explicitly stripping the check completely out the functor by introducing tags, and dispatching based on the execution space.

I did not guard recomputing lower bounds for GPU, as it takes < 0.1%.

src/details/ArborX_MinimumSpanningTree.hpp Outdated Show resolved Hide resolved
src/details/ArborX_MinimumSpanningTree.hpp Outdated Show resolved Hide resolved
@aprokop aprokop merged commit dc9d59d into arborx:master Feb 17, 2022
@aprokop aprokop deleted the use_lower_bound_in_mst branch February 17, 2022 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Something is slower than it should be
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants