Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Externalize buffer in BVH callback benchmark #416

Closed
wants to merge 7 commits into from

Conversation

masterleinad
Copy link
Collaborator

This pull request is meant to explore how much externalizing the buffer logic really costs (somewhat based on #412. Initial tests on my laptop look at least reasonable:

----------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                  Time             CPU   Iterations
----------------------------------------------------------------------------------------------------------------------------
BM_radius_search<ArborX::BVH<Serial>>/50000/20000/10/1/0/0/0/manual_time               58428 us        58430 us           11
BM_radius_callback_search<ArborX::BVH<Serial>>/50000/20000/10/1/0/0/0/manual_time      60663 us        60706 us           11
BM_radius_search<ArborX::BVH<OpenMP>>/50000/20000/10/1/0/0/0/manual_time               13582 us        13371 us           42
BM_radius_callback_search<ArborX::BVH<OpenMP>>/50000/20000/10/1/0/0/0/manual_time      14341 us        14196 us           49

The callbacks simply count the number of found neighbors.
@aprokop aprokop added the refactoring Code reorganization label Oct 22, 2020
@masterleinad
Copy link
Collaborator Author

Current benchmark results:

BM_radius_search<ArborX::BVH<Serial>>/1000/1000/10/1/0/0/2/manual_time_mean                      5675 us       5676 us        123
BM_radius_callback_search<ArborX::BVH<Serial>>/1000/1000/10/1/0/0/2/manual_time_mean             5695 us       5707 us        123
BM_radius_search<ArborX::BVH<Serial>>/10000/10000/10/1/0/0/2/manual_time_mean                   64294 us      64289 us         11
BM_radius_callback_search<ArborX::BVH<Serial>>/10000/10000/10/1/0/0/2/manual_time_mean          64503 us      64564 us         11
BM_radius_search<ArborX::BVH<Serial>>/100000/100000/10/1/0/0/2/manual_time_mean                672144 us     672068 us          1
BM_radius_callback_search<ArborX::BVH<Serial>>/100000/100000/10/1/0/0/2/manual_time_mean       673945 us     674476 us          1
BM_radius_search<ArborX::BVH<Serial>>/1000/1000/10/1/0/1/3/manual_time_mean                      2634 us       2635 us        266
BM_radius_callback_search<ArborX::BVH<Serial>>/1000/1000/10/1/0/1/3/manual_time_mean             2721 us       2733 us        257
BM_radius_search<ArborX::BVH<Serial>>/10000/10000/10/1/0/1/3/manual_time_mean                   16323 us      16323 us         43
BM_radius_callback_search<ArborX::BVH<Serial>>/10000/10000/10/1/0/1/3/manual_time_mean          17222 us      17289 us         41
BM_radius_search<ArborX::BVH<Serial>>/100000/100000/10/1/0/1/3/manual_time_mean                119133 us     119121 us          6
BM_radius_callback_search<ArborX::BVH<Serial>>/100000/100000/10/1/0/1/3/manual_time_mean       129884 us     130492 us          5
BM_radius_search<ArborX::BVH<OpenMP>>/1000/1000/10/1/0/0/2/manual_time_mean                       450 us        451 us       1563
BM_radius_callback_search<ArborX::BVH<OpenMP>>/1000/1000/10/1/0/0/2/manual_time_mean              607 us        659 us       1155
BM_radius_search<ArborX::BVH<OpenMP>>/10000/10000/10/1/0/0/2/manual_time_mean                    2090 us       2091 us        334
BM_radius_callback_search<ArborX::BVH<OpenMP>>/10000/10000/10/1/0/0/2/manual_time_mean           2292 us       2348 us        305
BM_radius_search<ArborX::BVH<OpenMP>>/100000/100000/10/1/0/0/2/manual_time_mean                 18719 us      18046 us         37
BM_radius_callback_search<ArborX::BVH<OpenMP>>/100000/100000/10/1/0/0/2/manual_time_mean        19560 us      18977 us         36
BM_radius_search<ArborX::BVH<OpenMP>>/1000/1000/10/1/0/1/3/manual_time_mean                       360 us        362 us       1933
BM_radius_callback_search<ArborX::BVH<OpenMP>>/1000/1000/10/1/0/1/3/manual_time_mean              500 us        552 us       1000
BM_radius_search<ArborX::BVH<OpenMP>>/10000/10000/10/1/0/1/3/manual_time_mean                    1253 us        936 us        557
BM_radius_callback_search<ArborX::BVH<OpenMP>>/10000/10000/10/1/0/1/3/manual_time_mean           1410 us       1238 us        496
BM_radius_search<ArborX::BVH<OpenMP>>/100000/100000/10/1/0/1/3/manual_time_mean                  7850 us       3255 us         89
BM_radius_callback_search<ArborX::BVH<OpenMP>>/100000/100000/10/1/0/1/3/manual_time_mean         8534 us       4694 us         82
BM_radius_search<ArborX::BVH<Cuda>>/10000/10000/10/1/0/0/2/manual_time_mean                       923 us       1005 us        756
BM_radius_callback_search<ArborX::BVH<Cuda>>/10000/10000/10/1/0/0/2/manual_time_mean             1012 us       1287 us        691
BM_radius_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/0/2/manual_time_mean                    3013 us       3448 us        235
BM_radius_callback_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/0/2/manual_time_mean           3199 us       4283 us        221
BM_radius_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/0/2/manual_time_mean                 17018 us      17703 us         42
BM_radius_callback_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/0/2/manual_time_mean        24656 us      27757 us         29
BM_radius_search<ArborX::BVH<Cuda>>/10000/10000/10/1/0/1/3/manual_time_mean                       855 us        937 us        818
BM_radius_callback_search<ArborX::BVH<Cuda>>/10000/10000/10/1/0/1/3/manual_time_mean              932 us       1209 us        750
BM_radius_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/1/3/manual_time_mean                    2469 us       2865 us        292
BM_radius_callback_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/1/3/manual_time_mean           2947 us       3844 us        240
BM_radius_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/1/3/manual_time_mean                  7791 us       8445 us         95
BM_radius_callback_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/1/3/manual_time_mean        10670 us      13931 us         67

I will have a look where the discrepancy for large sizes in CUDA comes from.

@masterleinad
Copy link
Collaborator Author

For BM_radius_callback_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/1/3, we basically have

===================
|-> 3.95e-01 sec 12.7% 34.1% 0.0% 0.2% 5.55e+02 73 second_pass [region]
|   |-> 3.94e-01 sec 12.6% 34.2% 0.0% 8.1% 5.56e+02 73 ArborX::BVH::query [region]
|       |-> 2.35e-01 sec 7.5% 3.2% 0.0% 96.8% 6.22e+02 73 ArborX::BVH::query::compute_permutation [region]
|       |   |-> 4.08e-03 sec 0.1% 100.0% 0.0% ------ 73 ArborX::BatchedQueries::assign_morton_codes_to_queries [for]
|       |   |-> 3.39e-03 sec 0.1% 100.0% 0.0% ------ 73 ArborX::Algorithms::iota [for]
|       |-> 1.27e-01 sec 4.1% 100.0% 0.0% ------ 73 ArborX::TreeTraversal::spatial [for]
|-> 3.34e-01 sec 10.7% 22.4% 0.0% 0.3% 6.55e+02 73 first_pass [region]
|   |-> 3.33e-01 sec 10.7% 22.5% 0.0% 9.5% 6.57e+02 73 ArborX::BVH::query [region]
|       |-> 2.34e-01 sec 7.5% 3.1% 0.0% 96.9% 6.25e+02 73 ArborX::BVH::query::compute_permutation [region]
|       |   |-> 3.90e-03 sec 0.1% 100.0% 0.0% ------ 73 ArborX::BatchedQueries::assign_morton_codes_to_queries [for]
|       |   |-> 3.40e-03 sec 0.1% 100.0% 0.0% ------ 73 ArborX::Algorithms::iota [for]
|       |-> 6.77e-02 sec 2.2% 100.0% 0.0% ------ 73 ArborX::TreeTraversal::spatial [for]
|-> 8.30e-02 sec 2.7% 32.4% 0.0% 71.4% 4.40e+03 73 intermediate [region]
|   |-> 8.75e-03 sec 0.3% 100.0% 0.0% ------ 73 scan_counts [scan]
|   |-> 7.24e-03 sec 0.2% 100.0% 0.0% ------ 73 Kokkos::View::initialization [for]
|   |-> 4.68e-03 sec 0.2% 168.8% 0.0% ------ 73 "counts"="Scalar" [copy]
|   |   |-> 3.22e-03 sec 0.1% 100.0% 0.0% ------ 73 Kokkos::ViewFill-1D [for]
|-> 1.96e-02 sec 0.6% 100.0% 0.0% ------ 225 Kokkos::View::initialization [for]
|-> 1.45e-02 sec 0.5% 21.9% 0.0% 0.7% 1.24e+03 3 ArborX::BVH::BVH [region]
|   |-> 6.47e-03 sec 0.2% 2.2% 0.0% 97.8% 4.63e+02 3 ArborX::BVH::BVH::sort_morton_codes [region]
|   |-> 5.52e-03 sec 0.2% 45.8% 0.0% 56.8% 1.63e+03 3 ArborX::BVH::BVH::generate_hierarchy [region]
|-> 4.71e-03 sec 0.2% 100.0% 0.0% ------ 6 "random_points"="random_points_mirror" [copy]
|-> 3.89e-03 sec 0.1% 100.0% 0.0% ------ 18 Kokkos::View::destruction [for]
|-> 3.26e-03 sec 0.1% 100.0% 0.0% ------ 73 initialize_offsets [for]

for the callback version (with external buffers) and

===================
|-> 7.87e-01 sec 26.8% 25.9% 0.0% 9.6% 1.31e+03 86 ArborX::BVH::query::spatial [region]
|   |-> 3.70e-01 sec 12.6% 50.0% 0.0% 15.9% 1.86e+03 86 ArborX::BufferOptimization::two_pass [region]
|   |   |-> 1.83e-01 sec 6.2% 45.5% 0.0% 54.5% 9.37e+02 86 ArborX::BufferOptimization::two_pass:second_pass [region]
|   |   |   |-> 7.38e-02 sec 2.5% 100.0% 0.0% ------ 86 ArborX::TreeTraversal::spatial [for]
|   |   |   |-> 9.60e-03 sec 0.3% 100.0% 0.0% ------ 86 ArborX::BufferOptimization::copy_offsets_to_counts [for]
|   |   |-> 6.67e-02 sec 2.3% 62.4% 0.0% 37.6% 5.16e+03 86 ArborX::BufferOptimization::first_pass_postprocess [region]
|   |   |   |-> 2.00e-02 sec 0.7% 100.0% 0.0% ------ 86 "offset_mirror"="offset" [copy]
|   |   |   |-> 9.66e-03 sec 0.3% 100.0% 0.0% ------ 86 ArborX::Algorithms::exclusive_scan [scan]
|   |   |   |-> 6.12e-03 sec 0.2% 100.0% 0.0% ------ 86 Kokkos::View::initialization [for]
|   |   |   |-> 5.85e-03 sec 0.2% 100.0% 0.0% ------ 86 ArborX::BufferOptimization::copy_counts_to_offsets [for]
|   |   |-> 5.67e-02 sec 1.9% 97.8% 0.0% 2.2% 1.52e+03 86 ArborX::BufferOptimization::two_pass::first_pass [region]
|   |   |   |-> 5.54e-02 sec 1.9% 100.0% 0.0% ------ 86 ArborX::TreeTraversal::spatial [for]
|   |   |-> 4.67e-03 sec 0.2% 100.0% 0.0% ------ 86 Kokkos::View::initialization [for]
|   |-> 2.77e-01 sec 9.4% 3.1% 0.0% 96.9% 6.21e+02 86 ArborX::BVH::query::spatial::compute_permutation [region]
|   |   |-> 4.64e-03 sec 0.2% 100.0% 0.0% ------ 86 ArborX::BatchedQueries::assign_morton_codes_to_queries [for]
|   |   |-> 4.08e-03 sec 0.1% 100.0% 0.0% ------ 86 ArborX::Algorithms::iota [for]
|   |-> 6.37e-02 sec 2.2% 15.3% 0.0% 91.0% 2.70e+03 86 ArborX::BVH::query::spatial::init_and_alloc [region]
|       |-> 5.75e-03 sec 0.2% 170.1% 0.0% ------ 86 "offset"="Scalar" [copy]
|           |-> 4.03e-03 sec 0.1% 100.0% 0.0% ------ 86 Kokkos::ViewFill-1D [for]
|-> 1.58e-02 sec 0.5% 20.8% 0.0% 1.0% 1.14e+03 3 ArborX::BVH::BVH [region]
|   |-> 6.55e-03 sec 0.2% 2.2% 0.0% 97.8% 4.58e+02 3 ArborX::BVH::BVH::sort_morton_codes [region]
|   |-> 5.51e-03 sec 0.2% 45.7% 0.0% 56.8% 1.63e+03 3 ArborX::BVH::BVH::generate_hierarchy [region]
|   |-> 3.02e-03 sec 0.1% 5.7% 0.0% 94.3% 9.95e+02 3 ArborX::BVH::BVH::assign_morton_codes [region]
|-> 9.19e-03 sec 0.3% 100.0% 0.0% ------ 6 Kokkos::View::initialization [for]
|-> 4.68e-03 sec 0.2% 100.0% 0.0% ------ 6 "random_points"="random_points_mirror" [copy]
|-> 4.39e-03 sec 0.1% 100.0% 0.0% ------ 18 Kokkos::View::destruction [for]

for the regular version (with internal buffers). So for this problem ArborX::BVH::query::compute_permutation is pretty expensive such that it makes a difference that we call it twice for the callback version.

@aprokop
Copy link
Contributor

aprokop commented Oct 23, 2020

So for this problem ArborX::BVH::query::compute_permutation is pretty expensive such that it makes a difference that we call it twice for the callback version.

I did not think of that. It certainly is clear to me the permutation computation should be exposed to a user, or at the least, to the outer part of ArborX that this buffer optimization would reside in. We have several use cases where it would also make sense for a user to have that information, e.g., self-collision problems. We could talk about the best way to achieve this.

@masterleinad
Copy link
Collaborator Author

Results for the case if we are not sorting:

------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                    Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------------------------
BM_radius_callback_search<ArborX::BVH<Cuda>>/1000000/1000000/10/0/0/1/3/manual_time       3625 us         7354 us          186

BEGIN KOKKOS PROFILING REPORT:
TOTAL TIME: 4.96647 seconds
TOP-DOWN TIME TREE:
<average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <remainder> <kernels per second> <number of calls> <name> [type]
===================
|-> 3.98e-01 sec 8.0% 98.0% 0.0% 1.0% 7.46e+02 297 second_pass [region]
|   |-> 3.94e-01 sec 7.9% 98.9% 0.0% 1.1% 7.54e+02 297 ArborX::BVH::query [region]
|       |-> 3.90e-01 sec 7.9% 100.0% 0.0% ------ 297 ArborX::TreeTraversal::spatial [for]
|-> 3.69e-01 sec 7.4% 97.8% 0.0% 1.1% 8.05e+02 297 first_pass [region]
|   |-> 3.65e-01 sec 7.4% 98.8% 0.0% 1.2% 8.14e+02 297 ArborX::BVH::query [region]
|       |-> 3.61e-01 sec 7.3% 100.0% 0.0% ------ 297 ArborX::TreeTraversal::spatial [for]
|-> 3.24e-01 sec 6.5% 30.6% 0.0% 73.1% 4.59e+03 297 intermediate [region]
|   |-> 3.17e-02 sec 0.6% 100.0% 0.0% ------ 297 scan_counts [scan]
|   |-> 2.66e-02 sec 0.5% 100.0% 0.0% ------ 297 Kokkos::View::initialization [for]
|   |-> 1.78e-02 sec 0.4% 167.0% 0.0% ------ 297 "counts"="Scalar" [copy]
|   |   |-> 1.20e-02 sec 0.2% 100.0% 0.0% ------ 297 Kokkos::ViewFill-1D [for]
|   |-> 1.10e-02 sec 0.2% 100.0% 0.0% ------ 297 reduce_counts [reduce]
|-> 5.62e-02 sec 1.1% 100.0% 0.0% ------ 899 Kokkos::View::initialization [for]
|-> 1.91e-02 sec 0.4% 21.4% 0.0% 0.7% 1.26e+03 4 ArborX::BVH::BVH [region]
|   |-> 8.57e-03 sec 0.2% 2.2% 0.0% 97.8% 4.66e+02 4 ArborX::BVH::BVH::sort_morton_codes [region]
|   |-> 7.21e-03 sec 0.1% 45.1% 0.0% 57.5% 1.66e+03 4 ArborX::BVH::BVH::generate_hierarchy [region]
|-> 1.21e-02 sec 0.2% 100.0% 0.0% ------ 297 initialize_offsets [for]
|-> 6.15e-03 sec 0.1% 100.0% 0.0% ------ 8 "random_points"="random_points_mirror" [copy]
|-> 5.45e-03 sec 0.1% 100.0% 0.0% ------ 24 Kokkos::View::destruction [for]

and

---------------------------------------------------------------------------------------------------------------------
Benchmark                                                                           Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------------
BM_radius_search<ArborX::BVH<Cuda>>/1000000/1000000/10/0/0/1/3/manual_time       5783 us         6607 us          106

BEGIN KOKKOS PROFILING REPORT:
TOTAL TIME: 2.84956 seconds
TOP-DOWN TIME TREE:
<average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <remainder> <kernels per second> <number of calls> <name> [type]
===================
|-> 6.84e-01 sec 24.0% 48.7% 0.0% 8.1% 1.71e+03 117 ArborX::BVH::query::spatial [region]
|   |-> 5.42e-01 sec 19.0% 59.2% 0.0% 14.7% 1.73e+03 117 ArborX::BufferOptimization::two_pass [region]
|   |   |-> 2.83e-01 sec 9.9% 56.2% 0.0% 43.8% 8.27e+02 117 ArborX::BufferOptimization::two_pass:second_pass [region]
|   |   |   |-> 1.53e-01 sec 5.4% 100.0% 0.0% ------ 117 ArborX::TreeTraversal::spatial [for]
|   |   |   |-> 6.37e-03 sec 0.2% 100.0% 0.0% ------ 117 ArborX::BufferOptimization::copy_offsets_to_counts [for]
|   |   |-> 1.32e-01 sec 4.6% 98.7% 0.0% 1.3% 8.88e+02 117 ArborX::BufferOptimization::two_pass::first_pass [region]
|   |   |   |-> 1.30e-01 sec 4.6% 100.0% 0.0% ------ 117 ArborX::TreeTraversal::spatial [for]
|   |   |-> 4.12e-02 sec 1.4% 61.7% 0.0% 38.3% 1.14e+04 117 ArborX::BufferOptimization::first_pass_postprocess [region]
|   |   |   |-> 1.21e-02 sec 0.4% 100.0% 0.0% ------ 117 ArborX::Algorithms::exclusive_scan [scan]
|   |   |   |-> 5.84e-03 sec 0.2% 100.0% 0.0% ------ 117 "offset_mirror"="offset" [copy]
|   |   |   |-> 4.91e-03 sec 0.2% 100.0% 0.0% ------ 117 ArborX::BufferOptimization::copy_counts_to_offsets [for]
|   |   |-> 6.01e-03 sec 0.2% 100.0% 0.0% ------ 117 Kokkos::View::initialization [for]
|   |-> 8.69e-02 sec 3.0% 14.4% 0.0% 91.5% 2.69e+03 117 ArborX::BVH::query::spatial::init_and_alloc [region]
|       |-> 7.41e-03 sec 0.3% 168.8% 0.0% ------ 117 "offset"="Scalar" [copy]
|           |-> 5.10e-03 sec 0.2% 100.0% 0.0% ------ 117 Kokkos::ViewFill-1D [for]
|-> 1.45e-02 sec 0.5% 21.8% 0.0% 0.7% 1.24e+03 3 ArborX::BVH::BVH [region]
|   |-> 6.52e-03 sec 0.2% 2.2% 0.0% 97.8% 4.60e+02 3 ArborX::BVH::BVH::sort_morton_codes [region]
|   |-> 5.50e-03 sec 0.2% 45.8% 0.0% 56.7% 1.64e+03 3 ArborX::BVH::BVH::generate_hierarchy [region]
|-> 7.90e-03 sec 0.3% 100.0% 0.0% ------ 6 Kokkos::View::initialization [for]
|-> 4.59e-03 sec 0.2% 100.0% 0.0% ------ 6 "random_points"="random_points_mirror" [copy]
|-> 3.90e-03 sec 0.1% 100.0% 0.0% ------ 18 Kokkos::View::destruction [for]

I am not quite sure that I understand the huge discrepancy between Time and COU time for the benchmark case, though.

@aprokop
Copy link
Contributor

aprokop commented Oct 26, 2020

Results for the case if we are not sorting:

It's really hard to compare these two, as they run different number of times (186 vs 106)? The times will have to be normalized first for comparison.

I am not quite sure that I understand the huge discrepancy between Time and COU time for the benchmark case, though.

Not sure, never saw this large of a discrepancy.

@masterleinad
Copy link
Collaborator Author

It's really hard to compare these two, as they run different number of times (186 vs 106)? The times will have to be normalized first for comparison.

My understanding was always that these times are normalized already.

@aprokop
Copy link
Contributor

aprokop commented Oct 26, 2020

My understanding was always that these times are normalized already.

In benchmark timers, sure. But not in Kokkos profiling, right? I think the latter is simply cumulative.

@masterleinad
Copy link
Collaborator Author

I still need to understand this some more but the results with presorting queries are:

BM_radius_search<ArborX::BVH<Serial>>/1000/1000/10/1/0/0/2/manual_time_mean                      5500 us       5670 us        127
BM_radius_callback_search<ArborX::BVH<Serial>>/1000/1000/10/1/0/0/2/manual_time_mean             5418 us       5605 us        129
BM_radius_search<ArborX::BVH<Serial>>/10000/10000/10/1/0/0/2/manual_time_mean                   62850 us      64294 us         11
BM_radius_callback_search<ArborX::BVH<Serial>>/10000/10000/10/1/0/0/2/manual_time_mean          62329 us      63882 us         11
BM_radius_search<ArborX::BVH<Serial>>/100000/100000/10/1/0/0/2/manual_time_mean                655820 us     671243 us          1
BM_radius_callback_search<ArborX::BVH<Serial>>/100000/100000/10/1/0/0/2/manual_time_mean       652267 us     668696 us          1
BM_radius_search<ArborX::BVH<Serial>>/1000/1000/10/1/0/1/3/manual_time_mean                      2450 us       2618 us        286
BM_radius_callback_search<ArborX::BVH<Serial>>/1000/1000/10/1/0/1/3/manual_time_mean             2404 us       2589 us        291
BM_radius_search<ArborX::BVH<Serial>>/10000/10000/10/1/0/1/3/manual_time_mean                   14640 us      16047 us         48
BM_radius_callback_search<ArborX::BVH<Serial>>/10000/10000/10/1/0/1/3/manual_time_mean          14004 us      15521 us         50
BM_radius_search<ArborX::BVH<Serial>>/100000/100000/10/1/0/1/3/manual_time_mean                 99394 us     114201 us          7
BM_radius_callback_search<ArborX::BVH<Serial>>/100000/100000/10/1/0/1/3/manual_time_mean        91963 us     107691 us          8
BM_radius_search<ArborX::BVH<OpenMP>>/1000/1000/10/1/0/0/2/manual_time_mean                       457 us       1017 us       1532
BM_radius_callback_search<ArborX::BVH<OpenMP>>/1000/1000/10/1/0/0/2/manual_time_mean              407 us       1190 us       1716
BM_radius_search<ArborX::BVH<OpenMP>>/10000/10000/10/1/0/0/2/manual_time_mean                    1930 us       2522 us        363
BM_radius_callback_search<ArborX::BVH<OpenMP>>/10000/10000/10/1/0/0/2/manual_time_mean           1839 us       2669 us        380
BM_radius_search<ArborX::BVH<OpenMP>>/100000/100000/10/1/0/0/2/manual_time_mean                 17078 us      17439 us         41
BM_radius_callback_search<ArborX::BVH<OpenMP>>/100000/100000/10/1/0/0/2/manual_time_mean        16568 us      17203 us         42
BM_radius_search<ArborX::BVH<OpenMP>>/1000/1000/10/1/0/1/3/manual_time_mean                       376 us        935 us       1867
BM_radius_callback_search<ArborX::BVH<OpenMP>>/1000/1000/10/1/0/1/3/manual_time_mean              327 us       1110 us       2133
BM_radius_search<ArborX::BVH<OpenMP>>/10000/10000/10/1/0/1/3/manual_time_mean                    1077 us       1463 us        648
BM_radius_callback_search<ArborX::BVH<OpenMP>>/10000/10000/10/1/0/1/3/manual_time_mean            965 us       1656 us        726
BM_radius_search<ArborX::BVH<OpenMP>>/100000/100000/10/1/0/1/3/manual_time_mean                  6255 us       2693 us        112
BM_radius_callback_search<ArborX::BVH<OpenMP>>/100000/100000/10/1/0/1/3/manual_time_mean         5528 us       2910 us        126
BM_radius_search<ArborX::BVH<Cuda>>/10000/10000/10/1/0/0/2/manual_time_mean                       599 us       1075 us       1170
BM_radius_callback_search<ArborX::BVH<Cuda>>/10000/10000/10/1/0/0/2/manual_time_mean              232 us       1014 us       3021
BM_radius_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/0/2/manual_time_mean                    1844 us       4506 us        404
BM_radius_callback_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/0/2/manual_time_mean           1198 us       4556 us        591
BM_radius_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/0/2/manual_time_mean                 11384 us      16728 us         62
BM_radius_callback_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/0/2/manual_time_mean         5036 us      17244 us        140
BM_radius_search<ArborX::BVH<Cuda>>/10000/10000/10/1/0/1/3/manual_time_mean                       530 us       1008 us       1319
BM_radius_callback_search<ArborX::BVH<Cuda>>/10000/10000/10/1/0/1/3/manual_time_mean              196 us        939 us       3567
BM_radius_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/1/3/manual_time_mean                    1226 us       3874 us        594
BM_radius_callback_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/1/3/manual_time_mean            579 us       3457 us       1226
BM_radius_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/1/3/manual_time_mean                  4088 us       9536 us        178
BM_radius_callback_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/1/3/manual_time_mean         1630 us      10722 us        432

So the Time (first column) is always better with the callback approach and the CPU time is comparable.

@masterleinad
Copy link
Collaborator Author

Results with atomic_fetch_add for the callbacks:

BM_radius_search<ArborX::BVH<Serial>>/1000/1000/10/1/0/0/2/manual_time_mean                      5510 us       5680 us        127
BM_radius_callback_search<ArborX::BVH<Serial>>/1000/1000/10/1/0/0/2/manual_time_mean             5511 us       5697 us        127
BM_radius_search<ArborX::BVH<Serial>>/10000/10000/10/1/0/0/2/manual_time_mean                   62768 us      64203 us         11
BM_radius_callback_search<ArborX::BVH<Serial>>/10000/10000/10/1/0/0/2/manual_time_mean          63162 us      64735 us         11
BM_radius_search<ArborX::BVH<Serial>>/100000/100000/10/1/0/0/2/manual_time_mean                654835 us     670292 us          1
BM_radius_callback_search<ArborX::BVH<Serial>>/100000/100000/10/1/0/0/2/manual_time_mean       656469 us     673018 us          1
BM_radius_search<ArborX::BVH<Serial>>/1000/1000/10/1/0/1/3/manual_time_mean                      2448 us       2617 us        286
BM_radius_callback_search<ArborX::BVH<Serial>>/1000/1000/10/1/0/1/3/manual_time_mean             2456 us       2641 us        285
BM_radius_search<ArborX::BVH<Serial>>/10000/10000/10/1/0/1/3/manual_time_mean                   14589 us      15996 us         48
BM_radius_callback_search<ArborX::BVH<Serial>>/10000/10000/10/1/0/1/3/manual_time_mean          14755 us      16292 us         48
BM_radius_search<ArborX::BVH<Serial>>/100000/100000/10/1/0/1/3/manual_time_mean                 99452 us     114232 us          7
BM_radius_callback_search<ArborX::BVH<Serial>>/100000/100000/10/1/0/1/3/manual_time_mean        99565 us     115430 us          7
BM_radius_search<ArborX::BVH<OpenMP>>/1000/1000/10/1/0/0/2/manual_time_mean                       457 us       1015 us       1528
BM_radius_callback_search<ArborX::BVH<OpenMP>>/1000/1000/10/1/0/0/2/manual_time_mean              412 us       1195 us       1695
BM_radius_search<ArborX::BVH<OpenMP>>/10000/10000/10/1/0/0/2/manual_time_mean                    1944 us       2535 us        361
BM_radius_callback_search<ArborX::BVH<OpenMP>>/10000/10000/10/1/0/0/2/manual_time_mean           1894 us       2722 us        369
BM_radius_search<ArborX::BVH<OpenMP>>/100000/100000/10/1/0/0/2/manual_time_mean                 17168 us      17509 us         41
BM_radius_callback_search<ArborX::BVH<OpenMP>>/100000/100000/10/1/0/0/2/manual_time_mean        16966 us      17604 us         41
BM_radius_search<ArborX::BVH<OpenMP>>/1000/1000/10/1/0/1/3/manual_time_mean                       376 us        936 us       1858
BM_radius_callback_search<ArborX::BVH<OpenMP>>/1000/1000/10/1/0/1/3/manual_time_mean              331 us       1115 us       2121
BM_radius_search<ArborX::BVH<OpenMP>>/10000/10000/10/1/0/1/3/manual_time_mean                    1084 us       1467 us        641
BM_radius_callback_search<ArborX::BVH<OpenMP>>/10000/10000/10/1/0/1/3/manual_time_mean           1049 us       1656 us        674
BM_radius_search<ArborX::BVH<OpenMP>>/100000/100000/10/1/0/1/3/manual_time_mean                  6307 us       2696 us        111
BM_radius_callback_search<ArborX::BVH<OpenMP>>/100000/100000/10/1/0/1/3/manual_time_mean         6244 us       2903 us        112
BM_radius_search<ArborX::BVH<Cuda>>/10000/10000/10/1/0/0/2/manual_time_mean                       600 us       1079 us       1169
BM_radius_callback_search<ArborX::BVH<Cuda>>/10000/10000/10/1/0/0/2/manual_time_mean              253 us       1049 us       2763
BM_radius_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/0/2/manual_time_mean                    1838 us       4514 us        403
BM_radius_callback_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/0/2/manual_time_mean           1207 us       4596 us        585
BM_radius_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/0/2/manual_time_mean                 11363 us      16761 us         62
BM_radius_callback_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/0/2/manual_time_mean         5099 us      18761 us        139
BM_radius_search<ArborX::BVH<Cuda>>/10000/10000/10/1/0/1/3/manual_time_mean                       534 us       1013 us       1310
BM_radius_callback_search<ArborX::BVH<Cuda>>/10000/10000/10/1/0/1/3/manual_time_mean              215 us        974 us       3263
BM_radius_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/1/3/manual_time_mean                    1241 us       3884 us        552
BM_radius_callback_search<ArborX::BVH<Cuda>>/100000/100000/10/1/0/1/3/manual_time_mean            602 us       3493 us       1174
BM_radius_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/1/3/manual_time_mean                  4075 us       9554 us        147
BM_radius_callback_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/1/3/manual_time_mean         1659 us      10816 us        427

not much of a difference.

@masterleinad
Copy link
Collaborator Author

Looking at the Kokkos profiling output for the non-callback version

BM_radius_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/1/3/manual_time       4734 us        10570 u
  
  BEGIN KOKKOS PROFILING REPORT:
  TOTAL TIME: 314.708 seconds
  TOP-DOWN TIME TREE:
  <average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <remainder> <ke
  ===================
  |-> 1.36e+02 sec 43.3% 37.3% 0.0% 10.1% 2.11e+03 28842 ArborX::BVH::query::spatial [region]
  |   |-> 1.01e+02 sec 32.0% 47.4% 0.0% 18.7% 2.29e+03 28842 ArborX::BufferOptimization::two_pass [region
  |   |   |-> 5.32e+01 sec 16.9% 44.1% 0.0% 55.9% 1.08e+03 28842 ArborX::BufferOptimization::two_pass:sec
  |   |   |   |-> 2.18e+01 sec 6.9% 100.0% 0.0% ------ 28842 ArborX::TreeTraversal::spatial [for]
  |   |   |   |-> 1.65e+00 sec 0.5% 100.0% 0.0% ------ 28842 ArborX::BufferOptimization::copy_offsets_to_
  |   |   |-> 1.62e+01 sec 5.2% 97.4% 0.0% 2.6% 1.77e+03 28842 ArborX::BufferOptimization::two_pass::firs
  |   |   |   |-> 1.58e+01 sec 5.0% 100.0% 0.0% ------ 28842 ArborX::TreeTraversal::spatial [for]
  |   |   |-> 1.07e+01 sec 3.4% 63.4% 0.0% 36.6% 1.08e+04 28842 ArborX::BufferOptimization::first_pass_po
  |   |   |   |-> 3.27e+00 sec 1.0% 100.0% 0.0% ------ 28842 ArborX::Algorithms::exclusive_scan [scan]
  |   |   |   |-> 1.46e+00 sec 0.5% 100.0% 0.0% ------ 28842 "offset_mirror"="offset" [copy]
  |   |   |   |-> 1.28e+00 sec 0.4% 100.0% 0.0% ------ 28842 ArborX::BufferOptimization::copy_counts_to_o
  |   |   |   |-> 7.91e-01 sec 0.3% 100.0% 0.0% ------ 28842 Kokkos::View::initialization [for]
  |   |   |-> 1.58e+00 sec 0.5% 100.0% 0.0% ------ 28842 Kokkos::View::initialization [for]
  |   |-> 2.20e+01 sec 7.0% 14.8% 0.0% 91.3% 2.62e+03 28842 ArborX::BVH::query::spatial::init_and_alloc [
  |       |-> 1.92e+00 sec 0.6% 169.7% 0.0% ------ 28842 "offset"="Scalar" [copy]
  |           |-> 1.34e+00 sec 0.4% 100.0% 0.0% ------ 28842 Kokkos::ViewFill-1D [for]
  |-> 2.45e+00 sec 0.8% 100.0% 0.0% ------ 28842 ArborX::BatchedQueries::permute_entries [for]
  |-> 1.61e+00 sec 0.5% 100.0% 0.0% ------ 28842 ArborX::BatchedQueries::assign_morton_codes_to_queries [
  |-> 1.37e+00 sec 0.4% 100.0% 0.0% ------ 28842 ArborX::Algorithms::iota [for]
  |-> 3.91e-01 sec 0.1% 100.0% 0.0% ------ 28878 Kokkos::View::destruction [for]

and the callback version

Benchmark                                                                                    Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------------------------
BM_radius_callback_search<ArborX::BVH<Cuda>>/1000000/1000000/10/1/0/1/3/manual_time       2608 us        11543 us        32336

BEGIN KOKKOS PROFILING REPORT:
TOTAL TIME: 514.789 seconds
TOP-DOWN TIME TREE:
<average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <remainder> <kernels per second> <number of calls> <name> [type]
===================
|-> 4.97e+01 sec 9.7% 31.7% 0.0% 72.1% 4.37e+03 43447 intermediate [region]
|   |-> 5.18e+00 sec 1.0% 100.0% 0.0% ------ 43447 scan_counts [scan]
|   |-> 4.27e+00 sec 0.8% 100.0% 0.0% ------ 43447 Kokkos::View::initialization [for]
|   |-> 2.76e+00 sec 0.5% 168.9% 0.0% ------ 43447 "counts"="Scalar" [copy]
|   |   |-> 1.90e+00 sec 0.4% 100.0% 0.0% ------ 43447 Kokkos::ViewFill-1D [for]
|   |-> 1.66e+00 sec 0.3% 100.0% 0.0% ------ 43447 reduce_counts [reduce]
|-> 3.43e+01 sec 6.7% 96.5% 0.0% 1.7% 1.27e+03 43447 second_pass [region]
|   |-> 3.37e+01 sec 6.5% 98.2% 0.0% 1.8% 1.29e+03 43447 ArborX::BVH::query [region]
|       |-> 3.31e+01 sec 6.4% 100.0% 0.0% ------ 43447 ArborX::TreeTraversal::spatial [for]
|-> 2.83e+01 sec 5.5% 95.8% 0.0% 2.0% 1.54e+03 43447 first_pass [region]
|   |-> 2.77e+01 sec 5.4% 97.8% 0.0% 2.2% 1.57e+03 43447 ArborX::BVH::query [region]
|       |-> 2.71e+01 sec 5.3% 100.0% 0.0% ------ 43447 ArborX::TreeTraversal::spatial [for]
|-> 6.88e+00 sec 1.3% 100.0% 0.0% ------ 130353 Kokkos::View::initialization [for]
|-> 4.98e+00 sec 1.0% 100.0% 0.0% ------ 43447 ArborX::BatchedQueries::permute_entries [for]
|-> 3.43e+00 sec 0.7% 100.0% 0.0% ------ 43447 assign_sorted_queries [for]
|-> 2.25e+00 sec 0.4% 100.0% 0.0% ------ 43447 ArborX::BatchedQueries::assign_morton_codes_to_queries [for]
|-> 1.98e+00 sec 0.4% 100.0% 0.0% ------ 43447 ArborX::Algorithms::iota [for]
|-> 1.94e+00 sec 0.4% 100.0% 0.0% ------ 43447 initialize_offsets [for]
|-> 5.85e-01 sec 0.1% 100.0% 0.0% ------ 43483 Kokkos::View::destruction [for]

the average reported runtime by Kokkos is 0.0109s for the non-callback version and 0.0118s for the callback version.

@aprokop aprokop mentioned this pull request Dec 3, 2020
@aprokop
Copy link
Contributor

aprokop commented Dec 8, 2020

Closing this in favor of #425.

@aprokop aprokop closed this Dec 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Code reorganization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants