New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve compress for cuda-aware mpi #7707
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great from my point of view. I also like that the import indices for the GPU got a separate function.
/rebuild |
(It is probably a moot point to let the CI run, but let's check the non-cuda scenario for successful compilation anyway.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few comments below.
for (const auto &import_range : import_indices_data) | ||
const unsigned int import_indices_plain_dev_size = | ||
import_indices_plain_dev.size(); | ||
for (unsigned int i = 0; i < import_indices_plain_dev_size; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that you can still use a range-based for loop here and I expect clang-tidy
to complain if it sees this loop.
{ | ||
const unsigned int import_indices_plain_dev_size = | ||
import_indices_plain_dev.size(); | ||
for (unsigned int i = 0; i < import_indices_plain_dev_size; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
{ | ||
const unsigned int import_indices_plain_dev_size = | ||
import_indices_plain_dev.size(); | ||
for (unsigned int i = 0; i < import_indices_plain_dev_size; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
{ | ||
const unsigned int import_indices_plain_dev_size = | ||
import_indices_plain_dev.size(); | ||
for (unsigned int i = 0; i < import_indices_plain_dev_size; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
@@ -573,9 +591,9 @@ namespace LinearAlgebra | |||
|
|||
template <typename Number> | |||
__global__ void | |||
add_permutated(Number * val, | |||
add_permutated(const size_type *indices, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't have this function an IndexType
template parameter? Would it make sense for conformity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it doesn't need to be templated. I only template the functions that have to be templated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
I can push the range-based loop changes here or create a PR afterward if you prefer that. |
That's fine, I'll do it. |
2607958
to
eccbb86
Compare
@masterleinad done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Passes all |
include/deal.II/base/partitioner.h
Outdated
@@ -571,6 +571,13 @@ namespace Utilities | |||
<< " elements for this partitioner."); | |||
|
|||
private: | |||
/** | |||
* Initialize import_indices_plain_dev from import_indices_data. This | |||
* function is only used when CUDA-aware MPI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* function is only used when CUDA-aware MPI. | |
* function is only used when using CUDA-aware MPI. |
Reduce the number of kernel launch in a way similar to what is done for update_ghost.
eccbb86
to
55026d6
Compare
This PR does two things:
cuda_kernel
for consistency (these functions are a couple of months old so we are free to change the API)update_ghost
) to speed upcompress
cc: @dsambit