New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace MPI_Allgather by MPI_Exscan #8298
Conversation
I will write a changelog (one in |
source/dofs/dof_renumbering.cc
Outdated
|
||
// calculate shifts | ||
types::global_dof_index cumulated = 0; | ||
unsigned int cumulated = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you change this to an int? I assume it could overflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my mistake from moving around the code. I will fix it.
Making a function call collective, especially one that doesn't look like a collective call (if it was called |
What version of the standard is needed for MPI_Exscan? |
I looked through the rest of the changes and things look good. 👍 |
I agree, introducing a new function making the intent of the global commucation and deprecating the old ones is a better idea. What about calling those new functions |
Looks like |
Our minimal requirement is MPI 2.0, right? So we probably need to do |
Wait, @masterleinad can you please confirm? |
@tjhei how exactly should we move on with deprecating the old functions? We essentially have three alternatives:
The reason why I mention the second and third option separately is that I would actually prefer the third case over the second - we silently do not fill the fields any more, so it is better to issue a compile error rather than a run time error. I'd personally go for the first option. |
|
||
// calculate shifts | ||
unsigned int cumulated = 0; | ||
types::global_dof_index cumulated = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjhei It appears that you implicitly spotted a bug in the index shift here! 👍
This is part of #8293. |
I now switched to using new functions for the computations, and populating the old deprecated interfaces on demand. I did not use a mutex yet because I think the use case (global communication over MPI) necessitates care from user code anyway and it will likely break some of them either way. |
/rebuild |
Sure, that looks right. I was just searching in the pdf and couldn't find it, but it is in https://www.mpi-forum.org/docs/mpi-2.0/mpi2-report.pdf. |
c6e7db3
to
cb584ce
Compare
I leave this up to you. I agree that "deprecate but doesn't work" is not a good solution. |
I wouldn't worry about thread safety for a deprecated function like this, but this is just me. ;-) |
Good - now that I've written the most conservative variant, I'm fine to stick with that. It will give us an opportunity to clean up some code for the 9.3 release when we remove this 😉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I only have some stylistic questions and minor remarks.
include/deal.II/dofs/dof_handler.h
Outdated
* function, so it must be called on all processors participating in the MPI | ||
* communicator underlying the triangulation. | ||
* | ||
* If you are only interested in the number of elements each processor owns | ||
* then n_locally_owned_dofs_per_processor() is a better choice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* then n_locally_owned_dofs_per_processor() is a better choice. | |
* then compute_n_locally_owned_dofs_per_processor() is a better choice. |
?
include/deal.II/dofs/dof_handler.h
Outdated
* This function involves global communication via the @p MPI_Allgather | ||
* function, so it must be called on all processors participating in the MPI | ||
* communicator underlying the triangulation. | ||
* | ||
* Each element of the vector returned by this function equals the number of | ||
* elements of the corresponding sets returned by | ||
* locally_owned_dofs_per_processor(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* locally_owned_dofs_per_processor(). | |
* compute_locally_owned_dofs_per_processor(). |
?
include/deal.II/hp/dof_handler.h
Outdated
* communicator underlying the triangulation. | ||
* | ||
* If you are only interested in the number of elements each processor owns | ||
* then n_locally_owned_dofs_per_processor() is a better choice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* then n_locally_owned_dofs_per_processor() is a better choice. | |
* then compute_n_locally_owned_dofs_per_processor() is a better choice. |
?
include/deal.II/hp/dof_handler.h
Outdated
* possibly large memory footprint on many processors. As a consequence, | ||
* this function needs to call compute_n_locally_owned_dofs_per_processor() | ||
* upon the first invocation, including global communication. Use | ||
* compute_n_locally_owned_dofs_per_processor() instead if on up to a few |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* compute_n_locally_owned_dofs_per_processor() instead if on up to a few | |
* compute_n_locally_owned_dofs_per_processor() instead if using up to a few |
and in the other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely.
include/deal.II/dofs/dof_handler.h
Outdated
const unsigned int level) const | ||
{ | ||
Assert(level < this->get_triangulation().n_global_levels(), | ||
ExcMessage("invalid level in locally_owned_mg_dofs_per_processor")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We mostly write full sentences in error messages. Would you mind doing that?
tests/sharedtria/dof_04.cc
Outdated
@@ -106,17 +106,20 @@ test() | |||
// | |||
// deallog << "n_locally_owned_dofs_per_processor: "; | |||
// std::vector<types::global_dof_index> v = | |||
// dof_handler.n_locally_owned_dofs_per_processor(); unsigned int sum | |||
// = 0; for (unsigned int i=0; i<v.size(); ++i) | |||
// dof_handler.compute_n_locally_owned_dofs_per_processor(); unsigned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well...
tests/sharedtria/hp_dof_01.cc
Outdated
@@ -111,17 +111,20 @@ test() | |||
|
|||
// deallog << "n_locally_owned_dofs_per_processor: "; | |||
// std::vector<types::global_dof_index> v = | |||
// dof_handler.n_locally_owned_dofs_per_processor(); unsigned int sum | |||
// = 0; for (unsigned int i=0; i<v.size(); ++i) | |||
// dof_handler.compute_n_locally_owned_dofs_per_processor(); unsigned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and here
tests/sharedtria/hp_dof_02.cc
Outdated
@@ -112,17 +112,20 @@ test() | |||
// | |||
// deallog << "n_locally_owned_dofs_per_processor: "; | |||
// std::vector<types::global_dof_index> v = | |||
// dof_handler.n_locally_owned_dofs_per_processor(); unsigned int sum | |||
// = 0; for (unsigned int i=0; i<v.size(); ++i) | |||
// dof_handler.compute_n_locally_owned_dofs_per_processor(); unsigned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and here
tests/sharedtria/hp_dof_03.cc
Outdated
@@ -109,17 +109,20 @@ test() | |||
// | |||
// deallog << "n_locally_owned_dofs_per_processor: "; | |||
// std::vector<types::global_dof_index> v = | |||
// dof_handler.n_locally_owned_dofs_per_processor(); unsigned int sum | |||
// = 0; for (unsigned int i=0; i<v.size(); ++i) | |||
// dof_handler.compute_n_locally_owned_dofs_per_processor(); unsigned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and here
tests/sharedtria/hp_dof_04.cc
Outdated
@@ -112,17 +112,20 @@ test() | |||
// | |||
// deallog << "n_locally_owned_dofs_per_processor: "; | |||
// std::vector<types::global_dof_index> v = | |||
// dof_handler.n_locally_owned_dofs_per_processor(); unsigned int sum | |||
// = 0; for (unsigned int i=0; i<v.size(); ++i) | |||
// dof_handler.compute_n_locally_owned_dofs_per_processor(); unsigned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and here
Do not store the index sets of all processors on each processor. Fill them on demand.
@masterleinad thanks for the review - I've fixed all remarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This PR, prepared together with @peterrum, replaces the use of
MPI_Allgather
by the much more efficientMPI_Exscan
in those cases where we wanted to get a parallel prefix sum (MPI calls the version of the prefix sum where the local element is excludedMPI_Exscan
).The second component of this PR is to not store
locally_owned_dofs_per_processor
andn_locally_dofs_per_processor
in theNumberCache
ofDoFHandler
according to the discussion in #8067. A clean solution to this change necessitates an incompatible change, namely that we have to make the return type of those vectors by value rather than const reference. I think the use cases are limited. We use those vectors in a few dozens of tests - on the one hand, we utilize them fordistribute_sparsity_pattern
that we should replace by a better implementation in a subsequent pull request, and then for some sanity checks. One thing that does not work any more is that we cannot calldof_handler.locally_owned_dofs_per_processor()
inside loops whose number of execution is different on different ranks because each call now implies global communication. The alternative would had been to populate the variable on the first call to those functions and then return it. It would have required a lock instead to make only one thread fill it. It would have avoided touching all those tests, but I really think we should discourage those functions because they intrinsically do not scale.Closes #8067.