Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace MPI_Allgather by MPI_Exscan #8298

Merged
merged 7 commits into from Jun 7, 2019

Conversation

kronbichler
Copy link
Member

This PR, prepared together with @peterrum, replaces the use of MPI_Allgather by the much more efficient MPI_Exscan in those cases where we wanted to get a parallel prefix sum (MPI calls the version of the prefix sum where the local element is excluded MPI_Exscan).

The second component of this PR is to not store locally_owned_dofs_per_processor and n_locally_dofs_per_processor in the NumberCache of DoFHandler according to the discussion in #8067. A clean solution to this change necessitates an incompatible change, namely that we have to make the return type of those vectors by value rather than const reference. I think the use cases are limited. We use those vectors in a few dozens of tests - on the one hand, we utilize them for distribute_sparsity_pattern that we should replace by a better implementation in a subsequent pull request, and then for some sanity checks. One thing that does not work any more is that we cannot call dof_handler.locally_owned_dofs_per_processor() inside loops whose number of execution is different on different ranks because each call now implies global communication. The alternative would had been to populate the variable on the first call to those functions and then return it. It would have required a lock instead to make only one thread fill it. It would have avoided touching all those tests, but I really think we should discourage those functions because they intrinsically do not scale.

Closes #8067.

@kronbichler
Copy link
Member Author

I will write a changelog (one in Incompatibilities and one in Minor changes) once we have agreed on the way forward regarding the implementation of n_locally_owned_dofs_per_processor and friends in terms of pass by value / pass by const reference.


// calculate shifts
types::global_dof_index cumulated = 0;
unsigned int cumulated = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you change this to an int? I assume it could overflow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my mistake from moving around the code. I will fix it.

@tjhei
Copy link
Member

tjhei commented Jun 2, 2019

One thing that does not work any more is that we cannot call dof_handler.locally_owned_dofs_per_processor() inside loops whose number of execution is different on different ranks because each call now implies global communication

Making a function call collective, especially one that doesn't look like a collective call (if it was called compute_* or something it would be understandable) is not ideal.
I would deprecate the current functions and give the new functions a different name that makes this change more obvious. This way, users will get a warning when updating. What do you think?

@tjhei
Copy link
Member

tjhei commented Jun 2, 2019

What version of the standard is needed for MPI_Exscan?

@tjhei
Copy link
Member

tjhei commented Jun 2, 2019

I looked through the rest of the changes and things look good. 👍

@kronbichler
Copy link
Member Author

Making a function call collective, especially one that doesn't look like a collective call (if it was called compute_* or something it would be understandable) is not ideal.
I would deprecate the current functions and give the new functions a different name that makes this change more obvious. This way, users will get a warning when updating. What do you think?

I agree, introducing a new function making the intent of the global commucation and deprecating the old ones is a better idea. What about calling those new functions compute_n_locally_owned_dofs_per_processor() and compute_locally_owned_dofs_per_processor() instead, i.e., prepending the three functions with compute_? Alternatively we could use retrieve_* to make the operation that happens clear.

@masterleinad
Copy link
Member

masterleinad commented Jun 3, 2019

What version of the standard is needed for MPI_Exscan?

Looks like MPI 2.1.

@kronbichler
Copy link
Member Author

What version of the standard is needed for MPI_Exscan?

Looks like MPI 2.1.

Our minimal requirement is MPI 2.0, right? So we probably need to do MPI_Scan and subtract the local result. Not as pretty but tolerable. Alternatively we can add Utilities::MPI::prefix_sum and do it in one place.

@kronbichler
Copy link
Member Author

Wait, MPI_Exscan is listed here in the 2.0 standard definition:
https://www.mpi-forum.org/docs/mpi-2.0/mpi-20-html/node153.htm#Node153
and see also here the "exclusive scan" listed as new operation:
https://www.mpi-forum.org/docs/mpi-2.0/mpi-20-html/mpi2-report.html
so we should be able to use the code as is.

@masterleinad can you please confirm?

@kronbichler
Copy link
Member Author

@tjhei how exactly should we move on with deprecating the old functions? We essentially have three alternatives:

  • We keep the functions {n_}locally_owned_{mg_}dofs_per_processor around and populate them upon the first call. To this end, we need to insert a lock to NumberCache because we might call those functions in a multithreaded environment.
  • We mark the functions as deprecated but do never populate them.
  • We remove the functions right away.

The reason why I mention the second and third option separately is that I would actually prefer the third case over the second - we silently do not fill the fields any more, so it is better to issue a compile error rather than a run time error. I'd personally go for the first option.


// calculate shifts
unsigned int cumulated = 0;
types::global_dof_index cumulated = 0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjhei It appears that you implicitly spotted a bug in the index shift here! 👍

@kronbichler
Copy link
Member Author

This is part of #8293.

@kronbichler
Copy link
Member Author

I now switched to using new functions for the computations, and populating the old deprecated interfaces on demand. I did not use a mutex yet because I think the use case (global communication over MPI) necessitates care from user code anyway and it will likely break some of them either way.

@kronbichler
Copy link
Member Author

/rebuild

@masterleinad
Copy link
Member

@masterleinad can you please confirm?

Sure, that looks right. I was just searching in the pdf and couldn't find it, but it is in https://www.mpi-forum.org/docs/mpi-2.0/mpi2-report.pdf.

@kronbichler kronbichler force-pushed the mpi_exscan branch 2 times, most recently from c6e7db3 to cb584ce Compare June 3, 2019 12:13
@tjhei
Copy link
Member

tjhei commented Jun 3, 2019

@tjhei how exactly should we move on with deprecating the old functions?

I leave this up to you. I agree that "deprecate but doesn't work" is not a good solution.

@tjhei
Copy link
Member

tjhei commented Jun 3, 2019

I did not use a mutex yet because I think the use case

I wouldn't worry about thread safety for a deprecated function like this, but this is just me. ;-)

@kronbichler
Copy link
Member Author

Good - now that I've written the most conservative variant, I'm fine to stick with that. It will give us an opportunity to clean up some code for the 9.3 release when we remove this 😉

Copy link
Member

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I only have some stylistic questions and minor remarks.

* function, so it must be called on all processors participating in the MPI
* communicator underlying the triangulation.
*
* If you are only interested in the number of elements each processor owns
* then n_locally_owned_dofs_per_processor() is a better choice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* then n_locally_owned_dofs_per_processor() is a better choice.
* then compute_n_locally_owned_dofs_per_processor() is a better choice.

?

* This function involves global communication via the @p MPI_Allgather
* function, so it must be called on all processors participating in the MPI
* communicator underlying the triangulation.
*
* Each element of the vector returned by this function equals the number of
* elements of the corresponding sets returned by
* locally_owned_dofs_per_processor().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* locally_owned_dofs_per_processor().
* compute_locally_owned_dofs_per_processor().

?

* communicator underlying the triangulation.
*
* If you are only interested in the number of elements each processor owns
* then n_locally_owned_dofs_per_processor() is a better choice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* then n_locally_owned_dofs_per_processor() is a better choice.
* then compute_n_locally_owned_dofs_per_processor() is a better choice.

?

* possibly large memory footprint on many processors. As a consequence,
* this function needs to call compute_n_locally_owned_dofs_per_processor()
* upon the first invocation, including global communication. Use
* compute_n_locally_owned_dofs_per_processor() instead if on up to a few
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* compute_n_locally_owned_dofs_per_processor() instead if on up to a few
* compute_n_locally_owned_dofs_per_processor() instead if using up to a few

and in the other places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely.

const unsigned int level) const
{
Assert(level < this->get_triangulation().n_global_levels(),
ExcMessage("invalid level in locally_owned_mg_dofs_per_processor"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We mostly write full sentences in error messages. Would you mind doing that?

@@ -106,17 +106,20 @@ test()
//
// deallog << "n_locally_owned_dofs_per_processor: ";
// std::vector<types::global_dof_index> v =
// dof_handler.n_locally_owned_dofs_per_processor(); unsigned int sum
// = 0; for (unsigned int i=0; i<v.size(); ++i)
// dof_handler.compute_n_locally_owned_dofs_per_processor(); unsigned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well...

@@ -111,17 +111,20 @@ test()

// deallog << "n_locally_owned_dofs_per_processor: ";
// std::vector<types::global_dof_index> v =
// dof_handler.n_locally_owned_dofs_per_processor(); unsigned int sum
// = 0; for (unsigned int i=0; i<v.size(); ++i)
// dof_handler.compute_n_locally_owned_dofs_per_processor(); unsigned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and here

@@ -112,17 +112,20 @@ test()
//
// deallog << "n_locally_owned_dofs_per_processor: ";
// std::vector<types::global_dof_index> v =
// dof_handler.n_locally_owned_dofs_per_processor(); unsigned int sum
// = 0; for (unsigned int i=0; i<v.size(); ++i)
// dof_handler.compute_n_locally_owned_dofs_per_processor(); unsigned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and here

@@ -109,17 +109,20 @@ test()
//
// deallog << "n_locally_owned_dofs_per_processor: ";
// std::vector<types::global_dof_index> v =
// dof_handler.n_locally_owned_dofs_per_processor(); unsigned int sum
// = 0; for (unsigned int i=0; i<v.size(); ++i)
// dof_handler.compute_n_locally_owned_dofs_per_processor(); unsigned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and here

@@ -112,17 +112,20 @@ test()
//
// deallog << "n_locally_owned_dofs_per_processor: ";
// std::vector<types::global_dof_index> v =
// dof_handler.n_locally_owned_dofs_per_processor(); unsigned int sum
// = 0; for (unsigned int i=0; i<v.size(); ++i)
// dof_handler.compute_n_locally_owned_dofs_per_processor(); unsigned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and here

@kronbichler
Copy link
Member Author

@masterleinad thanks for the review - I've fixed all remarks.

Copy link
Member

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not store IndexSet of all ranks in DoFHandler's NumberCache
3 participants