Replace MPI_Allgather by MPI_Exscan #8298

kronbichler · 2019-06-02T10:06:31Z

This PR, prepared together with @peterrum, replaces the use of MPI_Allgather by the much more efficient MPI_Exscan in those cases where we wanted to get a parallel prefix sum (MPI calls the version of the prefix sum where the local element is excluded MPI_Exscan).

The second component of this PR is to not store locally_owned_dofs_per_processor and n_locally_dofs_per_processor in the NumberCache of DoFHandler according to the discussion in #8067. A clean solution to this change necessitates an incompatible change, namely that we have to make the return type of those vectors by value rather than const reference. I think the use cases are limited. We use those vectors in a few dozens of tests - on the one hand, we utilize them for distribute_sparsity_pattern that we should replace by a better implementation in a subsequent pull request, and then for some sanity checks. One thing that does not work any more is that we cannot call dof_handler.locally_owned_dofs_per_processor() inside loops whose number of execution is different on different ranks because each call now implies global communication. The alternative would had been to populate the variable on the first call to those functions and then return it. It would have required a lock instead to make only one thread fill it. It would have avoided touching all those tests, but I really think we should discourage those functions because they intrinsically do not scale.

Closes #8067.

kronbichler · 2019-06-02T12:10:22Z

I will write a changelog (one in Incompatibilities and one in Minor changes) once we have agreed on the way forward regarding the implementation of n_locally_owned_dofs_per_processor and friends in terms of pass by value / pass by const reference.

tjhei · 2019-06-02T19:58:11Z

source/dofs/dof_renumbering.cc


        // calculate shifts
-        types::global_dof_index cumulated = 0;
+        unsigned int cumulated = 0;


why do you change this to an int? I assume it could overflow?

Sorry, my mistake from moving around the code. I will fix it.

tjhei · 2019-06-02T20:04:25Z

One thing that does not work any more is that we cannot call dof_handler.locally_owned_dofs_per_processor() inside loops whose number of execution is different on different ranks because each call now implies global communication

Making a function call collective, especially one that doesn't look like a collective call (if it was called compute_* or something it would be understandable) is not ideal.
I would deprecate the current functions and give the new functions a different name that makes this change more obvious. This way, users will get a warning when updating. What do you think?

tjhei · 2019-06-02T20:04:45Z

What version of the standard is needed for MPI_Exscan?

tjhei · 2019-06-02T20:06:35Z

I looked through the rest of the changes and things look good. 👍

kronbichler · 2019-06-02T20:20:42Z

Making a function call collective, especially one that doesn't look like a collective call (if it was called compute_* or something it would be understandable) is not ideal.
I would deprecate the current functions and give the new functions a different name that makes this change more obvious. This way, users will get a warning when updating. What do you think?

I agree, introducing a new function making the intent of the global commucation and deprecating the old ones is a better idea. What about calling those new functions compute_n_locally_owned_dofs_per_processor() and compute_locally_owned_dofs_per_processor() instead, i.e., prepending the three functions with compute_? Alternatively we could use retrieve_* to make the operation that happens clear.

masterleinad · 2019-06-03T02:42:25Z

What version of the standard is needed for MPI_Exscan?

Looks like MPI 2.1.

kronbichler · 2019-06-03T06:05:55Z

What version of the standard is needed for MPI_Exscan?

Looks like MPI 2.1.

Our minimal requirement is MPI 2.0, right? So we probably need to do MPI_Scan and subtract the local result. Not as pretty but tolerable. Alternatively we can add Utilities::MPI::prefix_sum and do it in one place.

kronbichler · 2019-06-03T07:32:49Z

Wait, MPI_Exscan is listed here in the 2.0 standard definition:
https://www.mpi-forum.org/docs/mpi-2.0/mpi-20-html/node153.htm#Node153
and see also here the "exclusive scan" listed as new operation:
https://www.mpi-forum.org/docs/mpi-2.0/mpi-20-html/mpi2-report.html
so we should be able to use the code as is.

@masterleinad can you please confirm?

kronbichler · 2019-06-03T07:44:46Z

@tjhei how exactly should we move on with deprecating the old functions? We essentially have three alternatives:

We keep the functions {n_}locally_owned_{mg_}dofs_per_processor around and populate them upon the first call. To this end, we need to insert a lock to NumberCache because we might call those functions in a multithreaded environment.
We mark the functions as deprecated but do never populate them.
We remove the functions right away.

The reason why I mention the second and third option separately is that I would actually prefer the third case over the second - we silently do not fill the fields any more, so it is better to issue a compile error rather than a run time error. I'd personally go for the first option.

kronbichler · 2019-06-03T07:50:42Z

source/dofs/dof_renumbering.cc


        // calculate shifts
-        unsigned int cumulated = 0;
+        types::global_dof_index cumulated = 0;


@tjhei It appears that you implicitly spotted a bug in the index shift here! 👍

kronbichler · 2019-06-03T08:55:18Z

This is part of #8293.

kronbichler · 2019-06-03T11:37:45Z

I now switched to using new functions for the computations, and populating the old deprecated interfaces on demand. I did not use a mutex yet because I think the use case (global communication over MPI) necessitates care from user code anyway and it will likely break some of them either way.

kronbichler · 2019-06-03T11:43:18Z

/rebuild

masterleinad · 2019-06-03T12:00:50Z

@masterleinad can you please confirm?

Sure, that looks right. I was just searching in the pdf and couldn't find it, but it is in https://www.mpi-forum.org/docs/mpi-2.0/mpi2-report.pdf.

tjhei · 2019-06-03T12:32:26Z

@tjhei how exactly should we move on with deprecating the old functions?

I leave this up to you. I agree that "deprecate but doesn't work" is not a good solution.

tjhei · 2019-06-03T12:33:18Z

I did not use a mutex yet because I think the use case

I wouldn't worry about thread safety for a deprecated function like this, but this is just me. ;-)

kronbichler · 2019-06-03T12:34:11Z

Good - now that I've written the most conservative variant, I'm fine to stick with that. It will give us an opportunity to clean up some code for the 9.3 release when we remove this 😉

masterleinad

Thanks! I only have some stylistic questions and minor remarks.

masterleinad · 2019-06-06T20:12:42Z

include/deal.II/dofs/dof_handler.h

+   * function, so it must be called on all processors participating in the MPI
+   * communicator underlying the triangulation.
+   *
+   * If you are only interested in the number of elements each processor owns
   * then n_locally_owned_dofs_per_processor() is a better choice.


Suggested change

* then n_locally_owned_dofs_per_processor() is a better choice.

* then compute_n_locally_owned_dofs_per_processor() is a better choice.

?

masterleinad · 2019-06-06T20:19:04Z

include/deal.II/dofs/dof_handler.h

+   * This function involves global communication via the @p MPI_Allgather
+   * function, so it must be called on all processors participating in the MPI
+   * communicator underlying the triangulation.
+   *
   * Each element of the vector returned by this function equals the number of
   * elements of the corresponding sets returned by
   * locally_owned_dofs_per_processor().


Suggested change

* locally_owned_dofs_per_processor().

* compute_locally_owned_dofs_per_processor().

?

masterleinad · 2019-06-06T21:02:23Z

include/deal.II/hp/dof_handler.h

+     * communicator underlying the triangulation.
+     *
+     * If you are only interested in the number of elements each processor owns
+     * then n_locally_owned_dofs_per_processor() is a better choice.


Suggested change

* then n_locally_owned_dofs_per_processor() is a better choice.

* then compute_n_locally_owned_dofs_per_processor() is a better choice.

?

masterleinad · 2019-06-06T21:05:31Z

include/deal.II/hp/dof_handler.h

+     * possibly large memory footprint on many processors. As a consequence,
+     * this function needs to call compute_n_locally_owned_dofs_per_processor()
+     * upon the first invocation, including global communication. Use
+     * compute_n_locally_owned_dofs_per_processor() instead if on up to a few


Suggested change

* compute_n_locally_owned_dofs_per_processor() instead if on up to a few

* compute_n_locally_owned_dofs_per_processor() instead if using up to a few

and in the other places?

Yes, definitely.

masterleinad · 2019-06-07T03:22:06Z

include/deal.II/dofs/dof_handler.h

+  const unsigned int level) const
+{
+  Assert(level < this->get_triangulation().n_global_levels(),
+         ExcMessage("invalid level in locally_owned_mg_dofs_per_processor"));


We mostly write full sentences in error messages. Would you mind doing that?

masterleinad · 2019-06-07T04:07:47Z

tests/sharedtria/dof_04.cc

@@ -106,17 +106,20 @@ test()
      //
      //      deallog << "n_locally_owned_dofs_per_processor: ";
      //      std::vector<types::global_dof_index> v =
-      //      dof_handler.n_locally_owned_dofs_per_processor(); unsigned int sum
-      //      = 0; for (unsigned int i=0; i<v.size(); ++i)
+      //      dof_handler.compute_n_locally_owned_dofs_per_processor(); unsigned


Here as well...

masterleinad · 2019-06-07T04:08:05Z

tests/sharedtria/hp_dof_01.cc

@@ -111,17 +111,20 @@ test()

      //      deallog << "n_locally_owned_dofs_per_processor: ";
      //      std::vector<types::global_dof_index> v =
-      //      dof_handler.n_locally_owned_dofs_per_processor(); unsigned int sum
-      //      = 0; for (unsigned int i=0; i<v.size(); ++i)
+      //      dof_handler.compute_n_locally_owned_dofs_per_processor(); unsigned


... and here

masterleinad · 2019-06-07T04:08:19Z

tests/sharedtria/hp_dof_02.cc

@@ -112,17 +112,20 @@ test()
      //
      //      deallog << "n_locally_owned_dofs_per_processor: ";
      //      std::vector<types::global_dof_index> v =
-      //      dof_handler.n_locally_owned_dofs_per_processor(); unsigned int sum
-      //      = 0; for (unsigned int i=0; i<v.size(); ++i)
+      //      dof_handler.compute_n_locally_owned_dofs_per_processor(); unsigned


... and here

masterleinad · 2019-06-07T04:08:32Z

tests/sharedtria/hp_dof_03.cc

@@ -109,17 +109,20 @@ test()
      //
      //      deallog << "n_locally_owned_dofs_per_processor: ";
      //      std::vector<types::global_dof_index> v =
-      //      dof_handler.n_locally_owned_dofs_per_processor(); unsigned int sum
-      //      = 0; for (unsigned int i=0; i<v.size(); ++i)
+      //      dof_handler.compute_n_locally_owned_dofs_per_processor(); unsigned


... and here

masterleinad · 2019-06-07T04:08:44Z

tests/sharedtria/hp_dof_04.cc

@@ -112,17 +112,20 @@ test()
      //
      //      deallog << "n_locally_owned_dofs_per_processor: ";
      //      std::vector<types::global_dof_index> v =
-      //      dof_handler.n_locally_owned_dofs_per_processor(); unsigned int sum
-      //      = 0; for (unsigned int i=0; i<v.size(); ++i)
+      //      dof_handler.compute_n_locally_owned_dofs_per_processor(); unsigned


... and here

Do not store the index sets of all processors on each processor. Fill them on demand.

kronbichler · 2019-06-07T09:43:16Z

@masterleinad thanks for the review - I've fixed all remarks.

masterleinad

Thanks!

kronbichler added the Parallel distributed label Jun 2, 2019

tjhei reviewed Jun 2, 2019

View reviewed changes

kronbichler force-pushed the mpi_exscan branch from 66ec2d3 to 6303061 Compare June 3, 2019 07:48

kronbichler commented Jun 3, 2019

View reviewed changes

kronbichler force-pushed the mpi_exscan branch from 6303061 to 2435251 Compare June 3, 2019 07:51

peterrum mentioned this pull request Jun 3, 2019

Mpi compute index owner #8300

Merged

kronbichler force-pushed the mpi_exscan branch from ad8083e to 98d5aeb Compare June 3, 2019 11:36

kronbichler force-pushed the mpi_exscan branch from 98d5aeb to d12ab54 Compare June 3, 2019 11:42

kronbichler added the ready to test label Jun 3, 2019

kronbichler force-pushed the mpi_exscan branch 2 times, most recently from c6e7db3 to cb584ce Compare June 3, 2019 12:13

kronbichler mentioned this pull request Jun 6, 2019

Code optimizations for >100k MPI ranks #8293

Closed

3 tasks

masterleinad requested changes Jun 7, 2019

View reviewed changes

kronbichler mentioned this pull request Jun 7, 2019

Replace n_..._dofs_per_processor in distribute_sparsity_pattern #8306

Closed

kronbichler added 7 commits June 7, 2019 11:42

Replace MPI_Allgather by MPI_Exscan

c67244c

Do not store the index sets of all processors on each processor. Fill them on demand.

Update tests with respect to locally_owned_dofs_per_processor

ded3f34

Backwards compatibility for locally_owned_dofs_per_processor

3e4c892

Adapt tests to the new compute_..._per_processor() functions

0006238

Add changelog

ff90ec8

Use new compute_locally_owned_dofs_per_processor

ec16564

Use new compute_... functions in tutorial steps

e9047ad

kronbichler force-pushed the mpi_exscan branch from cb584ce to e9047ad Compare June 7, 2019 09:42

masterleinad approved these changes Jun 7, 2019

View reviewed changes

masterleinad added the Reviewed and ready to merge label Jun 7, 2019

masterleinad merged commit b0644f7 into dealii:master Jun 7, 2019

kronbichler mentioned this pull request Jun 8, 2019

Avoid unused variable in dof_handler_policy.cc #8311

Merged

kronbichler deleted the mpi_exscan branch June 13, 2019 07:41

kronbichler mentioned this pull request Mar 22, 2020

SparsityTools::distribute_sparsity_pattern with compute_index_owner #9710

Merged

kronbichler mentioned this pull request Apr 24, 2020

Remove parallel::TriangulationBase::compute_n_locally_owned_active_cellls_per_processor #9945

Merged

masterleinad mentioned this pull request May 24, 2021

Remove deprecated DoFHandler::locally_owned_dofs_per_processor #12295

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace MPI_Allgather by MPI_Exscan #8298

Replace MPI_Allgather by MPI_Exscan #8298

kronbichler commented Jun 2, 2019

kronbichler commented Jun 2, 2019

tjhei Jun 2, 2019

kronbichler Jun 3, 2019

tjhei commented Jun 2, 2019

tjhei commented Jun 2, 2019

tjhei commented Jun 2, 2019

kronbichler commented Jun 2, 2019

masterleinad commented Jun 3, 2019 •

edited

kronbichler commented Jun 3, 2019

kronbichler commented Jun 3, 2019

kronbichler commented Jun 3, 2019

kronbichler Jun 3, 2019

kronbichler commented Jun 3, 2019

kronbichler commented Jun 3, 2019

kronbichler commented Jun 3, 2019

masterleinad commented Jun 3, 2019

tjhei commented Jun 3, 2019

tjhei commented Jun 3, 2019

kronbichler commented Jun 3, 2019

masterleinad left a comment

masterleinad Jun 6, 2019

masterleinad Jun 6, 2019

masterleinad Jun 6, 2019

masterleinad Jun 6, 2019

kronbichler Jun 7, 2019

masterleinad Jun 7, 2019

masterleinad Jun 7, 2019

masterleinad Jun 7, 2019

masterleinad Jun 7, 2019

masterleinad Jun 7, 2019

masterleinad Jun 7, 2019

kronbichler commented Jun 7, 2019

masterleinad left a comment

	* then n_locally_owned_dofs_per_processor() is a better choice.
	* then compute_n_locally_owned_dofs_per_processor() is a better choice.

	* locally_owned_dofs_per_processor().
	* compute_locally_owned_dofs_per_processor().

	* compute_n_locally_owned_dofs_per_processor() instead if on up to a few
	* compute_n_locally_owned_dofs_per_processor() instead if using up to a few

Replace MPI_Allgather by MPI_Exscan #8298

Replace MPI_Allgather by MPI_Exscan #8298

Conversation

kronbichler commented Jun 2, 2019

kronbichler commented Jun 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tjhei commented Jun 2, 2019

tjhei commented Jun 2, 2019

tjhei commented Jun 2, 2019

kronbichler commented Jun 2, 2019

masterleinad commented Jun 3, 2019 • edited

kronbichler commented Jun 3, 2019

kronbichler commented Jun 3, 2019

kronbichler commented Jun 3, 2019

Choose a reason for hiding this comment

kronbichler commented Jun 3, 2019

kronbichler commented Jun 3, 2019

kronbichler commented Jun 3, 2019

masterleinad commented Jun 3, 2019

tjhei commented Jun 3, 2019

tjhei commented Jun 3, 2019

kronbichler commented Jun 3, 2019

masterleinad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kronbichler commented Jun 7, 2019

masterleinad left a comment

Choose a reason for hiding this comment

masterleinad commented Jun 3, 2019 •

edited