Speed up ComputeIndexOwner::Dictionary #8785

kronbichler · 2019-09-18T11:13:31Z

While working on #8772, I noticed two things we should improve for the dictionary class of ComputeIndexOwner:

reinit() is slower than it needs to be (and slow enough to be noticed in profiles) because it expands the indices of the locally owned range one by one, and computes the rank one by one. We should do this in terms of intervals. Unfortunately a bit of interval intersection stuff, but still manageable.
As already noted in Eliminate MPI all-to-all communication in MPI::Partitioner #8772, we should introduce a minimal grain size in dofs_per_process.

Related to #8293.

The text was updated successfully, but these errors were encountered:

peterrum · 2019-09-18T11:33:44Z

The second point is easily made by replacing:

dofs_per_process  = (size + n_procs - 1) / n_procs;

by:

dofs_per_process  = std::max((size + n_procs - 1) / n_procs, min_dofs_per_process);

But in a second step, we should distributed more smarter the load among nodes/islands...

kronbichler · 2019-09-18T11:37:11Z

But in a second step, we should distributed more smarter the load among nodes/islands...

The question is how expensive that would be in terms of implementation and in terms of communication. My rationale is that it probably won't matter much even if communication is heavily skewed to the lower ranks: This is a setup routine and we won't need an optimal decision here; point-to-point is cheap also when 90% goes to rank 0 as in that case the global problem is small and not using too many ranks, either. But if you know something that is less than half an hour of work feel free to think about it.

peterrum · 2019-09-19T04:32:30Z

But if you know something that is less than half an hour of work feel free to think about it.

Essentially, we would need to inroduce dofs_per_group < group_size * dofs_per_process. Then it would look like this :

          unsigned int
          dof_to_dict_rank(const types::global_dof_index i)
          {
            const unsigned int i_group = i / dofs_per_group;
            const unsigned int i_local = (i % dofs_per_group) / dofs_per_process;
            return i_group * group_size + i_local;
          }

Note: This took me 5 minutes (and I did not try it out)...

Note: I will introduce in one of my next PRs a function to determine the size a compute node.

kronbichler · 2019-09-19T07:56:40Z

Note: I will introduce in one of my next PRs a function to determine the size a compute node.

There is some infrastructure here:
https://github.com/dealii/dealii/blob/master/source/base/mpi.cc#L643-L682

You might want to look for something similar and/or adapt it so we can keep functionality in one place.

kronbichler added the Parallel distributed label Sep 18, 2019

This was referenced Sep 20, 2019

Avoid sorting a large array in MG fill_copy_indices #8811

Merged

Process indices in ComputeIndexOwner by intervals #8813

Merged

Rombur closed this as completed in #8813 Sep 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up ComputeIndexOwner::Dictionary #8785

Speed up ComputeIndexOwner::Dictionary #8785

kronbichler commented Sep 18, 2019 •

edited

peterrum commented Sep 18, 2019

kronbichler commented Sep 18, 2019

peterrum commented Sep 19, 2019 •

edited

kronbichler commented Sep 19, 2019

Speed up ComputeIndexOwner::Dictionary #8785

Speed up ComputeIndexOwner::Dictionary #8785

Comments

kronbichler commented Sep 18, 2019 • edited

peterrum commented Sep 18, 2019

kronbichler commented Sep 18, 2019

peterrum commented Sep 19, 2019 • edited

kronbichler commented Sep 19, 2019

kronbichler commented Sep 18, 2019 •

edited

peterrum commented Sep 19, 2019 •

edited