Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up ComputeIndexOwner::Dictionary #8785

Closed
kronbichler opened this issue Sep 18, 2019 · 4 comments · Fixed by #8813
Closed

Speed up ComputeIndexOwner::Dictionary #8785

kronbichler opened this issue Sep 18, 2019 · 4 comments · Fixed by #8813

Comments

@kronbichler
Copy link
Member

kronbichler commented Sep 18, 2019

While working on #8772, I noticed two things we should improve for the dictionary class of ComputeIndexOwner:

  1. reinit() is slower than it needs to be (and slow enough to be noticed in profiles) because it expands the indices of the locally owned range one by one, and computes the rank one by one. We should do this in terms of intervals. Unfortunately a bit of interval intersection stuff, but still manageable.
  2. As already noted in Eliminate MPI all-to-all communication in MPI::Partitioner #8772, we should introduce a minimal grain size in dofs_per_process.

Related to #8293.

@peterrum
Copy link
Member

The second point is easily made by replacing:

dofs_per_process  = (size + n_procs - 1) / n_procs;

by:

dofs_per_process  = std::max((size + n_procs - 1) / n_procs, min_dofs_per_process);

But in a second step, we should distributed more smarter the load among nodes/islands...

@kronbichler
Copy link
Member Author

But in a second step, we should distributed more smarter the load among nodes/islands...

The question is how expensive that would be in terms of implementation and in terms of communication. My rationale is that it probably won't matter much even if communication is heavily skewed to the lower ranks: This is a setup routine and we won't need an optimal decision here; point-to-point is cheap also when 90% goes to rank 0 as in that case the global problem is small and not using too many ranks, either. But if you know something that is less than half an hour of work feel free to think about it.

@peterrum
Copy link
Member

peterrum commented Sep 19, 2019

But if you know something that is less than half an hour of work feel free to think about it.

Essentially, we would need to inroduce dofs_per_group < group_size * dofs_per_process. Then it would look like this :

          unsigned int
          dof_to_dict_rank(const types::global_dof_index i)
          {
            const unsigned int i_group = i / dofs_per_group;
            const unsigned int i_local = (i % dofs_per_group) / dofs_per_process;
            return i_group * group_size + i_local;
          }

Note: This took me 5 minutes (and I did not try it out)...

Note: I will introduce in one of my next PRs a function to determine the size a compute node.

@kronbichler
Copy link
Member Author

Note: I will introduce in one of my next PRs a function to determine the size a compute node.

There is some infrastructure here:
https://github.com/dealii/dealii/blob/master/source/base/mpi.cc#L643-L682

You might want to look for something similar and/or adapt it so we can keep functionality in one place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants