New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MG transfer global coarsening: Avoid sending empty messages #16633
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried your patch with the one testcase that caused all the issues related to #16615 and #16631. It worked fine with various counts of MPI processes (while oversubscribing). Everything went fine.
I will try some even bigger scenarios on the expanse supercomputer in the next few days. But we don't have to wait until then to merge this PR.
This does not look right. A zero-size request means that there was no request (or there was a request that failed?)!? This should not happen. The question is whose side effect this is? I guess the dictionary send these empty requesters here:
Which are received by the index owners here: dealii/source/base/mpi_compute_index_owner_internal.cc Lines 695 to 699 in da635b4
This indicates that dealii/source/base/mpi_compute_index_owner_internal.cc Lines 756 to 761 in da635b4
is accidentally is adding an empty entry potentially at the wrong rank!? I think it is worth to investigate this more to rule out that there is no bug. I cannot guarantee that I have time to debug this in the next days. |
// there are no indices to send (this can still happen in the run | ||
// of the consensus algorithms above if the index spaces are | ||
// sparse). | ||
if (i.first == my_rank || i.second.begin() == i.second.end()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.second
corresponds to an object of type IndexSet
. Checking for begin() == end()
is probably more easily understood if you said i.second.is_empty()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separately, I don't think this is sending an empty message. It perhaps sends an empty collection packed up into a string, but that's a non-zero length string.
Oh. I'm not sure right now whether we agreed that this patch does the right thing. |
No, we did not and should investigate the reason for the issue! |
To make a test case of @marcfehling run on my MPI implementation, I had to ensure that there were no empty messages in addition to #16631, otherwise an MPI process would get stuck in the
BlackBoxFineDoFHandlerView::reinit()
function here:dealii/include/deal.II/multigrid/mg_transfer_global_coarsening.templates.h
Lines 972 to 974 in d292264
MPI::internal::ComputeIndexOwner
functions can happen when we have sparse index ranges as we might touch anstd::map
entry without filling it somewhere. I did not investigate that closer, but I think that all tests we have do not rely on empty messages, so I think we should expect that this is more robust anyway. @peterrum there might a place where we could filter out that earlier, but I can't say for sure where the empty index sets come from, so I would appreciate some help.Furthermore, I can't give a test case because the one used by @marcfehling is rather large, and I could not understand what geometric configuration would be needed: Whenever I tried to reduce the complexity, the MPI deadlock would go away, so it seems my MPI implementation might simply have run out of buffers or something. @marcfehling would you mind testing your full case also with this patch, to ensure that nothing else happens here?