Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MG transfer global coarsening: Avoid sending empty messages #16633

Merged
merged 1 commit into from Feb 16, 2024

Conversation

kronbichler
Copy link
Member

@kronbichler kronbichler commented Feb 12, 2024

To make a test case of @marcfehling run on my MPI implementation, I had to ensure that there were no empty messages in addition to #16631, otherwise an MPI process would get stuck in the BlackBoxFineDoFHandlerView::reinit() function here:

const int ierr_1 = MPI_Waitall(requests.size(),
requests.data(),
MPI_STATUSES_IGNORE);
I believe the empty index sets in the MPI::internal::ComputeIndexOwner functions can happen when we have sparse index ranges as we might touch an std::map entry without filling it somewhere. I did not investigate that closer, but I think that all tests we have do not rely on empty messages, so I think we should expect that this is more robust anyway. @peterrum there might a place where we could filter out that earlier, but I can't say for sure where the empty index sets come from, so I would appreciate some help.

Furthermore, I can't give a test case because the one used by @marcfehling is rather large, and I could not understand what geometric configuration would be needed: Whenever I tried to reduce the complexity, the MPI deadlock would go away, so it seems my MPI implementation might simply have run out of buffers or something. @marcfehling would you mind testing your full case also with this patch, to ensure that nothing else happens here?

Copy link
Member

@marcfehling marcfehling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried your patch with the one testcase that caused all the issues related to #16615 and #16631. It worked fine with various counts of MPI processes (while oversubscribing). Everything went fine.

I will try some even bigger scenarios on the expanse supercomputer in the next few days. But we don't have to wait until then to merge this PR.

@peterrum
Copy link
Member

@peterrum there might a place where we could filter out that earlier, but I can't say for sure where the empty index sets come from, so I would appreciate some help.

This does not look right. A zero-size request means that there was no request (or there was a request that failed?)!?

This should not happen. The question is whose side effect this is?

I guess the dictionary send these empty requesters here:

send_data[i].push_back(interval.second);

Which are received by the index owners here:

for (unsigned int i = offset + 1;
i < offset + buffer[offset].second + 1;
++i)
my_index_set.add_range(index_offset + buffer[i].first,
index_offset + buffer[i].second);

This indicates that

auto &request = requesters[owner_index_guess];
if (request.empty() || request.back().first != rank_of_request)
request.emplace_back(
rank_of_request,
std::vector<
std::pair<types::global_dof_index, types::global_dof_index>>());

is accidentally is adding an empty entry potentially at the wrong rank!?

I think it is worth to investigate this more to rule out that there is no bug. I cannot guarantee that I have time to debug this in the next days.

// there are no indices to send (this can still happen in the run
// of the consensus algorithms above if the index spaces are
// sparse).
if (i.first == my_rank || i.second.begin() == i.second.end())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.second corresponds to an object of type IndexSet. Checking for begin() == end() is probably more easily understood if you said i.second.is_empty().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately, I don't think this is sending an empty message. It perhaps sends an empty collection packed up into a string, but that's a non-zero length string.

@tamiko tamiko merged commit cbed3c6 into dealii:master Feb 16, 2024
15 checks passed
@marcfehling
Copy link
Member

Oh. I'm not sure right now whether we agreed that this patch does the right thing.

@peterrum
Copy link
Member

Oh. I'm not sure right now whether we agreed that this patch does the right thing.

No, we did not and should investigate the reason for the issue!

@kronbichler kronbichler deleted the fix_empty_comm branch March 18, 2024 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants