New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catch exceptions in consensus algorithm #14364
Catch exceptions in consensus algorithm #14364
Conversation
I have marked this WIP, because I am waiting for some test results. The code should be more or less ready (pending review). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. What is the motivation for this? I guess an exception is throw within one of the lambdas passed to CA?
// 2) Until all posted receive operations are known to have | ||
// completed, | ||
// answer requests and keep checking whether all requests of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments are now not well formatted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you still fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me if you update the comments (and remove the WIP marker in the subject line of the PR).
I wonder, out of curiosity, where you encountered the exception? None of the user-side callbacks should be throwing exceptions, of course, but we're not enforcing this. I'm curious what kind of problem you ran into that had these functions throw exceptions...
// 2) Until all posted receive operations are known to have | ||
// completed, | ||
// answer requests and keep checking whether all requests of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you still fix this?
// 3) Signal to all other processes that all requests of this | ||
// process | ||
// have been answered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here and below
I have updated the comments. We are still hunting for the source of the bug (there are at least two different exceptions being thrown). I have added all we know so far in geodynamics/aspect#4984. In short the exception is thrown by |
But indepent of the solution to the actual bug, this PR has already been helpful for us. In my opinion we do not need to wait with this, if you agree it can be useful. |
Please squash the commits (at least the last two) and I'll merge. Thank you! |
/rebuild |
handle_exception(std::exception_ptr exception, const MPI_Comm &comm) | ||
{ | ||
# ifdef DEAL_II_WITH_MPI | ||
// an exception within a ConsensusAlgorithm likely causes an | ||
// MPI deadlock. Abort with a reasonable error message instead. | ||
try | ||
{ | ||
std::rethrow_exception(exception); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the tests also uncovered this issue:
/jenkins/workspace/dealii-tidy_PR-14364/include/deal.II/base/mpi_consensus_algorithms.h:1492:38: error: parameter 'exception' is passed by value and only copied once; consider moving it to avoid unnecessary copies [performance-
You're using std::current_exception()
for the exception
argument in the calling places. I think this might best be written in the following way, with an rvalue reference as argument:
handle_exception(std::exception_ptr exception, const MPI_Comm &comm) | |
{ | |
# ifdef DEAL_II_WITH_MPI | |
// an exception within a ConsensusAlgorithm likely causes an | |
// MPI deadlock. Abort with a reasonable error message instead. | |
try | |
{ | |
std::rethrow_exception(exception); | |
handle_exception(std::exception_ptr &&exception, const MPI_Comm &comm) | |
{ | |
# ifdef DEAL_II_WITH_MPI | |
// an exception within a ConsensusAlgorithm likely causes an | |
// MPI deadlock. Abort with a reasonable error message instead. | |
try | |
{ | |
std::rethrow_exception(std::move(exception)); |
I added #14388 to document the issue. |
5db6154
to
f670be0
Compare
I included all the changes and squashed the commits. |
Thanks! |
This is an extension on #14356 that is very helpful for debugging exceptions inside the ConsensusAlgorithm classes. #14356 will already guarantee that we crash on exceptions (in all places that use a ScopedLock), but if we add exception handling inside the ConsensusAlgorithm we can additionally print some useful information about what went wrong. I know we often let exceptions propagate to the user code to be dealt with, but in this case we know we will not make it there, because we abort in the destructor of ScopedLock. So does this make sense?