Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch exceptions in consensus algorithm #14364

Merged

Conversation

gassmoeller
Copy link
Member

This is an extension on #14356 that is very helpful for debugging exceptions inside the ConsensusAlgorithm classes. #14356 will already guarantee that we crash on exceptions (in all places that use a ScopedLock), but if we add exception handling inside the ConsensusAlgorithm we can additionally print some useful information about what went wrong. I know we often let exceptions propagate to the user code to be dealt with, but in this case we know we will not make it there, because we abort in the destructor of ScopedLock. So does this make sense?

@gassmoeller
Copy link
Member Author

I have marked this WIP, because I am waiting for some test results. The code should be more or less ready (pending review).

Copy link
Member

@peterrum peterrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. What is the motivation for this? I guess an exception is throw within one of the lambdas passed to CA?

Comment on lines 1686 to 1688
// 2) Until all posted receive operations are known to have
// completed,
// answer requests and keep checking whether all requests of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments are now not well formatted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you still fix this?

Copy link
Member

@bangerth bangerth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me if you update the comments (and remove the WIP marker in the subject line of the PR).

I wonder, out of curiosity, where you encountered the exception? None of the user-side callbacks should be throwing exceptions, of course, but we're not enforcing this. I'm curious what kind of problem you ran into that had these functions throw exceptions...

Comment on lines 1686 to 1688
// 2) Until all posted receive operations are known to have
// completed,
// answer requests and keep checking whether all requests of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you still fix this?

Comment on lines 1708 to 1710
// 3) Signal to all other processes that all requests of this
// process
// have been answered
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here and below

@gassmoeller gassmoeller changed the title [WIP] Catch exceptions in consensus algorithm Catch exceptions in consensus algorithm Nov 1, 2022
@gassmoeller
Copy link
Member Author

I have updated the comments.

We are still hunting for the source of the bug (there are at least two different exceptions being thrown). I have added all we know so far in geodynamics/aspect#4984. In short the exception is thrown by boost::archive::binary_iarchive within Utilities::unpack.

@gassmoeller
Copy link
Member Author

But indepent of the solution to the actual bug, this PR has already been helpful for us. In my opinion we do not need to wait with this, if you agree it can be useful.

@bangerth
Copy link
Member

bangerth commented Nov 2, 2022

Please squash the commits (at least the last two) and I'll merge. Thank you!

@bangerth
Copy link
Member

bangerth commented Nov 2, 2022

/rebuild

Comment on lines 1485 to 1492
handle_exception(std::exception_ptr exception, const MPI_Comm &comm)
{
# ifdef DEAL_II_WITH_MPI
// an exception within a ConsensusAlgorithm likely causes an
// MPI deadlock. Abort with a reasonable error message instead.
try
{
std::rethrow_exception(exception);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the tests also uncovered this issue:

/jenkins/workspace/dealii-tidy_PR-14364/include/deal.II/base/mpi_consensus_algorithms.h:1492:38: error: parameter 'exception' is passed by value and only copied once; consider moving it to avoid unnecessary copies [performance-

You're using std::current_exception() for the exception argument in the calling places. I think this might best be written in the following way, with an rvalue reference as argument:

Suggested change
handle_exception(std::exception_ptr exception, const MPI_Comm &comm)
{
# ifdef DEAL_II_WITH_MPI
// an exception within a ConsensusAlgorithm likely causes an
// MPI deadlock. Abort with a reasonable error message instead.
try
{
std::rethrow_exception(exception);
handle_exception(std::exception_ptr &&exception, const MPI_Comm &comm)
{
# ifdef DEAL_II_WITH_MPI
// an exception within a ConsensusAlgorithm likely causes an
// MPI deadlock. Abort with a reasonable error message instead.
try
{
std::rethrow_exception(std::move(exception));

@bangerth
Copy link
Member

bangerth commented Nov 2, 2022

I added #14388 to document the issue.

@gassmoeller gassmoeller force-pushed the catch_exceptions_in_ConsensusAlgorithm branch from 5db6154 to f670be0 Compare November 3, 2022 13:18
@gassmoeller
Copy link
Member Author

I included all the changes and squashed the commits.

@bangerth bangerth merged commit 1c182b6 into dealii:master Nov 4, 2022
@bangerth
Copy link
Member

bangerth commented Nov 4, 2022

Thanks!

@gassmoeller gassmoeller deleted the catch_exceptions_in_ConsensusAlgorithm branch November 4, 2022 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants