Report error and abort in CollectiveMutex during stack unwinding #14356

gassmoeller · 2022-10-17T19:32:23Z

When we throw an exception inside a CollectiveMutex::ScopedLock the destructor of ScopedLock tries to unlock the CollectiveMutex. This can lead to hard to debug MPI deadlocks, because other MPI ranks are likely trying to communicate with the ranks that throw the exception, while the ranks that throw the exceptions wait in the MPI barrier to unlock the mutex. We should not try to communicate while an exception is uncaught, because we do not know if all processes threw or just a subset.

The current behavior made it very hard to debug geodynamics/aspect#4984.

gassmoeller · 2022-10-18T12:34:28Z

Actually, let me mark this WIP for now. Even with this change we run into a deadlock. I confirmed that we do not get into it if I call MPI_Abort in case an exception is uncaught, but I did not include that in this PR (because I want the exception to propagate up).

I first want to investigate where we get stuck, but afterwards: Would it be fair to say that if an exception is raised within a CollectiveMutex, we cannot recover anyway, because other ranks are likely waiting for communication? Should we make sure we do not run into a deadlock in this case by calling MPI_Abort (possibly after printing an error and the exception)? We do this already in the HDF5 output here. Or would this be the responsibility of the exception handler that catches the exception further up the call stack?

tjhei · 2022-10-18T14:42:36Z

Yes, I would do the same as what we do for HDF5 and for TimerOutput. Communication in a destructor is not going to be a recoverable situation, I think.

bangerth · 2022-10-19T22:57:14Z

I agree.

source/base/mpi.cc

gassmoeller · 2022-10-20T14:25:03Z

Ok, I think I made the requested changes. Now we should abort whenever a CollectiveMutex is destroyed or unlocked (from a ScopedLock) when an exception is thrown. Let me know if you want me to change the error message.

masterleinad

Looks OK to me.

drwells · 2022-10-20T15:23:13Z

/rebuild

Remove communication during stack unwinding.

8695e98

gassmoeller mentioned this pull request Oct 17, 2022

Possible MPI Deadlock in GMG preconditioner geodynamics/aspect#4984

Closed

gassmoeller changed the title ~~Remove communication in CollectiveMutex during stack unwinding~~ [WIP] Remove communication in CollectiveMutex during stack unwinding Oct 18, 2022

bangerth reviewed Oct 19, 2022

View reviewed changes

source/base/mpi.cc Outdated Show resolved Hide resolved

gassmoeller force-pushed the avoid_deadlock_in_mutex branch from c2656a1 to 68c5b83 Compare October 20, 2022 13:57

Abort when exception is uncaught.

57c6048

gassmoeller force-pushed the avoid_deadlock_in_mutex branch from 68c5b83 to 57c6048 Compare October 20, 2022 13:57

gassmoeller changed the title ~~[WIP] Remove communication in CollectiveMutex during stack unwinding~~ Report error and abort in CollectiveMutex during stack unwinding Oct 20, 2022

masterleinad approved these changes Oct 20, 2022

View reviewed changes

drwells added the ready to test label Oct 20, 2022

gassmoeller mentioned this pull request Oct 20, 2022

Catch exceptions in consensus algorithm #14364

Merged

drwells merged commit d89583e into dealii:master Nov 1, 2022

gassmoeller deleted the avoid_deadlock_in_mutex branch May 11, 2023 14:13

gassmoeller restored the avoid_deadlock_in_mutex branch April 19, 2024 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report error and abort in CollectiveMutex during stack unwinding #14356

Report error and abort in CollectiveMutex during stack unwinding #14356

gassmoeller commented Oct 17, 2022

gassmoeller commented Oct 18, 2022

tjhei commented Oct 18, 2022

bangerth commented Oct 19, 2022

gassmoeller commented Oct 20, 2022

masterleinad left a comment

drwells commented Oct 20, 2022

Report error and abort in CollectiveMutex during stack unwinding #14356

Report error and abort in CollectiveMutex during stack unwinding #14356

Conversation

gassmoeller commented Oct 17, 2022

gassmoeller commented Oct 18, 2022

tjhei commented Oct 18, 2022

bangerth commented Oct 19, 2022

gassmoeller commented Oct 20, 2022

masterleinad left a comment

Choose a reason for hiding this comment

drwells commented Oct 20, 2022