New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report error and abort in CollectiveMutex during stack unwinding #14356
Conversation
Actually, let me mark this WIP for now. Even with this change we run into a deadlock. I confirmed that we do not get into it if I call I first want to investigate where we get stuck, but afterwards: Would it be fair to say that if an exception is raised within a CollectiveMutex, we cannot recover anyway, because other ranks are likely waiting for communication? Should we make sure we do not run into a deadlock in this case by calling |
Yes, I would do the same as what we do for HDF5 and for TimerOutput. Communication in a destructor is not going to be a recoverable situation, I think. |
I agree. |
c2656a1
to
68c5b83
Compare
68c5b83
to
57c6048
Compare
Ok, I think I made the requested changes. Now we should abort whenever a CollectiveMutex is destroyed or unlocked (from a ScopedLock) when an exception is thrown. Let me know if you want me to change the error message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me.
/rebuild |
When we throw an exception inside a
CollectiveMutex::ScopedLock
the destructor ofScopedLock
tries to unlock theCollectiveMutex
. This can lead to hard to debug MPI deadlocks, because other MPI ranks are likely trying to communicate with the ranks that throw the exception, while the ranks that throw the exceptions wait in the MPI barrier to unlock the mutex. We should not try to communicate while an exception is uncaught, because we do not know if all processes threw or just a subset.The current behavior made it very hard to debug geodynamics/aspect#4984.