New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
live module cascading failures #638
Comments
I'm going to suggest that for the 0.3.0 release we avoid loading the live module, as its efforts to keep the session wired up minus failed nodes will be in vain without some higher level mechanism to shrink the job, and its design was built on having a separate event "bus" but we've disabled epgm by default and are distributing events via the TBON. |
That seems fine, however it might be useful to have a fallback for |
Yeah, we can assume that the entire session is up if there's nothing monitoring liveness to tell us otherwise, so that's easy. |
As discussed in flux-framework#638, the live module - cannot function without pgm event distribution (default: off) - cannot improve reliability without higher level "shrink" operation Therefore, let's disable it by default.
As discussed in flux-framework#638, the live module - cannot function without pgm event distribution (default: off) - cannot improve reliability without higher level "shrink" operation Therefore, let's disable it by default.
Live module is no longer loaded by default. It should probably just be removed. Somebody please euthanize it. |
If a broker slows down for a while then returns to normal responsiveness, the following may happen (test by sending SIGSTOP to a broker, then SIGCONT):
The broker's TBON parent may mark it 'failed', and its children may reparent. This is normal.
When the broker becomes responsive, it may decide its TBON children have failed. This is wrong.
This triggers their children to reparent:
Further, the second set of failed nodes remain that way.
The text was updated successfully, but these errors were encountered: