Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

live module cascading failures #638

Closed
garlick opened this issue Apr 8, 2016 · 4 comments
Closed

live module cascading failures #638

garlick opened this issue Apr 8, 2016 · 4 comments
Assignees

Comments

@garlick
Copy link
Member

garlick commented Apr 8, 2016

If a broker slows down for a while then returns to normal responsiveness, the following may happen (test by sending SIGSTOP to a broker, then SIGCONT):

$ ./flux start -s64 
$ kill -STOP 37817
$ flux-start: 1 (pid 37817) Stopped (signal)

The broker's TBON parent may mark it 'failed', and its children may reparent. This is normal.

[1460150810.190432] live.crit[0]: transitioning 1 from OK to SLOW
[1460150814.190372] live.crit[0]: transitioning 1 from SLOW to FAIL
[1460150814.193420] broker.crit[3]: reparent ipc:///tmp/flux-37789-FbuCX8/0/req (new)
[1460150814.196075] broker.crit[4]: reparent ipc:///tmp/flux-37789-FbuCX8/0/req (new)

When the broker becomes responsive, it may decide its TBON children have failed. This is wrong.

$ kill -CONT 37817
flux-start: 1 (pid 37817) Continued
jimbo /home/garlick/proj/flux-core/src/cmd > [1460150826.819672] live.crit[1]: transitioning 3 from OK to SLOW
[1460150826.819703] live.crit[1]: transitioning 4 from OK to SLOW
[1460150826.820033] live.crit[1]: transitioning 4 from SLOW to FAIL
[1460150826.819672] live.crit[1]: transitioning 3 from OK to SLOW
[1460150826.820007] live.crit[1]: transitioning 3 from SLOW to FAIL

This triggers their children to reparent:

[1460150826.826241] broker.crit[8]: reparent ipc:///tmp/flux-37789-FbuCX8/0/req (new)
[1460150826.827488] broker.crit[9]: reparent ipc:///tmp/flux-37789-FbuCX8/0/req (new)
[1460150826.828682] broker.crit[10]: reparent ipc:///tmp/flux-37789-FbuCX8/0/req (new)
[1460150826.829373] broker.crit[7]: reparent ipc:///tmp/flux-37789-FbuCX8/0/req (new)
[1460150828.189251] live.crit[0]: transitioning 1 from FAIL to OK

Further, the second set of failed nodes remain that way.

$ flux up
ok:     [0-2,5-63]
slow:   
fail:   [3-4]
unknown:
@garlick
Copy link
Member Author

garlick commented Apr 12, 2016

I'm going to suggest that for the 0.3.0 release we avoid loading the live module, as its efforts to keep the session wired up minus failed nodes will be in vain without some higher level mechanism to shrink the job, and its design was built on having a separate event "bus" but we've disabled epgm by default and are distributing events via the TBON.

@grondo
Copy link
Contributor

grondo commented Apr 12, 2016

That seems fine, however it might be useful to have a fallback for flux up or have the command check for existence of the live module or in some other way indicate that the "up" service isn't functional to avoid confusion.

@garlick
Copy link
Member Author

garlick commented Apr 12, 2016

Yeah, we can assume that the entire session is up if there's nothing monitoring liveness to tell us otherwise, so that's easy.

garlick added a commit to garlick/flux-core that referenced this issue Apr 12, 2016
As discussed in flux-framework#638, the live module
- cannot function without pgm event distribution (default: off)
- cannot improve reliability without higher level "shrink" operation
Therefore, let's disable it by default.
garlick added a commit to garlick/flux-core that referenced this issue Apr 13, 2016
As discussed in flux-framework#638, the live module
- cannot function without pgm event distribution (default: off)
- cannot improve reliability without higher level "shrink" operation
Therefore, let's disable it by default.
@garlick
Copy link
Member Author

garlick commented Dec 28, 2016

Live module is no longer loaded by default. It should probably just be removed. Somebody please euthanize it.

@chu11 chu11 added this to the release 0.7.0 milestone Mar 14, 2017
@chu11 chu11 self-assigned this Mar 14, 2017
@chu11 chu11 removed this from the release 0.7.0 milestone Mar 14, 2017
chu11 added a commit to chu11/flux-core that referenced this issue Mar 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants