Exceptions don't trigger aborts in parallel mode #649

Closed
bitprophet opened this Issue May 22, 2012 · 4 comments

Projects

None yet

2 participants

@bitprophet
Member

We have code in place to abort() when any child processes in parallel mode encounter exceptions / return nonzero return codes. However, this doesn't appear to be working, a user noticed EOFErrors in parallel children under Jenkins wasn't failing the build. I verified this on my end.

Writing a test to hit this now, then will see what the actual bug is.

@bitprophet
Member

Looks like this was triggered by fixing #572 -- actual SystemExit events, including abort, inside a parallel task, does raise and cause the task to fail. Or rather, #572 failed to fully fix the problem -- it fixed it only for aborts, other exceptions don't get raised.

Captured and returned, yes, but the actual subprocess still exits cleanly, and then since we test for unclean exiting and not "is the return value an exception" in the parent process, we don't then abort ourselves.

I think the solution here is to ensure that the exception handling signals the end of the child process so it actually reraises the exception to terminate. As long as it's successfully handed the exception to the shared queue object, this should be fine.

EDIT: And actually, this is kind of funny -- the exception being handed back now becomes pointless as any exception would have triggered a nonzero exit code, which then triggers execute()'s internal abort() call. So we really did lock it in to "only abort in the parent if somebody called abort() explicitly in the child".

I don't think that's entirely correct at this point -- it could be for library use, but not for CLI use. Really, it needs a warn_only style toggle.

However, in the meantime, I can at least ensure the data passed from child processes to the parent is accurate re: return code. What we do with that info is still up in the air.

@bitprophet bitprophet added a commit that referenced this issue May 23, 2012
@bitprophet bitprophet Add failing test re #649 fdc27c8
@bitprophet bitprophet added a commit that closed this issue May 23, 2012
@bitprophet bitprophet Don't swallow exceptions in child processes.
Also makes parallel execute() honor warn_only.

Fixes #649
82d94d7
@bitprophet bitprophet added a commit that referenced this issue May 23, 2012
@bitprophet bitprophet Changelog re #649 514c1b1
@bitprophet
Member

Ended up using utils.error() which automatically honors warn_only for us, so users that did need the exception-returning behavior may still do so (tho they'll need to explicitly turn warn_only on -- given that this is a pretty glaring bug, I think they can deal with it).

The only lossy bit left is that the exit code is not preserved post-execute, but we can't easily add it into execute's return value w/o breaking backwards compat, and it'd almost always just be 1 anyways. The exception itself is much more useful.

@jberryman

I can't follow the above, but have been noticing errors and aborts in one "thread" not propogating to the other threads in parallel mode. I then have to do a CTRL-C, at which point an atexit handler fails to run for some reason. That function tells me what resources I provisioned on that run, so it would be great if fabric only flaked out in one of those two areas.

EDIT I may be confused by this as well as some vagaries of how atexit works with ctrl-c

@bitprophet
Member

@jberryman - yes, the parallel subprocesses (not threads - fwiw :)) do not communicate at all so you're seeing expected behavior. And atexit is one of those things that's difficult to fully account for, unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment