Failures in parallel tasks do not stop execution #457

Closed
sdcooke opened this Issue Oct 24, 2011 · 16 comments

5 participants

@sdcooke
sdcooke commented Oct 24, 2011

The following test case runs test.sh which returns an error exit code. Without the @parallel decorator fabric aborts and exits as expected after the first failure on the first host (there are two hosts in the 'web_workers' role). With the @parallel decorator the test_thing task exits before running date but test_thing2 is then run.

def test():
    execute(test_thing)
    execute(test_thing2)

@roles('web_workers')
@parallel
def test_thing():
    run('./test.sh')
    run('date')

@roles('web_workers')
@parallel
def test_thing2():
    run('date')

I'm assuming this isn't the intended workflow, if it is or I'm making a mistake I'd be interested to know what it is.

@bitprophet
Member

No, that's not the intended workflow, unless I am mis-remembering the intent (paging @goosemo to aisle 457 -- he might remember more).

What should happen is that test_thing runs to completion on all hosts, and then if any of them died/aborted, overall execution stops and thus test_thing2 would never run at all.

I'll check it out -- thanks for the report.

@bitprophet bitprophet was assigned Oct 24, 2011
@goosemo
Contributor
goosemo commented Oct 24, 2011

As it's written, the behavior showing is how it will happen. Return codes are essentially ignored, so that if a host's execution dies, it doesn't knock over the whole task. If there are dependancies to tasks, I've always kept them in a single task, or had the second subordinate task have a check to see if the requisite task did what was needed.

@bitprophet
Member

Just chatted with @goosemo on IRC about this, @sdcooke. When he wrote the original implementation, his use case was of the "I'm running on a ton of hosts and don't want one or two failures to sink the entire run" variety.

However, to be consistent with Fabric's original "fail fast" philosophy, the default behavior needs to be what you and I discussed above -- a parallelized task should abort() at the end if any of its subprocesses failed/aborted. This would prevent any subsequent tasks from executing.

We may need an additional setting to allow overriding this (i.e. at a higher level than warn_only does) but because it (connection failures and other things that can't be wrapped in warn_only) dovetails with #8 and that ticket's related issues, I think we can probably leave that for 1.4 or maybe 1.5. But if there's a strong enough need, adding one env var in a bugfix that changes behavior, is probably kosher too :)

@sdcooke
sdcooke commented Oct 24, 2011

Yes I suppose it's two different use cases. In our case if one host fails we'll end up with inconsistent code running between our servers, which we want to avoid.

In the case I posted above test_thing2 actually ran on both hosts even though the previous task had failed. That makes it possible to end up in situations where steps 1 and 3 have run in a deploy even though step 2 didn't run.

Thanks guys!

@goosemo
Contributor
goosemo commented Oct 24, 2011

I've patched this and if all looks good it'll look like this:
https://gist.github.com/1309985

also here is the branch:
https://github.com/goosemo/fabric/tree/457-failure-checking-for-parallel

@bitprophet
Member

This just got released as 1.3.1. Please let us know if it works, @sdcooke !

@sdcooke
sdcooke commented Oct 25, 2011

That's great - works exactly as I'd expect it to. Thanks guys.

@paulRbr
paulRbr commented May 10, 2016 edited

Hi guys,

I would like to talk again about this issue (Yeah I'm ).

Here is my use case of the parallel paradigm of Fabric:

@runs_once
def deploy
  execute(upload)
  execute(update)

@parallel
def upload
  # upload app artifact

@parallel(pool_size=amount_servers/4)
def update
  # stop and start app

called like the following:

> fab -R servers deploy

I would expect Fabric to stop the parallel execution of the update task to stop whenever a task exits with a non zero code. At least stop after the end of the current pool. But it does not and Fabric executes the update task on every host whatever the exit code is.

That's quite problematic if the app could not start in the update task as we end up stopping all instances of the app.

Do you have an idea on how to solve this?

@paulRbr
paulRbr commented May 19, 2016

ping @bitprophet @goosemo should I open a new issue with by comment above?

@jsalva
jsalva commented Jul 29, 2016

Having a similar issue. +1 on a solution

@goosemo
Contributor
goosemo commented Jul 29, 2016

@paulRbr just now looking at this, sorry for the delay. But have you seen this PR? #1471

It's potentially solving your issue if I am understanding correctly.

@jsalva
jsalva commented Jul 29, 2016 edited

it turns out this PR fixes one of but not all of my problems.

My issue is that the instances are attached to a load balancer on AWS, and can be terminated at random. In testing, it seems as though the terminated instance does not cause failure. Fabric just hangs into perpetuity, waiting for the now non-existent host to finish.

@goosemo
Contributor
goosemo commented Jul 29, 2016 edited

@jsalva interesting, are they being killed in the middle of a workload? Or is it more that the host list is stale at the time of execution?

@paulRbr
paulRbr commented Jul 29, 2016

Hi @goosemo, oh that's a nice PR. It does exactly what we need @captaintrain! Thanks!

Here's the output without the PR:

> fab -H local1,local2 deploy
[local1] Executing task 'deploy'
[local1] Executing task 'upload'
[local2] Executing task 'upload'
[localhost] local: # hello upload
[localhost] local: # hello upload
[local1] Executing task 'update'
[local2] Executing task 'update'
[localhost] local: # hello update
[localhost] local: # hello update    < I DON'T want this second one to run as the first update fails

Fatal error: One or more hosts failed while executing task 'update'

Aborting.

And now the output with the PR and the --parallel-exit-on-errors flag:

> fab --parallel-exit-on-errors -H local1,local2 deploy
[local1] Executing task 'deploy'
[local1] Executing task 'upload'
[local2] Executing task 'upload'
[localhost] local: # hello upload
[localhost] local: # hello upload
[local1] Executing task 'update'
[local2] Executing task 'update'
[localhost] local: # hello update     < \o/ It ran only once!

Fatal error: One or more hosts failed while executing task 'update'

Aborting.
@jsalva
jsalva commented Jul 31, 2016

@goosemo I'm intentionally terminating the hosts mid run()/sudo(), so mid-workload rather than stale hosts

@jsalva
jsalva commented Jul 31, 2016

it looks like we could in principle detect the state where the transport.send_ignore() raises a ProxyCommandFailure and use that to signal that the connection is closed, terminating the queued process with a failed status code of some sort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment