New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures in parallel tasks do not stop execution #457

Closed
sdcooke opened this Issue Oct 24, 2011 · 16 comments

Comments

Projects
None yet
5 participants
@sdcooke

sdcooke commented Oct 24, 2011

The following test case runs test.sh which returns an error exit code. Without the @parallel decorator fabric aborts and exits as expected after the first failure on the first host (there are two hosts in the 'web_workers' role). With the @parallel decorator the test_thing task exits before running date but test_thing2 is then run.

def test():
    execute(test_thing)
    execute(test_thing2)

@roles('web_workers')
@parallel
def test_thing():
    run('./test.sh')
    run('date')

@roles('web_workers')
@parallel
def test_thing2():
    run('date')

I'm assuming this isn't the intended workflow, if it is or I'm making a mistake I'd be interested to know what it is.

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Oct 24, 2011

Member

No, that's not the intended workflow, unless I am mis-remembering the intent (paging @goosemo to aisle 457 -- he might remember more).

What should happen is that test_thing runs to completion on all hosts, and then if any of them died/aborted, overall execution stops and thus test_thing2 would never run at all.

I'll check it out -- thanks for the report.

Member

bitprophet commented Oct 24, 2011

No, that's not the intended workflow, unless I am mis-remembering the intent (paging @goosemo to aisle 457 -- he might remember more).

What should happen is that test_thing runs to completion on all hosts, and then if any of them died/aborted, overall execution stops and thus test_thing2 would never run at all.

I'll check it out -- thanks for the report.

@ghost ghost assigned bitprophet Oct 24, 2011

@goosemo

This comment has been minimized.

Show comment
Hide comment
@goosemo

goosemo Oct 24, 2011

As it's written, the behavior showing is how it will happen. Return codes are essentially ignored, so that if a host's execution dies, it doesn't knock over the whole task. If there are dependancies to tasks, I've always kept them in a single task, or had the second subordinate task have a check to see if the requisite task did what was needed.

goosemo commented Oct 24, 2011

As it's written, the behavior showing is how it will happen. Return codes are essentially ignored, so that if a host's execution dies, it doesn't knock over the whole task. If there are dependancies to tasks, I've always kept them in a single task, or had the second subordinate task have a check to see if the requisite task did what was needed.

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Oct 24, 2011

Member

Just chatted with @goosemo on IRC about this, @sdcooke. When he wrote the original implementation, his use case was of the "I'm running on a ton of hosts and don't want one or two failures to sink the entire run" variety.

However, to be consistent with Fabric's original "fail fast" philosophy, the default behavior needs to be what you and I discussed above -- a parallelized task should abort() at the end if any of its subprocesses failed/aborted. This would prevent any subsequent tasks from executing.

We may need an additional setting to allow overriding this (i.e. at a higher level than warn_only does) but because it (connection failures and other things that can't be wrapped in warn_only) dovetails with #8 and that ticket's related issues, I think we can probably leave that for 1.4 or maybe 1.5. But if there's a strong enough need, adding one env var in a bugfix that changes behavior, is probably kosher too :)

Member

bitprophet commented Oct 24, 2011

Just chatted with @goosemo on IRC about this, @sdcooke. When he wrote the original implementation, his use case was of the "I'm running on a ton of hosts and don't want one or two failures to sink the entire run" variety.

However, to be consistent with Fabric's original "fail fast" philosophy, the default behavior needs to be what you and I discussed above -- a parallelized task should abort() at the end if any of its subprocesses failed/aborted. This would prevent any subsequent tasks from executing.

We may need an additional setting to allow overriding this (i.e. at a higher level than warn_only does) but because it (connection failures and other things that can't be wrapped in warn_only) dovetails with #8 and that ticket's related issues, I think we can probably leave that for 1.4 or maybe 1.5. But if there's a strong enough need, adding one env var in a bugfix that changes behavior, is probably kosher too :)

@sdcooke

This comment has been minimized.

Show comment
Hide comment
@sdcooke

sdcooke Oct 24, 2011

Yes I suppose it's two different use cases. In our case if one host fails we'll end up with inconsistent code running between our servers, which we want to avoid.

In the case I posted above test_thing2 actually ran on both hosts even though the previous task had failed. That makes it possible to end up in situations where steps 1 and 3 have run in a deploy even though step 2 didn't run.

Thanks guys!

sdcooke commented Oct 24, 2011

Yes I suppose it's two different use cases. In our case if one host fails we'll end up with inconsistent code running between our servers, which we want to avoid.

In the case I posted above test_thing2 actually ran on both hosts even though the previous task had failed. That makes it possible to end up in situations where steps 1 and 3 have run in a deploy even though step 2 didn't run.

Thanks guys!

@goosemo

This comment has been minimized.

Show comment
Hide comment
@goosemo

goosemo Oct 24, 2011

I've patched this and if all looks good it'll look like this:
https://gist.github.com/1309985

also here is the branch:
https://github.com/goosemo/fabric/tree/457-failure-checking-for-parallel

goosemo commented Oct 24, 2011

I've patched this and if all looks good it'll look like this:
https://gist.github.com/1309985

also here is the branch:
https://github.com/goosemo/fabric/tree/457-failure-checking-for-parallel

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Oct 24, 2011

Member

This just got released as 1.3.1. Please let us know if it works, @sdcooke !

Member

bitprophet commented Oct 24, 2011

This just got released as 1.3.1. Please let us know if it works, @sdcooke !

@sdcooke

This comment has been minimized.

Show comment
Hide comment
@sdcooke

sdcooke Oct 25, 2011

That's great - works exactly as I'd expect it to. Thanks guys.

sdcooke commented Oct 25, 2011

That's great - works exactly as I'd expect it to. Thanks guys.

@paulRbr

This comment has been minimized.

Show comment
Hide comment
@paulRbr

paulRbr May 10, 2016

Hi guys,

I would like to talk again about this issue (Yeah I'm ).

Here is my use case of the parallel paradigm of Fabric:

@runs_once
def deploy
  execute(upload)
  execute(update)

@parallel
def upload
  # upload app artifact

@parallel(pool_size=amount_servers/4)
def update
  # stop and start app

called like the following:

> fab -R servers deploy

I would expect Fabric to stop the parallel execution of the update task to stop whenever a task exits with a non zero code. At least stop after the end of the current pool. But it does not and Fabric executes the update task on every host whatever the exit code is.

That's quite problematic if the app could not start in the update task as we end up stopping all instances of the app.

Do you have an idea on how to solve this?

paulRbr commented May 10, 2016

Hi guys,

I would like to talk again about this issue (Yeah I'm ).

Here is my use case of the parallel paradigm of Fabric:

@runs_once
def deploy
  execute(upload)
  execute(update)

@parallel
def upload
  # upload app artifact

@parallel(pool_size=amount_servers/4)
def update
  # stop and start app

called like the following:

> fab -R servers deploy

I would expect Fabric to stop the parallel execution of the update task to stop whenever a task exits with a non zero code. At least stop after the end of the current pool. But it does not and Fabric executes the update task on every host whatever the exit code is.

That's quite problematic if the app could not start in the update task as we end up stopping all instances of the app.

Do you have an idea on how to solve this?

@paulRbr

This comment has been minimized.

Show comment
Hide comment
@paulRbr

paulRbr May 19, 2016

ping @bitprophet @goosemo should I open a new issue with by comment above?

paulRbr commented May 19, 2016

ping @bitprophet @goosemo should I open a new issue with by comment above?

@jsalva

This comment has been minimized.

Show comment
Hide comment
@jsalva

jsalva Jul 29, 2016

Having a similar issue. +1 on a solution

jsalva commented Jul 29, 2016

Having a similar issue. +1 on a solution

@goosemo

This comment has been minimized.

Show comment
Hide comment
@goosemo

goosemo Jul 29, 2016

@paulRbr just now looking at this, sorry for the delay. But have you seen this PR? #1471

It's potentially solving your issue if I am understanding correctly.

goosemo commented Jul 29, 2016

@paulRbr just now looking at this, sorry for the delay. But have you seen this PR? #1471

It's potentially solving your issue if I am understanding correctly.

@jsalva

This comment has been minimized.

Show comment
Hide comment
@jsalva

jsalva Jul 29, 2016

it turns out this PR fixes one of but not all of my problems.

My issue is that the instances are attached to a load balancer on AWS, and can be terminated at random. In testing, it seems as though the terminated instance does not cause failure. Fabric just hangs into perpetuity, waiting for the now non-existent host to finish.

jsalva commented Jul 29, 2016

it turns out this PR fixes one of but not all of my problems.

My issue is that the instances are attached to a load balancer on AWS, and can be terminated at random. In testing, it seems as though the terminated instance does not cause failure. Fabric just hangs into perpetuity, waiting for the now non-existent host to finish.

@goosemo

This comment has been minimized.

Show comment
Hide comment
@goosemo

goosemo Jul 29, 2016

@jsalva interesting, are they being killed in the middle of a workload? Or is it more that the host list is stale at the time of execution?

goosemo commented Jul 29, 2016

@jsalva interesting, are they being killed in the middle of a workload? Or is it more that the host list is stale at the time of execution?

@paulRbr

This comment has been minimized.

Show comment
Hide comment
@paulRbr

paulRbr Jul 29, 2016

Hi @goosemo, oh that's a nice PR. It does exactly what we need @captaintrain! Thanks!

Here's the output without the PR:

> fab -H local1,local2 deploy
[local1] Executing task 'deploy'
[local1] Executing task 'upload'
[local2] Executing task 'upload'
[localhost] local: # hello upload
[localhost] local: # hello upload
[local1] Executing task 'update'
[local2] Executing task 'update'
[localhost] local: # hello update
[localhost] local: # hello update    < I DON'T want this second one to run as the first update fails

Fatal error: One or more hosts failed while executing task 'update'

Aborting.

And now the output with the PR and the --parallel-exit-on-errors flag:

> fab --parallel-exit-on-errors -H local1,local2 deploy
[local1] Executing task 'deploy'
[local1] Executing task 'upload'
[local2] Executing task 'upload'
[localhost] local: # hello upload
[localhost] local: # hello upload
[local1] Executing task 'update'
[local2] Executing task 'update'
[localhost] local: # hello update     < \o/ It ran only once!

Fatal error: One or more hosts failed while executing task 'update'

Aborting.

paulRbr commented Jul 29, 2016

Hi @goosemo, oh that's a nice PR. It does exactly what we need @captaintrain! Thanks!

Here's the output without the PR:

> fab -H local1,local2 deploy
[local1] Executing task 'deploy'
[local1] Executing task 'upload'
[local2] Executing task 'upload'
[localhost] local: # hello upload
[localhost] local: # hello upload
[local1] Executing task 'update'
[local2] Executing task 'update'
[localhost] local: # hello update
[localhost] local: # hello update    < I DON'T want this second one to run as the first update fails

Fatal error: One or more hosts failed while executing task 'update'

Aborting.

And now the output with the PR and the --parallel-exit-on-errors flag:

> fab --parallel-exit-on-errors -H local1,local2 deploy
[local1] Executing task 'deploy'
[local1] Executing task 'upload'
[local2] Executing task 'upload'
[localhost] local: # hello upload
[localhost] local: # hello upload
[local1] Executing task 'update'
[local2] Executing task 'update'
[localhost] local: # hello update     < \o/ It ran only once!

Fatal error: One or more hosts failed while executing task 'update'

Aborting.
@jsalva

This comment has been minimized.

Show comment
Hide comment
@jsalva

jsalva Jul 31, 2016

@goosemo I'm intentionally terminating the hosts mid run()/sudo(), so mid-workload rather than stale hosts

jsalva commented Jul 31, 2016

@goosemo I'm intentionally terminating the hosts mid run()/sudo(), so mid-workload rather than stale hosts

@jsalva

This comment has been minimized.

Show comment
Hide comment
@jsalva

jsalva Jul 31, 2016

it looks like we could in principle detect the state where the transport.send_ignore() raises a ProxyCommandFailure and use that to signal that the connection is closed, terminating the queued process with a failed status code of some sort.

jsalva commented Jul 31, 2016

it looks like we could in principle detect the state where the transport.send_ignore() raises a ProxyCommandFailure and use that to signal that the connection is closed, terminating the queued process with a failed status code of some sort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment