Parallel-mode hanging instead of running tasks #600

Closed
bitprophet opened this Issue Mar 30, 2012 · 15 comments

Projects

None yet

5 participants

@bitprophet
Member

Original comment by @lruslan from #568 follows. I've seen related reports on IRC with no useful debugging; didn't see other recent tickets but some might be out there.


Hi , have weird problem with Fabric 1.4.0 seem it has some relation to current thread. I need to execute different type of subtasks inside single fabric task. Some of them need to be run in serial mode others in parallel. And when I execute parallel task after serial it make fabric stuck in limbo state. Maybe someone can point me if it's already know issue, or I do something wrong here, otherwise I need open issue request. Thats example of code I use http://pastie.org/private/4p6n4301c1hscwgjm2zva and here debug output I receive when do call fab --show debug --fabfile=test.py main http://pastie.org/private/xth5kr5cm3jpx3yzc2waa

_HOSTS = [ "friendslowslave4", "friendslowslave5" ] 

@task
def showHostnameSerial():
    cmd = "hostname"
    if run(cmd).failed:
        return True
    return False

@task
@parallel
def showHostnameParallel():
    cmd = "hostname"
    if sudo(cmd).failed:
        return True
    return False

@task
def main(): 
    results = execute(showHostnameSerial, hosts=_HOSTS)
    print results
    results = execute(showHostnameParallel, hosts=_HOSTS)
    print results
@bitprophet
Member

@lruslan - what happens when you run the above fabfile with the first execute line commented out? Does the parallel task still hang, or does it work? I will try to replicate this myself when I have time.

@lruslan
lruslan commented Mar 30, 2012

It work fine with parallel only, also it work when I first execute parallel and then serial. Only serial -> parallel make fabric stuck.

@bitprophet
Member

Great, thanks for the details, that's very helpful.

@lruslan
lruslan commented Apr 10, 2012

I think this issue is related to multiprocessing library we use.

We have problem only on the systems with Python 2.5.2 and multiprocessing module 2.6.2.1 installed from http://code.google.com/p/python-multiprocessing/

On Python 2.7 (with build in multiprocessing ) this issue can not be reproduced.

@sitaktif

Hi, I've got the same issue - except on a recent version. In certain conditions, Fabric hangs when running a task serially and a task in parallel one after the other.

This happens 100% of the time and even with only one host in parallel. See the code below.

Versions:

python --version
Python 2.7.7
$ /tmp fab --version
Fabric 1.8.2
Paramiko 1.12.2

And the fabfile:

#!/usr/bin/env python
# encoding: utf-8

# Minimal example to break fabric and Paramiko.
# Paramiko will display the following message and block:
#
#   Success for unrequested channel! [??]
#

import os
import logging
from fabric.api import task, parallel, env, run
logging.getLogger('paramiko.transport').setLevel(logging.DEBUG)
logging.getLogger('paramiko.transport').addHandler(logging.StreamHandler())

env.hosts = ['localhost']

ORIG_USER = os.getlogin()
OTHER_USER = 'my_other_user' # User needs to exist and to be able to ssh to localhost

@task
def serial():
    env.user = OTHER_USER
    run('echo serial')
    env.user = ORIG_USER # The bug does not happen without that line

@task
@parallel
def parallel():
    env.user = OTHER_USER # The bug does not happen without that line, either
    run('echo parallel')

You need to replace my_other_user with a real other user on the system.

Here is the output https://gist.github.com/sitaktif/aab296f619bcec5b45f8

There is an error from Paramiko saying "Success for unrequested channel! [??]". I've dug a little bit and it seems that basically fabric is asked by paramiko to create a channel in separate process (say P2), but P2 does not have the channel listed in its channels weakref dict (P1 does, though). Because of that, the function in P2 returns without finishing the channel setup (in paramiko/transport.py:_parse_channel_open_success which makes P1 hang forever (in paramiko/transport.py:open_channel()).

I could not really see why the channel weakref dict does not contain the channel in P2 in that case; any idea?. What is really weird is that this bug only happens when changing users around.

@moeffju
moeffju commented Jul 15, 2014

I’m having the same issue with Fabric 1.7.0, Paramiko 1.11.0. I’m deploying an app on multiple servers in parallel, but only want to run migrations on one server (or failing that, serially with the latter migrations being no-ops). After the (serial) migrate task runs, other (parallel) tasks hang. I’m not changing users.

@sitaktif

I have found what triggers the issue on my side and I have a fix. The problem with the user change is that the host string that is used is not the one that is popped (see below). Indeed I use new_user@host:port but it is old_user@host:port that is popped.

diff --git a/fabric/tasks.py b/fabric/tasks.py
index 879d97d..6729b6b 100644
--- a/fabric/tasks.py
+++ b/fabric/tasks.py
@@ -231,8 +231,7 @@ def _execute(task, host, my_env, args, kwargs, jobs, queue, multiprocessing):
             def submit(result):
                 queue.put({'name': name, 'result': result})
             try:
-                key = normalize_to_string(state.env.host_string)
-                state.connections.pop(key, "")
+                state.connections.clear()
                 submit(task.run(*args, **kwargs))
             except BaseException, e: # We really do want to capture everything
                 # SystemExit implies use of abort(), which prints its own

@moeffju If you do not change users, do you change ports or host_string maybe?

Does it work if you modify the source like above?

@moeffju
moeffju commented Jul 15, 2014

@sitaktif I’m not changing users or anything about the host string. I’ll try your fix.

@moeffju
moeffju commented Jul 16, 2014

(The same thing occurs when using the @runs_once decorator, btw.)

@moeffju moeffju added a commit to moeffju/fabric that referenced this issue Jul 16, 2014
@moeffju moeffju Attempt to fix #600 (by @sitaktif) 29d7bf3
@moeffju
moeffju commented Jul 16, 2014

@sitaktif Your patch fixes the problem for both @runs_once and @serial. Thanks!

@sitaktif

No probs - the only thing is that I'm not 100% sure of the implications of the fix above so it would be nice if someone who knows a bit more about the code (@bitprophet maybe?) reviews it.

@jontayesp

I had the same issue, but fix by @sitaktif worked great. Maybe should be a PR?

@bitprophet
Member

@sitaktif's patch looks sensible and problems with mismatched-but-arguably-equal host strings are a common class of bug in this codebase. I'll try to replicate the error & then apply the patch in the next update pass (soon). Thanks!

@bitprophet bitprophet added the Wart label Aug 5, 2014
@bitprophet
Member

Confirmed reproduction, the minimal test case is as above re: switching around env.user. Specifically:

  • Outer process (which is what serial execution uses) starts with blank state
  • Host is set to, say, localhost, and user set to, say, user1
  • Serial task runs, creates connection cache entry for user1@localhost when it connects
  • Serial task at the end, or some intermediate code between the two executions, sets user to user2
  • Parallel execution begins, sees that the current user is user2, and attempts to nuke user2@localhost from the cache. Since the cache only contains user1@localhost, it's a no-op, and the cache handed to the subprocess still contains the client object for user1@localhost.
  • Inside the task being executed in parallel, its local copy of the user state var gets set back to user1
  • Then it tries to call eg run(), which triggers a lookup in the connection cache for user1@localhost
  • It sees the not-deleted cached client object and tries reusing it
  • Which triggers the hang because now both the parent and child processes are attempting to talk to the same client object and its OS network socket (which is a known issue and is what the cache clean was intended to avert.)

This is why @sitaktif's change fixes the issue - it cleans out all cached client objects and prevents any possibility of this sort of accidental reuse. It also feels like what the code should originally have done - we never want a non-empty cache for parallel subprocesses because they can never do anything but lead to this problem.

@bitprophet
Member

Code example of the above:

from fabric.api import parallel, execute, run, env, task

def _whatever():                                      
    env.user = 'user1'                          
    run("whoami")                                  
    env.user = 'user2'                          

@task                                              
def oh_dear():                                         
    env.hosts = ['localhost']                            
    execute(_whatever)                                
    execute(parallel(_whatever))                      
@bitprophet bitprophet closed this in e397a40 Aug 6, 2014
@cijohnson cijohnson added a commit to Juniper/contrail-distro-third-party that referenced this issue May 6, 2015
@cijohnson cijohnson Some times parallel tasks hangs, Fix for the issue fabric/fabric#600
is in Fabric=1.7.5 version, so bringing in 1.7.5 version.
f1e387e
@cijohnson cijohnson added a commit to Juniper/contrail-distro-third-party that referenced this issue May 6, 2015
@cijohnson cijohnson Some times parallel tasks hangs, Fix for the issue fabric/fabric#600
is in Fabric=1.7.5 version, so bringing in 1.7.5 version.
e99f3a1
@cijohnson cijohnson added a commit to Juniper/contrail-distro-third-party that referenced this issue May 6, 2015
@cijohnson cijohnson Some times parallel tasks hangs, Fix for the issue fabric/fabric#600
is in Fabric=1.7.5 version, so bringing in 1.7.5 version.
5cac3aa
@cijohnson cijohnson added a commit to Juniper/contrail-distro-third-party that referenced this issue May 6, 2015
@cijohnson cijohnson Some times parallel tasks hangs, Fix for the issue fabric/fabric#600
is in Fabric=1.7.5 version, so bringing in 1.7.5 version.
147f125
@opencontrail-ci-admin opencontrail-ci-admin pushed a commit to Juniper/contrail-packaging that referenced this issue May 11, 2015
@cijohnson cijohnson Packaging Fabric-1.7.5 to solve
fabric/fabric#600

Change-Id: I613b8a5d8a870de759346a51eb8c13e662c320ac
a76d47b
@opencontrail-ci-admin opencontrail-ci-admin pushed a commit to Juniper/contrail-packaging that referenced this issue May 11, 2015
@cijohnson cijohnson Packaging Fabric-1.7.5 to solve
fabric/fabric#600

Change-Id: Ifa61034a9fedb69a2c6c62cd94cc06df3665cb03
2851add
@opencontrail-ci-admin opencontrail-ci-admin added a commit to Juniper/contrail-packaging that referenced this issue May 11, 2015
@opencontrail-ci-admin Zuul + opencontrail-ci-admin Merge "Packaging Fabric-1.7.5 to solve fabric/fabric#600" 9ed8b69
@opencontrail-ci-admin opencontrail-ci-admin pushed a commit to Juniper/contrail-provisioning that referenced this issue May 17, 2015
@cijohnson cijohnson Changing the Fabric requirement to 1.7.5 to resolve
bug fabric/fabric#600

Change-Id: Ic6e2296d157518befe67b0d16c0506fa7a3531ec
40feadc
@opencontrail-ci-admin opencontrail-ci-admin pushed a commit to Juniper/contrail-provisioning that referenced this issue May 21, 2015
@cijohnson cijohnson Changing the Fabric requirement to >=1.7.5 to resolve
bug fabric/fabric#600

Change-Id: Ibed0534bc6831bf5450a6d86eb3e2621af602ab6
7cb25f5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment