Race condition in execute() #654

Closed
PaulMcMillan opened this Issue May 26, 2012 · 16 comments

Projects

None yet

4 participants

@PaulMcMillan
Contributor

I've encountered what I believe to be a race condition in execute() which results in a job queue that never terminates. This error occurs reliably when running fabric over a VPN (especially from far away from the target machines) but does not ever occur when running fabric on the same local network as the target machines.

I'm using the remote command timeout wrapper described in my comment on #249 with timeout = 30

Relevant settings:

socket.socket().timeout = 30.0
env.use_exceptions_for['network'] = True
env.warn_only=False
env.parallel = True

My code is of the form:

@task
def foo():
    sudo('do stuff')

@task
def run():
    try:
        execute(foo)
    except KeyboardInterrupt:
        print('some closing info')

When I run this against 4 machines with debug output enabled, I get something like this:

Popping '10.0.0.16' off the queue and starting it
Popping '10.0.0.1' off the queue and starting it
Popping '10.0.0.3' off the queue and starting it
Popping '10.0.0.9' off the queue and starting it
[10.0.0.16] <run output>
Job queue found finished proc: 10.0.0.16.
Job queue has 3 running.
[10.0.0.9] <run output>
[10.0.0.3] <run output>
Job queue found finished proc: 10.0.0.9.
Job queue has 2 running.
[10.0.0.1] <run output>

At this point, the code hangs indefinitely. When I hit CTRL-C, I get this:

^CKeyboardInterrupt()
<My finalizing printed output from the catch>

Done.
Disconnecting from 10.0.0.16... done.
Disconnecting from 10.0.0.1... done.
Disconnecting from 10.0.0.3... done.

Again, it hangs. So I hit CTRL-C again, and get this:

^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib/python2.7/multiprocessing/util.py", line 295, in _exit_function
    p.join()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 145, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 148, in wait
    return self.poll(0)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 133, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib/python2.7/multiprocessing/util.py", line 295, in _exit_function
    p.join()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 145, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 148, in wait
    return self.poll(0)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 133, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

After printing this traceback, it finally closes out.

My best guess is that something is causing the execute queue to not realize it is empty, or to not completely cleanup a job after it is finished.

@bitprophet
Member

Possibly related (though not strongly): #600, #247

You also linked this multiprocessing and KeyboardInterrupt related blog post in IRC, I haven't digested it yet and not sure if you tried applying its lessons to any debugging yet, but possibly a place to start.

@pkittenis

I'm taking a stab at fixing this issue as in our usage (fabric as an API, mostly everything run in paralell) we come across this very often

My fork, below, fixes the blocking issue but I'm struggling a bit currently with fabric's global env which appears to reset host, causing the same task to be executed on the same host at the same time though fabric's output appears to show everything is fine. If you have some time to look at it @bitprophet I'd appreciate it, otherwise will let you know when I have it working.

https://github.com/pkittenis/fabric

@rmongia
rmongia commented Dec 10, 2012

Any workarounds for this issue?

@pkittenis

Nope, no workarounds other than 'don't use parallel' :)

I haven't got much further in fixing this issue tho have seen that the issue I'm having is global env variable being overwritten on every execution of a task, causing each task to ssh into the same host. If you want to take a shot at this feel free to have a look at my fork which at least fixes the initial blocking issue.

From what I've seen, it would take a lot of changes to clean all that up and fix it, it's a pretty fundamental issue in the way fabric works.

@pkittenis

@rmongia In case you're still struggling with this, personally I found it much easier to make a simple wrapper over paramiko that can do parallel commands through a python module rather than try and fix the many issues with connections, parallelisation, global env and so on that fabric has.

You can find my effort here in case you are in the same boat.

@rmongia
rmongia commented Dec 25, 2012

Thanks @pkittenis for the parallel-ssh suggestion, might come in handy for future. I didn't get the time to look into the code myself. However, things worked for me when I stopped returning large dictionary objects from the parallel tasks.

@bitprophet
Member

I can confirm I've run into this myself a lot recently; testing it out now to try and get to the bottom of things.

Using a pool size of 10, the "problem barrier" seems to be around 50 hosts; with an arbitrary collection of host groups totaling 47 targets, the job queue empties out fine every time; adding another group that bumps the count to 56 reliably/repeatedly causes a the hang at the end where the queue last reports 2 hosts left (but those two hosts appear to run their tasks OK, then the hang, with no further emissions from the queue debug output).

Arbitrarily used -x to exclude 5 hosts (so the total run is now 51) and no hang. Worked up from there, things did not hang until I got back to 55 hosts, at which point it hung once but was fine another 2-3 runs. 56 hosts still reliably hangs every time. (though...now it's reporting down to '1 host left' before it hangs. meh.)

N.B. I am also (like Paul was) doing this test over a VPN to a remote network.

@bitprophet
Member

When inspecting the queue's data structures, it bears up the debug output, in that when it is "hung" it's simply looping with 1-2 (usually 2) jobs permanently in the 'running' list.

Still going over how the queue works in detail (been a while since I reviewed the PR it came from) but it looks like the jobs themselves are what's hanging; there's a test in the loop designed to pull completed jobs off the running list and into the completed list, which tests aliveness (in the is_alive sense). In this hung terminal condition, that test has ceased firing -- both jobs must be registering as alive.

Put simply, the problem doesn't appear to be in the queueing mechanism itself, unless it's to do with the multiprocessing primitives used to communicate with the subprocess or something. Will poke more.

@bitprophet
Member

Also: when one encounters these hung subprocesses and has to Ctrl-C twice, it unsurprisingly leaves the children lying around. Testing this out can leave quite a few Python processes about (though they do respond fine to a normal kill; they're not zombies.)

@bitprophet
Member

The inner function defined within fabric.tasks._execute, which is the target of the Process objects, is running to completion in the "hung" processes, at least going by print statements. So it's unclear what's happening in these subprocesses such that their body runs all the way through yet they fail to exit.

Poking a bit at returncode/pids/etc, and also at the Queue object used for communication (will try removing it temporarily and see if that makes any difference). Failing that, will try attaching a debugger to one of the "dead" processes to see what it thinks it is actually doing, and also scan multiprocessing's bug list again =/

@bitprophet
Member

This SO question perfectly describes what we're seeing; the queue size (in # items or amount of data terms, sounds like) could very well be the reason why e.g. Paul saw this with 4 hosts and I am seeing it with 50+. E.g. he's likely returning lots of useful data, vs mine which are only returning None.

Also plays into @rmongia's note about the problem going away after he stopped returning data from the task. (If you're still following, @rmongia, I'd be curious about your specific situation re: # hosts and amt of data returned, etc).

@bitprophet
Member

Yes indeedy, it appears to be the Queue's fault, as explained in Python #8246/Python #8237 (linked in the above SO thread). As soon as I comment out the line actually dropping return values onto the queue, no more hangs.

With that line back in place & executing tasks that return larger values (e.g. returning stdout of a command that does a large head on /usr/share/dict/words) the hangs occur earlier in the runlist, so it's definitely this issue.

Should be possible to update the current naive "wait for everything to wrap up, then store results and return them" algorithm with something that preferentially takes things off the queue whenever it's non-empty...working on this now.

@bitprophet
Member

Rejiggered how results are kept & consumed and it definitely appears to fix the problem (and, as far as I can tell, doesn't break the actual "give me all my results" functionality in the process. woo.)

@bitprophet bitprophet added a commit that referenced this issue Jan 27, 2013
@bitprophet bitprophet Eagerly consume parallel job results.
Should prevent problems where Queue's background pipe fills up
and blocks/hangs. Re #654
c157644
@bitprophet
Member

@PaulMcMillan @rmongia (and @pkittenis if willing :P) just an FYI on the above -- hopefully fixed now. Please let me know if you can continue reproducing after upgrading to 1.5.3 (which I will put out this weekend). Thanks!

@rmongia
rmongia commented Jan 27, 2013

@bitprophet, #hosts were ~30-40 and the return data was actually ~1k (big json output). Thanks for the fix. Will patch my fab. :)

@PaulMcMillan
Contributor

whoo! Thank you so much for finally tracking this one down! I'll definitely let you know if I run into any more issues with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment