-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in execute() #654
Comments
Possibly related (though not strongly): #600, #247 You also linked this multiprocessing and KeyboardInterrupt related blog post in IRC, I haven't digested it yet and not sure if you tried applying its lessons to any debugging yet, but possibly a place to start. |
I'm taking a stab at fixing this issue as in our usage (fabric as an API, mostly everything run in paralell) we come across this very often My fork, below, fixes the blocking issue but I'm struggling a bit currently with fabric's global env which appears to reset host, causing the same task to be executed on the same host at the same time though fabric's output appears to show everything is fine. If you have some time to look at it @bitprophet I'd appreciate it, otherwise will let you know when I have it working. |
Any workarounds for this issue? |
Nope, no workarounds other than 'don't use parallel' :) I haven't got much further in fixing this issue tho have seen that the issue I'm having is global env variable being overwritten on every execution of a task, causing each task to ssh into the same host. If you want to take a shot at this feel free to have a look at my fork which at least fixes the initial blocking issue. From what I've seen, it would take a lot of changes to clean all that up and fix it, it's a pretty fundamental issue in the way fabric works. |
@rmongia In case you're still struggling with this, personally I found it much easier to make a simple wrapper over paramiko that can do parallel commands through a python module rather than try and fix the many issues with connections, parallelisation, global env and so on that fabric has. You can find my effort here in case you are in the same boat. |
Thanks @pkittenis for the parallel-ssh suggestion, might come in handy for future. I didn't get the time to look into the code myself. However, things worked for me when I stopped returning large dictionary objects from the parallel tasks. |
I can confirm I've run into this myself a lot recently; testing it out now to try and get to the bottom of things. Using a pool size of 10, the "problem barrier" seems to be around 50 hosts; with an arbitrary collection of host groups totaling 47 targets, the job queue empties out fine every time; adding another group that bumps the count to 56 reliably/repeatedly causes a the hang at the end where the queue last reports 2 hosts left (but those two hosts appear to run their tasks OK, then the hang, with no further emissions from the queue debug output). Arbitrarily used N.B. I am also (like Paul was) doing this test over a VPN to a remote network. |
When inspecting the queue's data structures, it bears up the debug output, in that when it is "hung" it's simply looping with 1-2 (usually 2) jobs permanently in the 'running' list. Still going over how the queue works in detail (been a while since I reviewed the PR it came from) but it looks like the jobs themselves are what's hanging; there's a test in the loop designed to pull completed jobs off the running list and into the completed list, which tests aliveness (in the is_alive sense). In this hung terminal condition, that test has ceased firing -- both jobs must be registering as alive. Put simply, the problem doesn't appear to be in the queueing mechanism itself, unless it's to do with the multiprocessing primitives used to communicate with the subprocess or something. Will poke more. |
Also: when one encounters these hung subprocesses and has to Ctrl-C twice, it unsurprisingly leaves the children lying around. Testing this out can leave quite a few Python processes about (though they do respond fine to a normal |
The Poking a bit at returncode/pids/etc, and also at the Queue object used for communication (will try removing it temporarily and see if that makes any difference). Failing that, will try attaching a debugger to one of the "dead" processes to see what it thinks it is actually doing, and also scan multiprocessing's bug list again =/ |
This SO question perfectly describes what we're seeing; the queue size (in # items or amount of data terms, sounds like) could very well be the reason why e.g. Paul saw this with 4 hosts and I am seeing it with 50+. E.g. he's likely returning lots of useful data, vs mine which are only returning None. Also plays into @rmongia's note about the problem going away after he stopped returning data from the task. (If you're still following, @rmongia, I'd be curious about your specific situation re: # hosts and amt of data returned, etc). |
Yes indeedy, it appears to be the Queue's fault, as explained in Python #8246/Python #8237 (linked in the above SO thread). As soon as I comment out the line actually dropping return values onto the queue, no more hangs. With that line back in place & executing tasks that return larger values (e.g. returning stdout of a command that does a large Should be possible to update the current naive "wait for everything to wrap up, then store results and return them" algorithm with something that preferentially takes things off the queue whenever it's non-empty...working on this now. |
Rejiggered how results are kept & consumed and it definitely appears to fix the problem (and, as far as I can tell, doesn't break the actual "give me all my results" functionality in the process. woo.) |
Should prevent problems where Queue's background pipe fills up and blocks/hangs. Re #654
@PaulMcMillan @rmongia (and @pkittenis if willing :P) just an FYI on the above -- hopefully fixed now. Please let me know if you can continue reproducing after upgrading to 1.5.3 (which I will put out this weekend). Thanks! |
@bitprophet, #hosts were ~30-40 and the return data was actually ~1k (big json output). Thanks for the fix. Will patch my fab. :) |
whoo! Thank you so much for finally tracking this one down! I'll definitely let you know if I run into any more issues with this. |
I've encountered what I believe to be a race condition in execute() which results in a job queue that never terminates. This error occurs reliably when running fabric over a VPN (especially from far away from the target machines) but does not ever occur when running fabric on the same local network as the target machines.
I'm using the remote command timeout wrapper described in my comment on #249 with timeout = 30
Relevant settings:
socket.socket().timeout = 30.0
env.use_exceptions_for['network'] = True
env.warn_only=False
env.parallel = True
My code is of the form:
When I run this against 4 machines with debug output enabled, I get something like this:
At this point, the code hangs indefinitely. When I hit CTRL-C, I get this:
Again, it hangs. So I hit CTRL-C again, and get this:
After printing this traceback, it finally closes out.
My best guess is that something is causing the execute queue to not realize it is empty, or to not completely cleanup a job after it is finished.
The text was updated successfully, but these errors were encountered: