-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement parallelism/thread-safety #19
Comments
Jeff Forcier (bitprophet) posted: For some discussion concerning users' needs along these lines, see this mid-July mailing list thread. on 2009-08-09 at 11:31am EDT |
Wes Winham (winhamwr) posted: My 2 cents: Reworking the execution model and implementing thread-safe state sharing is probably going to be really hard. It'll touch a lot of code, be subject to a lot of hard to track bugs, and might require backwards-incompatible changes. It would, however, be very useful for some people (probably a minority though). To me, this makes it a candidate for either looking at when some of the more high-use issues are implemented or for work by someone else (some person or group of people who need it) in a fork. Things like more example documentation, easier output control, improved prompt handling, and #23 generic persistence handlers (so I can do everything in my virtualenv easily) are all things that would benefit everyone substantially and also not be so big on time commitment. From a cost/benefit, I'm thinking that those are much bigger wins. Not at all saying that thread-safety wouldn't be awesome, just saying that you (Jeff) have limited time. I might be stating the obvious though and as someone who could benefit from thread-safety (I manage 6 machines in my case), but doesn't see it as a huge personal priority, of course I might be biased. on 2009-08-09 at 12:50pm EDT |
Morgan Goose (goosemo) posted: Don't know if this is off topic for this thread, but I've made a small change in the main.py, with a command flag, that'll leverage the multiprocessing module to fork out processes for each host. It is a bit messy with stdout, but seems to work fine for my test fabfile. It'll use the normal execution model by default, but if someone wants to run all of their hosts in parallel -P/--parallel will switch into using the Process() for each host. One thing to note is that it won't run tasks simultaneously. Those will still be processed sequentially, while every host on a task will be run in parallel. on 2010-03-11 at 04:27pm EST |
Morgan Goose (goosemo) posted: As that patch stands, it expects one to have the multiprocessing module. This is a non issue for python 2.6, but for 2.5/2,4 the backported module needs to be grabbed from http://code.google.com/p/python-multiprocessing/ on 2010-03-17 at 10:25am EDT |
Morgan Goose (goosemo) posted: Here is an updated patch that will check if the multiprocessing module should be imported, and will allow for python versions (2.4/2.5) w/o said module to not error out when attempting to run. on 2010-03-17 at 04:52pm EDT |
Morgan Goose (goosemo) posted: Just as an update, this is the tip commit for the multiprocessing branch I made for this issue. It adds in a cli flag -P for forcing parallel. It also adds two decorators, @runs_parallel and @runs_sequential, that will force those options on a fab task regardless of the parallel cli flag. If anyone has issues, or requests, let me know. on 2010-04-01 at 10:09am EDT |
Jeremy Orem (oremj) posted: It would also be beneficial to be able to specify how many parallel commands to run at once e.g., I have 100 systems and I only want to run on 2010-05-13 at 05:12pm EDT |
Joshua Peek (josh) posted: This is a feature that I look forward to having in 1.0. Thanks for your work. on 2010-05-20 at 07:56pm EDT |
Jeff Forcier (bitprophet) posted: Just a note, the work done over in #7 now means that remote stdout/stderr are printed on a byte-by-byte basis instead of line-by-line (fabfile-local print statements/interactions with Whether we solve that for this ticket by adding back in some sort of "legacy" line buffered, no-stdin-available mode (gross, lots of potential for code duplication, still lots of potential for confusing output) or taking a harder look at what sorts of output people expect when running jobs in parallel (I haven't seen whether Morgan's code handles this at all -- doesn't it get pretty confusing no matter what, unless you simply hide all but your own print statements? would we want to just be autologging each server to its own log file and expect folks to run-on sentences yaaaaay on 2010-06-18 at 03:13pm EDT |
Jeff Forcier (bitprophet) posted: Note that came up via IRC while Morgan was testing out a merge of his branch and the #7 branch -- because the MP solution runs a bunch of Fabric "workers" headless without actual TTYs involved, this impacts a lot of assumptions Fabric makes , though primarily the ones made by #7 (so far, the code that forces stdin to be bytewise is what errors out). So basically, the previous comment but writ large. Keep the overall ramifications of this in mind when integrating this work. Will be trying to update #7 with a fix for this soon. on 2010-07-21 at 12:39pm EDT |
Morgan Goose (goosemo) posted: So just today I got what I feel is an acceptable version of my multiprocessing I changed the readme some to explain the options I've added. Feel free to on 2010-07-21 at 10:13pm EDT |
Morgan Goose (goosemo) posted: With the newest updates I am able to merge in the multiprocessing branch I have, and have no issues when running my personal scripts that utilize parallel execution, and a simple test script of some run|get|put|cd()s. People can grab commit 2f56f612846c31c94c77 and test it out some more. I'll be working on trying to make unit tests to verify parallel execution, and since it's not in the main branch people can use http://github.com/goosemo/fabric/issues for tickets. (Unless Jeff wants/prefers issues be placed in this ticket.) on 2010-10-07 at 11:56am EDT |
Jeff Forcier (bitprophet) posted: As mentioned on IRC, I'm fine with people using fork-specific GH issues pages, as long as any outstanding issues at merge time get ported over here. (Basically, code in the core Fab repo should have its issues documented here.) Excited that it seems to "just work" with the latest IO changes in master. As discussed elsewhere I think we'll want to update things so that we have more bulletproof IO settings for this mode (I'm 99% sure that with nontrivial commands, users would run into garbled output between processes in the bytewise mode we use now) but if that's not resolved when I have time to look at this myself, I'll take care of it. on 2010-10-07 at 12:04pm EDT |
Morgan Goose (goosemo) posted: Can you think of a good test command that'd give some output that you'd expect to be garbled? Also what do you mean by that, an interweaving of output lines, or an issue with an output line itself having the output of another command break it up? on 2010-10-08 at 03:58pm EDT |
Jeff Forcier (bitprophet) posted: Well, theoretically because everything is byte-buffered now instead of line-buffered, you could get things mixed up on a character by character level if two of the processes are both writing to stdout or stderr at the same time. It's quite possible though that things may run fast enough, realistically, that each process writes its lines fast enough for this not to be a big deal. For a stress test, I was thinking of anything that generates a lot of output, or output quickly, such as wget/curl downloads, or aptitude update/upgrade, that sort of thing. An excellent test would be to see what happens with interactive programs like python or top/htop (try these beforehand without using parallelism, tho, just so you know how they behave currently in Fab's normal mode -- prefixes and crap still make them look funky sometimes). (I don't even know what would happen to stdin at that point, which is the other half of this, but it might be fun to find out. I presume it'll multiplex to all processes as they'd theoretically all inherit the file descriptor, but not sure.) on 2010-10-08 at 05:40pm EDT |
Jeff Forcier (bitprophet) posted: For now, putting this in as the lynchpin for 1.2. on 2011-03-10 at 12:12pm EST |
Joshua Peek (josh) posted: Hey Y'all just coming in again to say this would be great to have . on 2011-07-01 at 02:14am EDT |
Finally took goosemo's branch out for a spin. Works as advertised (woo!), but as I expected, the output is often quite garbled (and there's an issue open on his branch from another user with the same complaint.) Best solution short of "no stdout/stderr output, only logging" or using tmux/screen, is probably to re-implement a line-based output scheme which (along with |
I'm sure you guys have tried out Capistrano before (runs commands in parallel). Just wanted to chime in and say that I like their output scheme...
|
That's how Fabric operates already, actually, re: printing line prefixes (since even when running in serial you still want to know which server's output you're currently on.) The problem with parallel right now isn't related to "which line is which server" but simply that even on an individual line, you'll get some bytes from one server and some bytes from another one. (This is because unlike Cap, we moved to a bytewise IO scheme in 1.0 which lets us handle true interactivity with the remote end. Works great in serial, not so great in parallel, but interactivity doesn't mesh with parallel execution anyways, so not a problem.) |
(TODO moved to issue description) |
On a whim, did a slightly more complex real-world test before diving into linewise output:
It results in this error every time. Narrowing down, the error does not occur on invocations like this:
Nor this:
But it does on this:
Note that in all the above, this involves two invocations of the exact same task, a host list set up via It also occurs with two different parallel tasks, so it seems to be a straightforward "parallel, serial, parallel" issue and not some problem with task-name-based process IDs or anything. I.e.:
Will see if we need to re-call |
Wonderful, switching to EDIT: activating logging shows one EDIT 2: Also, there is an existing issue, #214, with the same basic symptom apparently popping up sometimes in what I assume is regular, non-parallel usage. It claims to be fixed with the patched Paramiko that this branch here should be using, but I might need to double check that it is installing the right version. That said, it's quite possible this is another unrelated problem that just triggers the same error msg. |
When exploring the PyCrypto issue, I noticed that paramiko 1.7.7.1 includes the @Sugarc0de fix (which has sadly since been nuked from his repo?). That fix does not fix this bug (expected, as this bug isn't something Morgan seems to have run into in his branch/testing) but should obviate the need to depend on a non-mainline Paramiko...for now. Something to consider when we get to that part of my above TODO. It does up our install deps to need PyCrypto >=2.1 instead of 1.9, so I should check to see if that fixes or clashes with the odd issue we had to fix-via-docs re: pyCrypto being stupid on Python 2.5. |
I think the problem here is one of connections -- Morgan's implementation of I also think our additions of Will experiment with cache nuking and then try to figure out if that's actually necessary or if we can juggle things such that we don't have to reconnect on every task invoke. (though that is not a huge evil, since running many tasks in a row in one session is not a common use case.) EDIT: Unfortunately, I tested out running 2 consecutive parallel tasks with the EDIT 2: However, another test I didn't think to try before, just doing |
Interesting -- throwing EDIT: Tossing more debugging into my local copy of Paramiko 1.7.7.1 stock, I find that yea, EDIT 2: AHA! So simple. When a parallelized task is the first task to connect, because of our lazy-connection behavior it is the first one to add items to the connection cache. And since it's been stuck into its own process, it's running with its own copy of This is why >1 parallel in a row never has any problems and why This doesn't immediately point to the hanging version of the problem when using the decorators, but perhaps it has to do with calling It also doesn't explain why using the decorator fails to solve the problem -- if the issue is solely that we aren't calling |
In order to know when/where to call The hang itself is within Paramiko, here inside Combined with the log message about EDIT: Nope, just the source code. Popping more debug stuff in the area generating the message, and comparing that to the rest of the log output, I'm seeing Paramiko doing things on channels 1 and 2, and then getting messages back on channel 2 and going "whu?! what's channel 2?!". Which seems suspect? EDIT 2: The channels datastruct, at the time of the phantom success, is empty, which explains the error, but presents its own mystery. Running just one parallel task results in what feels like multiple writes to the log from the two processes, both using channel 1. Makes me wonder if something is going on where the higher level Paramiko object, when set up via the serial process, is fleshing something out that is then "wrong" run in a subprocess later on. (Re: why the "bad" parallel run ends up with both channels 1 and 2.) Empirical testing:
So, pretty obvious "each new session request creates a new channel at N+1", which is backed up by how the code in EDIT 3: More debugging, within the forked process' transport obj, its I now suspect that this is because the previously-instantiated client object in the parent process is somehow retaining the network socket (or something) and is thus "grabbing" the success message (and probably anything else?) intended for the subprocess. This makes sense re: what little I remember about how That still fits with the "nuke cached connections fixes everything" behavior as well, because then each process is starting with its own fresh objects which would then create and retain their own socket handles. Bit tired to ponder the ramifications of this but I suspect that the "real" solution might require more patching to Paramiko so it's fully multiprocessing-friendly -- but OTOH if it's as simple as a shared socket problem then the onus is on us (heh) to ensure we don't end up in this situation -- necessitating bypassing the connection cache whenever parallelism is used. |
A summary of that monstrosity:
Moving forward, we can either nuke the cache entirely in the parent thread when starting any parallel threads, or try to nuke it inside child forked processes. The former is easier but marginally less efficient (in that any serial tasks will need to "unnecessarily" reconnect at the start; but since the caching is mostly useful within a task, and many-task sessions are not terribly common, it's not a big deal), the latter is worth it if it's easy but not if it's difficult or error prone. |
Went with the latter option re: nuking cache inside child procs. Still need to test more thoroughly re: different combos of parallel & serial tasks mixed up, but it appears to solve the reproducible test case from before just fine. Also updated |
There is no absolutely clear bug/bugfix in the CPython Mercurial repo/bug tracker, which would cause the hanging in the way that we are using I did comb through the entire history for Potential candidates:
|
Dealing with the Paramiko fork fun in #275 now. Last note here re: the RNG errors: I ran my tests on another Lion box (a 15" 2011 MBP) under its stock 2.6.6, and as with the MBA, I didn't even get the RNG errors to show up, or any hanging. Not sure why I got them with 2.6.6 on Linux in a VM, but whatever :) |
I've created monkeypatch script to enable fabric to be thread-safe using threading.local() |
@bancek -- did you note the ticket history? :( we've already got a pretty solid multiprocessing implementation, so at this point it's probably not a good use of time to flip things around to being threadsafe and then rewrite the feature using some other concurrency model. That said -- I'm definitely going to keep a link to that somewhere so I can refer to it when writing Fabric 2.0, so thanks! :D |
This wasn't meant to get into Fabric core. This is just temporary fix and I posted it here if somebody else needed a quick patch. |
Fair enough! Thanks again :) |
This was merged into master about a week ago. |
Reopening so I remember to double check the impact of a newer Paramiko's dependencies (only un-crossed-out item in description's list.) This will need doing regardless of the outcome of #275. |
The Paramiko/PyCrypto 2.1+ issue was simply the three-parter of python==2.5.x, pip<=0.8.0 and pycrypto >=2.1 being uninstallable. Now that Fabric requires pycrypto 2.1, that changes to: not installable via pip 0.8.0 or older on Python 2.5.x. (Using PyCrypto 2.0.1 is no longer an option.) Will update docs appropriately. (Also need to update all our docs re: Paramiko now being 'ssh'.) |
Except in one or two spots where it still makes sense. Re #19
Totally forgot to do this re: #19, whoops.
I am running ssh 1.7.13 + python 2.7.1 + OS X Lion + pycrypto 2.5 and am still running into the RNG issue. I've also tried on a redhat box but am getting the same results. The issue is present regardless of the number of parallel runs. I seem to be running into the problem the minute I spawn a separate process. Is this issue resolved or still being worked on? Thank you in advance. |
@euforia I believe you're the first person to say they've got the problem still, and you're also the only one I think who had the RNG problem for any sized run -- myself and others only had it become problematic with more than, say, two dozen simultaneous processes. Have you always been running on PyCrypto 2.5? Would you be able to test some other versions of PyCrypto (like Crypto 2.3) and maybe Python minor version (eg 2.6.x) to see if it's persistent everywhere? |
Thank you for your quick response Jeff. I remember it working prior to an upgrade I performed but I am not sure what the combination (pycryto + python) was. So far I've tried the following combination with no success:
Is there a combination you can provide that works for you? I can test it out to see if the issue is with the environment. Here's the code for reference: def get_output_from_channel_obj(chnl): def run_test_command(tobj, cmd): Mainconn = ssh.SSHClient() transport_obj = conn.get_transport() p1 = Process(target=run_test_command, args=(transport_obj, "virsh list",)) conn.close() end codeOnce again, thank you for your help. Please let me know. |
Running with parallel on a multinode setup we bumped into a fabric/paramiko bug that freezes deployment. Specifically we got: paramiko.transport:Success for unrequested channel! [??] This seems to be related with issue: fabric/fabric#19 So until someone fixes it, we remove parallelism from setup_cluster and setup_vmc. This will make a multinode deployment obviously slower but will not affect the basic use case of --autoconf. Signed-off-by: Dimitris Aragiorgis <dimara@grnet.gr>
Running with parallel on a multinode setup we bumped into a fabric/paramiko bug that freezes deployment. Specifically we got: paramiko.transport:Success for unrequested channel! [??] This seems to be related with issue: fabric/fabric#19 So until someone fixes it, we remove parallelism from setup_cluster and setup_vmc. This will make a multinode deployment obviously slower but will not affect the basic use case of --autoconf. Signed-off-by: Dimitris Aragiorgis <dimara@grnet.gr>
Running with parallel on a multinode setup we bumped into a fabric/paramiko bug that freezes deployment. Specifically we got: paramiko.transport:Success for unrequested channel! [??] This seems to be related with issue: fabric/fabric#19 So until someone fixes it, we remove parallelism from setup_cluster and setup_vmc. This will make a multinode deployment obviously slower but will not affect the basic use case of --autoconf. Signed-off-by: Dimitris Aragiorgis <dimara@grnet.gr>
Description
Fabric currently uses the simplest approach to the most common use case, but as a result is quite naive and not threadsafe, and cannot easily be run in parallel even by an outside agent.
Rework the execution model and state sharing to be thread-safe (whether by using threadlocals or something else), and if possible go further and actually implement a parallel execution mode users can choose to activate, with threading or multiprocessing or similar.
Morgan Goose has been the lead on this feature and has a in-working-shape branch in his Github fork (link goes to the
multiprocessing
branch, which is the one you want to use). We hope to merge this into core Fabric for 1.1.Current TODO:
Anal retentive renaming, e.g.s/runs_parallel/parallel/
Code formatting cleanup/rearrangingMechanics/behavior/implementation double-checkLinewise output added back in (may make sub-ticket)Paramiko situation examined re: dependency on 1.7.7.1+ and thus PyCrypto 2.1+Including documenting the change in the install docs if necessaryPull in anything useful that Morgan hadn't pushed at time of my merge(if not included in previous) Examine logging support and decide if it's worth bumping to next releaseTest, test, testOriginally submitted by Jeff Forcier (bitprophet) on 2009-07-21 at 02:52pm EDT
Attachments
Relations
The text was updated successfully, but these errors were encountered: