Implement parallelism/thread-safety #19

Closed
bitprophet opened this Issue Aug 19, 2011 · 53 comments

Projects

None yet

4 participants

@bitprophet
Member

Description

Fabric currently uses the simplest approach to the most common use case, but as a result is quite naive and not threadsafe, and cannot easily be run in parallel even by an outside agent.

Rework the execution model and state sharing to be thread-safe (whether by using threadlocals or something else), and if possible go further and actually implement a parallel execution mode users can choose to activate, with threading or multiprocessing or similar.


Morgan Goose has been the lead on this feature and has a in-working-shape branch in his Github fork (link goes to the multiprocessing branch, which is the one you want to use). We hope to merge this into core Fabric for 1.1.


Current TODO:

  • Anal retentive renaming, e.g. s/runs_parallel/parallel/
  • Code formatting cleanup/rearranging
  • Mechanics/behavior/implementation double-check
  • Linewise output added back in (may make sub-ticket)
  • Paramiko situation examined re: dependency on 1.7.7.1+ and thus PyCrypto 2.1+
    • Including documenting the change in the install docs if necessary
  • Pull in anything useful that Morgan hadn't pushed at time of my merge
  • (if not included in previous) Examine logging support and decide if it's worth bumping to next release
  • Test, test, test

Originally submitted by Jeff Forcier (bitprophet) on 2009-07-21 at 02:52pm EDT

Attachments

Relations

  • Related to #20: Rework output mechanisms
  • Related to #21: Make execution model more robust/flexible
  • Related to #197: Handle running without any controlling tty
@bitprophet bitprophet was assigned Aug 19, 2011
@bitprophet
Member

Jeff Forcier (bitprophet) posted:


For some discussion concerning users' needs along these lines, see this mid-July mailing list thread.


on 2009-08-09 at 11:31am EDT

@bitprophet
Member

Wes Winham (winhamwr) posted:


My 2 cents:

Reworking the execution model and implementing thread-safe state sharing is probably going to be really hard. It'll touch a lot of code, be subject to a lot of hard to track bugs, and might require backwards-incompatible changes. It would, however, be very useful for some people (probably a minority though).

To me, this makes it a candidate for either looking at when some of the more high-use issues are implemented or for work by someone else (some person or group of people who need it) in a fork. Things like more example documentation, easier output control, improved prompt handling, and #23 generic persistence handlers (so I can do everything in my virtualenv easily) are all things that would benefit everyone substantially and also not be so big on time commitment. From a cost/benefit, I'm thinking that those are much bigger wins. Not at all saying that thread-safety wouldn't be awesome, just saying that you (Jeff) have limited time.

I might be stating the obvious though and as someone who could benefit from thread-safety (I manage 6 machines in my case), but doesn't see it as a huge personal priority, of course I might be biased.


on 2009-08-09 at 12:50pm EDT

@bitprophet
Member

Morgan Goose (goosemo) posted:


Don't know if this is off topic for this thread, but I've made a small change in the main.py, with a command flag, that'll leverage the multiprocessing module to fork out processes for each host. It is a bit messy with stdout, but seems to work fine for my test fabfile.

It'll use the normal execution model by default, but if someone wants to run all of their hosts in parallel -P/--parallel will switch into using the Process() for each host.

One thing to note is that it won't run tasks simultaneously. Those will still be processed sequentially, while every host on a task will be run in parallel.


on 2010-03-11 at 04:27pm EST

@bitprophet
Member

Morgan Goose (goosemo) posted:


As that patch stands, it expects one to have the multiprocessing module. This is a non issue for python 2.6, but for 2.5/2,4 the backported module needs to be grabbed from http://code.google.com/p/python-multiprocessing/


on 2010-03-17 at 10:25am EDT

@bitprophet
Member

Morgan Goose (goosemo) posted:


Here is an updated patch that will check if the multiprocessing module should be imported, and will allow for python versions (2.4/2.5) w/o said module to not error out when attempting to run.


on 2010-03-17 at 04:52pm EDT

@bitprophet
Member

Morgan Goose (goosemo) posted:


Just as an update, this is the tip commit for the multiprocessing branch I made for this issue.
http://github.com/goosemo/fabric/commit/3406723d2f2f9fc41de2a5f064da46b460ff35a3

It adds in a cli flag -P for forcing parallel. It also adds two decorators, @runs_parallel and @runs_sequential, that will force those options on a fab task regardless of the parallel cli flag.

If anyone has issues, or requests, let me know.


on 2010-04-01 at 10:09am EDT

@bitprophet
Member

Jeremy Orem (oremj) posted:


It would also be beneficial to be able to specify how many parallel commands to run at once e.g., I have 100 systems and I only want to run service httpd reload on 20 at a time.


on 2010-05-13 at 05:12pm EDT

@bitprophet
Member

Joshua Peek (josh) posted:


This is a feature that I look forward to having in 1.0. Thanks for your work.


on 2010-05-20 at 07:56pm EDT

@bitprophet
Member

Jeff Forcier (bitprophet) posted:


Just a note, the work done over in #7 now means that remote stdout/stderr are printed on a byte-by-byte basis instead of line-by-line (fabfile-local print statements/interactions with sys.stdout remain line-buffered as usual). This may impact any parallel solution more negatively than the previous line-buffered behavior.

Whether we solve that for this ticket by adding back in some sort of "legacy" line buffered, no-stdin-available mode (gross, lots of potential for code duplication, still lots of potential for confusing output) or taking a harder look at what sorts of output people expect when running jobs in parallel (I haven't seen whether Morgan's code handles this at all -- doesn't it get pretty confusing no matter what, unless you simply hide all but your own print statements? would we want to just be autologging each server to its own log file and expect folks to multitail or something?) isn't something I will be looking at right now, but it will need to be tackled before this ticket here can be considered done.

run-on sentences yaaaaay


on 2010-06-18 at 03:13pm EDT

@bitprophet
Member

Jeff Forcier (bitprophet) posted:


Note that came up via IRC while Morgan was testing out a merge of his branch and the #7 branch -- because the MP solution runs a bunch of Fabric "workers" headless without actual TTYs involved, this impacts a lot of assumptions Fabric makes , though primarily the ones made by #7 (so far, the code that forces stdin to be bytewise is what errors out). So basically, the previous comment but writ large.

Keep the overall ramifications of this in mind when integrating this work. Will be trying to update #7 with a fix for this soon.


on 2010-07-21 at 12:39pm EDT

@bitprophet
Member

Morgan Goose (goosemo) posted:


So just today I got what I feel is an acceptable version of my multiprocessing
branch. To test it I suggest you make use of virtualenv (and virtualenvwrapper)
to keep your mainline Fabric install separate. But the code is up to date with
the HEAD of the fabric repo, and is able to be grabbed on github here: [[http://github.com/goosemo/fabric]]

I changed the readme some to explain the options I've added. Feel free to
peruse the code, and let me know if you find anything wanting in this
implementation.


on 2010-07-21 at 10:13pm EDT

@bitprophet
Member

Morgan Goose (goosemo) posted:


With the newest updates I am able to merge in the multiprocessing branch I have, and have no issues when running my personal scripts that utilize parallel execution, and a simple test script of some run|get|put|cd()s. People can grab commit 2f56f612846c31c94c77 and test it out some more. I'll be working on trying to make unit tests to verify parallel execution, and since it's not in the main branch people can use http://github.com/goosemo/fabric/issues for tickets. (Unless Jeff wants/prefers issues be placed in this ticket.)


on 2010-10-07 at 11:56am EDT

@bitprophet
Member

Jeff Forcier (bitprophet) posted:


As mentioned on IRC, I'm fine with people using fork-specific GH issues pages, as long as any outstanding issues at merge time get ported over here. (Basically, code in the core Fab repo should have its issues documented here.)

Excited that it seems to "just work" with the latest IO changes in master. As discussed elsewhere I think we'll want to update things so that we have more bulletproof IO settings for this mode (I'm 99% sure that with nontrivial commands, users would run into garbled output between processes in the bytewise mode we use now) but if that's not resolved when I have time to look at this myself, I'll take care of it.


on 2010-10-07 at 12:04pm EDT

@bitprophet
Member

Morgan Goose (goosemo) posted:


Can you think of a good test command that'd give some output that you'd expect to be garbled? Also what do you mean by that, an interweaving of output lines, or an issue with an output line itself having the output of another command break it up?


on 2010-10-08 at 03:58pm EDT

@bitprophet
Member

Jeff Forcier (bitprophet) posted:


Well, theoretically because everything is byte-buffered now instead of line-buffered, you could get things mixed up on a character by character level if two of the processes are both writing to stdout or stderr at the same time. It's quite possible though that things may run fast enough, realistically, that each process writes its lines fast enough for this not to be a big deal.

For a stress test, I was thinking of anything that generates a lot of output, or output quickly, such as wget/curl downloads, or aptitude update/upgrade, that sort of thing. An excellent test would be to see what happens with interactive programs like python or top/htop (try these beforehand without using parallelism, tho, just so you know how they behave currently in Fab's normal mode -- prefixes and crap still make them look funky sometimes).

(I don't even know what would happen to stdin at that point, which is the other half of this, but it might be fun to find out. I presume it'll multiplex to all processes as they'd theoretically all inherit the file descriptor, but not sure.)


on 2010-10-08 at 05:40pm EDT

@bitprophet
Member

Jeff Forcier (bitprophet) posted:


For now, putting this in as the lynchpin for 1.2.


on 2011-03-10 at 12:12pm EST

@bitprophet
Member

Joshua Peek (josh) posted:


Hey Y'all just coming in again to say this would be great to have .


on 2011-07-01 at 02:14am EDT

@bitprophet
Member

Finally took goosemo's branch out for a spin. Works as advertised (woo!), but as I expected, the output is often quite garbled (and there's an issue open on his branch from another user with the same complaint.)

Best solution short of "no stdout/stderr output, only logging" or using tmux/screen, is probably to re-implement a line-based output scheme which (along with abort_on_prompts) is flipped on whenever parallelism is active.

@daharon
daharon commented Sep 16, 2011

I'm sure you guys have tried out Capistrano before (runs commands in parallel). Just wanted to chime in and say that I like their output scheme...

  * executing "ln -nfs /mnt/fileserver/example.com/group_pic /srv/example.com/releases/20110916145923/group_pic"
    servers: ["10.0.16.29", "10.0.16.202", "10.0.16.112", "10.0.16.111", "10.0.18.28", "10.0.16.196", "10.0.16.82", "10.0.16.75", "10.0.16.149", "10.0.16.203", "10.0.16.148", "10.0.16.87", "10.0.16.70", "10.0.16.170"]
    [10.0.16.29] executing command
    [10.0.16.202] executing command
    [10.0.16.111] executing command
    [10.0.16.112] executing command
    [10.0.18.28] executing command
    [10.0.16.82] executing command
    [10.0.16.70] executing command
    [10.0.16.170] executing command
    [10.0.16.75] executing command
    [10.0.16.149] executing command
    [10.0.16.203] executing command
    [10.0.16.148] executing command
    [10.0.16.87] executing command
    [10.0.16.196] executing command
    command finished in 46ms
@bitprophet
Member

That's how Fabric operates already, actually, re: printing line prefixes (since even when running in serial you still want to know which server's output you're currently on.) The problem with parallel right now isn't related to "which line is which server" but simply that even on an individual line, you'll get some bytes from one server and some bytes from another one.

(This is because unlike Cap, we moved to a bytewise IO scheme in 1.0 which lets us handle true interactivity with the remote end. Works great in serial, not so great in parallel, but interactivity doesn't mesh with parallel execution anyways, so not a problem.)

@bitprophet
Member

(TODO moved to issue description)

@bitprophet
Member

On a whim, did a slightly more complex real-world test before diving into linewise output:

$ fab parallel_task runs_once_task serial_task parallel_task

It results in this error every time.

Narrowing down, the error does not occur on invocations like this:

$ fab parallel_task parallel_task

Nor this:

$ fab parallel_task runs_once_locally_task parallel_task

But it does on this:

$ fab parallel_task serial_task parallel_task

Note that in all the above, this involves two invocations of the exact same task, a host list set up via -H, and parallelization forced via -P -- no @parallel.

It also occurs with two different parallel tasks, so it seems to be a straightforward "parallel, serial, parallel" issue and not some problem with task-name-based process IDs or anything. I.e.:

$ fab parallel_task serial_task other_parallel_task

Will see if we need to re-call Random.atfork() somewhere -- we're already calling it inside the @parallel decorator and JobQueue._advance_the_queue (after job.start() though...hm.)

@bitprophet
Member

Wonderful, switching to @parallel and invoking fab parallel serial parallel2 (or parallel serial parallel) doesn't blow up but does just hang there and spits out a few No handlers could be found for logger "paramiko.transport" errors too (but...not one per host, just two on a 5-host list.)

EDIT: activating logging shows one WARNING:paramiko.transport:Success for unrequested channel! [??] per host in my host list in this situation. Huh. I do not see this warning in regular, one-task parallel usage.

EDIT 2: Also, there is an existing issue, #214, with the same basic symptom apparently popping up sometimes in what I assume is regular, non-parallel usage. It claims to be fixed with the patched Paramiko that this branch here should be using, but I might need to double check that it is installing the right version. That said, it's quite possible this is another unrelated problem that just triggers the same error msg.

@bitprophet
Member

When exploring the PyCrypto issue, I noticed that paramiko 1.7.7.1 includes the @sugarc0de fix (which has sadly since been nuked from his repo?). That fix does not fix this bug (expected, as this bug isn't something Morgan seems to have run into in his branch/testing) but should obviate the need to depend on a non-mainline Paramiko...for now.

Something to consider when we get to that part of my above TODO. It does up our install deps to need PyCrypto >=2.1 instead of 1.9, so I should check to see if that fixes or clashes with the odd issue we had to fix-via-docs re: pyCrypto being stupid on Python 2.5.

@bitprophet
Member

I think the problem here is one of connections -- Morgan's implementation of JobQueue explicitly disconnects all at the end, which is probably why two subsequent invokes of parallel tasks works, but shoving a serial task in between breaks things: entering the 2nd parallel task has extant connections in the cache, and thus Paramiko isn't rerunning Random.atfork() (which it added into Transport in either the patched 1.7.6 or in 1.7.7).

I also think our additions of Random.atfork() may not be helping as much since some/all of them may be running in the parent process and not the forks. Still doesn't explain the hanging in the @parallel using scenario though.

Will experiment with cache nuking and then try to figure out if that's actually necessary or if we can juggle things such that we don't have to reconnect on every task invoke. (though that is not a huge evil, since running many tasks in a row in one session is not a common use case.)

EDIT: Unfortunately, I tested out running 2 consecutive parallel tasks with the disconnect_all call commented out, and no boom, implying that it was not actually needed/related. I also didn't see anything obvious in the new code that would have really cared about the connection cache.

EDIT 2: However, another test I didn't think to try before, just doing fab serial parallel (no beginning parallel), also results in the error. So clearly some state tickled by running serially is fucking up parallel runs. Will re-examine what the main loop does.

@bitprophet
Member

Interesting -- throwing disconnect_all() at the top of JobQueue.start actually makes things work well in the fab serial parallel situation. Offhand it's not clear why that makes a difference, yet fab parallel parallel did not have problems when there was no disconnect_all() anywhere. Need to dig deeper.

EDIT: Tossing more debugging into my local copy of Paramiko 1.7.7.1 stock, I find that yea, Transport.start_client (and thus Paramiko's own Random.atfork()) is definitely only running when we set up connections to the server, which is why disconnecting was implicated in this particular issue originally (though not why it still only happens in the serial-parallel mix, and not a double-parallel setup.)

EDIT 2: AHA! So simple. When a parallelized task is the first task to connect, because of our lazy-connection behavior it is the first one to add items to the connection cache. And since it's been stuck into its own process, it's running with its own copy of fabric.state, so its caching never makes it back to the main thread.

This is why >1 parallel in a row never has any problems and why disconnect_all() only sometimes helps -- they each spawn with a new, fresh, empty copy of the connection cache. Unless a serial task has run beforehand, which fills it out, and then new parallel tasks inherit a copy of that filled-out cache, don't ever run Paramiko's own call to Random.atfork, and blow up.

This doesn't immediately point to the hanging version of the problem when using the decorators, but perhaps it has to do with calling atfork twice in a row or something.

It also doesn't explain why using the decorator fails to solve the problem -- if the issue is solely that we aren't calling atfork inside the new process, the decorator should be doing so, and should then mean that fab serial parallel_using_decorator would not have the problem. However, it still seems to. Will double check.

@bitprophet
Member

In order to know when/where to call atfork ourselves (i.e. if we want to avoid the "solution" of nuking connections before every parallel task start) I must figure out why we hang if using the decorator 2x in a row.

The hang itself is within Paramiko, here inside open_channel (which is called when Fabric attempts to actually open a session on the connection/client object.) Unfortunately I don't know exactly what it's doing when in that loop, re: why event.isSet() never occurs.

Combined with the log message about Success for unrequested channel it makes me think something odd is going on with the process forking, or a bug in Paramiko perhaps re: channel numbers. Will google a bit to see if this is a known problem/symptom.

EDIT: Nope, just the source code. Popping more debug stuff in the area generating the message, and comparing that to the rest of the log output, I'm seeing Paramiko doing things on channels 1 and 2, and then getting messages back on channel 2 and going "whu?! what's channel 2?!". Which seems suspect?

EDIT 2: The channels datastruct, at the time of the phantom success, is empty, which explains the error, but presents its own mystery.

Running just one parallel task results in what feels like multiple writes to the log from the two processes, both using channel 1. Makes me wonder if something is going on where the higher level Paramiko object, when set up via the serial process, is fleshing something out that is then "wrong" run in a subprocess later on. (Re: why the "bad" parallel run ends up with both channels 1 and 2.)

Empirical testing:

  • Running 1 serial task on 2 hosts sets up channel 1 on each connection object.
  • Running 2 serial tasks on 2 hosts sets up channel 1, then channel 2
  • Running one parallel task on 2 hosts also sets up channel 1 on each obj
  • Running serial, then parallel, sets up channel 1 in the serial task, then sets up channel 2 in the parallel task

So, pretty obvious "each new session request creates a new channel at N+1", which is backed up by how the code in open_transport works. If a parallel task is running on a cached cxn obj, it'll still increment as expected.

EDIT 3: More debugging, within the forked process' transport obj, its self._channels is being filled out correctly, so the spot accepting the success message is obviously on a different wavelength.

I now suspect that this is because the previously-instantiated client object in the parent process is somehow retaining the network socket (or something) and is thus "grabbing" the success message (and probably anything else?) intended for the subprocess. This makes sense re: what little I remember about how fork() treats file (and thus socket) access.

That still fits with the "nuke cached connections fixes everything" behavior as well, because then each process is starting with its own fresh objects which would then create and retain their own socket handles.

Bit tired to ponder the ramifications of this but I suspect that the "real" solution might require more patching to Paramiko so it's fully multiprocessing-friendly -- but OTOH if it's as simple as a shared socket problem then the onus is on us (heh) to ensure we don't end up in this situation -- necessitating bypassing the connection cache whenever parallelism is used.

@bitprophet
Member

A summary of that monstrosity:

  • The using-@parallel hanging problem is probably what happens when we correctly call atfork ourselves in a new process (which we must do because a prior serial task already triggered Paramiko's own call to atfork).
    • Put another way, it is not encountered in the -P scenario because we can't even get that far due to crashing out from not calling atfork.
  • I suspect, but have not yet verified, that the hang is due to the parent process' cached Paramiko connection object receiving socket messages intended for the child/forked processes instead. It discards them with an unrequested channel error (because it doesn't have the N+1 channel number), and the children loop forever, never getting that other side of the connection handshake, because the remote end obviously has no idea it spoke to the "wrong" process.
  • This can be routed around by ensuring that forked child processes always start with an empty connection cache, thus forcing them to create their own connections and thus their own sockets -- at the cost of creating an entirely new TCP/SSH connection.
  • I'm not sure there is any other solution, if the problem truly is at the socket level. "Make sure the child has its own socket" is tantamount to "set up a new TCP/SSH connection", after all.

Moving forward, we can either nuke the cache entirely in the parent thread when starting any parallel threads, or try to nuke it inside child forked processes. The former is easier but marginally less efficient (in that any serial tasks will need to "unnecessarily" reconnect at the start; but since the caching is mostly useful within a task, and many-task sessions are not terribly common, it's not a big deal), the latter is worth it if it's easy but not if it's difficult or error prone.

@bitprophet
Member

Went with the latter option re: nuking cache inside child procs. Still need to test more thoroughly re: different combos of parallel & serial tasks mixed up, but it appears to solve the reproducible test case from before just fine.


Also updated setup.py to point just at Paramiko 1.7.7.1, as per an above comment, but we still need to examine how that affects the edge case mentioned, and update docs accordingly.

@bitprophet
Member

Next up is linewise output. Had a though to skip this and go straight to logging -- since even linewise printing is still going to be less than useful in many situations, re: debugging. Then remembered that logging is typically line-oriented anyways -- we can't exactly toss individual bytes into a logging module and expect it to handle breaking things into individual lines or chunks. (I think.)

Re: this specific feature, I think it makes sense to have linewise output regardless.

Re: the general issue of logging, I'm not sure we can entirely escape from print in the non-parallel, bytewise situation, because unless the logging module can neatly print individual bytes without doing anything else, that will screw up the interactivity feature. But that's something we can handle later -- for this ticket, optional extra logging is what is most useful.

@bitprophet
Member

Have linewise working basically right now. Figured now was a good time to supply specific examples of why it's at least somewhat better than leaving everything bytewise:

(The doubling-up of some line prefixes in linewise mode is something that has already been at least partly addressed. Too lazy to re-capture :P)

@bitprophet
Member

Turns out that after fixing the line prefix stuff, we get extra blank lines instead, due to the problems mentioned in #182. This looks sufficiently annoying in any nontrivial parallel execution that I am trying to find a solution for it right now.

@bitprophet
Member

Fixed #182 but I think something went awry in merging that into the branch for this ticket, as now things seriously blow up in linewise mode, mostly surrounding ANSI color escape codes. Ironically bytewise doesn't suffer that problem, only its earlier occasional-garbling.

@bitprophet
Member

OK, fixed that, everything looks reasonably good at this point, other than the mentioned-in-#182 deal where trailing newlines don't have a line prefix, which also sometimes doubles up (i.e. two commands with trailing whitespace running concurrently will sometimes result in both blank lines printing next to each other, forming 2-3 blank lines in a row.)

There's no great way to handle that, and things are still about as readable as they'll get in this situation. Time to move on.

@bitprophet
Member

Have been doing some sanity testing on Python 2.5 and 2.6 with a nontrivial (~50 hosts) host list, and 2.6 seems to be pretty unstable, which is frustrating. Specifically, it pretty routinely runs into the sort of RNG atfork errors that I thought we had ironed out, as well as what looks like the halting/freezing problem I ran into earlier.

Specifics:

  • Both are installing Fabric from the same Git checkout using pip install -e .
  • Both are fresh virtualenvs created for the sole purpose of testing this, with --no-site-packages
  • Python 2.5 is OS X 10.6 Snow Leopard's default build of 2.5.4
    • Using multiprocessing 2.6.2.1 from PyPI
  • Python 2.6 is also the Snow Leopard default 2.6 build, 2.6.1
    • Using the stdlib multiprocessing

I'll try testing on a Linux box with 2.5, 2.6 and 2.7 (and, hopefully, a newer 2.6 at that) to see if I can replicate.


Ubuntu 11.04, stock Python 2.7 (2.7.1+ [lol]), I am getting at least some instances of the atfork error; running with just -P almost always results in 1 or more errors. None of the lockup/freezing issue so far.

Re: smaller pool sizes: about ~30-50% of the time with -z45 through -z50 (again on a host list of 50 hosts) I get the errors; try as I might I can't get it to occur with -z20, and pretty rarely but not "never" (maybe 5% of the time) with -z35. Still no freezing at all; just a short, ~1s pause at the end, presumably to wait for everything to disconnect.


On the same Ubuntu box, with Python 2.6 (2.6.6), it's a similar story: almost all the time with just -P, much of the time with e.g. -z45, some of the time with -z35 (seems more frequent than on 2.7 with the same z-value), never with -z20. Also no freezing here, which is at least one departure from 2.6 on the Mac, which consistently freezes even with no other errors appearing.

@bitprophet
Member

While I really hope this isn't due to some sort of change in multiprocessing, that seems to be one difference between 2.5+pypi-multiprocessing and 2.6/2.7. I am looking into this now.


  • Python 2.6.1 (the OS X 2.6)'s stdlib multiprocessing goes up to hg commit 49908:ed8adfef26c6, dated Nov 30th 2008.
  • Python 2.6.6 (Ubuntu Natty's 2.6, ignoring any Ubuntu changes) is 64004:5f33653e3837, dated August 14th 2010
  • Python 2.7.1 (Natty's 2.7, again sans Ubuntu tweaks) is 66115:0aa8af79359d, dated Nov 9th 2010

For reference, PyPI multiprocessing version 2.6.2.1 was uploaded to PyPI on July 30th 2009, making it ostensibly newer than the 2.6.1 stdlib version, but older than either Natty version.

Sadly, that somewhat rules out multiprocessing as the culprit; if it was some bug fixed in MP after the PyPI version, it should have also been fixed on my Ubuntu tests, barring some sort of fix-then-regression that caused only the PyPI version to have the fix. Seems highly unlikely to me. The opposite scenario (breakage introduced only after the PyPI version) also fails the test as it would mean the OS X 2.6 would work fine.


Will try doing some actual debugging now.

EDIT: Well now. For the first time, I've gotten the errors to show up with a non-limited -P under 2.5. So...at least that much is consistent. Must just be heavily exacerbated under 2.6/2.7 for some reason.

@bitprophet
Member

Verified that the Random.atfork call is definitely executing within the same PIDs that encounter the exceptions, which is very worrisome. Found at least one other report (on StackOverflow) which looks like the same "upgraded to paramiko 1.7.7.1 but it's still a problem" issue.

Will keep poking and/or throwing extraneous Random.atfork calls into our own code, for lack of any better approach.

EDIT: There is an issue filed at paramiko itself also asserting the same thing, with no response. It also implies that downgrading PyCrypto doesn't help, though I may try that myself anyways.

(Going to laugh if Paramiko-1.7.6 + atfork + Crypto 2.0.1 ends up working fine, since that is what Morgan originally tested on, after all...)


The actual exception (which Paramiko masks/hides in its own traceback, but then logs at level ERROR) turns out to be this one.

Crypto's test is for its stored PID to not match the current PID; and sure enough when I expand the exception msg to be more useful, it's always got the parent PID in its stored ID, with the new process' PID coming out of os.getpid() as normal.

Will dig some more to see if this is a PyCrypto bug or what.

@bitprophet
Member

Interesting, using Morgan's patched Paramiko 1.7.6.1 does seem to behave much better.

As far as I can tell the salient difference between that and 1.7.7.1 is that Morgan's version has two calls to Random.atfork, one in the same place as 1.7.7.1 has it, and another one at the top of Transport.run. (Which, if accurate, I guess validates my "lol just add moar atforks" idea.)

Note that this is still using PyCrypto 2.3, not 2.0.1 -- implying that it really is an issue about the use of atfork and not a bug in Crypto itself.


Using stock Paramiko 1.7.7.1 and toggling that second Random.atfork call at the top of Transport.run certainly appears to do the trick with my full-50-hosts run. When commented out, errors, when in place, no errors. Did this many times in a row, same results each time. Still on 2.5 at the moment, about to test the same one-line addition on 2.6 and 2.7.

EDIT: Fixes it on OS X's 2.6, but seems to always result in the hang-at-end problem. Linux 2.6 also fixed, and no hanging. Ditto for 2.7.

EDIT 2: Tried getting 2.6.1 on Linux but that was a no-go, so next up is to try directly troubleshooting the hangs on Snow Leopard.

EDIT 3: Upon reflection, the fact that it seems to work on the PyPI MP and Linux 2.6/2.7, implies it may be a bug in MP, since PyPI and the Linux installs are all newer than the 2.6 on Snow Leopard, which is the only one that seems to have the issue. Can test this by using Lion which ought to have a newer 2.6 (apparently it's 2.6.6, same as on Natty), as well as 2.7.

@bitprophet
Member

Hm, on a Lion machine under 2.7 I can't even recreate the atfork problems with the stock Paramiko 1.7.7.1. Which makes it hard to test this potential "fix".

Noting here that the system I was testing on all day today was a mid/late 2010 MBP 15" i5, and the system I am testing Lion on right now is a 2011 MBA 11" i5. The 11" feels a bit snappier when running the full 50-server flight, in fact. Makes me wonder if the issue is related to computing power.

EDIT: I can't get the error to occur on Python 2.6 either, same machine, stock Apple Python 2.6.6.


I also added the 2nd atfork on this system's 2.6 to see if it would somehow cause problems, and it does not seem to. So it's likely a safe "no-op" when it's not fixing the RNG errors.

Given all this:

  • Add the 2nd atfork globally
  • Note that Python 2.6.1 on OS X appears to still have issues with large flights of parallel processes, namely hanging at the end
  • Release as-is
  • Wait for more users to report success/failure on various combos of OS/Python, release bugfixes if required.

Though I would still like to try and see if I can replicate the hanging problem on my Linux VM on Python 2.6.1, or to debug it on Snow Leopard, to find out if it is a general "older Python 2.6 multiprocessing" issue that was fixed at some specific point.

@bitprophet
Member

Got 2.6.1 building on my Natty VM. It does exhibit the RNG problem, and amusingly, it also exhibits the freezing problem, even without the RNG fix applied.

With the RNG fix applied, I get some No handlers could be found for logger "paramiko.transport" errors, and the hanging. So this seems to rule out any Mac-specific problem, and is likely to be an "older multiprocessing" problem. I'll try to verify this by crawling the MP changelog and/or building increasingly newer Pythons on the VM to test with.

Also quickly tried to see what Paramiko was trying to log, but with logging enabled I see no warnings/errors in the log at level DEBUG. However I removed logging and ran a few more times, and it looks like the Paramiko logging errors do not happen every time. Not worth chasing down just yet.


2.6.2, Natty, pretty much the same story: exhibits RNG problem, and the occasional/intermittent logging errors, and (most of the time, but not always) the hanging. With RNG-fix applied, RNG problem goes away, hanging remains, just like on 2.6.1.

This was expected: if my theory is accurate, there was some fix to MP between Python 2.6.1 and the PyPI MP release, which came somewhere between Python 2.6.2 and 2.6.3. So I would expect 2.6.3 to work well with the RNG-fix. We'll see.


2.6.3 exhibits RNG problem (ok; expected, since even 2.6.6 and 2.7.1 exhibit it on Linux if not Lion), and the logging errors, but I can't get the hanging to occur, which is good -- it fits my theory.

Applying the RNG fix clears that problem up, and I still can't get the hanging to occur (i.e. everything seems to work just fine with 50 hosts all in one go.)

I might take another stab at recreating the RNG problem on a different Lion box to figure out what the deal is there, but this seems conclusive re: the hanging problem: Python >2.5,<2.6.3 has some sort of multiprocessing bug causing large runs to sometimes hang when they terminate.

@bitprophet
Member

There is no absolutely clear bug/bugfix in the CPython Mercurial repo/bug tracker, which would cause the hanging in the way that we are using multiprocessing.

I did comb through the entire history for Lib/multiprocessing from 2.6.1 through the PyPI release's release date (July 30th 2009). However, the above tests imply the fix was actually after 2.6.2's release, namely April 14th 2009, so that should be the actual window we find it in.

Potential candidates:

  • Issue 5177 - SO_REUSEADDR problems. I recall running into this type of problem myself earlier in Fabric's history. The implication is that a hung parallel run would eventually terminate, but I've let one sit around for a very long time with no result.
    • This was fixed in March, so probably not the one.
  • Changeset 2c00bdaad7a9 - no apparent issue related to this, and its fix sounds like something that should cause constant, big errors, but noting it anyways. Fixed June 1st -- fits the dates at least.
  • Issue 5331 - Morgan's implementation uses the custom JobQueue class and not multiprocessing.Pool, so I don't think this fits. Fixed on June 30th.
  • Issue 6433 - also Pool-related, so not a great fit. Fixed on July 16th.
@bitprophet
Member

Left a comment on the Paramiko tracker's issue re: the atfork crap. It comes off looking huffy re: forking (given I only reported the need for the 2nd atfork just now) but at least I linked back to #275 to give some historical context.

@bitprophet
Member

Dealing with the Paramiko fork fun in #275 now.

Last note here re: the RNG errors: I ran my tests on another Lion box (a 15" 2011 MBP) under its stock 2.6.6, and as with the MBA, I didn't even get the RNG errors to show up, or any hanging. Not sure why I got them with 2.6.6 on Linux in a VM, but whatever :)

@bancek
bancek commented Sep 30, 2011

I've created monkeypatch script to enable fabric to be thread-safe using threading.local()

https://github.com/bancek/fabric-threadsafe

@bitprophet
Member

@bancek -- did you note the ticket history? :( we've already got a pretty solid multiprocessing implementation, so at this point it's probably not a good use of time to flip things around to being threadsafe and then rewrite the feature using some other concurrency model.

That said -- I'm definitely going to keep a link to that somewhere so I can refer to it when writing Fabric 2.0, so thanks! :D

@bancek
bancek commented Sep 30, 2011

This wasn't meant to get into Fabric core. This is just temporary fix and I posted it here if somebody else needed a quick patch.

@bitprophet
Member

Fair enough! Thanks again :)

@bitprophet
Member

This was merged into master about a week ago.

@bitprophet bitprophet closed this Oct 7, 2011
@bitprophet bitprophet reopened this Oct 23, 2011
@bitprophet
Member

Reopening so I remember to double check the impact of a newer Paramiko's dependencies (only un-crossed-out item in description's list.) This will need doing regardless of the outcome of #275.

@bitprophet bitprophet added a commit that referenced this issue Oct 23, 2011
@bitprophet bitprophet s/paramiko/ssh/g
Re #275, re #19
b7788f5
@bitprophet
Member

The Paramiko/PyCrypto 2.1+ issue was simply the three-parter of python==2.5.x, pip<=0.8.0 and pycrypto >=2.1 being uninstallable.

Now that Fabric requires pycrypto 2.1, that changes to: not installable via pip 0.8.0 or older on Python 2.5.x. (Using PyCrypto 2.0.1 is no longer an option.)

Will update docs appropriately. (Also need to update all our docs re: Paramiko now being 'ssh'.)

@bitprophet bitprophet added a commit that referenced this issue Oct 23, 2011
@bitprophet bitprophet Replace all mentions of Paramiko in docs[trings].
Except in one or two spots where it still makes sense.

Re #19
253b905
@bitprophet bitprophet closed this Oct 23, 2011
@bitprophet bitprophet added a commit that referenced this issue Oct 24, 2011
@bitprophet bitprophet Add --linewise and its rationale to the docs.
Totally forgot to do this re: #19, whoops.
fbc5622
@euforia
euforia commented Mar 22, 2012

I am running ssh 1.7.13 + python 2.7.1 + OS X Lion + pycrypto 2.5 and am still running into the RNG issue. I've also tried on a redhat box but am getting the same results. The issue is present regardless of the number of parallel runs. I seem to be running into the problem the minute I spawn a separate process. Is this issue resolved or still being worked on? Thank you in advance.

@bitprophet
Member

@euforia I believe you're the first person to say they've got the problem still, and you're also the only one I think who had the RNG problem for any sized run -- myself and others only had it become problematic with more than, say, two dozen simultaneous processes.

Have you always been running on PyCrypto 2.5? Would you be able to test some other versions of PyCrypto (like Crypto 2.3) and maybe Python minor version (eg 2.6.x) to see if it's persistent everywhere?

@euforia
euforia commented Mar 23, 2012

Thank you for your quick response Jeff. I remember it working prior to an upgrade I performed but I am not sure what the combination (pycryto + python) was. So far I've tried the following combination with no success:

  • python 2.6.7 + ssh 1.7.13 + pycrypto 2.1/2.2/2.3/2.4/2.5
  • python 2.7.1 + ssh 1.7.13 + pycrypto 2.1/2.2/2.3/2.4/2.5

Is there a combination you can provide that works for you? I can test it out to see if the issue is with the environment.

Here's the code for reference:

def get_output_from_channel_obj(chnl):
data = chnl.recv(1024)
while data:
print "run:"
print "%s" % data
data = chnl.recv(1024)

def run_test_command(tobj, cmd):
chann = tobj.open_session()
chann.exec_command(cmd)
print "command:",cmd
print "output"
get_output_from_channel_obj(chann)
chann.close()

Main

conn = ssh.SSHClient()
conn.set_missing_host_key_policy(ssh.AutoAddPolicy())
conn.connect(ipaddr, username=user, password=passwd)

transport_obj = conn.get_transport()
channelSession = transport_obj.open_session()

p1 = Process(target=run_test_command, args=(transport_obj, "virsh list",))
p1.start()
p1.join()

conn.close()

end code

Once again, thank you for your help. Please let me know.

@dimara dimara added a commit to dimara/synnefo that referenced this issue May 29, 2014
@dimara dimara deploy: Remove parallel from fabfile
Running with parallel on a multinode setup we bumped into a
fabric/paramiko bug that freezes deployment. Specifically
we got:

paramiko.transport:Success for unrequested channel! [??]

This seems to be related with issue:

fabric/fabric#19

So until someone fixes it, we remove parallelism from setup_cluster
and setup_vmc. This will make a multinode deployment obviously slower
but will not affect the basic use case of --autoconf.

Signed-off-by: Dimitris Aragiorgis <dimara@grnet.gr>
30736fe
@dimara dimara added a commit to dimara/synnefo that referenced this issue May 29, 2014
@dimara dimara deploy: Remove parallel from fabfile
Running with parallel on a multinode setup we bumped into a
fabric/paramiko bug that freezes deployment. Specifically
we got:

paramiko.transport:Success for unrequested channel! [??]

This seems to be related with issue:

fabric/fabric#19

So until someone fixes it, we remove parallelism from setup_cluster
and setup_vmc. This will make a multinode deployment obviously slower
but will not affect the basic use case of --autoconf.

Signed-off-by: Dimitris Aragiorgis <dimara@grnet.gr>
cb97e62
@dimara dimara added a commit to dimara/synnefo that referenced this issue May 30, 2014
@dimara dimara deploy: Remove parallel from fabfile
Running with parallel on a multinode setup we bumped into a
fabric/paramiko bug that freezes deployment. Specifically
we got:

paramiko.transport:Success for unrequested channel! [??]

This seems to be related with issue:

fabric/fabric#19

So until someone fixes it, we remove parallelism from setup_cluster
and setup_vmc. This will make a multinode deployment obviously slower
but will not affect the basic use case of --autoconf.

Signed-off-by: Dimitris Aragiorgis <dimara@grnet.gr>
abfd363
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment