New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parallelism/thread-safety #19

Closed
bitprophet opened this Issue Aug 19, 2011 · 53 comments

Comments

Projects
None yet
4 participants
@bitprophet
Member

bitprophet commented Aug 19, 2011

Description

Fabric currently uses the simplest approach to the most common use case, but as a result is quite naive and not threadsafe, and cannot easily be run in parallel even by an outside agent.

Rework the execution model and state sharing to be thread-safe (whether by using threadlocals or something else), and if possible go further and actually implement a parallel execution mode users can choose to activate, with threading or multiprocessing or similar.


Morgan Goose has been the lead on this feature and has a in-working-shape branch in his Github fork (link goes to the multiprocessing branch, which is the one you want to use). We hope to merge this into core Fabric for 1.1.


Current TODO:

  • Anal retentive renaming, e.g. s/runs_parallel/parallel/
  • Code formatting cleanup/rearranging
  • Mechanics/behavior/implementation double-check
  • Linewise output added back in (may make sub-ticket)
  • Paramiko situation examined re: dependency on 1.7.7.1+ and thus PyCrypto 2.1+
    • Including documenting the change in the install docs if necessary
  • Pull in anything useful that Morgan hadn't pushed at time of my merge
  • (if not included in previous) Examine logging support and decide if it's worth bumping to next release
  • Test, test, test

Originally submitted by Jeff Forcier (bitprophet) on 2009-07-21 at 02:52pm EDT

Attachments

Relations

  • Related to #20: Rework output mechanisms
  • Related to #21: Make execution model more robust/flexible
  • Related to #197: Handle running without any controlling tty

@ghost ghost assigned bitprophet Aug 19, 2011

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Jeff Forcier (bitprophet) posted:


For some discussion concerning users' needs along these lines, see this mid-July mailing list thread.


on 2009-08-09 at 11:31am EDT

Member

bitprophet commented Aug 19, 2011

Jeff Forcier (bitprophet) posted:


For some discussion concerning users' needs along these lines, see this mid-July mailing list thread.


on 2009-08-09 at 11:31am EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Wes Winham (winhamwr) posted:


My 2 cents:

Reworking the execution model and implementing thread-safe state sharing is probably going to be really hard. It'll touch a lot of code, be subject to a lot of hard to track bugs, and might require backwards-incompatible changes. It would, however, be very useful for some people (probably a minority though).

To me, this makes it a candidate for either looking at when some of the more high-use issues are implemented or for work by someone else (some person or group of people who need it) in a fork. Things like more example documentation, easier output control, improved prompt handling, and #23 generic persistence handlers (so I can do everything in my virtualenv easily) are all things that would benefit everyone substantially and also not be so big on time commitment. From a cost/benefit, I'm thinking that those are much bigger wins. Not at all saying that thread-safety wouldn't be awesome, just saying that you (Jeff) have limited time.

I might be stating the obvious though and as someone who could benefit from thread-safety (I manage 6 machines in my case), but doesn't see it as a huge personal priority, of course I might be biased.


on 2009-08-09 at 12:50pm EDT

Member

bitprophet commented Aug 19, 2011

Wes Winham (winhamwr) posted:


My 2 cents:

Reworking the execution model and implementing thread-safe state sharing is probably going to be really hard. It'll touch a lot of code, be subject to a lot of hard to track bugs, and might require backwards-incompatible changes. It would, however, be very useful for some people (probably a minority though).

To me, this makes it a candidate for either looking at when some of the more high-use issues are implemented or for work by someone else (some person or group of people who need it) in a fork. Things like more example documentation, easier output control, improved prompt handling, and #23 generic persistence handlers (so I can do everything in my virtualenv easily) are all things that would benefit everyone substantially and also not be so big on time commitment. From a cost/benefit, I'm thinking that those are much bigger wins. Not at all saying that thread-safety wouldn't be awesome, just saying that you (Jeff) have limited time.

I might be stating the obvious though and as someone who could benefit from thread-safety (I manage 6 machines in my case), but doesn't see it as a huge personal priority, of course I might be biased.


on 2009-08-09 at 12:50pm EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Morgan Goose (goosemo) posted:


Don't know if this is off topic for this thread, but I've made a small change in the main.py, with a command flag, that'll leverage the multiprocessing module to fork out processes for each host. It is a bit messy with stdout, but seems to work fine for my test fabfile.

It'll use the normal execution model by default, but if someone wants to run all of their hosts in parallel -P/--parallel will switch into using the Process() for each host.

One thing to note is that it won't run tasks simultaneously. Those will still be processed sequentially, while every host on a task will be run in parallel.


on 2010-03-11 at 04:27pm EST

Member

bitprophet commented Aug 19, 2011

Morgan Goose (goosemo) posted:


Don't know if this is off topic for this thread, but I've made a small change in the main.py, with a command flag, that'll leverage the multiprocessing module to fork out processes for each host. It is a bit messy with stdout, but seems to work fine for my test fabfile.

It'll use the normal execution model by default, but if someone wants to run all of their hosts in parallel -P/--parallel will switch into using the Process() for each host.

One thing to note is that it won't run tasks simultaneously. Those will still be processed sequentially, while every host on a task will be run in parallel.


on 2010-03-11 at 04:27pm EST

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Morgan Goose (goosemo) posted:


As that patch stands, it expects one to have the multiprocessing module. This is a non issue for python 2.6, but for 2.5/2,4 the backported module needs to be grabbed from http://code.google.com/p/python-multiprocessing/


on 2010-03-17 at 10:25am EDT

Member

bitprophet commented Aug 19, 2011

Morgan Goose (goosemo) posted:


As that patch stands, it expects one to have the multiprocessing module. This is a non issue for python 2.6, but for 2.5/2,4 the backported module needs to be grabbed from http://code.google.com/p/python-multiprocessing/


on 2010-03-17 at 10:25am EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Morgan Goose (goosemo) posted:


Here is an updated patch that will check if the multiprocessing module should be imported, and will allow for python versions (2.4/2.5) w/o said module to not error out when attempting to run.


on 2010-03-17 at 04:52pm EDT

Member

bitprophet commented Aug 19, 2011

Morgan Goose (goosemo) posted:


Here is an updated patch that will check if the multiprocessing module should be imported, and will allow for python versions (2.4/2.5) w/o said module to not error out when attempting to run.


on 2010-03-17 at 04:52pm EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Morgan Goose (goosemo) posted:


Just as an update, this is the tip commit for the multiprocessing branch I made for this issue.
http://github.com/goosemo/fabric/commit/3406723d2f2f9fc41de2a5f064da46b460ff35a3

It adds in a cli flag -P for forcing parallel. It also adds two decorators, @runs_parallel and @runs_sequential, that will force those options on a fab task regardless of the parallel cli flag.

If anyone has issues, or requests, let me know.


on 2010-04-01 at 10:09am EDT

Member

bitprophet commented Aug 19, 2011

Morgan Goose (goosemo) posted:


Just as an update, this is the tip commit for the multiprocessing branch I made for this issue.
http://github.com/goosemo/fabric/commit/3406723d2f2f9fc41de2a5f064da46b460ff35a3

It adds in a cli flag -P for forcing parallel. It also adds two decorators, @runs_parallel and @runs_sequential, that will force those options on a fab task regardless of the parallel cli flag.

If anyone has issues, or requests, let me know.


on 2010-04-01 at 10:09am EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Jeremy Orem (oremj) posted:


It would also be beneficial to be able to specify how many parallel commands to run at once e.g., I have 100 systems and I only want to run service httpd reload on 20 at a time.


on 2010-05-13 at 05:12pm EDT

Member

bitprophet commented Aug 19, 2011

Jeremy Orem (oremj) posted:


It would also be beneficial to be able to specify how many parallel commands to run at once e.g., I have 100 systems and I only want to run service httpd reload on 20 at a time.


on 2010-05-13 at 05:12pm EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Joshua Peek (josh) posted:


This is a feature that I look forward to having in 1.0. Thanks for your work.


on 2010-05-20 at 07:56pm EDT

Member

bitprophet commented Aug 19, 2011

Joshua Peek (josh) posted:


This is a feature that I look forward to having in 1.0. Thanks for your work.


on 2010-05-20 at 07:56pm EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Jeff Forcier (bitprophet) posted:


Just a note, the work done over in #7 now means that remote stdout/stderr are printed on a byte-by-byte basis instead of line-by-line (fabfile-local print statements/interactions with sys.stdout remain line-buffered as usual). This may impact any parallel solution more negatively than the previous line-buffered behavior.

Whether we solve that for this ticket by adding back in some sort of "legacy" line buffered, no-stdin-available mode (gross, lots of potential for code duplication, still lots of potential for confusing output) or taking a harder look at what sorts of output people expect when running jobs in parallel (I haven't seen whether Morgan's code handles this at all -- doesn't it get pretty confusing no matter what, unless you simply hide all but your own print statements? would we want to just be autologging each server to its own log file and expect folks to multitail or something?) isn't something I will be looking at right now, but it will need to be tackled before this ticket here can be considered done.

run-on sentences yaaaaay


on 2010-06-18 at 03:13pm EDT

Member

bitprophet commented Aug 19, 2011

Jeff Forcier (bitprophet) posted:


Just a note, the work done over in #7 now means that remote stdout/stderr are printed on a byte-by-byte basis instead of line-by-line (fabfile-local print statements/interactions with sys.stdout remain line-buffered as usual). This may impact any parallel solution more negatively than the previous line-buffered behavior.

Whether we solve that for this ticket by adding back in some sort of "legacy" line buffered, no-stdin-available mode (gross, lots of potential for code duplication, still lots of potential for confusing output) or taking a harder look at what sorts of output people expect when running jobs in parallel (I haven't seen whether Morgan's code handles this at all -- doesn't it get pretty confusing no matter what, unless you simply hide all but your own print statements? would we want to just be autologging each server to its own log file and expect folks to multitail or something?) isn't something I will be looking at right now, but it will need to be tackled before this ticket here can be considered done.

run-on sentences yaaaaay


on 2010-06-18 at 03:13pm EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Jeff Forcier (bitprophet) posted:


Note that came up via IRC while Morgan was testing out a merge of his branch and the #7 branch -- because the MP solution runs a bunch of Fabric "workers" headless without actual TTYs involved, this impacts a lot of assumptions Fabric makes , though primarily the ones made by #7 (so far, the code that forces stdin to be bytewise is what errors out). So basically, the previous comment but writ large.

Keep the overall ramifications of this in mind when integrating this work. Will be trying to update #7 with a fix for this soon.


on 2010-07-21 at 12:39pm EDT

Member

bitprophet commented Aug 19, 2011

Jeff Forcier (bitprophet) posted:


Note that came up via IRC while Morgan was testing out a merge of his branch and the #7 branch -- because the MP solution runs a bunch of Fabric "workers" headless without actual TTYs involved, this impacts a lot of assumptions Fabric makes , though primarily the ones made by #7 (so far, the code that forces stdin to be bytewise is what errors out). So basically, the previous comment but writ large.

Keep the overall ramifications of this in mind when integrating this work. Will be trying to update #7 with a fix for this soon.


on 2010-07-21 at 12:39pm EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Morgan Goose (goosemo) posted:


So just today I got what I feel is an acceptable version of my multiprocessing
branch. To test it I suggest you make use of virtualenv (and virtualenvwrapper)
to keep your mainline Fabric install separate. But the code is up to date with
the HEAD of the fabric repo, and is able to be grabbed on github here: [[http://github.com/goosemo/fabric]]

I changed the readme some to explain the options I've added. Feel free to
peruse the code, and let me know if you find anything wanting in this
implementation.


on 2010-07-21 at 10:13pm EDT

Member

bitprophet commented Aug 19, 2011

Morgan Goose (goosemo) posted:


So just today I got what I feel is an acceptable version of my multiprocessing
branch. To test it I suggest you make use of virtualenv (and virtualenvwrapper)
to keep your mainline Fabric install separate. But the code is up to date with
the HEAD of the fabric repo, and is able to be grabbed on github here: [[http://github.com/goosemo/fabric]]

I changed the readme some to explain the options I've added. Feel free to
peruse the code, and let me know if you find anything wanting in this
implementation.


on 2010-07-21 at 10:13pm EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Morgan Goose (goosemo) posted:


With the newest updates I am able to merge in the multiprocessing branch I have, and have no issues when running my personal scripts that utilize parallel execution, and a simple test script of some run|get|put|cd()s. People can grab commit 2f56f612846c31c94c77 and test it out some more. I'll be working on trying to make unit tests to verify parallel execution, and since it's not in the main branch people can use http://github.com/goosemo/fabric/issues for tickets. (Unless Jeff wants/prefers issues be placed in this ticket.)


on 2010-10-07 at 11:56am EDT

Member

bitprophet commented Aug 19, 2011

Morgan Goose (goosemo) posted:


With the newest updates I am able to merge in the multiprocessing branch I have, and have no issues when running my personal scripts that utilize parallel execution, and a simple test script of some run|get|put|cd()s. People can grab commit 2f56f612846c31c94c77 and test it out some more. I'll be working on trying to make unit tests to verify parallel execution, and since it's not in the main branch people can use http://github.com/goosemo/fabric/issues for tickets. (Unless Jeff wants/prefers issues be placed in this ticket.)


on 2010-10-07 at 11:56am EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Jeff Forcier (bitprophet) posted:


As mentioned on IRC, I'm fine with people using fork-specific GH issues pages, as long as any outstanding issues at merge time get ported over here. (Basically, code in the core Fab repo should have its issues documented here.)

Excited that it seems to "just work" with the latest IO changes in master. As discussed elsewhere I think we'll want to update things so that we have more bulletproof IO settings for this mode (I'm 99% sure that with nontrivial commands, users would run into garbled output between processes in the bytewise mode we use now) but if that's not resolved when I have time to look at this myself, I'll take care of it.


on 2010-10-07 at 12:04pm EDT

Member

bitprophet commented Aug 19, 2011

Jeff Forcier (bitprophet) posted:


As mentioned on IRC, I'm fine with people using fork-specific GH issues pages, as long as any outstanding issues at merge time get ported over here. (Basically, code in the core Fab repo should have its issues documented here.)

Excited that it seems to "just work" with the latest IO changes in master. As discussed elsewhere I think we'll want to update things so that we have more bulletproof IO settings for this mode (I'm 99% sure that with nontrivial commands, users would run into garbled output between processes in the bytewise mode we use now) but if that's not resolved when I have time to look at this myself, I'll take care of it.


on 2010-10-07 at 12:04pm EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Morgan Goose (goosemo) posted:


Can you think of a good test command that'd give some output that you'd expect to be garbled? Also what do you mean by that, an interweaving of output lines, or an issue with an output line itself having the output of another command break it up?


on 2010-10-08 at 03:58pm EDT

Member

bitprophet commented Aug 19, 2011

Morgan Goose (goosemo) posted:


Can you think of a good test command that'd give some output that you'd expect to be garbled? Also what do you mean by that, an interweaving of output lines, or an issue with an output line itself having the output of another command break it up?


on 2010-10-08 at 03:58pm EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Jeff Forcier (bitprophet) posted:


Well, theoretically because everything is byte-buffered now instead of line-buffered, you could get things mixed up on a character by character level if two of the processes are both writing to stdout or stderr at the same time. It's quite possible though that things may run fast enough, realistically, that each process writes its lines fast enough for this not to be a big deal.

For a stress test, I was thinking of anything that generates a lot of output, or output quickly, such as wget/curl downloads, or aptitude update/upgrade, that sort of thing. An excellent test would be to see what happens with interactive programs like python or top/htop (try these beforehand without using parallelism, tho, just so you know how they behave currently in Fab's normal mode -- prefixes and crap still make them look funky sometimes).

(I don't even know what would happen to stdin at that point, which is the other half of this, but it might be fun to find out. I presume it'll multiplex to all processes as they'd theoretically all inherit the file descriptor, but not sure.)


on 2010-10-08 at 05:40pm EDT

Member

bitprophet commented Aug 19, 2011

Jeff Forcier (bitprophet) posted:


Well, theoretically because everything is byte-buffered now instead of line-buffered, you could get things mixed up on a character by character level if two of the processes are both writing to stdout or stderr at the same time. It's quite possible though that things may run fast enough, realistically, that each process writes its lines fast enough for this not to be a big deal.

For a stress test, I was thinking of anything that generates a lot of output, or output quickly, such as wget/curl downloads, or aptitude update/upgrade, that sort of thing. An excellent test would be to see what happens with interactive programs like python or top/htop (try these beforehand without using parallelism, tho, just so you know how they behave currently in Fab's normal mode -- prefixes and crap still make them look funky sometimes).

(I don't even know what would happen to stdin at that point, which is the other half of this, but it might be fun to find out. I presume it'll multiplex to all processes as they'd theoretically all inherit the file descriptor, but not sure.)


on 2010-10-08 at 05:40pm EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Jeff Forcier (bitprophet) posted:


For now, putting this in as the lynchpin for 1.2.


on 2011-03-10 at 12:12pm EST

Member

bitprophet commented Aug 19, 2011

Jeff Forcier (bitprophet) posted:


For now, putting this in as the lynchpin for 1.2.


on 2011-03-10 at 12:12pm EST

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Joshua Peek (josh) posted:


Hey Y'all just coming in again to say this would be great to have .


on 2011-07-01 at 02:14am EDT

Member

bitprophet commented Aug 19, 2011

Joshua Peek (josh) posted:


Hey Y'all just coming in again to say this would be great to have .


on 2011-07-01 at 02:14am EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Sep 16, 2011

Member

Finally took goosemo's branch out for a spin. Works as advertised (woo!), but as I expected, the output is often quite garbled (and there's an issue open on his branch from another user with the same complaint.)

Best solution short of "no stdout/stderr output, only logging" or using tmux/screen, is probably to re-implement a line-based output scheme which (along with abort_on_prompts) is flipped on whenever parallelism is active.

Member

bitprophet commented Sep 16, 2011

Finally took goosemo's branch out for a spin. Works as advertised (woo!), but as I expected, the output is often quite garbled (and there's an issue open on his branch from another user with the same complaint.)

Best solution short of "no stdout/stderr output, only logging" or using tmux/screen, is probably to re-implement a line-based output scheme which (along with abort_on_prompts) is flipped on whenever parallelism is active.

@daharon

This comment has been minimized.

Show comment
Hide comment
@daharon

daharon Sep 16, 2011

I'm sure you guys have tried out Capistrano before (runs commands in parallel). Just wanted to chime in and say that I like their output scheme...

  * executing "ln -nfs /mnt/fileserver/example.com/group_pic /srv/example.com/releases/20110916145923/group_pic"
    servers: ["10.0.16.29", "10.0.16.202", "10.0.16.112", "10.0.16.111", "10.0.18.28", "10.0.16.196", "10.0.16.82", "10.0.16.75", "10.0.16.149", "10.0.16.203", "10.0.16.148", "10.0.16.87", "10.0.16.70", "10.0.16.170"]
    [10.0.16.29] executing command
    [10.0.16.202] executing command
    [10.0.16.111] executing command
    [10.0.16.112] executing command
    [10.0.18.28] executing command
    [10.0.16.82] executing command
    [10.0.16.70] executing command
    [10.0.16.170] executing command
    [10.0.16.75] executing command
    [10.0.16.149] executing command
    [10.0.16.203] executing command
    [10.0.16.148] executing command
    [10.0.16.87] executing command
    [10.0.16.196] executing command
    command finished in 46ms

daharon commented Sep 16, 2011

I'm sure you guys have tried out Capistrano before (runs commands in parallel). Just wanted to chime in and say that I like their output scheme...

  * executing "ln -nfs /mnt/fileserver/example.com/group_pic /srv/example.com/releases/20110916145923/group_pic"
    servers: ["10.0.16.29", "10.0.16.202", "10.0.16.112", "10.0.16.111", "10.0.18.28", "10.0.16.196", "10.0.16.82", "10.0.16.75", "10.0.16.149", "10.0.16.203", "10.0.16.148", "10.0.16.87", "10.0.16.70", "10.0.16.170"]
    [10.0.16.29] executing command
    [10.0.16.202] executing command
    [10.0.16.111] executing command
    [10.0.16.112] executing command
    [10.0.18.28] executing command
    [10.0.16.82] executing command
    [10.0.16.70] executing command
    [10.0.16.170] executing command
    [10.0.16.75] executing command
    [10.0.16.149] executing command
    [10.0.16.203] executing command
    [10.0.16.148] executing command
    [10.0.16.87] executing command
    [10.0.16.196] executing command
    command finished in 46ms
@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Sep 16, 2011

Member

That's how Fabric operates already, actually, re: printing line prefixes (since even when running in serial you still want to know which server's output you're currently on.) The problem with parallel right now isn't related to "which line is which server" but simply that even on an individual line, you'll get some bytes from one server and some bytes from another one.

(This is because unlike Cap, we moved to a bytewise IO scheme in 1.0 which lets us handle true interactivity with the remote end. Works great in serial, not so great in parallel, but interactivity doesn't mesh with parallel execution anyways, so not a problem.)

Member

bitprophet commented Sep 16, 2011

That's how Fabric operates already, actually, re: printing line prefixes (since even when running in serial you still want to know which server's output you're currently on.) The problem with parallel right now isn't related to "which line is which server" but simply that even on an individual line, you'll get some bytes from one server and some bytes from another one.

(This is because unlike Cap, we moved to a bytewise IO scheme in 1.0 which lets us handle true interactivity with the remote end. Works great in serial, not so great in parallel, but interactivity doesn't mesh with parallel execution anyways, so not a problem.)

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Sep 17, 2011

Member

(TODO moved to issue description)

Member

bitprophet commented Sep 17, 2011

(TODO moved to issue description)

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Sep 20, 2011

Member

On a whim, did a slightly more complex real-world test before diving into linewise output:

$ fab parallel_task runs_once_task serial_task parallel_task

It results in this error every time.

Narrowing down, the error does not occur on invocations like this:

$ fab parallel_task parallel_task

Nor this:

$ fab parallel_task runs_once_locally_task parallel_task

But it does on this:

$ fab parallel_task serial_task parallel_task

Note that in all the above, this involves two invocations of the exact same task, a host list set up via -H, and parallelization forced via -P -- no @parallel.

It also occurs with two different parallel tasks, so it seems to be a straightforward "parallel, serial, parallel" issue and not some problem with task-name-based process IDs or anything. I.e.:

$ fab parallel_task serial_task other_parallel_task

Will see if we need to re-call Random.atfork() somewhere -- we're already calling it inside the @parallel decorator and JobQueue._advance_the_queue (after job.start() though...hm.)

Member

bitprophet commented Sep 20, 2011

On a whim, did a slightly more complex real-world test before diving into linewise output:

$ fab parallel_task runs_once_task serial_task parallel_task

It results in this error every time.

Narrowing down, the error does not occur on invocations like this:

$ fab parallel_task parallel_task

Nor this:

$ fab parallel_task runs_once_locally_task parallel_task

But it does on this:

$ fab parallel_task serial_task parallel_task

Note that in all the above, this involves two invocations of the exact same task, a host list set up via -H, and parallelization forced via -P -- no @parallel.

It also occurs with two different parallel tasks, so it seems to be a straightforward "parallel, serial, parallel" issue and not some problem with task-name-based process IDs or anything. I.e.:

$ fab parallel_task serial_task other_parallel_task

Will see if we need to re-call Random.atfork() somewhere -- we're already calling it inside the @parallel decorator and JobQueue._advance_the_queue (after job.start() though...hm.)

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Sep 20, 2011

Member

Wonderful, switching to @parallel and invoking fab parallel serial parallel2 (or parallel serial parallel) doesn't blow up but does just hang there and spits out a few No handlers could be found for logger "paramiko.transport" errors too (but...not one per host, just two on a 5-host list.)

EDIT: activating logging shows one WARNING:paramiko.transport:Success for unrequested channel! [??] per host in my host list in this situation. Huh. I do not see this warning in regular, one-task parallel usage.

EDIT 2: Also, there is an existing issue, #214, with the same basic symptom apparently popping up sometimes in what I assume is regular, non-parallel usage. It claims to be fixed with the patched Paramiko that this branch here should be using, but I might need to double check that it is installing the right version. That said, it's quite possible this is another unrelated problem that just triggers the same error msg.

Member

bitprophet commented Sep 20, 2011

Wonderful, switching to @parallel and invoking fab parallel serial parallel2 (or parallel serial parallel) doesn't blow up but does just hang there and spits out a few No handlers could be found for logger "paramiko.transport" errors too (but...not one per host, just two on a 5-host list.)

EDIT: activating logging shows one WARNING:paramiko.transport:Success for unrequested channel! [??] per host in my host list in this situation. Huh. I do not see this warning in regular, one-task parallel usage.

EDIT 2: Also, there is an existing issue, #214, with the same basic symptom apparently popping up sometimes in what I assume is regular, non-parallel usage. It claims to be fixed with the patched Paramiko that this branch here should be using, but I might need to double check that it is installing the right version. That said, it's quite possible this is another unrelated problem that just triggers the same error msg.

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Sep 20, 2011

Member

When exploring the PyCrypto issue, I noticed that paramiko 1.7.7.1 includes the @Sugarc0de fix (which has sadly since been nuked from his repo?). That fix does not fix this bug (expected, as this bug isn't something Morgan seems to have run into in his branch/testing) but should obviate the need to depend on a non-mainline Paramiko...for now.

Something to consider when we get to that part of my above TODO. It does up our install deps to need PyCrypto >=2.1 instead of 1.9, so I should check to see if that fixes or clashes with the odd issue we had to fix-via-docs re: pyCrypto being stupid on Python 2.5.

Member

bitprophet commented Sep 20, 2011

When exploring the PyCrypto issue, I noticed that paramiko 1.7.7.1 includes the @Sugarc0de fix (which has sadly since been nuked from his repo?). That fix does not fix this bug (expected, as this bug isn't something Morgan seems to have run into in his branch/testing) but should obviate the need to depend on a non-mainline Paramiko...for now.

Something to consider when we get to that part of my above TODO. It does up our install deps to need PyCrypto >=2.1 instead of 1.9, so I should check to see if that fixes or clashes with the odd issue we had to fix-via-docs re: pyCrypto being stupid on Python 2.5.

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Sep 20, 2011

Member

I think the problem here is one of connections -- Morgan's implementation of JobQueue explicitly disconnects all at the end, which is probably why two subsequent invokes of parallel tasks works, but shoving a serial task in between breaks things: entering the 2nd parallel task has extant connections in the cache, and thus Paramiko isn't rerunning Random.atfork() (which it added into Transport in either the patched 1.7.6 or in 1.7.7).

I also think our additions of Random.atfork() may not be helping as much since some/all of them may be running in the parent process and not the forks. Still doesn't explain the hanging in the @parallel using scenario though.

Will experiment with cache nuking and then try to figure out if that's actually necessary or if we can juggle things such that we don't have to reconnect on every task invoke. (though that is not a huge evil, since running many tasks in a row in one session is not a common use case.)

EDIT: Unfortunately, I tested out running 2 consecutive parallel tasks with the disconnect_all call commented out, and no boom, implying that it was not actually needed/related. I also didn't see anything obvious in the new code that would have really cared about the connection cache.

EDIT 2: However, another test I didn't think to try before, just doing fab serial parallel (no beginning parallel), also results in the error. So clearly some state tickled by running serially is fucking up parallel runs. Will re-examine what the main loop does.

Member

bitprophet commented Sep 20, 2011

I think the problem here is one of connections -- Morgan's implementation of JobQueue explicitly disconnects all at the end, which is probably why two subsequent invokes of parallel tasks works, but shoving a serial task in between breaks things: entering the 2nd parallel task has extant connections in the cache, and thus Paramiko isn't rerunning Random.atfork() (which it added into Transport in either the patched 1.7.6 or in 1.7.7).

I also think our additions of Random.atfork() may not be helping as much since some/all of them may be running in the parent process and not the forks. Still doesn't explain the hanging in the @parallel using scenario though.

Will experiment with cache nuking and then try to figure out if that's actually necessary or if we can juggle things such that we don't have to reconnect on every task invoke. (though that is not a huge evil, since running many tasks in a row in one session is not a common use case.)

EDIT: Unfortunately, I tested out running 2 consecutive parallel tasks with the disconnect_all call commented out, and no boom, implying that it was not actually needed/related. I also didn't see anything obvious in the new code that would have really cared about the connection cache.

EDIT 2: However, another test I didn't think to try before, just doing fab serial parallel (no beginning parallel), also results in the error. So clearly some state tickled by running serially is fucking up parallel runs. Will re-examine what the main loop does.

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Sep 20, 2011

Member

Interesting -- throwing disconnect_all() at the top of JobQueue.start actually makes things work well in the fab serial parallel situation. Offhand it's not clear why that makes a difference, yet fab parallel parallel did not have problems when there was no disconnect_all() anywhere. Need to dig deeper.

EDIT: Tossing more debugging into my local copy of Paramiko 1.7.7.1 stock, I find that yea, Transport.start_client (and thus Paramiko's own Random.atfork()) is definitely only running when we set up connections to the server, which is why disconnecting was implicated in this particular issue originally (though not why it still only happens in the serial-parallel mix, and not a double-parallel setup.)

EDIT 2: AHA! So simple. When a parallelized task is the first task to connect, because of our lazy-connection behavior it is the first one to add items to the connection cache. And since it's been stuck into its own process, it's running with its own copy of fabric.state, so its caching never makes it back to the main thread.

This is why >1 parallel in a row never has any problems and why disconnect_all() only sometimes helps -- they each spawn with a new, fresh, empty copy of the connection cache. Unless a serial task has run beforehand, which fills it out, and then new parallel tasks inherit a copy of that filled-out cache, don't ever run Paramiko's own call to Random.atfork, and blow up.

This doesn't immediately point to the hanging version of the problem when using the decorators, but perhaps it has to do with calling atfork twice in a row or something.

It also doesn't explain why using the decorator fails to solve the problem -- if the issue is solely that we aren't calling atfork inside the new process, the decorator should be doing so, and should then mean that fab serial parallel_using_decorator would not have the problem. However, it still seems to. Will double check.

Member

bitprophet commented Sep 20, 2011

Interesting -- throwing disconnect_all() at the top of JobQueue.start actually makes things work well in the fab serial parallel situation. Offhand it's not clear why that makes a difference, yet fab parallel parallel did not have problems when there was no disconnect_all() anywhere. Need to dig deeper.

EDIT: Tossing more debugging into my local copy of Paramiko 1.7.7.1 stock, I find that yea, Transport.start_client (and thus Paramiko's own Random.atfork()) is definitely only running when we set up connections to the server, which is why disconnecting was implicated in this particular issue originally (though not why it still only happens in the serial-parallel mix, and not a double-parallel setup.)

EDIT 2: AHA! So simple. When a parallelized task is the first task to connect, because of our lazy-connection behavior it is the first one to add items to the connection cache. And since it's been stuck into its own process, it's running with its own copy of fabric.state, so its caching never makes it back to the main thread.

This is why >1 parallel in a row never has any problems and why disconnect_all() only sometimes helps -- they each spawn with a new, fresh, empty copy of the connection cache. Unless a serial task has run beforehand, which fills it out, and then new parallel tasks inherit a copy of that filled-out cache, don't ever run Paramiko's own call to Random.atfork, and blow up.

This doesn't immediately point to the hanging version of the problem when using the decorators, but perhaps it has to do with calling atfork twice in a row or something.

It also doesn't explain why using the decorator fails to solve the problem -- if the issue is solely that we aren't calling atfork inside the new process, the decorator should be doing so, and should then mean that fab serial parallel_using_decorator would not have the problem. However, it still seems to. Will double check.

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Sep 21, 2011

Member

In order to know when/where to call atfork ourselves (i.e. if we want to avoid the "solution" of nuking connections before every parallel task start) I must figure out why we hang if using the decorator 2x in a row.

The hang itself is within Paramiko, here inside open_channel (which is called when Fabric attempts to actually open a session on the connection/client object.) Unfortunately I don't know exactly what it's doing when in that loop, re: why event.isSet() never occurs.

Combined with the log message about Success for unrequested channel it makes me think something odd is going on with the process forking, or a bug in Paramiko perhaps re: channel numbers. Will google a bit to see if this is a known problem/symptom.

EDIT: Nope, just the source code. Popping more debug stuff in the area generating the message, and comparing that to the rest of the log output, I'm seeing Paramiko doing things on channels 1 and 2, and then getting messages back on channel 2 and going "whu?! what's channel 2?!". Which seems suspect?

EDIT 2: The channels datastruct, at the time of the phantom success, is empty, which explains the error, but presents its own mystery.

Running just one parallel task results in what feels like multiple writes to the log from the two processes, both using channel 1. Makes me wonder if something is going on where the higher level Paramiko object, when set up via the serial process, is fleshing something out that is then "wrong" run in a subprocess later on. (Re: why the "bad" parallel run ends up with both channels 1 and 2.)

Empirical testing:

  • Running 1 serial task on 2 hosts sets up channel 1 on each connection object.
  • Running 2 serial tasks on 2 hosts sets up channel 1, then channel 2
  • Running one parallel task on 2 hosts also sets up channel 1 on each obj
  • Running serial, then parallel, sets up channel 1 in the serial task, then sets up channel 2 in the parallel task

So, pretty obvious "each new session request creates a new channel at N+1", which is backed up by how the code in open_transport works. If a parallel task is running on a cached cxn obj, it'll still increment as expected.

EDIT 3: More debugging, within the forked process' transport obj, its self._channels is being filled out correctly, so the spot accepting the success message is obviously on a different wavelength.

I now suspect that this is because the previously-instantiated client object in the parent process is somehow retaining the network socket (or something) and is thus "grabbing" the success message (and probably anything else?) intended for the subprocess. This makes sense re: what little I remember about how fork() treats file (and thus socket) access.

That still fits with the "nuke cached connections fixes everything" behavior as well, because then each process is starting with its own fresh objects which would then create and retain their own socket handles.

Bit tired to ponder the ramifications of this but I suspect that the "real" solution might require more patching to Paramiko so it's fully multiprocessing-friendly -- but OTOH if it's as simple as a shared socket problem then the onus is on us (heh) to ensure we don't end up in this situation -- necessitating bypassing the connection cache whenever parallelism is used.

Member

bitprophet commented Sep 21, 2011

In order to know when/where to call atfork ourselves (i.e. if we want to avoid the "solution" of nuking connections before every parallel task start) I must figure out why we hang if using the decorator 2x in a row.

The hang itself is within Paramiko, here inside open_channel (which is called when Fabric attempts to actually open a session on the connection/client object.) Unfortunately I don't know exactly what it's doing when in that loop, re: why event.isSet() never occurs.

Combined with the log message about Success for unrequested channel it makes me think something odd is going on with the process forking, or a bug in Paramiko perhaps re: channel numbers. Will google a bit to see if this is a known problem/symptom.

EDIT: Nope, just the source code. Popping more debug stuff in the area generating the message, and comparing that to the rest of the log output, I'm seeing Paramiko doing things on channels 1 and 2, and then getting messages back on channel 2 and going "whu?! what's channel 2?!". Which seems suspect?

EDIT 2: The channels datastruct, at the time of the phantom success, is empty, which explains the error, but presents its own mystery.

Running just one parallel task results in what feels like multiple writes to the log from the two processes, both using channel 1. Makes me wonder if something is going on where the higher level Paramiko object, when set up via the serial process, is fleshing something out that is then "wrong" run in a subprocess later on. (Re: why the "bad" parallel run ends up with both channels 1 and 2.)

Empirical testing:

  • Running 1 serial task on 2 hosts sets up channel 1 on each connection object.
  • Running 2 serial tasks on 2 hosts sets up channel 1, then channel 2
  • Running one parallel task on 2 hosts also sets up channel 1 on each obj
  • Running serial, then parallel, sets up channel 1 in the serial task, then sets up channel 2 in the parallel task

So, pretty obvious "each new session request creates a new channel at N+1", which is backed up by how the code in open_transport works. If a parallel task is running on a cached cxn obj, it'll still increment as expected.

EDIT 3: More debugging, within the forked process' transport obj, its self._channels is being filled out correctly, so the spot accepting the success message is obviously on a different wavelength.

I now suspect that this is because the previously-instantiated client object in the parent process is somehow retaining the network socket (or something) and is thus "grabbing" the success message (and probably anything else?) intended for the subprocess. This makes sense re: what little I remember about how fork() treats file (and thus socket) access.

That still fits with the "nuke cached connections fixes everything" behavior as well, because then each process is starting with its own fresh objects which would then create and retain their own socket handles.

Bit tired to ponder the ramifications of this but I suspect that the "real" solution might require more patching to Paramiko so it's fully multiprocessing-friendly -- but OTOH if it's as simple as a shared socket problem then the onus is on us (heh) to ensure we don't end up in this situation -- necessitating bypassing the connection cache whenever parallelism is used.

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Sep 21, 2011

Member

A summary of that monstrosity:

  • The using-@parallel hanging problem is probably what happens when we correctly call atfork ourselves in a new process (which we must do because a prior serial task already triggered Paramiko's own call to atfork).
    • Put another way, it is not encountered in the -P scenario because we can't even get that far due to crashing out from not calling atfork.
  • I suspect, but have not yet verified, that the hang is due to the parent process' cached Paramiko connection object receiving socket messages intended for the child/forked processes instead. It discards them with an unrequested channel error (because it doesn't have the N+1 channel number), and the children loop forever, never getting that other side of the connection handshake, because the remote end obviously has no idea it spoke to the "wrong" process.
  • This can be routed around by ensuring that forked child processes always start with an empty connection cache, thus forcing them to create their own connections and thus their own sockets -- at the cost of creating an entirely new TCP/SSH connection.
  • I'm not sure there is any other solution, if the problem truly is at the socket level. "Make sure the child has its own socket" is tantamount to "set up a new TCP/SSH connection", after all.

Moving forward, we can either nuke the cache entirely in the parent thread when starting any parallel threads, or try to nuke it inside child forked processes. The former is easier but marginally less efficient (in that any serial tasks will need to "unnecessarily" reconnect at the start; but since the caching is mostly useful within a task, and many-task sessions are not terribly common, it's not a big deal), the latter is worth it if it's easy but not if it's difficult or error prone.

Member

bitprophet commented Sep 21, 2011

A summary of that monstrosity:

  • The using-@parallel hanging problem is probably what happens when we correctly call atfork ourselves in a new process (which we must do because a prior serial task already triggered Paramiko's own call to atfork).
    • Put another way, it is not encountered in the -P scenario because we can't even get that far due to crashing out from not calling atfork.
  • I suspect, but have not yet verified, that the hang is due to the parent process' cached Paramiko connection object receiving socket messages intended for the child/forked processes instead. It discards them with an unrequested channel error (because it doesn't have the N+1 channel number), and the children loop forever, never getting that other side of the connection handshake, because the remote end obviously has no idea it spoke to the "wrong" process.
  • This can be routed around by ensuring that forked child processes always start with an empty connection cache, thus forcing them to create their own connections and thus their own sockets -- at the cost of creating an entirely new TCP/SSH connection.
  • I'm not sure there is any other solution, if the problem truly is at the socket level. "Make sure the child has its own socket" is tantamount to "set up a new TCP/SSH connection", after all.

Moving forward, we can either nuke the cache entirely in the parent thread when starting any parallel threads, or try to nuke it inside child forked processes. The former is easier but marginally less efficient (in that any serial tasks will need to "unnecessarily" reconnect at the start; but since the caching is mostly useful within a task, and many-task sessions are not terribly common, it's not a big deal), the latter is worth it if it's easy but not if it's difficult or error prone.

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Sep 22, 2011

Member

Went with the latter option re: nuking cache inside child procs. Still need to test more thoroughly re: different combos of parallel & serial tasks mixed up, but it appears to solve the reproducible test case from before just fine.


Also updated setup.py to point just at Paramiko 1.7.7.1, as per an above comment, but we still need to examine how that affects the edge case mentioned, and update docs accordingly.

Member

bitprophet commented Sep 22, 2011

Went with the latter option re: nuking cache inside child procs. Still need to test more thoroughly re: different combos of parallel & serial tasks mixed up, but it appears to solve the reproducible test case from before just fine.


Also updated setup.py to point just at Paramiko 1.7.7.1, as per an above comment, but we still need to examine how that affects the edge case mentioned, and update docs accordingly.

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Sep 28, 2011

Member

There is no absolutely clear bug/bugfix in the CPython Mercurial repo/bug tracker, which would cause the hanging in the way that we are using multiprocessing.

I did comb through the entire history for Lib/multiprocessing from 2.6.1 through the PyPI release's release date (July 30th 2009). However, the above tests imply the fix was actually after 2.6.2's release, namely April 14th 2009, so that should be the actual window we find it in.

Potential candidates:

  • Issue 5177 - SO_REUSEADDR problems. I recall running into this type of problem myself earlier in Fabric's history. The implication is that a hung parallel run would eventually terminate, but I've let one sit around for a very long time with no result.
    • This was fixed in March, so probably not the one.
  • Changeset 2c00bdaad7a9 - no apparent issue related to this, and its fix sounds like something that should cause constant, big errors, but noting it anyways. Fixed June 1st -- fits the dates at least.
  • Issue 5331 - Morgan's implementation uses the custom JobQueue class and not multiprocessing.Pool, so I don't think this fits. Fixed on June 30th.
  • Issue 6433 - also Pool-related, so not a great fit. Fixed on July 16th.
Member

bitprophet commented Sep 28, 2011

There is no absolutely clear bug/bugfix in the CPython Mercurial repo/bug tracker, which would cause the hanging in the way that we are using multiprocessing.

I did comb through the entire history for Lib/multiprocessing from 2.6.1 through the PyPI release's release date (July 30th 2009). However, the above tests imply the fix was actually after 2.6.2's release, namely April 14th 2009, so that should be the actual window we find it in.

Potential candidates:

  • Issue 5177 - SO_REUSEADDR problems. I recall running into this type of problem myself earlier in Fabric's history. The implication is that a hung parallel run would eventually terminate, but I've let one sit around for a very long time with no result.
    • This was fixed in March, so probably not the one.
  • Changeset 2c00bdaad7a9 - no apparent issue related to this, and its fix sounds like something that should cause constant, big errors, but noting it anyways. Fixed June 1st -- fits the dates at least.
  • Issue 5331 - Morgan's implementation uses the custom JobQueue class and not multiprocessing.Pool, so I don't think this fits. Fixed on June 30th.
  • Issue 6433 - also Pool-related, so not a great fit. Fixed on July 16th.
@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Sep 28, 2011

Member

Left a comment on the Paramiko tracker's issue re: the atfork crap. It comes off looking huffy re: forking (given I only reported the need for the 2nd atfork just now) but at least I linked back to #275 to give some historical context.

Member

bitprophet commented Sep 28, 2011

Left a comment on the Paramiko tracker's issue re: the atfork crap. It comes off looking huffy re: forking (given I only reported the need for the 2nd atfork just now) but at least I linked back to #275 to give some historical context.

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Sep 28, 2011

Member

Dealing with the Paramiko fork fun in #275 now.

Last note here re: the RNG errors: I ran my tests on another Lion box (a 15" 2011 MBP) under its stock 2.6.6, and as with the MBA, I didn't even get the RNG errors to show up, or any hanging. Not sure why I got them with 2.6.6 on Linux in a VM, but whatever :)

Member

bitprophet commented Sep 28, 2011

Dealing with the Paramiko fork fun in #275 now.

Last note here re: the RNG errors: I ran my tests on another Lion box (a 15" 2011 MBP) under its stock 2.6.6, and as with the MBA, I didn't even get the RNG errors to show up, or any hanging. Not sure why I got them with 2.6.6 on Linux in a VM, but whatever :)

@bancek

This comment has been minimized.

Show comment
Hide comment
@bancek

bancek Sep 30, 2011

I've created monkeypatch script to enable fabric to be thread-safe using threading.local()

https://github.com/bancek/fabric-threadsafe

bancek commented Sep 30, 2011

I've created monkeypatch script to enable fabric to be thread-safe using threading.local()

https://github.com/bancek/fabric-threadsafe

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Sep 30, 2011

Member

@bancek -- did you note the ticket history? :( we've already got a pretty solid multiprocessing implementation, so at this point it's probably not a good use of time to flip things around to being threadsafe and then rewrite the feature using some other concurrency model.

That said -- I'm definitely going to keep a link to that somewhere so I can refer to it when writing Fabric 2.0, so thanks! :D

Member

bitprophet commented Sep 30, 2011

@bancek -- did you note the ticket history? :( we've already got a pretty solid multiprocessing implementation, so at this point it's probably not a good use of time to flip things around to being threadsafe and then rewrite the feature using some other concurrency model.

That said -- I'm definitely going to keep a link to that somewhere so I can refer to it when writing Fabric 2.0, so thanks! :D

@bancek

This comment has been minimized.

Show comment
Hide comment
@bancek

bancek Sep 30, 2011

This wasn't meant to get into Fabric core. This is just temporary fix and I posted it here if somebody else needed a quick patch.

bancek commented Sep 30, 2011

This wasn't meant to get into Fabric core. This is just temporary fix and I posted it here if somebody else needed a quick patch.

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Sep 30, 2011

Member

Fair enough! Thanks again :)

Member

bitprophet commented Sep 30, 2011

Fair enough! Thanks again :)

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Oct 7, 2011

Member

This was merged into master about a week ago.

Member

bitprophet commented Oct 7, 2011

This was merged into master about a week ago.

@bitprophet bitprophet closed this Oct 7, 2011

@bitprophet bitprophet reopened this Oct 23, 2011

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Oct 23, 2011

Member

Reopening so I remember to double check the impact of a newer Paramiko's dependencies (only un-crossed-out item in description's list.) This will need doing regardless of the outcome of #275.

Member

bitprophet commented Oct 23, 2011

Reopening so I remember to double check the impact of a newer Paramiko's dependencies (only un-crossed-out item in description's list.) This will need doing regardless of the outcome of #275.

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Oct 23, 2011

Member

The Paramiko/PyCrypto 2.1+ issue was simply the three-parter of python==2.5.x, pip<=0.8.0 and pycrypto >=2.1 being uninstallable.

Now that Fabric requires pycrypto 2.1, that changes to: not installable via pip 0.8.0 or older on Python 2.5.x. (Using PyCrypto 2.0.1 is no longer an option.)

Will update docs appropriately. (Also need to update all our docs re: Paramiko now being 'ssh'.)

Member

bitprophet commented Oct 23, 2011

The Paramiko/PyCrypto 2.1+ issue was simply the three-parter of python==2.5.x, pip<=0.8.0 and pycrypto >=2.1 being uninstallable.

Now that Fabric requires pycrypto 2.1, that changes to: not installable via pip 0.8.0 or older on Python 2.5.x. (Using PyCrypto 2.0.1 is no longer an option.)

Will update docs appropriately. (Also need to update all our docs re: Paramiko now being 'ssh'.)

bitprophet added a commit that referenced this issue Oct 23, 2011

Replace all mentions of Paramiko in docs[trings].
Except in one or two spots where it still makes sense.

Re #19

@bitprophet bitprophet closed this Oct 23, 2011

bitprophet added a commit that referenced this issue Oct 24, 2011

Add --linewise and its rationale to the docs.
Totally forgot to do this re: #19, whoops.
@euforia

This comment has been minimized.

Show comment
Hide comment
@euforia

euforia Mar 22, 2012

I am running ssh 1.7.13 + python 2.7.1 + OS X Lion + pycrypto 2.5 and am still running into the RNG issue. I've also tried on a redhat box but am getting the same results. The issue is present regardless of the number of parallel runs. I seem to be running into the problem the minute I spawn a separate process. Is this issue resolved or still being worked on? Thank you in advance.

euforia commented Mar 22, 2012

I am running ssh 1.7.13 + python 2.7.1 + OS X Lion + pycrypto 2.5 and am still running into the RNG issue. I've also tried on a redhat box but am getting the same results. The issue is present regardless of the number of parallel runs. I seem to be running into the problem the minute I spawn a separate process. Is this issue resolved or still being worked on? Thank you in advance.

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Mar 22, 2012

Member

@euforia I believe you're the first person to say they've got the problem still, and you're also the only one I think who had the RNG problem for any sized run -- myself and others only had it become problematic with more than, say, two dozen simultaneous processes.

Have you always been running on PyCrypto 2.5? Would you be able to test some other versions of PyCrypto (like Crypto 2.3) and maybe Python minor version (eg 2.6.x) to see if it's persistent everywhere?

Member

bitprophet commented Mar 22, 2012

@euforia I believe you're the first person to say they've got the problem still, and you're also the only one I think who had the RNG problem for any sized run -- myself and others only had it become problematic with more than, say, two dozen simultaneous processes.

Have you always been running on PyCrypto 2.5? Would you be able to test some other versions of PyCrypto (like Crypto 2.3) and maybe Python minor version (eg 2.6.x) to see if it's persistent everywhere?

@euforia

This comment has been minimized.

Show comment
Hide comment
@euforia

euforia Mar 23, 2012

Thank you for your quick response Jeff. I remember it working prior to an upgrade I performed but I am not sure what the combination (pycryto + python) was. So far I've tried the following combination with no success:

  • python 2.6.7 + ssh 1.7.13 + pycrypto 2.1/2.2/2.3/2.4/2.5
  • python 2.7.1 + ssh 1.7.13 + pycrypto 2.1/2.2/2.3/2.4/2.5

Is there a combination you can provide that works for you? I can test it out to see if the issue is with the environment.

Here's the code for reference:

def get_output_from_channel_obj(chnl):
data = chnl.recv(1024)
while data:
print "run:"
print "%s" % data
data = chnl.recv(1024)

def run_test_command(tobj, cmd):
chann = tobj.open_session()
chann.exec_command(cmd)
print "command:",cmd
print "output"
get_output_from_channel_obj(chann)
chann.close()

Main

conn = ssh.SSHClient()
conn.set_missing_host_key_policy(ssh.AutoAddPolicy())
conn.connect(ipaddr, username=user, password=passwd)

transport_obj = conn.get_transport()
channelSession = transport_obj.open_session()

p1 = Process(target=run_test_command, args=(transport_obj, "virsh list",))
p1.start()
p1.join()

conn.close()

end code

Once again, thank you for your help. Please let me know.

euforia commented Mar 23, 2012

Thank you for your quick response Jeff. I remember it working prior to an upgrade I performed but I am not sure what the combination (pycryto + python) was. So far I've tried the following combination with no success:

  • python 2.6.7 + ssh 1.7.13 + pycrypto 2.1/2.2/2.3/2.4/2.5
  • python 2.7.1 + ssh 1.7.13 + pycrypto 2.1/2.2/2.3/2.4/2.5

Is there a combination you can provide that works for you? I can test it out to see if the issue is with the environment.

Here's the code for reference:

def get_output_from_channel_obj(chnl):
data = chnl.recv(1024)
while data:
print "run:"
print "%s" % data
data = chnl.recv(1024)

def run_test_command(tobj, cmd):
chann = tobj.open_session()
chann.exec_command(cmd)
print "command:",cmd
print "output"
get_output_from_channel_obj(chann)
chann.close()

Main

conn = ssh.SSHClient()
conn.set_missing_host_key_policy(ssh.AutoAddPolicy())
conn.connect(ipaddr, username=user, password=passwd)

transport_obj = conn.get_transport()
channelSession = transport_obj.open_session()

p1 = Process(target=run_test_command, args=(transport_obj, "virsh list",))
p1.start()
p1.join()

conn.close()

end code

Once again, thank you for your help. Please let me know.

dimara added a commit to dimara/synnefo that referenced this issue May 29, 2014

deploy: Remove parallel from fabfile
Running with parallel on a multinode setup we bumped into a
fabric/paramiko bug that freezes deployment. Specifically
we got:

paramiko.transport:Success for unrequested channel! [??]

This seems to be related with issue:

fabric/fabric#19

So until someone fixes it, we remove parallelism from setup_cluster
and setup_vmc. This will make a multinode deployment obviously slower
but will not affect the basic use case of --autoconf.

Signed-off-by: Dimitris Aragiorgis <dimara@grnet.gr>

dimara added a commit to dimara/synnefo that referenced this issue May 29, 2014

deploy: Remove parallel from fabfile
Running with parallel on a multinode setup we bumped into a
fabric/paramiko bug that freezes deployment. Specifically
we got:

paramiko.transport:Success for unrequested channel! [??]

This seems to be related with issue:

fabric/fabric#19

So until someone fixes it, we remove parallelism from setup_cluster
and setup_vmc. This will make a multinode deployment obviously slower
but will not affect the basic use case of --autoconf.

Signed-off-by: Dimitris Aragiorgis <dimara@grnet.gr>

dimara added a commit to dimara/synnefo that referenced this issue May 30, 2014

deploy: Remove parallel from fabfile
Running with parallel on a multinode setup we bumped into a
fabric/paramiko bug that freezes deployment. Specifically
we got:

paramiko.transport:Success for unrequested channel! [??]

This seems to be related with issue:

fabric/fabric#19

So until someone fixes it, we remove parallelism from setup_cluster
and setup_vmc. This will make a multinode deployment obviously slower
but will not affect the basic use case of --autoconf.

Signed-off-by: Dimitris Aragiorgis <dimara@grnet.gr>

ploxiln pushed a commit to ploxiln/fab-classic that referenced this issue Jun 5, 2018

Merge pull request fabric#19 from rodrigc/travis
Enable Python 3 travis build
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment