New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heavy CPU usage due to new interactivity behavior #312

Closed
bitprophet opened this Issue Aug 19, 2011 · 14 comments

Comments

Projects
None yet
2 participants
@bitprophet
Member

bitprophet commented Aug 19, 2011

Description

Probably due to our looping/selecting on stdin, Fabric now uses a ton of CPU on some (all?) systems when waiting for remote commands to execute.

It'd be nice if we could find a different solution for the stdin/stdout/stderr polling which avoids this, but given the difficulty of simply getting that feature implemented at all (including experiments with async libraries) it might not be doable.

  1. Investigate whether switching to an async library would help any and would be possible.
  2. Possibly allow control over the interactivity (which would also be useful for parallelism in terms of preventing bytewise interweaving of output) so folks at least have the option of saying "I'm not using that feature and want to not be polling constantly".

Originally submitted by Jeff Forcier (bitprophet) on 2011-03-10 at 05:10pm EST

Attachments

Relations

  • Related to #320: Issue with buffering input with sudo

Closed as Done on 2011-03-22 at 05:53pm EDT

@ghost ghost assigned bitprophet Aug 19, 2011

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Eric Hanchrow (offby1) posted:


The patch I put in for local() will end up exhibiting this too, regrettably; the same select() approach is used there.


on 2011-03-12 at 11:01pm EST

Member

bitprophet commented Aug 19, 2011

Eric Hanchrow (offby1) posted:


The patch I put in for local() will end up exhibiting this too, regrettably; the same select() approach is used there.


on 2011-03-12 at 11:01pm EST

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Kirill Pinchuk (cybergrind) posted:


this patch fix cpu consumption. maybe other sleep values could be better


on 2011-03-16 at 05:59am EDT

Member

bitprophet commented Aug 19, 2011

Kirill Pinchuk (cybergrind) posted:


this patch fix cpu consumption. maybe other sleep values could be better


on 2011-03-16 at 05:59am EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Joel Eidsath (joel@counsyl.com) posted:


Thanks, cbr. The patch worked for me.


on 2011-03-17 at 05:30pm EDT

Member

bitprophet commented Aug 19, 2011

Joel Eidsath (joel@counsyl.com) posted:


Thanks, cbr. The patch worked for me.


on 2011-03-17 at 05:30pm EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Jeff Forcier (bitprophet) posted:


Bumping this down to 1.0.1, at this point it definitely feels like a bugfix.


on 2011-03-22 at 02:34am EDT

Member

bitprophet commented Aug 19, 2011

Jeff Forcier (bitprophet) posted:


Bumping this down to 1.0.1, at this point it definitely feels like a bugfix.


on 2011-03-22 at 02:34am EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Jeff Forcier (bitprophet) posted:


Testing with run("sleep 10") confirms the issue: on an i5 dual-core w/ hyperthreading, python runs at 120% CPU for the 10s of the run.

The two loops cbr's patch touches are the waiting-on-stdin loop in fabric/io.py and the waiting-for-the-output-threads-to-finish loop in fabric/network.py. Adding a sleep to the first one makes a minor difference in CPU usage; the second one is where the real savings appear to show up, as having both in drops us down to negligible CPU usage after an initial, still minor spike.

Furthermore, adding these sleep calls makes truly interactive things like run("python") also behave much better -- previously in some situations they would be extremely slow to "draw", making them nigh unusable.

Still have to see whether it fixes #320 but that's a story for that ticket.


on 2011-03-22 at 02:56pm EDT

Member

bitprophet commented Aug 19, 2011

Jeff Forcier (bitprophet) posted:


Testing with run("sleep 10") confirms the issue: on an i5 dual-core w/ hyperthreading, python runs at 120% CPU for the 10s of the run.

The two loops cbr's patch touches are the waiting-on-stdin loop in fabric/io.py and the waiting-for-the-output-threads-to-finish loop in fabric/network.py. Adding a sleep to the first one makes a minor difference in CPU usage; the second one is where the real savings appear to show up, as having both in drops us down to negligible CPU usage after an initial, still minor spike.

Furthermore, adding these sleep calls makes truly interactive things like run("python") also behave much better -- previously in some situations they would be extremely slow to "draw", making them nigh unusable.

Still have to see whether it fixes #320 but that's a story for that ticket.


on 2011-03-22 at 02:56pm EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Jeff Forcier (bitprophet) posted:


cbr grind, do you have a real name you'd like to be credited with in the changelog? I'm modifying your patch a bit but you'll still get credit for the submission/basic idea :)


on 2011-03-22 at 03:23pm EDT

Member

bitprophet commented Aug 19, 2011

Jeff Forcier (bitprophet) posted:


cbr grind, do you have a real name you'd like to be credited with in the changelog? I'm modifying your patch a bit but you'll still get credit for the submission/basic idea :)


on 2011-03-22 at 03:23pm EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Jeff Forcier (bitprophet) posted:


Bit of testing re: the sleep value:

  • Anything over 0.1s results in noticeable slowdown again. No CPU usage problems, but now impacts ability to work with the streams -- i.e. an interactive python session is noticeably slow and/or user stdin is not picked up reliably on keypress.
  • Decreasing further from 0.1 seems to be good:
    • Halving it to 0.05s makes interactive sessions even more responsive, while a sleep 10 still doesn't use any noticeable CPU power
    • Halving that to 0.025s results in similar dividends -- it's hard to notice any more increase in interactivity speed (because it feels just as fast as a native shell), and still no CPU usage to speak of.
    • Down to 0.001s and we are now seeing about 8-10% Python CPU usage on my system.
    • Back up to 0.01s, and no CPU usage again.

So I think I will leave it at 0.01s and see how that works for people -- I expect it'll be a large benefit for almost everybody, and if some folks with slower machines are still seeing some CPU usage, we can always expose this for personal tuning or something along those lines.


on 2011-03-22 at 03:50pm EDT

Member

bitprophet commented Aug 19, 2011

Jeff Forcier (bitprophet) posted:


Bit of testing re: the sleep value:

  • Anything over 0.1s results in noticeable slowdown again. No CPU usage problems, but now impacts ability to work with the streams -- i.e. an interactive python session is noticeably slow and/or user stdin is not picked up reliably on keypress.
  • Decreasing further from 0.1 seems to be good:
    • Halving it to 0.05s makes interactive sessions even more responsive, while a sleep 10 still doesn't use any noticeable CPU power
    • Halving that to 0.025s results in similar dividends -- it's hard to notice any more increase in interactivity speed (because it feels just as fast as a native shell), and still no CPU usage to speak of.
    • Down to 0.001s and we are now seeing about 8-10% Python CPU usage on my system.
    • Back up to 0.01s, and no CPU usage again.

So I think I will leave it at 0.01s and see how that works for people -- I expect it'll be a large benefit for almost everybody, and if some folks with slower machines are still seeing some CPU usage, we can always expose this for personal tuning or something along those lines.


on 2011-03-22 at 03:50pm EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Jeff Forcier (bitprophet) posted:


Applied in changeset commit:5ef8083f024097f802fa741fd20e2faa7e84d20b.


on 2011-03-22 at 05:12pm EDT

Member

bitprophet commented Aug 19, 2011

Jeff Forcier (bitprophet) posted:


Applied in changeset commit:5ef8083f024097f802fa741fd20e2faa7e84d20b.


on 2011-03-22 at 05:12pm EDT

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Aug 19, 2011

Member

Kirill Pinchuk (cybergrind) posted:


Jeff Forcier wrote:

cbr grind, do you have a real name you'd like to be credited with in the changelog? I'm modifying your patch a bit but you'll still get credit for the submission/basic idea :)

My real name is Kirill Pinchuk =)


on 2011-03-22 at 05:53pm EDT

Member

bitprophet commented Aug 19, 2011

Kirill Pinchuk (cybergrind) posted:


Jeff Forcier wrote:

cbr grind, do you have a real name you'd like to be credited with in the changelog? I'm modifying your patch a bit but you'll still get credit for the submission/basic idea :)

My real name is Kirill Pinchuk =)


on 2011-03-22 at 05:53pm EDT

@sodul

This comment has been minimized.

Show comment
Hide comment
@sodul

sodul Mar 5, 2012

I'm seeing the issue with Fabric 1.4.0 and ssh (library) 1.7.13 on Ubuntu Server 10.04 LTS. Fab commands that take less than 5% cpu and run under 5s on OS X take 30s and 100% CPU on Ubuntu. We have a complex set of tasks that go from 5m to 20m to run. These are pretty beefy machines (8 cores) so it is not a slow CPU issue.
Can we get the sleep time exposed as to allow workaround for people affected. These tasks are headless so we do not care at all for the interactive support.

sodul commented Mar 5, 2012

I'm seeing the issue with Fabric 1.4.0 and ssh (library) 1.7.13 on Ubuntu Server 10.04 LTS. Fab commands that take less than 5% cpu and run under 5s on OS X take 30s and 100% CPU on Ubuntu. We have a complex set of tasks that go from 5m to 20m to run. These are pretty beefy machines (8 cores) so it is not a slow CPU issue.
Can we get the sleep time exposed as to allow workaround for people affected. These tasks are headless so we do not care at all for the interactive support.

@bitprophet

This comment has been minimized.

Show comment
Hide comment
@bitprophet

bitprophet Mar 6, 2012

Member

As of those versions I believe the sleep value is now stored in the ssh namespace.

So while I haven't tested this myself, you should be able to, at the top of your fabfile/fabric-using code, do something like:

import ssh
ssh.io_sleep = <new value>

And it should take effect for any Fabric or SSH code that module imports & runs.

Please let me know if that does the trick.

Member

bitprophet commented Mar 6, 2012

As of those versions I believe the sleep value is now stored in the ssh namespace.

So while I haven't tested this myself, you should be able to, at the top of your fabfile/fabric-using code, do something like:

import ssh
ssh.io_sleep = <new value>

And it should take effect for any Fabric or SSH code that module imports & runs.

Please let me know if that does the trick.

@sodul

This comment has been minimized.

Show comment
Hide comment
@sodul

sodul Mar 6, 2012

On Mon, 5 Mar 2012 16:27:47 -0800, Jeff Forcier wrote:

As of those versions I believe the sleep value is now stored in the
ssh namespace:
https://github.com/bitprophet/ssh/blob/master/ssh/__init__.py#L101

So while I haven't tested this myself, you should be able to, at the
top of your fabfile/fabric-using code, do something like:

import ssh
ssh.io_sleep = <new value>

And it should take effect for any Fabric or SSH code that module
imports & runs.

Please let me know if that does the trick.


Reply to this email directly or view it on GitHub:
#312 (comment)

Some more background about what we are doing and how to reproduce the
issue consistently.

We are running the fab rules as part of our CI system (Jenkins) and it
seem that controlling the fab rules from a remote host is a main cause
of the problem.

I've tried to use an OS X Jenkins slave to control the fab rules and
experienced similar slowness. I will try to patch the ssh library
directly to set io_sleep with various values and let you know if I get
any good results.

Regards,
Stephane Odul

sodul commented Mar 6, 2012

On Mon, 5 Mar 2012 16:27:47 -0800, Jeff Forcier wrote:

As of those versions I believe the sleep value is now stored in the
ssh namespace:
https://github.com/bitprophet/ssh/blob/master/ssh/__init__.py#L101

So while I haven't tested this myself, you should be able to, at the
top of your fabfile/fabric-using code, do something like:

import ssh
ssh.io_sleep = <new value>

And it should take effect for any Fabric or SSH code that module
imports & runs.

Please let me know if that does the trick.


Reply to this email directly or view it on GitHub:
#312 (comment)

Some more background about what we are doing and how to reproduce the
issue consistently.

We are running the fab rules as part of our CI system (Jenkins) and it
seem that controlling the fab rules from a remote host is a main cause
of the problem.

I've tried to use an OS X Jenkins slave to control the fab rules and
experienced similar slowness. I will try to patch the ssh library
directly to set io_sleep with various values and let you know if I get
any good results.

Regards,
Stephane Odul

@sodul

This comment has been minimized.

Show comment
Hide comment
@sodul

sodul Mar 13, 2012

I've tried many options, including directly patching the ssh module to sleep for 0.1s without success.

I can reproduce the issue consistently on ubuntu 10.04 lts and OS X 10.6. If the fab rules are running in a 'middle' machine then that fab task will consistently use 100% CPU. We have a CI system that calls fab tasks that take 20m to run vs 5m when I run them directly by hand on my laptop.

The way we are using Fabric as no need for byte level flushing. It would be great if we could have an option to reuse the line by line capturing that was used in pre-1.0 Fabric. The no-pty and linebyline options don't seem to help.

Sent from my iPhone

On Mar 6, 2012, at 14:58, soft soft@localhost wrote:

On Mon, 5 Mar 2012 16:27:47 -0800, Jeff Forcier wrote:

As of those versions I believe the sleep value is now stored in the
ssh namespace:
https://github.com/bitprophet/ssh/blob/master/ssh/__init__.py#L101

So while I haven't tested this myself, you should be able to, at the
top of your fabfile/fabric-using code, do something like:

import ssh
ssh.io_sleep =

And it should take effect for any Fabric or SSH code that module
imports & runs.

Please let me know if that does the trick.


Reply to this email directly or view it on GitHub:
#312 (comment)

Some more background about what we are doing and how to reproduce the issue consistently.

We are running the fab rules as part of our CI system (Jenkins) and it seem that controlling the fab rules from a remote host is a main cause of the problem.

I've tried to use an OS X Jenkins slave to control the fab rules and experienced similar slowness. I will try to patch the ssh library directly to set io_sleep with various values and let you know if I get any good results.

Regards,
Stephane Odul

sodul commented Mar 13, 2012

I've tried many options, including directly patching the ssh module to sleep for 0.1s without success.

I can reproduce the issue consistently on ubuntu 10.04 lts and OS X 10.6. If the fab rules are running in a 'middle' machine then that fab task will consistently use 100% CPU. We have a CI system that calls fab tasks that take 20m to run vs 5m when I run them directly by hand on my laptop.

The way we are using Fabric as no need for byte level flushing. It would be great if we could have an option to reuse the line by line capturing that was used in pre-1.0 Fabric. The no-pty and linebyline options don't seem to help.

Sent from my iPhone

On Mar 6, 2012, at 14:58, soft soft@localhost wrote:

On Mon, 5 Mar 2012 16:27:47 -0800, Jeff Forcier wrote:

As of those versions I believe the sleep value is now stored in the
ssh namespace:
https://github.com/bitprophet/ssh/blob/master/ssh/__init__.py#L101

So while I haven't tested this myself, you should be able to, at the
top of your fabfile/fabric-using code, do something like:

import ssh
ssh.io_sleep =

And it should take effect for any Fabric or SSH code that module
imports & runs.

Please let me know if that does the trick.


Reply to this email directly or view it on GitHub:
#312 (comment)

Some more background about what we are doing and how to reproduce the issue consistently.

We are running the fab rules as part of our CI system (Jenkins) and it seem that controlling the fab rules from a remote host is a main cause of the problem.

I've tried to use an OS X Jenkins slave to control the fab rules and experienced similar slowness. I will try to patch the ssh library directly to set io_sleep with various values and let you know if I get any good results.

Regards,
Stephane Odul

@sodul

This comment has been minimized.

Show comment
Hide comment
@sodul

sodul Mar 29, 2012

Hi Jeff,

We are still getting the same issues with fabric, so I have moved most of our cod outside of fabric and to remote scripts and call these through subprocess.call('ssh ...') in order to bypass Fabic's custom output buffeting. It is working great so far. Jobs that used to take 20+minutes now run under 7minutes.

An other issue we have is with stdout.flush() which on some tasks just get stuck for 15m. I don't know if you have experienced it.

We will keep fabric for a little while to orchestrate our deployments, and probably for a while longer as a sys admin wrapper for our AWS/boto needs, but these issues are a complete deal breaker for us.

Sent from my iPhone

On Mar 6, 2012, at 14:58, soft soft@localhost wrote:

On Mon, 5 Mar 2012 16:27:47 -0800, Jeff Forcier wrote:

As of those versions I believe the sleep value is now stored in the
ssh namespace:
https://github.com/bitprophet/ssh/blob/master/ssh/__init__.py#L101

So while I haven't tested this myself, you should be able to, at the
top of your fabfile/fabric-using code, do something like:

import ssh
ssh.io_sleep =

And it should take effect for any Fabric or SSH code that module
imports & runs.

Please let me know if that does the trick.


Reply to this email directly or view it on GitHub:
#312 (comment)

Some more background about what we are doing and how to reproduce the issue consistently.

We are running the fab rules as part of our CI system (Jenkins) and it seem that controlling the fab rules from a remote host is a main cause of the problem.

I've tried to use an OS X Jenkins slave to control the fab rules and experienced similar slowness. I will try to patch the ssh library directly to set io_sleep with various values and let you know if I get any good results.

Regards,
Stephane Odul

sodul commented Mar 29, 2012

Hi Jeff,

We are still getting the same issues with fabric, so I have moved most of our cod outside of fabric and to remote scripts and call these through subprocess.call('ssh ...') in order to bypass Fabic's custom output buffeting. It is working great so far. Jobs that used to take 20+minutes now run under 7minutes.

An other issue we have is with stdout.flush() which on some tasks just get stuck for 15m. I don't know if you have experienced it.

We will keep fabric for a little while to orchestrate our deployments, and probably for a while longer as a sys admin wrapper for our AWS/boto needs, but these issues are a complete deal breaker for us.

Sent from my iPhone

On Mar 6, 2012, at 14:58, soft soft@localhost wrote:

On Mon, 5 Mar 2012 16:27:47 -0800, Jeff Forcier wrote:

As of those versions I believe the sleep value is now stored in the
ssh namespace:
https://github.com/bitprophet/ssh/blob/master/ssh/__init__.py#L101

So while I haven't tested this myself, you should be able to, at the
top of your fabfile/fabric-using code, do something like:

import ssh
ssh.io_sleep =

And it should take effect for any Fabric or SSH code that module
imports & runs.

Please let me know if that does the trick.


Reply to this email directly or view it on GitHub:
#312 (comment)

Some more background about what we are doing and how to reproduce the issue consistently.

We are running the fab rules as part of our CI system (Jenkins) and it seem that controlling the fab rules from a remote host is a main cause of the problem.

I've tried to use an OS X Jenkins slave to control the fab rules and experienced similar slowness. I will try to patch the ssh library directly to set io_sleep with various values and let you know if I get any good results.

Regards,
Stephane Odul

richid pushed a commit to richid/fabric that referenced this issue Aug 1, 2012

Add sleep to IO loops to prevent being CPU bound.
Fixes fabric#312.

Thanks to Redmine user "cbr grind" for the patch.

richid pushed a commit to richid/fabric that referenced this issue Aug 1, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment