Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for `command_timeout` in `run()` #1989

Closed
wants to merge 1 commit into from

Conversation

@fruch
Copy link

commented Jul 1, 2019

depends on:
pyinvoke/invoke#645

@fruch fruch force-pushed the fruch:add_timeout_to_run branch 2 times, most recently from 79a5198 to 4f988d4 Jul 1, 2019

@fruch fruch force-pushed the fruch:add_timeout_to_run branch from 4f988d4 to 8e32d90 Jul 1, 2019

@fruch fruch referenced this pull request Jul 2, 2019

Closed

Fixing race in setting timeout of the Remoter object [#1097] #1098

0 of 9 tasks complete
@bitprophet

This comment has been minimized.

Copy link
Member

commented Jul 19, 2019

My TODO:

  • Needs to actually raise CommandTimedOut or whatever Invoke settles on, as-is I think this will just raise socket.timeout?
  • manual test
  • unit tests
  • maybe an integration test?
  • changelog entry
  • update Upgrading docs page to note that this is ported to v2 now
  • This probably requires bumping Invoke dependency
    • But also the thorny "invoke has to actually be released first" problem. Wonder if I can easily signal to Travis that it should get Invoke-master when testing...or just make that part n parcel of Fabric 2 tests...ugh
  • Also update changelog for previously merged stuff that mentions "works with invoke<1.3 but requires >=1.3 for xxx" since that'll be moot now... - I must be thinking of the support for pyinvoke/invoke#522 or some other outstanding branch - it's not in this one for now
@bitprophet

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

This may not be a simple merge either; the socket timeouts bubble up into the thread-exception handlers (as they occur on attempts to recv/recv_stderr in the IO workers).

Fabric 1 did something kinda similar but it had all the timeout logic meshed in the middle of the IO workers, so it was able to say "is this a socket timeout and are we past the configured timeout (to prevent muting connection-time timeouts, I think)? turn it into a timeout exception and raise that".

Looking at Invoke, I notice we do something vaguely similar with watcher errors - it definitely follows the obvious/naive option for us here, which is "examine thread errors to see if they need special handling". This might be a good excuse to cut up Runner some more so we can extend this pattern for Fabric.


The only other route I see offhand is to change the contract for timeouts and say "actually, the abstract Runner is what handles the timeout itself, via the threading Timer, and subclasses only need to ensure we have a way of forcibly stopping the subprocess from the main thread" (if we don't just make that always "send an interrupt like KeyboardInterrupt"?)

This way, we're not relying on "action at a distance" via the socket timeouts, but we just extend the Runner's main-thread wait loop - the same spot that handles KeyboardInterrupt (aka "the human got tired of waiting" style timeouts...:D) could look for timer expiry and call self.send_interrupt. It would also have to set some bookkeeping such that shutdown triggers the CommandTimedOut instead of just pseudo-gracefully stopping.

This is tempting partly because the whole "subclasses determine what timeout means" felt wrong to me - why would the execution mechanism change the fact that we want control returned to us after N seconds? I guess there are some situations, like accounting for transmission time of network exec (timeout defined as time spent running remotely only) or eg allowing a subclass to do something very different like wrapping the command in the timeout program, etc.

@bitprophet

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

Think I'll aim for the latter, started blocking out the former and it requires some mild contortions within Runner and its state (there's a reason _run_body is such a big single function right now...for better and mostly worse), I suspect the latter approach is going to be simpler to implement as well as "feeling more correct".

EDIT: well, there's still the issue of how to 'close' the connection when the Timer exits, the Fabric 1 approach (which is this PR's approach) of setting the channel timeout before starting, is "easy" because it handles the timing for you automatically. Do we just set a very short timeout after the timer expires, or something else?


This prompted me to dig into Paramiko it and reminded me: the v1 timeout is not the same as what we do in Invoke (and thus, no matter how I go with the mechanics, this needs attention). It's used:

  • as the socket read timeout for every recv and recv_err
  • while waiting for the send window to open (IIRC this has to do with multiple channels or TCP transmission window stuff)

What this means is that the timeout setting is the longest Paramiko will ever wait for a period of inactivity - so depending on what is happening it could wait much longer than the specified timeout. E.g. imagine a remote program that behaves normally for a number of seconds and then appears to hang; the timeout only captures that hang time and does not account for the earlier period of activity.

A contrived example, but consider:

Connection.run("for x in `seq 10`; do sleep 1; done; sleep 10", timeout=5)

This will time out after 15 seconds instead of 5, because the ~10s spend in the for loop won't count as timing out by Paramiko's metrics.

Or imagine a program which is nothing but constant activity - it will run to completion without ever timing out!


This makes me wonder if the right approach is instead to say "timeouts are actually just us sending an interrupt on your behalf automatically after the timer expires", though this has the problem of not being resistant to badly hung processes.

Or we could try harder for a "I don't care about the subprocess' health or status, I just want control returned to me" setup and (as above) just set a very short channel timeout, or (probably better) just call Channel.shutdown(2) to shut off both reads and writes, which should rapidly result in things shutting down. (In Invoke/Local, I guess we can keep using SIGKILL which is about the same...)

Could also do both, though I wonder if that makes us subject to race conditions ever. (Plus it invites the idea of a timeout for your timeout, which pleases XZibit but makes me just want to cry.)

I need to experiment briefly to make sure shutdown(2) does what I expect here. And then get this done. It's taken up way too much time.

@bitprophet

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

Channel.shutdown(2) doesn't suffice, I believe because it doesn't actually close the connection & that means Channel.exit_status_ready blocks until the natural end of the process anyways. We may just want to .close instead - unhappy I don't know off the top of my head. EDIT: seems like that does the trick better.

bitprophet added a commit that referenced this pull request Aug 6, 2019

@bitprophet

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

Almost forgot to check up on CLI args, glad I did - now invoke and fabric 2 retain the fabric 1 -T and -t flags. This is just about ready to merge.

@bitprophet

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

All done? Finally? Have to close since GH doesn't see cherry-picks.

@bitprophet bitprophet closed this Aug 6, 2019

fruch added a commit to fruch/scylla-cluster-tests that referenced this pull request Aug 7, 2019

fruch added a commit to fruch/scylla-cluster-tests that referenced this pull request Aug 7, 2019

fruch added a commit to fruch/scylla-cluster-tests that referenced this pull request Aug 7, 2019

fruch added a commit to fruch/scylla-cluster-tests that referenced this pull request Aug 7, 2019

bentsi added a commit to scylladb/scylla-cluster-tests that referenced this pull request Aug 8, 2019

bentsi added a commit to scylladb/scylla-cluster-tests that referenced this pull request Aug 8, 2019

Switch back to formal version of `invoke` and `fabric`
since those two PR were merged by Jeff:
* pyinvoke/invoke#645
* fabric/fabric#1989

(cherry picked from commit 58eb4ee)

bentsi added a commit to scylladb/scylla-cluster-tests that referenced this pull request Aug 8, 2019

Switch back to formal version of `invoke` and `fabric`
since those two PR were merged by Jeff:
* pyinvoke/invoke#645
* fabric/fabric#1989

(cherry picked from commit 58eb4ee)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.