Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Add support for `command_timeout` in `run()` #1989
referenced this pull request
Jul 2, 2019
This may not be a simple merge either; the socket timeouts bubble up into the thread-exception handlers (as they occur on attempts to recv/recv_stderr in the IO workers).
Fabric 1 did something kinda similar but it had all the timeout logic meshed in the middle of the IO workers, so it was able to say "is this a socket timeout and are we past the configured timeout (to prevent muting connection-time timeouts, I think)? turn it into a timeout exception and raise that".
Looking at Invoke, I notice we do something vaguely similar with watcher errors - it definitely follows the obvious/naive option for us here, which is "examine thread errors to see if they need special handling". This might be a good excuse to cut up Runner some more so we can extend this pattern for Fabric.
The only other route I see offhand is to change the contract for timeouts and say "actually, the abstract Runner is what handles the timeout itself, via the threading Timer, and subclasses only need to ensure we have a way of forcibly stopping the subprocess from the main thread" (if we don't just make that always "send an interrupt like KeyboardInterrupt"?)
This way, we're not relying on "action at a distance" via the socket timeouts, but we just extend the Runner's main-thread wait loop - the same spot that handles KeyboardInterrupt (aka "the human got tired of waiting" style timeouts...:D) could look for timer expiry and call
This is tempting partly because the whole "subclasses determine what timeout means" felt wrong to me - why would the execution mechanism change the fact that we want control returned to us after N seconds? I guess there are some situations, like accounting for transmission time of network exec (timeout defined as time spent running remotely only) or eg allowing a subclass to do something very different like wrapping the command in the
Think I'll aim for the latter, started blocking out the former and it requires some mild contortions within Runner and its state (there's a reason _run_body is such a big single function right now...for better and mostly worse), I suspect the latter approach is going to be simpler to implement as well as "feeling more correct".
EDIT: well, there's still the issue of how to 'close' the connection when the Timer exits, the Fabric 1 approach (which is this PR's approach) of setting the channel timeout before starting, is "easy" because it handles the timing for you automatically. Do we just set a very short timeout after the timer expires, or something else?
This prompted me to dig into Paramiko it and reminded me: the v1 timeout is not the same as what we do in Invoke (and thus, no matter how I go with the mechanics, this needs attention). It's used:
What this means is that the timeout setting is the longest Paramiko will ever wait for a period of inactivity - so depending on what is happening it could wait much longer than the specified timeout. E.g. imagine a remote program that behaves normally for a number of seconds and then appears to hang; the timeout only captures that hang time and does not account for the earlier period of activity.
A contrived example, but consider:
Connection.run("for x in `seq 10`; do sleep 1; done; sleep 10", timeout=5)
This will time out after 15 seconds instead of 5, because the ~10s spend in the for loop won't count as timing out by Paramiko's metrics.
Or imagine a program which is nothing but constant activity - it will run to completion without ever timing out!
This makes me wonder if the right approach is instead to say "timeouts are actually just us sending an interrupt on your behalf automatically after the timer expires", though this has the problem of not being resistant to badly hung processes.
Or we could try harder for a "I don't care about the subprocess' health or status, I just want control returned to me" setup and (as above) just set a very short channel timeout, or (probably better) just call
Could also do both, though I wonder if that makes us subject to race conditions ever. (Plus it invites the idea of a timeout for your timeout, which pleases XZibit but makes me just want to cry.)
I need to experiment briefly to make sure