Skip to content

Add remote-program timeout support #249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bitprophet opened this issue Aug 19, 2011 · 7 comments
Closed

Add remote-program timeout support #249

bitprophet opened this issue Aug 19, 2011 · 7 comments

Comments

@bitprophet
Copy link
Member

EDIT: as per this comment below this issue is now solely about detecting and responding to remote program halt/freeze, no network specific issues.


Description

(Note: this ticket is a copy of the original content from #8, which has been repurposed.)

I.e. either have our own timeout, i.e. no data from remote end in N seconds == abort, or make sure that existing network-level timeouts work correctly (e.g. when connecting -- I seem to recall someone mentioning that Fabric was just sitting around waiting forever for a remote system that had valid DNS but was unreachable.)

Self-generated timeouts should probably be off by default so as not to screw up unsuspecting users running remote tasks that are legitimately silent for long periods of time, and should of course be configurable in length either way.

The current timeout behavior is controlled in source:fabric/network.py on line ~219, except socket.timeout:.


Originally submitted by Jeff Forcier (bitprophet) on 2010-10-29 at 12:14pm EDT

Relations

@ghost ghost assigned bitprophet Aug 19, 2011
@bitprophet
Copy link
Member Author

Jeff Forcier (bitprophet) posted:


While it is likely to be only useful for local, I was linked to the following Python module today: timed_command.


on 2010-10-29 at 12:16pm EDT

@bitprophet
Copy link
Member Author

This should also be considered related to #12:

  • Actual "DNS good, SSH server bad" situations currently appear to correctly bomb after 10 seconds, so the abovementioned report is probably out of date or includes more factors we don't know about.
  • Situations where the connection is valid and the remote process has hung, feels like the primary use case for this ticket, which requires detecting an absence of I/O data.
  • A sub-usecase could be where a long-term connection is established but then breaks partway through execution (e.g. run("sleep 120") where the remote sshd dies 30 seconds in.)
    • I don't know what currently happens in this situation but suspect it would result in a near immediate socket or SSH exception of some kind.

Basically, I think this ticket should be refined to the point where we consider just the "everything is working fine but the remote program appears to be frozen" use case. Other issues belong more in #12.

As such I'm dropping this from 1.4, it's not really network related anymore.

@bitprophet
Copy link
Member Author

When looking for a quick solution for IRC user insert36 I found that we probably need to be using Channel.settimeout for the main usecase here (timing out the remote program execution(s)). The channel used to run commands could have that method run on it in either:

  • state.default_channel() -- remind myself why this is used, IIRC it's so that all operations + contrib functions have access to the "right" channel object, with tweaks. If so this is probably the right place to apply a timeout option.
  • operations._execute() is the base for run/sudo so if we needed to limit the option to just those two, that's the spot.

@PaulMcMillan
Copy link

My quick-and-dirty solution was to write a wrapper for the fab command:

#!/usr/bin/env python
import fabric.state
fabric.state.env.command_timeout = None # add a default value
old_default_channel = fabric.state.default_channel
def new_default_channel():
    chan = old_default_channel()
    chan.settimeout(fabric.state.env.command_timeout)
    return chan
fabric.state.default_channel = new_default_channel

import fabric.main
fabric.main.main()

I tried monkey patching later on, but after fabric.main.main() was too late. This seems to work like a charm. Now I can use the standard settings context manager as appropriate. I'll have a patch for fabric in a few minutes here.

@bitprophet
Copy link
Member Author

We can actually handle this at the Paramiko layer, where it's more useful to more folks; then I can adapt Paul's patch to simply passthrough to the Paramiko level option.

Patch from a Paramiko user:

From 007fedc94b4cbb390e7413bec0a55cdf74ab8c24 Mon Sep 17 00:00:00 2001
From: cernov <cernov@[REDACTED].com>
Date: Fri, 16 Nov 2012 11:58:00 +0200
Subject: [PATCH] Added an extra parameter (timeout) to exec_command

---
 paramiko/client.py |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/paramiko/client.py b/paramiko/client.py
index 07560a3..0e53b41 100644
--- a/paramiko/client.py
+++ b/paramiko/client.py
@@ -350,7 +350,7 @@ class SSHClient (object):
             self._agent.close()
             self._agent = None

-    def exec_command(self, command, bufsize=-1):
+    def exec_command(self, command, bufsize=-1, timeout=None):
         """
         Execute a command on the SSH server.  A new L{Channel} is opened and
         the requested command is executed.  The command's input and output
@@ -361,12 +361,15 @@ class SSHClient (object):
         @type command: str
         @param bufsize: interpreted the same way as by the built-in C{file()} function in python
         @type bufsize: int
+        @param timeout: maximum time to wait for command to execute
+        @type timeout: float
         @return: the stdin, stdout, and stderr of the executing command
         @rtype: tuple(L{ChannelFile}, L{ChannelFile}, L{ChannelFile})

         @raise SSHException: if the server fails to execute the command
         """
         chan = self._transport.open_session()
+        chan.settimeout(timeout)
         chan.exec_command(command)
         stdin = chan.makefile('wb', bufsize)
         stdout = chan.makefile('rb', bufsize)
-- 
1.7.9.5

@bitprophet
Copy link
Member Author

Well, fuck me running, that won't work because of #733's changes recently: it uses a 0.1s channel (née network) timeout to ensure faster reads. But this means we can no longer rely on the network-level timeout to detect if the remote end has shut up (re: timing out the remote program.)

I think this can be solved by doing a "manual" timeout: keep track of time elapsed within the IO loop and raise a custom exception when that exceeds the configured command_timeout, in the spot where we currently just except socket.timeout: continue (again, which was introduced during #733).

However it's annoying that we have to dance around it like this, and also means the Paramiko level change is actually not useful here. (I was hoping to faux-test the Paramiko change via Fabric/its test suite, lazy me.)

@bitprophet
Copy link
Member Author

Fabric-level changes for this are in and test out OK, can use both the CLI/env setting that Paul laid out, or a kwarg to run/sudo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants