Fix for #715 #733

Merged
merged 11 commits into from Oct 1, 2012

Conversation

Projects
None yet
2 participants

Hi,

I've completed refactoring the read loop completely. Here is a summary of the changes I've made:

  • Created class for the SSH Channel Reader
  • In the main reading loop, I've clearly separated the printing part and the interactive/capture part because printing require very careful carriage return accounting, while interactivity/capture only requires conservation of the network stream,
  • The printing part is a very small state machine that handles prefix and carriages returns correctly across network read,
  • The network is now read in chunk of 4096 bytes (arbitrary value). Every chunk is then parsed on a part-of-line by line basis. Part of line is added to current line (if needed) and processed,
  • SSH Channel is now read with a 0.1s timeout.

As a result, the performance of the network read loop is much better (on my sample 14000 lines of log run, the performance goes from 1m30s to 34s (only 3s slower than raw SSH).

I think the overall refactoring is of acceptable quality, although some expert Pythonist could probably improve some parts of the code without a doubt.

All tests do pass and there is nothing to configure to adapt to the speed/interactiveness tradeoff. We have the best of two worlds !

Pierre Baillet and others added some commits Aug 24, 2012

Pierre Baillet Small refactor in io.py
- Optimize read when there is no capture made
9bdc773
Pierre Baillet Added interactive flag to run
- Setting interactive to False propagate the flag down to the io and ensure reads are fast
- This use a codepath that was never used before in the read loop and in which no prefix is displayed
19002e9
Pierre Baillet Merge remote-tracking branch 'upstream/master'
Conflicts:
	fabric/operations.py
629d9c3
Pierre Baillet Remove interactive behavior a20bcdb
root First refactor 6d398fe
Pierre Baillet Tests do pass ! a29d96b
Pierre Baillet Now read by 4 bytes chunks 8ccd5ea
Pierre Baillet Now honors \r\n in network blocks bc6c43a
Pierre Baillet Useless test dacc719
Pierre Baillet Increase read size aa3ec34
Pierre Baillet Use two sets of bytes to print/handle interactive because we modify p…
…rintable bytes
4c6641b
Owner

bitprophet commented Sep 23, 2012

Sorry for the delay, been on a bidness trip and haven't had much OSS time. Just pulled down your fork and will try to give this a gander real soon. Definitely like what you wrote in your summary, for what that's worth :) Thanks!

@bitprophet bitprophet commented on the diff Sep 23, 2012

fabric/io.py
@@ -12,10 +14,11 @@
if win32:
import msvcrt
+import logging
@bitprophet

bitprophet Sep 23, 2012

Owner

Assuming you left this chunk in by accident :)

@octplane

octplane Sep 23, 2012

Lot of logging for test has been left around (cargo culted from other code in fabric)...

@bitprophet bitprophet commented on the diff Sep 23, 2012

fabric/io.py
@@ -24,8 +27,8 @@ def _endswith(char_list, substring):
return tail == substring
-def _is_newline(byte):
- return byte in ('\n', '\r')
+def _has_newline(bytes):
@bitprophet

bitprophet Sep 23, 2012

Owner

Stupid nitpick: bytes is a builtin; shadowing builtins is usually frowned upon. I'd say it's worth changing here (e.g. to bytelist) since what we're doing in this module is relatively byte/string related. Could bite us later (pun unintended.)

@bitprophet bitprophet commented on the diff Sep 23, 2012

fabric/io.py
@@ -24,8 +27,8 @@ def _endswith(char_list, substring):
return tail == substring
-def _is_newline(byte):
- return byte in ('\n', '\r')
+def _has_newline(bytes):
+ return '\r' in bytes or '\n' in bytes
def _was_newline(capture, byte):
@bitprophet

bitprophet Sep 23, 2012

Owner

The change in this function is wrong, but it looks like the function itself is (rightfully, esp given it was duplicating the newline check, heh) not used anywhere, so we can just nuke it.

@bitprophet bitprophet commented on the diff Sep 23, 2012

fabric/io.py
- set_password(password)
- # Reset reprompt flag
- reprompt = False
- # Send current password down the pipe
- chan.sendall(password + '\n')
- elif try_again:
- # Remove text from capture buffer
- capture = capture[:len(env.again_prompt)]
- # Set state so we re-prompt the user at the next prompt.
- reprompt = True
+ ol = OutputLooper(chan, attr, stream, capture)
+ ol.loop()
+
+class OutputLooper(object):
+ def __init__(self, chan, attr, stream, capture):
+ self._chan = chan
@bitprophet

bitprophet Sep 23, 2012

Owner

I'd argue these leading underscores are unnecessary; sure, technically these are "private" variables, but given there are no public variables (this class is not part of Fabric's published API and is unlikely to be in future) there's no point and it looks a little fugly ;)

@bitprophet bitprophet commented on the diff Sep 23, 2012

fabric/io.py
+ self._flush(env.sudo_prompt)
+ # Prompt for, and store, password. Give empty prompt so the
+ # initial display "hides" just after the actually-displayed
+ # prompt from the remote end.
+ self._chan.input_enabled = False
+ password = fabric.network.prompt_for_password(
+ prompt=" ", no_colon=True, stream=self._stream
+ )
+ self._chan.input_enabled = True
+ # Update env.password, env.passwords if necessary
+ set_password(password)
+ # Reset reprompt flag
+ self._reprompt = False
+ # Send current password down the pipe
+ self._chan.sendall(password + '\n')
+
@bitprophet

bitprophet Sep 23, 2012

Owner

Leading whitespace here. (Wish GH's diffs showed that like Git's did.)

@bitprophet bitprophet commented on the diff Sep 23, 2012

fabric/io.py
+
+ # Initialize loop variables
+ initial_prefix_printed = False
+ seen_cr = False
+ line = []
+
+ # Allow prefix to be turned off.
+ if not env.output_prefix:
+ self._prefix = ""
+
+ while True:
+ # Handle actual read
+ try:
+ bytes = self._read_func(self._read_size)
+ except socket.timeout:
+ logger.info("Nothing read")
@bitprophet

bitprophet Sep 23, 2012

Owner

More logging to nuke.

Owner

bitprophet commented Sep 23, 2012

Sorry for leaving all these really annoying teacher-esque red marks :) Chances are I will actually update all of them myself in the next few hours, but noting here in case I don't.

@bitprophet bitprophet commented on the diff Sep 23, 2012

fabric/io.py
+ # Initialize loop variables
+ initial_prefix_printed = False
+ seen_cr = False
+ line = []
+
+ # Allow prefix to be turned off.
+ if not env.output_prefix:
+ self._prefix = ""
+
+ while True:
+ # Handle actual read
+ try:
+ bytes = self._read_func(self._read_size)
+ except socket.timeout:
+ logger.info("Nothing read")
+ continue
@bitprophet

bitprophet Sep 23, 2012

Owner

🚨ACTUAL QUESTION/COMMENT🚨 What happens here if the socket is continually timing out, i.e. the remote end has gone belly-up? If the ssh/paramiko layer doesn't itself raise a non-socket.timeout exception (e.g. an SSHException) I feel like this could leave a process in limbo. Will have to figure out how to test this if you haven't already.

@octplane

octplane Sep 23, 2012

Well, if I'm not mistaken, the continual timeout means there is nothing to read on the SSH stream for now. So either network is very slow, or the command is not outputting anything...

I don't think there is anything abnormal here. If the connection shuts down, we'll got '' from the pipe and handle "usual" disconnection.

All in all, I'm not sure how we could handle/test that. If you have any idea, I won't mind looking at that and try to fix the 'issue' :)

While I'm at it, I found out the test cases were not covering all the real life issues I encountered (especially because the read is now done by chunk and \r\n can happen at a chunk frontier) while finishing the patch. I'm not pythonist enough to be able to develop that kind of test case... should I file an Issue and we discuss that there ?

@bitprophet

bitprophet Sep 26, 2012

Owner

Well, as I said, I could see it going either way -- if the remote end disappears during this loop I'm not 100% sure we'd get an empty string back, instead I'd expect either a timeout or a network error, and so my worry is if it it is a timeout we'd be in trouble.

I tried testing this via a run("sleep 60") to localhost & killing my sshd, but that doesn't work -- if I gracefully kill (OS X's) ssh server, it finishes handles the command and then terminates. If I ungracefully kill it, it kills the child process which results in a 'valid' nonzero return value.

Basically I need to test an actual network partition :) which I will do shortly!

@bitprophet

bitprophet Sep 26, 2012

Owner

Boo, I was right. To reproduce:

  • Fab task hitting some remote server running e.g. run("sleep 15")
  • While that is running, pull your Ethernet/disable your wifi
  • Wait
  • Keep waiting
  • Put your laptop down for 10 minutes
  • Come back, it's still waiting
  • Make a sandwich
  • OK, it must have noticed by now!
  • Nope, still waiting.
  • 😓

So I guess I'd like you to clarify exactly what the explicit short timeout + ignored socket.timeout stuff is for.

If there's nothing coming down the pipe, us attempting to force a short timeout, ignore it, then try reading again -- doesn't seem like it would help any vs: normal length timeout, which will naturally show up if we get nothing within X time, implying something is wrong.

@octplane

octplane Sep 26, 2012

Well, this is just a standard behavior for network oriented interfaces. I've merely implemented this the way the SSH library is meant to be used (see https://github.com/bitprophet/ssh/blob/master/ssh/channel.py#L490).

As a matter of fact, this indicates the resilience of TCP connection and I'm pretty sure you could do the same test without fab and just with ssh and get the same result (ie ssh hanging out and waiting a looooonng time).

If this is not behaving the same way, then:

  • There must be a way to probe the SSH library for its internal state and find out it's now disconnected from the network (after we get a timeout) : This is the most likely situation...
  • Or we're screwed :(.

In any case, getting disconnected should raise an exception from the SSH library that we should be able to catch, while the read(n), catch timeout is the usual way to do the stuff with network libs (this is how people usually do in all langages, in fact)

I'm not at the office this morning, but I hopefully will have some time to investigate a bit...

@octplane

octplane Sep 26, 2012

Oh, and, how does the original fabric behave in this case ?

@octplane

octplane Sep 26, 2012

I'm throwing my notes here, so that you can see how far I've gone.

  • Read some SSH code and there is definitely things lying around the Transport object that could help us knowing the connection is up or not, although Google does not know.
  • I've decided (because I have no network cable easily available), to try to simulate the disconnection when using a simple SSH tunnel that I'd abruptly cut.
  • Running demo_simple.py (from the ssh package) and shutting down the tunnel sends en EOF
  • Running fab -H localhost sleep and doing the same does this:
fab -H localhost sleep
[localhost] Executing task 'sleep'
[localhost] run: sleep 30

Fatal error: run() received nonzero return code -1 while executing!

Requested: sleep 30
Executed: /bin/bash -l -c "sleep 30"

Aborting.
Disconnecting from localhost:1212... done.

This looks good. Now I will try to emulate the "unplug network connector" and run the same tests (demo_simple is cool because it's small and its behaviour will reveal how to fix the issue in fabric). Not before an hour or two though because I'm not having enough machines around to do the test properly...

(as a side node, I've read a few things about TCP retransmission and the usual kernel timeout seems to be around 15 minutes http://stackoverflow.com/questions/5907527/application-control-of-tcp-retransmission-on-linux ) before the kernel decides the connection is actually cut. This value I'll have to check against the unplug-cable test...

@octplane

octplane Sep 26, 2012

Jeff,

I'm sorry. This is the morning and I haven't got my coffee, so I jumped into the issue of detecting dead connection without answering your question regarding the "short" timeout.

The fact that this is a short timeout doesn't change anything to your test. Its goal is merely to provide a good reactivity when streaming the output of a command: As i'm requesting a lot of bytes in one chunk, slow outputting commands could take a while to display the first 4096 bytes. Setting a short timeout allows to dump the first bytes as soon as the first timeout without having to wait for 4096 bytes to be ready.

A diagram because what I'm writing might be not clear.
We assume the command run is outputting bytes 1 by 1 bytes every second (so this is a very not-verbose, very slow command)

setTimeout(0.1)

Time
0                           1s                 100s         4096s
Number or read() completed
                             10                  around 1000 around 40960
Number of bytes printed by the command so far
                              1                    100          4096
Number of bytes read (and printed) by fabric
                             1                      100         4096

with setTimeout(None) blocking mode

Time
0                           1s                 100s         4096s
Number or read() completed
                              0                                       1
           (waiting for 4095 more bytes)
Number of bytes printed by the command so far
                              1                    100          4096
Number of bytes read (and printed) by fabric
                             0                      0              4096

Note also, that there is no "normal" timeout as you said. We usually go blocking (None) or non-blocking (any other value that's chosen by the lib user). The main goal of my patch was to read by large chunk rather than byte-by-byte while maintaining the reactivity of the tool: this is esp. important to detect interactive prompting and behave accordingly. If we set the timeout to None and the 'sudo prompt' arrives in the middle of a 4096 bytes read (which is likely to be the case), fabric will wait indefinitely for the remaining of the 4096 bytes before prompting the user: a dead-lock.

So we need to have a timeout. If you find 0.1 too small, we can put 1, or even 10 but this will not change the unplug-cable behavior, I think... Moreover, this will change the time fabric "lags" behind an interactive sudo prompting (0.1 s is cool, 1s is acceptable, 10s... will feel slooow).

@octplane

octplane Sep 26, 2012

Ohai,

After some investigations, I managed to reproduce some kind of lockup using "ifdown eth0" on my VM (I have no wire, did I tell you that ?). However :

  • simple_demo.py (from ssh) exhibits the same issue (hangs and never ever breaks)
  • vanilla fab exhibits the same issue too
  • branch fab has the issue (obviously)

All these where run along with a simple ssh user@host which has some kind of protection against that and end up aborting Write failed: Broken pipe.

So now, I'm left with very few things:

  • can you check the issue does exists in simple_demo.py and vanilla fab when you actually unplug the wire ?
  • if yes, then:
    • we likely have a more serious issue in ssh (I suspect something wrong happens but the ssh code does not detect clearly the network is cut)
    • we can however implement a solution in fab by adding someking of "idle" timeout what will decide that when a client hasn't say anything for 'n' seconds, then something's wrong (I know the whirr - hadoop like framework - has some kind of hardcoded idle timeout of 30s which did posed us problems because we has something that was idle for > 30s). This is dirty but it might be acceptable.
  • If no, I need to find a way to reproduce the issue in some way.

Argl.

Owner

bitprophet commented Sep 23, 2012

Finished reviewing, sanity tested some (albeit small) personal fab tasks on latest master, then tested on this branch. Does feel a bit snappier! 👍 and no brokenness noticed, as expected.

Will start doing the minor edits listed above.

@bitprophet bitprophet added a commit that referenced this pull request Sep 23, 2012

@bitprophet bitprophet Remove debug logging re #733 c1d076f

@bitprophet bitprophet added a commit that referenced this pull request Sep 23, 2012

@bitprophet bitprophet Whitespace tweaks re #733 9c5d6ce

@bitprophet bitprophet added a commit that referenced this pull request Sep 23, 2012

@bitprophet bitprophet s/bytes/bytelist/g re #733 206607e
Owner

bitprophet commented Sep 23, 2012

OK, all set -- @octplane A) thanks so much for this! and B) please answer the siren'd question above when you can :)

Hey, thanks for all Jeff, you were very welcoming and helped me getting this through !

Owner

bitprophet commented Sep 30, 2012

Moving the timeout discussion out of line comments...:)

So, read your notes -- thanks a lot! I A) totally spaced on testing the disconnect/timeout behavior on the state of the code prior to your change (a serious lapse for me!) and B) now understand your point re: the short timeout being useful.

I'll test on my end to see if I can recreate the "hanging" behavior with how Fabric used to read from the network -- if the behavior is the same then we can log "fix this?" as a new ticket and consider this one done. If the behavior is different we'll have to figure out why.

I will do that ASAP. Thanks!

Owner

bitprophet commented Oct 1, 2012

Yup, same behavior with Fabric 1.4 release branch + latest ssh module. So I'm glad I brought it up, but it won't hold back merging this awesome update.

@bitprophet bitprophet merged commit 4c6641b into fabric:master Oct 1, 2012

1 check passed

default The Travis build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment