Gateway fails when target prompts for password #957

Closed
dgonza opened this Issue Aug 14, 2013 · 16 comments

Projects

None yet

5 participants

@dgonza
dgonza commented Aug 14, 2013

Trying this command
$ fab -H usr-a@target-host -g usr-b@intermediate-host -- whoami
will fail if the key for the local user is not present in the target host.
It will prompt for the password but it won't respond after it is entered.
If the local key is present in the target host it will log in normally.

@Bengrunt

I've got the exact same problem using fabric 1.8 and using roles decorators and env.gateway.

@adrianbn

That's an issue for me too, env.gateway and target asks for password. It seems to hang/stall.

@bitprophet
Member

@adrianbn gave more details in this mailing list thread (check both of his messages for whole picture): http://lists.nongnu.org/archive/html/fab-user/2013-10/msg00004.html

@darthmdh

Some more debugging:
(pid 12345 is the sshd running on the gateway from the remote fabric connection)
I left strace running for a minute or so... it's stuck in this syscall:

gateway$ strace -p 12345
Process 12345 attached - interrupt to quit
select(9, [3 6], [], NULL, NULL <unfinished ...>
Process 12345 detached

gateway$ lsof -p 12345
[...]
sshd 12345 joe 0u CHR 1,3 1491 /dev/null
sshd 12345 joe 1u CHR 1,3 1491 /dev/null
sshd 12345 joe 2u CHR 1,3 1491 /dev/null
sshd 12345 joe 3u IPv4 53582836462 TCP gateway:ssh->fabrichost:44208 (ESTABLISHED)
sshd 12345 joe 4u unix 0xaf737482 535393847 socket
sshd 12345 joe 5u unix 0xf6892376 535393852 socket
sshd 12345 joe 6r FIFO 0,6 535393856 pipe
sshd 12345 joe 7w FIFO 0,6 535393856 pipe

@bitprophet
Member

Going to see if I can replicate this; most of my personal testing of the gateway feature was using SSH keys, not passwords, so it's possible password prompt interaction is straight up broken with the gateway feature enabled.

Hopefully I will figure this out this afternoon, but if I don't, it would be super helpful to get this info from @darthmdh @adrianbn @Bengrunt and @dgonza:

  • Which Fabric version(s) are you encountering this in? (I know @Bengrunt is seeing it in 1.8, just curious if it's appearing earlier - I will be going back to 1.5 and working forward myself.)
  • What authentication are you trying to use with both the gateway and the final target? E.g. password on both; key on gateway, password on target; password on gateway, key on target; etc.

My guess is that the main brokenness is when the gateway requires a password, since the gateway's own IO streams are suppressed in this setup. Will poke.

@bitprophet
Member

Interestingly when I attempt to perform password auth to my localhost, via my localhost, I get a traceback, not a hang (this is under the 1.5 branch, same result under 1.8 and master):

» fab -a -H localhost -g localhost -- whoami
[localhost] Executing task '<remainder>'
[localhost] run: whoami
[localhost] Login password for 'jforcier': 
Traceback (most recent call last):
  File "/Users/jforcier/Code/fabric/fabric/main.py", line 736, in main
    *args, **kwargs
  File "/Users/jforcier/Code/fabric/fabric/tasks.py", line 316, in execute
    multiprocessing
  File "/Users/jforcier/Code/fabric/fabric/tasks.py", line 213, in _execute
    return task.run(*args, **kwargs)
  File "/Users/jforcier/Code/fabric/fabric/tasks.py", line 123, in run
    return self.wrapped(*args, **kwargs)
  File "/Users/jforcier/Code/fabric/fabric/main.py", line 713, in <lambda>
    state.commands[r] = lambda: api.run(remainder_command)
  File "/Users/jforcier/Code/fabric/fabric/network.py", line 532, in host_prompting_wrapper
    return func(*args, **kwargs)
  File "/Users/jforcier/Code/fabric/fabric/operations.py", line 1008, in run
    warn_only=warn_only, stdout=stdout, stderr=stderr)
  File "/Users/jforcier/Code/fabric/fabric/operations.py", line 891, in _run_command
    channel=default_channel(), command=wrapped_command, pty=pty,
  File "/Users/jforcier/Code/fabric/fabric/state.py", line 352, in default_channel
    chan = connections[env.host_string].get_transport().open_session()
  File "/Users/jforcier/Code/paramiko/paramiko/transport.py", line 662, in open_session
    return self.open_channel('session')
  File "/Users/jforcier/Code/paramiko/paramiko/transport.py", line 731, in open_channel
    raise SSHException('SSH session not active')
paramiko.SSHException: SSH session not active
Disconnecting from localhost... done.

Will poke more with the above combinations of key/pass on gateway/target.

@bitprophet
Member

Using my localhost as a gateway, with SSH keys enabled, and targeting a virtual machine with only password auth enabled, I get the hang behavior and these tracebacks (on 1.5):

» fab -H vm -g localhost -- whoami
[vm] Executing task '<remainder>'
[vm] run: whoami
[vm] Passphrase for private key: 
^C
Stopped.
Disconnecting from localhost... done.
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/Users/jforcier/Code/paramiko/paramiko/transport.py", line 65, in _join_lingering_threads
    thr.stop_thread()
  File "/Users/jforcier/Code/paramiko/paramiko/transport.py", line 1395, in stop_thread
    self.join(10)
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/threading.py", line 605, in join
    self.__block.wait(delay)
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/threading.py", line 235, in wait
    _sleep(delay)
KeyboardInterrupt
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/Users/jforcier/Code/paramiko/paramiko/transport.py", line 65, in _join_lingering_threads
    thr.stop_thread()
  File "/Users/jforcier/Code/paramiko/paramiko/transport.py", line 1395, in stop_thread
    self.join(10)
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/threading.py", line 605, in join
    self.__block.wait(delay)
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/threading.py", line 235, in wait
    _sleep(delay)
KeyboardInterrupt

Which is the same as @adrianbn's email description. So this implies that folks getting the hang behavior are using keys on their gateway but not their target. (Though as above, using password auth on both remote points is also problematic.)

Confirmed that on the 1.5 branch, using SSH keys on both gateway + target still works fine. Will work off of this version of the code & port any fixes forwards all the way to 1.8.

@bitprophet
Member

OK, it's actually hanging on the attempted 2nd connect to the target (via the gateway) after dealing w/ password prompt entry. So it's something in the Paramiko level. Will scan open Paramiko tickets real quick, then dig.

@Bengrunt

To answer your question: I was using SSH pubkey auth on both the gateway and final host.
More precisely, my personal pubkey was used for the gateway while a second pubkey (let's say the gateway's root account pubkey) was used to connect to the final target.
Until now, I managed to circumvent the problem by copying my personal key on every target host.
However, I'm wondering if that's not just the normal SSH tunnel behavior...

@bitprophet
Member

It looks like the gateway socket is being closed during the exception/prompt/try again cycle (going by a repr() print of the socket during debugging). Sounds like a reasonable thing to cause this issue. Hopefully it's our doing & we can not do it.

@Bengrunt If the above is right, your situation may be similar, since SSH auth (at least in paramiko/fab) is kinda-sorta a chain of exceptions + trying subsequent keys. (Especially given that paramiko's multi-key handling is kinda wonky even without gateway stuff involved.)

If I can find a solution to this that is basically "don't close the gateway socket prematurely / recreate it as needed", I'd definitely recommend you retry once it's published :) watch this space...

@bitprophet
Member

Yup, derp. We have an explicit finally that closes the socket if it's not None. That would kinda do it.

Problem here is the flow looks like this, for gateways:

  • Set up gateway, call connect() for it, cache.
  • Call connect() for "real" connection to final target
  • Inside that connect() call, a while loop runs, within which is the try/excepts/finally

So we need a way to either:

  • NOT close the socket so it persists within the while loop
    • Requires that we detect whether the failure is being handled by the while loop, or not - in some situations, exceptions get bubbled up, but in others, we are handling + falling back into the loop body
    • Feel like this means finally is not the right semantics for this, but not positive. I is rusty.
  • Rip the gateway socket setup to a standalone function & reconnect each time we retry the final target connection (so both connections are up or down in tandem.)
    • Feels inefficient and maybe problematic in edge cases, though can't think of specific ones
    • Probably easier to implement?
@bitprophet
Member

Yea, I think I'm being dumb about exceptions - if we move the socket closure outside the while loop, in a simple "try/finally", it should do what we want (maximally lazy socket closure within connect()).

@bitprophet
Member

With that "fixed" I now get our old friend Error reading SSH protocol banner; suspect that the gateway socket even when not closed, is still unable to be reused in the way I was imagining - lending support for the 2nd option above.

[vm] Passphrase for private key: <entered key here>
DEBUG:paramiko.transport:starting thread (client mode): 0x8aae90L
DEBUG:paramiko.transport:[chan 1] EOF received (1)
ERROR:paramiko.transport:Exception: Error reading SSH protocol banner
DEBUG:paramiko.transport:EOF in transport thread
DEBUG:paramiko.transport:[chan 1] EOF sent (1)
ERROR:paramiko.transport:Traceback (most recent call last):
ERROR:paramiko.transport:  File "/Users/jforcier/Code/paramiko/paramiko/transport.py", line 1557, in run
ERROR:paramiko.transport:    self._check_banner()
ERROR:paramiko.transport:  File "/Users/jforcier/Code/paramiko/paramiko/transport.py", line 1683, in _check_banner
ERROR:paramiko.transport:    raise SSHException('Error reading SSH protocol banner' + str(x))
ERROR:paramiko.transport:SSHException: Error reading SSH protocol banner
ERROR:paramiko.transport:
@bitprophet
Member

Main issue now is that the top level fabric.network.connect takes a sock object and once that object becomes unusable it's difficult to know how it was generated. (Also need to ensure any recreated object becomes re-cached in the connection cache!)

Basically HostConnectionCache.connect and the top level connect need to be a bit more intertwined :(

@bitprophet
Member

While I'm talking to myself:

  • Could have HCC.connect's call of top level connect() pass itself in, allowing connect() to update the gateway cache if needed.
    • This can be made backwards compat by making that new kwarg default to None & simply don't try doing said update if it is None.
  • Alternately, could have top level connect() return both the connected SSHClient object and the socket it's riding on top of, so that HCC.connect can perform the update itself.
    • There is no way to make this backwards compatible.
  • Given this is all internal, backwards compat is not paramount, but I'd still rather not break it given how old Fabric 1 is at this point, there are bound to be some people reusing internals this important.
@bitprophet
Member

Ended up tweaking connect's signature a bit, having it take a socket arg that it no longer does anything with (since it now makes the gateway cxn internally) felt misleading and not worth keeping BC for an internal method call.

Also kinda seem to have fixed it, can now gateway to my VM via localhost with VM requiring password auth

@bitprophet bitprophet added a commit that referenced this issue Dec 18, 2013
@bitprophet bitprophet Changelog re #957 92852e0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment