-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSH agent forwarding error #562
Comments
Not sure what you mean by "code shipment" exactly. Can you please provide links to a Gist of your fabfile & the full output of when you run it and get this error? Might help. Haven't seen this error before myself but the feature is brand new and has only been tested in a few situations. |
Thanks for looking into this. When I say "code shipment", I mean pulling a repo, archiving the previous release, and creating a new directory with new code. Please refer to my fabfile and shipper library here: https://gist.github.com/1849832. The full output from one of the servers is as follows (I put xxx for the server name and module for privacy sake): [xxx] out: Loading staging settings ... |
Sorry, I meant literally all the output from the fab task, not just the lines displaying the remote server's stdout :) If you're explicitly hiding the rest with eg I can say that the error there appears to be not from Fabric itself but from whatever is being invoked remotely, but without more info (see above) I can't tell which run/sudo call actually generated that :) |
I've seen this as well. I have a git checkout on a remote server, and if I do 'git submodule update' and there are multiple sub-modules to update, the 2nd one always fails. |
I have a similar issues using mercurial:
|
Thanks a lot for the feedback, guys. That's really interesting, especially how it affects hg too -- @gosusnp, is hg also using the SSH protocol in this instance (from your target to the repository host)? I assume 'yes' but want to make sure. So, sounds like there may be some problem with initiating multiple agent-using requests on the remote box -- I'll see if even the base case for this (two simple EDIT: Well, that didn't replicate the problem -- two subsequent |
@bitprophet, your assumption is correct. If it helps, my current workaround is to execute the whole command using local("ssh -A %(host)s .... buildout") instead of run("buildout"). |
Just as I initially reported ;) |
I have a similar problem to gosusnp, though in my case with svn (ssh+svn protocol). I'm not clear that it's exactly the same issue as the original report, though. I'm running python 2.7.1 This hangs after the output from updating 'foo'. A plain "svn up" fails the same way. One line of output then it hangs. def doesnotwork():
with cd('$PATH'):
run('svn update foo bar') This works as expected: def works():
with cd('$PATH'):
run('svn update foo') as another commenter mentioned, this workaround seems to work: def workaround():
local('ssh %(host)s cd $PATH && svn up foo bar' % env) running with --show=debug doesn't seem to provide any help. |
Did a little more digging, and I think the issue I'm seeing is the same as #613 rather than this one. It does seem to be related to agent forwarding. FWIW the minimal test case to show the problem I've found is: def test():
run('ssh server ls && ssh server ls') Running the 2 ssh operations in separate run() calls doesn't have any problems. |
Yes, this and #613 are almost definitely the same root cause, namely some sort of deficiency in the agent forwarding feature. (The perils of pull requests -- I don't know how it actually works, only that it passed tests & worked in a few basic real world test situations.) I had tried triggering this issue with two |
I don't know a lot about how ssh works its magic, and my python is pretty rusty, but I thought I'd take a stab at debugging the issue. One thing I noticed is that in ssh.transport._parse_channel_open(), the two lines that test kind == 'auth-agent@openssh.com' (2011 and 2090) are indented with tabs while the rest of the file is indented with spaces. Correcting this has no effect on the bug though :) Looking at the logging, I see that when auth is requested, it makes a connection to the AgentClientProxy. Then when the second ssh call makes an auth request, it fires up another AgentClientProxy to handle that, though the first AgentClientProxy is never closed. Again, I don't know a lot about how this is supposed to work, but it seems like the ideal case would be to reuse the first AgentClientProxy, or if it does need to be shut down, maybe the fact that it's not being closed before another instance tries to connect that's causing the problem. |
Actually, in my explorations, Anyway, have done some digging but not done yet. Mostly focusing on "what might be different in the 2nd agent request that's causing it to upset the remote sshd". Found two spots already where state is created in a non-idempotent manner:
Still digging around; also thinking based on the above that the agent handler thread might be a place to dig deeper if no other spots in the code seem likely. |
I don't believe this is the cause of the problem, but I realized we're never closing the I've committed something to tie that off. Minor issue with this (though it would still be a problem before my change) is that this combo of local socket + extra SSH channel is being created (and now at least partly torn down) for every command run. So again in large fabfiles this could potentially be inefficient. However, from what little I know of how forwarding works, we probably can't hold open the local->remote channel persistently, and the local socket connection is a file/Unix socket, so while we could cache that OK, it's not likely a big deal. Finally, we'd want this granular behavior in any case where somebody only wanted to enable agent forwarding for a subset of their execution run, though I'd expect that to be a pretty uncommon use case. |
Sanity check against normal SSH client behavior with the same invocation (two agent forwarded cxns in a single remote command execution). Looks similar to what we're doing at first blush, re: a new agent forwarding channel per remote 'ssh' invoke:
I notice it's apparently freeing up the auth connection each time (running with I'll have to go find out where closing occurs and make sure it's logging; I don't see any obvious channel close events from Paramiko at debug level, as I do from |
I just looked though my saved logs again... and you're right, I'm only seeing AgentClientProxy init'd once, it's connecting twice. Probably not especially helpful, but here's the relavant portion of my logs (with some logging added to agent.py):
(that final EOF log message is from me control-C'ing it after the hang) Good to know the "one agent connection per request" is the same thing OpenSSH is doing. I'm suspicious about it logging EOF messages on chan 2 (the first agent channel) and nothing on chan 3 (the second agent connection), but that may just essentially be a symptom. |
I found it! AgentClientProxy.connect() makes a connection to the ssh-agent and stores that in self._conn AgentProxyThread (the super class of AgentRemoteProxy) selects on and reads from this connection in the AgentClientProxy instance. So when the second AgentProxyThread starts up, it causes the AgentClientProxy to overwrite self._conn with the second connection to the ssh-agent. The first AgentProxyThread is still running and selecting on self._agent._conn, so it actually receives the message before the second thread can get it. In other words, we have 2 AgentProxyThread classes listening on the same socket, when really they should be listening on separate sockets. So, in AgentProxyThread, in the run() method, if you have it save agent._conn in self._conn, and likewise change the _communicate() method to use self._conn instead of self._agent._conn, it fixes the problem. Here's my diff: --- /Library/Python/2.7/site-packages/ssh/agent.py 2012-04-30 11:34:04.000000000 -0700
+++ agent.py 2012-05-04 16:03:30.000000000 -0700
@@ -105,6 +105,7 @@
threading.Thread.__init__(self, target=self.run)
self._agent = agent
self._exit = False
+ self._conn = None
def run(self):
try:
@@ -112,6 +113,7 @@
self.__inr = r
self.__addr = addr
self._agent.connect()
+ self._conn = self._agent._conn
self._communicate()
except:
#XXX Not sure what to do here ... raise or pass ?
@@ -122,10 +124,10 @@
oldflags = fcntl.fcntl(self.__inr, fcntl.F_GETFL)
fcntl.fcntl(self.__inr, fcntl.F_SETFL, oldflags | os.O_NONBLOCK)
while not self._exit:
- events = select([self._agent._conn, self.__inr], [], [], 0.5)
+ events = select([self._conn, self.__inr], [], [], 0.5)
for fd in events[0]:
if self._agent._conn == fd:
- data = self._agent._conn.recv(512)
+ data = self._conn.recv(512)
if len(data) != 0:
self.__inr.send(data)
else:
@@ -133,7 +135,7 @@
elif self.__inr == fd:
data = self.__inr.recv(512)
if len(data) != 0:
- self._agent._conn.send(data)
+ self._conn.send(data)
else:
break
time.sleep(io_sleep) I'm not sure that really the "right" way to fix it, since AgentClientProxy is still having the value of self._conn overwritten with each new connection... but it least it demonstrates the general idea of how to fix it. |
So if I read what you're saying right, I was moving in the right direction re: the poor state handling of the local socket connection to the local SSH agent? And I think what you're saying is that this works well, and the only problem is the value of I'll try to verify your changes above and then think about whether/how to try and memoize/cache this socket. Thanks a lot for the 2nd look, no idea if I would've run into that specific angle myself :) |
The state handling of the local socket seems to be ok, it's the overwriting problem that causes the issue (unless I'm misunderstanding what you mean). Take a look at the loop in AgentProxyThread._communicate(). It's selecting on self._agent._conn. So when the second AgentRemoteProxy instance is created, it overwrites agent._conn with a new socket, so now we have two instances of AgentRemoteProxy (in separate threads) trying to read data from the socket that should be exclusive to the second instance. I found the problem by putting logging in Connection.send() and recv(). The first instance, where things worked, would send a message and receive a response. On the second run, the new instance would send a message, but the first instance would receive the response. This is kind of messy, but here are my logs. First the one that works:
Then when the second ssh happens:
the request comes in via chan 3 on thread 4, but it's thread 3 that gets the response and then tries to pass it along to chan 2 . From there it was just a matter of digging though the code doing the send and recv to spot the problem. You can see that thread 3 starts out working with socket 0x10868f280 but after thread 4 starts up, thread 3 is now working with socket 0x10868f210, the one created for thread 4. I hope that all makes sense :) FWIW, I tried running a script that makes a bunch of ssh connections (via svn, i.e. a non-trivial test) and it ran to completion. |
Yea, by "state handling" I meant "incorrectly overwrote I think I get what you mean, but again I'll try to verify on my end soon. Thanks again! |
Ah yes, then that is the problem :) My patch simply copies the conn object from the AgentClinetProxy in to the AgentProxyThread instance... which smooths over this problem, but doesn't leave the code in any better shape to correctly handle state. |
I've reproduced your findings -- I am also seeing the older thread Trying to think of best way to "really" solve this, though I'd honestly be fine with your solution since AFAICT it's no worse than the original patch re: how it handles the socket objects. I was earlier thinking about caching the socket fd, but that's not the real problem now, the problem is the multiple read threads. Need to either convince the older one to terminate when the new one shows up, or (if possible) have the first thread created "switch" to newly created channels. (This might leave stale remote channels lying around too, so really what we want is for all three items -- fd, thread, channel -- to be reused whenever possible. Not sure if that meshes with the SSH spec though.) |
i've created a new class which creates every time a new ssh-agent channel arrives a new AgentClientProxy. it works for me. |
Thanks for the patch, @lynxis. This looks like exactly what I was talking about above re: ensuring a single local-agent-socket=>remote-fake-agent connection per Fabric-level statement. I'll test it out in a bit! EDIT: Confirmed patch works. Want to double check whether I can re-apply an earlier tweak that should |
I think it's a good idea to close all channels. Added a commit to my fork. One question about these agent channels. Do we close they in a clean way after we received EOF? |
@lynxis re: EOFs triggering closures, I don't know, actually -- I'd have to read the spec or the openssh source to figure that out. As long as we close the channels automatically at the end of the Fabric level Will try to test that new change out today sometime if I can. |
The stuff with connection drop during operation is pretty repeatable here - |
@techtonik I'm pretty sure that error condition is related to the changes under discussion, yea, though I never experienced it myself (I only got the |
This is up as @techtonik if you upgrade your Python |
I lied, they're on PyPI now, not tomorrow :) Still curious if this fixed @techtonik's problem. Also really enjoy spamming his github inbox, apparently :D |
Well, I've tried to use Fabric 1.4.1 with SSH from checkout unsuccessfully. I need to Logout and try once again with latest checkout of Fabric, because it seems Gnome agent crashed behind the scenes - I started to be prompted for private key password. The most evident bug was messed output: Before:
After:
|
That fixed it for me! Thanks a lot @bitprophet @lynxis @stevenmcdonald for your help to troubleshoot and solve this one. |
I confirm. Fixed with Fabric 1.4.2 / ssh 1.7.14. ;)
|
Upstream changes: ----------------- ## ssh 1.7.14 (2012-05-07) * #15: Implemented parameter substitution in SSHConfig, matching the implementation of `ssh_config(5)`. Thanks to Olle Lundberg for the patch. * #24: Switch some internal type checking to use `isinstance` to help prevent problems with client libraries using subclasses of builtin types. Thanks to Alex Morega for the patch. * [Fabric #562](fabric/fabric#562): Agent forwarding would error out (with `Authentication response too long`) or freeze, when more than one remote connection to the local agent was active at the same time. This has been fixed. Thanks to Steven McDonald for assisting in troubleshooting/patching, and to GitHub user `@lynxis` for providing the final version of the patch.
Fixes [Fabric paramiko#562](fabric/fabric#562). (cherry picked from commit 58aa520)
Upstream changes: ----------------- ## ssh 1.7.14 (2012-05-07) * #15: Implemented parameter substitution in SSHConfig, matching the implementation of `ssh_config(5)`. Thanks to Olle Lundberg for the patch. * #24: Switch some internal type checking to use `isinstance` to help prevent problems with client libraries using subclasses of builtin types. Thanks to Alex Morega for the patch. * [Fabric #562](fabric/fabric#562): Agent forwarding would error out (with `Authentication response too long`) or freeze, when more than one remote connection to the local agent was active at the same time. This has been fixed. Thanks to Steven McDonald for assisting in troubleshooting/patching, and to GitHub user `@lynxis` for providing the final version of the patch.
We are getting "paramiko.ssh_exception.AuthenticationException: Unable to connect to SSH agent" when we try to run an SSH command inside an SSH session, e.g. RemoteCommandExecutor.run_remote_command("ssh {0}".format(mpi_module, remote_host)) Possibly related to fabric/fabric#562 Signed-off-by: Luca Carrogu <carrogu@amazon.com>
We are getting "paramiko.ssh_exception.AuthenticationException: Unable to connect to SSH agent" when we try to run an SSH command inside an SSH session, e.g. RemoteCommandExecutor.run_remote_command("ssh {0}".format(mpi_module, remote_host)) Possibly related to fabric/fabric#562 Signed-off-by: Luca Carrogu <carrogu@amazon.com>
I enabled ssh agent forwarding with fab 1.4 with env.forward_agent = 'True' and all seems fine, except I receive an odd error during code shipment:
out: Authentication response too long: 3577407571
out: fatal: The remote end hung up unexpectedly
Fab does properly forward the agent when pulling code from github, so not sure what's going on.
Before fab 1.4, I used this function to get around the key forwarding issue:
def sshagent_run(cmd):
local('ssh -A %s@%s "%s"' % (env.user, env.host, cmd))
Any idea what's going on?
The text was updated successfully, but these errors were encountered: