New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle connection failures due to server load #948

Merged
merged 2 commits into from Dec 15, 2013

Conversation

Projects
None yet
3 participants
@akitada

akitada commented Aug 3, 2013

SSHException with "Error reading SSH protocol banner" is raised when
sshd is unresponsive but Fabric treats it as an auth failure. This patch
fixes this by treating it as NetworkError.

akitada added some commits Aug 3, 2013

Handle connection failures due to server load
SSHException with "Error reading SSH protocol banner" is raised when
sshd is unresponsive but Fabric treats it as an auth failure. This patch
fixes this by treating it as NetworkError.
@joekiller

This comment has been minimized.

joekiller commented Aug 15, 2013

I encounter this problem when I launch a ton of quick SSH commands via a gateway proxy. Below is the error that bubbles up. Looks like for me the error results from self[gateway] = connect(*normalize(gateway)) which would be caught by this patch.

The output usually looks like

No handlers could be found for logger "paramiko.transport"

Then you get this error:

!!! Parallel execution exception under host u'ip-blah-blah-blah.ec2.internal':
Process ip-blah-blah-blah.ec2.internal:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File " /lib/python2.7/site-packages/fabric/tasks.py", line 233, in inner
    submit(task.run(*args, **kwargs))
  File " /lib/python2.7/site-packages/fabric/tasks.py", line 171, in run
    return self.wrapped(*args, **kwargs)
  File " /lib/python2.7/site-packages/fabric/decorators.py", line 181, in inner
    return func(*args, **kwargs)
  File "/fabfile/util.py", line 476, in _tail_logs
    connections.connect(env.host_string)
  File " /lib/python2.7/site-packages/fabric/network.py", line 103, in connect
    self[gateway] = connect(*normalize(gateway))
  File " /lib/python2.7/site-packages/fabric/network.py", line 429, in connect
    raise NetworkError(msg, e)
NetworkError: Error reading SSH protocol banner[Errno 54] Connection reset by peer

If I catch the error via BaseException when I print out the connections for the parallel task with fabric.state.connections I can see that none exist [].

This patch, along with the command with settings(connection_attempts=N) where N is greater than 1 should work around the issue.

Below is the code I used to work around the problem for the time being for anyone else.

    try:
        with settings(connection_attempts=5):
            sudo(command)
    except KeyboardInterrupt:
        disconnect_all()
    except NetworkError as e:
        msg = str(e)
        if 'Error reading SSH protocol banner' in msg:
            while True:
                sleep(1)
                try:
                    connections.connect(env.host_string)
                except NetworkError as e:
                    msg = str(e)
                    if 'Error reading SSH protocol banner' in msg:
                        continue
                except KeyboardInterrupt:
                    disconnect_all()
                    break
                break
            sudo(command)

@bitprophet

This comment has been minimized.

Member

bitprophet commented Nov 20, 2013

Said this in IRC but will repeat here for posterity:

  • Seems like an OK approach to the problem
  • If somebody can come up with a reliable way to reproduce the issue (it's not just as simple as "no sshd running", right? but a dead or still-starting sshd?) that'd make me feel better about merging this :)
@bitprophet

This comment has been minimized.

Member

bitprophet commented Nov 20, 2013

@akitada said that he can reproduce the issue by running this kind of script with a large number of 'hosts': #965 (comment) - aka "hit the same sshd with many connections". I will try that sometime.

@bitprophet

This comment has been minimized.

Member

bitprophet commented Dec 15, 2013

I can't seem to reproduce this with e.g. 10x hostnames resolving to single target box + 100x parallel runs across all 10 of those "hosts" (doing ls /). Was using a variant of @akitada's linked example above in my comment.

That said, willing to merge since I can't see many possibilities for this causing extra issues & can take it on trust that it does fix it for users experiencing the problem. Thanks!

@bitprophet bitprophet merged commit 2d38a01 into fabric:master Dec 15, 2013

1 check passed

default The Travis CI build passed
Details
@bitprophet

This comment has been minimized.

Member

bitprophet commented Dec 15, 2013

This will get released in the next 1.8.x release, btw, despite Github being all "lol master".

@akitada akitada deleted the akitada:banner-error branch Dec 15, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment