Handle connection failures due to server load #948

Merged
merged 2 commits into from Dec 15, 2013

Projects

None yet

3 participants

@akitada
Contributor
akitada commented Aug 3, 2013

SSHException with "Error reading SSH protocol banner" is raised when
sshd is unresponsive but Fabric treats it as an auth failure. This patch
fixes this by treating it as NetworkError.

akitada added some commits Aug 3, 2013
@akitada akitada Handle connection failures due to server load
SSHException with "Error reading SSH protocol banner" is raised when
sshd is unresponsive but Fabric treats it as an auth failure. This patch
fixes this by treating it as NetworkError.
f9a41a7
@akitada akitada Add a changelog entry for #948 2d38a01
@joekiller
Contributor

I encounter this problem when I launch a ton of quick SSH commands via a gateway proxy. Below is the error that bubbles up. Looks like for me the error results from self[gateway] = connect(*normalize(gateway)) which would be caught by this patch.

The output usually looks like

No handlers could be found for logger "paramiko.transport"

Then you get this error:

!!! Parallel execution exception under host u'ip-blah-blah-blah.ec2.internal':
Process ip-blah-blah-blah.ec2.internal:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File " /lib/python2.7/site-packages/fabric/tasks.py", line 233, in inner
    submit(task.run(*args, **kwargs))
  File " /lib/python2.7/site-packages/fabric/tasks.py", line 171, in run
    return self.wrapped(*args, **kwargs)
  File " /lib/python2.7/site-packages/fabric/decorators.py", line 181, in inner
    return func(*args, **kwargs)
  File "/fabfile/util.py", line 476, in _tail_logs
    connections.connect(env.host_string)
  File " /lib/python2.7/site-packages/fabric/network.py", line 103, in connect
    self[gateway] = connect(*normalize(gateway))
  File " /lib/python2.7/site-packages/fabric/network.py", line 429, in connect
    raise NetworkError(msg, e)
NetworkError: Error reading SSH protocol banner[Errno 54] Connection reset by peer

If I catch the error via BaseException when I print out the connections for the parallel task with fabric.state.connections I can see that none exist [].

This patch, along with the command with settings(connection_attempts=N) where N is greater than 1 should work around the issue.

Below is the code I used to work around the problem for the time being for anyone else.

    try:
        with settings(connection_attempts=5):
            sudo(command)
    except KeyboardInterrupt:
        disconnect_all()
    except NetworkError as e:
        msg = str(e)
        if 'Error reading SSH protocol banner' in msg:
            while True:
                sleep(1)
                try:
                    connections.connect(env.host_string)
                except NetworkError as e:
                    msg = str(e)
                    if 'Error reading SSH protocol banner' in msg:
                        continue
                except KeyboardInterrupt:
                    disconnect_all()
                    break
                break
            sudo(command)

@bitprophet
Member

Said this in IRC but will repeat here for posterity:

  • Seems like an OK approach to the problem
  • If somebody can come up with a reliable way to reproduce the issue (it's not just as simple as "no sshd running", right? but a dead or still-starting sshd?) that'd make me feel better about merging this :)
@bitprophet
Member

@akitada said that he can reproduce the issue by running this kind of script with a large number of 'hosts': #965 (comment) - aka "hit the same sshd with many connections". I will try that sometime.

@bitprophet
Member

I can't seem to reproduce this with e.g. 10x hostnames resolving to single target box + 100x parallel runs across all 10 of those "hosts" (doing ls /). Was using a variant of @akitada's linked example above in my comment.

That said, willing to merge since I can't see many possibilities for this causing extra issues & can take it on trust that it does fix it for users experiencing the problem. Thanks!

@bitprophet bitprophet merged commit 2d38a01 into fabric:master Dec 15, 2013

1 check passed

default The Travis CI build passed
Details
@bitprophet
Member

This will get released in the next 1.8.x release, btw, despite Github being all "lol master".

@akitada akitada deleted the akitada:banner-error branch Dec 15, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment