Automatically reconnect timed-out SSH connections #402

Closed
bitprophet opened this Issue Aug 19, 2011 · 16 comments

Projects

None yet

6 participants

@bitprophet
Member

Description

During continuous sessions with Amazon AWS instances it happens from time to time the "SSH session not active" exception.

I'd like to suggest webengineer@aab9fc0 workaround.


Originally submitted by Max Chervonec (webengineer.in.ua) on 2011-08-01 at 03:02am EDT

@bitprophet bitprophet was assigned Aug 19, 2011
@bitprophet
Member

Jeff Forcier (bitprophet) posted:


Thanks for the submission! The patch looks good. One of the upcoming releases will probably be network related so this fix will fit well there.


on 2011-08-01 at 11:46am EDT

@bitprophet
Member

Max Chervonec (webengineer.in.ua) posted:


Glad to help!


on 2011-08-02 at 09:55am EDT

@tow
tow commented May 22, 2012

Any progress on this ticket? It doesn't seem to have made it into the main fabric repo yet.

@webengineer
Contributor

Toby, just in case you are interested how to workaround - try to use --keepalive=15 option.

@bitprophet
Member

@webengineer is right that keepalive may help with this.

However, yea, I think I did intend for this to go in with the other stuff like timeout & reconnection attempts.

@tow, what would really help is if you could evaluate how those three options -- keepalive, timeout & connection attempts -- intersect with the problem you are/were having, and with your proposed change.

I think keepalive will prevent your exception from ever actually occurring, but not 100% sure. It's also possible that the way we implemented reconnections would similarly obviate the part of the code you modified.

But that's just conjecture (I haven't time to repro on my end and test right now) and I'd still be happy to merge your patch if it looks like it fills a problem these other options don't cover.

Thanks & sorry again :)

@webengineer
Contributor

@bitprophet, I am 95% sure that --keepalive=15 worked for us during some period (before we switched to Celery-governed tasks).

@kkolev
kkolev commented Oct 23, 2012

I'm also affected by the issue. Strangely enough, it doesn't seem to get resolved with --keepalive=15. The command in particular is a sudo('pip install <something>'), which often takes ages or outright times out due to PyPI.

Do you guys know if there's another workaround to this? I'm not sure if connection attempts is a good direction, since the initial SSH connection is established successfully - the problem is that it times out after a while.

And I'm kind of suspicious of timeout, since most of the pip install commands take a lot longer than the default 10s to execute, but I will give it a try with something ridiculous like 600.

@bitprophet
Member

@kkolev - your issue falls under remote program timeout, the other stuff here is all initial connection timeout. Please see #249 -- there is a pull request there I will be merging pretty soon.

@jberryman

I'm hitting this issue on fabric 1.4.3, but setting env.keepalive = 15 hasn't fixed it. This is my pattern of usage:

env.keepalive = 15
env.user = ubuntu
#... lots of crap as ubuntu
with settings(user='occasional'):
    run('something')
#... lots more crap as 'ubuntu'
with settings(user='occasional'):
    run('something')

where when we hit that second block as user occasional, I get:

  File "xxx/local/lib/python2.7/site-packages/ssh/transport.py", line 660, in open_session
    return self.open_channel('session')
  File "xxx/local/lib/python2.7/site-packages/ssh/transport.py", line 729, in open_channel
    raise SSHException('SSH session not active')

Not really following who's who in this thread, but I don't think timeout or connection attempts are relevant to the referenced error.

EDIT: I'm having trouble producing a small example that reproduces the problem, but can try harder if it would help.

@jberryman

Ah, okay I figured out how to reproduce it; a reboot() is what causes the error even with keepalive set. Oddly the default env.user connection is fine after the reboot, but not the other.

Here is the task, run with fab -H xxx reproduce:

env.connection_attempts = 30  # enough time for reboot
@task
def reproduce():
    run('echo "on normal user"')
    with settings(user='occasional'):
        run('echo "on occasional user"')
    reboot()
    run('echo back up')
    with settings(user='occasional'):
        run('echo "fail"')

And here is the output with error:

$ fab -H xxx reproduce                      
[xxx] Executing task 'reproduce'
[xxx] run: echo "on normal user"
[xxx] out: on normal user

[xxx] run: echo "on occasional user"
[xxx] out: on occasional user

[xxx] out:
[xxx] out: Broadcast message from ubuntu@xxx
[xxx] out:  (/dev/pts/0) at 15:18 ...

[xxx] out: The system is going down for reboot NOW!
[xxx] run: echo back up
[xxx] out: back up

[xxx] run: echo "fail"
Traceback (most recent call last):
  File "foo/local/lib/python2.7/site-packages/fabric/main.py", line 717, in main
    *args, **kwargs
  File "foo/local/lib/python2.7/site-packages/fabric/tasks.py", line 299, in execute
    multiprocessing
  File "foo/local/lib/python2.7/site-packages/fabric/tasks.py", line 198, in _execute
    return task.run(*args, **kwargs)
  File "foo/local/lib/python2.7/site-packages/fabric/tasks.py", line 112, in run
    return self.wrapped(*args, **kwargs)
  File "foo/fabfile.py", line 1438, in reproduce
    run('echo "fail"')
  File "foo/local/lib/python2.7/site-packages/fabric/network.py", line 463, in host_prompting_wrapper
    return func(*args, **kwargs)
  File "foo/local/lib/python2.7/site-packages/fabric/operations.py", line 909, in run
    return _run_command(command, shell, pty, combine_stderr)
  File "foo/local/lib/python2.7/site-packages/fabric/operations.py", line 819, in _run_command
    stdout, stderr, status = _execute(default_channel(), wrapped_command, pty,
  File "foo/local/lib/python2.7/site-packages/fabric/state.py", line 340, in default_channel
    chan = connections[env.host_string].get_transport().open_session()
  File "foo/local/lib/python2.7/site-packages/ssh/transport.py", line 660, in open_session
    return self.open_channel('session')
  File "foo/local/lib/python2.7/site-packages/ssh/transport.py", line 729, in open_channel
    raise SSHException('SSH session not active')
ssh.SSHException: SSH session not active
Disconnecting from xxx... done.
Disconnecting from occasional@xxx... done.

Let me know if I should file a separate bug report.

EDIT: the workaround I'm using is to call disconnect_all() after reboot()

@bitprophet
Member

Hey @jberryman, I think the issue with that specific use case is that reboot calls a reconnect function, but only for the current value of env.host_string, so your 2nd/alternate connection w/ the different user is unaffected (and then becomes invalid post-reboot, causing your error, AFAICT).

disconnect_all, as the name implies, forcibly disconnects all open connections, including your 2nd connection, which is why it works OK.

I reckon we could update reboot to scan the connection cache for connections sharing the same hostname as env.host_string (by using e.g. network.to_dict or network.normalize. I hate that module so much. Fab 2 can't come fast enough :)) and reconnect all of them. If you want to take a stab at this & can confirm it works for your setup, I'd merge that pull request. Thanks!


I'm also going to poke at the original patch in the description so I can finally close this poor ticket.

@bitprophet bitprophet added a commit that referenced this issue Jan 26, 2013
@bitprophet bitprophet Changelog re #402, fixes #402 08d4942
@bitprophet
Member

@jberryman So, now that I have merged the original patch, it will theoretically also fix your problem, no reboot tweaking needed. Please let me know if you try it out and it fails to work; will be released in Fab 1.6 which will be out very soon.

@jberryman

Great! Won't be able to upgrade right away, but look forward to checking
out the updates.

On Sat, Jan 26, 2013 at 12:54 PM, Jeff Forcier notifications@github.comwrote:

@jberryman https://github.com/jberryman So, now that I have merged the
original patch, it will theoretically also fix your problem, no reboottweaking needed. Please let me know if you try it out and it fails to work;
will be released in Fab 1.6 which will be out very soon.


Reply to this email directly or view it on GitHubhttps://github.com/fabric/fabric/issues/402#issuecomment-12739064.

@MajorTal

Sadly, I'm still getting this error.

@bitprophet
Member

@MajorTal A) it's usually apropos to give details when you report such things ;) B) this hasn't actually been released yet, Fab 1.6.0 is still a few days away. Are you running the latest master branch? (If not, can you install from it & see if it fixes your specific issue?) Thanks!

@jberryman

@bitprophet sorry, to clarify, have you tested on my task above that reproduces the issue that I posted or are you asking me to do that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment