Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically reconnect timed-out SSH connections #402

Closed
bitprophet opened this issue Aug 19, 2011 · 16 comments
Closed

Automatically reconnect timed-out SSH connections #402

bitprophet opened this issue Aug 19, 2011 · 16 comments

Comments

@bitprophet
Copy link
Member

Description

During continuous sessions with Amazon AWS instances it happens from time to time the "SSH session not active" exception.

I'd like to suggest webengineer@aab9fc0 workaround.


Originally submitted by Max Chervonec (webengineer.in.ua) on 2011-08-01 at 03:02am EDT

@ghost ghost assigned bitprophet Aug 19, 2011
@bitprophet
Copy link
Member Author

Jeff Forcier (bitprophet) posted:


Thanks for the submission! The patch looks good. One of the upcoming releases will probably be network related so this fix will fit well there.


on 2011-08-01 at 11:46am EDT

@bitprophet
Copy link
Member Author

Max Chervonec (webengineer.in.ua) posted:


Glad to help!


on 2011-08-02 at 09:55am EDT

@tow
Copy link

tow commented May 22, 2012

Any progress on this ticket? It doesn't seem to have made it into the main fabric repo yet.

@webengineer
Copy link

Toby, just in case you are interested how to workaround - try to use --keepalive=15 option.

@bitprophet
Copy link
Member Author

@webengineer is right that keepalive may help with this.

However, yea, I think I did intend for this to go in with the other stuff like timeout & reconnection attempts.

@tow, what would really help is if you could evaluate how those three options -- keepalive, timeout & connection attempts -- intersect with the problem you are/were having, and with your proposed change.

I think keepalive will prevent your exception from ever actually occurring, but not 100% sure. It's also possible that the way we implemented reconnections would similarly obviate the part of the code you modified.

But that's just conjecture (I haven't time to repro on my end and test right now) and I'd still be happy to merge your patch if it looks like it fills a problem these other options don't cover.

Thanks & sorry again :)

@webengineer
Copy link

@bitprophet, I am 95% sure that --keepalive=15 worked for us during some period (before we switched to Celery-governed tasks).

@ghost
Copy link

ghost commented Oct 23, 2012

I'm also affected by the issue. Strangely enough, it doesn't seem to get resolved with --keepalive=15. The command in particular is a sudo('pip install <something>'), which often takes ages or outright times out due to PyPI.

Do you guys know if there's another workaround to this? I'm not sure if connection attempts is a good direction, since the initial SSH connection is established successfully - the problem is that it times out after a while.

And I'm kind of suspicious of timeout, since most of the pip install commands take a lot longer than the default 10s to execute, but I will give it a try with something ridiculous like 600.

@bitprophet
Copy link
Member Author

@kkolev - your issue falls under remote program timeout, the other stuff here is all initial connection timeout. Please see #249 -- there is a pull request there I will be merging pretty soon.

@jberryman
Copy link

I'm hitting this issue on fabric 1.4.3, but setting env.keepalive = 15 hasn't fixed it. This is my pattern of usage:

env.keepalive = 15
env.user = ubuntu
#... lots of crap as ubuntu
with settings(user='occasional'):
    run('something')
#... lots more crap as 'ubuntu'
with settings(user='occasional'):
    run('something')

where when we hit that second block as user occasional, I get:

  File "xxx/local/lib/python2.7/site-packages/ssh/transport.py", line 660, in open_session
    return self.open_channel('session')
  File "xxx/local/lib/python2.7/site-packages/ssh/transport.py", line 729, in open_channel
    raise SSHException('SSH session not active')

Not really following who's who in this thread, but I don't think timeout or connection attempts are relevant to the referenced error.

EDIT: I'm having trouble producing a small example that reproduces the problem, but can try harder if it would help.

@jberryman
Copy link

Ah, okay I figured out how to reproduce it; a reboot() is what causes the error even with keepalive set. Oddly the default env.user connection is fine after the reboot, but not the other.

Here is the task, run with fab -H xxx reproduce:

env.connection_attempts = 30  # enough time for reboot
@task
def reproduce():
    run('echo "on normal user"')
    with settings(user='occasional'):
        run('echo "on occasional user"')
    reboot()
    run('echo back up')
    with settings(user='occasional'):
        run('echo "fail"')

And here is the output with error:

$ fab -H xxx reproduce                      
[xxx] Executing task 'reproduce'
[xxx] run: echo "on normal user"
[xxx] out: on normal user

[xxx] run: echo "on occasional user"
[xxx] out: on occasional user

[xxx] out:
[xxx] out: Broadcast message from ubuntu@xxx
[xxx] out:  (/dev/pts/0) at 15:18 ...

[xxx] out: The system is going down for reboot NOW!
[xxx] run: echo back up
[xxx] out: back up

[xxx] run: echo "fail"
Traceback (most recent call last):
  File "foo/local/lib/python2.7/site-packages/fabric/main.py", line 717, in main
    *args, **kwargs
  File "foo/local/lib/python2.7/site-packages/fabric/tasks.py", line 299, in execute
    multiprocessing
  File "foo/local/lib/python2.7/site-packages/fabric/tasks.py", line 198, in _execute
    return task.run(*args, **kwargs)
  File "foo/local/lib/python2.7/site-packages/fabric/tasks.py", line 112, in run
    return self.wrapped(*args, **kwargs)
  File "foo/fabfile.py", line 1438, in reproduce
    run('echo "fail"')
  File "foo/local/lib/python2.7/site-packages/fabric/network.py", line 463, in host_prompting_wrapper
    return func(*args, **kwargs)
  File "foo/local/lib/python2.7/site-packages/fabric/operations.py", line 909, in run
    return _run_command(command, shell, pty, combine_stderr)
  File "foo/local/lib/python2.7/site-packages/fabric/operations.py", line 819, in _run_command
    stdout, stderr, status = _execute(default_channel(), wrapped_command, pty,
  File "foo/local/lib/python2.7/site-packages/fabric/state.py", line 340, in default_channel
    chan = connections[env.host_string].get_transport().open_session()
  File "foo/local/lib/python2.7/site-packages/ssh/transport.py", line 660, in open_session
    return self.open_channel('session')
  File "foo/local/lib/python2.7/site-packages/ssh/transport.py", line 729, in open_channel
    raise SSHException('SSH session not active')
ssh.SSHException: SSH session not active
Disconnecting from xxx... done.
Disconnecting from occasional@xxx... done.

Let me know if I should file a separate bug report.

EDIT: the workaround I'm using is to call disconnect_all() after reboot()

@bitprophet
Copy link
Member Author

Hey @jberryman, I think the issue with that specific use case is that reboot calls a reconnect function, but only for the current value of env.host_string, so your 2nd/alternate connection w/ the different user is unaffected (and then becomes invalid post-reboot, causing your error, AFAICT).

disconnect_all, as the name implies, forcibly disconnects all open connections, including your 2nd connection, which is why it works OK.

I reckon we could update reboot to scan the connection cache for connections sharing the same hostname as env.host_string (by using e.g. network.to_dict or network.normalize. I hate that module so much. Fab 2 can't come fast enough :)) and reconnect all of them. If you want to take a stab at this & can confirm it works for your setup, I'd merge that pull request. Thanks!


I'm also going to poke at the original patch in the description so I can finally close this poor ticket.

@bitprophet
Copy link
Member Author

@jberryman So, now that I have merged the original patch, it will theoretically also fix your problem, no reboot tweaking needed. Please let me know if you try it out and it fails to work; will be released in Fab 1.6 which will be out very soon.

@jberryman
Copy link

Great! Won't be able to upgrade right away, but look forward to checking
out the updates.

On Sat, Jan 26, 2013 at 12:54 PM, Jeff Forcier notifications@github.comwrote:

@jberryman https://github.com/jberryman So, now that I have merged the
original patch, it will theoretically also fix your problem, no reboottweaking needed. Please let me know if you try it out and it fails to work;
will be released in Fab 1.6 which will be out very soon.


Reply to this email directly or view it on GitHubhttps://github.com//issues/402#issuecomment-12739064.

@MajorTal
Copy link

Sadly, I'm still getting this error.

@bitprophet
Copy link
Member Author

@MajorTal A) it's usually apropos to give details when you report such things ;) B) this hasn't actually been released yet, Fab 1.6.0 is still a few days away. Are you running the latest master branch? (If not, can you install from it & see if it fixes your specific issue?) Thanks!

@jberryman
Copy link

@bitprophet sorry, to clarify, have you tested on my task above that reproduces the issue that I posted or are you asking me to do that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants