Make reconnection more robust (was re: reboot() specifically) #12

Closed
bitprophet opened this Issue Aug 19, 2011 · 30 comments

Projects

None yet

3 participants

@bitprophet
Member

Description

Use a more robust reconnection/sleep mechanism than "guess how long a reboot takes and sleep that long". Possibilities:

  • Try reconnecting after, say, 30 seconds, with a short timeout value, then loop every, say, 10 seconds until we reconnect
  • Just give user a prompt, within a loop, so they can manually whack Enter to try reconnecting
  • Stick with the manual sleep timer entry, and just ensure it is explicitly documented, i.e. "we highly recommend figuring out how long your system takes to reboot before using this function"

Originally submitted by Jeff Forcier (bitprophet) on 2009-07-21 at 11:38am EDT

Relations

  • Related to #201: Ambiguous sudo call in reboot function
@bitprophet bitprophet was assigned Aug 19, 2011
@bitprophet
Member

**** (meinwald) posted:


Is there any reason not to make reconnection more robust in general? I suggest this for a few reasons:

  • If reboot is the last command executed, there is no reason to wait at all.
  • It may be helpful to be able to reconnect without failing the current task anyways. (This is how I understand current behavior, correct me if I am wrong.)
  • If the above behavior is implemented, adding something similar to reboot would be redundant.

on 2010-07-31 at 09:28pm EDT

@bitprophet
Member

Jeff Forcier (bitprophet) posted:


  • The wait call is to wait for the actual reboot process, so unless your system is using a superfast boot setup (OS-on-motherboard or an awesome SSD or something) I don't see how we can avoid it :)
  • Current task should not fail when this is called (the intent is for it to be used in the middle of a larger task, e.g. update kernel => reboot => do more stuff); if it is failing on your system I'd be interested to know what the OS/version is. I've tested it against Ubuntu 8.04 and possibly CentOS 5.x but that's it.
    • I can definitely see it failing if the literal command reboot doesn't exist, and if that is the case it should be easy enough to upgrade reboot() to try and detect whether e.g. shutdown -r now or similar is possible.

on 2010-08-01 at 08:27am EDT

@bitprophet
Member

**** (meinwald) posted:


I think I was a little unclear as I was trying to talk about two things at once.

While the intent of reboot is having it in the middle of a task, it is possible that it could be the last line. In that specific, though perhaps unlikely event, there is no reason to wait, unless reboot should be changed into a "reboot then wait and check if the host is still alive, failing otherwise" function, which might not be a bad idea, as someone could just call sudo('reboot') (or equivalent) manually if that check is not desired.

The middle bullet has nothing to do with reboot itself. If something else causes a random disconnect, I was wondering if reconnection is robust enough, and if not, if solving that would make this reboot issue less relevant.


on 2010-08-01 at 11:51am EDT

@bitprophet
Member

Jeff Forcier (bitprophet) posted:


OK, gotcha. Yea, I hadn't considered the "last item in a task" scenario, and there's also the generic reconnection angle; so you're right that this indicates the function should be further refactored or possibly done away with.

The generic reconnection issue is hopefully something that can be baked into the connection cache; it depends on exactly when/where a "oh god, channel closed, what do I do" exception would get raised.


on 2010-08-01 at 10:11pm EDT

@bitprophet
Member

Moving reconnection out of reboot (which then becomes deprecated -- first time that's happened in fab 1.x I think!) and into the connection cache is clearly a good idea. So the question becomes "how?" and the answer is probably a straight up loop that tries reconnecting every N seconds, timing out after M tries.

Another question is then what the default values for N/M should be, and/or in what situations (initial connect vs reconnect) -- especially considering the sibling feature skip_bad_hosts where one probably doesn't want very high N/M values if it's been activated (or you'll wait a very long time before moving to the next host.)

Brain dump:

  • Initial connection: fast-fail says we want to default to aborting right away, but in some not-uncommon situations like connecting to newly provisioned systems, it's very useful to have at least a short retry period.
    • Whether that's useful enough to merit being the default (vs forcing this not-uncommon use case to toggle a setting on) is hard to say.
    • The backwards compatibility demon says to force a settings toggle so folks who were (for whatever reason) used to the current fast-fail behavior, aren't surprised.
      • If we go this route, great time to start a specific "what settings defaults to flip for Fab 2.0" list
  • Reconnection: this feels like a definite no-brainer re: trying to get the connection back, whether due to an explicit reboot, or an unwanted-but-probably-temporary connection drop.
  • Default N value: should be long enough that Fabric isn't eating CPU/network/whatever trying to reconnect, but short/granular enough to not result in unnecessarily long waits.
    • Also consider what the current connection timeout is, as that functions as a de-facto N just by itself (i.e. a "fast as possible" network.connect() loop will still be padded by ssh's default connect timeout.)
    • However, in situations where the ssh timeout does not apply (such as lower level No route to host or DNS lookup failures) we'd end up with a too-fast loop.
    • So possibly override the ssh level timeout to be very short/zero, and enforce our own N for consistency.
      • Counter: lower level connection problems usually indicate an unrecoverable problem, versus waiting-for-boot-or-reboot where it's only a manner of time.
  • Default M value: could be # of tries or overall time-elapsed....
    • Offhand, # of tries feels more user-friendly.
    • Having both feels too over-engineered.
    • Assuming # of tries, it depends on what we select for N; I'd say that the total default time period should end up being perhaps one to two minutes. Any longer than that and something really stupid is going on, and as long as we make these configurable, the user can override.
  • skip_bad_hosts: should probably tickle the values of N/M to be very low.
    • Though I could see some use cases wanting to retain control of those even in combo with this behavior.
    • Only way to have cake and eat it too would be to:
      • double-up on the settings: have N/M for normal, and skip_N/skip_M for skip behavior's use, where the latter default to the lower values, but could be updated by the user
      • or to detect whether users have modified the settings and leave them alone if so; which could be tricky / require a 2nd "layer" of settings, which implies a reworking of env.
@KayEss
KayEss commented Jan 18, 2012

I've been wondering how to change my scripts that build on fabric to make reconnects after reboot more reliable without just putting stupidly long time (on EC2, and I guess other virtual environments, the reboot time can vary a lot as it also depends on a number of factors outside our control).

My only comment on the above is that if you shorten SSH's time out then aren't you likely to have more connection failures over poor networks? For example, if SSH times out after 1s and then fabric waits another 5s before trying again then you can get into a situation where that can't connect, but a SSH time out of 5s would connect.

Maybe I misunderstood your proposal though.

@bitprophet
Member

N.B. the timeout for "can't connect" appears to be 10 seconds, though I could not find anything in ssh that explicitly sets it to this. "Real" SSH timeouts (via the ssh CLI app) by comparison are over a minute.

There is of course no timeout for immediate-fail situations like DNS lookup failures or remote-end-is-working host-key mismatches (I expect a situation where the remote end has no SSH server would timeout like any other "can't connect" situation.)

This brings up another angle: the reconnect attempts should only fire for:

  • Socket timeouts (socket.timeout) (i.e. system exists but SSH doesn't appear to be reachable/running)
  • Probably low-level socket errors (socket.error) since those are kind of random and may well be recoverable from.

DNS lookups, host mismatches, etc aren't likely to be recoverable so retrying feels kind of pointless/time-wasting.

@KayEss
KayEss commented Jan 18, 2012

Probably better to err on the side of trying the reconnect than to abort though isn't it? Hopefully the errors are specific enough that early aborts can be correctly identified across all of the platforms this runs on.

@bitprophet
Member

Hey @KayEss -- sorry, Github didn't pick up your comments while I was writing my second one :)

If it wasn't clear, N and M will be fully configurable so that if people end up in unforeseen situations, they can hopefully work around the defaults.

You're right though, that I hadn't factored in slow but functional situations, focusing solely on a binary "it's working or it's down" dichotomy. That does make the "force ssh timeout to be 0 and use the Fabric level N value as the 'real' timeout" angle a bit less appealing. Thanks for the input!

Combined with my own counterpoint above that DNS/etc issues are typically unrecoverable anyway, it sounds like the best approach is to just use N as the ssh socket timeout (client.connect(..., timeout=N)), and to loop M times. (This is also just plain simpler to implement.)

@bitprophet
Member

One issue that came up during implementation, is the scenario where we experience a "retry-friendly" error (right now, just timeout or the catchall socket.error) which doesn't have its own timeout (so, mostly socket.error.)

For example, when testing the retry behavior locally, I disabled sshd, which results in an instantaneous socket error of Invalid argument. So having any number of retries here still ends up executing near instantly.

I'm honestly not sure how many socket.error situations would be recoverable, or how many would have timeout behavior without being socket.timeout. So our options seeem to be:

  • Only retry for timeouts and nothing else
  • Manually sleep for env.timeout seconds in the socket.error case so it has the intended "try, wait a bit, try again" behavior

For now, going with the former -- we can always expand it to account for socket.error if anybody finds a good use case for it.

@bitprophet
Member

This works reasonably well now, and should be backwards compatible with previous behavior by default.

@bitprophet bitprophet closed this Jan 19, 2012
@KayEss
KayEss commented Jan 20, 2012

My only idea for the timeout would be that in stead of trying to work out what the problem was and then decide to wait or not, you could time how long the connect attempt took and if it wasn't N seconds yet then sleep however long is needed to make it up to N seconds.

I'll try this out though once I get a chance.

@bitprophet
Member

That might work, but I'm not sure if it's worth the effort or extra code complexity -- with a configurable timeout users who routinely run into very slow but functional systems can be expected to adjust that setting to "long enough".

Unless I am again missing some use case not covered by the now-merged implementation :)

@KayEss
KayEss commented Jan 24, 2012

The scenario I was thinking of (and it may not be a problem for the code as you have it), is something like when a machine is rebooting and the IP stack has started, but sshd hasn't. At this point the machine will be actively refusing connections to port 22. So without some other timeout, ssh will just return straight away saying it can't connect which means it's possible to burn through your M retries in less than a second (especially on a fast connection). This might not give sshd enough time to start up before the connection gives up.

As I say, might not be a problem, but it's what I was thinking of.

@bitprophet
Member

The timeout I'm speaking of is simply the option for the SSH library (I forget if the ssh CLI tool has a similar option). In my testing it seems to work just fine (meaning it will sit there and wait, presumably retrying, until the configured time has elapsed) even when there is no sshd running, period. So I assume that would be identical to your situation of "machine would respond to ping but not to TCP port 22".

tl;dr nothing in the current implementation is actually doing any out-of-band "is it up" detection, so I am not super worried about that use case :)

@KayEss
KayEss commented Jan 24, 2012

You have it all under control then. Great! :)

@bitprophet bitprophet reopened this Jan 26, 2012
@bitprophet
Member

Still need to add a changelog entry and possibly some notes in the execution usage docs.

And #536 reminded me I never did test out the "reboot or reconnect midway" use case, only initial connections. At best, we can deprecate reboot, and at worst, we can at least rejigger it to work with this ticket's features instead of a naive sleep.

@npinto
npinto commented Jan 27, 2012

@bitprophet, is there a specific option I need to use to get the following to work?

% fab --version
Fabric 1.4a
ssh (library) 1.8.0a0
% cat fabfile.py
from fabric.operations import run
import time


def test():
    run('uname -a')
    run('reboot')

    while True:
        time.sleep(1)
        run('date')
% fab -n 100 -u root -H squid2 test
[squid2] Executing task 'test'
[squid2] run: uname -a
[squid2] out: Linux squid2 3.1.6-gentoo #1 SMP Fri Jan 13 13:22:01 EST 2012 x86_64 Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz GenuineIntel GNU/Linux

[squid2] run: reboot
[squid2] out: 
[squid2] out: Broadcast message from root@squid2 (pts/0) (Fri Jan 27 15:01:38 2012):

[squid2] out: The system is going down for reboot NOW!

[squid2] run: date
[squid2] out: Fri Jan 27 15:01:39 EST 2012

[squid2] run: date
[squid2] out: Fri Jan 27 15:01:40 EST 2012

[squid2] run: date
Traceback (most recent call last):
  File "/home/npinto/.local/lib64/python2.7/site-packages/Fabric-1.4a._2256b2b_-py2.7.egg/fabric/main.py", line 710, in main
    *args, **kwargs
  File "/home/npinto/.local/lib64/python2.7/site-packages/Fabric-1.4a._2256b2b_-py2.7.egg/fabric/tasks.py", line 243, in execute
    _execute(task, host, my_env, args, new_kwargs)
  File "/home/npinto/.local/lib64/python2.7/site-packages/Fabric-1.4a._2256b2b_-py2.7.egg/fabric/tasks.py", line 180, in _execute
    task.run(*args, **kwargs)
  File "/home/npinto/.local/lib64/python2.7/site-packages/Fabric-1.4a._2256b2b_-py2.7.egg/fabric/tasks.py", line 106, in run
    return self.wrapped(*args, **kwargs)
  File "/home/npinto/tmp/fabric_test/fabfile.py", line 11, in test
    run('date')
  File "/home/npinto/.local/lib64/python2.7/site-packages/Fabric-1.4a._2256b2b_-py2.7.egg/fabric/network.py", line 365, in host_prompting_wrapper
    return func(*args, **kwargs)
  File "/home/npinto/.local/lib64/python2.7/site-packages/Fabric-1.4a._2256b2b_-py2.7.egg/fabric/operations.py", line 904, in run
    return _run_command(command, shell, pty, combine_stderr)
  File "/home/npinto/.local/lib64/python2.7/site-packages/Fabric-1.4a._2256b2b_-py2.7.egg/fabric/operations.py", line 814, in _run_command
    stdout, stderr, status = _execute(default_channel(), wrapped_command, pty,
  File "/home/npinto/.local/lib64/python2.7/site-packages/Fabric-1.4a._2256b2b_-py2.7.egg/fabric/state.py", line 324, in default_channel
    chan = connections[env.host_string].get_transport().open_session()
  File "/home/npinto/.local/lib64/python2.7/site-packages/ssh-dev-py2.7.egg/ssh/transport.py", line 660, in open_session
    return self.open_channel('session')
  File "/home/npinto/.local/lib64/python2.7/site-packages/ssh-dev-py2.7.egg/ssh/transport.py", line 729, in open_channel
    raise SSHException('SSH session not active')
ssh.SSHException: SSH session not active
Disconnecting from squid2... done.

I'm probably missing something from the discussion.

N

@bitprophet
Member

No, but that is telling me that we do need to either keep reboot around or add some error handling to the connection cache.

What's probably happening is that as the real SSH connection drops or dies (because the server rebooted), our cached connection object is still sitting around in the connection cache.

reboot explicitly clears out the cached object when it runs, but nothing in the global setup does that right now. Now that reconnecting is a global thing, we should probably add code to the connection cache that looks for this error and reactively clears the cached connection object.


Put another way, all I tested initially was that initial connections could retry, but that all takes place before the cache comes into play. This "we're already underway" reconnecting is the use case I had not tested and wanted you to test for me. Surprise, it is broken!

I think that with the above change I mentioned in place, it would work better: what I'd expect to happen is:

  • Server is up
  • Initial connection works fine
  • Commands are run, which all use the cached connection
  • Reboot command is executed
  • Next command is run (in this case, your run('date') -- and protip, the idea here is that you should not need the while loop! Fabric should take care of the looping/waiting already.)
  • It will error out in the way you just saw, but the cache should try/except that error, and use it as a sign to nuke its (now stale) connection object. (Instead of barfing like it did for you.)
  • It should then try re-connecting
  • Which should enter back into the "initial" connection code, which includes my changes to try a number of times with a timeout
  • Resulting in a bit of a wait until the server comes back online
  • At which point run('date') should execute and then we go from there.

@npinto If you're not comfortable trying to add that try/except I mentioned, just let me know and I will look at this myself soon. No pressure :)

@npinto
npinto commented Jan 28, 2012

@bitprophet, should I try / except around the following line ?
https://github.com/fabric/fabric/blob/master/fabric/state.py#L324

If so, how should I deal with "Connection refused"-like errors that I will get as the ssh deamon is shutting down ?

Thanks again for your prompt answers.

@bitprophet
Member

Looking at this myself now...didn't notice earlier but your while loop was kinda-sorta needed, insofar as my fear of executing a command or two before the reboot kicks in, is correct: you got off at least one 'date' call pre-reboot, and on my local VM I am also able to execute a command right away (though it sometimes errors out with -1, which I assume is when connection works but the shell is already in "lockdown" mode and refuses to run.)

So regardless of what else we do, any rebooting is going to need at least a few seconds' sleep to ensure that the flow of one's Fabric-using code works as expected (where the next statement after the reboot executes post-reboot and not pre-reboot.)

This isn't technically a problem on Fabric's end (i.e. even somebody just scripting a bunch of shell calls to ssh would have to deal with this) but will probably need documenting, and at any rate fabric.operations.reboot should stick around in a modified form, if deprecated.


I will work on the stuff mentioned above re: handling the reconnection aspect of things, as well as patching up reboot. (Should end up where it does not require the wait call, and probably just no-ops it entirely, but still accepts it to be BC.)

@bitprophet
Member

Having 2nd thoughts about only covering timeouts -- when trying to test again I kept forgetting that simply disabling sshd on a box will often result in "low level" errors such as Connection refused, or at times ssh-lib-level Error reading SSH protocol banner (such as when you're talking to a tunnel whose remote end is down.)

Wish I'd written down more details about my earlier tests because I'm definitely seeing a wider field of error types today re: situations that cause inability to connect, which aren't socket.timeout.

I'll see about extending it again to cover more types of errors. This will necessitate implementing the "if it's not a timeout, sleep for the timeout instead" tweak as well.

@bitprophet
Member

Can't escape these rabbit holes / yak shaving ceremonies today.

When testing against a VM, which uses a network tunnel to connect the VM's SSH to my localhost (e.g. on port 2203), a downed SSHD doesn't show up as a connection failure, but as an SSH protocol failure (e.g. the one above re: protocol banner, or another about SSH session not active.)

Sadly this trips up another unfixed "bug" (née issue with a lower level library), #85 -- so IOW we cannot properly reconnect to servers exhibiting this kind of SSH protocol error, until that is fixed. I do not have the bandwidth to fix it right now.


Thankfully I was able to test against my localhost, albeit tweaking reboot to not actually reboot, but replacing it with a simple echo+sleep. I then manually turned off the local sshd when I saw it fire, re-enabling afterwards, using debug=True to see when it was trying to reconnect.

@bitprophet bitprophet added a commit that referenced this issue Feb 2, 2012
@bitprophet bitprophet reboot() overhaul re #12 407c1e4
@bitprophet bitprophet added a commit that referenced this issue Feb 2, 2012
@bitprophet bitprophet Changelog and docs re #12 a594444
@bitprophet
Member

I think this is wrapped up now, also changelog'd and documented (which I had totally forgotten, sigh).

@npinto @KayEss if you have the time, please do grab latest master and try these new changes out. Docs are in the usage docs => Execution => near the bottom under "Reconnecting and skipping bad hosts".

@bitprophet bitprophet closed this Feb 2, 2012
@KayEss
KayEss commented Feb 2, 2012

I'll probably not get to it today, but it should be over the weekend.

I'd love to be able to keep being able to use reboot() so that I didn't have to think about getting all of that logic right every time a reboot was needed, so I'd be -1 on even deprecating it.

@bitprophet
Member

Well, I neglected to mention that if/when deprecated from Fab core, it would probably fit in well in #461 (Patchwork) in the sense that it is a useful convenience, but does not need to be in the core software because it only uses the public API.

@KayEss
KayEss commented Feb 8, 2012

Here is the test that I've done:

  1. Using Fabric 1.3.4:
    • I've booted a new EC2 instance with a delay of just 2s before trying to connect to it. As expected this wasn't long enough for it to complete booting so the connect failed.
    • I then ran another script which rebooted the server with a reboot delay of 5s. As expected this didn't work either.
  2. After pip installing the version of Fabric in master (pip freeze shows I am now running fabric 1.4a.-c5284f1-):
    • Booting the new instance still doesn't connect. Error trace is below.
    • After connecting I tried the script that does the reboot and again, it fails to connect. This is the second traceback below.

I assume that both of these will be solved by passing in the right retry parameters on the commands that are failing here and the retry count of 1 is to preserve backward compatibility. Looking at run and sudo I didn't see any retry parameters, presumably this means they're in the env somewhere?

[ec2-79-125-71-213.eu-west-1.compute.amazonaws.com] run: cat ~/.ssh/authorized_keys
Traceback (most recent call last):
File "./devenv/../bin/pf-server-start", line 17, in <module>
    main(*sys.argv[1:])
File "./devenv/../bin/pf-server-start", line 12, in main
    Server.start(client, *process_arguments(*args))
File "/home/kirit/Projects/profab/profab/server.py", line 117, in start
    server.dist_upgrade()
File "/home/kirit/Projects/profab/profab/server.py", line 25, in wrapper
    function(server, *args, **kwargs)
File "/home/kirit/Projects/profab/profab/server.py", line 203, in dist_upgrade
    authorized_keys = run('cat %s' % key_file)
File "/home/kirit/tmp/virtualenvironments/profabdev/local/lib/python2.7/site-packages/fabric/network.py", line 457, in host_prompting_wrapper
    return func(*args, **kwargs)
File "/home/kirit/tmp/virtualenvironments/profabdev/local/lib/python2.7/site-packages/fabric/operations.py", line 904, in run
    return _run_command(command, shell, pty, combine_stderr)
File "/home/kirit/tmp/virtualenvironments/profabdev/local/lib/python2.7/site-packages/fabric/operations.py", line 814, in _run_command
    stdout, stderr, status = _execute(default_channel(), wrapped_command, pty,
File "/home/kirit/tmp/virtualenvironments/profabdev/local/lib/python2.7/site-packages/fabric/state.py", line 340, in default_channel
    chan = connections[env.host_string].get_transport().open_session()
File "/home/kirit/tmp/virtualenvironments/profabdev/local/lib/python2.7/site-packages/fabric/network.py", line 84, in __getitem__
    self.connect(key)
File "/home/kirit/tmp/virtualenvironments/profabdev/local/lib/python2.7/site-packages/fabric/network.py", line 76, in connect
    self[key] = connect(user, host, port)
File "/home/kirit/tmp/virtualenvironments/profabdev/local/lib/python2.7/site-packages/fabric/network.py", line 393, in connect
    raise NetworkError(msg, e)
fabric.exceptions.NetworkError: Low level socket error connecting to host ec2-79-125-71-213.eu-west-1.compute.amazonaws.com: Connection refused (tried 1 time)




INFO:ssh.transport:Secsh channel 13 opened.
Disconnecting from ubuntu@ec2-79-125-71-213.eu-west-1.compute.amazonaws.com... done.
Traceback (most recent call last):
File "./devenv/../bin/pf-server-upgrade", line 17, in <module>
    main(*sys.argv[1:])
File "./devenv/../bin/pf-server-upgrade", line 12, in main
    server.dist_upgrade()
File "/home/kirit/Projects/profab/profab/server.py", line 25, in wrapper
    function(server, *args, **kwargs)
File "/home/kirit/Projects/profab/profab/server.py", line 210, in dist_upgrade
    self.reboot()
File "/home/kirit/Projects/profab/profab/server.py", line 25, in wrapper
    function(server, *args, **kwargs)
File "/home/kirit/Projects/profab/profab/server.py", line 166, in reboot
    reboot(5)
File "/home/kirit/tmp/virtualenvironments/profabdev/local/lib/python2.7/site-packages/fabric/network.py", line 457, in host_prompting_wrapper
    return func(*args, **kwargs)
File "/home/kirit/tmp/virtualenvironments/profabdev/local/lib/python2.7/site-packages/fabric/operations.py", line 1059, in reboot
    connections.connect(env.host_string)
File "/home/kirit/tmp/virtualenvironments/profabdev/local/lib/python2.7/site-packages/fabric/network.py", line 76, in connect
    self[key] = connect(user, host, port)
File "/home/kirit/tmp/virtualenvironments/profabdev/local/lib/python2.7/site-packages/fabric/network.py", line 393, in connect
    raise NetworkError(msg, e)
fabric.exceptions.NetworkError: Low level socket error connecting to host ec2-79-125-71-213.eu-west-1.compute.amazonaws.com: Connection refused (tried 1 time)
@bitprophet
Member
  • Yes, backwards compatibility is what makes it give up after 1 try.
  • To change that setting, check out the dev docs at http://docs.fabfile.org/en/latest/ -- specifically the changelog or the env usage docs. You want env.timeout and/or env.connection_attempts IIRC.
  • I'm a little worried you're still getting an actual traceback -- if you're running via fab or running library code via execute() it should be capturing NetworkError and doing a nice looking abort instead. How exactly are you invoking Fabric here?
@KayEss
KayEss commented Feb 8, 2012

So, I've added connection_timeout to env where I set the host name and removed all trace of the boot wait sleep. I now get a working connection to the newly booted instance. I've also removed the 5s I was passing to reboot, and it does indeed wait less than the 120s default before resuming the next stage of the script. So as far as I can see it all works perfectly well.

I'm using the fabric commands from a python file. The relevant code is in Server.start and Server.dist_upgrade in this file: https://github.com/Proteus-tech/profab/blob/master/profab/server.py

@bitprophet
Member

Glad it's working.

The traceback is because you're using it in library mode -- the code to capture NetworkErrors (which honors env.use_exceptions_for['network'] to decide whether to raise or nicely abort) is only defined in execute().

So, if you wanted "normal" behavior there, you might want to look at using execute to call chunks of run/sudo using code (if it fits for your use case, it may not), otherwise just keep in mind that Fabric now occasionally raises real exceptions instead of SystemExit :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment