Make reconnection more robust (was re: reboot() specifically) #12
Comments
**** (meinwald) posted: Is there any reason not to make reconnection more robust in general? I suggest this for a few reasons:
on 2010-07-31 at 09:28pm EDT |
Jeff Forcier (bitprophet) posted:
on 2010-08-01 at 08:27am EDT |
**** (meinwald) posted: I think I was a little unclear as I was trying to talk about two things at once. While the intent of The middle bullet has nothing to do with reboot itself. If something else causes a random disconnect, I was wondering if reconnection is robust enough, and if not, if solving that would make this reboot issue less relevant. on 2010-08-01 at 11:51am EDT |
Jeff Forcier (bitprophet) posted: OK, gotcha. Yea, I hadn't considered the "last item in a task" scenario, and there's also the generic reconnection angle; so you're right that this indicates the function should be further refactored or possibly done away with. The generic reconnection issue is hopefully something that can be baked into the connection cache; it depends on exactly when/where a "oh god, channel closed, what do I do" exception would get raised. on 2010-08-01 at 10:11pm EDT |
Moving reconnection out of Another question is then what the default values for N/M should be, and/or in what situations (initial connect vs reconnect) -- especially considering the sibling feature Brain dump:
|
I've been wondering how to change my scripts that build on fabric to make reconnects after reboot more reliable without just putting stupidly long time (on EC2, and I guess other virtual environments, the reboot time can vary a lot as it also depends on a number of factors outside our control). My only comment on the above is that if you shorten SSH's time out then aren't you likely to have more connection failures over poor networks? For example, if SSH times out after 1s and then fabric waits another 5s before trying again then you can get into a situation where that can't connect, but a SSH time out of 5s would connect. Maybe I misunderstood your proposal though. |
N.B. the timeout for "can't connect" appears to be 10 seconds, though I could not find anything in There is of course no timeout for immediate-fail situations like DNS lookup failures or remote-end-is-working host-key mismatches (I expect a situation where the remote end has no SSH server would timeout like any other "can't connect" situation.) This brings up another angle: the reconnect attempts should only fire for:
DNS lookups, host mismatches, etc aren't likely to be recoverable so retrying feels kind of pointless/time-wasting. |
Probably better to err on the side of trying the reconnect than to abort though isn't it? Hopefully the errors are specific enough that early aborts can be correctly identified across all of the platforms this runs on. |
Hey @KayEss -- sorry, Github didn't pick up your comments while I was writing my second one :) If it wasn't clear, N and M will be fully configurable so that if people end up in unforeseen situations, they can hopefully work around the defaults. You're right though, that I hadn't factored in slow but functional situations, focusing solely on a binary "it's working or it's down" dichotomy. That does make the "force ssh timeout to be 0 and use the Fabric level N value as the 'real' timeout" angle a bit less appealing. Thanks for the input! Combined with my own counterpoint above that DNS/etc issues are typically unrecoverable anyway, it sounds like the best approach is to just use N as the |
One issue that came up during implementation, is the scenario where we experience a "retry-friendly" error (right now, just timeout or the catchall For example, when testing the retry behavior locally, I disabled sshd, which results in an instantaneous socket error of I'm honestly not sure how many
For now, going with the former -- we can always expand it to account for |
This works reasonably well now, and should be backwards compatible with previous behavior by default. |
My only idea for the timeout would be that in stead of trying to work out what the problem was and then decide to wait or not, you could time how long the connect attempt took and if it wasn't N seconds yet then sleep however long is needed to make it up to N seconds. I'll try this out though once I get a chance. |
That might work, but I'm not sure if it's worth the effort or extra code complexity -- with a configurable timeout users who routinely run into very slow but functional systems can be expected to adjust that setting to "long enough". Unless I am again missing some use case not covered by the now-merged implementation :) |
The scenario I was thinking of (and it may not be a problem for the code as you have it), is something like when a machine is rebooting and the IP stack has started, but sshd hasn't. At this point the machine will be actively refusing connections to port 22. So without some other timeout, ssh will just return straight away saying it can't connect which means it's possible to burn through your M retries in less than a second (especially on a fast connection). This might not give sshd enough time to start up before the connection gives up. As I say, might not be a problem, but it's what I was thinking of. |
The timeout I'm speaking of is simply the option for the SSH library (I forget if the ssh CLI tool has a similar option). In my testing it seems to work just fine (meaning it will sit there and wait, presumably retrying, until the configured time has elapsed) even when there is no sshd running, period. So I assume that would be identical to your situation of "machine would respond to ping but not to TCP port 22". tl;dr nothing in the current implementation is actually doing any out-of-band "is it up" detection, so I am not super worried about that use case :) |
You have it all under control then. Great! :) |
Still need to add a changelog entry and possibly some notes in the execution usage docs. And #536 reminded me I never did test out the "reboot or reconnect midway" use case, only initial connections. At best, we can deprecate |
@bitprophet, is there a specific option I need to use to get the following to work?
I'm probably missing something from the discussion. N |
No, but that is telling me that we do need to either keep reboot around or add some error handling to the connection cache. What's probably happening is that as the real SSH connection drops or dies (because the server rebooted), our cached connection object is still sitting around in the connection cache.
Put another way, all I tested initially was that initial connections could retry, but that all takes place before the cache comes into play. This "we're already underway" reconnecting is the use case I had not tested and wanted you to test for me. Surprise, it is broken! I think that with the above change I mentioned in place, it would work better: what I'd expect to happen is:
@npinto If you're not comfortable trying to add that |
@bitprophet, should I If so, how should I deal with "Connection refused"-like errors that I will get as the ssh deamon is shutting down ? Thanks again for your prompt answers. |
Looking at this myself now...didn't notice earlier but your while loop was kinda-sorta needed, insofar as my fear of executing a command or two before the reboot kicks in, is correct: you got off at least one 'date' call pre-reboot, and on my local VM I am also able to execute a command right away (though it sometimes errors out with -1, which I assume is when connection works but the shell is already in "lockdown" mode and refuses to run.) So regardless of what else we do, any rebooting is going to need at least a few seconds' sleep to ensure that the flow of one's Fabric-using code works as expected (where the next statement after the reboot executes post-reboot and not pre-reboot.) This isn't technically a problem on Fabric's end (i.e. even somebody just scripting a bunch of shell calls to I will work on the stuff mentioned above re: handling the reconnection aspect of things, as well as patching up |
Having 2nd thoughts about only covering timeouts -- when trying to test again I kept forgetting that simply disabling sshd on a box will often result in "low level" errors such as Wish I'd written down more details about my earlier tests because I'm definitely seeing a wider field of error types today re: situations that cause inability to connect, which aren't I'll see about extending it again to cover more types of errors. This will necessitate implementing the "if it's not a timeout, sleep for the timeout instead" tweak as well. |
Can't escape these rabbit holes / yak shaving ceremonies today. When testing against a VM, which uses a network tunnel to connect the VM's SSH to my localhost (e.g. on port 2203), a downed SSHD doesn't show up as a connection failure, but as an SSH protocol failure (e.g. the one above re: protocol banner, or another about Sadly this trips up another unfixed "bug" (née issue with a lower level library), #85 -- so IOW we cannot properly reconnect to servers exhibiting this kind of SSH protocol error, until that is fixed. I do not have the bandwidth to fix it right now. Thankfully I was able to test against my localhost, albeit tweaking |
I think this is wrapped up now, also changelog'd and documented (which I had totally forgotten, sigh). @npinto @KayEss if you have the time, please do grab latest master and try these new changes out. Docs are in the usage docs => Execution => near the bottom under "Reconnecting and skipping bad hosts". |
I'll probably not get to it today, but it should be over the weekend. I'd love to be able to keep being able to use |
Well, I neglected to mention that if/when deprecated from Fab core, it would probably fit in well in #461 (Patchwork) in the sense that it is a useful convenience, but does not need to be in the core software because it only uses the public API. |
Here is the test that I've done:
I assume that both of these will be solved by passing in the right retry parameters on the commands that are failing here and the retry count of 1 is to preserve backward compatibility. Looking at
|
|
So, I've added I'm using the fabric commands from a python file. The relevant code is in |
Glad it's working. The traceback is because you're using it in library mode -- the code to capture NetworkErrors (which honors So, if you wanted "normal" behavior there, you might want to look at using |
Description
Use a more robust reconnection/sleep mechanism than "guess how long a reboot takes and sleep that long". Possibilities:
Originally submitted by Jeff Forcier (bitprophet) on 2009-07-21 at 11:38am EDT
Relations
The text was updated successfully, but these errors were encountered: