-
Notifications
You must be signed in to change notification settings - Fork 799
fix(builder): ssh timeout to prevent intermediary firewalls/loadbalancers from dropping connection #1258
Conversation
Thanks for the contribution! Please ensure your commits follow our style guide. This code will be tested once a Deis maintainer reviews it. |
@developerinlondon Can you amend this so it just contains commit 77fab34? The other stuff should warrant a separate discussion (and I think you already have a PR for that). |
@carmstrong didn't find a clear way to take out specific commits. So just overwrote them with your master copy. |
@developerinlondon you can create a new branch off master and just cherry pick the commit in with |
no problem @bacongobbler . I think I got it right this time :-) |
just tested on our AWS Loadbalancers and can confirm they are no longer dropping connections. Would still be important to add specific test scenarios for these however. |
Thank you very much for this fix! This is awesome. No need for git protocol hacks, which makes me a happy camper (pun intended) 😄 |
Code LGTM! Thanks for the fix. |
Testing this PR manually since the component tests won't catch if this PR fixes an app deployment on EC2 with a loadbalancer. Code LGTM though :) |
Thanks @bacongobbler |
Manually tested with https://github.com/deis/django-polls, but still hit the sideband issue:
|
hmm.. interesting. I wonder how its working on our ELB. Lets see if we can dig up the ssh log somewhere. Can you check first the files have been modified correctly and the ssh daemon is loaded after the change? |
otherwise see if this makes a diff: |
The files were modified correctly. I'm trying with |
well we shouldn't be needing the ClientAliveCountMax, as the connection would still be alive. can you just try setting your local ServerAliveInterval to 5 seconds maybe? |
ah wait, if you had ELB set to 60 seconds it shouldnt have worked. but ClientAliveInterval to 15 should have worked. (make sure the sshd daemon restarts properly) |
yeah, I modified the ClientAliveInterval to 15 seconds, rebuilt the builder, restarted the builder and it all worked after that. Still straining the server a bit to test any edge cases but it seems to work so far 👍 |
@developerinlondon could you please tune the ClientAliveInterval down to 15 seconds and add a comment about why this is necessary (explain ELB KeepAlive timeouts)? The ServerAliveInterval is not necessary since the builder isn't starting any SSH connections, so that can be removed as well. |
@bacongobbler basically the ssh ClientAliveInterval will send a packet through the network at the interval we specify to the other side (the ssh client connecting to builder), this will reset the timeout counter for any intermediaries (LoadBalancer or Firewall), So even if you have it set to 55 seconds, it will be enough to keep the connection live before the 60 seconds timeout on Loadbalancer. So we can probably get away with 45 seconds for this as its unlikely people would have their firewalls/loadbalancers set to a timeout smaller then that. The catch with using this is the connection WILL break if there is a physical disruption on the network for some reason, which makes no difference in our case as we have a timeout on our intermediaries anyways. |
I'm not so sure that's actually what's happening, as the push command in docker's source code performs a bunch of streaming HTTP requests to the registry, nothing more. Please correct me if I'm wrong, but I don't believe there is any dependency on SSH being used anywhere in docker core. Therefore, there's no need for
That would be ideal, but I think in this case it shouldn't be necessary. If people need to run their own custom components, there is documentation on that. 😄
Putting the keep-alive timeout at 45 seconds for TCP connections sounds good to me. |
|
||
Most Loadbalancers/Firewalls have a minimum timeout of 60 seconds. (eg AWS | ||
ELB has it at 60 seconds). | ||
The Deis-Builder is configured to have a SSH Keelalive every 45 seconds so the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo on Keelalive
@bacongobbler, I think it happens when builder pushes to Docker Registry. Even if we dont explicitly make ssh calls I think somewhere underlying (maybe docker?) its using ssh layer. See my build-fail below after turning off the keep-alive:
|
It's not pushing to the registry any more at that point, though. It's just a long-running POST request: https://github.com/deis/deis/blob/master/builder/templates/builder#L140-L144 There's nothing on the backend performing any SSH connections whatsoever, which is why I find it odd that this fix solves this issue for you. I guess if it works though, I'm not complaining. The ServerAliveInterval can be kept there. |
ok seems it didnt totally fix it, you might be right in the ssh timeout not having any effect when builder tries to connect. I am getting the error now timing out during the POST (#1275). might have to configure the TCP keepalive in that case. I am still trying to work out where in controller (#1275) is failing. |
I think altering the proc psuedo file will probably fix it, but I can only see it in the host fileSystem, not something at container level (eg More info here: |
added the tcp-timeout change: #1277 |
wonder why Travis is failing.. |
|
Code LGTM. Needs manual testing, since the smoke tests won't pick this up and I don't see a way we can mock this one up in a test suite. :( |
probably something like this can be an automated test for this, though someone needs to find some time to dig up on the details for the steps (on how to setup a firewall with a 60 seconds timeout on vagrant and setup a new coreOS cluster to go behind that firewall) Edit: Sample Spec Skeleton Added (https://github.com/deis/deis/pull/1258/files#diff-a32b9d0f32446271d217ac7497fc43deR1). Needs more work so it actually creates a vagrant firewall to test this on. |
…cers from dropping connection Currently the Deis-builder loses connection to cilents sometimes when tbere are intermediaries that monitors the connection and drops it after a certain interval (eg AWS Load Balancers have 1 min timeouts) while the Builder is busy with some activities such as tarring a file etc. This will ensure a ssh-keepalive is sent every minute to keep the connection open. TESTING: The added go-spec needs to be filled in to do a real automated test. Otherwise similar steps can be followed to do a manual test. closes deis#1046, deis#1195
Closing until we can get to the root issue in #1330. |
Currently the Deis-builder loses connection to cilents sometimes when there
are intermediaries that monitors the connection and drops it after a certain
interval (eg AWS Load Balancers have a 5 min timeout) while the Builder is
busy with some activities such as tarring a file etc.
This will ensure a ssh-keepalive is sent every minute to keep the connection
open.
TODO: Tests needs to be created with mock Firewalls in the middle that are
configured to drop connection as specific intervals, and we need to demonstrate
that the builder does not lose connection past those period when connected
to by a client.
Fixes #1046, #1195