Skip to content
This repository has been archived by the owner on Nov 30, 2021. It is now read-only.

fix(builder): ssh timeout to prevent intermediary firewalls/loadbalancers from dropping connection #1258

Conversation

developerinlondon
Copy link
Contributor

Currently the Deis-builder loses connection to cilents sometimes when there
are intermediaries that monitors the connection and drops it after a certain
interval (eg AWS Load Balancers have a 5 min timeout) while the Builder is
busy with some activities such as tarring a file etc.

This will ensure a ssh-keepalive is sent every minute to keep the connection
open.

TODO: Tests needs to be created with mock Firewalls in the middle that are
configured to drop connection as specific intervals, and we need to demonstrate
that the builder does not lose connection past those period when connected
to by a client.

Fixes #1046, #1195

@deis-bot
Copy link

deis-bot commented Jul 4, 2014

Thanks for the contribution! Please ensure your commits follow our style guide. This code will be tested once a Deis maintainer reviews it.

@carmstrong
Copy link
Contributor

@developerinlondon Can you amend this so it just contains commit 77fab34? The other stuff should warrant a separate discussion (and I think you already have a PR for that).

@developerinlondon
Copy link
Contributor Author

@carmstrong didn't find a clear way to take out specific commits. So just overwrote them with your master copy.

@bacongobbler
Copy link
Member

@developerinlondon you can create a new branch off master and just cherry pick the commit in with git cherry-pick 77fab34. IIRC you can then force-push the branch in with git push -f origin <new_branch_name>:1046-deis-builder-ssh-timeout. That should force-push your new_branch_name to this PR. I'd do this myself (opening a new PR with your commit cherry-picked in) but I'm camping this weekend :)

@developerinlondon
Copy link
Contributor Author

no problem @bacongobbler . I think I got it right this time :-)

@developerinlondon
Copy link
Contributor Author

just tested on our AWS Loadbalancers and can confirm they are no longer dropping connections.

Would still be important to add specific test scenarios for these however.

@bacongobbler
Copy link
Member

Thank you very much for this fix! This is awesome. No need for git protocol hacks, which makes me a happy camper (pun intended) 😄

@carmstrong
Copy link
Contributor

Code LGTM! Thanks for the fix.

@bacongobbler
Copy link
Member

Testing this PR manually since the component tests won't catch if this PR fixes an app deployment on EC2 with a loadbalancer. Code LGTM though :)

@carmstrong
Copy link
Contributor

Thanks @bacongobbler

@bacongobbler
Copy link
Member

Manually tested with https://github.com/deis/django-polls, but still hit the sideband issue:

$ make -C builder build restart
Step 13 : RUN echo ClientAliveInterval 60 >> /etc/ssh/sshd_config
 ---> Running in 3eadc8e9af4d
 ---> a31dca211a4e
Removing intermediate container 3eadc8e9af4d
Step 14 : RUN echo ServerAliveInterval 5 >> /etc/ssh/ssh_config
 ---> Running in 798612bab5ea
 ---> 5c7d5828b4ad
[...]
$ deis create
$ deis config:set BUILDPACK_URL=https://github.com/heroku/heroku-buildpack-python
$ git push deis master
remote:        Launching... Connection to deis.bacongobbler.com closed by remote host.
fatal: The remote end hung up unexpectedly
error: error in sideband demultiplexer

@developerinlondon
Copy link
Contributor Author

hmm.. interesting. I wonder how its working on our ELB.

Lets see if we can dig up the ssh log somewhere. Can you check first the files have been modified correctly and the ssh daemon is loaded after the change?

@developerinlondon
Copy link
Contributor Author

otherwise see if this makes a diff:
echo ServerAliveInterval 60 >> ~/ssh/config
which will mean the ClientAliveInterval isn't working as it should.

@bacongobbler
Copy link
Member

The files were modified correctly. I'm trying with ClientAliveInterval set to 15 and a ClientAliveCountMax set to 4 as EC2 load balancers have a default KeepAlive set to 1 minute, so I think the timeout is beating the null packet to the finish line.

@developerinlondon
Copy link
Contributor Author

well we shouldn't be needing the ClientAliveCountMax, as the connection would still be alive. can you just try setting your local ServerAliveInterval to 5 seconds maybe?

@developerinlondon
Copy link
Contributor Author

ah wait, if you had ELB set to 60 seconds it shouldnt have worked. but ClientAliveInterval to 15 should have worked. (make sure the sshd daemon restarts properly)

@bacongobbler
Copy link
Member

yeah, I modified the ClientAliveInterval to 15 seconds, rebuilt the builder, restarted the builder and it all worked after that. Still straining the server a bit to test any edge cases but it seems to work so far 👍

@bacongobbler
Copy link
Member

@developerinlondon could you please tune the ClientAliveInterval down to 15 seconds and add a comment about why this is necessary (explain ELB KeepAlive timeouts)? The ServerAliveInterval is not necessary since the builder isn't starting any SSH connections, so that can be removed as well.

@developerinlondon
Copy link
Contributor Author

@bacongobbler basically the ssh ClientAliveInterval will send a packet through the network at the interval we specify to the other side (the ssh client connecting to builder), this will reset the timeout counter for any intermediaries (LoadBalancer or Firewall), So even if you have it set to 55 seconds, it will be enough to keep the connection live before the 60 seconds timeout on Loadbalancer. So we can probably get away with 45 seconds for this as its unlikely people would have their firewalls/loadbalancers set to a timeout smaller then that. The catch with using this is the connection WILL break if there is a physical disruption on the network for some reason, which makes no difference in our case as we have a timeout on our intermediaries anyways.
We need the ServerAliveInterval also as the deis-builder connects to the deis-registry as a client by ssh to put images up. I haven't tested the interval, but we can probably get away with 45 seconds on this also.
If you know where in documentation to add this, I can add some text explaining it.
Its not something users should generally be worried about as 45 seconds should be small enough but probably these should go as configuration environment variables somewhere to give people the flexibility of choosing what they want this to be. Might require a python script that updates these. wdyt?

@bacongobbler
Copy link
Member

We need the ServerAliveInterval also as the deis-builder connects to the deis-registry as a client by ssh to put images up.

I'm not so sure that's actually what's happening, as the push command in docker's source code performs a bunch of streaming HTTP requests to the registry, nothing more. Please correct me if I'm wrong, but I don't believe there is any dependency on SSH being used anywhere in docker core. Therefore, there's no need for ServerAliveInterval.

Its not something users should generally be worried about as 45 seconds should be small enough but probably these should go as configuration environment variables somewhere to give people the flexibility of choosing what they want this to be.

That would be ideal, but I think in this case it shouldn't be necessary. If people need to run their own custom components, there is documentation on that. 😄

wdyt?

Putting the keep-alive timeout at 45 seconds for TCP connections sounds good to me.


Most Loadbalancers/Firewalls have a minimum timeout of 60 seconds. (eg AWS
ELB has it at 60 seconds).
The Deis-Builder is configured to have a SSH Keelalive every 45 seconds so the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo on Keelalive

@developerinlondon
Copy link
Contributor Author

@bacongobbler, I think it happens when builder pushes to Docker Registry. Even if we dont explicitly make ssh calls I think somewhere underlying (maybe docker?) its using ssh layer. See my build-fail below after turning off the keep-alive:


remote: Sending build context to Docker daemon         
remote: 
remote: Step 0 : FROM deis/slugrunner        
remote:  ---> 683f264537a0        
remote: Step 1 : RUN mkdir -p /app        
remote:  ---> Using cache        
remote:  ---> 51a6b6418e5d        
remote: Step 2 : ADD slug.tgz /app        
remote:  ---> 6ce466ca9a60        
remote: Removing intermediate container 1666766b2175        
remote: Step 3 : ENTRYPOINT ["/runner/init"]        
remote:  ---> Running in 79ca119b27d8        
remote:  ---> 2244296144a4        
remote: Removing intermediate container 79ca119b27d8        
remote: Successfully built 2244296144a4        
remote: -----> Pushing image to private registry   
remote:        Launching... Connection to deis.stats-production.cronycle.net closed by remote host.
fatal: The remote end hung up unexpectedly
error: error in sideband demultiplexer
To ssh://git@deis.stats-production.cronycle.net:2222/stats.git
 + e80aa01...d16dfcd HEAD -> master (forced update)
error: failed to push some refs to 'ssh://git@deis.stats-production.cronycle.net:2222/stats.git'
Build step 'Execute shell' marked build as failure
Finished: FAILURE

@bacongobbler
Copy link
Member

I think it happens when builder pushes to Docker Registry

It's not pushing to the registry any more at that point, though. It's just a long-running POST request: https://github.com/deis/deis/blob/master/builder/templates/builder#L140-L144

There's nothing on the backend performing any SSH connections whatsoever, which is why I find it odd that this fix solves this issue for you. I guess if it works though, I'm not complaining. The ServerAliveInterval can be kept there.

@developerinlondon
Copy link
Contributor Author

ok seems it didnt totally fix it, you might be right in the ssh timeout not having any effect when builder tries to connect. I am getting the error now timing out during the POST (#1275). might have to configure the TCP keepalive in that case.

I am still trying to work out where in controller (#1275) is failing.

@developerinlondon
Copy link
Contributor Author

I think altering the proc psuedo file will probably fix it, but I can only see it in the host fileSystem, not something at container level (eg echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time will set it to 10 mins instead of the default 2 hours).

More info here:
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

@developerinlondon
Copy link
Contributor Author

added the tcp-timeout change: #1277

@developerinlondon
Copy link
Contributor Author

wonder why Travis is failing..

@bacongobbler
Copy link
Member

wonder why Travis is failing..

travis-ci/travis-ci#2389 :(

@bacongobbler
Copy link
Member

Code LGTM. Needs manual testing, since the smoke tests won't pick this up and I don't see a way we can mock this one up in a test suite. :(

@developerinlondon
Copy link
Contributor Author

probably something like this can be an automated test for this, though someone needs to find some time to dig up on the details for the steps (on how to setup a firewall with a 60 seconds timeout on vagrant and setup a new coreOS cluster to go behind that firewall)

Edit: Sample Spec Skeleton Added (https://github.com/deis/deis/pull/1258/files#diff-a32b9d0f32446271d217ac7497fc43deR1). Needs more work so it actually creates a vagrant firewall to test this on.

…cers from dropping connection

Currently the Deis-builder loses connection to cilents sometimes when tbere
are intermediaries that monitors the connection and drops it after a certain
interval (eg AWS Load Balancers have 1 min timeouts) while the Builder is
busy with some activities such as tarring a file etc.

This will ensure a ssh-keepalive is sent every minute to keep the connection
open.

TESTING: The added go-spec needs to be filled in to do a real automated test.
Otherwise similar steps can be followed to do a manual test.

closes deis#1046, deis#1195
@carmstrong
Copy link
Contributor

Closing until we can get to the root issue in #1330.

@carmstrong carmstrong closed this Jul 17, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error from git during Docker build process
4 participants