Skip to content
This repository has been archived by the owner on Nov 30, 2021. It is now read-only.

Non-zero-downtime deploys with healthcheck enabled #4600

Closed
jgmize opened this issue Oct 12, 2015 · 3 comments
Closed

Non-zero-downtime deploys with healthcheck enabled #4600

jgmize opened this issue Oct 12, 2015 · 3 comments

Comments

@jgmize
Copy link
Contributor

jgmize commented Oct 12, 2015

We enabled healthchecks for a staging instance of our bedrock application, and are still seeing downtime during deploys.

I wrote a script to observe the cluster while making a configuration change to trigger a deploy, and captured the output. What I observed was the following:

  1. The previous version (v128) of the app (bedrock-stage) stops being published to the /deis/services/bedrock-stage before the new version (v129) has entered the running state. https://gist.github.com/jgmize/674d0c2ea03cbb96971f#file-watch-bedrock-stage-L411-L452
  2. When the routers refresh their config, the app stops being published and all requests begin to return a 404 https://gist.github.com/jgmize/674d0c2ea03cbb96971f#file-watch-bedrock-stage-L473-L637
  3. The publisher begins publishing v128 to etcd under /deis/services/bedrock-stage/ again (https://gist.github.com/jgmize/674d0c2ea03cbb96971f#file-watch-bedrock-stage-L638-L1009), and then shortly begins publishing v129 as well (https://gist.github.com/jgmize/674d0c2ea03cbb96971f#file-watch-bedrock-stage-L1010-L2137)
  4. v128 continues to be published even after the units have stopped running according to fleet (https://gist.github.com/jgmize/674d0c2ea03cbb96971f#file-watch-bedrock-stage-L2138-L2196) (this should be less of an issue than the original problem due to the proxy_next_upstream in the nginx.conf)
  5. After a few more seconds, the nginx.conf is updated to reflect the values in etcd and begins routing traffic to the new instances exclusively (https://gist.github.com/jgmize/674d0c2ea03cbb96971f#file-watch-bedrock-stage-L2247-L2266)

Note that this does not happen every time, making tracking the race condition down difficult, but we are seeing intermittent downtime during deploys in both of our clusters.

@carmstrong
Copy link
Contributor

Thanks for the detailed report, jgmize! We'll certainly need to fix this.

/cc @bacongobbler @deis/production-team

@jgmize
Copy link
Contributor Author

jgmize commented Oct 12, 2015

A few more details: both of our clusters are running on 1.10.0. We're tentatively planning to do a migration upgrade to 1.11.1 this afternoon, which will change all of the internal IP addresses, which is why I didn't mind sharing this level of information in a public issue. If you or anyone on your team would like to take a look at our system logs for more details let me know. As I was going through the code looking for the source of this issue, I noticed that when calling the save method on an api.models.Build instance that all the containers from the previous build are destroyed, with no check to validate that the new build is ready, but I wasn't able to find when in the deployment process this method is called, so I wasn't sure if this was the source of the issue. The "Issues to Note" section of #4045 describes mitigating the risk of this situation occurring with unit tests within Publisher, but based on what we're seeing that unfortunately does not seem to be sufficient.

@bacongobbler
Copy link
Member

This is not something we'll be able to get to for the LTS release (#4776) and the architecture for health checks in deis v2 has changed dramatically such that this is no longer relevant (we're now using kubernetes liveness and readiness probes) so I'm closing this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants