Non-zero-downtime deploys with healthcheck enabled #4600

jgmize · 2015-10-12T10:06:20Z

We enabled healthchecks for a staging instance of our bedrock application, and are still seeing downtime during deploys.

I wrote a script to observe the cluster while making a configuration change to trigger a deploy, and captured the output. What I observed was the following:

The previous version (v128) of the app (bedrock-stage) stops being published to the /deis/services/bedrock-stage before the new version (v129) has entered the running state. https://gist.github.com/jgmize/674d0c2ea03cbb96971f#file-watch-bedrock-stage-L411-L452
When the routers refresh their config, the app stops being published and all requests begin to return a 404 https://gist.github.com/jgmize/674d0c2ea03cbb96971f#file-watch-bedrock-stage-L473-L637
The publisher begins publishing v128 to etcd under /deis/services/bedrock-stage/ again (https://gist.github.com/jgmize/674d0c2ea03cbb96971f#file-watch-bedrock-stage-L638-L1009), and then shortly begins publishing v129 as well (https://gist.github.com/jgmize/674d0c2ea03cbb96971f#file-watch-bedrock-stage-L1010-L2137)
v128 continues to be published even after the units have stopped running according to fleet (https://gist.github.com/jgmize/674d0c2ea03cbb96971f#file-watch-bedrock-stage-L2138-L2196) (this should be less of an issue than the original problem due to the proxy_next_upstream in the nginx.conf)
After a few more seconds, the nginx.conf is updated to reflect the values in etcd and begins routing traffic to the new instances exclusively (https://gist.github.com/jgmize/674d0c2ea03cbb96971f#file-watch-bedrock-stage-L2247-L2266)

Note that this does not happen every time, making tracking the race condition down difficult, but we are seeing intermittent downtime during deploys in both of our clusters.

carmstrong · 2015-10-12T15:53:49Z

Thanks for the detailed report, jgmize! We'll certainly need to fix this.

/cc @bacongobbler @deis/production-team

jgmize · 2015-10-12T16:16:59Z

A few more details: both of our clusters are running on 1.10.0. We're tentatively planning to do a migration upgrade to 1.11.1 this afternoon, which will change all of the internal IP addresses, which is why I didn't mind sharing this level of information in a public issue. If you or anyone on your team would like to take a look at our system logs for more details let me know. As I was going through the code looking for the source of this issue, I noticed that when calling the save method on an api.models.Build instance that all the containers from the previous build are destroyed, with no check to validate that the new build is ready, but I wasn't able to find when in the deployment process this method is called, so I wasn't sure if this was the source of the issue. The "Issues to Note" section of #4045 describes mitigating the risk of this situation occurring with unit tests within Publisher, but based on what we're seeing that unfortunately does not seem to be sufficient.

bacongobbler · 2016-01-22T00:30:22Z

This is not something we'll be able to get to for the LTS release (#4776) and the architecture for health checks in deis v2 has changed dramatically such that this is no longer relevant (we're now using kubernetes liveness and readiness probes) so I'm closing this.

carmstrong added bug production-team labels Oct 12, 2015

jgmize mentioned this issue Oct 12, 2015

Need long-running cluster test #4430

Closed

bacongobbler closed this as completed Jan 22, 2016

bookshelfdave mentioned this issue Feb 6, 2017

0-downtime deploys mozmeao/infra#25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-zero-downtime deploys with healthcheck enabled #4600

Non-zero-downtime deploys with healthcheck enabled #4600

jgmize commented Oct 12, 2015

carmstrong commented Oct 12, 2015

jgmize commented Oct 12, 2015

bacongobbler commented Jan 22, 2016

Non-zero-downtime deploys with healthcheck enabled #4600

Non-zero-downtime deploys with healthcheck enabled #4600

Comments

jgmize commented Oct 12, 2015

carmstrong commented Oct 12, 2015

jgmize commented Oct 12, 2015

bacongobbler commented Jan 22, 2016