You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 30, 2021. It is now read-only.
We enabled healthchecks for a staging instance of our bedrock application, and are still seeing downtime during deploys.
I wrote a script to observe the cluster while making a configuration change to trigger a deploy, and captured the output. What I observed was the following:
Note that this does not happen every time, making tracking the race condition down difficult, but we are seeing intermittent downtime during deploys in both of our clusters.
The text was updated successfully, but these errors were encountered:
A few more details: both of our clusters are running on 1.10.0. We're tentatively planning to do a migration upgrade to 1.11.1 this afternoon, which will change all of the internal IP addresses, which is why I didn't mind sharing this level of information in a public issue. If you or anyone on your team would like to take a look at our system logs for more details let me know. As I was going through the code looking for the source of this issue, I noticed that when calling the save method on an api.models.Build instance that all the containers from the previous build are destroyed, with no check to validate that the new build is ready, but I wasn't able to find when in the deployment process this method is called, so I wasn't sure if this was the source of the issue. The "Issues to Note" section of #4045 describes mitigating the risk of this situation occurring with unit tests within Publisher, but based on what we're seeing that unfortunately does not seem to be sufficient.
This is not something we'll be able to get to for the LTS release (#4776) and the architecture for health checks in deis v2 has changed dramatically such that this is no longer relevant (we're now using kubernetes liveness and readiness probes) so I'm closing this.
We enabled healthchecks for a staging instance of our bedrock application, and are still seeing downtime during deploys.
I wrote a script to observe the cluster while making a configuration change to trigger a deploy, and captured the output. What I observed was the following:
Note that this does not happen every time, making tracking the race condition down difficult, but we are seeing intermittent downtime during deploys in both of our clusters.
The text was updated successfully, but these errors were encountered: