-
Notifications
You must be signed in to change notification settings - Fork 204
Description
When deploying a new version of a service behind a load balancer, it's useful to "gracefully" terminate the instances of the old version while directing all new traffic to the instances of the new version. In Kubernetes-land, this is usually done by signaling to the soon-to-be-terminated instances to stop responding successfully to health checks. When the load balancer notices they're failing health checks, it will stop sending new requests to them and direct all traffic to the new instances that are passing health checks. Then, after some reasonable grace period, such as 30 seconds, the old instances can shut down entirely.
From what I understand, Misk's current shutdown behavior is to just race to shutdown (while preserving CoordinatedService ordering), with no provision for a grace period of 1) intentionally failing health checks 2) refusing new requests or 3) both to give the upstream load balancer an opportunity to re-route traffic.
This manifests as spurious errors in services that depend on Misk services, because sometimes they make calls during the brief shutdown period and those calls fail in weird ways due to the rushed shutdown.