Search before asking
Motivation
During Kubernetes StatefulSet rolling upgrades, the next TabletServer pod should not restart until all replicas from the previously restarted pod have fully recovered (leaders re-elected, ISR restored). Without this, cascading restarts can cause data unavailability or prolonged under-replication.
Currently there is no server-side API to determine whether the cluster has finished recovery. Operators rely on TCP-only readiness probes, which pass as soon as the process binds its port — long before replica recovery completes.
Solution
Add a GetClusterHealth RPC to the Coordinator that computes cluster health from in-memory
state (CoordinatorContext). The API returns replica statistics and an overall health status:
- GREEN — all replicas are in-sync and all leaders are active.
- YELLOW — all leaders are active, but some replicas have not yet rejoined ISR.
- RED — one or more leaders have not been confirmed active (election or KV recovery in progress).
- UNKNOWN — health could not be determined.
A readiness-probe shell script (readiness-check.sh) performs a two-step check:
- TCP port check (local liveness)
- Cluster Health API query (only pass on GREEN)
This gates StatefulSet rolling upgrades: the next pod only restarts when the cluster is fully healthy.
Anything else?
No response
Willingness to contribute
Search before asking
Motivation
During Kubernetes StatefulSet rolling upgrades, the next
TabletServerpod should not restart until all replicas from the previously restarted pod have fully recovered (leaders re-elected, ISR restored). Without this, cascading restarts can cause data unavailability or prolonged under-replication.Currently there is no server-side API to determine whether the cluster has finished recovery. Operators rely on TCP-only readiness probes, which pass as soon as the process binds its port — long before replica recovery completes.
Solution
Add a
GetClusterHealthRPC to the Coordinator that computes cluster health from in-memorystate (CoordinatorContext). The API returns replica statistics and an overall health status:
A readiness-probe shell script (
readiness-check.sh) performs a two-step check:This gates StatefulSet rolling upgrades: the next pod only restarts when the cluster is fully healthy.
Anything else?
No response
Willingness to contribute