Skip to content

[server] Support Cluster Health API for safe rolling upgrades #3399

@swuferhong

Description

@swuferhong

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

During Kubernetes StatefulSet rolling upgrades, the next TabletServer pod should not restart until all replicas from the previously restarted pod have fully recovered (leaders re-elected, ISR restored). Without this, cascading restarts can cause data unavailability or prolonged under-replication.

Currently there is no server-side API to determine whether the cluster has finished recovery. Operators rely on TCP-only readiness probes, which pass as soon as the process binds its port — long before replica recovery completes.

Solution

Add a GetClusterHealth RPC to the Coordinator that computes cluster health from in-memory
state (CoordinatorContext). The API returns replica statistics and an overall health status:

  • GREEN — all replicas are in-sync and all leaders are active.
  • YELLOW — all leaders are active, but some replicas have not yet rejoined ISR.
  • RED — one or more leaders have not been confirmed active (election or KV recovery in progress).
  • UNKNOWN — health could not be determined.

A readiness-probe shell script (readiness-check.sh) performs a two-step check:

  1. TCP port check (local liveness)
  2. Cluster Health API query (only pass on GREEN)

This gates StatefulSet rolling upgrades: the next pod only restarts when the cluster is fully healthy.

Anything else?

No response

Willingness to contribute

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

Labels

No labels
No labels
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions