Skip to content

[server] Add Cluster Health API implementation#3400

Open
swuferhong wants to merge 1 commit into
apache:mainfrom
swuferhong:fluss-server-recovery
Open

[server] Add Cluster Health API implementation#3400
swuferhong wants to merge 1 commit into
apache:mainfrom
swuferhong:fluss-server-recovery

Conversation

@swuferhong
Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #3399

  • Add GetClusterHealth RPC to Coordinator that computes cluster health from in-memory state
  • Track inactive leaders in CoordinatorContext (marked inactive on NotifyLeaderAndIsr send,
    marked active on successful response when responding server is still the leader)
  • Handle send failures in CoordinatorRequestBatch by synthesizing error responses to clear
    pending inactive state
  • Add client API Admin.getClusterHealth() with ClusterHealth / ClusterHealthStatus types
  • Add ClusterHealthReadinessCheck CLI tool in fluss-dist (exit 0=GREEN, 1=not ready, 2=API unsupported)
  • Add readiness-check.sh two-step readiness probe script (TCP + Cluster Health API)
    with first-boot detection and grace period for API-unsupported (mixed-version rolling upgrade)
  • Wire tablet-server readiness probe to readiness-check.sh in Helm chart
  • Add documentation for Helm deployment and upgrade guide

Brief change log

Tests

API and Format

Documentation

@swuferhong swuferhong force-pushed the fluss-server-recovery branch from fe0c5c1 to 042fc7a Compare May 29, 2026 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[server] Support Cluster Health API for safe rolling upgrades

1 participant