Skip to content

[server] Add timeout mechanism for rebalance tasks to prevent indefinite blocking on missing completion events #3096

@swuferhong

Description

@swuferhong

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

Currently, the rebalance implementation relies on receiving a completion event from the previous task before proceeding to the next one. If the completion event is lost (e.g., due to request loss, network issues, or server crashes), the entire rebalance process gets stuck indefinitely with no way to recover.

Solution

Introduce a configurable timeout mechanism for rebalance tasks:

  1. When a rebalance task is dispatched, start a timeout timer (e.g., configurable via rebalance.task.timeout)
  2. If the completion event is not received within the timeout period, automatically treat the task as failed/timed-out and trigger the next task or retry
  3. Add appropriate logging to indicate when a timeout occurs vs. a normal completion

This ensures the rebalance process can make forward progress even when individual completion events are lost,
improving the overall robustness of the rebalance mechanism.

Anything else?

No response

Willingness to contribute

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions