Search before asking
Motivation
Currently, the rebalance implementation relies on receiving a completion event from the previous task before proceeding to the next one. If the completion event is lost (e.g., due to request loss, network issues, or server crashes), the entire rebalance process gets stuck indefinitely with no way to recover.
Solution
Introduce a configurable timeout mechanism for rebalance tasks:
- When a rebalance task is dispatched, start a timeout timer (e.g., configurable via rebalance.task.timeout)
- If the completion event is not received within the timeout period, automatically treat the task as failed/timed-out and trigger the next task or retry
- Add appropriate logging to indicate when a timeout occurs vs. a normal completion
This ensures the rebalance process can make forward progress even when individual completion events are lost,
improving the overall robustness of the rebalance mechanism.
Anything else?
No response
Willingness to contribute
Search before asking
Motivation
Currently, the rebalance implementation relies on receiving a completion event from the previous task before proceeding to the next one. If the completion event is lost (e.g., due to request loss, network issues, or server crashes), the entire rebalance process gets stuck indefinitely with no way to recover.
Solution
Introduce a configurable timeout mechanism for rebalance tasks:
This ensures the rebalance process can make forward progress even when individual completion events are lost,
improving the overall robustness of the rebalance mechanism.
Anything else?
No response
Willingness to contribute