Skip to content

fix(amber): make region kill synchronous before scheduling next region#4557

Merged
aglinxinyuan merged 21 commits into
mainfrom
xiaozhen-sync-region-kill
Apr 30, 2026
Merged

fix(amber): make region kill synchronous before scheduling next region#4557
aglinxinyuan merged 21 commits into
mainfrom
xiaozhen-sync-region-kill

Conversation

@Xiao-zhen-Liu
Copy link
Copy Markdown
Contributor

@Xiao-zhen-Liu Xiao-zhen-Liu commented Apr 29, 2026

What changes were proposed in this PR?

This PR makes region termination synchronous with respect to region scheduling. Previously, the workflow coordinator could schedule the next region before the previous region's workers were fully terminated.

The main changes are:

  • Keep RegionExecutionCoordinator in a non-completed phase while worker termination is still in progress.
  • Store and reuse a single termination future per region so repeated coordination calls do not start duplicate kill attempts.
  • Wait for region termination futures in WorkflowExecutionCoordinator before scheduling the next region.
  • Send gracefulStop only after all workers successfully reply to endWorker.
  • Retry region termination every 200ms if endWorker fails because a worker still has queued messages.
  • Keep EndHandler strict: endWorker fails whenever the worker still has any unprocessed message.
  • Prevent premature workflow completion while there are pending scheduled regions or unfinished region coordinators.
  • Emit final workflow completion from the coordinator terminal point when no next region exists.

Any related issues, documentation, discussions?

Closes #4556

How was this PR tested?

Manually tested; Added test cases for RegionExecutionCoordinator, WorkflowExecutionCoordinator , and EndHandler.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex

@Xiao-zhen-Liu Xiao-zhen-Liu self-assigned this Apr 29, 2026
@Xiao-zhen-Liu Xiao-zhen-Liu added engine fix backend Anything related to backend services labels Apr 29, 2026
@chenlica
Copy link
Copy Markdown
Contributor

The design looks right to me. Can we add a visual diagram to illustrate it?

@Yicong-Huang
Copy link
Copy Markdown
Contributor

I see there are a few recent changes with scheduler. Given its importance, I hope we can be careful on those changes. Two questions:

  1. Can we add tests cases to verify this works as expected? also anti-test the bugging situation will not happen.
  2. Do we need this before release 1.1.0, for can this wait? This Workers in a previous region are not fully terminated before the next region is started #4556 its a bug issue but described like a task. If we don't fix it, will it block users?

@Xiao-zhen-Liu
Copy link
Copy Markdown
Contributor Author

I see there are a few recent changes with scheduler. Given its importance, I hope we can be careful on those changes. Two questions:

  1. Can we add tests cases to verify this works as expected? also anti-test the bugging situation will not happen.
  2. Do we need this before release 1.1.0, for can this wait? This Make region termination synchronous before scheduling the next region #4556 its a bug issue but described like a task. If we don't fix it, will it block users?
  1. Yes test cases will be added.
  2. Not needed for the release. There should be no effect on user experience before and after this PR. The previous asynchronous termination design was okay, since it does not matter (aside from the resource management part) if the next region starts before the previous region's workers are fully terminated (only the materialized results are required anyway). Synchronous termination is needed however for @aglinxinyuan 's control block work, as we need the ability to restart / reinitialize a region to be able to run an operator multiple times. It is also a cleaner design, so it is beneficial to merge it.

@Yicong-Huang
Copy link
Copy Markdown
Contributor

Thanks. It will be good to have those discussions on the bug issue, before this PR.

If you want to include this in the release, we need more test cases and runs of tests. so in general I suggest we merge it after the release if it only blocks control block, which is an ongoing work.

@Xiao-zhen-Liu
Copy link
Copy Markdown
Contributor Author

Thanks. It will be good to have those discussions on the bug issue, before this PR.

If you want to include this in the release, we need more test cases and runs of tests. so in general I suggest we merge it after the release if it only blocks control block, which is an ongoing work.

Test cases added. We can merge it after the release. It does not affect current user experience.

Copy link
Copy Markdown
Contributor

@aglinxinyuan aglinxinyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The changes in this PR were discussed offline, and both the issue and PR description accurately reflect that discussion. Let’s hold off on merging until the release.

@aglinxinyuan aglinxinyuan enabled auto-merge (squash) April 30, 2026 23:04
@aglinxinyuan aglinxinyuan merged commit 9eb33ce into main Apr 30, 2026
17 checks passed
@aglinxinyuan aglinxinyuan deleted the xiaozhen-sync-region-kill branch April 30, 2026 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Anything related to backend services engine

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Workers in a previous region are not fully terminated before the next region is started

4 participants