Skip to content

[Chore](pipeline) make wake up do not return error#62190

Merged
BiteTheDDDDt merged 3 commits intoapache:masterfrom
BiteTheDDDDt:dev_0408_2
Apr 9, 2026
Merged

[Chore](pipeline) make wake up do not return error#62190
BiteTheDDDDt merged 3 commits intoapache:masterfrom
BiteTheDDDDt:dev_0408_2

Conversation

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor

@BiteTheDDDDt BiteTheDDDDt commented Apr 8, 2026

This pull request refactors the PipelineTask::wake_up method to improve error handling and simplify its interface. The method now returns void instead of Status, and errors are handled internally by canceling the associated fragment if necessary. Corresponding updates are made throughout the codebase and tests to accommodate this change.

Core Refactoring and Error Handling:

  • Changed the signature of PipelineTask::wake_up from returning Status to void, and updated its implementation to handle errors by canceling the fragment context internally, rather than propagating Status up the call stack. [1] [2]
  • Updated the Dependency::set_ready method to call the new wake_up signature without error checking, since errors are now handled inside wake_up.

Test Adjustments:

  • Modified unit tests in pipeline_task_test.cpp to remove checks on the return value of wake_up and instead directly call the new void method. [1] [2]

Copilot AI review requested due to automatic review settings April 8, 2026 02:02
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 8, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the pipeline dependency wake-up path by making Dependency::set_ready() non-throwing and adding exception handling around task wake-up to avoid leaving tasks stuck in BLOCKED state.

Changes:

  • Mark Dependency::set_ready() as noexcept.
  • Add try/catch around PipelineTask::wake_up() to cancel the query if wake-up fails for a non-finalized task.
  • Handle both Doris exceptions and standard/unexpected exceptions during wake-up.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
be/src/exec/pipeline/dependency.h Marks Dependency::set_ready() as noexcept.
be/src/exec/pipeline/dependency.cpp Wraps task wake-up logic with exception handling and best-effort recovery via query cancellation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread be/src/exec/pipeline/dependency.cpp Outdated
Comment thread be/src/exec/pipeline/dependency.cpp
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 issue.

  1. be/src/exec/pipeline/dependency.cpp: the new recovery path treats only FINALIZED tasks as benign, but PipelineTask::close() moves tasks to FINISHED before finalize() runs. A delayed dependency wake-up in that window can return Task state transition from FINISHED to RUNNABLE is not allowed, and this patch now escalates that benign late notification into frag->cancel(...), turning task completion races into query failures.

Critical checkpoint conclusions:

  • Goal / proof: The goal is to make Dependency::set_ready() non-throwing and avoid hangs. The current code does not fully accomplish that safely because it can incorrectly cancel a completing query. No new test covers the delayed wake-up / completion race.
  • Small / focused change: Yes, the patch is small and localized.
  • Concurrency: Applicable. The change is in a concurrent wake-up path; lock scope and ordering remain small, but the completion-state classification in the race window is incorrect.
  • Lifecycle / static init: No static-init concern. Task lifecycle is central here, and the review finding is in the FINISHED -> FINALIZED window.
  • Configuration: Not applicable.
  • Compatibility / storage format: Not applicable.
  • Parallel code paths: No additional analogous path requiring the same change was identified.
  • Special conditional checks: The new t->is_finalized() check is too narrow for the intended “already completed” case.
  • Test coverage: Missing targeted coverage for delayed wake-up and cancellation/completion races.
  • Observability: The added warning logs include the dependency name and status; observability is otherwise adequate.
  • Transaction / persistence: Not applicable.
  • Data writes / atomicity: Not applicable.
  • FE-BE variable passing: Not applicable.
  • Performance: No major performance issue identified in this patch.
  • Other issues: None beyond the false-positive cancellation above.

local_block_task.swap(_blocked_task);
}
for (auto task : local_block_task) {
if (auto t = task.lock()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t->is_finalized() is too narrow for the new "benign late wake-up" check. PipelineTask::close() sets the state to FINISHED before finalize() flips it to FINALIZED (pipeline_task.cpp), so a delayed dependency notification can legitimately hit a task that is already completed but not yet finalized. In that window wake_up() returns Task state transition from FINISHED to RUNNABLE is not allowed, and this code now cancels the whole fragment even though the task has already finished.

Please treat the pre-finalize FINISHED state as completed too, or key off the specific state-transition failure instead of only is_finalized().

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

/review

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 issue that should be fixed before merge.

Critical checkpoints:

  • Goal / correctness: Not satisfied. The new noexcept path still allows blocked tasks to be stranded if an exception aborts the wake-up loop before fragment cancellation.
  • Minimality / focus: Yes. The change is small and localized to Dependency::set_ready().
  • Concurrency: Risk remains. _ready is published and _blocked_task is cleared before every waiter is safely handled, so an exception can break the task lifecycle.
  • Lifecycle / initialization: No new lifecycle or static initialization issues seen.
  • Configuration: N/A.
  • Compatibility: N/A.
  • Parallel paths: No additional matching path updates appear required.
  • Special checks: The new catch blocks do not guarantee cancellation or recovery on the exception path.
  • Test coverage: No test covers the wake_up exception path.
  • Observability: Non-OK status logging is useful, but the std::exception log says cancelling query even though this path does not cancel.
  • Transaction / persistence: N/A.
  • Data writes / modifications: N/A.
  • FE-BE variable passing: N/A.
  • Performance: No primary hot-path regression identified.
  • Other issues: None beyond the exception-handling hang risk above.

Overall opinion: not ready to merge as-is.

Comment thread be/src/exec/pipeline/dependency.cpp Outdated
} catch (const std::exception& e) {
// Non-Doris exceptions (e.g. std::bad_alloc from scheduler submit path).
LOG(WARNING) << "Dependency::set_ready(): unexpected exception during wake_up, "
"cancelling query. dep="
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once _ready is set and _blocked_task is swapped out, any exception that escapes one iteration aborts the rest of local_block_task. The outer catch then only logs and returns. A concrete case is std::bad_alloc thrown from t->wake_up() or from the logging/cancel code in this loop: the current task is never resubmitted, the dependency is already marked ready, and no later set_ready() call will revisit the dropped waiters. That leaves the fragment hanging instead of cancelling it. The exception handling needs to be per task, with fragment cancellation or equivalent unblock logic before continuing to the next waiter.

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

/review

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings:

  1. be/src/exec/pipeline/dependency.cpp: the new wake_up() error path leaves the task in an inconsistent lifecycle state. PipelineTask::wake_up() clears _blocked_dep and transitions the task to RUNNABLE before calling submit(). If submit() returns non-OK here, set_ready() now only cancels the fragment and returns. At that point the task is no longer in _blocked_task, is no longer blocked on the dependency, and was never enqueued, so no worker thread will ever run close_task()/decrement_running_task() for it. The fragment cancel path only calls unblock_all_dependencies(); it does not schedule or close this stranded task. This is a real hang / leaked-running-task regression on the exact scheduler-failure path the old THROW_IF_ERROR propagated to the caller.

  2. be/src/exec/pipeline/dependency.cpp: the new std::exception catch still aborts the BE process via CHECK(false). That contradicts the PR goal of handling std::bad_alloc and other standard exceptions gracefully. On the OOM/logging/cancel failure path described in the PR body, this change still turns the query-local failure into a process crash.

Critical checkpoint conclusions:

  • Goal of current task: make Dependency::set_ready() non-throwing and avoid tasks hanging when wake-up/recovery fails. Conclusion: not achieved; the new submit() failure handling can strand a task permanently, and the std::exception branch still aborts the process. No test in this PR demonstrates the new behavior.
  • Minimal / focused change: mostly focused to Dependency::set_ready(), but the behavioral change is not safely integrated with the surrounding task lifecycle.
  • Concurrency: applicable. Dependency::_blocked_task, PipelineTask::_blocked_dep, and task scheduler queues are part of the wake-up handshake. The new code keeps the lock scope lightweight, but it breaks the state-machine contract when wake_up() fails after mutating task state.
  • Lifecycle management: applicable. Task lifecycle is BLOCKED -> RUNNABLE -> scheduled/closed; the new path can stop after RUNNABLE without ever reaching scheduler ownership or close/finalize.
  • Configuration items: not applicable.
  • Compatibility changes: not applicable.
  • Parallel code paths: applicable. Other wake-up/unblock paths (PipelineFragmentContext::cancel(), Pipeline::make_all_runnable(), PipelineTask::unblock_all_dependencies()) still assume dependency wake-up either succeeds or propagates failure; they do not repair the stranded-task case introduced here.
  • Special conditional checks: applicable. The new if (!st.ok()) recovery branch needs stronger guarantees or compensation logic, which are currently missing.
  • Test coverage: insufficient. I did not find a regression test or unit test covering failed wake_up() / failed scheduler submission / exception during recovery.
  • Observability: partial. The warning log includes dependency/task/status, which is good, but observability does not compensate for the stranded-task bug or the process-abort path.
  • Transaction / persistence: not applicable.
  • Data writes / modifications: not applicable.
  • FE/BE variable passing: not applicable.
  • Performance: no material concern in the patch itself; correctness/lifecycle issues dominate.
  • Other issues: none beyond the two findings above.

Overall opinion: the PR should not be merged as-is because it introduces a task-lifecycle regression on wake-up failure and does not actually provide graceful handling for std::exception / OOM paths.

Comment thread be/src/exec/pipeline/dependency.cpp Outdated
if (auto t = task.lock()) {
std::unique_lock<std::mutex> lc(_task_lock);
auto st = t->wake_up(this, lc);
if (!st.ok()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PipelineTask::wake_up() is not an atomic "enqueue me" operation. It first clears _blocked_dep and transitions the task to RUNNABLE, and only then calls submit() (pipeline_task.cpp:1047-1055). If submit() returns non-OK here, this new branch only cancels the fragment and returns. The task has already been removed from _blocked_task, is no longer blocked on this dependency, and was never inserted into the scheduler queue, so no worker thread will ever call close_task() / decrement_running_task() for it. PipelineFragmentContext::cancel() only calls unblock_all_dependencies(), which does not repair that stranded-task state. This turns a scheduler failure into a potential permanent hang/leaked running-task count.

Comment thread be/src/exec/pipeline/dependency.cpp Outdated
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 57.14% (16/28) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.98% (20105/37947)
Line Coverage 36.53% (188935/517218)
Region Coverage 32.80% (146674/447143)
Branch Coverage 33.93% (64215/189235)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 64.29% (18/28) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.52% (27323/37162)
Line Coverage 57.14% (294623/515636)
Region Coverage 54.23% (244736/451276)
Branch Coverage 56.00% (106294/189817)

@BiteTheDDDDt BiteTheDDDDt changed the title [Chore](pipeline) make Dependency::set_ready() noexcept [Chore](pipeline) make wake up do not return error Apr 8, 2026
update

update

fmt

update
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 issue.

  1. be/src/exec/pipeline/pipeline_task.h: the wake_up() signature change was not propagated to every call site. be/test/exec/pipeline/pipeline_task_test.cpp:557 still calls task->wake_up(dep.get(), lc).ok(), so this PR does not compile as-is.

Critical checkpoint conclusions:

  • Goal of the task: The goal is to make Dependency::set_ready() stop propagating Status from wake_up(). The implementation is close, but the remaining stale test call means the change set does not currently build, so the goal is not fully accomplished.
  • Modification size/focus: Yes. The patch is small and focused on the wake-up/error-handling path.
  • Concurrency: I reviewed the Dependency::set_ready() -> PipelineTask::wake_up() path. It remains concurrency-sensitive, but I did not confirm a separate lock-order regression from this diff beyond the build break.
  • Special lifecycle/static initialization: No special lifecycle or static initialization issue is introduced in the reviewed code.
  • Configuration changes: None.
  • Compatibility/storage format changes: None.
  • Parallel code paths: The production caller in Dependency::set_ready() was updated, but the unit test still uses the old return contract, so not all relevant paths were updated.
  • Special conditional checks: The FINISHED/FINALIZED guard remains justified by the existing comment and matches the intended no-resubmit behavior.
  • Test coverage: Incomplete. The updated signature currently breaks an existing unit test call site, so there is no evidence of passing coverage for this refactor.
  • Observability: No additional logging or metrics appear necessary for this narrow refactor.
  • Transaction/persistence/data-write concerns: Not applicable.
  • FE-BE variable passing: Not applicable.
  • Performance: No obvious steady-state performance regression is visible in the reviewed diff.
  • Other issues: None beyond the compile break above.

}

Status wake_up(Dependency* dep, std::unique_lock<std::mutex>& /* dep_lock */);
void wake_up(Dependency* dep, std::unique_lock<std::mutex>& /* dep_lock */);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This signature change is not fully propagated yet. be/test/exec/pipeline/pipeline_task_test.cpp:557 still does EXPECT_TRUE(task->wake_up(dep.get(), lc).ok());, so the PR no longer compiles because wake_up() now returns void. Please update that remaining test call site together with the two lines already changed above.

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 60.61% (40/66) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.73% (27401/37166)
Line Coverage 57.34% (295690/515677)
Region Coverage 54.49% (245857/451230)
Branch Coverage 56.22% (106714/189828)

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 9, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

PR approved by anyone and no changes requested.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 66.67% (8/12) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.99% (20122/37971)
Line Coverage 36.55% (189162/517611)
Region Coverage 32.80% (146829/447607)
Branch Coverage 33.92% (64253/189415)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 88.57% (62/70) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.68% (27400/37186)
Line Coverage 57.29% (295633/516029)
Region Coverage 54.44% (245939/451740)
Branch Coverage 56.15% (106691/189997)

@BiteTheDDDDt BiteTheDDDDt merged commit 443ae22 into apache:master Apr 9, 2026
29 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants