[Bug](sink) fix cancel_at_time not work at parallel_sink/parallel_outfile by BiteTheDDDDt · Pull Request #61739 · apache/doris

BiteTheDDDDt · 2026-03-26T02:48:47Z

This pull request refines the lifecycle management of result buffers in the execution engine. It introduces a mechanism to determine when a result buffer is fully closed (i.e., all fragment instances have finished writing), enabling deferred cleanup to be scheduled only at the appropriate time. The changes propagate this logic through the buffer interface, operator implementations, and associated tests.

Lifecycle management improvements:

The close method in ResultBlockBufferBase and its implementations now takes a bool& is_fully_closed parameter, which is set to true only when the last fragment instance has closed the buffer. This allows callers to safely schedule deferred cleanup only when the buffer is no longer needed. (be/src/runtime/result_block_buffer.h, be/src/runtime/result_block_buffer.cpp) [1] [2] [3] [4]
In both ResultFileSinkLocalState::close and ResultSinkLocalState::close, deferred cleanup via cancel_at_time is now scheduled only when is_fully_closed is true, and the correct buffer ID is used as the registration key, ensuring proper cleanup in both parallel and non-parallel modes. (be/src/exec/operator/result_file_sink_operator.cpp, be/src/exec/operator/result_sink_operator.cpp) [1] [2]

Interface and API changes:

The ResultBlockBufferBase interface now includes a buffer_id() method, which returns the correct ID for use in cleanup, abstracting away the difference between parallel and non-parallel modes. (be/src/runtime/result_block_buffer.h) [1] [2]

Test updates:

All unit tests for result block buffers are updated to use the new close signature and verify the value of is_fully_closed, ensuring correct behavior in both normal and error scenarios. (be/test/exec/sink/arrow_result_block_buffer_test.cpp, be/test/exec/sink/result_block_buffer_test.cpp) [1] [2] [3] [4] [5] [6]

hello-stephen · 2026-03-26T02:48:54Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

BiteTheDDDDt · 2026-03-26T02:49:10Z

run buildall

Copilot

Pull request overview

Fixes deferred result-buffer cancellation for parallel result sink / parallel outfile by using the same identifier for cancellation as was used when registering the buffer in ResultBufferMgr.

Changes:

Update ResultSinkLocalState::close() to cancel by query_id when enable_parallel_result_sink is enabled; otherwise cancel by fragment_instance_id.
Update ResultFileSinkLocalState::close() to cancel by query_id when enable_parallel_outfile is enabled; otherwise cancel by fragment_instance_id.
Add clarifying comments explaining the key selection.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
be/src/exec/operator/result_sink_operator.cpp	Aligns deferred buffer cancellation key with parallel/non-parallel sender registration key for result sink.
be/src/exec/operator/result_file_sink_operator.cpp	Aligns deferred buffer cancellation key with parallel/non-parallel sender registration key for outfile sink.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T02:52:48Z

be/src/exec/operator/result_sink_operator.cpp

+    // In parallel result sink mode, the buffer is registered under query_id; otherwise
+    // it is registered under fragment_instance_id.  Pass the matching key so the
+    // deferred cancel actually finds and removes the buffer entry.
    state->exec_env()->result_mgr()->cancel_at_time(
            time(nullptr) + config::result_buffer_cancelled_interval_time,
-            state->fragment_instance_id());
+            state->query_options().enable_parallel_result_sink ? state->query_id()
+                                                               : state->fragment_instance_id());


In parallel result sink mode, multiple fragment instances share the same ResultBlockBuffer keyed by query_id, and close() is invoked per fragment instance (removing one dependency each time). Scheduling cancel_at_time(..., query_id) on every instance close can cancel/erase the shared buffer after the interval even if other instances are still producing results (i.e., before the buffer is actually fully closed), which can break result fetching and/or cancel in-flight execution. Consider scheduling the deferred cancel only when the shared buffer transitions to fully closed (after the last dependency is removed), e.g., by moving this scheduling into ResultBlockBuffer::close when _result_sink_dependencies becomes empty, or by extending the ResultBlockBufferBase::close contract to indicate “final close” so callers can safely schedule cancel once.

Copilot · 2026-03-26T02:52:49Z

be/src/exec/operator/result_file_sink_operator.cpp

+    // In parallel outfile mode, the buffer is registered under query_id; otherwise
+    // it is registered under fragment_instance_id.  Pass the matching key so the
+    // deferred cancel actually finds and removes the buffer entry.
    state->exec_env()->result_mgr()->cancel_at_time(
            time(nullptr) + config::result_buffer_cancelled_interval_time,
-            state->fragment_instance_id());
+            state->query_options().enable_parallel_outfile ? state->query_id()
+                                                           : state->fragment_instance_id());


Same concern as result sink: in parallel outfile mode the buffer is shared under query_id and ResultBlockBuffer::close only fully closes once all fragment-instance dependencies have been removed. Calling cancel_at_time(..., query_id) from every fragment instance close can cancel the shared buffer after the interval even if some instances are still running, potentially breaking client fetch and/or canceling remaining producers. Consider deferring scheduling until the buffer is fully closed (e.g., schedule inside ResultBlockBuffer::close when the last dependency is removed, or change close() to return/indicate when it performed the final close so this can be scheduled once).

BiteTheDDDDt · 2026-03-26T03:03:08Z

run buildall

BiteTheDDDDt · 2026-03-26T03:03:31Z

/review

github-actions

Code Review Summary

Overall Assessment

The fix correctly addresses the bug where cancel_at_time was not working in parallel result-sink / parallel outfile mode. The root cause was that in parallel mode, the buffer is registered under query_id but cancel_at_time was unconditionally using fragment_instance_id, so it would try to cancel a non-existent entry. The approach of using is_fully_closed + buffer_id() is clean and correct.

Critical Checkpoint Conclusions

Goal / correctness: The goal is to fix deferred buffer cleanup in parallel sink modes. The production code accomplishes this correctly by (a) only scheduling cleanup when the last instance closes the shared buffer, and (b) using buffer_id() which returns the correct registration key regardless of mode. However, existing unit tests will not compile due to the signature change (see inline comment).
Minimal / focused: Yes, the change is focused and minimal.
Concurrency: The is_fully_closed output parameter is set under _lock inside ResultBlockBuffer::close(), and the caller reads it after close() returns — no concurrent access issue. The _result_sink_dependencies map is correctly used as the reference-counting mechanism under the same lock.
Lifecycle: buffer_id() returns a const reference to _fragment_id, a member of the buffer object. The buffer remains alive (held by _sender shared_ptr) when buffer_id() is called, so no dangling reference.
Parallel code paths: Both result_file_sink_operator.cpp and result_sink_operator.cpp are updated symmetrically. Good.
Error handling: If _sender->close() returns error, RETURN_IF_ERROR skips cancel_at_time. This is acceptable because create_sender() already registers a safety-net timeout (execution_timeout + 5s) that will eventually clean up the buffer.
Test coverage: Build-breaking issue — 6 call sites in be/test/exec/sink/result_block_buffer_test.cpp (lines 163, 292, 311) and be/test/exec/sink/arrow_result_block_buffer_test.cpp (lines 176, 308, 327) still call close() with the old 3-argument signature. These must be updated to pass the new bool& is_fully_closed parameter. Additionally, the tests should be extended to verify is_fully_closed is correctly set (false when dependencies remain, true when the last one closes).
Observability: No new logging needed; existing logging in get_batch already covers the buffer lifecycle.
Configuration: No new config items.
FE-BE compatibility: No protocol changes.
Performance: No concerns — the change adds a bool output parameter and a conditional branch, both trivial.
Other issues: None found.

github-actions · 2026-03-26T03:09:59Z

be/src/runtime/result_block_buffer.h

+    // Close one fragment instance's contribution to this buffer.  When the last
+    // registered instance calls close(), |is_fully_closed| is set to true,
+    // indicating that no more producers will write to this buffer and callers may
+    // safely schedule deferred cleanup.  The buffer is keyed in ResultBufferMgr


[Build Break] The close() signature changed from 3 to 4 parameters, but the following test call sites were not updated and will fail to compile:

be/test/exec/sink/result_block_buffer_test.cpp lines 163, 292, 311

be/test/exec/sink/arrow_result_block_buffer_test.cpp lines 176, 308, 327

All 6 sites call buffer.close(ins_id, Status::..., 0) with only 3 arguments. They need to be updated to pass bool& is_fully_closed, e.g.:

bool is_fully_closed = false; EXPECT_TRUE(buffer.close(ins_id, Status::OK(), 0, is_fully_closed).ok());

Also consider adding test assertions on is_fully_closed itself — e.g., verify it returns false when other dependencies remain, and true when the last instance closes.

BiteTheDDDDt · 2026-03-26T03:17:03Z

run buildall

hello-stephen · 2026-03-26T05:54:41Z

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.39% (26986/36773)
Line Coverage	56.84% (289402/509145)
Region Coverage	54.26% (241630/445342)
Branch Coverage	55.92% (104500/186865)

hello-stephen · 2026-03-26T06:31:41Z

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.39% (26989/36773)
Line Coverage	56.86% (289475/509145)
Region Coverage	54.27% (241692/445342)
Branch Coverage	55.95% (104550/186865)

…file (#61740) pick #61739

fix cancel_at_time not work at parallel_sink/parallel_outfile

3b7ba8a

Copilot AI review requested due to automatic review settings March 26, 2026 02:48

Copilot started reviewing on behalf of BiteTheDDDDt March 26, 2026 02:49 View session

BiteTheDDDDt mentioned this pull request Mar 26, 2026

[Bug](sink) fix cancel_at_time not work at parallel_sink/parallel_outfile #61740

Merged

Copilot AI reviewed Mar 26, 2026

View reviewed changes

update

020f877

github-actions bot reviewed Mar 26, 2026

View reviewed changes

fix ut

5c329a3

yiguolei pushed a commit that referenced this pull request Mar 26, 2026

[Bug](sink) fix cancel_at_time not work at parallel_sink/parallel_out…

37d0eb7

…file (#61740) pick #61739

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug](sink) fix cancel_at_time not work at parallel_sink/parallel_outfile#61739

[Bug](sink) fix cancel_at_time not work at parallel_sink/parallel_outfile#61739
BiteTheDDDDt wants to merge 3 commits intoapache:masterfrom
BiteTheDDDDt:fix_0326

BiteTheDDDDt commented Mar 26, 2026 •

edited

Loading

Uh oh!

hello-stephen commented Mar 26, 2026

Uh oh!

BiteTheDDDDt commented Mar 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

BiteTheDDDDt commented Mar 26, 2026

Uh oh!

BiteTheDDDDt commented Mar 26, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot Mar 26, 2026

Uh oh!

BiteTheDDDDt commented Mar 26, 2026

Uh oh!

hello-stephen commented Mar 26, 2026

Uh oh!

hello-stephen commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BiteTheDDDDt commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hello-stephen commented Mar 26, 2026

Uh oh!

BiteTheDDDDt commented Mar 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

BiteTheDDDDt commented Mar 26, 2026

Uh oh!

BiteTheDDDDt commented Mar 26, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Code Review Summary

Overall Assessment

Critical Checkpoint Conclusions

Uh oh!

github-actions bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

BiteTheDDDDt commented Mar 26, 2026

Uh oh!

hello-stephen commented Mar 26, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Mar 26, 2026

BE Regression && UT Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BiteTheDDDDt commented Mar 26, 2026 •

edited

Loading