Skip to content

[Bug](sink) fix cancel_at_time not work at parallel_sink/parallel_outfile#61739

Open
BiteTheDDDDt wants to merge 3 commits intoapache:masterfrom
BiteTheDDDDt:fix_0326
Open

[Bug](sink) fix cancel_at_time not work at parallel_sink/parallel_outfile#61739
BiteTheDDDDt wants to merge 3 commits intoapache:masterfrom
BiteTheDDDDt:fix_0326

Conversation

@BiteTheDDDDt
Copy link
Contributor

@BiteTheDDDDt BiteTheDDDDt commented Mar 26, 2026

This pull request refines the lifecycle management of result buffers in the execution engine. It introduces a mechanism to determine when a result buffer is fully closed (i.e., all fragment instances have finished writing), enabling deferred cleanup to be scheduled only at the appropriate time. The changes propagate this logic through the buffer interface, operator implementations, and associated tests.

Lifecycle management improvements:

  • The close method in ResultBlockBufferBase and its implementations now takes a bool& is_fully_closed parameter, which is set to true only when the last fragment instance has closed the buffer. This allows callers to safely schedule deferred cleanup only when the buffer is no longer needed. (be/src/runtime/result_block_buffer.h, be/src/runtime/result_block_buffer.cpp) [1] [2] [3] [4]

  • In both ResultFileSinkLocalState::close and ResultSinkLocalState::close, deferred cleanup via cancel_at_time is now scheduled only when is_fully_closed is true, and the correct buffer ID is used as the registration key, ensuring proper cleanup in both parallel and non-parallel modes. (be/src/exec/operator/result_file_sink_operator.cpp, be/src/exec/operator/result_sink_operator.cpp) [1] [2]

Interface and API changes:

  • The ResultBlockBufferBase interface now includes a buffer_id() method, which returns the correct ID for use in cleanup, abstracting away the difference between parallel and non-parallel modes. (be/src/runtime/result_block_buffer.h) [1] [2]

Test updates:

  • All unit tests for result block buffers are updated to use the new close signature and verify the value of is_fully_closed, ensuring correct behavior in both normal and error scenarios. (be/test/exec/sink/arrow_result_block_buffer_test.cpp, be/test/exec/sink/result_block_buffer_test.cpp) [1] [2] [3] [4] [5] [6]

Copilot AI review requested due to automatic review settings March 26, 2026 02:48
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@BiteTheDDDDt
Copy link
Contributor Author

run buildall

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes deferred result-buffer cancellation for parallel result sink / parallel outfile by using the same identifier for cancellation as was used when registering the buffer in ResultBufferMgr.

Changes:

  • Update ResultSinkLocalState::close() to cancel by query_id when enable_parallel_result_sink is enabled; otherwise cancel by fragment_instance_id.
  • Update ResultFileSinkLocalState::close() to cancel by query_id when enable_parallel_outfile is enabled; otherwise cancel by fragment_instance_id.
  • Add clarifying comments explaining the key selection.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
be/src/exec/operator/result_sink_operator.cpp Aligns deferred buffer cancellation key with parallel/non-parallel sender registration key for result sink.
be/src/exec/operator/result_file_sink_operator.cpp Aligns deferred buffer cancellation key with parallel/non-parallel sender registration key for outfile sink.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +202 to +208
// In parallel result sink mode, the buffer is registered under query_id; otherwise
// it is registered under fragment_instance_id. Pass the matching key so the
// deferred cancel actually finds and removes the buffer entry.
state->exec_env()->result_mgr()->cancel_at_time(
time(nullptr) + config::result_buffer_cancelled_interval_time,
state->fragment_instance_id());
state->query_options().enable_parallel_result_sink ? state->query_id()
: state->fragment_instance_id());
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In parallel result sink mode, multiple fragment instances share the same ResultBlockBuffer keyed by query_id, and close() is invoked per fragment instance (removing one dependency each time). Scheduling cancel_at_time(..., query_id) on every instance close can cancel/erase the shared buffer after the interval even if other instances are still producing results (i.e., before the buffer is actually fully closed), which can break result fetching and/or cancel in-flight execution. Consider scheduling the deferred cancel only when the shared buffer transitions to fully closed (after the last dependency is removed), e.g., by moving this scheduling into ResultBlockBuffer::close when _result_sink_dependencies becomes empty, or by extending the ResultBlockBufferBase::close contract to indicate “final close” so callers can safely schedule cancel once.

Copilot uses AI. Check for mistakes.
Comment on lines +138 to +144
// In parallel outfile mode, the buffer is registered under query_id; otherwise
// it is registered under fragment_instance_id. Pass the matching key so the
// deferred cancel actually finds and removes the buffer entry.
state->exec_env()->result_mgr()->cancel_at_time(
time(nullptr) + config::result_buffer_cancelled_interval_time,
state->fragment_instance_id());
state->query_options().enable_parallel_outfile ? state->query_id()
: state->fragment_instance_id());
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as result sink: in parallel outfile mode the buffer is shared under query_id and ResultBlockBuffer::close only fully closes once all fragment-instance dependencies have been removed. Calling cancel_at_time(..., query_id) from every fragment instance close can cancel the shared buffer after the interval even if some instances are still running, potentially breaking client fetch and/or canceling remaining producers. Consider deferring scheduling until the buffer is fully closed (e.g., schedule inside ResultBlockBuffer::close when the last dependency is removed, or change close() to return/indicate when it performed the final close so this can be scheduled once).

Copilot uses AI. Check for mistakes.
@BiteTheDDDDt
Copy link
Contributor Author

run buildall

@BiteTheDDDDt
Copy link
Contributor Author

/review

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Overall Assessment

The fix correctly addresses the bug where cancel_at_time was not working in parallel result-sink / parallel outfile mode. The root cause was that in parallel mode, the buffer is registered under query_id but cancel_at_time was unconditionally using fragment_instance_id, so it would try to cancel a non-existent entry. The approach of using is_fully_closed + buffer_id() is clean and correct.

Critical Checkpoint Conclusions

  1. Goal / correctness: The goal is to fix deferred buffer cleanup in parallel sink modes. The production code accomplishes this correctly by (a) only scheduling cleanup when the last instance closes the shared buffer, and (b) using buffer_id() which returns the correct registration key regardless of mode. However, existing unit tests will not compile due to the signature change (see inline comment).

  2. Minimal / focused: Yes, the change is focused and minimal.

  3. Concurrency: The is_fully_closed output parameter is set under _lock inside ResultBlockBuffer::close(), and the caller reads it after close() returns — no concurrent access issue. The _result_sink_dependencies map is correctly used as the reference-counting mechanism under the same lock.

  4. Lifecycle: buffer_id() returns a const reference to _fragment_id, a member of the buffer object. The buffer remains alive (held by _sender shared_ptr) when buffer_id() is called, so no dangling reference.

  5. Parallel code paths: Both result_file_sink_operator.cpp and result_sink_operator.cpp are updated symmetrically. Good.

  6. Error handling: If _sender->close() returns error, RETURN_IF_ERROR skips cancel_at_time. This is acceptable because create_sender() already registers a safety-net timeout (execution_timeout + 5s) that will eventually clean up the buffer.

  7. Test coverage: Build-breaking issue — 6 call sites in be/test/exec/sink/result_block_buffer_test.cpp (lines 163, 292, 311) and be/test/exec/sink/arrow_result_block_buffer_test.cpp (lines 176, 308, 327) still call close() with the old 3-argument signature. These must be updated to pass the new bool& is_fully_closed parameter. Additionally, the tests should be extended to verify is_fully_closed is correctly set (false when dependencies remain, true when the last one closes).

  8. Observability: No new logging needed; existing logging in get_batch already covers the buffer lifecycle.

  9. Configuration: No new config items.

  10. FE-BE compatibility: No protocol changes.

  11. Performance: No concerns — the change adds a bool output parameter and a conditional branch, both trivial.

  12. Other issues: None found.

// Close one fragment instance's contribution to this buffer. When the last
// registered instance calls close(), |is_fully_closed| is set to true,
// indicating that no more producers will write to this buffer and callers may
// safely schedule deferred cleanup. The buffer is keyed in ResultBufferMgr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Build Break] The close() signature changed from 3 to 4 parameters, but the following test call sites were not updated and will fail to compile:

  • be/test/exec/sink/result_block_buffer_test.cpp lines 163, 292, 311
  • be/test/exec/sink/arrow_result_block_buffer_test.cpp lines 176, 308, 327

All 6 sites call buffer.close(ins_id, Status::..., 0) with only 3 arguments. They need to be updated to pass bool& is_fully_closed, e.g.:

bool is_fully_closed = false;
EXPECT_TRUE(buffer.close(ins_id, Status::OK(), 0, is_fully_closed).ok());

Also consider adding test assertions on is_fully_closed itself — e.g., verify it returns false when other dependencies remain, and true when the last instance closes.

@BiteTheDDDDt
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.39% (26986/36773)
Line Coverage 56.84% (289402/509145)
Region Coverage 54.26% (241630/445342)
Branch Coverage 55.92% (104500/186865)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.39% (26989/36773)
Line Coverage 56.86% (289475/509145)
Region Coverage 54.27% (241692/445342)
Branch Coverage 55.95% (104550/186865)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants