[fix](set) fix coredump of set op if total data size exceeds 4G#61471
[fix](set) fix coredump of set op if total data size exceeds 4G#61471jacktengg wants to merge 2 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
/review |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
There was a problem hiding this comment.
Code Review Summary
Overall Assessment
The bug analysis is correct and the fix is logically sound. The old code had a critical flaw: when total _mutable_block size exceeded 4GB, build_block was overwritten via to_block() while the hash table still held row_num indices from the previous block. During the source phase, _add_result_columns() uses these stale indices to access the last build_block, causing out-of-bounds reads and SIGSEGV.
The fix correctly removes the intermediate flush and only materializes build_block once at eos, ensuring all hash table row_num values reference valid rows in a single contiguous block.
Critical Checkpoint Conclusions
1. Does the code accomplish its goal? Is there a test that proves it?
The fix is correct — it eliminates the dangling-reference crash. However, no regression test is added to cover the >4GB scenario or even a multi-flush scenario. While reproducing 4GB+ in CI is impractical, a functional test verifying correctness with moderate data sizes would strengthen confidence. At minimum, a test exercising INTERSECT/EXCEPT with enough data to validate the single-build-block invariant should be added.
2. Is this modification as small, clear, and focused as possible?
Yes. The change is minimal: remove the BUILD_BLOCK_MAX_SIZE constant, change if (eos || ...) to if (eos), and un-nest the eos-only finalization code. The diff is clean and focused.
3. Concurrency concerns?
No new concurrency issues. The sink operator processes data single-threaded per pipeline task. The probe_finished_children_dependency signaling remains unchanged.
4. Memory/lifecycle implications?
This is the primary concern with the fix approach. Removing the 4GB flush means _mutable_block can grow unbounded in memory until eos. The set operators have no spill support (_spillable=false, revocable_mem_size()=0). The existing get_reserve_mem_size() reports current (retrospective) allocation, not prospective growth. If the build side has many GB of data with unique rows, this will consume proportional memory with no relief mechanism.
That said, this is a pre-existing limitation — even before this PR, the hash table itself grows unbounded. The BUILD_BLOCK_MAX_SIZE flush didn't reduce peak memory since the hash table still holds all entries. So the memory concern is not worsened by this fix; it just changes where the memory is held (mutable block vs. the final build_block — both coexist at flush time in the old code anyway).
5. Are there functionally parallel code paths?
No. The set sink operator is the only path that builds the set hash table.
6. Test coverage?
All existing set operation tests use trivially small data (3-100 rows). No test covers the >4GB or even multi-batch accumulation scenario. A regression test should be added.
7. Performance considerations?
No performance regression. The old code did to_block() + _process_build_block() potentially twice (once at 4GB, once at eos). The new code does it exactly once at eos. If anything, this is slightly more efficient for large datasets.
8. Observability?
No changes needed. The existing _hash_table_size and _valid_element_in_hash_table counters remain.
9. Compatibility?
No incompatible changes. This is a purely internal execution-engine fix.
10. Other concerns?
The ColumnStr<UInt32> has a hard 4GB char-data limit (check_chars_length in column_string.h). If a single string column accumulates >4GB of character data in the build block, it will throw STRING_OVERFLOW_IN_VEC_ENGINE. This is a pre-existing limitation unrelated to this PR, but worth noting since the old 4GB flush might have incidentally avoided this for some workloads.
Verdict
The fix is correct and necessary. The only actionable concern is the missing test coverage.
| @@ -84,26 +83,24 @@ Status SetSinkOperatorX<is_intersect>::sink(RuntimeState* state, Block* in_block | |||
| } | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
[Test Coverage] The fix correctly eliminates the dangling-reference crash, but no regression test is added. Consider adding a regression test that exercises INTERSECT/EXCEPT with enough data volume (or enough batches) to validate the single-build-block invariant holds. Even a moderate-sized test with multiple sink() calls before eos would help prevent regression.
Additionally, the uint32_t::max row-count guard at line 80 is never tested. A negative test triggering the NotSupported error path would be valuable.
TPC-H: Total hot run time: 27207 ms |
TPC-DS: Total hot run time: 169484 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
aa401fb to
1f08c50
Compare
|
run buildall |
|
/review |
|
run buildall |
There was a problem hiding this comment.
Code Review Summary
Overview
This PR fixes a coredump (SIGSEGV) in set operations (INTERSECT/EXCEPT) when the total string data size exceeds 4GB. The root cause was that the old code flushed the build block in chunks via a 4GB BUILD_BLOCK_MAX_SIZE threshold, causing build_block to be overwritten and invalidating hash table row indices that pointed into the previous block.
Fix Approach
The fix adopts the same pattern used by hash join:
merge_ignore_overflowreplacesmergeto allowColumnString(UInt32 offsets) to silently overflow during accumulation.convert_column_if_overflowconverts overflowedColumnStringtoColumnString64(UInt64 offsets) after materialization.insert_indices_fromreplacesappend_data_by_selectorin the source operator to support reading fromColumnString64.- The intermediate 4GB flush is removed -- all data is accumulated into a single
MutableBlockuntileos.
Critical Checkpoint Conclusions
1. Goal & Correctness: The fix correctly addresses the root cause (dangling row indices from overwritten build_block). The code now uses a single build_block that persists for the entire lifecycle. Both unit tests and regression tests prove correctness for the >4GB case. Pass.
2. Modification scope: The change is focused and minimal -- 4 files changed in production code with clear purpose. Pass.
3. Concurrency: No new concurrency concerns. The set sink operator is single-threaded for the build phase. N/A.
4. Lifecycle management: No new lifecycle issues. The _mutable_block and build_block lifecycle is simplified (single accumulation + single materialization). Pass.
5. Configuration items: No new config items added. N/A.
6. Incompatible changes: No format/protocol changes. N/A.
7. Parallel code paths: Both is_intersect=true and is_intersect=false template instantiations are covered. The pattern mirrors the hash join approach. Pass.
8. Special conditional checks: The MutableBlock::empty() initialization guard follows the established hash join pattern. Pass.
9. Test coverage: Good. Both INTERSECT and EXCEPT unit tests exercise the >4GB path. Regression test covers EXCEPT/INTERSECT with subset and self-join scenarios. Pass (minor issue noted below).
10. Observability: No new critical paths requiring extra observability. Existing timers and counters are preserved. Pass.
11. Transaction/persistence: N/A.
12. Data writes: N/A.
13. FE-BE variable passing: N/A.
14. Performance: Removing the intermediate flush is a net simplification. The merge_ignore_overflow approach avoids the overhead of checking check_chars_length() on every merge, which is a minor improvement. The convert_column_if_overflow has a one-time O(n) cost at eos. Hash table build and probe paths correctly handle ColumnString64 via virtual dispatch and explicit is_column_string64() checks. No performance concerns.
15. Other issues:
- The regression test drops the table at the end (line 101), which contradicts the test standard: "After completing tests, do not drop tables; instead drop tables before using them in tests, to preserve the environment for debugging." The table is already dropped before creation (line 22), which is correct. The final
DROP TABLEat line 101 should be removed. - The
std::move(*in_block)inmerge_ignore_overflow(std::move(*in_block))is misleading --merge_impl_ignore_overflowdoes not actually move data from the source (it copies viainsert_range_from_ignore_overflowwhich takesconst IColumn&). This is harmless but could confuse future readers. However, the hash join uses the same pattern (hashjoin_build_sink.cpp:827), so this is consistent with existing code.
Verdict
The fix is correct and well-tested. The approach follows the established hash join pattern for handling >4GB string data. One minor test standards issue flagged inline.
| """ | ||
|
|
||
| sql """ DROP TABLE IF EXISTS test_set_op_large_string """ | ||
| } |
There was a problem hiding this comment.
Per Doris test standards: "After completing tests, do not drop tables; instead drop tables before using them in tests, to preserve the environment for debugging." The table is already correctly dropped-before-create at line 22. This final DROP TABLE should be removed to preserve the table for post-failure debugging.
TPC-H: Total hot run time: 26989 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 26650 ms |
TPC-DS: Total hot run time: 169871 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Root Cause Analysis
核心原因:SetSinkOperatorX::sink() 中 build_block
被多次覆盖,导致哈希表中的旧条目成为悬空引用。
问题链路
在 set_sink_operator.cpp:52-56:
if (eos || local_state._mutable_block.allocated_bytes() >= BUILD_BLOCK_MAX_SIZE) { // 4GB
build_block = local_state._mutable_block.to_block(); // 覆盖 build_block!
RETURN_IF_ERROR(_process_build_block(local_state, build_block, state));
local_state._mutable_block.clear();
}
当数据总量超过 BUILD_BLOCK_MAX_SIZE(4GB)时,这个 flush 会触发多次:
0..N1),哈希表存入 row_num = 0, 1, ..., N1
的数据被销毁。哈希表新增 row_num = 0, 1, ..., N2
RowRefListWithFlags 继承自 RowRef,只存储 uint32_t row_num(join_op.h:46),没有 block
指针或 offset。
在 hash_table_set_build.h:39,构建时存入的是:Mapped {k},即行号 k。
在 set_source_operator.cpp:161-162:
auto& column = *build_block.get_by_position(idx->second).column;
local_state._mutable_cols[idx->first]->insert_from(column, it->row_num);
此时 build_block 是最后一次 flush 的 batch2。但哈希表中来自 batch1 的条目的 row_num
可能超出 batch2 的行数范围。
当 batch1 的 row_num = X(X > batch2 的行数)被用于 insert_from(column, X) 时:
// column_string.h:180-197
const size_t size_to_append = src.offsets[X] - src.offsets[X - 1]; // 越界读取 → 垃圾值
const size_t offset = src.offsets[X - 1]; // 垃圾值
// ...
memcpy(..., &src.chars[offset], size_to_append); // 垃圾 offset → 访问未映射内存 →
SIGSEGV
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)