[fix](iterator) Use explicit output schema in new_merge_iterator and new_union_iterator#60772
Open
uchenily wants to merge 5 commits intoapache:masterfrom
Open
[fix](iterator) Use explicit output schema in new_merge_iterator and new_union_iterator#60772uchenily wants to merge 5 commits intoapache:masterfrom
uchenily wants to merge 5 commits intoapache:masterfrom
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
run buildall |
TPC-H: Total hot run time: 28891 ms |
TPC-DS: Total hot run time: 184014 ms |
Contributor
Author
|
run beut |
yiguolei
reviewed
Feb 15, 2026
yiguolei
reviewed
Feb 15, 2026
| #include "absl/strings/substitute.h" | ||
| #include "common/config.h" | ||
| #include "common/env_config.h" | ||
| // #include "common/env_config.h" |
yiguolei
reviewed
Feb 15, 2026
| VMergeIteratorContext(RowwiseIteratorUPtr&& iter, int sequence_id_idx, bool is_unique, | ||
| bool is_reverse, std::vector<uint32_t>* read_orderby_key_columns) | ||
| bool is_reverse, std::vector<uint32_t>* read_orderby_key_columns, | ||
| const Schema* output_schema) |
Contributor
There was a problem hiding this comment.
为什么要传递裸指针,不是直接使用shared ptr?
yiguolei
reviewed
Feb 15, 2026
| // src block may contain extra columns (e.g. delete-predicate columns) because | ||
| // SegmentIterator reads with input_schema = return_columns + delete columns | ||
| // to evaluate row filtering. We should only copy the requested output columns. | ||
| DCHECK_GE(src.columns(), output_cols); |
yiguolei
reviewed
Feb 15, 2026
| return Status::OK(); | ||
| } | ||
| size_t output_cols = _output_schema->num_column_ids(); | ||
| // src block may contain extra columns (e.g. delete-predicate columns) because |
Contributor
There was a problem hiding this comment.
SRC block 里不包括delete-predicate columns的,如果包括的话,你之前那个PR 为什么可以工作??
Contributor
Author
There was a problem hiding this comment.
我之前的pr是根据dst block来遍历的
Status VMergeIteratorContext::copy_rows(Block* block, bool advanced) {
...
for (size_t i = 0; i < block->columns(); ++i) {
...
}
yiguolei
reviewed
Feb 15, 2026
| // delete_hanlder is always set, but it maybe not init, so that it will return empty conditions | ||
| // or predicates when it is not inited. | ||
| if (_read_context->delete_handler != nullptr) { | ||
| _read_context->delete_handler->get_delete_conditions_after_version( |
Contributor
There was a problem hiding this comment.
把你github pr description 那个例子,弄成regression test 加入到PR 里
Contributor
Author
There was a problem hiding this comment.
有什么更直接的办法创建overlapping状态的rowset吗, 我复现的步骤里面需要改 be.conf 添加 write_buffer_size = 8 这个没法放到回归测试
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR ensures merge/union iterators use an explicit output schema projection and copy only the requested columns, preventing column count mismatches when delete-predicate columns are read in addition to return columns.
BetaRowsetReader now builds an
output_schemafrom return_columns and passes it to merge/union iterators, VMergeIteratorContext copies using the output schema (not the incorrect _iter->schema())Consider the following table:
And a delete predicate applied to a non-key column:
When executing ORDER BY k LIMIT n, Doris has a Top-N optimization. Even though the query is SELECT *, the engine initially avoids scanning all columns. It constructs a minimal intermediate schema containing only the sort keys (k) and the internal
__DORIS_ROWID_COL__to perform the merge and sorting efficiently. (_col_ids = {0, 3}, ==> _num_columns = 2). However, because a delete predicate exists on column v1, the BetaRowsetReader add v1 to this intermediate schema to evaluate and filter out deleted rows during the scan. (_col_ids = {0, 3, 1}, note that column v1 (index=1) is appended to this schema ==> _num_columns = 3)The previous implementation of VMergeIteratorContext::copy_rows used the incorrect _num_columns value, resulting in an array out-of-bounds access and causing BE coredumped.
Detailed reproduction steps are follows:
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)