[fix](inverted index) Split bound multi-segment readers#63138
[fix](inverted index) Split bound multi-segment readers#63138airborne12 wants to merge 1 commit into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: query_v2 search collection could pick the segment loop from readers.front() while leaf scorers resolved CLucene readers from reader_bindings or field_reader_bindings. SegmentPostings requires segment-level readers because multi-segment TermDocs::readBlock() is unsupported.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- env CCACHE_DIR=/tmp/doris25499-master-ccache CCACHE_TEMPDIR=/tmp/doris25499-master-ccache-tmp ./run-be-ut.sh --run --filter=MultiSegmentCollectorTest.* -j 32
- Behavior changed: No
- Does this need documentation: No
7d22250 to
81370a7
Compare
|
run buildall |
TPC-H: Total hot run time: 31274 ms |
TPC-DS: Total hot run time: 170112 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
1 similar comment
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Requesting changes due to a remaining correctness/crash path in query_v2 multi-segment handling.
Critical checkpoint conclusions:
- Goal/test: The PR addresses multi-segment reader splitting for collection and adds BE unit coverage for doc-set collection, but it does not cover or fix null-bitmap extraction after collection.
- Scope/focus: The change is focused, but the fix is incomplete because one downstream scorer-building path bypasses the new segment iterator.
- Concurrency/lifecycle: No new concurrency or non-obvious static lifecycle issue found. The non-owning subreader wrappers are used within the parent reader lifetime during collection.
- Compatibility/config/persistence: No config, protocol, storage-format, or persistence compatibility concern found.
- Parallel paths: A parallel query_v2 path remains: FunctionSearch extracts null bitmaps by building a scorer on the original context after collection.
- Tests: Added tests cover doc-set collection for MultiReader and bindings, but not top-k/WAND or null-bearing queries; the missing null case is directly related to the remaining issue.
- Observability/performance: No additional observability issue found; no clear performance regression beyond small per-segment context creation.
User focus: No additional user-provided review focus was supplied.
| template <typename SegmentCallback> | ||
| void for_each_index_segment(const QueryExecutionContext& context, const std::string& binding_key, | ||
| SegmentCallback&& callback) { | ||
| auto* reader = context.readers.empty() ? nullptr : context.readers.front().get(); |
There was a problem hiding this comment.
This only fixes callers that go through for_each_index_segment, but FunctionSearch::evaluate_inverted_index_with_search_param still extracts NULLs after collection with weight->scorer(exec_ctx, root_binding_key) on the original exec_ctx (see the exec_ctx.null_resolver block after the collector call). Term/boolean scorers construct SegmentPostings immediately via create_term_posting(), so a query over a nullable field whose reader is MultiReader/MultiSegmentReader can still reach the same unsupported multi-segment TermDocs::readBlock() path after collection succeeds. Please route null-bitmap extraction through segment contexts as well, or otherwise avoid building SegmentPostings on the unsplit multi-segment reader, and add a nullable multi-segment test for this path.
|
PR approved by at least one committer and no changes requested. |
What problem does this PR solve?
Issue Number: DORIS-25499
Related PR: None
Problem Summary:
query_v2 search collection can drive segment iteration from
readers.front()while leaf scorers resolve CLucene readers fromreader_bindingsorfield_reader_bindings. If the actual leaf reader remains a multi-segment reader,SegmentPostingscallsTermDocs::readBlock(), which CLucene does not support for multi-segment readers and can abort BE.This PR selects the actual segmented reader from the execution context, rewrites all readers and reader bindings to segment-level readers for each callback, validates segmented reader topology, and keeps
MultiSegmentReader/MultiReaderdoc base handling explicit. It also adds regression coverage forMultiReader, segmented field bindings, and explicit single-reader binding keys.Release note
None
Check List (For Author)
Test
MultiSegmentCollectorTest.CollectDocSetWithMultiReaderMultiSegmentCollectorTest.CollectDocSetWithSegmentedFieldBindingMultiSegmentCollectorTest.CollectDocSetWithSingleReaderBindingenv CCACHE_DIR=/tmp/doris25499-master-ccache CCACHE_TEMPDIR=/tmp/doris25499-master-ccache-tmp ./run-be-ut.sh --run --filter=MultiSegmentCollectorTest.* -j 32Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)