[refactor](be) Refactor SegmentPrefetcher to CacheBlockAwarePrefetchRemoteReader#63056
Open
bobhan1 wants to merge 6 commits intoapache:masterfrom
Open
[refactor](be) Refactor SegmentPrefetcher to CacheBlockAwarePrefetchRemoteReader#63056bobhan1 wants to merge 6 commits intoapache:masterfrom
SegmentPrefetcher to CacheBlockAwarePrefetchRemoteReader#63056bobhan1 wants to merge 6 commits intoapache:masterfrom
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
SegmentPrefetcher to CacheBlockAwarePrefetchRemoteReader
Contributor
Author
|
run buildall |
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Refactor segment file-cache prefetching around a cache-block-aware cached remote reader. This removes the old SegmentPrefetcher and moves prefetch ownership to CacheBlockAwarePrefetchRemoteReader, which inherits CachedRemoteFileReader and expands higher-level file access ranges to all covered file cache blocks. The new reader registers one move-only RAII ReadPatternHandle per caller, so multiple column iterators sharing the same underlying segment file reader keep independent prefetch progress and are unregistered automatically when the owner is destroyed or reset.
The segment layer now builds file access ranges through SegmentFileAccessRangeBuilder. Rowids are converted through ordinal indexes only while building the access ranges; actual prefetch progress is triggered later by the current data page file offset. SegmentIterator initializes these patterns after segment open, index pruning, and iterator initialization, when the row bitmap is known and subsequent data page reads are monotonic in scan order. The implementation also handles data pages that cross file-cache-block boundaries or are larger than one cache block by prefetching every covered cache block instead of assuming page size is smaller than the cache block size.
A new FileReaderOptions::enable_cache_block_prefetch option controls whether FILE_BLOCK_CACHE readers are created as CacheBlockAwarePrefetchRemoteReader. Cloud segment data-file readers enable it when segment file-cache prefetch is enabled for query or compaction. Existing complex and variant column iterators forward cache-block prefetch setup to their nested file iterators.
### Release note
None
### Check List (For Author)
- Test:
- Unit Test: ./run-be-ut.sh --run --filter=CacheBlockAwarePrefetchRemoteReaderTest.*:BlockFileCacheTest.usage_example_registers_independent_column_patterns:BlockFileCacheTest.cache_block_aware_prefetch_remote_reader_prefetches_cache_blocks -j100
- Format Check: build-support/check-format.sh
- Static Check: git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: CacheBlockAwarePrefetchRemoteReader previously supported multiple read patterns because segment column iterators shared the same underlying file reader and had to manually trigger prefetch. This refactors cache-block prefetch so each physical FileColumnIterator owns an independent reader when prefetch is enabled, the reader holds at most one read pattern, and read_at() automatically advances prefetch by file offset. The code comments also document how Segment, ColumnReaderCache, ColumnReader, and FileColumnIterator cooperate to keep the cache-aware reader iterator-local.
### Release note
None
### Check List (For Author)
- Test: Unit Test / Manual test
- Unit Test: ./run-be-ut.sh --run --filter=CacheBlockAwarePrefetchRemoteReaderTest.*:BlockFileCacheTest.usage_example_read_at_automatically_prefetches_single_pattern:BlockFileCacheTest.cache_block_aware_prefetch_remote_reader_prefetches_cache_blocks -j100
- Manual test: build-support/clang-format.sh; git diff --check
- Behavior changed: Yes. Cache-block prefetch is now iterator-local and is triggered automatically from read_at() when enabled.
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Refine the cache-block-aware prefetch reader after extracting it from segment prefetching. The previous reader state kept block planning and cursor progress tightly coupled, and tests did not fully cover component interaction paths. This commit splits immutable prefetch planning from mutable cursor state, keeps file access ranges as the trigger source so reads inside a page range still advance prefetch, renames the dry-run cache warming API to async_touch_local_cache, and expands unit tests for builder, plan, cursor, async cache touch, and reader integration behavior.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- ./run-be-ut.sh --run --filter=CacheBlockAwarePrefetchRemoteReaderTest.*:BlockFileCacheTest.usage_example_read_at_automatically_prefetches_single_pattern:BlockFileCacheTest.cached_remote_file_reader_async_touch_local_cache_downloads_range:BlockFileCacheTest.cache_block_aware_prefetch_remote_reader_prefetches_cache_blocks -j100
- build-support/clang-format.sh
- git diff --check
- Behavior changed: No (query results are unchanged; only file-cache prefetch scheduling internals are refactored)
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Add an explicit initial-window touch API for cache-block-aware prefetch readers. Segment predicate columns install patterns from row ranges that are guaranteed to be read, so SegmentIterator now touches their first prefetch window immediately after installing the pattern, while non-predicate and common-expression columns keep read_at-triggered prefetch because their exact rowids are batch-local after predicate evaluation.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- build-support/clang-format.sh
- git diff --check
- ./run-be-ut.sh --run --filter=CacheBlockAwarePrefetchRemoteReaderTest.*:BlockFileCacheTest.usage_example_read_at_automatically_prefetches_single_pattern:BlockFileCacheTest.cached_remote_file_reader_async_touch_local_cache_downloads_range:BlockFileCacheTest.cache_block_aware_prefetch_remote_reader* -j100
- Behavior changed: No (query results unchanged; only file-cache prefetch timing changes for predicate columns)
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Add BE unit test coverage for cache-block-aware segment prefetch with many sparse data pages, many cache blocks, large file offset spans, overlapping cache blocks, and large ranges that cross more cache blocks than the configured prefetch window.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- build-support/clang-format.sh
- git diff --check
- ./run-be-ut.sh --run --filter=CacheBlockAwarePrefetchRemoteReaderTest.*:BlockFileCacheTest.usage_example_read_at_automatically_prefetches_single_pattern:BlockFileCacheTest.cached_remote_file_reader_async_touch_local_cache_downloads_range:BlockFileCacheTest.cache_block_aware_prefetch_remote_reader* -j100
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve? Issue Number: N/A Related PR: apache#63056 Problem Summary: Update ReaderOwnedColumnIterator to forward the cache-block prefetch interfaces introduced by the segment prefetcher refactor, removing stale SegmentPrefetcher API usage left after rebasing onto master. ### Release note None ### Check List (For Author) - Test: Manual test - ./build.sh --be -j100 - Behavior changed: No - Does this need documentation: No
9fcacba to
61189b0
Compare
bobhan1
added a commit
to bobhan1/doris
that referenced
this pull request
May 8, 2026
### What problem does this PR solve? Issue Number: N/A Related PR: apache#63056 Problem Summary: Update ReaderOwnedColumnIterator to forward the cache-block prefetch interfaces introduced by the segment prefetcher refactor, removing stale SegmentPrefetcher API usage left after rebasing onto master. ### Release note None ### Check List (For Author) - Test: Manual test - ./build.sh --be -j100 - Behavior changed: No - Does this need documentation: No
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: None
Related PR: #59482
Problem Summary:
This PR refactors segment data-file prefetch for file cache reads and adds a cache-block-aware remote reader.
Previously, segment prefetch was managed outside the file reader and depended on a shared underlying reader across columns. That made the prefetch state harder to isolate, required upper layers to manually trigger prefetch, and coupled rowid/page knowledge directly to the prefetcher. For cold object-storage scans, this also limited the ability to issue parallel cache-block-sized reads early enough to trade object-store IOPS for bandwidth.
This PR changes the design as follows:
Add
CacheBlockAwarePrefetchRemoteReader.CachedRemoteFileReader.read_at()calls automatically advance the prefetch window by file offset.async_touch_initial_window()lets callers touch the first window before the first foreground read when the first ranges are guaranteed to be read, such as predicate columns after segment index pruning.CachedRemoteFileReader::async_touch_local_cache().Split rowid-to-file-range conversion into
SegmentFileAccessRangeBuilder.Remove the old segment prefetcher path.
CacheBlockAwarePrefetchRemoteReaderinstances.read_at()-triggered behavior because their final rowids are produced batch by batch after predicate evaluation.Rename the lower-level cache warming API.
CachedRemoteFileReader::prefetch_range()is renamed toasync_touch_local_cache()to describe the actual behavior more clearly: asynchronously download the requested range into the local file cache.Add focused BE unit tests.
read_at()-triggered prefetch.The optimization intentionally spends more object-storage IOPS to expose more parallelism and improve aggregate bandwidth for cold segment reads. Query results are unchanged.
Release note
None
Check List (For Author)
Test
build-support/clang-format.shgit diff --check./run-be-ut.sh --run --filter=CacheBlockAwarePrefetchRemoteReaderTest.*:BlockFileCacheTest.usage_example_read_at_automatically_prefetches_single_pattern:BlockFileCacheTest.cached_remote_file_reader_async_touch_local_cache_downloads_range:BlockFileCacheTest.cache_block_aware_prefetch_remote_reader* -j100Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)