[fix](variant) Skip full footer scan when constructing VariantStatsCaculator#62819
Conversation
…culator SegmentWriter::init() can run multiple times against the same writer (vertical compaction's key columns + per value-column-group calls), and the footer accumulates entries across calls. The calculator was scanning the whole footer on every construction, so each additional init() walked an ever-larger footer that included entries it cannot address via the init's own `column_ids`. Snapshot the footer size before _create_writers appends new entries and pass it to VariantStatsCaculator as `footer_column_offset`, so the constructor only scans its own slice. Per-init() construction cost goes from O(footer accumulated size) to O(this init's column_ids size); the total cost across N vertical-compaction init() calls drops from O(N^2) to O(N). All existing behavior is preserved (including the defensive Status::NotFound on missing footer entries). One new unit test covers the offset case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
|
OpenCode automated review failed and did not complete. Error: Review step was failure (possibly timeout or cancelled) Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
|
/review |
There was a problem hiding this comment.
I did not find blocking issues in this PR.
Critical checkpoints:
- Goal / correctness: The change is narrowly targeted at the repeated
SegmentWriter::init()reuse path in vertical compaction / segcompaction. ScopingVariantStatsCaculatorto the footer slice added by the currentinit()matches howSegmentWriteraccumulates footer entries across groups, so the fix addresses the reported stale-scan behavior without changing the downstream stats update logic. - Scope: The modification is small and focused: one caller-side offset snapshot, one constructor parameter, and one unit test.
- Concurrency: No new concurrency risk is introduced here. The touched logic runs on the per-writer compaction path and the
SegmentWriterreuse is serialized by the surrounding compaction flow. - Lifecycle: The review-sensitive lifecycle detail is the persistent
_footeracross repeatedclear()/init()cycles; this patch handles that explicitly by snapshotting the footer size before the current group appends its own metas. - Config / compatibility: No config, protocol, or storage-format compatibility changes.
- Parallel paths: The fix covers both known repeated-init paths that reuse
SegmentWriter(VerticalBetaRowsetWriterand segcompaction), because both go throughSegmentWriter::init(col_ids, is_key). - Special conditions: The new
footer_column_offsetcondition is justified and documented in code. - Test coverage: The added unit test captures the intended footer-offset behavior and checks that stats land on the current slice instead of stale entries from a prior init. I could not run the BE UT in this runner because
thirdparty/installed/bin/protocis missing, so my validation here is code inspection only. - Observability: No additional observability appears necessary for this localized fix.
- Data / persistence: No transaction, persistence, or metadata replay semantics are changed beyond which in-memory footer entries are indexed during compaction.
- Performance: The optimization claim is consistent with the implementation; the constructor scan is reduced from accumulated footer size to the current init slice, which removes the repeated full-footer walk.
- Other issues: None found in the reviewed diff.
User focus points: No additional user-provided focus points were supplied for this review.
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
Problem
In vertical compaction,
SegmentWriter::init()is called multiple times against the same writer (key columns first, then each value-column group). The footer accumulates across calls, so every additionalinit()re-scans an ever-larger footer — including entries from priorinit()s that the currentcolumn_idscannot address.init()callsFix
Snapshot
_footer.columns_size()before_create_writersappends new entries and pass it toVariantStatsCaculatorasfooter_column_offset. The constructor only walks[offset, end)— its own slice.When performing compaction with 10,000 columns in the variant, this method accounts for around 8% of CPU usage. After changing it to O(N), this section has disappeared from the flame graph.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)