[fix](zone-map) Avoid per-row Field temporaries in TypedZoneMapIndexWriter#62544
[fix](zone-map) Avoid per-row Field temporaries in TypedZoneMapIndexWriter#62544yiguolei merged 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
|
run buildall |
There was a problem hiding this comment.
Findings
be/src/storage/index/zone_map/zone_map_index.cpp: the new string min/max path now copies and compares the fullStringRefin the hot path. Because_materialize_page_minmax()immediately goes throughField::create_field_from_olap_value(), bytes afterMAX_ZONE_MAP_INDEX_SIZEare still truncated before any page/segment zone map is persisted. That means longSTRING/VARCHARvalues now pay O(full_value_size) copy/compare cost and keep those full copies alive until page flush without improving the serialized zone map, so this can regress the long-string / variant workloads the patch is trying to speed up.be/benchmark/benchmark_zone_map_index.hpp:make_writer()leaks theStorageFieldreturned byStorageFieldFactory::create(*col)on every benchmark iteration. Since each benchmark loop constructs a fresh writer, the benchmark process accumulates leaked fields, which can skew timings or OOM long runs.
Critical Checkpoint Conclusions
- Goal of the task: reduce
TypedZoneMapIndexWriter::add_values()overhead. The numeric path improves that, but the new string path still has a blocking hot-path regression for long values, so the current patch does not fully accomplish the goal yet. - Scope/minimality: the production change is focused, and the benchmark addition is focused, but the benchmark currently has a resource-management bug.
- Concurrency: no new cross-thread behavior is introduced here; writer state stays local to a single writer instance.
- Lifecycle/static init: page-lifetime state (
_page_min,_page_max,_arena) is self-contained inside the writer. I did not find a new lifecycle or static-initialization bug. - Config/compatibility: no new config, protocol, or storage-format compatibility concerns found.
- Parallel code paths: I did not find another equivalent zone-map write path that also needed the same change.
- Special conditions/comments: the new string-specific fast path is documented, but the current implementation keeps more data than the persisted zone map can ever use.
- Tests: existing
be/test/storage/segment/zone_map_index_test.cppcovers basic zone-map serialization, truncation, and float flag behavior, but no new test guards the long-string deferred-materialization path added here. - Observability: not applicable for this refactor.
- Transactions/persistence/data correctness: I did not find a data-correctness regression in the currently covered paths.
- Performance: blocked by the long-string full-copy/full-compare behavior above.
- Other issues: the benchmark leak makes the evidence in the PR body unreliable until it is fixed.
0b1cd15 to
4378af6
Compare
|
/review |
There was a problem hiding this comment.
No blocking findings in the modified production path.
Review checkpoints:
- Goal/task: The change achieves its goal of moving page min/max tracking to raw
ValTypeand materializingFieldonly atflush()time. Existingbe/test/storage/segment/zone_map_index_test.cppalready exercises repeatedadd_values()/flush()behavior across int, string, float special values, decimal, and date/time types, although there is no new targeted test for this exact hot path. - Scope/focus: The production change is small and localized to
TypedZoneMapIndexWriter, with the benchmark kept separate. - Concurrency: No new concurrent path is introduced. This writer state is still page-local and used through the existing single-writer column-write call chain, so there are no new lock ordering or thread-safety concerns here.
- Lifecycle: The new
_page_min,_page_max,_page_has_minmax, and_arenastate is reset on everyflush(). For string types, bytes are copied into_arena, then materialized intoFieldbefore_arena.clear(), so I do not see a dangling-reference issue escaping the writer. - Config/compatibility: No new config, FE/BE protocol, or on-disk format change is introduced.
ZoneMapPBserialization remains on the existing path. - Parallel code paths: The optimization is applied in the shared zone-map writer path used by the nullable-run split workload described in the PR, so I did not find a sibling production path in this abstraction that also needed the same change.
- Special checks: String truncation behavior and float
NaN/+Inf/-Infflag handling remain aligned with the pre-change logic. - Observability/persistence/transactions/data writes: This change only affects in-memory min/max accumulation before the existing serialization path; I do not see new observability, persistence, transactionality, or data-visibility risks.
- Performance: The direction is sound. The hot path removes repeated
Fieldconstruction while keeping materialization at page flush.
Residual risk: I reviewed the code and existing unit coverage, but I did not run a BE build or tests in this runner, so the remaining risk is limited to unexecuted build/benchmark compilation rather than a production-path correctness concern.
|
run buildall |
4378af6 to
f1cf838
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
No blocking issues found in this PR.
Critical checkpoints:
- Goal / correctness: The change achieves its stated goal of removing per-call
Fieldtemporary construction fromTypedZoneMapIndexWriterwhile preserving the existing zone-map serialization path. The existing unit tests inbe/test/storage/segment/zone_map_index_test.cppalready exercise repeatedadd_values()calls, string truncation, pass-all pages, and multiple primitive/date/time types against the same writer logic. - Change scope: The modification is small and focused. It only adjusts the zone-map writer hot path and adds a benchmark wired into the benchmark binary.
- Concurrency: No new concurrency is introduced.
TypedZoneMapIndexWriterremains object-local writer state, and the PR does not change locking, thread interaction, or lifecycle ordering. - Lifecycle / initialization: No special lifecycle or static-initialization risk was introduced. The new
_page_min,_page_max, and_page_has_minmaxstate is reset duringflush(). - Configuration: No new config item was added.
- Compatibility: No FE/BE protocol change or on-disk format change was introduced.
ZoneMapPBserialization/deserialization remains unchanged. - Parallel paths: I did not find another equivalent zone-map writer path that also needed the same change.
- Special conditions: The string truncation behavior and float/double special-value handling follow the existing semantics of this module.
- Test coverage: Existing unit coverage is relevant for this change, but no new automated test was added specifically for the deferred materialization path, and I did not run BE tests in this runner.
- Observability: No additional logs or metrics are needed for this localized hot-path optimization.
- Transaction / persistence / data writes: This touches the storage write path, but it does not alter transaction semantics or persisted metadata shape, so I do not see a replay/compatibility concern from this diff.
- Performance: The optimization removes hot-path
Fieldmaterialization and defers it toflush(). The added benchmark is aligned with the intended compaction/write workload, and I did not identify an obvious performance regression in the changed logic. - Other issues: None blocking found.
Residual risk:
- I relied on static review plus the existing zone-map unit coverage; I did not build or run tests locally in this runner.
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
f1cf838 to
b0e94cf
Compare
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
No blocking findings after reviewing be/src/storage/index/zone_map/zone_map_index.{h,cpp} and the new benchmark wiring.
Critical checkpoint conclusions:
- Goal / correctness: The PR goal is to remove per-call
Fieldmaterialization fromTypedZoneMapIndexWriter::add_values(). The new_page_min/_page_maxcache does that while keeping page and segment zone-map serialization behavior unchanged. Existingzone_map_index_testcoverage for string truncation, CHAR padding,pass_all, and float/double special values still exercises the touched semantics. - Scope / minimality: The functional change is tightly scoped to the zone-map writer, and the benchmark is isolated under
be/benchmark/. - Concurrency: No new concurrency is introduced.
TypedZoneMapIndexWriterremains page-local mutable state reset atflush(). - Lifecycle: The new raw min/max state is page-scoped and cleared together with
_page_zone_mapafter each flush. For string types, the cachedStringRefstorage is backed by_arenaand cleared only after materialization/serialization completes. - Config / compatibility / FE-BE protocol: No new config, no storage-format change, no FE-BE contract change, and no parallel code path appears to be missed because all supported primitive writers still flow through
ZoneMapIndexWriter::create(). - Special conditions:
pass_all, truncation toMAX_ZONE_MAP_INDEX_SIZE, and the max-value byte bump remain in the same logical order as before, so pruning semantics are preserved. - Testing: No new unit/regression test was added, but the existing BE unit tests already cover the semantics touched here. Residual risk is limited to the performance-oriented split-run call pattern that the new benchmark models rather than an observed correctness gap.
- Observability / transactions / persistence: No new observability requirements, and the zone-map protobuf format is unchanged.
- Performance: The hot path removes repeated
Fieldconstruction/destruction with no obvious correctness regression from code inspection.
Overall opinion: the change is small, focused, and looks correct.
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
I reviewed the full PR, including the existing inline review threads and the surrounding write/read paths. I did not repeat the two already-raised issues (the benchmark leak/unchecked create status, and the earlier full-string-copy regression), but they remain part of the open review context.
Additional findings from this pass:
be/src/storage/index/zone_map/zone_map_index.h: the new arena-backed string min/max path adds retained per-page allocations that are invisible toZoneMapIndexWriter::size(), soScalarColumnWriter::estimate_buffer_size()/SegmentWriter::estimate_segment_size()under-account memory.be/benchmark/benchmark_zone_map_index.hpp: the added string benchmark only uses 16-byte strings, so it does not measure the> MAX_ZONE_MAP_INDEX_SIZEtruncation/copy path that this PR changes.
Critical checkpoints:
- Goal / proof: the PR aims to remove per-call
Fieldtemporaries fromTypedZoneMapIndexWriter::add_values(). The numeric fast path does that, but there is no new test covering the new raw-ValType/ arena lifecycle. - Scope: focused to the zone-map writer and its benchmark.
- Concurrency: no new concurrent path; the writer is still used from the column-writer thread without new locking.
- Lifecycle: page/segment field materialization is safe across
flush(), but the new arena-backed string state is not fully tracked in memory accounting. - Configs: none.
- Compatibility / storage format: unchanged; no FE/BE protocol change.
- Parallel paths: I did not find another functional zone-map write path that needed the same refactor.
- Special conditional checks: existing
pass_alland string max-byte bump logic remain unchanged. - Test coverage: existing BE zone-map tests cover serialized truncation semantics, but no new test covers this refactor's memory/accounting behavior or the claimed long-string benchmark scenario.
- Test result updates: none.
- Observability: no new logging/metrics needed.
- Transactions / persistence / data writes: no new transaction or persistence risks found; the on-disk zone-map format remains the same.
- FE/BE variable passing: none involved.
- Performance: the numeric hot path should improve, but the string benchmark evidence is incomplete and the new string path introduces extra resident memory that current accounting misses.
- Other issues: see inline comments.
User focus points:
- No additional user-provided review focus.
b95df21 to
71a11ef
Compare
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
There was a problem hiding this comment.
I completed a full review of the PR, read the existing inline threads first, and did not find any additional distinct blocking issues beyond the concerns already raised there. Those existing findings are still sufficient to block merge:
- The new benchmark still ignores
Statusreturns fromZoneMapIndexWriter::create()/flush(), which violates Doris error-handling rules and can silently produce invalid benchmark results. - The benchmark does not exercise
>512-byte strings, so the PR body's string-performance claims are not supported by the added benchmark coverage. - The new string min/max path retains arena-backed memory that is not reported by
ZoneMapIndexWriter::size(), soScalarColumnWriter::estimate_buffer_size()/SegmentWriter::estimate_segment_size()under-account memory. BecauseArena::clear()keeps the head chunk for reuse, a single large page can also pin that chunk for the rest of the segment writer lifetime while the estimate remains flat.
Critical checkpoints:
- Goal / correctness: The goal is to remove per-call
Fieldtemporaries inTypedZoneMapIndexWriter::add_values(). The deferred-materialization approach appears functionally correct for page/segment min/max semantics, and existing unit tests already cover truncation /pass_allbehavior. However, the PR is not ready because the remaining benchmark and memory-accounting issues undermine the claimed benefit and violate Doris runtime accounting expectations. - Scope / focus: The code change is locally focused in the zone-map writer plus a benchmark, but the benchmark and accounting follow-through are still incomplete.
- Concurrency: I did not find new concurrency or lock-order issues in this path; the writer remains single-owner / single-thread in the existing call chain.
- Lifecycle: The new string path introduces arena-backed lifetime for page min/max. That lifecycle is internally consistent, but its memory is not surfaced through the existing size-accounting interface.
- Config / compatibility: No new config or storage-format compatibility issue found.
- Parallel paths: I did not find another equivalent writer path that also needed the same optimization.
- Special conditions: The truncation / max-byte increment logic remains aligned with the prior zone-map semantics.
- Test coverage: Existing direct unit tests cover zone-map serialization semantics, including long-string truncation and
pass_all, but the PR adds no test for the new deferred-materialization hot path itself and no caller-level coverage for the nullable-run /ScalarColumnWriterscenario described in the PR body. - Observability: No new observability requirement found.
- Transactions / persistence / writes: No FE transaction or metadata-log impact found; this is a BE in-memory write-path optimization only.
- Performance: The optimization should help the intended hot path, but the current benchmark evidence is incomplete for long strings and still contains ignored-error handling.
User focus: review_focus.txt contained no additional user-provided focus points.
Please address the existing inline findings before merge.
|
|
run buildall |
1 similar comment
|
run buildall |
…e segment-merge bug
Optimization: in the per-row hot path, compare and store the running
page min/max as raw ValType instead of constructing a Field per batch.
The Field is only materialized once per page (in flush()), so
create_field_from_olap_value() goes from N (per-row) to 2 (per-page)
calls. For string types, the truncated bytes are kept in two std::string
members (_page_min_storage / _page_max_storage); std::string::assign
reuses its internal buffer, replacing the previous 4 KB-min Arena.
Bug fix: the previous flush() unconditionally merged
_page_zone_map.{min,max}_value into _segment_zone_map even when the
current page had no values. For varchar/string max values of exactly
MAX_ZONE_MAP_INDEX_SIZE bytes, modify_index_before_flush() mutates the
in-memory max in place (last byte +1), and _reset_zone_map() does not
clear min_value/max_value, so an all-null page following a
max-length-string page would re-merge the already-incremented value into
segment.max. finish() would then increment again, leaving segment.max
double-incremented and observably wrong via SELECT max(varchar)
push-down. Gating the segment merge on _page_zone_map.has_not_null fixes
this and matches the design contract documented on _reset_zone_map.
Also clean up create_field_from_olap_value: callers (now only
_materialize_page_minmax) are responsible for truncation, so the
redundant truncation inside the helper is removed.
Add a microbenchmark (benchmark_zone_map_index.hpp) and two unit tests
that pin the bug-fix and the int regression behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
run buildall |
1 similar comment
|
run buildall |
|
/review |
|
PR approved by at least one committer and no changes requested. |
There was a problem hiding this comment.
Code review completed for PR 62544. I reviewed the existing inline threads first and did not duplicate those already-known concerns.
Summary opinion: no additional non-duplicative blocking issue found in the current diff. The implementation is focused on avoiding per-call Field materialization in TypedZoneMapIndexWriter, with the string truncation moved into the zone-map hot path and a segment merge guard for all-null pages.
Critical checkpoint conclusions:
- Goal/test coverage: The PR goal is achieved by tracking page min/max as raw ValType and materializing Field at flush; new unit tests cover the all-null-page segment max regression and existing zone-map tests cover string truncation/materialization behavior.
- Scope/focus: The change is small and localized to zone-map writer behavior, its Field helper dependency, benchmark coverage, and BE unit tests.
- Concurrency/lifecycle: No new shared concurrency or lock path is introduced. The page min/max lifetime is local to the writer; string refs borrow from per-writer std::string storage and are materialized before serialization.
- Configuration/compatibility: No new config or persistent format change is introduced; serialized ZoneMapPB semantics remain compatible.
- Parallel paths: The touched path is the ScalarColumnWriter zone-map path; no separate BE/Cloud path appeared to require the same logic.
- Data correctness: The new guard avoids merging all-null page stale min/max into the segment zone map. Existing review threads already discuss the earlier min/max/truncation concerns, so I did not repeat them.
- Memory/performance: The hot path removes Field temporaries and replaces arena-backed retained copies with bounded std::string storage for current page min/max. Existing benchmark-status-check/long-string coverage comments are already known from prior threads.
- Observability: No new logging/metrics seem necessary for this local write-path optimization.
- User focus: No additional user-provided review focus was specified.
Verification: attempted ./run-be-ut.sh --run --filter=ColumnZoneMapTest.*, but it timed out during third-party submodule initialization before the focused tests could execute.
|
skip check_coverage |
Problem
When writing nullable columns,
append_nullablesplits each block at null boundaries into small non-null runs.[1, 3, NULL, NULL, 4, 5, 6, NULL]add_values([1,3], count=2)+add_values([4,5,6], count=3)Each
add_valuescall built two heap-allocatedFieldtemporaries just to comparemin/max — allocation, memcpy, variant dispatch, destructor — per call.
Fix
Track page min/max as raw
ValType(the same native typestd::minmax_elementalready uses). MaterializeFieldonly once atflush()time.Benchmark (Release, 65536 rows/iter)
block_rows= post-null-split run length.Small batches (1–16): 2×–11×. Large batches: no regression.
When the columns in the variant are sparse, during compaction, the frequent add calls on the zonemap index cause field creation and field comparison to account for around 12% of CPU usage. After optimization, this drops to 3%.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)