[improve](compaction) Use segment footer raw_data_bytes for first-time batch size estimation#62263
Conversation
…e batch size estimation When vertical compaction runs for the first time on a tablet (no historical sampling data), estimate_batch_size() previously returned a hardcoded value of 992, which could cause OOM for wide tables or be too conservative for narrow tables. This change uses ColumnMetaPB.raw_data_bytes from segment footer to compute a per-row size estimate for the first compaction. raw_data_bytes records the original data size before encoding, which closely approximates runtime Block::bytes(). Subsequent compactions continue to use the existing historical sampling mechanism unchanged. Key design decisions: - Footer collection only runs when needed (no manual override, and at least one column group lacks historical sampling data) - Variant columns (raw_data_bytes=0 TODO) trigger fallback to 992 - Structural overhead (+1 null map, +8 offset) only added for scalar columns with actual footer data - Complex types (ARRAY/MAP/STRUCT) use raw_data_bytes directly without structural compensation as it already includes recursive sub-writer data - Historical sampling now uses Block::allocated_bytes() instead of bytes() for more accurate memory estimation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
…ion init Log per_row, sample_bytes, sample_rows immediately after all merge inputs finish loading their first block, before the actual merge starts. This helps diagnose memory issues by showing the actual per-row memory size at init time.
The log was added to help diagnose vertical compaction memory issues. Investigation is complete; the existing 'estimate batch size' log in merger.cpp already provides per-group batch_size and per_row info for daily monitoring.
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run cloud_p0 |
|
/review |
There was a problem hiding this comment.
Findings:
be/src/storage/merger.cpp: the new footer estimator still underestimates nullable fixed-width columns when nulls are present, so a first vertical compaction can still choose an OOM-sized batch.be/src/storage/merger.cpp: the footer path silently uses partial metadata when some source segments do not carryraw_data_bytesyet, so after a rolling upgrade the first compaction can be sized from an incomplete subset of rowsets.
Critical checkpoint conclusions:
- Goal / proof: Partially met. The PR improves first-time vertical compaction estimation, but the footer path is still inaccurate for nullable fixed-width columns, and the new tests only cover non-null INT/VARCHAR cases.
- Scope / minimality: Yes. The change stays focused on compaction memory estimation and related unit tests.
- Concurrency: No new concurrency issue found.
sample_info_lockscope remains small and consistent. - Lifecycle / static initialization: No special lifecycle or static-init risk introduced.
- Configuration: No new config items.
- Compatibility: Not fully handled. Older rowsets can legitimately lack
raw_data_bytes, but this path still uses the partial footer data instead of falling back. - Parallel / equivalent paths: The historical sampling path and first-compaction footer path are both updated, but the footer path diverges from the real runtime memory layout in the nullable case above.
- Special conditions: The
VARIANTfallback behavior is explicit and acceptable. - Test coverage: Not sufficient for nullable fixed-width and mixed old/new segment cases on first compaction.
- Observability: Existing INFO/WARNING logs are sufficient for this change.
- Transaction / persistence / data write / FE-BE variable passing: Not applicable here.
- Performance: Switching historical sampling to
allocated_bytes()is directionally correct, but the footer estimate is still unsafe in the cases above. - Other issues: None beyond the findings above.
- User focus: No additional user-provided focus points were supplied, so there was nothing extra to verify there.
…ize estimation - Fall back to default per-row when any column in the group lacks footer raw_data_bytes (e.g. legacy segments after rolling upgrade), instead of silently summing only the columns we measured. - For fixed-width scalar columns, lower-bound the per-row estimate by the fixed type size. raw_data_bytes only counts non-null payload, but the reader still allocates the full nested column slot for null rows via ColumnNullable::insert_many_defaults(), so highly nullable INT/BIGINT/ etc. columns were under-estimated and could still pick an OOM-sized batch on first compaction.
…ytes Asserts raw_data_bytes only counts non-null payload for a nullable fixed-width column (10% non-null INT in this case), which is the premise behind the type_size lower bound added to the footer-based per-row estimation.
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
…e batch size estimation (#62263) ## Summary - When vertical compaction runs for the first time on a tablet (no historical sampling data), `estimate_batch_size()` previously returned a hardcoded value of 992, which could cause OOM for wide tables or be too conservative for narrow tables - This change uses `ColumnMetaPB.raw_data_bytes` from segment footer to compute a per-row size estimate for the first compaction. `raw_data_bytes` records the original data size before encoding, which closely approximates runtime `Block::bytes()` - Historical sampling now uses `Block::allocated_bytes()` instead of `bytes()` for more accurate memory estimation (`size()` vs `capacity()`) - Subsequent compactions with historical sampling data are completely unchanged ### Key design decisions | Column type | Estimation strategy | |------------|-------------------| | Scalar (INT/VARCHAR etc.) | `raw_data_bytes / rows_with_data` + structural compensation (+1 null map, +8 offset) | | Complex (ARRAY/MAP/STRUCT) | `raw_data_bytes / rows_with_data`, no compensation (already includes recursive sub-writer data) | | VARIANT (root/subcolumn) | Fallback to 992 (`raw_data_bytes=0 // TODO` in writer) | ### Performance safeguards - Footer collection only runs on first compaction (no historical sampling data) - Skipped entirely when `compaction_batch_size` is manually set - OOM backoff and sparse optimization paths are untouched ## Test plan - [ ] Wide table (200+ columns) first compaction does not OOM - [ ] Narrow table first compaction batch_size is close to upper limit - [ ] Multi-round compaction: first round uses footer, subsequent rounds use historical sampling - [ ] Variant columns fallback to 992 - [ ] Sparse optimization is not affected - [ ] `TestFirstCompactionUsesFooterEstimation` unit test passes
Summary
estimate_batch_size()previously returned a hardcoded value of 992, which could cause OOM for wide tables or be too conservative for narrow tablesColumnMetaPB.raw_data_bytesfrom segment footer to compute a per-row size estimate for the first compaction.raw_data_bytesrecords the original data size before encoding, which closely approximates runtimeBlock::bytes()Block::allocated_bytes()instead ofbytes()for more accurate memory estimation (size()vscapacity())Key design decisions
raw_data_bytes / rows_with_data+ structural compensation (+1 null map, +8 offset)raw_data_bytes / rows_with_data, no compensation (already includes recursive sub-writer data)raw_data_bytes=0 // TODOin writer)Performance safeguards
compaction_batch_sizeis manually setTest plan
TestFirstCompactionUsesFooterEstimationunit test passes