fix(parquet): bound data page byte size for large variable-width values#9972
fix(parquet): bound data page byte size for large variable-width values#9972adriangb wants to merge 8 commits into
Conversation
`short_string_non_null` writes 1M 8-byte strings — exercises the BYTE_ARRAY write path where per-value bookkeeping cost is largest. `large_string_non_null` writes 1024 rows of 256 KiB strings — the case where individual values exceed the default data-page byte limit, so a default `write_batch_size`-row chunk would otherwise buffer hundreds of MiB before any page-size check fires. Both fill gaps in the existing arrow_writer benches, which only cover random-length strings. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
run benchmark arrow_writer |
393ead0 to
4823429
Compare
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (4823429) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer |
d0e3d97 to
0fd6dcb
Compare
The parquet column writer only checks the data page byte limit AFTER each mini-batch finishes writing, and mini-batches are sized by row count (`write_batch_size`, default 1024). For BYTE_ARRAY columns with large values — e.g. a 5 MiB image blob per row — a single mini-batch can buffer multiple GiB into one data page before the configured byte limit is even consulted. Pages can exceed the limit by orders of magnitude. Make the mini-batch size byte-budget aware: - For each chunk, ask the encoder how many of the next values fit in one page byte budget. If everything fits, stay on the existing batched fast path (zero behavior change for small values). - If not, sub-batch — for flat columns, one mini-batch per `k` values where `k` is the fit count; for repeated columns, one mini-batch per record (since a record cannot span data pages). Skip the check while dictionary encoding is active: the byte estimate is plain-encoded size, but a dict-encoded data page only stores small RLE indices, so the estimate would spuriously shrink pages. Dictionary fallback bounds dict-encoded pages independently. The encoder hook is `count_values_within_byte_budget(values, offset, len, byte_budget) -> Option<usize>` plus a `_gather` variant for the arrow path, mirroring the existing `write`/`write_gather` split. Returning `None` means "no cheap estimate available; stay batched." Implementation details: - `ParquetValueType::byte_size(&self)` returns the per-value plain- encoded byte size. Defaults to `size_of::<Self>()`; overridden for `ByteArray` (`len + 4`) and `FixedLenByteArray` (`len`). - Standard `ColumnValueEncoderImpl<T>::count_values_within_byte_budget` short-circuits to `(byte_budget / size_of::<T::T>()).max(1).min(n)` for fixed-size physical types — one division, no walk. For BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY it scans values cumulatively and exits at the first one to push the sum past the budget, which also catches skewed distributions (a single oversized value among many small ones is detected wherever it lands). - Arrow `ByteArrayEncoder::count_values_within_byte_budget_gather` uses a two-stage walk on `GenericByteArray<O>` types: stage 1 computes the total in O(1) via one subtraction on the offsets buffer when indices are contiguous (the case for every non-null column), returning immediately if the chunk fits. Stage 2 walks per-index lengths from the offsets buffer (still no slice/UTF-8 construction) when stage 1 doesn't conclude. View/dict/fixed-size-binary arrays fall through to a per-value walk via `ArrayAccessor::value`. - `LevelDataRef::value_count(total, max_def)` reports how many levels in the chunk correspond to actual non-null values. Used to bridge the encoder's value-count answer back into level-count subdivision for nullable columns. Tests in `column::writer::tests`: - `test_column_writer_caps_page_size_for_large_byte_array_values` — flat regression: 64 × 64 KiB BYTE_ARRAY values vs a 16 KiB page limit produces one page per value rather than a single ~4 MiB page. - `test_column_writer_caps_page_size_for_large_values_in_list` — Materialized-rep branch of `write_granular_chunk`: list of 3 large blobs × 3 records, asserts one page per record (no record splits). - `test_column_writer_caps_page_size_with_nullable_large_values` — `LevelDataRef::value_count` on Materialized def levels with mixed nulls. - `test_column_writer_dict_enabled_large_values_post_spill` — `has_dictionary()` short-circuit while dict is active, then byte- budget sub-batching after dict spill. - `test_column_writer_caps_page_size_for_fixed_len_byte_array` — `FixedLenByteArray::byte_size` override. Tests in `arrow::arrow_writer::tests`: - `test_arrow_writer_caps_page_size_for_large_strings` — end-to-end through `ArrowWriter` exercising the offsets-buffer fast path. - `test_arrow_writer_caps_page_size_for_large_string_view` — view-array fallback (Utf8View has no contiguous offsets buffer). - `test_arrow_writer_all_null_string_column` — `value_count` Uniform branch under arrow's level optimization; asserts null_count and page coverage rather than just non-empty output. - `test_arrow_writer_granular_mode_roundtrip` — value-fidelity round- trip: mix small + large strings so the byte-budget cutoff lands mid-chunk, write through `ArrowWriter`, read back with `ParquetRecordBatchReader`, assert each string matches. Bench results vs `main` (5-run medians on a noisy laptop, run-to-run variance ~±2%): - `primitive/default` (i32 25% null): −0.4% to +1.3% - `primitive_non_null/default`: −2.3% to +0.4% - `bool_non_null/default`: +1.8% to +15.9% (highly noisy on this machine) - `string/default`: +3.3% to +4.7% - `short_string_non_null/default` (new, 1M × 8 B): +1.0% to +6.4% - `large_string_non_null/default` (new, 1024 × 256 KiB): +0.5% to +2.7% — the case the fix targets - `string_dictionary/default`: +3.3% to +6.4% - `string_non_null/default`: −1.6% to +2.3% All within laptop variance for the fast-path (small-value) cases. The fix's intended case — large variable-width values — now correctly bounds page sizes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
0fd6dcb to
24b83c7
Compare
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (24b83c7) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (24b83c7) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
…w arrays The byte-budget check on `Utf8View` / `BinaryView` columns previously fell through to a per-value walk via `ArrayAccessor::value`, which constructs a `&str`/`&[u8]` slice for each index — chasing the buffer pointer through the view's u128 word, then slicing `data_buffers[i]`. At ~1 µs per chunk over ~1000 chunks on the 1 M-row `string_and_binary_view` bench, that was a consistent ~+3–5 % regression vs `main` in both GKE benchmark runs. View arrays store each value's length in the low 32 bits of its u128 view, so we can scan lengths with no data-buffer dereferences: ``` let len = (views[idx] as u32) as usize; ``` Add a dedicated fast path for `Utf8View` and `BinaryView` that walks the views buffer directly. Falls through to the per-value walk only for `FixedSizeBinary` and `Dictionary` — the latter still needs the dictionary-keys indirection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (70dc497) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Two targeted regressions surfaced in the GKE benchmark sweep: 1. `string_dictionary/*` regressed +30-89 % vs `main` after writer-dict spill. The arrow Dictionary input falls through to the per-value walk via `ArrayAccessor::value`, which dereferences the dict (keys[idx] → values[key] → slice construction) for every index in every chunk. The whole point of the byte-budget check is to bound pages of large BYTE_ARRAY values, but an arrow column that's already Dictionary-encoded at the arrow layer implies its values are small enough that dedup is worthwhile — the opposite shape. Treat Dictionary input as "everything fits" and skip the check. 2. `list_primitive_sparse_99pct_null` regressed ~+8 % across props. The cost was `LevelDataRef::value_count`'s O(N) def-level scan on the 20 000-row compact-levels chunks the list path uses. The arrow path already has the answer cheaper: `value_indices` is the sorted list of non-null positions in the batch, so the count of indices falling in the current chunk's level range is a binary search (one `partition_point`). Use that when `value_indices` is `Some` and fall back to the def-level scan only on the non-arrow path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (bbe2b7e) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (bbe2b7e) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
Have you considered making the batch size configurable per column? |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Yes, that may be a simpler approach. But I'm hoping we can get to a place where users don't have to think about / configure this. Given they gave us a page size limit it'd be nice if we can always adhere to that... |
| /// push a page far past the configured page byte limit before the | ||
| /// post-write size check fires. | ||
| #[inline] | ||
| fn byte_size(&self) -> usize { |
There was a problem hiding this comment.
This seems to duplicate dict_encoding_size. Also, #9700 wants to rename dict_encoding_size and instead implement it pretty much the same way as here.
|
Another thought...maybe add another chunker like the CDC work added ( ). If we compute batches up front when we know the shape of the data that might be faster 🤷 |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
…dgetChunker Two small structural cleanups in response to PR review: - Remove `ParquetValueType::byte_size`. It overlapped with `dict_encoding_size`, which @etseidl pointed out is being renamed and generalized in apache#9700. Instead, compute the per-value plain- encoded byte cost inline in `ColumnValueEncoderImpl::count_values_within_byte_budget` from `dict_encoding_size`'s components, dispatched on the physical type (same dispatch shape as `DictEncoder::push` in `encodings/encoding/dict_encoder.rs:52`). No new trait method. - Lift the byte-budget mini-batch sizing decision out of `write_batch_internal` into a new `ByteBudgetChunker` struct (`column/writer/byte_budget_chunker.rs`). The chunker captures the column-open-time facts (page byte limit, static-fits flag, max_def_level) once and exposes one `pick_sub_batch_size` method. `write_batch_internal`'s inner loop is now ~25 lines shorter and reads as: compute chunk boundary → ask chunker for sub_batch_size → write_mini_batch or write_granular_chunk. This is the lightweight version of the "make it a chunker like CDC" suggestion. A full CDC-style pre-compute would emit all chunk boundaries upfront, but the byte budget decision depends on the encoder's live `has_dictionary()` state, which changes mid-batch when the writer's dictionary spills. Querying that per chunk (as this refactor does) preserves the existing dict-active short- circuit; a precomputed plan would force a choice between losing that short-circuit or losing correctness when dict spills mid-batch on large-value columns. No behavior change. Tests still pass and `cargo bench` shows the same deltas as before the refactor. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (145ea5d) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
GKE bench shows string_dictionary regresses ~+90% on the branch even though `pick_sub_batch_size` should short-circuit instantly when the encoder's dictionary is still active (single struct-field load + virtual call into `has_dictionary()`). Local laptop benches don't reproduce the regression, suggesting it's an architecture-specific inlining/code-layout effect on the GKE aarch64 runner. Marking `new` and `pick_sub_batch_size` `#[inline]` to give the compiler a clear hint that these should fold into `write_batch_internal`'s hot loop. Local laptop bench is unchanged (~+3% on string_dictionary, ~+5% on string_and_binary_view, both within noise); pushing to see whether GKE moves. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (403af94) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
…es are non-null The chunker's per-chunk `partition_point` (arrow path) or `LevelDataRef::value_count` (non-arrow path) returns `chunk_size` by construction whenever the column has no nulls. The GKE bench showed ~+12–27% regressions on `list_primitive_non_null/*` and `string_non_null/*` consistent with that walk dominating: ~50 K chunks × a binary search through a 50 M-entry `non_null_indices` buffer means cold cache reads on every chunk. Compute a `ValueCountStrategy` once at `write_batch_internal` entry: - `AllPresent` — set when the arrow caller passed `non_null_indices.len() == num_levels`, or when the column has `max_def_level == 0`. The chunker uses `chunk_size` directly with no per-chunk work. - `Sorted(&[usize])` — arrow nullable path; binary-search the indices. - `DefLevelScan(max_def)` — non-arrow nullable path; def-level scan. For the bench's `list_primitive_non_null` (all-non-null lists with a 50 M-entry leaf), this drops the per-chunk binary search entirely; expected to bring those rows back near noise. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (bb19d3e) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
…th out
The previous `#[inline]` hint was no longer enough once
`pick_sub_batch_size` grew the `ValueCountStrategy` match — LLVM
silently stopped inlining and the most recent GKE bench bounced
`string_dictionary/*` back to +46–81% (`default` +81%, `parquet_2`
+86%, `bloom_filter` +46%).
Fix:
1. Mark `pick_sub_batch_size` `#[inline(always)]`. The hot path is
just `if static_always_fits || has_dictionary || chunk_size == 0 {
return chunk_size; }` — one struct-field load + one virtual call —
so unconditional inlining is the right call, not a heuristic
suggestion.
2. Pull the byte-budget computation out into a separate
`byte_budget_sub_batch_size` method marked `#[inline(never)]`. This
keeps the inlined fast path small even as the slow path grows; the
slow path is paid for explicitly when bypasses don't fire, not
smuggled into every chunk's inline body.
Same behavior, just compiler-friendlier code layout.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (9d647dc) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
We write large values into our parquet files (e.g. a 5MB LLM prompt). A naive write will cause massive pages (we've seen up to 2GB) at default write settings. The main knob to control this is
write_batch_sizewhich defaults to 1024. But if each row is 5MB that's 5GB. On the other hand setting this to something small like 32 kills write performance and is completely unnecessary for other fixed width columns.The writer even documents this (
parquet/src/column/writer/mod.rs):This PR makes the mini-batch size byte-budget aware:
bytes_per_valuefrom the values about to be written and picksub_batch_size = page_byte_limit / bytes_per_value(clamped ≥ 1).sub_batch_size≥ chunk size, so we stay on the existing batched fast path with zero behavior change.Implementation notes
Skip the byte-size check while parquet dictionary encoding is active:
estimated_value_bytesreturns plain-encoded size but a dict-encoded data page only stores small RLE indices, so the estimate would spuriously shrink pages. Dict fallback bounds dict-encoded pages independently.For repeated/nested columns the sub-batch steps record-by-record (rep == 0 boundaries) so a record never spans data pages, matching the parquet format rule.
Regression test
test_column_writer_caps_page_size_for_large_byte_array_valueswrites 64 × 64 KiB BYTE_ARRAY values with a 16 KiB page byte limit. Before this fix that produced a single ~4 MiB page; after, it's one page per value (~64 pages, all within ~2× the value size).Bench results
5-run medians, criterion
arrow_writerbench, default writer properties, on a noisy laptop (run-to-run variance ~±1.6%):primitive/default(i32 25% null)primitive_non_null/defaultbool_non_null/defaultstring/defaultshort_string_non_null/default(new, 1M × 8 B)large_string_non_null/default(new, 1024 × 256 KiB)string_non_null/defaultstring_dictionary/defaultlist_primitive/defaultlist_primitive_non_null/default🤖 Generated with Claude Code