Summary
Follow-up to #22416. In split_vec_min_alloc, the residual vector keeps its original capacity across a long split_off chain. Under DataFusion's capacity-based memory accounting, that residual is charged for the full backing allocation until it drops, even after most of the data has been emitted.
Background
split_vec_min_alloc lives in datafusion_common::utils and is used by the multi_group_by group-value builders (bytes.rs, primitive.rs) to carve fixed-size chunks off a growing vector.
In the split_off branch, the emitted prefix keeps the original capacity, and the residual likewise keeps its original capacity (unchanged by split_off).
Concrete pathological case (raised by ariel-miculas in #22416 (comment)): a vector of one million elements is drained 1024 at a time. Each emitted slice has capacity 1024 except the final one, which inherits capacity one million via split_off + mem::replace. Symmetrically, on every iteration before the final one the residual still owns the full backing buffer, so accounting charges for it for the entire chain.
Why this was not fixed in #22416
#22416 considered calling shrink_to_fit on the emitted prefix in the split_off branch and backed it out. A caller in datafusion/physical-plan/src/aggregates/group_values/multi_group_by/bytes.rs pushes onto the emitted prefix immediately after the call (first_n_offsets.push(...)); shrinking the prefix forces a realloc on the next push. The test emitted_prefix_does_not_realloc_on_push in datafusion/common/src/utils/mod.rs pins that constraint.
The push-after-emit constraint applies to the prefix, not to the residual. The residual is the side that lingers; the prefix is the side that gets pushed onto.
Proposed shapes
Two candidates, in order of decreasing utility-side change:
-
Have split_vec_min_alloc shrink the residual when its length is much smaller than its capacity (some ratio threshold). One realloc near the end of a long chain replaces carrying the full allocation indefinitely. Adds a heuristic to the shared utility.
-
Push the shrink to the caller's finalize path. Callers driving long split chains (the multi_group_by builders) call shrink_to_fit (or shrink_to) on the residual once they know no more pushes are coming. Keeps the utility allocation-policy-free; each caller picks based on its own access pattern.
Whichever shape lands should come with a benchmark on the 1M to 1024 case plus a smaller fixture to keep CI cheap, so the residual shrink can be shown not to regress the common short-chain path.
Related
Summary
Follow-up to #22416. In
split_vec_min_alloc, the residual vector keeps its original capacity across a longsplit_offchain. Under DataFusion's capacity-based memory accounting, that residual is charged for the full backing allocation until it drops, even after most of the data has been emitted.Background
split_vec_min_alloclives indatafusion_common::utilsand is used by themulti_group_bygroup-value builders (bytes.rs,primitive.rs) to carve fixed-size chunks off a growing vector.In the
split_offbranch, the emitted prefix keeps the original capacity, and the residual likewise keeps its original capacity (unchanged bysplit_off).Concrete pathological case (raised by ariel-miculas in #22416 (comment)): a vector of one million elements is drained 1024 at a time. Each emitted slice has capacity 1024 except the final one, which inherits capacity one million via
split_off+mem::replace. Symmetrically, on every iteration before the final one the residual still owns the full backing buffer, so accounting charges for it for the entire chain.Why this was not fixed in #22416
#22416 considered calling
shrink_to_fiton the emitted prefix in thesplit_offbranch and backed it out. A caller indatafusion/physical-plan/src/aggregates/group_values/multi_group_by/bytes.rspushes onto the emitted prefix immediately after the call (first_n_offsets.push(...)); shrinking the prefix forces a realloc on the next push. The testemitted_prefix_does_not_realloc_on_pushindatafusion/common/src/utils/mod.rspins that constraint.The push-after-emit constraint applies to the prefix, not to the residual. The residual is the side that lingers; the prefix is the side that gets pushed onto.
Proposed shapes
Two candidates, in order of decreasing utility-side change:
Have
split_vec_min_allocshrink the residual when its length is much smaller than its capacity (some ratio threshold). One realloc near the end of a long chain replaces carrying the full allocation indefinitely. Adds a heuristic to the shared utility.Push the shrink to the caller's finalize path. Callers driving long split chains (the
multi_group_bybuilders) callshrink_to_fit(orshrink_to) on the residual once they know no more pushes are coming. Keeps the utility allocation-policy-free; each caller picks based on its own access pattern.Whichever shape lands should come with a benchmark on the 1M to 1024 case plus a smaller fixture to keep CI cheap, so the residual shrink can be shown not to regress the common short-chain path.
Related