[VL] Add Velox batch resizer copyRanges fast path#12101
Merged
yaooqinn merged 4 commits intoMay 20, 2026
Conversation
yaooqinn
reviewed
May 19, 2026
| constexpr int32_t kInputBatches = 64; | ||
| constexpr int32_t kRowsPerBatch = 64; | ||
| constexpr int32_t kTotalRows = kInputBatches * kRowsPerBatch; | ||
| constexpr int64_t kPreferredBatchBytes = std::numeric_limits<int64_t>::max(); |
Member
There was a problem hiding this comment.
The 7 benchmark scenarios are all dense-flat, but enableCopyRanges=true (default) silently routes dictionary / constant / topLevelNull inputs through RowVector::copy (VeloxBatchResizer.cc:148). Suggest adding 1 dict-heavy + 1 constant-heavy scenario to validate the fallback path is at least perf-neutral before flipping the default on.
Contributor
Author
There was a problem hiding this comment.
I added two fallback-focused benchmark scenarios for the copyRanges default-on path:
DictionaryHeavy_64x64ConstantHeavy_64x64
Each scenario compares:
BM_VeloxBatchResizerFallbackAppendOptOutBaseline:enableCopyRanges=false, the old incremental append path.BM_VeloxBatchResizerDefaultCopyRangesFallback: defaultenableCopyRanges=true; dictionary/constant inputs are unsupported by childcopyRanges, so they fall back toRowVector::copyinside the new collect-and-copy path.
Benchmark command:
/tmp/gluten-cpp-ut-build/velox/benchmarks/velox_batch_resizer_benchmark \
--benchmark_min_time=0.05s \
--benchmark_repetitions=3 \
--benchmark_format=json \
> /tmp/velox_batch_resizer_perf.json
Environment:
- CPU: 32 logical CPUs @ ~2995 MHz
- Caches: L1D 48 KiB x16, L1I 32 KiB x16, L2 2 MiB x16, L3 36 MiB x1
- Note: ASLR was enabled, so the numbers may contain normal benchmark noise.
Dense flat scenarios, CPU time averaged over 3 repetitions:
┌──────────────────┬─────────────────────────┬────────────────────┬─────────┬─────────────────────────┬─────────────────────────────┐
│ Scenario │ Append opt-out baseline │ Default copyRanges │ Speedup │ Direct child copyRanges │ Raw payload bulk-copy model │
├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤
│ Mixed_64x64 │ 94.3 us │ 11.3 us │ 8.35x │ 14.4 us │ 34.8 us │
├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤
│ Mixed_16x256 │ 31.7 us │ 6.3 us │ 5.01x │ 5.3 us │ 11.2 us │
├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤
│ Mixed_256x16 │ 234.2 us │ 48.7 us │ 4.81x │ 30.6 us │ 119.2 us │
├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤
│ Fixed2_64x64 │ 29.3 us │ 5.4 us │ 5.43x │ 2.1 us │ 13.4 us │
├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤
│ Fixed16_64x64 │ 123.5 us │ 26.7 us │ 4.62x │ 18.6 us │ 95.7 us │
├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤
│ LongString_64x64 │ 33.3 us │ 7.2 us │ 4.61x │ 4.1 us │ 15.7 us │
├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤
│ BoolHeavy_64x64 │ 73.4 us │ 10.8 us │ 6.78x │ 5.4 us │ 38.1 us │
└──────────────────┴─────────────────────────┴────────────────────┴─────────┴─────────────────────────┴─────────────────────────────┘
Fallback scenarios requested by the review comment, CPU time averaged over 3 repetitions:
┌───────────────────────┬─────────────────────────┬─────────────────────────────┬──────────────┐
│ Scenario │ Append opt-out baseline │ Default copyRanges fallback │ Result │
├───────────────────────┼─────────────────────────┼─────────────────────────────┼──────────────┤
│ DictionaryHeavy_64x64 │ 102.7 us │ 77.7 us │ 1.32x faster │
├───────────────────────┼─────────────────────────┼─────────────────────────────┼──────────────┤
│ ConstantHeavy_64x64 │ 67.8 us │ 42.0 us │ 1.61x faster │
└───────────────────────┴─────────────────────────┴─────────────────────────────┴──────────────┘
The fallback numbers are not regressions. The reason is that enableCopyRanges=true fallback is not the same as the old enableCopyRanges=false append path.
With enableCopyRanges=false, VeloxBatchResizer creates an empty output RowVector and incrementally appends each small input batch via buffer->append(input). That path grows and updates the output vector batch-by-batch.
With enableCopyRanges=true, even when a batch is not eligible for child-vector copyRanges because it contains dictionary or constant children, the resizer first collects the small input batches, computes the final totalRows, allocates/resizes the output RowVector once, and then copies each unsupported input into the pre-sized output using:
buffer->copy(input.get(), offset, 0, input->size());
So these fallback benchmarks still benefit from the new collect-and-copy structure:
- one output allocation / resize for the final row count;
- fixed-offset RowVector::copy into a pre-sized output;
- less incremental append bookkeeping;
- Velox copy still has encoded-vector handling for dictionary and constant inputs, even though child copyRanges is not used.
This explains why the default copyRanges-enabled fallback path is faster than the append opt-out baseline in these two scenarios. The important validation point for the review is that dictionary-heavy and constant-heavy inputs do not show a perf regression after enabling copyRanges by default.Add a default-enabled VeloxBatchResizer fast path that collects small dense batches, allocates the output RowVector once, and bulk-copies child vector ranges with copyRanges. The config remains available as an opt-out switch. Wire the flag through Scala, Java, and JNI, add C++ coverage for fast-path and fallback behavior, add config default coverage, and add dense-vector benchmark scenarios comparing the append opt-out baseline, default copyRanges path, direct child copyRanges, reader-side raw payload bulk-copy model, and pre-merged flush model. Benchmark results from velox_batch_resizer_benchmark (CPU time; ASLR enabled, so numbers may have noise): - Mixed_64x64: append opt-out baseline 95.1us, default copyRanges 19.7us, direct child copyRanges 17.4us, raw bulk-copy model 33.3us. - Mixed_16x256: append opt-out baseline 33.7us, default copyRanges 6.4us, direct child copyRanges 5.0us, raw bulk-copy model 10.5us. - Mixed_256x16: append opt-out baseline 217.7us, default copyRanges 50.4us, direct child copyRanges 28.6us, raw bulk-copy model 112.6us. - Fixed2_64x64: append opt-out baseline 26.6us, default copyRanges 5.5us, direct child copyRanges 2.0us, raw bulk-copy model 13.7us. - Fixed16_64x64: append opt-out baseline 121.6us, default copyRanges 27.0us, direct child copyRanges 17.4us, raw bulk-copy model 92.9us. - LongString_64x64: append opt-out baseline 31.7us, default copyRanges 7.1us, direct child copyRanges 4.5us, raw bulk-copy model 15.3us. - BoolHeavy_64x64: append opt-out baseline 68.7us, default copyRanges 10.9us, direct child copyRanges 5.4us, raw bulk-copy model 37.7us. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add dictionary-heavy and constant-heavy VeloxBatchResizer benchmark scenarios so the default copyRanges-enabled path is compared against the append opt-out baseline when inputs fall back to RowVector::copy. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
49744b4 to
6080dae
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
yaooqinn
approved these changes
May 20, 2026
Member
yaooqinn
left a comment
There was a problem hiding this comment.
LGTM, thank you @zhli1142015 for the updates
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a default-enabled VeloxBatchResizer fast path that collects small dense batches, allocates the output RowVector once, and bulk-copies child vector ranges with copyRanges. The config remains available as an opt-out switch.
Wire the flag through Scala, Java, and JNI, add C++ coverage for fast-path and fallback behavior, add config default coverage, and add dense-vector benchmark scenarios comparing the append opt-out baseline, default copyRanges path, direct child copyRanges, reader-side raw payload bulk-copy model, and pre-merged flush model.
Benchmark results from velox_batch_resizer_benchmark (CPU time; ASLR enabled, so numbers may have noise):
Mixed_64x64: append opt-out baseline 95.1us, default copyRanges 19.7us, direct child copyRanges 17.4us, raw bulk-copy model 33.3us.
Mixed_16x256: append opt-out baseline 33.7us, default copyRanges 6.4us, direct child copyRanges 5.0us, raw bulk-copy model 10.5us.
Mixed_256x16: append opt-out baseline 217.7us, default copyRanges 50.4us, direct child copyRanges 28.6us, raw bulk-copy model 112.6us.
Fixed2_64x64: append opt-out baseline 26.6us, default copyRanges 5.5us, direct child copyRanges 2.0us, raw bulk-copy model 13.7us.
Fixed16_64x64: append opt-out baseline 121.6us, default copyRanges 27.0us, direct child copyRanges 17.4us, raw bulk-copy model 92.9us.
LongString_64x64: append opt-out baseline 31.7us, default copyRanges 7.1us, direct child copyRanges 4.5us, raw bulk-copy model 15.3us.
BoolHeavy_64x64: append opt-out baseline 68.7us, default copyRanges 10.9us, direct child copyRanges 5.4us, raw bulk-copy model 37.7us.
What changes are proposed in this pull request?
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?