[VL] Add Velox batch resizer copyRanges fast path by zhli1142015 · Pull Request #12101 · apache/gluten

zhli1142015 · 2026-05-18T01:09:47Z

Add a default-enabled VeloxBatchResizer fast path that collects small dense batches, allocates the output RowVector once, and bulk-copies child vector ranges with copyRanges. The config remains available as an opt-out switch.

Wire the flag through Scala, Java, and JNI, add C++ coverage for fast-path and fallback behavior, add config default coverage, and add dense-vector benchmark scenarios comparing the append opt-out baseline, default copyRanges path, direct child copyRanges, reader-side raw payload bulk-copy model, and pre-merged flush model.

Benchmark results from velox_batch_resizer_benchmark (CPU time; ASLR enabled, so numbers may have noise):

Mixed_64x64: append opt-out baseline 95.1us, default copyRanges 19.7us, direct child copyRanges 17.4us, raw bulk-copy model 33.3us.
Mixed_16x256: append opt-out baseline 33.7us, default copyRanges 6.4us, direct child copyRanges 5.0us, raw bulk-copy model 10.5us.
Mixed_256x16: append opt-out baseline 217.7us, default copyRanges 50.4us, direct child copyRanges 28.6us, raw bulk-copy model 112.6us.
Fixed2_64x64: append opt-out baseline 26.6us, default copyRanges 5.5us, direct child copyRanges 2.0us, raw bulk-copy model 13.7us.
Fixed16_64x64: append opt-out baseline 121.6us, default copyRanges 27.0us, direct child copyRanges 17.4us, raw bulk-copy model 92.9us.
LongString_64x64: append opt-out baseline 31.7us, default copyRanges 7.1us, direct child copyRanges 4.5us, raw bulk-copy model 15.3us.
BoolHeavy_64x64: append opt-out baseline 68.7us, default copyRanges 10.9us, direct child copyRanges 5.4us, raw bulk-copy model 37.7us.

What changes are proposed in this pull request?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

yaooqinn · 2026-05-19T09:23:21Z

+constexpr int32_t kInputBatches = 64;
+constexpr int32_t kRowsPerBatch = 64;
+constexpr int32_t kTotalRows = kInputBatches * kRowsPerBatch;
+constexpr int64_t kPreferredBatchBytes = std::numeric_limits<int64_t>::max();


The 7 benchmark scenarios are all dense-flat, but enableCopyRanges=true (default) silently routes dictionary / constant / topLevelNull inputs through RowVector::copy (VeloxBatchResizer.cc:148). Suggest adding 1 dict-heavy + 1 constant-heavy scenario to validate the fallback path is at least perf-neutral before flipping the default on.

I added two fallback-focused benchmark scenarios for the copyRanges default-on path:

DictionaryHeavy_64x64

ConstantHeavy_64x64

Each scenario compares:

BM_VeloxBatchResizerFallbackAppendOptOutBaseline: enableCopyRanges=false, the old incremental append path.

BM_VeloxBatchResizerDefaultCopyRangesFallback: default enableCopyRanges=true; dictionary/constant inputs are unsupported by child copyRanges, so they fall back to RowVector::copy inside the new collect-and-copy path.

Benchmark command:

/tmp/gluten-cpp-ut-build/velox/benchmarks/velox_batch_resizer_benchmark \ --benchmark_min_time=0.05s \ --benchmark_repetitions=3 \ --benchmark_format=json \ > /tmp/velox_batch_resizer_perf.json Environment: - CPU: 32 logical CPUs @ ~2995 MHz - Caches: L1D 48 KiB x16, L1I 32 KiB x16, L2 2 MiB x16, L3 36 MiB x1 - Note: ASLR was enabled, so the numbers may contain normal benchmark noise. Dense flat scenarios, CPU time averaged over 3 repetitions: ┌──────────────────┬─────────────────────────┬────────────────────┬─────────┬─────────────────────────┬─────────────────────────────┐ │ Scenario │ Append opt-out baseline │ Default copyRanges │ Speedup │ Direct child copyRanges │ Raw payload bulk-copy model │ ├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤ │ Mixed_64x64 │ 94.3 us │ 11.3 us │ 8.35x │ 14.4 us │ 34.8 us │ ├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤ │ Mixed_16x256 │ 31.7 us │ 6.3 us │ 5.01x │ 5.3 us │ 11.2 us │ ├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤ │ Mixed_256x16 │ 234.2 us │ 48.7 us │ 4.81x │ 30.6 us │ 119.2 us │ ├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤ │ Fixed2_64x64 │ 29.3 us │ 5.4 us │ 5.43x │ 2.1 us │ 13.4 us │ ├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤ │ Fixed16_64x64 │ 123.5 us │ 26.7 us │ 4.62x │ 18.6 us │ 95.7 us │ ├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤ │ LongString_64x64 │ 33.3 us │ 7.2 us │ 4.61x │ 4.1 us │ 15.7 us │ ├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤ │ BoolHeavy_64x64 │ 73.4 us │ 10.8 us │ 6.78x │ 5.4 us │ 38.1 us │ └──────────────────┴─────────────────────────┴────────────────────┴─────────┴─────────────────────────┴─────────────────────────────┘ Fallback scenarios requested by the review comment, CPU time averaged over 3 repetitions: ┌───────────────────────┬─────────────────────────┬─────────────────────────────┬──────────────┐ │ Scenario │ Append opt-out baseline │ Default copyRanges fallback │ Result │ ├───────────────────────┼─────────────────────────┼─────────────────────────────┼──────────────┤ │ DictionaryHeavy_64x64 │ 102.7 us │ 77.7 us │ 1.32x faster │ ├───────────────────────┼─────────────────────────┼─────────────────────────────┼──────────────┤ │ ConstantHeavy_64x64 │ 67.8 us │ 42.0 us │ 1.61x faster │ └───────────────────────┴─────────────────────────┴─────────────────────────────┴──────────────┘ The fallback numbers are not regressions. The reason is that enableCopyRanges=true fallback is not the same as the old enableCopyRanges=false append path. With enableCopyRanges=false, VeloxBatchResizer creates an empty output RowVector and incrementally appends each small input batch via buffer->append(input). That path grows and updates the output vector batch-by-batch. With enableCopyRanges=true, even when a batch is not eligible for child-vector copyRanges because it contains dictionary or constant children, the resizer first collects the small input batches, computes the final totalRows, allocates/resizes the output RowVector once, and then copies each unsupported input into the pre-sized output using: buffer->copy(input.get(), offset, 0, input->size()); So these fallback benchmarks still benefit from the new collect-and-copy structure: - one output allocation / resize for the final row count; - fixed-offset RowVector::copy into a pre-sized output; - less incremental append bookkeeping; - Velox copy still has encoded-vector handling for dictionary and constant inputs, even though child copyRanges is not used. This explains why the default copyRanges-enabled fallback path is faster than the append opt-out baseline in these two scenarios. The important validation point for the review is that dictionary-heavy and constant-heavy inputs do not show a perf regression after enabling copyRanges by default.

Add a default-enabled VeloxBatchResizer fast path that collects small dense batches, allocates the output RowVector once, and bulk-copies child vector ranges with copyRanges. The config remains available as an opt-out switch. Wire the flag through Scala, Java, and JNI, add C++ coverage for fast-path and fallback behavior, add config default coverage, and add dense-vector benchmark scenarios comparing the append opt-out baseline, default copyRanges path, direct child copyRanges, reader-side raw payload bulk-copy model, and pre-merged flush model. Benchmark results from velox_batch_resizer_benchmark (CPU time; ASLR enabled, so numbers may have noise): - Mixed_64x64: append opt-out baseline 95.1us, default copyRanges 19.7us, direct child copyRanges 17.4us, raw bulk-copy model 33.3us. - Mixed_16x256: append opt-out baseline 33.7us, default copyRanges 6.4us, direct child copyRanges 5.0us, raw bulk-copy model 10.5us. - Mixed_256x16: append opt-out baseline 217.7us, default copyRanges 50.4us, direct child copyRanges 28.6us, raw bulk-copy model 112.6us. - Fixed2_64x64: append opt-out baseline 26.6us, default copyRanges 5.5us, direct child copyRanges 2.0us, raw bulk-copy model 13.7us. - Fixed16_64x64: append opt-out baseline 121.6us, default copyRanges 27.0us, direct child copyRanges 17.4us, raw bulk-copy model 92.9us. - LongString_64x64: append opt-out baseline 31.7us, default copyRanges 7.1us, direct child copyRanges 4.5us, raw bulk-copy model 15.3us. - BoolHeavy_64x64: append opt-out baseline 68.7us, default copyRanges 10.9us, direct child copyRanges 5.4us, raw bulk-copy model 37.7us. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add dictionary-heavy and constant-heavy VeloxBatchResizer benchmark scenarios so the default copyRanges-enabled path is compared against the append opt-out baseline when inputs fall back to RowVector::copy. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

yaooqinn

LGTM, thank you @zhli1142015 for the updates

github-actions Bot added VELOX DOCS labels May 18, 2026

zhli1142015 changed the title ~~[VL] Enable Velox batch resizer copyRanges fast path~~ [VL] Add Velox batch resizer copyRanges fast path May 18, 2026

yaooqinn reviewed May 19, 2026

View reviewed changes

zhli1142015 and others added 3 commits May 19, 2026 17:34

add test

8a62075

zhli1142015 force-pushed the feature/velox-resize-copy-ranges-fastpath branch from 49744b4 to 6080dae Compare May 19, 2026 09:59

[VL] Fix batch resizer CI regressions

b3976a1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

zhli1142015 requested a review from yaooqinn May 20, 2026 03:01

yaooqinn approved these changes May 20, 2026

View reviewed changes

yaooqinn merged commit 71302c6 into apache:main May 20, 2026
61 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] Add Velox batch resizer copyRanges fast path#12101

[VL] Add Velox batch resizer copyRanges fast path#12101
yaooqinn merged 4 commits into
apache:mainfrom
zhli1142015:feature/velox-resize-copy-ranges-fastpath

zhli1142015 commented May 18, 2026

Uh oh!

yaooqinn May 19, 2026

Uh oh!

zhli1142015 May 19, 2026

Uh oh!

yaooqinn left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhli1142015 commented May 18, 2026

What changes are proposed in this pull request?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

yaooqinn May 19, 2026

Choose a reason for hiding this comment

Uh oh!

zhli1142015 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

yaooqinn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants