Skip to content

[VL] Add Velox batch resizer copyRanges fast path#12101

Merged
yaooqinn merged 4 commits into
apache:mainfrom
zhli1142015:feature/velox-resize-copy-ranges-fastpath
May 20, 2026
Merged

[VL] Add Velox batch resizer copyRanges fast path#12101
yaooqinn merged 4 commits into
apache:mainfrom
zhli1142015:feature/velox-resize-copy-ranges-fastpath

Conversation

@zhli1142015
Copy link
Copy Markdown
Contributor

Add a default-enabled VeloxBatchResizer fast path that collects small dense batches, allocates the output RowVector once, and bulk-copies child vector ranges with copyRanges. The config remains available as an opt-out switch.

Wire the flag through Scala, Java, and JNI, add C++ coverage for fast-path and fallback behavior, add config default coverage, and add dense-vector benchmark scenarios comparing the append opt-out baseline, default copyRanges path, direct child copyRanges, reader-side raw payload bulk-copy model, and pre-merged flush model.

Benchmark results from velox_batch_resizer_benchmark (CPU time; ASLR enabled, so numbers may have noise):

  • Mixed_64x64: append opt-out baseline 95.1us, default copyRanges 19.7us, direct child copyRanges 17.4us, raw bulk-copy model 33.3us.

  • Mixed_16x256: append opt-out baseline 33.7us, default copyRanges 6.4us, direct child copyRanges 5.0us, raw bulk-copy model 10.5us.

  • Mixed_256x16: append opt-out baseline 217.7us, default copyRanges 50.4us, direct child copyRanges 28.6us, raw bulk-copy model 112.6us.

  • Fixed2_64x64: append opt-out baseline 26.6us, default copyRanges 5.5us, direct child copyRanges 2.0us, raw bulk-copy model 13.7us.

  • Fixed16_64x64: append opt-out baseline 121.6us, default copyRanges 27.0us, direct child copyRanges 17.4us, raw bulk-copy model 92.9us.

  • LongString_64x64: append opt-out baseline 31.7us, default copyRanges 7.1us, direct child copyRanges 4.5us, raw bulk-copy model 15.3us.

  • BoolHeavy_64x64: append opt-out baseline 68.7us, default copyRanges 10.9us, direct child copyRanges 5.4us, raw bulk-copy model 37.7us.

What changes are proposed in this pull request?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

@zhli1142015 zhli1142015 changed the title [VL] Enable Velox batch resizer copyRanges fast path [VL] Add Velox batch resizer copyRanges fast path May 18, 2026
constexpr int32_t kInputBatches = 64;
constexpr int32_t kRowsPerBatch = 64;
constexpr int32_t kTotalRows = kInputBatches * kRowsPerBatch;
constexpr int64_t kPreferredBatchBytes = std::numeric_limits<int64_t>::max();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 7 benchmark scenarios are all dense-flat, but enableCopyRanges=true (default) silently routes dictionary / constant / topLevelNull inputs through RowVector::copy (VeloxBatchResizer.cc:148). Suggest adding 1 dict-heavy + 1 constant-heavy scenario to validate the fallback path is at least perf-neutral before flipping the default on.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added two fallback-focused benchmark scenarios for the copyRanges default-on path:

  • DictionaryHeavy_64x64
  • ConstantHeavy_64x64

Each scenario compares:

  1. BM_VeloxBatchResizerFallbackAppendOptOutBaseline: enableCopyRanges=false, the old incremental append path.
  2. BM_VeloxBatchResizerDefaultCopyRangesFallback: default enableCopyRanges=true; dictionary/constant inputs are unsupported by child copyRanges, so they fall back to RowVector::copy inside the new collect-and-copy path.

Benchmark command:

/tmp/gluten-cpp-ut-build/velox/benchmarks/velox_batch_resizer_benchmark \
  --benchmark_min_time=0.05s \
  --benchmark_repetitions=3 \
  --benchmark_format=json \
  > /tmp/velox_batch_resizer_perf.json

Environment:

- CPU: 32 logical CPUs @ ~2995 MHz
- Caches: L1D 48 KiB x16, L1I 32 KiB x16, L2 2 MiB x16, L3 36 MiB x1
- Note: ASLR was enabled, so the numbers may contain normal benchmark noise.

Dense flat scenarios, CPU time averaged over 3 repetitions:

┌──────────────────┬─────────────────────────┬────────────────────┬─────────┬─────────────────────────┬─────────────────────────────┐
│ Scenario         │ Append opt-out baseline │ Default copyRanges │ Speedup │ Direct child copyRanges │ Raw payload bulk-copy model │
├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤
│ Mixed_64x64      │ 94.3 us                 │ 11.3 us            │ 8.35x   │ 14.4 us                 │ 34.8 us                     │
├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤
│ Mixed_16x256     │ 31.7 us                 │ 6.3 us             │ 5.01x   │ 5.3 us                  │ 11.2 us                     │
├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤
│ Mixed_256x16     │ 234.2 us                │ 48.7 us            │ 4.81x   │ 30.6 us                 │ 119.2 us                    │
├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤
│ Fixed2_64x64     │ 29.3 us                 │ 5.4 us             │ 5.43x   │ 2.1 us                  │ 13.4 us                     │
├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤
│ Fixed16_64x64    │ 123.5 us                │ 26.7 us            │ 4.62x   │ 18.6 us                 │ 95.7 us                     │
├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤
│ LongString_64x64 │ 33.3 us                 │ 7.2 us             │ 4.61x   │ 4.1 us                  │ 15.7 us                     │
├──────────────────┼─────────────────────────┼────────────────────┼─────────┼─────────────────────────┼─────────────────────────────┤
│ BoolHeavy_64x64  │ 73.4 us                 │ 10.8 us            │ 6.78x   │ 5.4 us                  │ 38.1 us                     │
└──────────────────┴─────────────────────────┴────────────────────┴─────────┴─────────────────────────┴─────────────────────────────┘

Fallback scenarios requested by the review comment, CPU time averaged over 3 repetitions:

┌───────────────────────┬─────────────────────────┬─────────────────────────────┬──────────────┐
│ Scenario              │ Append opt-out baseline │ Default copyRanges fallback │ Result       │
├───────────────────────┼─────────────────────────┼─────────────────────────────┼──────────────┤
│ DictionaryHeavy_64x64 │ 102.7 us                │ 77.7 us                     │ 1.32x faster │
├───────────────────────┼─────────────────────────┼─────────────────────────────┼──────────────┤
│ ConstantHeavy_64x64   │ 67.8 us                 │ 42.0 us                     │ 1.61x faster │
└───────────────────────┴─────────────────────────┴─────────────────────────────┴──────────────┘

The fallback numbers are not regressions. The reason is that enableCopyRanges=true fallback is not the same as the old enableCopyRanges=false append path.

With enableCopyRanges=false, VeloxBatchResizer creates an empty output RowVector and incrementally appends each small input batch via  buffer->append(input). That path grows and updates the output vector batch-by-batch.

With enableCopyRanges=true, even when a batch is not eligible for child-vector copyRanges because it contains dictionary or constant children, the resizer   first collects the small input batches, computes the final totalRows, allocates/resizes the output RowVector once, and then copies each unsupported input   into the pre-sized output using:

buffer->copy(input.get(), offset, 0, input->size());

So these fallback benchmarks still benefit from the new collect-and-copy structure:

- one output allocation / resize for the final row count;
- fixed-offset RowVector::copy into a pre-sized output;
- less incremental append bookkeeping;
- Velox copy still has encoded-vector handling for dictionary and constant inputs, even though child copyRanges is not used.

This explains why the default copyRanges-enabled fallback path is faster than the append opt-out baseline in these two scenarios. The important validation   point for the review is that dictionary-heavy and constant-heavy inputs do not show a perf regression after enabling copyRanges by default.

zhli1142015 and others added 3 commits May 19, 2026 17:34
Add a default-enabled VeloxBatchResizer fast path that collects small dense batches, allocates the output RowVector once, and bulk-copies child vector ranges with copyRanges. The config remains available as an opt-out switch.

Wire the flag through Scala, Java, and JNI, add C++ coverage for fast-path and fallback behavior, add config default coverage, and add dense-vector benchmark scenarios comparing the append opt-out baseline, default copyRanges path, direct child copyRanges, reader-side raw payload bulk-copy model, and pre-merged flush model.

Benchmark results from velox_batch_resizer_benchmark (CPU time; ASLR enabled, so numbers may have noise):

- Mixed_64x64: append opt-out baseline 95.1us, default copyRanges 19.7us, direct child copyRanges 17.4us, raw bulk-copy model 33.3us.

- Mixed_16x256: append opt-out baseline 33.7us, default copyRanges 6.4us, direct child copyRanges 5.0us, raw bulk-copy model 10.5us.

- Mixed_256x16: append opt-out baseline 217.7us, default copyRanges 50.4us, direct child copyRanges 28.6us, raw bulk-copy model 112.6us.

- Fixed2_64x64: append opt-out baseline 26.6us, default copyRanges 5.5us, direct child copyRanges 2.0us, raw bulk-copy model 13.7us.

- Fixed16_64x64: append opt-out baseline 121.6us, default copyRanges 27.0us, direct child copyRanges 17.4us, raw bulk-copy model 92.9us.

- LongString_64x64: append opt-out baseline 31.7us, default copyRanges 7.1us, direct child copyRanges 4.5us, raw bulk-copy model 15.3us.

- BoolHeavy_64x64: append opt-out baseline 68.7us, default copyRanges 10.9us, direct child copyRanges 5.4us, raw bulk-copy model 37.7us.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add dictionary-heavy and constant-heavy VeloxBatchResizer benchmark scenarios so the default copyRanges-enabled path is compared against the append opt-out baseline when inputs fall back to RowVector::copy.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@zhli1142015 zhli1142015 force-pushed the feature/velox-resize-copy-ranges-fastpath branch from 49744b4 to 6080dae Compare May 19, 2026 09:59
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@zhli1142015 zhli1142015 requested a review from yaooqinn May 20, 2026 03:01
Copy link
Copy Markdown
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @zhli1142015 for the updates

@yaooqinn yaooqinn merged commit 71302c6 into apache:main May 20, 2026
61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants