[VL] Add per-batch input-encoding counter to VeloxHashShuffleWriter by luis4a0 · Pull Request #12107 · apache/gluten

luis4a0 · 2026-05-18T13:40:45Z

What changes are proposed in this pull request?

This PR is companion to #12083 (SCOPED_TIMER fix). With the per-stage timers in cpuWallTimingList_ now recording real numbers, the natural complement is a counter that tells the reader what shape the writer actually sees; i.e. for each write(cb, ...) call, how many of the input children arrive as FLAT / DICTIONARY / CONSTANT / LAZY / other encoding (the values of facebook::velox::VectorEncoding::Simple).

A new inputEncodingCounts_ member (std::array<int64_t, 5>) is maintained on VeloxHashShuffleWriter. The counts are taken before any getFlattenedRowVector() invocation in write(cb, ...), so they reflect the encoding the writer actually receives -- not the post-flatten shape (which is always FLAT by construction). Output is gated through the existing VELOX_SHUFFLE_WRITER_LOG_FLAG log channel and uses the same line shape as the existing cpuWallTimingList_ lines,v so behavior is unchanged when the flag is off (default).

This information is useful for at least three things:

Verifying whether a workload would benefit from writer-side dictionary/constant pass-through optimizations before writing the pass-through code (does the writer ever actually see those encodings?). Recent OSS PR [GLUTEN-8855][VL] Support dictionary in hash based shuffle #9727 (writer-side dictionary rebuild) and several proposed constant pass-through patches all share the same assumption that the writer sees DICT/CONST inputs; this counter makes that assumption directly checkable on any workload.
Spotting unexpected upstream operator changes that collapse/preserve encodings differently across Velox versions (e.g., a change in HashAggregate that starts emitting CONSTANT children when it didn't before).
Correlating a wall-time spike in a particular stage with the encoding mix that drove it.

The counter is always-on (one branch+one increment per input child, swamped by the surrounding shuffle work). Scope is deliberately narrow: only the 5-bucket encoding mix at the hash-shuffle writer entry, surfaced via the existing log channel. A per-type breakdown (BIGINT/DECIMAL/VARCHAR/...) and exposure through ShuffleWriterMetrics to the JVM SparkListener are both plausible follow-ups but kept out of this PR.

Diff: +277 / -0 over 4 files (3 in cpp/velox/, 1 new test).

How was this patch tested?

This PR adds cpp/velox/tests/VeloxHashShuffleWriterInputEncodingTest.cc (second commit) with three gtest cases:

allFlat: two batches with 3 flat children each; asserts FLAT bucket accumulates to 6 and the other 4 buckets stay 0.
mixedFlatDictConst: one batch with FLAT + FLAT + DICTIONARY + CONSTANT children; asserts FLAT=2, DICTIONARY=1, CONSTANT=1, and the remaining 2 buckets stay 0.
lazy: wraps a child in a LazyVector; asserts the LAZY bucket increments (the encoding is captured BEFORE any loadedVector() call inside the writer).

The new test is a standalone add_velox_test(...) entry rather than a new TEST_P case in VeloxShuffleWriterTest.cc, so it isolates the counter behavior from the parameterised round-trip matrix, which keeps the failure mode obvious if a future change quietly breaks one of the buckets.

Companion to apache#12083 (`SCOPED_TIMER` fix). With the per-stage timers in `cpuWallTimingList_` now recording real numbers, the natural complement is a counter that tells the reader what shape the writer actually sees — i.e. for each `write(cb, ...)` call, how many of the input children arrive as FLAT / DICTIONARY / CONSTANT / LAZY / other encoding (the values of `facebook::velox::VectorEncoding::Simple`). This information is useful for at least three things: - Verifying whether a workload would benefit from writer-side dictionary / constant pass-through optimizations *before* writing the pass-through code (does the writer ever actually see those encodings?). - Spotting unexpected upstream operator changes that collapse / preserve encodings differently across Velox versions (e.g., a change in HashAggregate that starts emitting CONSTANT children when it didn't before). - Correlating a wall-time spike in a particular stage with the encoding mix that drove it. The counter is always-on (one branch + one increment per input child, swamped by the surrounding shuffle work). Output is gated through the existing `VELOX_SHUFFLE_WRITER_LOG_FLAG` log channel — same line shape as the existing `cpuWallTimingList_` lines — so behavior is unchanged when the flag is off (default). Scope is deliberately narrow: only the 5-bucket encoding mix at the hash-shuffle writer entry, surfaced via the existing log channel. A per-type breakdown (BIGINT / DECIMAL / VARCHAR / ...) and exposure through `ShuffleWriterMetrics` to the JVM SparkListener are both plausible follow-ups but kept out of this PR. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Three test cases covering the typical encoding mixes a hash shuffle writer may see at `write()` entry: - `allFlat` — every child arrives FLAT (the common case at HEAD on TPC-H / TPC-DS, where upstream operators flatten dict/const inputs before shuffle). Asserts `FLAT` bucket accumulates across two batches and the other 4 buckets stay 0. - `mixedFlatDictConst` — one batch with FLAT + DICTIONARY + CONSTANT children, asserts the three respective buckets each increment. - `lazy` — wraps a child in a `LazyVector` and asserts the `LAZY` bucket increments (the encoding is captured BEFORE any `loadedVector()` call inside the writer). The test is a standalone `add_velox_test(...)` entry rather than a new `TEST_P` case in `VeloxShuffleWriterTest.cc`, so it isolates the counter behavior from the parameterised round-trip matrix — which keeps the failure mode obvious if a future change quietly breaks one of the buckets. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@yaooqinn

…ount Two review comments from @yaooqinn on apache#12107: (1) apache#12107 (comment) — `VectorEncoding::Simple` has 11 values; the prior switch handled 4 and funnelled the remaining 7 (BIASED, SEQUENCE, ROW, MAP, FLAT_MAP, ARRAY, FUNCTION) into "Other". ROW / MAP / ARRAY are first-class struct / map / array column types exercised by the sibling `VeloxShuffleWriterTest` via `makeArrayVector` / `makeMapVector`, so a reader interpreting "Other=80%" in the log would assume rare/unknown encodings rather than "I have a struct column". Split into a new `kInputEncodingComplex` bucket for ROW / MAP / FLAT_MAP / ARRAY; `kInputEncodingOther` now only catches BIASED / SEQUENCE / FUNCTION (and future additions to `VectorEncoding::Simple`). (2) apache#12107 (comment) — `accumulateInputEncodingCounts` early-returns on non-velox batches, so the InputEncoding log line's denominator (sum of buckets) is over a different population than the `count` field on the `cpuWallTimingList_` lines emitted above. Added an `inputEncodingSkippedBatches_` counter that is incremented on every early-return path and printed at the end of the InputEncoding line as `SkippedNonVeloxBatches=N`, so the two log blocks have comparable denominators. Test file updated: new `complex` test case exercises ARRAY + MAP children landing in `kInputEncodingComplex` (not `Other`); all existing cases also assert the new `Complex` bucket stays 0 and `inputEncodingSkippedBatches() == 0`. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The existing `CpuWallTimingSplitRV` bucket wraps the entire `splitRowVector()` call, so when it dominates wall time in a `VELOX_SHUFFLE_WRITER_LOG_FLAG=1` profile, there's no way to tell which of the four sub-paths is responsible: fixed-width scatter, validity copy, binary split, or complex (Presto-serialized) split. This change subdivides `SplitRV` into four new buckets, one per helper inside `splitRowVector()`: - `CpuWallTimingSplitFixedWidth` — `splitFixedWidthValueBuffer` - `CpuWallTimingSplitValidity` — `splitValidityBuffer` - `CpuWallTimingSplitBinary` — `splitBinaryArray` - `CpuWallTimingSplitComplex` — `splitComplexType` The outer `SplitRV` bucket is unchanged: it continues to count once per call, and its wall/cpu measurement now happens to overlap the four inner buckets (the outer timer is still ticking while each inner is). The outer minus the sum-of-inners is therefore the time spent in `splitRowVector()` outside the four sub-paths (mostly the post-split `partitionBufferBase_` update loop). To make the new sub-stage data accessible to anything other than the existing compile-time log path (tests, future `ShuffleWriterMetrics` exposure, profilers), the `CpuWallTimingType` enum, the `CpuWallTimingName()` helper, and a new single-value accessor `cpuWallTiming(CpuWallTimingType)` are hoisted from `protected` to `public`. The underlying `cpuWallTimingList_` storage stays `protected`. This is the natural follow-up to apache#12083 (which made the per-stage timers actually record real numbers) and to the input-encoding counter in the preceding commit on this branch (which is itself PR apache#12107). Together, the three changes let a reviewer answer "what shape did the writer see, where did the time go, and what inner path drove it" from a single `VELOX_SHUFFLE_WRITER_LOG_FLAG` run. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

yaooqinn

LGTM, thank you @luis4a0!

The existing `CpuWallTimingSplitRV` bucket wraps the entire `splitRowVector()` call, so when it dominates wall time in a `VELOX_SHUFFLE_WRITER_LOG_FLAG=1` profile, there's no way to tell which of the four sub-paths is responsible: fixed-width scatter, validity copy, binary split, or complex (Presto-serialized) split. This change subdivides `SplitRV` into four new buckets, one per helper inside `splitRowVector()`: - `CpuWallTimingSplitFixedWidth` — `splitFixedWidthValueBuffer` - `CpuWallTimingSplitValidity` — `splitValidityBuffer` - `CpuWallTimingSplitBinary` — `splitBinaryArray` - `CpuWallTimingSplitComplex` — `splitComplexType` The outer `SplitRV` bucket is unchanged: it continues to count once per call, and its wall/cpu measurement now happens to overlap the four inner buckets (the outer timer is still ticking while each inner is). The outer minus the sum-of-inners is therefore the time spent in `splitRowVector()` outside the four sub-paths (mostly the post-split `partitionBufferBase_` update loop). To make the new sub-stage data accessible to anything other than the existing compile-time log path (tests, future `ShuffleWriterMetrics` exposure, profilers), the `CpuWallTimingType` enum, the `CpuWallTimingName()` helper, and a new single-value accessor `cpuWallTiming(CpuWallTimingType)` are hoisted from `protected` to `public`. The underlying `cpuWallTimingList_` storage stays `protected`. This is the natural follow-up to apache#12083 (which made the per-stage timers actually record real numbers) and to the input-encoding counter in the preceding commit on this branch (which is itself PR apache#12107). Together, the three changes let a reviewer answer "what shape did the writer see, where did the time go, and what inner path drove it" from a single `VELOX_SHUFFLE_WRITER_LOG_FLAG` run. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

luis4a0 and others added 2 commits May 18, 2026 13:39

luis4a0 mentioned this pull request May 18, 2026

[VL] Subdivide SplitRV CPU/wall timer into 4 sub-stages #12108

Open

github-actions Bot added the VELOX label May 18, 2026

luis4a0 marked this pull request as ready for review May 18, 2026 14:06

yaooqinn reviewed May 18, 2026

View reviewed changes

Comment thread cpp/velox/shuffle/VeloxHashShuffleWriter.cc

yaooqinn reviewed May 18, 2026

View reviewed changes

Comment thread cpp/velox/shuffle/VeloxHashShuffleWriter.cc

yaooqinn approved these changes May 18, 2026

View reviewed changes

marin-ma approved these changes May 18, 2026

View reviewed changes

marin-ma merged commit d2c6f38 into apache:main May 18, 2026
60 checks passed

luis4a0 deleted the lpenaranda/pr-encoding-survey branch May 19, 2026 05:52

luis4a0 mentioned this pull request May 19, 2026

[CORE] Add customMetrics extension point to ShuffleWriterMetrics for backend-specific shuffle stats #12114

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] Add per-batch input-encoding counter to VeloxHashShuffleWriter#12107

[VL] Add per-batch input-encoding counter to VeloxHashShuffleWriter#12107
marin-ma merged 3 commits into
apache:mainfrom
luis4a0:lpenaranda/pr-encoding-survey

luis4a0 commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

yaooqinn left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

luis4a0 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

yaooqinn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

luis4a0 commented May 18, 2026 •

edited

Loading