[VL] Add per-batch input-encoding counter to VeloxHashShuffleWriter#12107
Merged
Conversation
Companion to apache#12083 (`SCOPED_TIMER` fix). With the per-stage timers in `cpuWallTimingList_` now recording real numbers, the natural complement is a counter that tells the reader what shape the writer actually sees — i.e. for each `write(cb, ...)` call, how many of the input children arrive as FLAT / DICTIONARY / CONSTANT / LAZY / other encoding (the values of `facebook::velox::VectorEncoding::Simple`). This information is useful for at least three things: - Verifying whether a workload would benefit from writer-side dictionary / constant pass-through optimizations *before* writing the pass-through code (does the writer ever actually see those encodings?). - Spotting unexpected upstream operator changes that collapse / preserve encodings differently across Velox versions (e.g., a change in HashAggregate that starts emitting CONSTANT children when it didn't before). - Correlating a wall-time spike in a particular stage with the encoding mix that drove it. The counter is always-on (one branch + one increment per input child, swamped by the surrounding shuffle work). Output is gated through the existing `VELOX_SHUFFLE_WRITER_LOG_FLAG` log channel — same line shape as the existing `cpuWallTimingList_` lines — so behavior is unchanged when the flag is off (default). Scope is deliberately narrow: only the 5-bucket encoding mix at the hash-shuffle writer entry, surfaced via the existing log channel. A per-type breakdown (BIGINT / DECIMAL / VARCHAR / ...) and exposure through `ShuffleWriterMetrics` to the JVM SparkListener are both plausible follow-ups but kept out of this PR. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three test cases covering the typical encoding mixes a hash shuffle writer may see at `write()` entry: - `allFlat` — every child arrives FLAT (the common case at HEAD on TPC-H / TPC-DS, where upstream operators flatten dict/const inputs before shuffle). Asserts `FLAT` bucket accumulates across two batches and the other 4 buckets stay 0. - `mixedFlatDictConst` — one batch with FLAT + DICTIONARY + CONSTANT children, asserts the three respective buckets each increment. - `lazy` — wraps a child in a `LazyVector` and asserts the `LAZY` bucket increments (the encoding is captured BEFORE any `loadedVector()` call inside the writer). The test is a standalone `add_velox_test(...)` entry rather than a new `TEST_P` case in `VeloxShuffleWriterTest.cc`, so it isolates the counter behavior from the parameterised round-trip matrix — which keeps the failure mode obvious if a future change quietly breaks one of the buckets. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
yaooqinn
reviewed
May 18, 2026
yaooqinn
reviewed
May 18, 2026
…ount Two review comments from @yaooqinn on apache#12107: (1) apache#12107 (comment) — `VectorEncoding::Simple` has 11 values; the prior switch handled 4 and funnelled the remaining 7 (BIASED, SEQUENCE, ROW, MAP, FLAT_MAP, ARRAY, FUNCTION) into "Other". ROW / MAP / ARRAY are first-class struct / map / array column types exercised by the sibling `VeloxShuffleWriterTest` via `makeArrayVector` / `makeMapVector`, so a reader interpreting "Other=80%" in the log would assume rare/unknown encodings rather than "I have a struct column". Split into a new `kInputEncodingComplex` bucket for ROW / MAP / FLAT_MAP / ARRAY; `kInputEncodingOther` now only catches BIASED / SEQUENCE / FUNCTION (and future additions to `VectorEncoding::Simple`). (2) apache#12107 (comment) — `accumulateInputEncodingCounts` early-returns on non-velox batches, so the InputEncoding log line's denominator (sum of buckets) is over a different population than the `count` field on the `cpuWallTimingList_` lines emitted above. Added an `inputEncodingSkippedBatches_` counter that is incremented on every early-return path and printed at the end of the InputEncoding line as `SkippedNonVeloxBatches=N`, so the two log blocks have comparable denominators. Test file updated: new `complex` test case exercises ARRAY + MAP children landing in `kInputEncodingComplex` (not `Other`); all existing cases also assert the new `Complex` bucket stays 0 and `inputEncodingSkippedBatches() == 0`. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
luis4a0
added a commit
to luis4a0/gluten
that referenced
this pull request
May 18, 2026
The existing `CpuWallTimingSplitRV` bucket wraps the entire `splitRowVector()` call, so when it dominates wall time in a `VELOX_SHUFFLE_WRITER_LOG_FLAG=1` profile, there's no way to tell which of the four sub-paths is responsible: fixed-width scatter, validity copy, binary split, or complex (Presto-serialized) split. This change subdivides `SplitRV` into four new buckets, one per helper inside `splitRowVector()`: - `CpuWallTimingSplitFixedWidth` — `splitFixedWidthValueBuffer` - `CpuWallTimingSplitValidity` — `splitValidityBuffer` - `CpuWallTimingSplitBinary` — `splitBinaryArray` - `CpuWallTimingSplitComplex` — `splitComplexType` The outer `SplitRV` bucket is unchanged: it continues to count once per call, and its wall/cpu measurement now happens to overlap the four inner buckets (the outer timer is still ticking while each inner is). The outer minus the sum-of-inners is therefore the time spent in `splitRowVector()` outside the four sub-paths (mostly the post-split `partitionBufferBase_` update loop). To make the new sub-stage data accessible to anything other than the existing compile-time log path (tests, future `ShuffleWriterMetrics` exposure, profilers), the `CpuWallTimingType` enum, the `CpuWallTimingName()` helper, and a new single-value accessor `cpuWallTiming(CpuWallTimingType)` are hoisted from `protected` to `public`. The underlying `cpuWallTimingList_` storage stays `protected`. This is the natural follow-up to apache#12083 (which made the per-stage timers actually record real numbers) and to the input-encoding counter in the preceding commit on this branch (which is itself PR apache#12107). Together, the three changes let a reviewer answer "what shape did the writer see, where did the time go, and what inner path drove it" from a single `VELOX_SHUFFLE_WRITER_LOG_FLAG` run. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
yaooqinn
approved these changes
May 18, 2026
marin-ma
approved these changes
May 18, 2026
luis4a0
added a commit
to luis4a0/gluten
that referenced
this pull request
May 19, 2026
The existing `CpuWallTimingSplitRV` bucket wraps the entire `splitRowVector()` call, so when it dominates wall time in a `VELOX_SHUFFLE_WRITER_LOG_FLAG=1` profile, there's no way to tell which of the four sub-paths is responsible: fixed-width scatter, validity copy, binary split, or complex (Presto-serialized) split. This change subdivides `SplitRV` into four new buckets, one per helper inside `splitRowVector()`: - `CpuWallTimingSplitFixedWidth` — `splitFixedWidthValueBuffer` - `CpuWallTimingSplitValidity` — `splitValidityBuffer` - `CpuWallTimingSplitBinary` — `splitBinaryArray` - `CpuWallTimingSplitComplex` — `splitComplexType` The outer `SplitRV` bucket is unchanged: it continues to count once per call, and its wall/cpu measurement now happens to overlap the four inner buckets (the outer timer is still ticking while each inner is). The outer minus the sum-of-inners is therefore the time spent in `splitRowVector()` outside the four sub-paths (mostly the post-split `partitionBufferBase_` update loop). To make the new sub-stage data accessible to anything other than the existing compile-time log path (tests, future `ShuffleWriterMetrics` exposure, profilers), the `CpuWallTimingType` enum, the `CpuWallTimingName()` helper, and a new single-value accessor `cpuWallTiming(CpuWallTimingType)` are hoisted from `protected` to `public`. The underlying `cpuWallTimingList_` storage stays `protected`. This is the natural follow-up to apache#12083 (which made the per-stage timers actually record real numbers) and to the input-encoding counter in the preceding commit on this branch (which is itself PR apache#12107). Together, the three changes let a reviewer answer "what shape did the writer see, where did the time go, and what inner path drove it" from a single `VELOX_SHUFFLE_WRITER_LOG_FLAG` run. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes are proposed in this pull request?
This PR is companion to #12083 (
SCOPED_TIMERfix). With the per-stage timers incpuWallTimingList_now recording real numbers, the natural complement is a counter that tells the reader what shape the writer actually sees; i.e. for eachwrite(cb, ...)call, how many of the input children arrive as FLAT / DICTIONARY / CONSTANT / LAZY / other encoding (the values offacebook::velox::VectorEncoding::Simple).A new
inputEncodingCounts_member (std::array<int64_t, 5>) is maintained onVeloxHashShuffleWriter. The counts are taken before anygetFlattenedRowVector()invocation inwrite(cb, ...), so they reflect the encoding the writer actually receives -- not the post-flatten shape (which is always FLAT by construction). Output is gated through the existingVELOX_SHUFFLE_WRITER_LOG_FLAGlog channel and uses the same line shape as the existingcpuWallTimingList_lines,v so behavior is unchanged when the flag is off (default).This information is useful for at least three things:
The counter is always-on (one branch+one increment per input child, swamped by the surrounding shuffle work). Scope is deliberately narrow: only the 5-bucket encoding mix at the hash-shuffle writer entry, surfaced via the existing log channel. A per-type breakdown (BIGINT/DECIMAL/VARCHAR/...) and exposure through
ShuffleWriterMetricsto the JVM SparkListener are both plausible follow-ups but kept out of this PR.Diff: +277 / -0 over 4 files (3 in
cpp/velox/, 1 new test).How was this patch tested?
This PR adds
cpp/velox/tests/VeloxHashShuffleWriterInputEncodingTest.cc(second commit) with three gtest cases:allFlat: two batches with 3 flat children each; assertsFLATbucket accumulates to 6 and the other 4 buckets stay 0.mixedFlatDictConst: one batch with FLAT + FLAT + DICTIONARY + CONSTANT children; asserts FLAT=2, DICTIONARY=1, CONSTANT=1, and the remaining 2 buckets stay 0.lazy: wraps a child in aLazyVector; asserts theLAZYbucket increments (the encoding is captured BEFORE anyloadedVector()call inside the writer).The new test is a standalone
add_velox_test(...)entry rather than a newTEST_Pcase inVeloxShuffleWriterTest.cc, so it isolates the counter behavior from the parameterised round-trip matrix, which keeps the failure mode obvious if a future change quietly breaks one of the buckets.