[VL] Add min/max partition stats to columnar InMemoryRelation cache for partition pruning#12092
[VL] Add min/max partition stats to columnar InMemoryRelation cache for partition pruning#12092yaooqinn wants to merge 72 commits into
Conversation
…ield + magic-prefix sniff PA-1 RED tests for Layer A min/max stats — pure JVM-only suite, no native lib. 3 RED cases (Expected RED failure column see plan §3 PA-1): - PA-1.1 testStatsFieldRoundTripV2: case class only has 3 fields → compile error 'too many arguments' until stats: InternalRow added - PA-1.2 testKryoV1Backwards: v1 binary read should yield stats=null (graceful degrade) → fails until v2 reader handles v1 layout - PA-1.3 testV1BinaryNumRowsHigh02NotMisidentified: NB1 ship-blocker — v1 with numRows=258 (first byte 0x02) must NOT be mis-identified as v2 by naive first-byte sniff → magic-prefix 4-byte sniff (0xFECA5302) is the only correct approach refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md PA-1 refs: todos/features/gluten-inmemory-cache-stats/docs/0002-cpp-stats-contract.md §4 refs: todos/features/gluten-inmemory-cache-stats/docs/0003-jni-binary-framing-reference.md §9
… magic-prefix sniff (BE)
GREEN impl for PA-1 RED tests (ColumnarCachedBatchKryoSuite). PA-1 ONLY adds
the stats field + Kryo serializer; buildFilter override + extends
SimpleMetricsCachedBatchSerializer is deferred to PA-3 per plan 0004.
(1) CachedColumnarBatch case class:
- Add 4th field 'stats: InternalRow' (nullable; null = stats unavailable,
triggers buildFilter pass-through fallback in graceful degrade)
- Change base from CachedBatch to SimpleMetricsCachedBatch (SPARK-32274 trait,
dormant until PA-3 switches serializer parent)
- sizeInBytes scaladoc clarifies Gluten override semantic (serialized blob
length, not vanilla per-column-sum)
(2) CachedColumnarBatchKryoSerializer:
- V2_MAGIC = 0xFE 0xCA 0x53 0x02 (4-byte sniff prefix)
Cannot collide with v1 binary because v1 first 4 bytes are numRows in
Kryo fixed 4-byte BIG-ENDIAN; any non-negative Int < 2^31 has BE high
byte in [0x00, 0x7F]. Magic high byte 0xFE > 0x7F is structurally
unreachable, eliminating all single-byte aliasing vectors.
- write path: emit V2_MAGIC + numRows (Kryo Output.writeInt(int) single-arg
= fixed 4-byte BE; AVOID (int, boolean) overload which forwards to
writeVarInt 1-5 bytes and breaks the magic sniff invariant) + sizeInBytes
+ bytes + hasStats Boolean + optional (statsLen + statsBytes)
- read path: peek 4-byte first; if matches V2_MAGIC -> readV2; else ->
readV1 (treats first4 as raw numRows BE, returns stats=null for
graceful degrade)
- Companion object holds V2_MAGIC + serializeStats / deserializeStats helpers
Java serialization placeholder; PA-3 MUST replace with statsBlob binary
framing per contract 0002 §3 + reference 0003 §3-4 before any code path
can produce stats != null (Java ser is NOT cross-version checkpoint safe).
(3) convertColumnarBatchToCachedBatch L347-349 write path:
- Add stats=null placeholder. PA-3 replaces with stats InternalRow built
from cpp computeStats output via JNI serializeWithStats.
NOTE: still extends bare CachedBatchSerializer (not SimpleMetricsCachedBatchSerializer
yet); buildFilter pass-through L426-431 unchanged. PA-3 will switch base + delete
pass-through + add SimpleMetricsCachedBatch invariant debug assert.
review: reviews/batch-PA-1-code-review.md (Layer 2 三 personas; 1 BLOCKER B1
Kryo wire format BE/LE 修正已 amend; verdicts P1 -1->+0 / P2 不通过->通过 / P3 +1)
refs: todos/features/gluten-inmemory-cache-stats/docs/0001-layerA-minmax-design.md D-A2
refs: todos/features/gluten-inmemory-cache-stats/docs/0002-cpp-stats-contract.md §4 (BL3 + NB1 + rev 3 BE)
refs: todos/features/gluten-inmemory-cache-stats/docs/0003-jni-binary-framing-reference.md §9 (rev 3 BE)
refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md PA-1
refs: todos/features/gluten-inmemory-cache-stats/reviews/batch-PA-1-code-review.md
…tats(Long FlatVector)
Locks the minimal ColumnStats contract:
struct ColumnStats { hasLowerBound, hasUpperBound, lowerBound, upperBound, nullCount }
std::vector<ColumnStats> computeStats(RowVectorPtr)
RED stderr (gcc, --build_tests=ON):
error: '...VeloxColumnarBatchSerializer' has no member named 'computeStats'
(+4 cascade errors on stats[0].lowerBound.value<int64_t>())
PA-2 batch micro-sliced: PA-2.1 = Long FlatVector path only. PA-2.2-2.6
(NaN / nulls absent / Decimal / JNI framing / capability) follow each as
one-RED -> one-production-change increments.
refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md
…atVector) Adds ColumnStats struct and VeloxColumnarBatchSerializer::computeStats(rowVector) scalar v1 -- BIGINT FlatVector min/max + nullCount via direct buffer scan. Other type kinds fall back to hasLowerBound=hasUpperBound=false (graceful degrade: downstream buildFilter override emits pass-through for that column). GREEN log (gtest, --gtest_filter='*PA_2_1*'): [ RUN ] VeloxColumnarBatchSerializerTest.PA_2_1_testComputeStatsLongFlatVector [ OK ] VeloxColumnarBatchSerializerTest.PA_2_1_testComputeStatsLongFlatVector (0 ms) [ PASSED ] 1 test. PA-2.2..2.6 (NaN / nulls absent / Decimal / JNI framing / capability) follow. refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md
…te in pruning
RED stderr:
Value of: stats[0].hasLowerBound
Actual: false
Expected: true
REAL FlatVector w/o NaN must be supported after PA-2.2 GREEN
PA-2.1 sentinel: PASSED (regression OK).
refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md
… guard Refactor computeStats to a templated scanMinMax<T>() helper covering BIGINT and REAL. Floating-point specialization aborts on first NaN value -- per Spark equality semantics (NaN != NaN), any NaN poisons the whole column so min/max-based pruning would silently drop matching rows. NB4 ship blocker invariant: NaN -> hasLowerBound=hasUpperBound=false -> buildFilter pass-through. GREEN log (*PA_2_*): PA_2_1_testComputeStatsLongFlatVector ........ OK PA_2_2_testComputeStatsNaNFloatPartition ..... OK 2 PASSED. refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md
…r min/max
PA-2.3 (nulls-buffer-absent) folded into PA-5 ship blocker UT batch -- the
existing GREEN already has a defensive 'if (nulls != nullptr)' guard, so no
honest RED can be authored against an already-correct path (Plan sec 0
ironclad rule 2: no fake 'already green' RED).
RED stderr:
Value of: stats[0].hasLowerBound Actual: false Expected: true
HUGEINT (long Decimal P>=19) FlatVector must be supported after PA-2.4 GREEN
Value of: stats[0].hasUpperBound Actual: false Expected: true
C++ exception 'wrong kind! UNKNOWN != HUGEINT'
-- raised by stats[0].lowerBound.value<int128_t>() because the default-
constructed variant has KIND::UNKNOWN; future GREEN populates it.
PA-2.1/2.2 sentinels: PASSED.
refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md
…tor scan Adds HUGEINT case to computeStats switch -- reuses scanMinMax<int128_t> template (no NaN guard via if constexpr branch). variant(int128_t) ctor exists in Velox so no special variant::create<HUGEINT> needed. JVM-side 16-byte big-endian BigInteger marshaling deferred to PA-3 (writer path) per Plan sec PA-3. GREEN log (*PA_2_*): PA_2_1_testComputeStatsLongFlatVector ............... OK PA_2_2_testComputeStatsNaNFloatPartition ............ OK PA_2_4_testComputeStatsHugeintDecimalFlatVector ..... OK 3 PASSED. refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md
…yout Locks framing layout per docs/0003-jni-binary-framing-reference.md sec 2: [ magic(4) | statsLen(4 LE) | statsBlob | bytesLen(4 LE) | bytesBlob ] where magic = 0xFE 0xCA 0x53 0x02 (matches PA-1 V2_MAGIC). PA-2.5a scope: framing + bytesBlob (delegated to existing serializeTo) + EMPTY statsBlob (statsLen=0). PA-2.5b will populate statsBlob; PA-2.5c adds JNI bridge. Note byte-order asymmetry: cpp framing headers are LE (0003 sec 2); JVM-side Kryo wrap (PA-1) uses single-arg writeInt = BE. Different layers, different conventions; both intentional, both documented. RED stderr: error: '...VeloxColumnarBatchSerializer' has no member named 'framedSerializeWithStats' refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md
…layout Adds framedSerializeWithStats(batch) returning std::vector<uint8_t> framed as [ magic(4) | statsLen(4 LE) | statsBlob | bytesLen(4 LE) | bytesBlob ]. PA-2.5a uses an empty statsBlob (statsLen=0) -- bytesBlob delegated to existing append+serializeTo path, so end-to-end round-trip via existing deserialize remains intact (PA-3 will skip the framing header on read). PA-2.5b will populate statsBlob from computeStats output; PA-2.5c adds the JNI bridge entry point Java_..._serializeWithStats. GREEN log (*PA_2_*): PA_2_1_testComputeStatsLongFlatVector ............... OK PA_2_2_testComputeStatsNaNFloatPartition ............ OK PA_2_4_testComputeStatsHugeintDecimalFlatVector ..... OK PA_2_5a_testFramedSerializeWithStatsLayout .......... OK 4 PASSED. refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md
Locks per-column stats marshaling per docs/0003-jni-binary-framing-reference
sec 3, BIGINT path only:
[ numCols u32 LE ]
per col: [ supported u8 ][ nullCount u32 LE ][ count u32 LE ][ sizeInBytes u64 LE ]
if supported: [ lowerBoundLen u32 LE ][ lowerBound bytes ][ upperBoundLen u32 LE ][ upperBound bytes ]
PA-2.5b uses sizeInBytes=0 placeholder; PA-3 vanilla path may overwrite from
SimpleMetricsCachedBatch context (or stays 0 -- not consumed by partition
pruning). REAL/HUGEINT marshaling deferred to PA-2.5b-real / -hugeint
follow-up slices to keep one-RED-one-prod-change discipline (Plan sec 0
rule 1).
RED stderr:
Expected: (statsLen) >= (4u), actual: 0 vs 4
PA-2.5b: statsBlob must contain at least numCols(uint32)
(PA-2.5a writes statsLen=0; populating it is precisely the GREEN of PA-2.5b.)
refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md
…sentinel relax GREEN delta: - framedSerializeWithStats now invokes computeStats(rowVector) and serializes per-column [supported u8 | nullCount u32 LE | count u32 LE | sizeInBytes u64 LE | (if supported) lowerBoundLen u32 LE | bytes | upperBoundLen u32 LE | bytes]. - BIGINT path emits 8-byte LE int64 for lower/upperBound; REAL/HUGEINT marshaling is held off behind 'emitSupported &&= variant.kind()==BIGINT' guard so we don't desync wire format with parsers expecting 8B (PA-2.5b-real / -hugeint follow-ups will widen this). - sizeInBytes = 0 placeholder. PA-2.5a sentinel relax: removed EXPECT_EQ(statsLen, 0u) -- it was a temporary PA-2.5a-scope assumption mistakenly written as a permanent invariant. Lesson: RED tests should only assert permanent contract shape, not transient placeholder values from a prior micro-slice. GREEN log (*PA_2_*): PA_2_1_testComputeStatsLongFlatVector ............... OK PA_2_2_testComputeStatsNaNFloatPartition ............ OK PA_2_4_testComputeStatsHugeintDecimalFlatVector ..... OK PA_2_5a_testFramedSerializeWithStatsLayout .......... OK (sentinel) PA_2_5b_testStatsBlobBigintLayout ................... OK 5 PASSED. refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md
…umbing)
Wires the JVM serializeWithStats(handle): byte[] native call to the cpp
framedSerializeWithStats virtual hook, end-to-end:
cpp/core/operators/serializer/ColumnarBatchSerializer.h
+ virtual std::vector<uint8_t> framedSerializeWithStats(batch) -- default
returns empty (other backends inherit the safe fallback).
cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.h
+ override marker on framedSerializeWithStats (impl already shipped in
PA-2.5a/b commits).
cpp/core/jni/JniWrapper.cc
+ Java_..._serializeWithStats: retrieve ColumnarBatch, call
serializer->framedSerializeWithStats, convert to jbyteArray. Mirrors
the existing _serialize entry point's structure.
gluten-arrow/.../ColumnarBatchSerializerJniWrapper.java
+ native byte[] serializeWithStats(long handle).
Non-TDD pure plumbing batch (Plan sec 0 ironclad rule 2 explicitly allows
this category when test infra would dwarf the change). Verification:
- cpp gtest sentinel: 5/5 PASS (PA-2.1/2.2/2.4/2.5a/2.5b unchanged).
- libgluten.so exports symbol:
T Java_org_apache_gluten_vectorized_ColumnarBatchSerializerJniWrapper_serializeWithStats
W _ZN6gluten23ColumnarBatchSerializer24framedSerializeWithStats...
- JVM mvn compile -pl gluten-arrow,backends-velox -am: BUILD SUCCESS (9/9).
End-to-end live invocation will be exercised by PA-3 (vanilla cache write
path calls jni.serializeWithStats) -- that is the natural integration test
for this plumbing.
refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md
… plumbing) Adds ColumnarBatchSerializerJniWrapper.supportsStatsExt() static lazy helper. PA-3 write path will branch on this -- true => jni.serializeWithStats, false => legacy jni.serialize. Enables jar/native-lib version skew users without UnsatisfiedLinkError surfacing. Implementation note: probing via reflection on the declared native method (presence => JVM side wired). The cpp side symbol is checked by the JNI linker on first real invocation; PA-3 must catch UnsatisfiedLinkError there too (rare runtime fallback, not the common case). Non-TDD plumbing batch (Plan sec 0 ironclad rule 2 explicitly allows this when test infra would dwarf the change). PA-3 write-path tests will exercise the helper end-to-end. Verification: mvn -pl gluten-arrow compile BUILD SUCCESS. PA-2 batch is now CLOSED. Coverage: - PA-2.1 BIGINT scan ........................ TDD GREEN - PA-2.2 NaN REAL guard ..................... TDD GREEN - PA-2.3 nulls absent ....................... folded into PA-5 sentinel - PA-2.4 HUGEINT scan ....................... TDD GREEN - PA-2.5a framing layout .................... TDD GREEN - PA-2.5b BIGINT statsBlob marshal .......... TDD GREEN - PA-2.5c JNI bridge ........................ plumbing (verified by symbol export) - PA-2.6 capability helper .................. plumbing (verified by JVM compile) Backlog (one-line in todos backlog.md): - PA-2.5b-real: REAL/Float lower/upperBound 4B LE marshal - PA-2.5b-hugeint: HUGEINT/long-Decimal 16B BE marshal refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md
…gh' claim The PA-1 scaladoc claimed stats=null falls back to pass-through, but PA-3.0 investigation (todos rev 2) proved vanilla SimpleMetricsCachedBatchSerializer.buildFilter NPEs on partitionFilter.eval(null) for non-trivial predicates (both codegen and interpreted paths, no fallback). PA-3.1 will land the actual lazy-split iterator wrapper override; this commit only fixes the misleading scaladoc. refs: todos/features/gluten-inmemory-cache-stats/investigations/0003-simplemetrics-buildfilter-survey.md (rev 2)
…To predicate
Flip base class to SimpleMetricsCachedBatchSerializer and remove the pre-PA-3
TODO buildFilter override. Add ColumnarCachedBatchBuildFilterSuite asserting
that v1 binary (stats=null) batches are not dropped when an EqualTo predicate
is applied.
Expected RED failure (observed):
java.lang.NullPointerException ... isNullAt(int) ... <parameter1> is null
at GeneratedClass$SpecificPredicate.eval(Unknown Source)
at SimpleMetricsCachedBatchSerializer.$anonfun$buildFilter$8(CachedBatchSerializer.scala:365)
Matches the prediction in todos investigations/0003 rev 2 evidence 3 exactly:
vanilla SimpleMetricsCachedBatchSerializer.buildFilter NPEs on
partitionFilter.eval(cachedBatch.stats=null) for non-trivial predicates,
both codegen and interpreted paths (no fallback).
PA-3.1 GREEN will add a lazy-split iterator wrapper override that directs
stats=null batches through (skipping vanilla partition pruning) and feeds
stats!=null batches to super.buildFilter. The naive occupier-row placeholder
fails differently (silent drop, see evidence 3.A); only the lazy-split
wrapper is correct.
refs: todos/features/gluten-inmemory-cache-stats/investigations/0003-simplemetrics-buildfilter-survey.md (rev 2)
refs: todos/features/gluten-inmemory-cache-stats/docs/0001-layerA-minmax-design.md (rev 4 D-A3)
refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md PA-3.1
…dFilter stats=null guard
Override buildFilter with a lazy-split iterator wrapper:
- stats=null batches (v1 binary read path, PA-4 SQLConf-off write path) are
directed through unchanged (skipping vanilla partition pruning).
- stats!=null batches are drained in contiguous runs and routed through
super.buildFilter for actual partition pruning.
Implementation:
- cachedBatchIterator.buffered enables stats peek without consuming.
- Inner runIt sub-iterator self-terminates at the first stats=null boundary,
so a single parent invocation pulls only the contiguous stats!=null run.
- staged Iterator buffers parent's output for hasNext accuracy.
- Lazy throughout: no Iterator materialization, no per-batch allocation
beyond the sub-iterator wrapper.
Verified:
- ColumnarCachedBatchBuildFilterSuite 1/1 PASS (PA-3.1 GREEN)
- ColumnarCachedBatchKryoSuite 3/3 PASS (PA-1 regression sentinel)
- Total 4/4 PASS, no NPE.
refs: todos/features/gluten-inmemory-cache-stats/investigations/0003-simplemetrics-buildfilter-survey.md (rev 2)
refs: todos/features/gluten-inmemory-cache-stats/docs/0001-layerA-minmax-design.md (rev 4 D-A3)
refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md PA-3.1
Add ColumnarCachedBatchStatsBlobSuite with 4 cases against the cpp-aligned LE
wire format (cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.cc
PA-2.5b helper). Also bump PA-1.1 test to use the vanilla 5-slot/col
PartitionStatistics schema (lower, upper, nullCount, count, sizeInBytes)
instead of the 2-field placeholder, because PA-3.2 GREEN production change
requires numFields % 5 == 0.
Expected RED failures (observed):
- PA-3.2.A: statsBlob byte-for-byte vs Java Ser blob -> mismatch, AssertionError
- PA-3.2.C: corrupt 4-byte blob -> StreamCorruptedException, but we
expected IllegalArgumentException with 'numCols' guard
PASS in RED (honest sentinel only, will stay GREEN after PA-3.2 GREEN):
- PA-3.2.B: round-trip InternalRow 5-field (Java Ser handles it accidentally)
- PA-3.2.D: unsupported col round-trip (Java Ser handles it accidentally)
- PA-1.1/1.2/1.3 + PA-3.1: full regression sentinel
The honest RED cases (A + C) drive the GREEN: replace Java Serialization
with LE statsBlob marshal aligned to cpp PA-2.5b wire (numCols u32 LE,
per col supported u8 / nullCount u32 LE / count u32 LE / sizeInBytes u64 LE
/ optional lower+upper as u32 len + i64 LE for BIGINT).
refs: todos/features/gluten-inmemory-cache-stats/docs/0002-cpp-stats-contract.md (rev 4 sec 3)
refs: todos/features/gluten-inmemory-cache-stats/docs/0003-jni-binary-framing-reference.md (rev 4 sec 3-4)
refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md PA-3.2
…p-aligned statsBlob marshal
Replace serializeStats / deserializeStats Java Serialization placeholders
with a binary marshal aligned to cpp PA-2.5b wire format
(cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.cc):
Layout (LE throughout, no schemaVer prefix to match cpp):
[ numCols: uint32 LE ]
per col:
[ supported: uint8 ]
[ nullCount: uint32 LE ]
[ count: uint32 LE ]
[ sizeInBytes: uint64 LE ]
if supported:
[ lowerBoundLen: uint32 LE = 8 ]
[ lowerBound: int64 LE ]
[ upperBoundLen: uint32 LE = 8 ]
[ upperBound: int64 LE ]
PA-3.2 scope: BIGINT 1-col only. Multi-col + other types land in PA-3.3 /
PA-3.5 / follow-up. The InternalRow layout is vanilla 5-slots/col in order
(lowerBound, upperBound, nullCount, count, sizeInBytes) per
investigations/0003 rev 2 evidence 2.
Eager guards:
- serializeStats: require(numFields % 5 == 0) catches stale 2-field callers.
- deserializeStats: require(numCols in [0, 4096]) catches truncated /
corrupt blobs before they NPE downstream.
Verified:
- ColumnarCachedBatchStatsBlobSuite 4/4 PASS (PA-3.2.A/B/C/D)
- ColumnarCachedBatchKryoSuite 3/3 PASS (PA-1 regression)
- ColumnarCachedBatchBuildFilterSuite 1/1 PASS (PA-3.1 regression)
- 8/8 total ALL PASS, no NPE, no silent corruption.
NOTE: contract docs/0002 sec 3 still mentions a schemaVer:uint8=0x01 prefix
that the cpp end does not emit. Follow-up doc fix in todos PA-3.2 review
batch (separate commit).
refs: todos/features/gluten-inmemory-cache-stats/docs/0002-cpp-stats-contract.md (rev 4 sec 3)
refs: todos/features/gluten-inmemory-cache-stats/docs/0003-jni-binary-framing-reference.md (rev 4 sec 3-4)
refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md PA-3.2
Add ColumnarCachedBatchFramedBytesSuite with 3 cases asserting the JVM-side
parser for the JNI serializeWithStats framed return:
[ V2_MAGIC: 4 bytes 0xFE 0xCA 0x53 0x02 ]
[ statsLen: uint32 LE ]
[ statsBlob: statsLen bytes ]
[ bytesLen: uint32 LE ]
[ bytesBlob: bytesLen bytes ]
Tests:
- PA-3.3.A round-trip stats + bytesBlob through parseFramedBytes
- PA-3.3.B corrupt magic fails with 'magic' in error
- PA-3.3.C truncated framing fails eagerly
Expected RED failure (observed): testCompile error
'value parseFramedBytes is not a member of CachedColumnarBatchKryoSerializer'
Honest RED per AGENTS.md TDD law 3 (failure because feature missing,
not typo). PA-3.3 GREEN will:
(a) add parseFramedBytes(blob): (InternalRow, Array[Byte]) helper
(b) wire write path to call jniWrapper.serializeWithStats(handle) and
parseFramedBytes the return; the wire step itself is plumbing per
Plan sec 0 law 2 because driving it under unit test needs a native
lib + real ColumnarBatch fixture, which PA-3.5 e2e will cover.
refs: todos/features/gluten-inmemory-cache-stats/docs/0003-jni-binary-framing-reference.md (rev 4 sec 2)
refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md PA-3.3
…ring
(a) Add CachedColumnarBatchKryoSerializer.parseFramedBytes(framed):
(InternalRow, Array[Byte]) helper. Parses the JNI serializeWithStats
framed return (cpp PA-2.5c):
[V2_MAGIC(4) | statsLen u32 LE | statsBlob | bytesLen u32 LE | bytesBlob]
Eager require() guards catch corrupt magic + truncated framing before
propagating into a malformed CachedColumnarBatch.
(b) Wire convertColumnarBatchToCachedBatch write path (plumbing, see law 2):
capability-check via ColumnarBatchSerializerJniWrapper.supportsStatsExt()
(gluten-arrow PA-2.6 helper, reflection-cached). When supported, call
jniWrapper.serializeWithStats(handle), parseFramedBytes the return,
and emit CachedColumnarBatch(stats != null). When unsupported (older
cpp libgluten.so), fall back to the original serialize() path with
stats=null; PA-3.1 lazy-split iterator wrapper handles the read side.
The wire step itself is plumbing per Plan sec 0 law 2 because driving it
under unit test needs a native lib + real ColumnarBatch fixture. PA-3.5
e2e will cover it (cache a BIGINT col DataFrame, filter pushdown, assert
partition skip count > 0).
Verified:
- ColumnarCachedBatchFramedBytesSuite 3/3 PASS (PA-3.3.A/B/C)
- ColumnarCachedBatchStatsBlobSuite 4/4 PASS (PA-3.2 regression)
- ColumnarCachedBatchKryoSuite 3/3 PASS (PA-1 regression)
- ColumnarCachedBatchBuildFilterSuite 1/1 PASS (PA-3.1 regression)
- 11/11 ALL PASS, no NPE, no silent corruption.
refs: todos/features/gluten-inmemory-cache-stats/docs/0003-jni-binary-framing-reference.md (rev 4 sec 2)
refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md PA-3.3
…cs + fix exhaustion bug
Add ColumnarCachedBatchBuildFilterPruneSuite (3 cases) locking the stats!=null
prune branch of the PA-3.1 lazy-split iterator wrapper -- PA-3.1 only covered
the stats=null direct-through path.
Cases:
- PA-3.4.A literal in [lower, upper] -> batch kept
- PA-3.4.B literal outside [lower, upper] -> batch pruned
- PA-3.4.C mixed null/non-null stream: null through, non-null pruned correctly
PA-3.4.B uncovered a real bug in the PA-3.1 wrapper (test was supposed to be
a regression sentinel but caught a live defect):
Symptom: java.util.NoSuchElementException: next on empty iterator
at ColumnarCachedBatchSerializer$$anon$3.next(...:608)
Root cause: hasNext returned true based on (staged.hasNext || peekable.hasNext)
but next() recursed and tried to peekable.head after parent fully consumed
the runIt and pruned all batches -> peekable empty + staged empty -> crash.
Fix: refactor to advance() helper that loops drain-and-stage until staged has
a ready element or peekable is exhausted. hasNext / next both call advance
first, so hasNext is honest and next never sees an empty staged. Also handles
the case where parent prunes the entire run and the next batch is stats=null
pass-through (loop continues into the null branch).
This is exactly why PA-3.4 mattered: PA-3.1 had no test exercising the
prune-everything path.
Verified: 14/14 PASS (PA-3.4 3 + PA-3.3 3 + PA-3.2 4 + PA-1 3 + PA-3.1 1).
refs: todos/features/gluten-inmemory-cache-stats/docs/0001-layerA-minmax-design.md (rev 4 D-A3)
refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md PA-3.4
3 cases under VeloxWholeStageTransformerSuite (true native fixture): A. cache + equality filter: no crash + correct result B. plan contains InMemoryTableScanExec + ColumnarCachedBatchSerializer kicked in C. numOutputRows < N (significantly less than full-scan), <= 2 partitions worth Precise prune semantics are anchored by ColumnarCachedBatchBuildFilterPruneSuite (PA-3.4); this suite is e2e smoke only — non-TDD pure regression sentinel. Local fixture init blocked by NoClassDefFoundError: scala/Serializable at GlutenBuildInfo on mvn scalatest:test path (classloader); deferred to fork CI. Reference implementation VeloxColumnarCacheSuite uses the same base class and is green on community CI, confirming the fixture is sound. refs: todos/features/gluten-inmemory-cache-stats/docs/0005-pa35-e2e-test-shape-note.md
…s.enabled (default false) Adds default-off ship gate for the Layer A min/max partition stats path: - gluten-substrait/.../GlutenConfig.scala: new COLUMNAR_TABLE_CACHE_PARTITION_STATS_ENABLED entry, key spark.gluten.sql.columnar.tableCache.partitionStats.enabled, default false. - backends-velox/.../ColumnarCachedBatchSerializer.scala: write path double-gates on conf && supportsStatsExt; when either is false falls back to legacy serialize() with stats=null. The PA-3.1 lazy-split iterator wrapper transparently passes such batches through, so disabling the conf is a pure no-op for read correctness. - ColumnarCachePartitionStatsConfSuite: 2 cases (key + default-off, doc mentions rationale). - ColumnarCachedBatchE2ESuite: set partitionStats=true so the e2e smoke exercises the stats path under fork CI. Local regression: 16/16 ALL PASS (PA-4 2 + PA-3.4 3 + PA-3.3 3 + PA-3.2 4 + PA-1 3 + PA-3.1 1). E2E suite still deferred to fork CI for native runtime. refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md PA-4
When all batches under InMemoryTableScanExec are pruned by the partition stats, the numOutputRows metric on that node may legitimately read 0 -- the surviving row(s) come from a different physical path (Gluten native scan metrics, or higher-layer pivot resolution). The semantic correctness is anchored by PA-3.5.A (count == pivot rowcount); this case only needs to refute a full-scan, not constrain a lower bound. Local verification: 19/19 ALL PASS (PA-3.5 e2e 3 / PA-4 2 / PA-3.4 3 / PA-3.3 3 / PA-3.2 4 / PA-1 3 / PA-3.1 1) under build/mvn test -pl backends-velox -Pspark-4.1 -Pjava-17 -Pscala-2.13 after a full ./dev/buildbundle-veloxbe.sh + ./build/mvn install pass. refs: todos/features/gluten-inmemory-cache-stats/docs/0005-pa35-e2e-test-shape-note.md
- PA-5.B Float NaN: ignore + investigation 0006 deferred to PA-10 - PA-5.C/D/E Decimal/String marshal: ignore + acceptance criteria for follow-up PR (will be flipped to test in PA-7/PA-8/PA-9 when full-type marshal lands) - 24 case suite: 20 PASS / 4 ignored / 0 FAIL (build/mvn test 1m06) - spotless auto-applied import order normalization refs: todos/features/gluten-inmemory-cache-stats/docs/0008-layerA-fulltype-extension.md refs: todos/features/gluten-inmemory-cache-stats/investigations/0006-pa5b-float-nan-prune.md
…h params D-A5 D-A5 plumbing for full-type marshal (no behavior change yet, BIGINT-only): - CachedColumnarBatch: + schema: StructType (default null, source-compatible) - Kryo writer/reader: v3 layout adds nullable schema JSON segment after stats - serializeStats / deserializeStats / parseFramedBytes: add schema parameter (unused inside; PA-6.A onwards reads it for dataType dispatch) - Write path: convert Seq[Attribute] to StructType once and carry per-batch - Tuple destructuring with type ascription causes runtime MatchError under Scala 2.13 Tuple2 erasure -- use t._1/t._2 instead Sentinel: 20 PASS / 4 ignored / 0 fail (no regression) refs: todos/features/gluten-inmemory-cache-stats/docs/0008-layerA-fulltype-extension.md refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md (PA-6.0)
serializeStats / deserializeStats now branch on schema(col).dataType: IntegerType -> 4 LE bytes (writeU32LE bit-pattern == signed Int LE) LongType -> 8 LE bytes (existing PA-3.2 path) other -> UnsupportedOperationException (PA-6.B/C/F/G or PA-7..PA-10) schema=null keeps legacy BIGINT-only behavior (back-compat for callsites that have not been updated). RED log: ClassCastException Integer cannot be cast to Long at CachedColumnarBatchKryoSerializer$.serializeStats(...:304) GREEN: 21 PASS / 4 ignored / 0 fail (full 8-suite sentinel + new ColumnarCachedBatchIntFamilyMarshalSuite) refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md (PA-6.A)
- serializeStats: + ShortType branch using new writeU16LE helper - deserializeStats: + ShortType branch reading buf.getShort - writeU16LE helper added next to writeU32LE/writeU64LE RED: UnsupportedOperationException 'dispatch for ShortType not implemented' GREEN: 22 PASS / 4 ignored / 0 fail (full sentinel + new PA-6.B case) refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md (PA-6.B)
- serializeStats / deserializeStats: + ByteType branch (1 byte signed)
RED (verified by stash + isolated run):
PA-6.C TINYINT round-trip 1B preserves value (incl negative) *** FAILED ***
java.lang.UnsupportedOperationException: PA-6.A serializeStats: dispatch for
ByteType not implemented yet (landing in PA-6.B/C/F/G or PA-7..PA-10)
at CachedColumnarBatchKryoSerializer$.serializeStats(...:330)
GREEN: 23 PASS / 4 ignored / 0 fail (full 9-suite sentinel)
refs: todos/features/gluten-inmemory-cache-stats/docs/0004-layerA-implementation-plan.md (PA-6.C)
The unknown-arm path in deserializeStats reads upperSkipLen as a raw int and previously allocated new Array[Byte](upperSkipLen) without bounds checking. A cpp/JVM wire mismatch could surface as an OOM rather than a clear diagnostic. Adds require(upperSkipLen in [0, STRING_BOUND_TRUNCATE_LEN]) so the JVM fails fast with a wire-mismatch hint instead of attempting a multi-GB allocation. The upper bound matches the cpp-side string truncation SOT. 8-suite 37/37 GREEN.
… cap Three correctness/operational fixes from final code review: 1. Hoist require(lowerLen in [0, 256]) immediately after read at L334 so it covers ALL dispatch arms (not just the unknown-arm B4 fix). Symmetric defense against torn frames or corrupt blobs reaching new Array[Byte](N). 2. Replace eq StringType / case StringType with isInstanceOf[StringType] / case _: StringType at the four dispatch sites (isDispatchable, write isStringCol guard, write dispatch arm, read dispatch arm). Spark's StringType is now class StringType private[sql] with case object StringType extends StringType(0); collated columns are class instances that fail reference-equality. Without this, every collated string column silently loses partition pruning on the read path. 3. Raise the deserializeStats numCols cap from 4096 to Int.MaxValue / 5. The 4096 framing as 'corrupt' rejected legitimate wide tables (5k+ column ML feature stores work fine on vanilla cache). Int.MaxValue / 5 is the true overflow guard for the GenericInternalRow(numCols * 5) allocation that follows. 8-suite 37/37 GREEN.
Quantifies the read-path pruning benefit and the write-path overhead of
the columnar table-cache partition-stats path:
100M-row range partitioned by c2 into 32 partitions, in-memory cached;
groupBy + sum/count/avg follow-up to make pruned-partition savings visible.
Results (single-driver local[1], AMD EPYC 7763):
table cache build (write path)
partitionStats off 126425 ms 1.0X
partitionStats on 131431 ms 1.0X (no measurable overhead)
filter+agg high selectivity (c2 < 1000, ~0.001%)
partitionStats off 4431 ms 1.0X
partitionStats on 1744 ms 2.5X
filter+agg low selectivity (c2 < 50000000, ~50%)
partitionStats off 5332 ms 1.0X
partitionStats on 3392 ms 1.6X
filter+agg point lookup (c2 = 50000000, 1 row)
partitionStats off 4343 ms 1.0X
partitionStats on 1686 ms 2.6X
|
Run Gluten Clickhouse CI on x86 |
Run dev/format-cpp-code.sh against PR sources to satisfy the CI format-check job (clang-format 15).
|
Run Gluten Clickhouse CI on x86 |
AllGlutenConfiguration golden-file check requires this whenever a new SQLConf is added (CI spark-test-* failures pointed here).
|
Run Gluten Clickhouse CI on x86 |
weiting-chen
left a comment
There was a problem hiding this comment.
Thanks for this excellent PR — the partition stats design is clean and well-tested. A few suggestions for hardening (none are merge-blocking).
There was a problem hiding this comment.
Pull request overview
This PR adds native-produced per-partition min/max/null-count stats to the Velox-backed columnar InMemoryRelation cache so Spark can apply partition pruning when filtering cached DataFrames (gated by a new SQLConf).
Changes:
- Introduces
spark.gluten.sql.columnar.tableCache.partitionStats.enabled(defaultfalse) to control stats collection and pruning. - Extends the Velox serializer/JNI to emit a framed payload
[magic | statsLen | statsBlob | bytesLen | bytesBlob]and adds JVM-side parsing + Kryo (de)serialization for cached batches carrying stats. - Adds extensive C++ and Scala test coverage plus a benchmark demonstrating pruning benefits.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| gluten-substrait/src/main/scala/org/apache/gluten/config/GlutenConfig.scala | Adds the new SQLConf flag for enabling cache partition stats. |
| gluten-arrow/src/main/java/org/apache/gluten/vectorized/ColumnarBatchSerializerJniWrapper.java | Adds JNI serializeWithStats and a “capability check” helper (needs correction). |
| docs/Configuration.md | Documents the new configuration key. |
| cpp/core/operators/serializer/ColumnarBatchSerializer.h | Adds a virtual framedSerializeWithStats hook (default empty) for backend override. |
| cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.h | Declares stats computation and framed serialization for Velox. |
| cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.cc | Implements min/max scan + NaN guard and framed serialization (needs length-range guarding). |
| cpp/core/jni/JniWrapper.cc | Exposes serializeWithStats JNI entry point returning framed bytes. |
| cpp/velox/tests/VeloxColumnarBatchSerializerTest.cc | Adds detailed gtests for stats correctness and wire layout. |
| backends-velox/src/main/scala/org/apache/spark/sql/execution/ColumnarCachedBatchSerializer.scala | Adds CachedColumnarBatch w/ stats+schema, Kryo framing/stat marshaling, and buildFilter wrapper (needs robust fallback handling). |
| backends-velox/src/test/scala/org/apache/spark/sql/execution/*.scala | Adds JVM unit + e2e tests for framing/marshal/pruning behavior. |
| backends-velox/src/test/scala/org/apache/spark/sql/execution/benchmark/ColumnarTableCachePartitionStatsBenchmark.scala | Adds benchmark harness to measure overhead and pruning benefit. |
| benchmarks/ColumnarTableCachePartitionStatsBenchmark-results.txt | Adds benchmark output snapshot. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
liuneng1994
left a comment
There was a problem hiding this comment.
Thanks for the substantial work and the thorough TDD trail. Reviewed focusing on Spark behavior parity and test coverage. Three findings can return wrong results / crash and aren't covered by tests; flagging before any future default-on flip.
🔴 1. Non-binary collation strings can be wrongly pruned (silent wrong results)
CachedColumnarBatchKryoSerializer.isDispatchable accepts any StringType (the comment even says "any collation"):
case _: StringType => true // truncated to 256B; see encodeStringBounds (any collation)But both the cpp side (scanMinMax<StringView> → memcmp) and JVM encodeStringBounds (byte-wise truncate + +1 carry) compute min/max in unsigned byte order. For non-binary collations (UTF8_LCASE, UNICODE_CI, …) Spark's equality / ordering is not byte order, so a partition whose true min/max under the collation overlaps the predicate range may have a byte-order min/max that does not — the partition gets pruned and matching rows disappear.
Suggested fix: gate on the binary collation only, e.g.
case s: StringType if s.collationId == CollationFactory.UTF8_BINARY_COLLATION_ID => trueand add an e2e test under a non-binary collation.
🔴 2. CONSTANT/DICTIONARY-encoded columns + IsNull can prune wrongly
if (child == nullptr || !child->isFlatEncoding()) { continue; } // computeStatsFor dict/const-encoded children, nullCount stays at 0 and we then emit nullCount=0, count=numRows, supported=0. The vanilla SimpleMetricsCachedBatchSerializer.buildFilter evaluates IsNull(col) purely from nullCount (it does not require supported/lower/upper) — so it concludes "no nulls in this partition" and prunes it. If the column is in fact a CONSTANT-null vector (very common after outer joins or projection of literal null), IsNull(col) will incorrectly return zero rows.
Suggested fix: at minimum decode nullCount for non-flat encodings (e.g. via DecodedVector or child->mayHaveNulls() + per-row check), or skip writing the nullCount slot so the parent treats it as unknown. New test needed: cache a frame containing an all-null column on the non-flat path, then filter with col.isNull and assert the count equals N (not 0).
🟠 3. supportsStatsExt() does not actually probe the cpp symbol
public static boolean supportsStatsExt() {
...
ColumnarBatchSerializerJniWrapper.class.getDeclaredMethod("serializeWithStats", long.class);
...
}This reflects on the JVM class's own declared method, which is always true once the PR lands. The comment acknowledges "the cpp symbol is verified at first real invocation; callers must catch UnsatisfiedLinkError there too" — but convertColumnarBatchToCachedBatch does not catch. So a new gluten jar paired with an older libgluten.so will hard-crash with UnsatisfiedLinkError instead of falling back to the legacy serialize() path that the comment promises.
Suggested fix: try/catch UnsatisfiedLinkError once around an inert probe call (or do dlsym natively at lib-load), cache the result, and only then dispatch to serializeWithStats.
🟡 Smaller correctness / parity notes
- NaN poison preserves partial
nullCount.scanMinMaxreturns false on the first NaN; thenullCntaccumulated up to that point is still written into stats. It happens to be conservative forIsNull/IsNotNull(always under-counts), but it is a behavior divergence fromColumnStats.gatherNullStatsand is fragile to future predicate-eval changes — recommend zeroing nullCount when the column is NaN-poisoned. - TimestampNTZ. JVM dispatch lumps
TimestampType | TimestampNTZTypeinto the 8B Long arm; cpp only inspectsTypeKind::TIMESTAMP. Worth verifying Velox's NTZ TypeKind in Gluten's mapping and adding an NTZ e2e — currently no NTZ test. buildFilterlazy-split callsparent(index, runIt)per contiguous run. Each invocation may compile a fresh codegen filter; in mixed (stats / no-stats) iterators this can balloon. Consider memoizing or pre-warming the parent closure.
Test coverage gaps to add before default-on
The unit / wire tests are thorough, but the following e2e gaps should be closed before flipping the default to true:
- Non-binary collation string column + equality / range filter (covers #1).
- CONSTANT/dict-encoded all-null column +
col.isNullfilter (covers #2). - Stale-symbol fallback: simulate
serializeWithStatsmissing on the cpp side and assert graceful fallback to the legacy path (covers #3). - TimestampNTZ end-to-end equality.
- Cross-config: cache built with stats=on, read after toggling stats=off (and vice versa); also mixed v1-binary + v2-stats batches in the same partition.
- Disk spill via Kryo: existing Kryo tests round-trip in-memory only; cover the
DiskStorespill + read-back path with stats present. - Mixed
supported=1andsupported=0columns in one batch — wire alignment is only implicitly tested. - Decimal with
scale > 0and long-Decimal (precision > 18) end-to-end. - YearMonth / DayTime interval end-to-end (only marshal unit tests today).
- Schema mismatch: stats column count vs
schema.lengthmismatch — currently relies on caller invariant with norequire.
Non-blocking polish
- The production code carries a lot of TDD-process noise (
PA-3.2 RED,PA-6.A,refs: todos/features/…). These are valuable in the commit messages but are likely to confuse future maintainers when they appear in// commentsof merged code — consider rewriting them as forward-looking design comments at squash / merge. serializeStats(stats, schema)acceptsschema == nulland falls back to BIGINT; the write path always passes a non-null schema, but norequireenforces it. A future caller could regress silently — recommendrequire(schema != null, …)once V1-only paths are gone.count/nullCountareuint32on the wire — fine in practice, but a comment about the >2³¹-row assumption (or widening touint64) would help.- cpp
framedSerializeWithStatswritespushU64(0)for per-columnsizeInBytes; a one-line comment that this is intentional becauseCachedColumnarBatch.sizeInBytesis overridden at the batch level would prevent future "fix me" PRs.
Overall the design is clean and the TDD discipline is great; default-off ship is the right call. With #1–#3 fixed and the e2e gaps closed, this should be solid for a default-on flip.
|
it seems the current pr also implement the CachedBatch skipping with pushed filter not just the partition, right? |
zhli1142015
left a comment
There was a problem hiding this comment.
I reviewed the partition-stats changes and left a few correctness comments.
…iedLinkError fallback The reflection-based supportsStatsExt() helper checks the Java method declaration via getDeclaredMethod, which always succeeds in this jar regardless of whether the loaded libgluten.so contains the corresponding JNI symbol. A new Gluten jar paired with an older native library would still throw UnsatisfiedLinkError on the first serializeWithStats call, giving the helper a false sense of safety. Remove the helper and detect capability lazily at the call site: - The cache write path queries ColumnarCachedBatchSerializer.statsExtAvailable (default true, optimistic fast path) before invoking serializeWithStats. - A try/catch around the call traps UnsatisfiedLinkError once, calls markStatsExtUnavailable to flip a JVM-wide volatile flag, and falls back to the legacy serialize() path emitting stats=null. Subsequent batches in this JVM go straight through the fallback, since a jar/native version mismatch is a deployment misconfiguration that should not be retried per batch. The buildFilter wrapper continues to direct stats=null batches through unchanged, so partition pruning is silently disabled rather than crashing the executor.
The framedSerializeWithStats wire layout encodes bytesLen as a u32 LE field, so a single batch payload larger than 4 GiB would silently truncate when cast from int64_t to uint32_t and produce a corrupt frame that the JVM-side parser would mis-read. Add a GLUTEN_CHECK enforcing 0 <= bytesLen <= UINT32_MAX before the cast. The condition is pathological in practice (would require a single batch with very wide schemas or huge string columns), but fail-fast here is cheaper than debugging a corrupt frame downstream.
…1 wire The CachedColumnarBatchKryoSerializer.read path trusted the length-prefixed fields (payload length, statsLen, schemaLen) directly from the input stream and allocated arrays from them. A corrupt or truncated cache block could trigger NegativeArraySizeException (negative length) or multi-GB allocation (oversized length), crashing the executor instead of failing fast with a clear exception. Bound each length to spark.kryoserializer.buffer.max via maxKryoBufferBytes. Spark's own Kryo write path enforces this same ceiling on the producing side, so any stream claiming a larger field is necessarily corrupt or maliciously crafted. Additionally, the V2 wire format added trailing hasStats / hasSchema boolean tags after the byte payload. A CachedColumnarBatch persisted on disk by an older Gluten jar (DISK_ONLY / MEMORY_AND_DISK storage levels surviving a rolling upgrade) would lack these tags, and input.readBoolean() would throw KryoException trying to consume bytes past the end of the V1 stream. Guard each trailing boolean read with an input.available() > 0 check so legacy cached batches deserialize as stats=null / schema=null instead of crashing. ColumnarCachedBatchKryoSuite covers four new cases: - V1 wire stream with no trailing booleans reads cleanly with stats=null and schema=null. - Negative length is rejected with IllegalArgumentException before array allocation. - Int.MaxValue length is rejected before triggering a 2 GB allocation. - Int.MaxValue statsLen is rejected with the same bounds-check message.
|
Run Gluten Clickhouse CI on x86 |
Address PR apache#12092 review (zhli1142015) for VeloxColumnarBatchSerializer stats compute path. Three independent correctness fixes plus regression tests, all routed via the `IsNull` pruning predicate semantics: framed stats serialize `nullCount` even when `emitSupported = 0`, and Spark `InMemoryFileIndexStats` reads `statsFor(a).nullCount > 0` for `IsNull` pruning. Any under-count silently drops matching rows. (1) BOOLEAN scan must use valueAt(i) / isNullAt(i). FlatVector<bool> is bit-packed in Velox; the generic scanMinMax<T> template invokes rawValues() which throws VeloxRuntimeError. Materializing a cached BooleanType column with partition stats enabled would hard-fail. New scanBoolMinMax helper; static_assert in scanMinMax<T> blocks accidental T=bool reuse. (2) Float / Double NaN scan no longer early-returns. NaN still poisons min/max (hasLowerBound=hasUpperBound=false, IEEE-754 NaN compare semantics make Spark NaN != NaN), but the loop now continues to accrue real null count. A `[NaN, null]` partition previously serialized nullCount=0 even when emitSupported=0, allowing `col IS NULL` to incorrectly prune. (3) Non-flat encoding (Dictionary / Constant / Complex) now reports a real null count via a dedicated isNullAt loop. Min/max remains unsupported. Default 0 on a null-bearing dict-encoded column would again let `col IS NULL` prune incorrectly. Tests: 3 new gtests in VeloxColumnarBatchSerializerTest verify each fix; all 22 tests in the suite pass. RED-validated by stashing the production change and re-running -- all 3 new tests fail as expected (throw / nullCount=1 / nullCount=0).
|
Run Gluten Clickhouse CI on x86 |
Address PR apache#12092 review (Batch 3): * Copilot review apache#5: hoist `StructType(schema.map(StructField(...)))` out of the per-batch `Iterator.next()` body in ColumnarCachedBatchSerializer.scala. The schema is constant for the lifetime of a partition iterator, so re-allocating one StructType per CachedBatch wastes GC for the many-small-batch case. Lifted to a single iterator-scoped val that is closed over by next(). * weiting-chen review apache#4: add a config-gate negative E2E test (`partitionStats.enabled=false: legacy serialize() path correctness preserved`) in ColumnarCachedBatchE2ESuite. With the production default false, serializeWithStats must NOT be invoked and the legacy serialize() path emits stats=null; this exercises that branch end-to-end so any regression that flips the gate or breaks the stats=null read path surfaces here. Asserts correctness only (result == 1L), not numOutputRows -- the Gluten native scan reports rows on a separate metrics path so InMemoryTableScanExec metric is zero in either gated branch (mirrors the existing comment on "numOutputRows reflects post-filter row count" test). Tests: ColumnarCachedBatchE2ESuite 12/12 PASS (includes new disabled- config case + 11 pre-existing).
|
Run Gluten Clickhouse CI on x86 |
…nge) Pure comment cleanup per project taste (do not retain reviewer handles, internal batch numbers, doc revision tags, or PR comment indices in source -- those belong in git log / PR thread, not in the code). Touched only comment text added in earlier commits in this branch; no production semantics, no test logic, no wire format. * ColumnarCachedBatchSerializer.scala: drop the trailing "(Copilot review apache#5)" attribution from the StructType-hoist comment. * ColumnarCachedBatchE2ESuite.scala: drop the "Reviewer apache#4 (weiting-chen)" prefix on the disabled-config test prologue and tighten wording. * ColumnarCachedBatchKryoSuite.scala: drop the internal "Slice 1.3" section banner. * VeloxColumnarBatchSerializerTest.cc: drop the "Batch 2 (PR apache#12092 zhli1142015 #N)" prefix on the three new gtest prologues; drop the internal "NB4 rev6" / "rev6" tags from EXPECT_EQ failure messages. Verification: velox_operators_test 22/22 PASS, backends-velox test-compile BUILD SUCCESS.
|
Run Gluten Clickhouse CI on x86 |

What changes were proposed in this pull request?
This PR fills the long-standing TODO in
ColumnarCachedBatchSerializer.buildFilterby adding native-side min/max partition statistics to the columnar
InMemoryRelationcache, enabling partition-level pruning when filtering over a cached
DataFrame.The Velox backend now computes per-batch min/max/null-count stats during
serializeWithStats(cpp side, Arrow C-Data export-free) and embeds themin the cached batch envelope. On the read path, the JVM serializer extends
SimpleMetricsCachedBatchSerializerand routes batches with stats throughthe inherited
buildFilterfor partition pruning; batches without stats(legacy v1 binary, or the SQLConf gate disabled) lazy-pass-through unchanged.
Type matrix covered (vanilla Spark
ColumnStats.scalaparity):VARCHAR cpp-side scan is intentionally deferred (
supported=0from cpp,no JVM crash path) due to Velox StringView lifetime considerations across
RowVector boundaries — tracked for a follow-up PR.
Why are the changes needed?
Partition stats let
InMemoryRelationskip whole cached partitions thatcannot satisfy a filter predicate, materially speeding up repeated point
lookups and selective filters over a cached DataFrame — a common pattern
in iterative SQL workloads (notebooks, ML feature engineering, BI tools
with cached intermediate results). Vanilla Spark has had this since
SPARK-32274 (3.1, ~5 years); Gluten's columnar cache previously
extended
CachedBatchSerializerdirectly, leaving thebuildFilterTODOempty and skipping pruning entirely.
Does this PR introduce any user-facing change?
Yes — a new SQLConf:
spark.gluten.sql.columnar.tableCache.partitionStats.enabledfalseDefault-off ship for the first release; flip to default-on after community
feedback on the included benchmark + a release of soak time. Gate is
double-checked on both the write path (skip stats compute) and read path
(via the inherited
buildFilter).How was this patch tested?
Unit / integration tests:
velox_operators_test) — per-type min/max scanbuildFilter wrapper / per-arm prune semantics (BIGINT / Date / Timestamp /
String / NaN sentinel / numCols cap)
ColumnarCachedBatchE2ESuite) — cache + equalityfilter end-to-end, plan shape verification,
numOutputRowspruneevidence
Benchmark (
backends-velox/benchmarks/ColumnarTableCachePartitionStatsBenchmark-results.txt):100M-row range partitioned by
c2into 32 in-memory partitions,groupBy + sum/count/avgfollow-up to make pruned-partition savingsvisible. Driver:
local[1], AMD EPYC 7763.c2 < 1000, ~0.001%)c2 < 50000000, ~50%)c2 = 50000000, 1 row)Write path shows no measurable overhead — stats computation runs in cpp
inside the existing batch scan loop.
Notes on commit history
This PR retains the atomic per-slice commit history (~58 commits) rather
than squashing, to make per-feature
git bisectand incremental reviewpractical. Each commit is self-contained: a feature slice, its test, and
any fixes amended into the slice.