[VL] Catch corrupt stats frames per-batch in ColumnarCachedBatchSerializer#12183
Merged
yaooqinn merged 2 commits intoMay 29, 2026
Conversation
…chSerializer
Refactor only — no behavior change. Move the per-batch serializeWithStats fast
path (with the UnsatisfiedLinkError → markStatsExtUnavailable + legacy fallback
catch arm) out of the anonymous Iterator lambda in convertColumnarBatchToCachedBatch
into a companion-object package-private method:
ColumnarCachedBatchSerializer.serializeOneBatchWithStats(
jni, handle, numRows, structSchema, fallbackToLegacy)
The legacy-only branch is extracted into a local def legacySerializeInline() so
both the else branch and the fallback closure can call it without duplication.
Sentinel suites (all green, no regression):
ColumnarCachedBatchE2ESuite 15/0
ColumnarCachedBatchStatsBlobSuite 4/0
ColumnarCachedBatchFramedBytesSuite 3/0
ColumnarCachedBatchBuildFilterSuite 1/0
ColumnarCachedBatchBuildFilterPruneSuite 3/0
ColumnarCachedBatchKryoSuite 6/0
ColumnarCachedBatchIntFamilyMarshalSuite 8/0
ColumnarCachedBatchKryoBoundaryProbeBugSuite 1/0
This extraction creates a unit-test hook point for an upcoming change that
widens the catch to handle corrupt/undecodable stats frames per-batch via
NonFatal without tripping the JVM-lifetime capability latch.
Generated-by: Claude claude-opus-4.7
…lizer
The V2 serialize-with-stats fast path used to catch only UnsatisfiedLinkError.
Any IllegalArgumentException from parseFramedBytes (corrupt magic, truncated
frame, statsLen/bytesLen mismatch) or KryoException from deserializeStats would
bubble out and fail the entire Spark task. Since V2 is an opt-in optimization,
a malformed frame for one cached batch should not be user-visible — the
existing legacy serialize() fallback should absorb it just like an UnsatisfiedLinkError
does.
This change layers the catch:
case e: UnsatisfiedLinkError =>
// capability gone for the JVM lifetime; trip the latch
markStatsExtUnavailable(e); fallbackToLegacy()
case NonFatal(e) =>
// per-batch corruption; do NOT trip the latch
warnCorruptStatsFrame(e); fallbackToLegacy()
The capability latch (statsExtAvailableFlag) is deliberately one-way for the
JVM lifetime — a native-symbol mismatch isn't recoverable. Corrupt-frame events,
in contrast, are per-batch and shouldn't poison the latch: the next batch
retries the fast path.
To keep log output bounded when a native regression produces malformed frames
batch after batch, warnCorruptStatsFrame caps warnings at 100 per JVM with a
final summary line.
Test coverage (new ColumnarCachedBatchSerializerHelperSuite, 4 cases):
- corrupt magic -> fallback CachedBatch with stats=null
- truncated framed bytes -> fallback CachedBatch with stats=null
- Kryo-corrupt statsBlob -> fallback CachedBatch with stats=null
- UnsatisfiedLinkError -> still trips capability latch (regression)
Also drops resetStatsExtAvailableForTesting() — it had no remaining callers
after the new helper suite started resetting the flag via reflection directly.
Sentinel suites (all green, no regression):
ColumnarCachedBatchE2ESuite 15/0
ColumnarCachedBatchStatsBlobSuite 4/0
ColumnarCachedBatchFramedBytesSuite 3/0
ColumnarCachedBatchBuildFilterSuite 1/0
ColumnarCachedBatchBuildFilterPruneSuite 3/0
ColumnarCachedBatchKryoSuite 6/0
ColumnarCachedBatchIntFamilyMarshalSuite 8/0
ColumnarCachedBatchKryoBoundaryProbeBugSuite 1/0
ColumnarCachedBatchSerializerHelperSuite 4/0 (new)
Generated-by: Claude claude-opus-4.7
zhli1142015
approved these changes
May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The V2
serializeWithStatsfast path (opt-in viaspark.gluten.sql.columnar.tableCache.partitionStats.enabled) used to catch onlyUnsatisfiedLinkError. AnyIllegalArgumentExceptionfromparseFramedBytes(corrupt magic, truncated frame, statsLen/bytesLen mismatch) orKryoExceptionfromdeserializeStatswould bubble out and fail the entire Spark task. Since V2 is an opt-in optimization, a malformed frame for one cached batch should not be user-visible — the existing legacyserialize()fallback should absorb it just like anUnsatisfiedLinkErrordoes.Change
Layered catch in the per-batch fast path:
The capability latch (
statsExtAvailableFlag) is deliberately one-way for the JVM lifetime — a native-symbol mismatch isn't recoverable. Corrupt-frame events, in contrast, are per-batch and shouldn't poison the latch: the next batch retries the fast path.To keep log output bounded when a native regression produces malformed frames batch after batch,
warnCorruptStatsFramecaps warnings at 100 per JVM with a final summary line.Refactor
To make the catch site unit-testable (previously nested inside an anonymous
Iteratorlambda insidemapPartitions), the fast path was extracted into a companion-object helperserializeOneBatchWithStats(jni, handle, numRows, structSchema, fallbackToLegacy). The first commit is the pure refactor; the second commit adds the new arm + tests.resetStatsExtAvailableForTesting()is dropped — it had no callers after the new helper suite switched to resetting the flag via reflection directly.Test coverage
New
ColumnarCachedBatchSerializerHelperSuite(4 cases, Mockito-stubbed JNI wrapper, no native runtime needed):CachedBatchwithstats=nullUnsatisfiedLinkError→ still trips capability latch (regression)Local sentinel — all 9
ColumnarCachedBatch*Suitegreen, 46/0 succeeded, 0 regression vs apache/main.Risks
SQLConfif reviewers prefer.NonFataldeliberately does not absorbOutOfMemoryError/StackOverflowError/ control-flow exceptions — those should still propagate.Part of the V2 hardening series following #12176.
Generated-by: Claude claude-opus-4.7