Skip to content

[VL] Catch corrupt stats frames per-batch in ColumnarCachedBatchSerializer#12183

Merged
yaooqinn merged 2 commits into
apache:mainfrom
yaooqinn:users/kentyao/gluten-ccbs-stats-fallback-catch-layering
May 29, 2026
Merged

[VL] Catch corrupt stats frames per-batch in ColumnarCachedBatchSerializer#12183
yaooqinn merged 2 commits into
apache:mainfrom
yaooqinn:users/kentyao/gluten-ccbs-stats-fallback-catch-layering

Conversation

@yaooqinn
Copy link
Copy Markdown
Member

Motivation

The V2 serializeWithStats fast path (opt-in via spark.gluten.sql.columnar.tableCache.partitionStats.enabled) used to catch only UnsatisfiedLinkError. Any IllegalArgumentException from parseFramedBytes (corrupt magic, truncated frame, statsLen/bytesLen mismatch) or KryoException from deserializeStats would bubble out and fail the entire Spark task. Since V2 is an opt-in optimization, a malformed frame for one cached batch should not be user-visible — the existing legacy serialize() fallback should absorb it just like an UnsatisfiedLinkError does.

Change

Layered catch in the per-batch fast path:

case e: UnsatisfiedLinkError =>
  // capability gone for the JVM lifetime; trip the latch
  markStatsExtUnavailable(e); fallbackToLegacy()
case NonFatal(e) =>
  // per-batch corruption; do NOT trip the latch
  warnCorruptStatsFrame(e); fallbackToLegacy()

The capability latch (statsExtAvailableFlag) is deliberately one-way for the JVM lifetime — a native-symbol mismatch isn't recoverable. Corrupt-frame events, in contrast, are per-batch and shouldn't poison the latch: the next batch retries the fast path.

To keep log output bounded when a native regression produces malformed frames batch after batch, warnCorruptStatsFrame caps warnings at 100 per JVM with a final summary line.

Refactor

To make the catch site unit-testable (previously nested inside an anonymous Iterator lambda inside mapPartitions), the fast path was extracted into a companion-object helper serializeOneBatchWithStats(jni, handle, numRows, structSchema, fallbackToLegacy). The first commit is the pure refactor; the second commit adds the new arm + tests.

resetStatsExtAvailableForTesting() is dropped — it had no callers after the new helper suite switched to resetting the flag via reflection directly.

Test coverage

New ColumnarCachedBatchSerializerHelperSuite (4 cases, Mockito-stubbed JNI wrapper, no native runtime needed):

  • corrupt magic → fallback CachedBatch with stats=null
  • truncated framed bytes → fallback
  • Kryo-corrupt statsBlob → fallback
  • UnsatisfiedLinkError → still trips capability latch (regression)

Local sentinel — all 9 ColumnarCachedBatch*Suite green, 46/0 succeeded, 0 regression vs apache/main.

Risks

  • Log cap is hard-coded at 100. Happy to move to a SQLConf if reviewers prefer.
  • NonFatal deliberately does not absorb OutOfMemoryError / StackOverflowError / control-flow exceptions — those should still propagate.

Part of the V2 hardening series following #12176.

Generated-by: Claude claude-opus-4.7

yaooqinn added 2 commits May 29, 2026 12:08
…chSerializer

Refactor only — no behavior change. Move the per-batch serializeWithStats fast
path (with the UnsatisfiedLinkError → markStatsExtUnavailable + legacy fallback
catch arm) out of the anonymous Iterator lambda in convertColumnarBatchToCachedBatch
into a companion-object package-private method:

  ColumnarCachedBatchSerializer.serializeOneBatchWithStats(
      jni, handle, numRows, structSchema, fallbackToLegacy)

The legacy-only branch is extracted into a local def legacySerializeInline() so
both the else branch and the fallback closure can call it without duplication.

Sentinel suites (all green, no regression):
  ColumnarCachedBatchE2ESuite                  15/0
  ColumnarCachedBatchStatsBlobSuite             4/0
  ColumnarCachedBatchFramedBytesSuite           3/0
  ColumnarCachedBatchBuildFilterSuite           1/0
  ColumnarCachedBatchBuildFilterPruneSuite      3/0
  ColumnarCachedBatchKryoSuite                  6/0
  ColumnarCachedBatchIntFamilyMarshalSuite      8/0
  ColumnarCachedBatchKryoBoundaryProbeBugSuite  1/0

This extraction creates a unit-test hook point for an upcoming change that
widens the catch to handle corrupt/undecodable stats frames per-batch via
NonFatal without tripping the JVM-lifetime capability latch.

Generated-by: Claude claude-opus-4.7
…lizer

The V2 serialize-with-stats fast path used to catch only UnsatisfiedLinkError.
Any IllegalArgumentException from parseFramedBytes (corrupt magic, truncated
frame, statsLen/bytesLen mismatch) or KryoException from deserializeStats would
bubble out and fail the entire Spark task. Since V2 is an opt-in optimization,
a malformed frame for one cached batch should not be user-visible — the
existing legacy serialize() fallback should absorb it just like an UnsatisfiedLinkError
does.

This change layers the catch:

  case e: UnsatisfiedLinkError =>
    // capability gone for the JVM lifetime; trip the latch
    markStatsExtUnavailable(e); fallbackToLegacy()
  case NonFatal(e) =>
    // per-batch corruption; do NOT trip the latch
    warnCorruptStatsFrame(e); fallbackToLegacy()

The capability latch (statsExtAvailableFlag) is deliberately one-way for the
JVM lifetime — a native-symbol mismatch isn't recoverable. Corrupt-frame events,
in contrast, are per-batch and shouldn't poison the latch: the next batch
retries the fast path.

To keep log output bounded when a native regression produces malformed frames
batch after batch, warnCorruptStatsFrame caps warnings at 100 per JVM with a
final summary line.

Test coverage (new ColumnarCachedBatchSerializerHelperSuite, 4 cases):
  - corrupt magic           -> fallback CachedBatch with stats=null
  - truncated framed bytes  -> fallback CachedBatch with stats=null
  - Kryo-corrupt statsBlob  -> fallback CachedBatch with stats=null
  - UnsatisfiedLinkError    -> still trips capability latch (regression)

Also drops resetStatsExtAvailableForTesting() — it had no remaining callers
after the new helper suite started resetting the flag via reflection directly.

Sentinel suites (all green, no regression):
  ColumnarCachedBatchE2ESuite                  15/0
  ColumnarCachedBatchStatsBlobSuite             4/0
  ColumnarCachedBatchFramedBytesSuite           3/0
  ColumnarCachedBatchBuildFilterSuite           1/0
  ColumnarCachedBatchBuildFilterPruneSuite      3/0
  ColumnarCachedBatchKryoSuite                  6/0
  ColumnarCachedBatchIntFamilyMarshalSuite      8/0
  ColumnarCachedBatchKryoBoundaryProbeBugSuite  1/0
  ColumnarCachedBatchSerializerHelperSuite      4/0 (new)

Generated-by: Claude claude-opus-4.7
@github-actions github-actions Bot added the VELOX label May 29, 2026
@yaooqinn yaooqinn merged commit a897401 into apache:main May 29, 2026
60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants