[VL] Add lazy per-column deserialization for Columnar Table Cache#12211

Open

jackylee-ch wants to merge 1 commit into

apache:mainfrom

jackylee-ch:table-cache-lazy-deserialization

Contributor

jackylee-ch commented Jun 1, 2026 •

edited

Loading

What changes are proposed in this pull request?

Add lazy per-column deserialization for the Velox columnar table cache.

Key points:

Introduce V3 cache bytes (0xFECA5303) with independently serialized column payloads.
Read V3 cache data through projected Velox LazyVectors, so unreferenced columns are not deserialized.
Keep V2 compatibility and route reads by frame magic, independent of the current lazy-deserialization config.
Add spark.gluten.sql.columnar.tableCache.lazy.deserialization.enabled (default: false).

Performance

Latest checked-in benchmark result is a local validation run:
OpenJDK 17.0.16, macOS 26.5, Apple M5 Pro, 10,000 rows, 4 partitions, 1 iteration.

Scenario	Legacy	partitionStats only	Lazy V3
Cache build	167 ms (1.0X)	111 ms (1.5X)	90 ms (1.9X)
Read 1/16 cols	13 ms (1.0X)	14 ms (0.9X)	11 ms (1.1X)
Read 4/16 cols	21 ms (1.0X)	19 ms (1.1X)	19 ms (1.1X)
Read all 16 cols	12 ms (1.0X)	12 ms (1.0X)	11 ms (1.1X)
Filter + 2/16 cols	9 ms (1.0X)	6 ms (1.4X)	5 ms (1.6X)

The benchmark defaults remain production-scale (100M rows, 32 partitions, 3 iterations) and can be overridden by Spark conf.

How was this patch tested?

ColumnarCachedBatchFramedBytesSuite
ColumnarCachedBatchSerializerHelperSuite
ColumnarCachedBatchLazySerdeTest
ColumnarCachedBatchE2ESuite
ColumnarTableCacheLazyDeserBenchmark
Native rebuild for libgluten.dylib and libvelox.dylib

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7; Codex GPT-5

github-actions Bot added CORE VELOX DOCS labels

jackylee-ch marked this pull request as draft

June 1, 2026 04:58

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 58bd451 to d5a0502 Compare

June 1, 2026 08:59

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from d5a0502 to 8e374db Compare

June 1, 2026 09:05

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 8e374db to 0f0ccd2 Compare

June 1, 2026 09:08

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86


          [VL][TableCache] Add lazy per-column deserialization with V3 framed f…

8b09d6b

…ormat

Introduces a new V3 wire format for columnar table cache that enables
per-column lazy deserialization via Velox LazyVector, reducing CPU for
wide-table scans by only decoding referenced columns.

The current table cache always deserializes all N columns even when a
query only needs M columns (M << N). For a 16-column table with a 1-column
query, this wastes 15/16 of the deserialization work. This change adds a
new V3 per-column format and lazy loading to eliminate that overhead.

V3: [magic=0xFECA5303(4B)][statsLen(4B)][statsBlob][numRows(4B)][numCols(4B)]
    [per-col: colLen(4B) + serializeSingleColumn bytes]
V2: unchanged [magic=0xFECA5302(4B)][statsLen(4B)][statsBlob][bytesLen(4B)][bytesBlob]

V3 is NOT backward compatible with V2 readers. V3 code reads V2 data via V2 path.

- `ColumnarBatchSerializer.h`: Add virtual `framedSerializeWithStatsV3()`
  and `deserializeV3()` to base class for symmetric write/read V3 APIs
  (no Velox headers needed in core JNI wrapper).
- `VeloxColumnarBatchSerializer.h/.cc`:
  - `framedSerializeWithStatsV3()`: Calls `getFlattenedRowVector()` first
    (force-loads any lazy/dict children) then uses `serializeSingleColumn`
    per column. Each column's bytes are self-contained.
  - `CachedColumnLoader`: VectorLoader backed by per-column byte slice.
    Decodes via `deserializeSingleColumn` on first access; frees raw bytes
    post-load to prevent double-buffer memory waste.
  - `deserializeV3()`: Returns M-column RowVector with LazyVector children
    (only requested columns). Schema matches selectedAttributes exactly.
    Correctly handles numRows==0 (null constant) vs colLen==0 with numRows>0
    (hard error: malformed frame rather than silent data corruption).
  - `buildStatsBlob()`: Extracted private helper shared by V3 write path.
  - `options_`: Explicitly set compressionKind=NONE and nullsFirst=false
    as required by serializeSingleColumn / deserializeSingleColumn.

- `JniWrapper.cc`: Add `serializeWithStatsV3` and `deserializeWithProjection`
  JNI methods via base-class virtual dispatch (no Velox headers in core).
- `ColumnarBatchSerializerJniWrapper.java`: Add corresponding native methods.
  - `serializeWithStatsV3(long handle)`: Returns null for non-Velox backends.
  - `deserializeWithProjection(long serializerHandle, byte[] data, int[] cols)`:
    null cols=all, int[0]=zero cols, int[m]=M specific cols.

- `GlutenConfig.scala`: Add `COLUMNAR_TABLE_CACHE_LAZY_DESERIALIZATION_ENABLED`
  (key: `spark.gluten.sql.columnar.tableCache.lazy.deserialization.enabled`,
  default: false) adjacent to existing tableCache configs.
- `ColumnarCachedBatchSerializer.scala`:
  - `parseFramedBytes()`: Routes on magic byte[3]: 0x02->V2, 0x03->V3.
    `parseV2Frame` fully validates V2 magic; `parseV3Frame` extracts stats
    and returns full frame for C++ to re-validate.
  - Write path: 3-branch gating at partition level (configs hoisted outside
    Iterator). V3->serializeWithStatsV3, V2-stats->serializeWithStats, else->legacy.
  - Read path: V3 bytes ALWAYS route to `deserializeWithProjection` (independent
    of lazyEnabled config), preventing V3 bytes from hitting V2 Presto deserializer.
    When lazyEnabled=false + V3 bytes: passes null (loadAll) so all columns are
    force-loaded via ensureFlattened() with no data loss.
  - `serializeOneBatchWithStatsV3`: Companion object method with two-arm catch
    and independent `statsExtV3AvailableFlag` latch (separate from V2 latch).
- `docs/Configuration.md`: Add new config entry to prevent AllGlutenConfiguration
  CI failure.

- `ColumnarCachedBatchFramedBytesSuite`: V3/V2 routing, magic validation,
  V3 stats extraction, short-frame rejection, per-column framing boundary
  documentation (+5 new tests, 8 total).
- `ColumnarCachedBatchLazySerdeTest`: 7 E2E integration tests covering V3
  write+read correctness, projected read, count(*), all-types coverage,
  lazyEnabled=false config toggle, cross-config V3->lazy=false read.
- `ColumnarCachedBatchE2ESuite`: 2 V3 smoke tests.

- `ColumnarTableCacheLazyDeserBenchmark`: 5 benchmark scenarios comparing
  legacy / partitionStats-only / lazy-V3 modes:
  1. Cache build overhead (write-path cost of V3)
  2. Read 1/16 columns (maximum skip benefit)
  3. Read 4/16 columns (moderate skip benefit)
  4. Read all 16 columns (LazyVector overhead case)
  5. Filter + 2/16 columns (batch-skip + column-skip combined)

Change-Id: I2a8582f901fafd436cac1a1d16e0367e9330b336

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 0f0ccd2 to 8b09d6b Compare

June 1, 2026 11:21

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

jackylee-ch marked this pull request as ready for review

June 1, 2026 14:20

Contributor Author

jackylee-ch commented Jun 1, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE DOCS VELOX