Skip to content

[VL] Add lazy per-column deserialization for Columnar Table Cache#12211

Open
jackylee-ch wants to merge 1 commit into
apache:mainfrom
jackylee-ch:table-cache-lazy-deserialization
Open

[VL] Add lazy per-column deserialization for Columnar Table Cache#12211
jackylee-ch wants to merge 1 commit into
apache:mainfrom
jackylee-ch:table-cache-lazy-deserialization

Conversation

@jackylee-ch
Copy link
Copy Markdown
Contributor

@jackylee-ch jackylee-ch commented Jun 1, 2026

What changes are proposed in this pull request?

Add lazy per-column deserialization for the Velox columnar table cache.

Key points:

  • Introduce V3 cache bytes (0xFECA5303) with independently serialized column payloads.
  • Read V3 cache data through projected Velox LazyVectors, so unreferenced columns are not deserialized.
  • Keep V2 compatibility and route reads by frame magic, independent of the current lazy-deserialization config.
  • Add spark.gluten.sql.columnar.tableCache.lazy.deserialization.enabled (default: false).

Performance

Latest checked-in benchmark result is a local validation run:
OpenJDK 17.0.16, macOS 26.5, Apple M5 Pro, 10,000 rows, 4 partitions, 1 iteration.

Scenario Legacy partitionStats only Lazy V3
Cache build 167 ms (1.0X) 111 ms (1.5X) 90 ms (1.9X)
Read 1/16 cols 13 ms (1.0X) 14 ms (0.9X) 11 ms (1.1X)
Read 4/16 cols 21 ms (1.0X) 19 ms (1.1X) 19 ms (1.1X)
Read all 16 cols 12 ms (1.0X) 12 ms (1.0X) 11 ms (1.1X)
Filter + 2/16 cols 9 ms (1.0X) 6 ms (1.4X) 5 ms (1.6X)

The benchmark defaults remain production-scale (100M rows, 32 partitions, 3 iterations) and can be overridden by Spark conf.

How was this patch tested?

  • ColumnarCachedBatchFramedBytesSuite
  • ColumnarCachedBatchSerializerHelperSuite
  • ColumnarCachedBatchLazySerdeTest
  • ColumnarCachedBatchE2ESuite
  • ColumnarTableCacheLazyDeserBenchmark
  • Native rebuild for libgluten.dylib and libvelox.dylib

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7; Codex GPT-5

@github-actions github-actions Bot added CORE works for Gluten Core VELOX DOCS labels Jun 1, 2026
@jackylee-ch jackylee-ch marked this pull request as draft June 1, 2026 04:58
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 58bd451 to d5a0502 Compare June 1, 2026 08:59
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from d5a0502 to 8e374db Compare June 1, 2026 09:05
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 8e374db to 0f0ccd2 Compare June 1, 2026 09:08
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

…ormat

Introduces a new V3 wire format for columnar table cache that enables
per-column lazy deserialization via Velox LazyVector, reducing CPU for
wide-table scans by only decoding referenced columns.

The current table cache always deserializes all N columns even when a
query only needs M columns (M << N). For a 16-column table with a 1-column
query, this wastes 15/16 of the deserialization work. This change adds a
new V3 per-column format and lazy loading to eliminate that overhead.

V3: [magic=0xFECA5303(4B)][statsLen(4B)][statsBlob][numRows(4B)][numCols(4B)]
    [per-col: colLen(4B) + serializeSingleColumn bytes]
V2: unchanged [magic=0xFECA5302(4B)][statsLen(4B)][statsBlob][bytesLen(4B)][bytesBlob]

V3 is NOT backward compatible with V2 readers. V3 code reads V2 data via V2 path.

- `ColumnarBatchSerializer.h`: Add virtual `framedSerializeWithStatsV3()`
  and `deserializeV3()` to base class for symmetric write/read V3 APIs
  (no Velox headers needed in core JNI wrapper).
- `VeloxColumnarBatchSerializer.h/.cc`:
  - `framedSerializeWithStatsV3()`: Calls `getFlattenedRowVector()` first
    (force-loads any lazy/dict children) then uses `serializeSingleColumn`
    per column. Each column's bytes are self-contained.
  - `CachedColumnLoader`: VectorLoader backed by per-column byte slice.
    Decodes via `deserializeSingleColumn` on first access; frees raw bytes
    post-load to prevent double-buffer memory waste.
  - `deserializeV3()`: Returns M-column RowVector with LazyVector children
    (only requested columns). Schema matches selectedAttributes exactly.
    Correctly handles numRows==0 (null constant) vs colLen==0 with numRows>0
    (hard error: malformed frame rather than silent data corruption).
  - `buildStatsBlob()`: Extracted private helper shared by V3 write path.
  - `options_`: Explicitly set compressionKind=NONE and nullsFirst=false
    as required by serializeSingleColumn / deserializeSingleColumn.

- `JniWrapper.cc`: Add `serializeWithStatsV3` and `deserializeWithProjection`
  JNI methods via base-class virtual dispatch (no Velox headers in core).
- `ColumnarBatchSerializerJniWrapper.java`: Add corresponding native methods.
  - `serializeWithStatsV3(long handle)`: Returns null for non-Velox backends.
  - `deserializeWithProjection(long serializerHandle, byte[] data, int[] cols)`:
    null cols=all, int[0]=zero cols, int[m]=M specific cols.

- `GlutenConfig.scala`: Add `COLUMNAR_TABLE_CACHE_LAZY_DESERIALIZATION_ENABLED`
  (key: `spark.gluten.sql.columnar.tableCache.lazy.deserialization.enabled`,
  default: false) adjacent to existing tableCache configs.
- `ColumnarCachedBatchSerializer.scala`:
  - `parseFramedBytes()`: Routes on magic byte[3]: 0x02->V2, 0x03->V3.
    `parseV2Frame` fully validates V2 magic; `parseV3Frame` extracts stats
    and returns full frame for C++ to re-validate.
  - Write path: 3-branch gating at partition level (configs hoisted outside
    Iterator). V3->serializeWithStatsV3, V2-stats->serializeWithStats, else->legacy.
  - Read path: V3 bytes ALWAYS route to `deserializeWithProjection` (independent
    of lazyEnabled config), preventing V3 bytes from hitting V2 Presto deserializer.
    When lazyEnabled=false + V3 bytes: passes null (loadAll) so all columns are
    force-loaded via ensureFlattened() with no data loss.
  - `serializeOneBatchWithStatsV3`: Companion object method with two-arm catch
    and independent `statsExtV3AvailableFlag` latch (separate from V2 latch).
- `docs/Configuration.md`: Add new config entry to prevent AllGlutenConfiguration
  CI failure.

- `ColumnarCachedBatchFramedBytesSuite`: V3/V2 routing, magic validation,
  V3 stats extraction, short-frame rejection, per-column framing boundary
  documentation (+5 new tests, 8 total).
- `ColumnarCachedBatchLazySerdeTest`: 7 E2E integration tests covering V3
  write+read correctness, projected read, count(*), all-types coverage,
  lazyEnabled=false config toggle, cross-config V3->lazy=false read.
- `ColumnarCachedBatchE2ESuite`: 2 V3 smoke tests.

- `ColumnarTableCacheLazyDeserBenchmark`: 5 benchmark scenarios comparing
  legacy / partitionStats-only / lazy-V3 modes:
  1. Cache build overhead (write-path cost of V3)
  2. Read 1/16 columns (maximum skip benefit)
  3. Read 4/16 columns (moderate skip benefit)
  4. Read all 16 columns (LazyVector overhead case)
  5. Filter + 2/16 columns (batch-skip + column-skip combined)

Change-Id: I2a8582f901fafd436cac1a1d16e0367e9330b336
@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 0f0ccd2 to 8b09d6b Compare June 1, 2026 11:21
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch marked this pull request as ready for review June 1, 2026 14:20
@jackylee-ch
Copy link
Copy Markdown
Contributor Author

@yaooqinn PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DOCS VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant