[VL] Add lazy per-column deserialization for Columnar Table Cache#12211
Open
jackylee-ch wants to merge 1 commit into
Open
[VL] Add lazy per-column deserialization for Columnar Table Cache#12211jackylee-ch wants to merge 1 commit into
jackylee-ch wants to merge 1 commit into
Conversation
|
Run Gluten Clickhouse CI on x86 |
58bd451 to
d5a0502
Compare
|
Run Gluten Clickhouse CI on x86 |
d5a0502 to
8e374db
Compare
|
Run Gluten Clickhouse CI on x86 |
8e374db to
0f0ccd2
Compare
|
Run Gluten Clickhouse CI on x86 |
…ormat
Introduces a new V3 wire format for columnar table cache that enables
per-column lazy deserialization via Velox LazyVector, reducing CPU for
wide-table scans by only decoding referenced columns.
The current table cache always deserializes all N columns even when a
query only needs M columns (M << N). For a 16-column table with a 1-column
query, this wastes 15/16 of the deserialization work. This change adds a
new V3 per-column format and lazy loading to eliminate that overhead.
V3: [magic=0xFECA5303(4B)][statsLen(4B)][statsBlob][numRows(4B)][numCols(4B)]
[per-col: colLen(4B) + serializeSingleColumn bytes]
V2: unchanged [magic=0xFECA5302(4B)][statsLen(4B)][statsBlob][bytesLen(4B)][bytesBlob]
V3 is NOT backward compatible with V2 readers. V3 code reads V2 data via V2 path.
- `ColumnarBatchSerializer.h`: Add virtual `framedSerializeWithStatsV3()`
and `deserializeV3()` to base class for symmetric write/read V3 APIs
(no Velox headers needed in core JNI wrapper).
- `VeloxColumnarBatchSerializer.h/.cc`:
- `framedSerializeWithStatsV3()`: Calls `getFlattenedRowVector()` first
(force-loads any lazy/dict children) then uses `serializeSingleColumn`
per column. Each column's bytes are self-contained.
- `CachedColumnLoader`: VectorLoader backed by per-column byte slice.
Decodes via `deserializeSingleColumn` on first access; frees raw bytes
post-load to prevent double-buffer memory waste.
- `deserializeV3()`: Returns M-column RowVector with LazyVector children
(only requested columns). Schema matches selectedAttributes exactly.
Correctly handles numRows==0 (null constant) vs colLen==0 with numRows>0
(hard error: malformed frame rather than silent data corruption).
- `buildStatsBlob()`: Extracted private helper shared by V3 write path.
- `options_`: Explicitly set compressionKind=NONE and nullsFirst=false
as required by serializeSingleColumn / deserializeSingleColumn.
- `JniWrapper.cc`: Add `serializeWithStatsV3` and `deserializeWithProjection`
JNI methods via base-class virtual dispatch (no Velox headers in core).
- `ColumnarBatchSerializerJniWrapper.java`: Add corresponding native methods.
- `serializeWithStatsV3(long handle)`: Returns null for non-Velox backends.
- `deserializeWithProjection(long serializerHandle, byte[] data, int[] cols)`:
null cols=all, int[0]=zero cols, int[m]=M specific cols.
- `GlutenConfig.scala`: Add `COLUMNAR_TABLE_CACHE_LAZY_DESERIALIZATION_ENABLED`
(key: `spark.gluten.sql.columnar.tableCache.lazy.deserialization.enabled`,
default: false) adjacent to existing tableCache configs.
- `ColumnarCachedBatchSerializer.scala`:
- `parseFramedBytes()`: Routes on magic byte[3]: 0x02->V2, 0x03->V3.
`parseV2Frame` fully validates V2 magic; `parseV3Frame` extracts stats
and returns full frame for C++ to re-validate.
- Write path: 3-branch gating at partition level (configs hoisted outside
Iterator). V3->serializeWithStatsV3, V2-stats->serializeWithStats, else->legacy.
- Read path: V3 bytes ALWAYS route to `deserializeWithProjection` (independent
of lazyEnabled config), preventing V3 bytes from hitting V2 Presto deserializer.
When lazyEnabled=false + V3 bytes: passes null (loadAll) so all columns are
force-loaded via ensureFlattened() with no data loss.
- `serializeOneBatchWithStatsV3`: Companion object method with two-arm catch
and independent `statsExtV3AvailableFlag` latch (separate from V2 latch).
- `docs/Configuration.md`: Add new config entry to prevent AllGlutenConfiguration
CI failure.
- `ColumnarCachedBatchFramedBytesSuite`: V3/V2 routing, magic validation,
V3 stats extraction, short-frame rejection, per-column framing boundary
documentation (+5 new tests, 8 total).
- `ColumnarCachedBatchLazySerdeTest`: 7 E2E integration tests covering V3
write+read correctness, projected read, count(*), all-types coverage,
lazyEnabled=false config toggle, cross-config V3->lazy=false read.
- `ColumnarCachedBatchE2ESuite`: 2 V3 smoke tests.
- `ColumnarTableCacheLazyDeserBenchmark`: 5 benchmark scenarios comparing
legacy / partitionStats-only / lazy-V3 modes:
1. Cache build overhead (write-path cost of V3)
2. Read 1/16 columns (maximum skip benefit)
3. Read 4/16 columns (moderate skip benefit)
4. Read all 16 columns (LazyVector overhead case)
5. Filter + 2/16 columns (batch-skip + column-skip combined)
Change-Id: I2a8582f901fafd436cac1a1d16e0367e9330b336
0f0ccd2 to
8b09d6b
Compare
|
Run Gluten Clickhouse CI on x86 |
Contributor
Author
|
@yaooqinn PTAL |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes are proposed in this pull request?
Add lazy per-column deserialization for the Velox columnar table cache.
Key points:
0xFECA5303) with independently serialized column payloads.LazyVectors, so unreferenced columns are not deserialized.spark.gluten.sql.columnar.tableCache.lazy.deserialization.enabled(default:false).Performance
Latest checked-in benchmark result is a local validation run:
OpenJDK 17.0.16, macOS 26.5, Apple M5 Pro,
10,000rows,4partitions,1iteration.The benchmark defaults remain production-scale (
100Mrows,32partitions,3iterations) and can be overridden by Spark conf.How was this patch tested?
ColumnarCachedBatchFramedBytesSuiteColumnarCachedBatchSerializerHelperSuiteColumnarCachedBatchLazySerdeTestColumnarCachedBatchE2ESuiteColumnarTableCacheLazyDeserBenchmarklibgluten.dylibandlibvelox.dylibWas this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.7; Codex GPT-5