[WIP] feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists#18497
[WIP] feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists#18497rahil-c wants to merge 3 commits intoapache:masterfrom
Conversation
Translate the Hudi VECTOR logical-type metadata (`hudi_type = "VECTOR(dim[,elem])"`)
into the lance-spark metadata key `arrow.fixed-size-list.size` before calling
`LanceArrowUtils.toArrowSchema`, so the Lance writer emits a native Arrow
FixedSizeList<Float32|Float64, dim> (Lance's vector column encoding) instead
of a plain variable-length list. No change needed at `LanceFileWriter.open(...)`;
the encoding is driven by the Arrow schema itself.
- New private helper `enrichSparkSchemaForLanceVectors` in `HoodieSparkLanceWriter`
reuses `VectorConversionUtils.detectVectorColumnsFromMetadata` to find VECTOR
fields and attaches the Lance metadata key; non-vector fields pass through
unchanged.
- Fails fast with `HoodieNotSupportedException` for non-ArrayType or non-
Float/Double element types (matches lance-spark's `shouldBeFixedSizeList`).
- Tests in `TestLanceDataSource` (COW + MOR):
- `testFloatVectorRoundTrip`
- `testDoubleVectorRoundTrip`
- `testMultipleVectorColumns`
Each opens the written `.lance` file via `LanceFileReader` and asserts the
field is `ArrowType.FixedSizeList` with the expected `listSize` — the direct
regression guard that fails pre-fix and passes post-fix.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Companion to the Lance writer's native FixedSizeList encoding: on read, rehydrate the Hudi `hudi_type = VECTOR(...)` Spark metadata that `LanceArrowUtils.fromArrowSchema` drops, so the read schema matches the Parquet path. Gate the Parquet-only ArrayType→BinaryType vector rewrite in HoodieFileGroupReaderBasedFileFormat on format == PARQUET; Lance returns vectors natively as ArrayType so the rewrite would trigger a spurious cast and break the read. - VectorConversionUtils.restoreVectorMetadataFromArrowSchema walks the Arrow schema and re-attaches VECTOR(dim[,DOUBLE]) for FixedSizeList<Float32|Float64, dim> fields. - HoodieSparkLanceReader.getSchema and SparkLanceReaderBase.read now call it so downstream VECTOR-aware code sees the same schema as on Parquet. - TestLanceDataSource: assert hudi_type metadata is restored on read for float, double, and multi-vector round-trips. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mirrors the Parquet writer: emit the comma-separated `colName:VECTOR(dim[,elemType])` descriptor list under the existing `hoodie.vector.columns` key in the Lance file-footer key-value metadata. Reader still derives VECTOR identity from the Arrow FixedSizeList type today; this footer entry is insurance for future descriptor fields the Arrow type cannot express (quantization tags, distance metrics, etc.) and keeps Lance files symmetric with Parquet files. - HoodieBaseLanceWriter: new protected `additionalSchemaMetadata()` hook invoked during close(), so subclasses can contribute footer KV entries alongside bloom-filter metadata. - HoodieSparkLanceWriter: override `additionalSchemaMetadata()` to emit `hoodie.vector.columns` when the Spark schema has any VECTOR column. - VectorConversionUtils: add `buildVectorColumnsMetadataValue(StructType)` matching the Parquet-path helper's output format. - TestLanceDataSource: assert footer carries the expected descriptor list for float, double, and multi-vector round-trips. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18497 +/- ##
============================================
+ Coverage 68.52% 68.81% +0.29%
- Complexity 27968 28161 +193
============================================
Files 2440 2459 +19
Lines 134456 135186 +730
Branches 16226 16399 +173
============================================
+ Hits 92138 93031 +893
+ Misses 35054 34761 -293
- Partials 7264 7394 +130
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
Writing VECTOR columns as native Lance |
|
Lance artifact coordinates are out of date: |
|
The
|
|
The three added tests (
|
|
Minor code-quality nits
|
Summary
hudi_type = "VECTOR(dim[,elem])") into lance-spark'sarrow.fixed-size-list.sizemetadata key before callingLanceArrowUtils.toArrowSchema, so the Lance writer emits a native ArrowFixedSizeList<Float32|Float64, dim>(Lance's vector column encoding) instead of a plain variable-length list.hudi_type = VECTOR(...)Spark metadata on read so the Lance path surfaces the same schema as Parquet.LanceArrowUtils.fromArrowSchemadrops field metadata; a newVectorConversionUtils.restoreVectorMetadataFromArrowSchemawalks the Arrow schema and rebuilds the descriptor fromFixedSizeList<Float32|Float64, dim>.HoodieFileGroupReaderBasedFileFormat'sArrayType → BinaryTypevector rewrite onhoodieFileFormat == PARQUET. Parquet needs the rewrite because VECTOR is stored asFIXED_LEN_BYTE_ARRAY; Lance returns vectors natively asArrayTypeso the rewrite triggered a spurious cast (scala.MatchError: ArrayType(DoubleType,false)atCast.castToBinaryCode) once reader-side metadata was restored.HoodieNotSupportedExceptionfor non-ArrayType or non-Float/Double element types (matches lance-spark'sVectorUtils.shouldBeFixedSizeList).Why
Hudi's VECTOR logical type (RFC-99) already round-trips correctly on the Parquet path via the
hoodie.vector.columnsfooter metadata + FIXED_LEN_BYTE_ARRAY storage. On the newly-added Lance base-file path, VECTOR columns silently degraded to plainList<Float>/List<Double>Arrow fields on write and lost theirhudi_typedescriptor on read — breaking parity with Parquet and defeating Lance's vector column encoding (tight packing, future vector search, etc.).Lance-Spark's DDL-level
TBLPROPERTIES ('<col>.arrow.fixed-size-list.size' = '128')knob ultimately just attaches that samearrow.fixed-size-list.sizekey to the column's Spark metadata. Since Hudi writes at the file level (bypassing Spark DDL), we attach the metadata directly from Hudi's existing VECTOR descriptor.Implementation
Writer —
HoodieSparkLanceWriter#enrichSparkSchemaForLanceVectorsVectorConversionUtils.detectVectorColumnsFromMetadatato find fields tagged withhudi_type = VECTOR(...).LanceArrowUtils.ARROW_FIXED_SIZE_LIST_SIZE_KEY()("arrow.fixed-size-list.size") as a Long with the dimension, preserving any pre-existing metadata (includinghudi_type) viaMetadataBuilder.withMetadata(...).FloatTypeorDoubleType; throwsHoodieNotSupportedExceptionotherwise.Downstream:
LanceArrowUtils.toArrowSchema(...)then emitsFixedSizeList<elem, dim>, andLanceArrowWriter.createFieldWriterautomatically selects itsFixedSizeListWriterbranch when it sees the matching Arrow vector — no other code changes required.Reader —
VectorConversionUtils#restoreVectorMetadataFromArrowSchemaLanceArrowUtils.fromArrowSchema.FixedSizeListwith a Float32/Float64 child, derivesHoodieSchema.createVector(dim, elementType).toTypeDescriptor()and re-attaches it ashudi_typeon the corresponding SparkStructField.HoodieSparkLanceReader#getSchemaandSparkLanceReaderBase#read.FileFormat fix —
HoodieFileGroupReaderBasedFileFormat#withVectorRewriteEarly-return
(schema, Map.empty)whenhoodieFileFormat != PARQUET. TheArrayType → BinaryTyperewrite is a Parquet-specific workaround; applying it on the Lance path produced an unsupportedCast(ArrayType, BinaryType)once reader-side VECTOR metadata was present.Test plan
Added to
TestLanceDataSource(parameterized across COW + MOR):testFloatVectorRoundTrip— 4-dim FLOAT VECTORtestDoubleVectorRoundTrip— 4-dim DOUBLE VECTORtestMultipleVectorColumns— two vector columns of different element types / dims on the same rowEach test:
.lancefile directly viaLanceFileReaderand assertsfield.getType()isArrowType.FixedSizeListwith the expectedlistSize(writer guard).assertHudiTypeMetadatato asserthudi_type = VECTOR(...)is restored on the read schema (reader guard).→
Tests run: 24, Failures: 0, Errors: 0, Skipped: 0(6 new + 18 existing).Out of scope
shouldBeFixedSizeListrejects non-Float/Double; would require upstream Lance work or a separate encoding).🤖 Generated with Claude Code