[WIP] feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists by rahil-c · Pull Request #18497 · apache/hudi

rahil-c · 2026-04-13T16:04:17Z

Summary

Writer: Translate Hudi's VECTOR logical-type metadata (hudi_type = "VECTOR(dim[,elem])") into lance-spark's arrow.fixed-size-list.size metadata key before calling LanceArrowUtils.toArrowSchema, so the Lance writer emits a native Arrow FixedSizeList<Float32|Float64, dim> (Lance's vector column encoding) instead of a plain variable-length list.
Reader: Re-attach the hudi_type = VECTOR(...) Spark metadata on read so the Lance path surfaces the same schema as Parquet. LanceArrowUtils.fromArrowSchema drops field metadata; a new VectorConversionUtils.restoreVectorMetadataFromArrowSchema walks the Arrow schema and rebuilds the descriptor from FixedSizeList<Float32|Float64, dim>.
Bug fix: Gate HoodieFileGroupReaderBasedFileFormat's ArrayType → BinaryType vector rewrite on hoodieFileFormat == PARQUET. Parquet needs the rewrite because VECTOR is stored as FIXED_LEN_BYTE_ARRAY; Lance returns vectors natively as ArrayType so the rewrite triggered a spurious cast (scala.MatchError: ArrayType(DoubleType,false) at Cast.castToBinaryCode) once reader-side metadata was restored.
Fails fast with HoodieNotSupportedException for non-ArrayType or non-Float/Double element types (matches lance-spark's VectorUtils.shouldBeFixedSizeList).

Why

Hudi's VECTOR logical type (RFC-99) already round-trips correctly on the Parquet path via the hoodie.vector.columns footer metadata + FIXED_LEN_BYTE_ARRAY storage. On the newly-added Lance base-file path, VECTOR columns silently degraded to plain List<Float> / List<Double> Arrow fields on write and lost their hudi_type descriptor on read — breaking parity with Parquet and defeating Lance's vector column encoding (tight packing, future vector search, etc.).

Lance-Spark's DDL-level TBLPROPERTIES ('<col>.arrow.fixed-size-list.size' = '128') knob ultimately just attaches that same arrow.fixed-size-list.size key to the column's Spark metadata. Since Hudi writes at the file level (bypassing Spark DDL), we attach the metadata directly from Hudi's existing VECTOR descriptor.

Implementation

Writer — `HoodieSparkLanceWriter#enrichSparkSchemaForLanceVectors`

Reuses existing VectorConversionUtils.detectVectorColumnsFromMetadata to find fields tagged with hudi_type = VECTOR(...).
For each such field, attaches LanceArrowUtils.ARROW_FIXED_SIZE_LIST_SIZE_KEY() ("arrow.fixed-size-list.size") as a Long with the dimension, preserving any pre-existing metadata (including hudi_type) via MetadataBuilder.withMetadata(...).
Validates element type is FloatType or DoubleType; throws HoodieNotSupportedException otherwise.

Downstream: LanceArrowUtils.toArrowSchema(...) then emits FixedSizeList<elem, dim>, and LanceArrowWriter.createFieldWriter automatically selects its FixedSizeListWriter branch when it sees the matching Arrow vector — no other code changes required.

Reader — `VectorConversionUtils#restoreVectorMetadataFromArrowSchema`

Walks the Arrow schema in parallel with the Spark schema produced by LanceArrowUtils.fromArrowSchema.
For each Arrow field that is a FixedSizeList with a Float32/Float64 child, derives HoodieSchema.createVector(dim, elementType).toTypeDescriptor() and re-attaches it as hudi_type on the corresponding Spark StructField.
Wired into HoodieSparkLanceReader#getSchema and SparkLanceReaderBase#read.

FileFormat fix — `HoodieFileGroupReaderBasedFileFormat#withVectorRewrite`

Early-return (schema, Map.empty) when hoodieFileFormat != PARQUET. The ArrayType → BinaryType rewrite is a Parquet-specific workaround; applying it on the Lance path produced an unsupported Cast(ArrayType, BinaryType) once reader-side VECTOR metadata was present.

Test plan

Added to TestLanceDataSource (parameterized across COW + MOR):

testFloatVectorRoundTrip — 4-dim FLOAT VECTOR
testDoubleVectorRoundTrip — 4-dim DOUBLE VECTOR
testMultipleVectorColumns — two vector columns of different element types / dims on the same row

Each test:

Opens the written .lance file directly via LanceFileReader and asserts field.getType() is ArrowType.FixedSizeList with the expected listSize (writer guard).
Calls assertHudiTypeMetadata to assert hudi_type = VECTOR(...) is restored on the read schema (reader guard).

mvn -pl hudi-spark-datasource/hudi-spark -Pspark3.5,scala-2.12 \
    -Dtest=TestLanceDataSource -DfailIfNoTests=false surefire:test

→ Tests run: 24, Failures: 0, Errors: 0, Skipped: 0 (6 new + 18 existing).

Out of scope

INT8 VECTOR support on Lance (lance-spark's shouldBeFixedSizeList rejects non-Float/Double; would require upstream Lance work or a separate encoding).

🤖 Generated with Claude Code

Translate the Hudi VECTOR logical-type metadata (`hudi_type = "VECTOR(dim[,elem])"`) into the lance-spark metadata key `arrow.fixed-size-list.size` before calling `LanceArrowUtils.toArrowSchema`, so the Lance writer emits a native Arrow FixedSizeList<Float32|Float64, dim> (Lance's vector column encoding) instead of a plain variable-length list. No change needed at `LanceFileWriter.open(...)`; the encoding is driven by the Arrow schema itself. - New private helper `enrichSparkSchemaForLanceVectors` in `HoodieSparkLanceWriter` reuses `VectorConversionUtils.detectVectorColumnsFromMetadata` to find VECTOR fields and attaches the Lance metadata key; non-vector fields pass through unchanged. - Fails fast with `HoodieNotSupportedException` for non-ArrayType or non- Float/Double element types (matches lance-spark's `shouldBeFixedSizeList`). - Tests in `TestLanceDataSource` (COW + MOR): - `testFloatVectorRoundTrip` - `testDoubleVectorRoundTrip` - `testMultipleVectorColumns` Each opens the written `.lance` file via `LanceFileReader` and asserts the field is `ArrowType.FixedSizeList` with the expected `listSize` — the direct regression guard that fails pre-fix and passes post-fix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Companion to the Lance writer's native FixedSizeList encoding: on read, rehydrate the Hudi `hudi_type = VECTOR(...)` Spark metadata that `LanceArrowUtils.fromArrowSchema` drops, so the read schema matches the Parquet path. Gate the Parquet-only ArrayType→BinaryType vector rewrite in HoodieFileGroupReaderBasedFileFormat on format == PARQUET; Lance returns vectors natively as ArrayType so the rewrite would trigger a spurious cast and break the read. - VectorConversionUtils.restoreVectorMetadataFromArrowSchema walks the Arrow schema and re-attaches VECTOR(dim[,DOUBLE]) for FixedSizeList<Float32|Float64, dim> fields. - HoodieSparkLanceReader.getSchema and SparkLanceReaderBase.read now call it so downstream VECTOR-aware code sees the same schema as on Parquet. - TestLanceDataSource: assert hudi_type metadata is restored on read for float, double, and multi-vector round-trips. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Mirrors the Parquet writer: emit the comma-separated `colName:VECTOR(dim[,elemType])` descriptor list under the existing `hoodie.vector.columns` key in the Lance file-footer key-value metadata. Reader still derives VECTOR identity from the Arrow FixedSizeList type today; this footer entry is insurance for future descriptor fields the Arrow type cannot express (quantization tags, distance metrics, etc.) and keeps Lance files symmetric with Parquet files. - HoodieBaseLanceWriter: new protected `additionalSchemaMetadata()` hook invoked during close(), so subclasses can contribute footer KV entries alongside bloom-filter metadata. - HoodieSparkLanceWriter: override `additionalSchemaMetadata()` to emit `hoodie.vector.columns` when the Spark schema has any VECTOR column. - VectorConversionUtils: add `buildVectorColumnsMetadataValue(StructType)` matching the Parquet-path helper's output format. - TestLanceDataSource: assert footer carries the expected descriptor list for float, double, and multi-vector round-trips. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov-commenter · 2026-04-13T18:58:20Z

Codecov Report

❌ Patch coverage is 76.28866% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.81%. Comparing base (35e2bbf) to head (099aadf).
⚠️ Report is 30 commits behind head on master.

Files with missing lines	Patch %	Lines
.../apache/hudi/io/storage/VectorConversionUtils.java	76.92%	6 Missing and 6 partials ⚠️
...apache/hudi/io/storage/HoodieSparkLanceWriter.java	78.78%	4 Missing and 3 partials ⚠️
...rg/apache/hudi/io/lance/HoodieBaseLanceWriter.java	40.00%	1 Missing and 2 partials ⚠️
...parquet/HoodieFileGroupReaderBasedFileFormat.scala	75.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18497      +/-   ##
============================================
+ Coverage     68.52%   68.81%   +0.29%     
- Complexity    27968    28161     +193     
============================================
  Files          2440     2459      +19     
  Lines        134456   135186     +730     
  Branches      16226    16399     +173     
============================================
+ Hits          92138    93031     +893     
+ Misses        35054    34761     -293     
- Partials       7264     7394     +130

Flag	Coverage Δ
common-and-other-modules	`44.51% <1.03%> (+0.16%)`	⬆️
hadoop-mr-java-client	`44.78% <0.00%> (-0.21%)`	⬇️
spark-client-hadoop-common	`48.37% <0.00%> (-0.01%)`	⬇️
spark-java-tests	`48.88% <76.28%> (+0.11%)`	⬆️
spark-scala-tests	`45.46% <1.03%> (-0.19%)`	⬇️
utilities	`38.17% <1.03%> (-0.19%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...apache/hudi/io/storage/HoodieSparkLanceReader.java	`74.02% <100.00%> (+2.97%)`	⬆️
...ution/datasources/lance/SparkLanceReaderBase.scala	`86.00% <100.00%> (ø)`
...parquet/HoodieFileGroupReaderBasedFileFormat.scala	`85.46% <75.00%> (-0.32%)`	⬇️
...rg/apache/hudi/io/lance/HoodieBaseLanceWriter.java	`67.85% <40.00%> (-1.77%)`	⬇️
...apache/hudi/io/storage/HoodieSparkLanceWriter.java	`89.15% <78.78%> (-7.00%)`	⬇️
.../apache/hudi/io/storage/VectorConversionUtils.java	`81.81% <76.92%> (-4.39%)`	⬇️

... and 143 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wombatu-kun · 2026-04-14T07:21:00Z

Writing VECTOR columns as native Lance FixedSizeList is the right direction, and the writer/reader symmetry plus the forward-compat footer are nice design choices. I've been working on the same problem on a parallel branch (different strategy: plain List<Float> + BinaryType rewrite reversal), and while comparing the two approaches I spotted a handful of things worth addressing before this lands.

wombatu-kun · 2026-04-14T07:21:20Z

Lance artifact coordinates are out of date: com.lancedb → org.lance, 0.0.15 → 0.4.0

wombatu-kun · 2026-04-14T07:22:57Z

The hoodieFileFormat != PARQUET early-return in withVectorRewrite is clean. Please double-check that every call site that previously may have hit the rewrite now correctly skips it for Lance:

buildReaderWithPartitionValues (all three invocations: requiredSchema, outputSchema, requestedSchema).
Any other places in the file where detectVectorColumns / replaceVectorFieldsWithBinary are called independently of the helper.

wombatu-kun · 2026-04-14T07:23:10Z

The three added tests (testFloatVectorRoundTrip, testDoubleVectorRoundTrip, testMultipleVectorColumns) are good writer/reader guards but cover only the trivial insert + full-table-read path. Recommended additions:

Nullable vector column (null row values, null whole struct).
Partitioned table.
MOR log merging (write → update → read).
Schema evolution: add VECTOR column to an existing Lance table and read old + new rows.
Clustering: verify clustered output Lance files also carry native FixedSizeList + footer.
Projection: read only the vector column; read vector column alongside metadata columns.
Time travel / incremental query.

wombatu-kun · 2026-04-14T07:26:30Z

Minor code-quality nits

HoodieSparkLanceWriter#enrichSparkSchemaForLanceVectors: the local DataType dt = field.dataType() is assigned before the ArrayType check, then unused once the cast is done. Can collapse to if (!(field.dataType() instanceof ArrayType)).
VectorConversionUtils#buildVectorColumnsMetadataValue duplicates the format produced by HoodieSchema#buildVectorColumnsMetadataValue for Avro schemas. Consider adding a Javadoc cross-reference and asserting the two produce identical strings for equivalent schemas (could even delegate).
VectorConversionUtils#restoreVectorMetadataFromArrowSchema walks only top-level fields. If a nested struct ever carries a VECTOR child (unlikely today but possible), it would be missed. Worth a Javadoc note: "Top-level VECTORs only; nested struct children are not recursed into."
HoodieBaseLanceWriter#close: the new additionalSchemaMetadata hook is called after bloom filter metadata — fine — but the nested if (writer != null) check is redundant (already inside the outer writer != null for bloom filter); can collapse.
buildVectorColumnsMetadataValue returns "" for schemas without vectors, and the override early-returns Collections.emptyMap() in that case — so no footer entry is emitted. Correct, just worth a comment explaining that the hook is called unconditionally but is a no-op when there are no vectors (otherwise a future reader maintaining this code might wonder why a non-VECTOR Lance file has no hoodie.vector.columns).

github-actions bot added the size:M PR with lines of changes in (100, 300] label Apr 13, 2026

github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Apr 13, 2026

rahil-c changed the title ~~[WIP] feat(lance): write Hudi VECTOR columns as native Lance fixed-size lists~~ [WIP] feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists Apr 13, 2026

rahil-c requested a review from wombatu-kun April 13, 2026 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists#18497

[WIP] feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists#18497
rahil-c wants to merge 3 commits intoapache:masterfrom
rahil-c:rahil/lance-vector-write

rahil-c commented Apr 13, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 13, 2026

Uh oh!

wombatu-kun commented Apr 14, 2026

Uh oh!

wombatu-kun commented Apr 14, 2026

Uh oh!

wombatu-kun commented Apr 14, 2026

Uh oh!

wombatu-kun commented Apr 14, 2026

Uh oh!

wombatu-kun commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rahil-c commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Implementation

Writer — HoodieSparkLanceWriter#enrichSparkSchemaForLanceVectors

Reader — VectorConversionUtils#restoreVectorMetadataFromArrowSchema

FileFormat fix — HoodieFileGroupReaderBasedFileFormat#withVectorRewrite

Test plan

Out of scope

Uh oh!

codecov-commenter commented Apr 13, 2026

Codecov Report

Uh oh!

wombatu-kun commented Apr 14, 2026

Uh oh!

wombatu-kun commented Apr 14, 2026

Uh oh!

wombatu-kun commented Apr 14, 2026

Uh oh!

wombatu-kun commented Apr 14, 2026

Uh oh!

wombatu-kun commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rahil-c commented Apr 13, 2026 •

edited

Loading

Writer — `HoodieSparkLanceWriter#enrichSparkSchemaForLanceVectors`

Reader — `VectorConversionUtils#restoreVectorMetadataFromArrowSchema`

FileFormat fix — `HoodieFileGroupReaderBasedFileFormat#withVectorRewrite`