Skip to content

[WIP] feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists#18497

Draft
rahil-c wants to merge 3 commits intoapache:masterfrom
rahil-c:rahil/lance-vector-write
Draft

[WIP] feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists#18497
rahil-c wants to merge 3 commits intoapache:masterfrom
rahil-c:rahil/lance-vector-write

Conversation

@rahil-c
Copy link
Copy Markdown
Collaborator

@rahil-c rahil-c commented Apr 13, 2026

Summary

  • Writer: Translate Hudi's VECTOR logical-type metadata (hudi_type = "VECTOR(dim[,elem])") into lance-spark's arrow.fixed-size-list.size metadata key before calling LanceArrowUtils.toArrowSchema, so the Lance writer emits a native Arrow FixedSizeList<Float32|Float64, dim> (Lance's vector column encoding) instead of a plain variable-length list.
  • Reader: Re-attach the hudi_type = VECTOR(...) Spark metadata on read so the Lance path surfaces the same schema as Parquet. LanceArrowUtils.fromArrowSchema drops field metadata; a new VectorConversionUtils.restoreVectorMetadataFromArrowSchema walks the Arrow schema and rebuilds the descriptor from FixedSizeList<Float32|Float64, dim>.
  • Bug fix: Gate HoodieFileGroupReaderBasedFileFormat's ArrayType → BinaryType vector rewrite on hoodieFileFormat == PARQUET. Parquet needs the rewrite because VECTOR is stored as FIXED_LEN_BYTE_ARRAY; Lance returns vectors natively as ArrayType so the rewrite triggered a spurious cast (scala.MatchError: ArrayType(DoubleType,false) at Cast.castToBinaryCode) once reader-side metadata was restored.
  • Fails fast with HoodieNotSupportedException for non-ArrayType or non-Float/Double element types (matches lance-spark's VectorUtils.shouldBeFixedSizeList).

Why

Hudi's VECTOR logical type (RFC-99) already round-trips correctly on the Parquet path via the hoodie.vector.columns footer metadata + FIXED_LEN_BYTE_ARRAY storage. On the newly-added Lance base-file path, VECTOR columns silently degraded to plain List<Float> / List<Double> Arrow fields on write and lost their hudi_type descriptor on read — breaking parity with Parquet and defeating Lance's vector column encoding (tight packing, future vector search, etc.).

Lance-Spark's DDL-level TBLPROPERTIES ('<col>.arrow.fixed-size-list.size' = '128') knob ultimately just attaches that same arrow.fixed-size-list.size key to the column's Spark metadata. Since Hudi writes at the file level (bypassing Spark DDL), we attach the metadata directly from Hudi's existing VECTOR descriptor.

Implementation

Writer — HoodieSparkLanceWriter#enrichSparkSchemaForLanceVectors

  1. Reuses existing VectorConversionUtils.detectVectorColumnsFromMetadata to find fields tagged with hudi_type = VECTOR(...).
  2. For each such field, attaches LanceArrowUtils.ARROW_FIXED_SIZE_LIST_SIZE_KEY() ("arrow.fixed-size-list.size") as a Long with the dimension, preserving any pre-existing metadata (including hudi_type) via MetadataBuilder.withMetadata(...).
  3. Validates element type is FloatType or DoubleType; throws HoodieNotSupportedException otherwise.

Downstream: LanceArrowUtils.toArrowSchema(...) then emits FixedSizeList<elem, dim>, and LanceArrowWriter.createFieldWriter automatically selects its FixedSizeListWriter branch when it sees the matching Arrow vector — no other code changes required.

Reader — VectorConversionUtils#restoreVectorMetadataFromArrowSchema

  1. Walks the Arrow schema in parallel with the Spark schema produced by LanceArrowUtils.fromArrowSchema.
  2. For each Arrow field that is a FixedSizeList with a Float32/Float64 child, derives HoodieSchema.createVector(dim, elementType).toTypeDescriptor() and re-attaches it as hudi_type on the corresponding Spark StructField.
  3. Wired into HoodieSparkLanceReader#getSchema and SparkLanceReaderBase#read.

FileFormat fix — HoodieFileGroupReaderBasedFileFormat#withVectorRewrite

Early-return (schema, Map.empty) when hoodieFileFormat != PARQUET. The ArrayType → BinaryType rewrite is a Parquet-specific workaround; applying it on the Lance path produced an unsupported Cast(ArrayType, BinaryType) once reader-side VECTOR metadata was present.

Test plan

Added to TestLanceDataSource (parameterized across COW + MOR):

  • testFloatVectorRoundTrip — 4-dim FLOAT VECTOR
  • testDoubleVectorRoundTrip — 4-dim DOUBLE VECTOR
  • testMultipleVectorColumns — two vector columns of different element types / dims on the same row

Each test:

  1. Opens the written .lance file directly via LanceFileReader and asserts field.getType() is ArrowType.FixedSizeList with the expected listSize (writer guard).
  2. Calls assertHudiTypeMetadata to assert hudi_type = VECTOR(...) is restored on the read schema (reader guard).
mvn -pl hudi-spark-datasource/hudi-spark -Pspark3.5,scala-2.12 \
    -Dtest=TestLanceDataSource -DfailIfNoTests=false surefire:test

Tests run: 24, Failures: 0, Errors: 0, Skipped: 0 (6 new + 18 existing).

Out of scope

  • INT8 VECTOR support on Lance (lance-spark's shouldBeFixedSizeList rejects non-Float/Double; would require upstream Lance work or a separate encoding).

🤖 Generated with Claude Code

Translate the Hudi VECTOR logical-type metadata (`hudi_type = "VECTOR(dim[,elem])"`)
into the lance-spark metadata key `arrow.fixed-size-list.size` before calling
`LanceArrowUtils.toArrowSchema`, so the Lance writer emits a native Arrow
FixedSizeList<Float32|Float64, dim> (Lance's vector column encoding) instead
of a plain variable-length list. No change needed at `LanceFileWriter.open(...)`;
the encoding is driven by the Arrow schema itself.

- New private helper `enrichSparkSchemaForLanceVectors` in `HoodieSparkLanceWriter`
  reuses `VectorConversionUtils.detectVectorColumnsFromMetadata` to find VECTOR
  fields and attaches the Lance metadata key; non-vector fields pass through
  unchanged.
- Fails fast with `HoodieNotSupportedException` for non-ArrayType or non-
  Float/Double element types (matches lance-spark's `shouldBeFixedSizeList`).
- Tests in `TestLanceDataSource` (COW + MOR):
    - `testFloatVectorRoundTrip`
    - `testDoubleVectorRoundTrip`
    - `testMultipleVectorColumns`
  Each opens the written `.lance` file via `LanceFileReader` and asserts the
  field is `ArrowType.FixedSizeList` with the expected `listSize` — the direct
  regression guard that fails pre-fix and passes post-fix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Apr 13, 2026
Companion to the Lance writer's native FixedSizeList encoding: on read,
rehydrate the Hudi `hudi_type = VECTOR(...)` Spark metadata that
`LanceArrowUtils.fromArrowSchema` drops, so the read schema matches the
Parquet path. Gate the Parquet-only ArrayType→BinaryType vector rewrite
in HoodieFileGroupReaderBasedFileFormat on format == PARQUET; Lance
returns vectors natively as ArrayType so the rewrite would trigger a
spurious cast and break the read.

- VectorConversionUtils.restoreVectorMetadataFromArrowSchema walks the
  Arrow schema and re-attaches VECTOR(dim[,DOUBLE]) for
  FixedSizeList<Float32|Float64, dim> fields.
- HoodieSparkLanceReader.getSchema and SparkLanceReaderBase.read now
  call it so downstream VECTOR-aware code sees the same schema as on
  Parquet.
- TestLanceDataSource: assert hudi_type metadata is restored on read
  for float, double, and multi-vector round-trips.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Apr 13, 2026
@rahil-c rahil-c changed the title [WIP] feat(lance): write Hudi VECTOR columns as native Lance fixed-size lists [WIP] feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists Apr 13, 2026
@rahil-c rahil-c requested a review from wombatu-kun April 13, 2026 17:12
Mirrors the Parquet writer: emit the comma-separated
`colName:VECTOR(dim[,elemType])` descriptor list under the existing
`hoodie.vector.columns` key in the Lance file-footer key-value metadata.
Reader still derives VECTOR identity from the Arrow FixedSizeList type
today; this footer entry is insurance for future descriptor fields the
Arrow type cannot express (quantization tags, distance metrics, etc.)
and keeps Lance files symmetric with Parquet files.

- HoodieBaseLanceWriter: new protected `additionalSchemaMetadata()` hook
  invoked during close(), so subclasses can contribute footer KV
  entries alongside bloom-filter metadata.
- HoodieSparkLanceWriter: override `additionalSchemaMetadata()` to emit
  `hoodie.vector.columns` when the Spark schema has any VECTOR column.
- VectorConversionUtils: add `buildVectorColumnsMetadataValue(StructType)`
  matching the Parquet-path helper's output format.
- TestLanceDataSource: assert footer carries the expected descriptor
  list for float, double, and multi-vector round-trips.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 76.28866% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.81%. Comparing base (35e2bbf) to head (099aadf).
⚠️ Report is 30 commits behind head on master.

Files with missing lines Patch % Lines
.../apache/hudi/io/storage/VectorConversionUtils.java 76.92% 6 Missing and 6 partials ⚠️
...apache/hudi/io/storage/HoodieSparkLanceWriter.java 78.78% 4 Missing and 3 partials ⚠️
...rg/apache/hudi/io/lance/HoodieBaseLanceWriter.java 40.00% 1 Missing and 2 partials ⚠️
...parquet/HoodieFileGroupReaderBasedFileFormat.scala 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18497      +/-   ##
============================================
+ Coverage     68.52%   68.81%   +0.29%     
- Complexity    27968    28161     +193     
============================================
  Files          2440     2459      +19     
  Lines        134456   135186     +730     
  Branches      16226    16399     +173     
============================================
+ Hits          92138    93031     +893     
+ Misses        35054    34761     -293     
- Partials       7264     7394     +130     
Flag Coverage Δ
common-and-other-modules 44.51% <1.03%> (+0.16%) ⬆️
hadoop-mr-java-client 44.78% <0.00%> (-0.21%) ⬇️
spark-client-hadoop-common 48.37% <0.00%> (-0.01%) ⬇️
spark-java-tests 48.88% <76.28%> (+0.11%) ⬆️
spark-scala-tests 45.46% <1.03%> (-0.19%) ⬇️
utilities 38.17% <1.03%> (-0.19%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...apache/hudi/io/storage/HoodieSparkLanceReader.java 74.02% <100.00%> (+2.97%) ⬆️
...ution/datasources/lance/SparkLanceReaderBase.scala 86.00% <100.00%> (ø)
...parquet/HoodieFileGroupReaderBasedFileFormat.scala 85.46% <75.00%> (-0.32%) ⬇️
...rg/apache/hudi/io/lance/HoodieBaseLanceWriter.java 67.85% <40.00%> (-1.77%) ⬇️
...apache/hudi/io/storage/HoodieSparkLanceWriter.java 89.15% <78.78%> (-7.00%) ⬇️
.../apache/hudi/io/storage/VectorConversionUtils.java 81.81% <76.92%> (-4.39%) ⬇️

... and 143 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wombatu-kun
Copy link
Copy Markdown
Contributor

Writing VECTOR columns as native Lance FixedSizeList is the right direction, and the writer/reader symmetry plus the forward-compat footer are nice design choices. I've been working on the same problem on a parallel branch (different strategy: plain List<Float> + BinaryType rewrite reversal), and while comparing the two approaches I spotted a handful of things worth addressing before this lands.

@wombatu-kun
Copy link
Copy Markdown
Contributor

Lance artifact coordinates are out of date: com.lancedborg.lance, 0.0.150.4.0

@wombatu-kun
Copy link
Copy Markdown
Contributor

The hoodieFileFormat != PARQUET early-return in withVectorRewrite is clean. Please double-check that every call site that previously may have hit the rewrite now correctly skips it for Lance:

  • buildReaderWithPartitionValues (all three invocations: requiredSchema, outputSchema, requestedSchema).
  • Any other places in the file where detectVectorColumns / replaceVectorFieldsWithBinary are called independently of the helper.

@wombatu-kun
Copy link
Copy Markdown
Contributor

The three added tests (testFloatVectorRoundTrip, testDoubleVectorRoundTrip, testMultipleVectorColumns) are good writer/reader guards but cover only the trivial insert + full-table-read path. Recommended additions:

  • Nullable vector column (null row values, null whole struct).
  • Partitioned table.
  • MOR log merging (write → update → read).
  • Schema evolution: add VECTOR column to an existing Lance table and read old + new rows.
  • Clustering: verify clustered output Lance files also carry native FixedSizeList + footer.
  • Projection: read only the vector column; read vector column alongside metadata columns.
  • Time travel / incremental query.

@wombatu-kun
Copy link
Copy Markdown
Contributor

Minor code-quality nits

  • HoodieSparkLanceWriter#enrichSparkSchemaForLanceVectors: the local DataType dt = field.dataType() is assigned before the ArrayType check, then unused once the cast is done. Can collapse to if (!(field.dataType() instanceof ArrayType)).
  • VectorConversionUtils#buildVectorColumnsMetadataValue duplicates the format produced by HoodieSchema#buildVectorColumnsMetadataValue for Avro schemas. Consider adding a Javadoc cross-reference and asserting the two produce identical strings for equivalent schemas (could even delegate).
  • VectorConversionUtils#restoreVectorMetadataFromArrowSchema walks only top-level fields. If a nested struct ever carries a VECTOR child (unlikely today but possible), it would be missed. Worth a Javadoc note: "Top-level VECTORs only; nested struct children are not recursed into."
  • HoodieBaseLanceWriter#close: the new additionalSchemaMetadata hook is called after bloom filter metadata — fine — but the nested if (writer != null) check is redundant (already inside the outer writer != null for bloom filter); can collapse.
  • buildVectorColumnsMetadataValue returns "" for schemas without vectors, and the override early-returns Collections.emptyMap() in that case — so no footer entry is emitted. Correct, just worth a comment explaining that the hook is called unconditionally but is a no-op when there are no vectors (otherwise a future reader maintaining this code might wonder why a non-VECTOR Lance file has no hoodie.vector.columns).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants