feat(blob): add support for lance blob inline descriptor reading by rahil-c · Pull Request #18586 · apache/hudi

rahil-c · 2026-04-24T17:24:14Z

Describe the issue this Pull Request addresses

Core problem: Hudi's BLOB model and Lance's blob encoding don't line up out of the box, and without a bridge Hudi BLOB tables can't use hoodie.base.file.format=lance.

Hudi BLOB is a per-row-tagged struct where each row independently picks INLINE (data in-place) or OUT_OF_LINE (reference to external file). Lance's blob encoding is schema-level: a column is either blob-encoded or it isn't.

Before this PR there was no translation layer, causing three failures on Lance-backed BLOB tables:

Writer didn't activate Lance's blob encoding — INLINE bytes went through the default column path.
Lance rejected valid Hudi BLOB rows — null data (OUT_OF_LINE) or null reference (INLINE) tripped Lance's strict child-nullability check.
No read-side lazy mode — callers wanting a deferred reference instead of eager bytes had no option.

Summary and Changelog

This PR adds the Hudi-BLOB ↔ Lance translation layer and a read-mode config so BLOB works end-to-end on Lance.

Config

hoodie.read.blob.inline.mode (advanced, default CONTENT):

Value	Behavior	Use case
CONTENT	Lance materializes INLINE bytes in `data`	Simple, eager reads; also used internally by `read_blob()` and compaction/merge
DESCRIPTOR	Iterator preserves `type=INLINE` but sets `data=null` and populates `reference` with the Lance blob-stream position/size for user-level lazy resolution	Large tables where most rows are never materialized; users resolve via `read_blob()` which reads with CONTENT mode

The internal compaction/merge reader (HoodieSparkLanceReader) and read_blob() stay pinned to CONTENT regardless of this config.

Write path

HoodieSparkLanceWriter enriches the Arrow schema: each BLOB data child becomes LargeBinary + lance-encoding:blob=true, and nullability is widened inside the BLOB subtree so Lance accepts null children on valid rows.

Read path

SparkLanceReaderBase resolves the config, widens nullability inside BLOB subtrees, and composes a BlobDescriptorTransform into LanceRecordIterator when in DESCRIPTOR mode.
LanceRecordIterator is a single final iterator for all Lance reads. An optional BlobDescriptorTransform (composition pattern) handles per-row BLOB rewriting in DESCRIPTOR mode.
BlobDescriptorTransform reads type/nulls from the InternalRow (always accurate per-row); the Lance-specific BlobStructAccessor for reading {position, size} on INLINE rows is obtained fresh per row from the column vectors to avoid stale references across batches. The original type is preserved (INLINE stays INLINE) so users see the storage mode they wrote.

Per-file changelog

BlobDescriptorTransform.java (new) — Transform composed into LanceRecordIterator. Static-final UTF8 constants, Set<Integer> for blob column indices. Type/null checks use InternalRow; BlobStructAccessor obtained fresh per INLINE row from column vectors. Explicit INLINE/OUT_OF_LINE type check (throws on unknown), defensive null-data handling.
LanceRecordIterator.java — Single final iterator for all Lance reads. Accepts optional BlobDescriptorTransform via constructor (composition, not inheritance). Uses rowIterator for batch iteration.
LanceResourceCloser.java (renamed from LanceCloseables.java) — Name clarified. Attaches suppressed exception when both Arrow and Lance close fail.
SparkLanceReaderBase.scala — Creates BlobDescriptorTransform when DESCRIPTOR + BLOB columns present. Idiomatic Scala collect for blob field names.
HoodieSparkLanceWriter.java — Routes Arrow schema through blob annotation at writer-open time.
HoodieSparkLanceReader.java — Pinned to CONTENT mode for compaction/merge.
HoodieReaderConfig.java — Adds hoodie.read.blob.inline.mode.
TestLanceDataSource.scala — Parameterized tests (COW + MOR):
- testBlobInlineRoundTrip — CONTENT mode byte round-trip.
- testBlobOutOfLine — Parameterized on read mode (default/CONTENT/DESCRIPTOR); OUT_OF_LINE references survive unchanged; read_blob() resolves correct bytes via CONTENT mode.
- testBlobInlineDescriptorMode — DESCRIPTOR on INLINE rows: type stays INLINE, reference points at .lance file, read_blob() reads original bytes via CONTENT mode.

Impact

New functionality. Hudi BLOB columns work with hoodie.base.file.format=lance end-to-end.
One new config. hoodie.read.blob.inline.mode — advanced, default CONTENT. No action required for existing users.
Forward-only on-disk change for BLOB-on-Lance files. Files carry lance-encoding:blob=true on the BLOB data child. No impact on non-BLOB Lance files or Parquet tables.
Performance: Non-BLOB reads unchanged. BLOB CONTENT mode adds a schema nullability-widening copy at open time (amortized). DESCRIPTOR mode does ~2 small Object[] allocations per BLOB column per row; BlobStructAccessor reads are direct Arrow buffer accesses.

Risk Level

Medium. Integration between two relatively young subsystems (Hudi BLOB + Lance). Mitigated by:

End-to-end coverage for both storage modes and all read modes (default, CONTENT, DESCRIPTOR).
All tests parameterized across COW and MOR.
DESCRIPTOR path gated on BlobReadMode.DESCRIPTOR && blobFieldNames.nonEmpty; non-BLOB tables traverse the pre-PR code path exactly.
No changes to Hudi BLOB semantics, Parquet path, or read_blob() SQL resolution.

Documentation Update

None beyond the in-config documentation on hoodie.read.blob.inline.mode.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

Covers steps 2, 3, and 4 of VC's 1.2 BLOB release plan: 2. Wire INLINE writes - testEndToEndInline drives SparkRDDWriteClient insert+upsert with INLINE Avro records (data=bytes, reference=null) against a Parquet-backed Hudi table. 3. read_blob() reads inline values - testInlineBlobRoundTrip runs SELECT read_blob(col) over an in-memory INLINE DataFrame and verifies each payload round-trips byte-for-byte. 4. Mixed datasets - testMixedInlineAndOutOfLine builds 10 rows alternating INLINE and OUT_OF_LINE, pointing the range rows at one shared file and asserts the returned sequence matches input order (stronger than TestBatchedBlobReader.testMixedBlobTypes, which orders by record_id before asserting). testInlineOnHudiBackedTable mirrors the cherry-picked testReadBlobOnHudiBackedTable (OUT_OF_LINE) but writes INLINE rows via spark.write.format("hudi") + bulk_insert, reads back through spark.read.format("hudi"), and resolves via read_blob() - exercises the full write -> HoodieFileIndex-backed read -> SQL path that the cherry-picked BatchedBlobReadExec serialization fix unblocks. No production code changes. BatchedBlobReader already dispatches INLINE rows into its inline branch (collectBatch field-0 check) and preserves row order via sortBy(index) in processNextBatch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR wires the DESCRIPTOR blob read mode through the Lance reader path with a dedicated rewrite iterator, and shares the close logic across the existing and new iterators. A couple of edge cases around the close path and INLINE descriptor handling worth double-checking in the inline comments. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of minor naming and style suggestions below.

yihua

LGTM

yihua · 2026-04-26T04:40:55Z

-        // Build ColumnVector[] in Spark-schema order by looking each field up by name;
-        // lance-spark 0.4.0's VectorSchemaRoot may return the file's on-disk order, which
-        // would misalign the UnsafeProjection. Cached on the first batch and reused thereafter.
-        if (columnVectors == null) {


Something to revisit separately: the columnVectors is only assigned once, for all batches. This is reused, and I see that BlobDescriptorTransform use it across different batches. Have we test multiple batches from arrow reader and see if it works?

It looks like the test cases are loose here. So it would be good to add stronger tests in a follow-up.

Follow up sounds like a good idea

rahil-c · 2026-04-26T04:44:38Z

-          + "CONTENT returns the raw inline bytes (default). "
-          + "A future DESCRIPTOR mode will return a {position, size} pointer instead of materializing "
-          + "the bytes, so callers can defer the byte read.");
+          + "CONTENT (default) returns the raw inline bytes. "


@yihua just to confirm then we want default to always read the raw bytes for both lance and parquet?

I think ideally DESCRIPTOR mode should be the default. I kept it as is in this PR.

rahil-c · 2026-04-26T05:02:39Z

+    Exception lanceException = null;
+
+    if (currentBatch != null) {
+      currentBatch.close();


This is not a regression from you but more so maybe a miss in general in this code path. I was reviewing this again and Im wondering if this current batch close needs to be in try/catch as well. Claude was pointing this out to me as a potential issue and i think it seems accurate for this case.

2.1 Resource leak when currentBatch.close() or allocator.close() throws — LanceResourceCloser.java:48-67 if (currentBatch != null) { currentBatch.close(); // not in try/catch } if (arrowReader != null) { try { ... } catch (...) } if (lanceReader != null) { try { ... } catch (...) } if (allocator != null) { allocator.close(); // not in try/catch } If currentBatch.close() throws (a ColumnarBatch close cascades to its child column vectors and can throw), the arrowReader, lanceReader, and allocator are never closed → buffer / file-handle leak. The same applies if allocator.close() throws (BufferAllocator.close() throws IllegalStateException when buffers are still outstanding) — the throw escapes without rethrowing the prior captured Arrow/Lance exceptions. The whole point of consolidating this helper was to make sure "a failed reader close never leaks the allocator" (per the class Javadoc). The current code only protects against arrowReader/lanceReader failures, not the other two. Fix: wrap all four closes in try/catch and aggregate via addSuppressed.

I checked currentBatch.close() default implementation should not throw exception as it's in-memory processing so I didn't address the try/catch. It's good to be defensive here and add the additional try/catch. Let's do a separate fix as this PR is not changing this part.

We can follow up on this in seperate pr so it doesnt block ci

rahil-c · 2026-04-26T05:04:39Z

@@ -888,14 +888,103 @@ class TestLanceDataSource extends HoodieSparkClientTestBase {
    }
  }



I am thinking a followup which i or someone can pick up on is a test for doing mixed INLINE and OUTLINE BLOBs within a table?

Makes sense

rahil-c · 2026-04-26T05:07:55Z

@yihua thanks for the help on this, for the most part this LGTM

hudi-bot · 2026-04-26T05:37:44Z

CI report:

ef14bde UNKNOWN
9d3361b UNKNOWN
39fef73 UNKNOWN
9204609 UNKNOWN
b5b06f1 Azure: SUCCESS
591ac80 UNKNOWN
7aae317 UNKNOWN
0e56c6e UNKNOWN
32b1762 Azure: PENDING

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-04-26T06:02:38Z

Codecov Report

❌ Patch coverage is 65.74074% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.90%. Comparing base (8530644) to head (32b1762).

Files with missing lines	Patch %	Lines
...rg/apache/hudi/io/storage/LanceResourceCloser.java	33.33%	10 Missing and 6 partials ⚠️
...pache/hudi/io/storage/BlobDescriptorTransform.java	76.92%	3 Missing and 6 partials ⚠️
...rg/apache/hudi/io/storage/LanceRecordIterator.java	78.78%	4 Missing and 3 partials ⚠️
...ution/datasources/lance/SparkLanceReaderBase.scala	54.54%	2 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #18586   +/-   ##
=========================================
  Coverage     68.89%   68.90%           
- Complexity    28576    28581    +5     
=========================================
  Files          2480     2482    +2     
  Lines        136995   137053   +58     
  Branches      16694    16713   +19     
=========================================
+ Hits          94389    94436   +47     
- Misses        35007    35009    +2     
- Partials       7599     7608    +9

Flag	Coverage Δ
common-and-other-modules	`44.38% <0.92%> (-0.02%)`	⬇️
hadoop-mr-java-client	`44.82% <100.00%> (+0.05%)`	⬆️
spark-client-hadoop-common	`48.46% <1.03%> (-0.04%)`	⬇️
spark-java-tests	`49.50% <65.74%> (+0.02%)`	⬆️
spark-scala-tests	`45.22% <0.92%> (-0.03%)`	⬇️
utilities	`37.93% <0.92%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
.../apache/hudi/common/config/HoodieReaderConfig.java	`100.00% <100.00%> (ø)`
...ution/datasources/lance/SparkLanceReaderBase.scala	`75.58% <54.54%> (-3.00%)`	⬇️
...rg/apache/hudi/io/storage/LanceRecordIterator.java	`77.27% <78.78%> (+16.99%)`	⬆️
...pache/hudi/io/storage/BlobDescriptorTransform.java	`76.92% <76.92%> (ø)`
...rg/apache/hudi/io/storage/LanceResourceCloser.java	`33.33% <33.33%> (ø)`

... and 9 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…che#18586) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

rahil-c requested a review from yihua April 24, 2026 17:24

rahil-c changed the title ~~Rahil/blob lance transformed~~ add support for lance blob inline descriptor reading Apr 24, 2026

github-actions Bot added the size:XL PR with lines of changes > 1000 label Apr 24, 2026

rahil-c and others added 8 commits April 25, 2026 16:00

fix compilation

c30d679

intial critical/medium pass

2f97d27

comment test cleanup

3c52970

Support simplified path for lance blob inline reading

dd92c0f

simplify further

896b9a6

add support for lance blob inline descriptor reading

9207b4f

further simplify

c2be9ff

yihua force-pushed the rahil/blob-lance-transformed branch from fb60d32 to c2be9ff Compare April 25, 2026 23:04

yihua changed the title ~~add support for lance blob inline descriptor reading~~ feat(blob): add support for lance blob inline descriptor reading Apr 25, 2026

github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:XL PR with lines of changes > 1000 labels Apr 25, 2026

Fix rebase

ef14bde

yihua marked this pull request as ready for review April 25, 2026 23:13

Fix build

5bf1614

hudi-agent reviewed Apr 25, 2026

View reviewed changes

hudi-agent mentioned this pull request Apr 25, 2026

[OSS PR #18586] feat(blob): add support for lance blob inline descriptor reading hudi-agent/hudi#22

Open

yihua reviewed Apr 25, 2026

View reviewed changes

yihua reviewed Apr 26, 2026

View reviewed changes

yihua added 2 commits April 25, 2026 18:05

Address comments

7a9e219

Add LanceResourceCloser

c38a3a4

yihua reviewed Apr 26, 2026

View reviewed changes

Comment thread ...spark-client/src/main/java/org/apache/hudi/io/storage/BlobDescriptorLanceRecordIterator.java Outdated

Address more comments for maintenability

9d3361b

yihua reviewed Apr 26, 2026

View reviewed changes

Comment thread hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/LanceRecordIterator.java Outdated

Fix build

801c92b

yihua added 3 commits April 25, 2026 20:46

Optimize schema parsing

39fef73

Improve code

9204609

Revise type checks

b5b06f1

yihua force-pushed the rahil/blob-lance-transformed branch from 591ac80 to b5b06f1 Compare April 26, 2026 04:15

yihua added 2 commits April 25, 2026 21:24

Fix tests

0a61ff2

revert BatchedBlobReader

7aae317

yihua approved these changes Apr 26, 2026

View reviewed changes

yihua added 2 commits April 25, 2026 21:31

Fix position and length fetching

0e56c6e

Restore logic

32b1762

yihua reviewed Apr 26, 2026

View reviewed changes

rahil-c commented Apr 26, 2026

View reviewed changes

yihua merged commit 787953f into apache:master Apr 26, 2026
57 of 58 checks passed

dwshmilyss pushed a commit to dwshmilyss/hudi that referenced this pull request May 21, 2026

feat(blob): add support for lance blob inline descriptor reading (apa…

5290eff

…che#18586) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

		@@ -888,14 +888,103 @@ class TestLanceDataSource extends HoodieSparkClientTestBase {
		}
		}

Conversation

rahil-c commented Apr 24, 2026 • edited by yihua Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Config

Write path

Read path

Per-file changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yihua Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahil-c Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahil-c commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hudi-bot commented Apr 26, 2026

CI report:

Uh oh!

codecov-commenter commented Apr 26, 2026

Codecov Report

Uh oh!

rahil-c commented Apr 24, 2026 •

edited by yihua

Loading

yihua Apr 26, 2026 •

edited

Loading

rahil-c Apr 26, 2026 •

edited

Loading

rahil-c commented Apr 26, 2026 •

edited

Loading