Skip to content

feat(blob): add support for lance blob inline descriptor reading#18586

Merged
yihua merged 21 commits into
apache:masterfrom
rahil-c:rahil/blob-lance-transformed
Apr 26, 2026
Merged

feat(blob): add support for lance blob inline descriptor reading#18586
yihua merged 21 commits into
apache:masterfrom
rahil-c:rahil/blob-lance-transformed

Conversation

@rahil-c
Copy link
Copy Markdown
Collaborator

@rahil-c rahil-c commented Apr 24, 2026

Describe the issue this Pull Request addresses

Core problem: Hudi's BLOB model and Lance's blob encoding don't line up out of the box, and without a bridge Hudi BLOB tables can't use hoodie.base.file.format=lance.

Hudi BLOB is a per-row-tagged struct where each row independently picks INLINE (data in-place) or OUT_OF_LINE (reference to external file). Lance's blob encoding is schema-level: a column is either blob-encoded or it isn't.

Before this PR there was no translation layer, causing three failures on Lance-backed BLOB tables:

  1. Writer didn't activate Lance's blob encoding — INLINE bytes went through the default column path.
  2. Lance rejected valid Hudi BLOB rows — null data (OUT_OF_LINE) or null reference (INLINE) tripped Lance's strict child-nullability check.
  3. No read-side lazy mode — callers wanting a deferred reference instead of eager bytes had no option.

Summary and Changelog

This PR adds the Hudi-BLOB ↔ Lance translation layer and a read-mode config so BLOB works end-to-end on Lance.

Config

hoodie.read.blob.inline.mode (advanced, default CONTENT):

Value Behavior Use case
CONTENT Lance materializes INLINE bytes in data Simple, eager reads; also used internally by read_blob() and compaction/merge
DESCRIPTOR Iterator preserves type=INLINE but sets data=null and populates reference with the Lance blob-stream position/size for user-level lazy resolution Large tables where most rows are never materialized; users resolve via read_blob() which reads with CONTENT mode

The internal compaction/merge reader (HoodieSparkLanceReader) and read_blob() stay pinned to CONTENT regardless of this config.

Write path

  • HoodieSparkLanceWriter enriches the Arrow schema: each BLOB data child becomes LargeBinary + lance-encoding:blob=true, and nullability is widened inside the BLOB subtree so Lance accepts null children on valid rows.

Read path

  • SparkLanceReaderBase resolves the config, widens nullability inside BLOB subtrees, and composes a BlobDescriptorTransform into LanceRecordIterator when in DESCRIPTOR mode.
  • LanceRecordIterator is a single final iterator for all Lance reads. An optional BlobDescriptorTransform (composition pattern) handles per-row BLOB rewriting in DESCRIPTOR mode.
  • BlobDescriptorTransform reads type/nulls from the InternalRow (always accurate per-row); the Lance-specific BlobStructAccessor for reading {position, size} on INLINE rows is obtained fresh per row from the column vectors to avoid stale references across batches. The original type is preserved (INLINE stays INLINE) so users see the storage mode they wrote.

Per-file changelog

  • BlobDescriptorTransform.java (new) — Transform composed into LanceRecordIterator. Static-final UTF8 constants, Set<Integer> for blob column indices. Type/null checks use InternalRow; BlobStructAccessor obtained fresh per INLINE row from column vectors. Explicit INLINE/OUT_OF_LINE type check (throws on unknown), defensive null-data handling.
  • LanceRecordIterator.java — Single final iterator for all Lance reads. Accepts optional BlobDescriptorTransform via constructor (composition, not inheritance). Uses rowIterator for batch iteration.
  • LanceResourceCloser.java (renamed from LanceCloseables.java) — Name clarified. Attaches suppressed exception when both Arrow and Lance close fail.
  • SparkLanceReaderBase.scala — Creates BlobDescriptorTransform when DESCRIPTOR + BLOB columns present. Idiomatic Scala collect for blob field names.
  • HoodieSparkLanceWriter.java — Routes Arrow schema through blob annotation at writer-open time.
  • HoodieSparkLanceReader.java — Pinned to CONTENT mode for compaction/merge.
  • HoodieReaderConfig.java — Adds hoodie.read.blob.inline.mode.
  • TestLanceDataSource.scala — Parameterized tests (COW + MOR):
    • testBlobInlineRoundTrip — CONTENT mode byte round-trip.
    • testBlobOutOfLine — Parameterized on read mode (default/CONTENT/DESCRIPTOR); OUT_OF_LINE references survive unchanged; read_blob() resolves correct bytes via CONTENT mode.
    • testBlobInlineDescriptorMode — DESCRIPTOR on INLINE rows: type stays INLINE, reference points at .lance file, read_blob() reads original bytes via CONTENT mode.

Impact

  • New functionality. Hudi BLOB columns work with hoodie.base.file.format=lance end-to-end.
  • One new config. hoodie.read.blob.inline.mode — advanced, default CONTENT. No action required for existing users.
  • Forward-only on-disk change for BLOB-on-Lance files. Files carry lance-encoding:blob=true on the BLOB data child. No impact on non-BLOB Lance files or Parquet tables.
  • Performance: Non-BLOB reads unchanged. BLOB CONTENT mode adds a schema nullability-widening copy at open time (amortized). DESCRIPTOR mode does ~2 small Object[] allocations per BLOB column per row; BlobStructAccessor reads are direct Arrow buffer accesses.

Risk Level

Medium. Integration between two relatively young subsystems (Hudi BLOB + Lance). Mitigated by:

  • End-to-end coverage for both storage modes and all read modes (default, CONTENT, DESCRIPTOR).
  • All tests parameterized across COW and MOR.
  • DESCRIPTOR path gated on BlobReadMode.DESCRIPTOR && blobFieldNames.nonEmpty; non-BLOB tables traverse the pre-PR code path exactly.
  • No changes to Hudi BLOB semantics, Parquet path, or read_blob() SQL resolution.

Documentation Update

None beyond the in-config documentation on hoodie.read.blob.inline.mode.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@rahil-c rahil-c requested a review from yihua April 24, 2026 17:24
@rahil-c rahil-c changed the title Rahil/blob lance transformed add support for lance blob inline descriptor reading Apr 24, 2026
@github-actions github-actions Bot added the size:XL PR with lines of changes > 1000 label Apr 24, 2026
rahil-c and others added 8 commits April 25, 2026 16:00
Covers steps 2, 3, and 4 of VC's 1.2 BLOB release plan:

  2. Wire INLINE writes  - testEndToEndInline drives SparkRDDWriteClient
     insert+upsert with INLINE Avro records (data=bytes, reference=null)
     against a Parquet-backed Hudi table.
  3. read_blob() reads inline values  - testInlineBlobRoundTrip runs
     SELECT read_blob(col) over an in-memory INLINE DataFrame and
     verifies each payload round-trips byte-for-byte.
  4. Mixed datasets  - testMixedInlineAndOutOfLine builds 10 rows
     alternating INLINE and OUT_OF_LINE, pointing the range rows at
     one shared file and asserts the returned sequence matches input
     order (stronger than TestBatchedBlobReader.testMixedBlobTypes,
     which orders by record_id before asserting).

testInlineOnHudiBackedTable mirrors the cherry-picked
testReadBlobOnHudiBackedTable (OUT_OF_LINE) but writes INLINE rows
via spark.write.format("hudi") + bulk_insert, reads back through
spark.read.format("hudi"), and resolves via read_blob() - exercises
the full write -> HoodieFileIndex-backed read -> SQL path that the
cherry-picked BatchedBlobReadExec serialization fix unblocks.

No production code changes. BatchedBlobReader already dispatches
INLINE rows into its inline branch (collectBatch field-0 check)
and preserves row order via sortBy(index) in processNextBatch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@yihua yihua force-pushed the rahil/blob-lance-transformed branch from fb60d32 to c2be9ff Compare April 25, 2026 23:04
@yihua yihua changed the title add support for lance blob inline descriptor reading feat(blob): add support for lance blob inline descriptor reading Apr 25, 2026
@github-actions github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:XL PR with lines of changes > 1000 labels Apr 25, 2026
@yihua yihua marked this pull request as ready for review April 25, 2026 23:13
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR wires the DESCRIPTOR blob read mode through the Lance reader path with a dedicated rewrite iterator, and shares the close logic across the existing and new iterators. A couple of edge cases around the close path and INLINE descriptor handling worth double-checking in the inline comments. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of minor naming and style suggestions below.

@yihua yihua force-pushed the rahil/blob-lance-transformed branch from 591ac80 to b5b06f1 Compare April 26, 2026 04:15
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// Build ColumnVector[] in Spark-schema order by looking each field up by name;
// lance-spark 0.4.0's VectorSchemaRoot may return the file's on-disk order, which
// would misalign the UnsafeProjection. Cached on the first batch and reused thereafter.
if (columnVectors == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to revisit separately: the columnVectors is only assigned once, for all batches. This is reused, and I see that BlobDescriptorTransform use it across different batches. Have we test multiple batches from arrow reader and see if it works?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the test cases are loose here. So it would be good to add stronger tests in a follow-up.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up sounds like a good idea

+ "CONTENT returns the raw inline bytes (default). "
+ "A future DESCRIPTOR mode will return a {position, size} pointer instead of materializing "
+ "the bytes, so callers can defer the byte read.");
+ "CONTENT (default) returns the raw inline bytes. "
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua just to confirm then we want default to always read the raw bytes for both lance and parquet?

Copy link
Copy Markdown
Contributor

@yihua yihua Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally DESCRIPTOR mode should be the default. I kept it as is in this PR.

Exception lanceException = null;

if (currentBatch != null) {
currentBatch.close();
Copy link
Copy Markdown
Collaborator Author

@rahil-c rahil-c Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a regression from you but more so maybe a miss in general in this code path. I was reviewing this again and Im wondering if this current batch close needs to be in try/catch as well. Claude was pointing this out to me as a potential issue and i think it seems accurate for this case.

2.1 Resource leak when currentBatch.close() or allocator.close() throws — LanceResourceCloser.java:48-67
                                                                                                                                
  if (currentBatch != null) {
    currentBatch.close();          // not in try/catch                                                                          
  }                                                                                                                             
  if (arrowReader != null) { try { ... } catch (...) }                                                                          
  if (lanceReader != null) { try { ... } catch (...) }                                                                          
  if (allocator != null) {
    allocator.close();             // not in try/catch                                                                          
  }                                                                                                                             
   
  If currentBatch.close() throws (a ColumnarBatch close cascades to its child column vectors and can throw), the arrowReader,   
  lanceReader, and allocator are never closed → buffer / file-handle leak. The same applies if allocator.close() throws
  (BufferAllocator.close() throws IllegalStateException when buffers are still outstanding) — the throw escapes without         
  rethrowing the prior captured Arrow/Lance exceptions.

  The whole point of consolidating this helper was to make sure "a failed reader close never leaks the allocator" (per the class
   Javadoc). The current code only protects against arrowReader/lanceReader failures, not the other two.
                                                                                                                                
  Fix: wrap all four closes in try/catch and aggregate via addSuppressed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked currentBatch.close() default implementation should not throw exception as it's in-memory processing so I didn't address the try/catch. It's good to be defensive here and add the additional try/catch. Let's do a separate fix as this PR is not changing this part.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can follow up on this in seperate pr so it doesnt block ci

@@ -888,14 +888,103 @@ class TestLanceDataSource extends HoodieSparkClientTestBase {
}
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking a followup which i or someone can pick up on is a test for doing mixed INLINE and OUTLINE BLOBs within a table?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@rahil-c
Copy link
Copy Markdown
Collaborator Author

rahil-c commented Apr 26, 2026

@yihua thanks for the help on this, for the most part this LGTM

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 65.74074% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.90%. Comparing base (8530644) to head (32b1762).

Files with missing lines Patch % Lines
...rg/apache/hudi/io/storage/LanceResourceCloser.java 33.33% 10 Missing and 6 partials ⚠️
...pache/hudi/io/storage/BlobDescriptorTransform.java 76.92% 3 Missing and 6 partials ⚠️
...rg/apache/hudi/io/storage/LanceRecordIterator.java 78.78% 4 Missing and 3 partials ⚠️
...ution/datasources/lance/SparkLanceReaderBase.scala 54.54% 2 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18586   +/-   ##
=========================================
  Coverage     68.89%   68.90%           
- Complexity    28576    28581    +5     
=========================================
  Files          2480     2482    +2     
  Lines        136995   137053   +58     
  Branches      16694    16713   +19     
=========================================
+ Hits          94389    94436   +47     
- Misses        35007    35009    +2     
- Partials       7599     7608    +9     
Flag Coverage Δ
common-and-other-modules 44.38% <0.92%> (-0.02%) ⬇️
hadoop-mr-java-client 44.82% <100.00%> (+0.05%) ⬆️
spark-client-hadoop-common 48.46% <1.03%> (-0.04%) ⬇️
spark-java-tests 49.50% <65.74%> (+0.02%) ⬆️
spark-scala-tests 45.22% <0.92%> (-0.03%) ⬇️
utilities 37.93% <0.92%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../apache/hudi/common/config/HoodieReaderConfig.java 100.00% <100.00%> (ø)
...ution/datasources/lance/SparkLanceReaderBase.scala 75.58% <54.54%> (-3.00%) ⬇️
...rg/apache/hudi/io/storage/LanceRecordIterator.java 77.27% <78.78%> (+16.99%) ⬆️
...pache/hudi/io/storage/BlobDescriptorTransform.java 76.92% <76.92%> (ø)
...rg/apache/hudi/io/storage/LanceResourceCloser.java 33.33% <33.33%> (ø)

... and 9 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yihua yihua merged commit 787953f into apache:master Apr 26, 2026
57 of 58 checks passed
dwshmilyss pushed a commit to dwshmilyss/hudi that referenced this pull request May 21, 2026
…che#18586)

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants