Skip to content

feat(blob): default blob.inline.mode to DESCRIPTOR for Lance#18744

Merged
yihua merged 13 commits into
apache:masterfrom
voonhous:address-#18742
May 20, 2026
Merged

feat(blob): default blob.inline.mode to DESCRIPTOR for Lance#18744
yihua merged 13 commits into
apache:masterfrom
voonhous:address-#18742

Conversation

@voonhous
Copy link
Copy Markdown
Member

@voonhous voonhous commented May 15, 2026

Describe the issue this Pull Request addresses

Closes: #18742

Summary and Changelog

Flip default of hoodie.read.blob.inline.mode from CONTENT to DESCRIPTOR, and make DESCRIPTOR a true metadata-only mode for INLINE rows.

  • Plain reads on Lance tables no longer materialize inline blob payloads per row.
  • read_blob(col) on INLINE rows under DESCRIPTOR now throws with an actionable message naming both INLINE and DESCRIPTOR, and the config to flip.
  • CONTENT remains available as opt-in for byte-heavy workloads.

Changes:

  • HoodieReaderConfig: default flipped to DESCRIPTOR; doc updated.
  • BatchedBlobReader: INLINE branch throws IllegalStateException when inline_data=NULL (the Lance + DESCRIPTOR shape). CONTENT 1-hop passthrough preserved.
  • ScalarFunctions: read_blob DESCRIBE FUNCTION gains a one-line caveat about the INLINE+DESCRIPTOR throw.
  • TestLanceDataSource:
    • testBlobInlineRoundTrip opts into CONTENT explicitly.
    • testBlobInlineDescriptorMode asserts descriptor-shape on plain reads and that read_blob() throws with INLINE + DESCRIPTOR in the message chain.
    • testBlobInlineCompactionRoundTrip (MOR): inserts INLINE blobs, upserts a subset to force compaction, verifies touched rows carry new bytes and untouched rows retain originals. Asserts presence of log files and a compaction commit so a CoW-like fallthrough cannot silently pass.
    • testBlobMixedInlineAndOutOfLineContentMode: one column holding both INLINE and OUT_OF_LINE rows; CONTENT-mode read_blob() resolves each shape via its own path.
    • testBlobInlineMultipleColumnsPlainSelect: two INLINE blob columns with distinct byte patterns. CONTENT keeps per-column bytes independent; DESCRIPTOR synthesis fires per column.
    • testBlobInlineMultipleColumnsReadBlobAll: SELECT read_blob(a), read_blob(b) under CONTENT, distinct bytes catch cross-column aliasing.
    • testBlobInlineMultipleColumnsMixedSelect: SELECT read_blob(a), b (mixed materialize + struct projection). CONTENT returns bytes for a and content-shape struct for b. DESCRIPTOR still throws on read_blob(a) even when b is only projected as a struct.
    • Shared writeMultiBlobInlineTable helper for the multi-column tests.
  • rfc/rfc-100/rfc-100.md: behavior matrix, prose, and mermaid diagrams refreshed.

Compaction-side CONTENT pin in HoodieSparkLanceReader is unchanged and exercised end-to-end by the MOR compaction test.

Impact

User-facing behavior change for Lance INLINE blobs:

  • SELECT * now returns data=NULL plus a synthesized reference by default (was data=bytes, reference=NULL).
  • read_blob(col) on INLINE rows throws under the default. Callers either:
    • set hoodie.read.blob.inline.mode=CONTENT and use read_blob(col) (1-hop), or
    • read col.data directly under CONTENT.

Error message names both INLINE and DESCRIPTOR and the config to flip, so failures are actionable.

Parquet unaffected. The Parquet reader does not honor hoodie.read.blob.inline.mode; INLINE rows always come back in CONTENT shape there, so the new throw is unreachable.

OUT_OF_LINE unaffected under any mode. The setting governs INLINE reads only; OUT_OF_LINE descriptors are real user metadata and read_blob() always materializes via external pread.

Why error rather than the 2-hop fallback: the synthesized reference for INLINE+DESCRIPTOR is a pointer into .lance storage layout, not user-facing metadata. Silent 2-hop materialization let users project col.reference.offset next to read_blob(col) and treat the Lance-internal offset as durable. The error aligns the mode's behavior with its name: DESCRIPTOR is metadata-only.

Risk Level

Medium. Default flip plus the new throw is a behavior change for any Lance caller using read_blob() on INLINE blobs. Mitigations:

  • Restore prior behavior with hoodie.read.blob.inline.mode=CONTENT. Also the right setting for byte-heavy workloads (image processing, ML feature extraction, full-table exports).
  • Error message is actionable and names the config to flip.
  • Compaction-side reader is hard-pinned to CONTENT; MOR compaction test exercises the round-trip end-to-end.
  • Mixed-storage test pins INLINE + OUT_OF_LINE in a single column.
  • Multi-column tests pin per-column independence and rule out cross-column aliasing.

Documentation Update

  • HoodieReaderConfig.BLOB_INLINE_READ_MODE description updated.
  • ScalarFunctions.read_blob DESCRIBE FUNCTION text gains an INLINE+DESCRIPTOR caveat block.
  • RFC-100 behavior matrix, prose, and mermaid diagrams refreshed.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@voonhous voonhous added this to the release-1.2.0 milestone May 15, 2026
@voonhous voonhous requested review from rahil-c and yihua May 15, 2026 09:13
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the RFC update! This flips the default of hoodie.read.blob.inline.mode from CONTENT to DESCRIPTOR so Lance INLINE blobs are no longer materialized on every plain read, with a BatchedBlobReader fallback so read_blob() still works under either shape. A few design-level points could use more attention — see inline comments. Once those are addressed, a Hudi committer or PMC member can take it from here for a deeper review.

Comment thread rfc/rfc-100/rfc-100.md Outdated
Comment thread rfc/rfc-100/rfc-100.md Outdated
Comment thread rfc/rfc-100/rfc-100.md
- For DESCRIPTOR + Lance, hop 2 is a raw filesystem `pread` against the `.lance` file at the descriptor's `(offset, length)`. It bypasses Lance's decoder entirely — the blob encoding (`lance-encoding:blob=true` on a `LargeBinary` column) stores blob bytes contiguously at the position Lance reports, so direct byte access is safe.
- Plain `SELECT col` (no `read_blob`) is always 1 hop. DESCRIPTOR's win is that hop 1 skips blob decoding when bytes aren't needed.

### 3. Writer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The RFC notes the DESCRIPTOR + Lance hop-2 read "bypasses Lance's decoder entirely" via raw pread at the recorded offset/length. Is there a stability contract from Lance that the (offset, length) reported on a blob-encoded LargeBinary column is guaranteed to be the on-disk byte range across Lance versions? If Lance ever switches blob storage to a chunked/compressed layout, raw pread would silently return garbage. Worth either citing the Lance contract here or noting the version pin / detection strategy.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Comment thread rfc/rfc-100/rfc-100.md
@@ -163,32 +163,46 @@ Lance's native blob encoding stores blobs in a way that already exposes a `(file

**Visual**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 It would help to add a short "Alternatives considered" note for the default flip — e.g., gating DESCRIPTOR per-table/per-query vs flipping the global default, or making DESCRIPTOR the default only when the engine signals projection without read_blob(). The current text presents the flip as the only option; documenting why a global flip is preferable would strengthen the RFC.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 15, 2026
voonhous added 3 commits May 15, 2026 19:34
Flip the default of hoodie.read.blob.inline.mode from CONTENT to
DESCRIPTOR so plain reads (e.g. SELECT *) on Lance tables stop paying
the I/O cost of materializing inline blob payloads. CONTENT remains
available as an opt-in for callers that need the raw bytes in `data`.

The compactor's CONTENT pin is already in place in HoodieSparkLanceReader,
so compaction continues to read bytes regardless of the user-facing
default and round-trips them through the base-file rewrite.

Also fix BatchedBlobReader: under DESCRIPTOR mode, INLINE rows arrive
with null `inline_data` and a synthesized reference. The previous
dispatch short-circuited on storage_type=INLINE and read the null
field, breaking read_blob(). The INLINE branch now falls back to the
descriptor's positional range read when inline_data is absent, so
read_blob() works in both modes.

Tests:
- testBlobInlineRoundTrip now opts into CONTENT explicitly (it exists
  specifically to validate CONTENT semantics).
- New testBlobInlineCompactionRoundTrip verifies INLINE bytes survive
  MOR compaction read/rewrite under the new default, asserted via the
  realistic user paths (plain read + read_blob()).
Replace the ASCII visual with two mermaid diagrams: a storage-rooted
dispatch tree showing what the user gets back for SELECT col vs
SELECT read_blob(col) across modes and file formats, and a hop-count
view of read_blob() byte resolution that makes the CONTENT vs
DESCRIPTOR + Lance trade-off explicit.
…o-op

- Swap CONTENT/DESCRIPTOR default labels in the mermaid diagrams and
  surrounding prose so the doc matches HoodieReaderConfig's new
  default.
- Make the matrix's Parquet + DESCRIPTOR row reflect current behavior
  (config is ignored on Parquet today; rows come back in CONTENT
  shape) with a footnote, and split the eventual spec'd behavior into
  its own paragraph marked not-yet-implemented. Mermaid Parquet
  branches updated to match.
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the RFC update! The default flip plus the BatchedBlobReader fallback and new compaction round-trip test together make a coherent story, and the mermaid diagrams are a nice readability win. A few architectural questions remain around lifecycle of the synthetic descriptor and an undocumented invariant in the compaction reader path that I think are worth surfacing in the RFC itself before a Hudi committer or PMC member takes it from here.

Comment thread rfc/rfc-100/rfc-100.md
Comment thread rfc/rfc-100/rfc-100.md Outdated
Comment thread rfc/rfc-100/rfc-100.md
F -->|Lance| LD["type = INLINE<br/>inline_data = NULL<br/>reference = synthetic managed<br/>path, offset, length, is_managed=true"]
F -->|"Parquet (today: mode no-op)"| PD["Parquet reader does not<br/>implement DESCRIPTOR yet —<br/>returns CONTENT shape:<br/>inline_data = bytes, reference = NULL"]

QI -->|"SELECT read_blob(col)"| RM{hoodie.read.blob.inline.mode}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The Lance branch shows reference = synthetic managed path, offset, length, is_managed=true. Could you spell out what is_managed=true means semantically and how it differs from a user-supplied OUT_OF_LINE reference downstream — e.g., does it gate cleaning, does read_blob() route differently, can a writer round-trip a row carrying this flag back into the table? This flag appears to be the only signal distinguishing "safe to clean — Hudi owns the bytes" from "do not clean — user owns the bytes", which is load-bearing.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

The INLINE-descriptor fallback in BatchedBlobReader (the branch that
preads bytes via the synthesized reference when inline_data is null) is
load-bearing for mixed-usage queries. If read_blob() force-flipped the
scan back to CONTENT shape, reference.* would all be null in the same
projection. Fold a single co-projection SELECT into the existing
testBlobInlineDescriptorMode fixture to lock in this contract without
adding a new test method.
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the iteration! The descriptor-fallback path in BatchedBlobReader, the MOR compaction round-trip test, and the clarified Parquet no-op documentation address the substantive concerns. Prior rounds have already raised the key design points (compaction CONTENT pin, mid-batch require behavior, synthetic descriptor lifecycle, Parquet semantics). A Hudi committer or PMC member can take it from here for the final design call.

cc @yihua

DESCRIPTOR is a metadata-only mode for INLINE rows: bytes are not
materialized during the scan and the synthesized reference is an
internal pointer into the .lance file's storage layout, not user-facing
metadata. The prior 2-hop pread fallback in BatchedBlobReader silently
materialized bytes anyway, conflating the implementation detail with
durable storage info — e.g. SELECT read_blob(col), col.reference.offset
would return a Lance-internal offset that users could mistake for a
real descriptor.

Replace the fallback with a runtime IllegalStateException naming both
INLINE and DESCRIPTOR so the failure is actionable: set CONTENT mode or
read col.data directly. The check is per-row in BatchedBlobReader, so
OUT_OF_LINE under any mode and Parquet (where DESCRIPTOR is a no-op
today) are unaffected.

Tests updated accordingly: testBlobInlineDescriptorMode now asserts the
throw and exercises a CONTENT-mode roundtrip alongside, and a new
testBlobMixedInlineAndOutOfLineContentMode pins the mixed-storage
behavior on Lance.
@github-actions github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels May 16, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the iteration! The RFC and code changes now consistently describe DESCRIPTOR as a metadata-only mode for INLINE rows, with read_blob() throwing an actionable error on Lance+INLINE+DESCRIPTOR, and the diagrams + behavior matrix line up with the implementation. No new critical correctness concerns from this round beyond what prior reviews already flagged — a Hudi committer or PMC member can take it from here for the final design call.

cc @yihua

…n the default

testBlobInlineRoundTrip and testBlobInlineCompactionRoundTrip each had a
second `spark.read.format("hudi").load(...)` that did not set the inline
mode and relied on the prior default returning bytes. With the default
flipped to DESCRIPTOR and the 2-hop fallback retired, read_blob() on
INLINE rows now throws there. Set hoodie.read.blob.inline.mode=CONTENT
on those loads and refresh the surrounding comments.
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the iteration! The mid-batch throw is well-contained, the new MOR compaction round-trip test directly exercises the HoodieSparkLanceReader CONTENT pin against the data-corruption scenario, and the CONTENT-mode roundtrip plus mixed-storage tests pin write→read integrity. Remaining open threads from prior rounds are design-level rather than critical-correctness; a Hudi committer or PMC member can take it from here for the final design call.

cc @yihua

s"hoodie.read.blob.inline.mode=DESCRIPTOR, INLINE blobs are returned " +
s"as metadata-only (inline_data=NULL, synthesized reference). To read " +
s"bytes, set hoodie.read.blob.inline.mode=CONTENT, or project " +
s"col.data / col.reference.* directly instead of calling read_blob().")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on today discussion i think we want to avoid the user to directly call col.data and instead they should be using read_blob().

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can fix the wording

|
|Caveat:
| Throws on INLINE rows under hoodie.read.blob.inline.mode=DESCRIPTOR.
| Set CONTENT mode or read col.data directly to materialize INLINE bytes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comprehensive testing @voonhous. Can you add tests for the cases @yihua was showing today regarding doing a select * from a table has multiple blob cols, like blob_A, blob_B.

select read_blob(blob_col1), blob_col2 from table
mode = CONTENT, return content both columns
mode = DESCRIPTOR (default), throw error

select blob_col1, blob_col2 from table
mode = CONTENT, return content both columns
mode = DESCRIPTOR (default), return discriptor

select read_blob(blob_col1), read_blob(blob_col2) from table
return content for columns

Essentially we want to ensure that when a user does select read_blob(blob_A), blob_B from table, with default as DESCRIPTOR that this would throw error as we have a mixed case here for bytes and pointers.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added

* originals.
*/
@Test
def testBlobInlineCompactionRoundTrip(): Unit = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure that log files are created

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

s"read_blob() cannot materialize bytes for an INLINE blob under " +
s"DESCRIPTOR mode (row $rowIndex). Under " +
s"hoodie.read.blob.inline.mode=DESCRIPTOR, INLINE blobs will not return bytes via the 'data' subfield" +
" and instead returns a 'reference' subfield containing metadata only." +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
" and instead returns a 'reference' subfield containing metadata only." +
" and instead returns a 'reference' subfield containing metadata only. " +

Comment on lines +230 to +231
s"read_blob() cannot materialize bytes for an INLINE blob under " +
s"DESCRIPTOR mode (row $rowIndex). Under " +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussion, read_blob()  should disregard the inline blob read mode, correct?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meaning that read_blob() should always read the blob content regardless of the config.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_blob() is no longer allowed on DESCRIPTOR mode as discussed.

yihua and others added 4 commits May 19, 2026 18:34
Drop the CONTENT-mode read_blob roundtrip from testBlobInlineDescriptorMode
(already covered by testBlobInlineRoundTrip across CoW and MoR).

Trim testBlobInlineMultipleColumnsPlainSelect to per-column-independence
assertions only; the per-column INLINE shape under CONTENT and DESCRIPTOR is
already pinned by testBlobInlineRoundTrip and testBlobInlineDescriptorMode.

Trim testBlobInlineMultipleColumnsReadBlobAll to a CONTENT-only smoke; the
DESCRIPTOR throw path is already pinned by testBlobInlineDescriptorMode and
testBlobInlineMultipleColumnsMixedSelect.

Net: 28 added, 82 removed.
…read_blob, and directly selecting on blob column
…sage of read_blob, and directly selecting on blob column"

This reverts commit a1b25f6.
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @voonhous @rahil-c Thanks for persisting through the discussion and the implementation to make user experience of Blob reading straightforward!

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit 9565926 into apache:master May 20, 2026
62 of 63 checks passed
yihua added a commit that referenced this pull request May 20, 2026
Co-authored-by: Rahil Chertara <rchertara@gmail.com>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.25%. Comparing base (071b3f1) to head (7c0bd80).
⚠️ Report is 37 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18744      +/-   ##
============================================
+ Coverage     68.14%   68.25%   +0.11%     
- Complexity    29094    29327     +233     
============================================
  Files          2517     2527      +10     
  Lines        141113   141811     +698     
  Branches      17508    17623     +115     
============================================
+ Hits          96160    96799     +639     
- Misses        37046    37050       +4     
- Partials       7907     7962      +55     
Flag Coverage Δ
common-and-other-modules 44.40% <6.25%> (+<0.01%) ⬆️
hadoop-mr-java-client 44.88% <100.00%> (-0.13%) ⬇️
spark-client-hadoop-common 48.26% <100.00%> (-0.07%) ⬇️
spark-java-tests 48.87% <100.00%> (-0.11%) ⬇️
spark-scala-tests 44.94% <62.50%> (+0.02%) ⬆️
utilities 37.46% <6.25%> (-0.17%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../apache/hudi/common/config/HoodieReaderConfig.java 100.00% <100.00%> (ø)
...apache/spark/sql/hudi/blob/BatchedBlobReader.scala 85.21% <100.00%> (+0.73%) ⬆️
...g/apache/spark/sql/hudi/blob/ScalarFunctions.scala 92.30% <ø> (ø)

... and 67 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dwshmilyss pushed a commit to dwshmilyss/hudi that referenced this pull request May 21, 2026
…18744)

Co-authored-by: Rahil Chertara <rchertara@gmail.com>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
@voonhous voonhous deleted the address-#18742 branch May 21, 2026 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Default hoodie.read.blob.inline.mode to DESCRIPTOR for Lance (compaction pinned to CONTENT)

6 participants