feat: Add Parquet DESCRIPTOR mode for blob inline reading by rahil-c · Pull Request #18683 · apache/hudi

rahil-c · 2026-05-03T22:39:08Z

Describe the issue this Pull Request addresses

When hoodie.read.blob.inline.mode=DESCRIPTOR is set with Parquet base files, leverage Parquet's nested column projection to skip reading the blob data sub-column entirely (genuine I/O savings). Previously the config only affected Lance reads; Parquet still materialized the bytes.

Approach mirrors the existing VECTOR column rewrite pattern in HoodieFileGroupReaderBasedFileFormat:

Detect blob columns via schema metadata.
Strip the data sub-field from blob structs in the read schema.
Post-read null-pad the data field back into output rows.

Both COW (HoodieFileGroupReaderBasedFileFormat.readBaseFile) and MOR (SparkFileFormatInternalRowReaderContext.getFileRecordIterator) paths are covered. Also adds a defensive null check in BatchedBlobReader.

A naive DESCRIPTOR + read_blob() would silently return null on Parquet (no byte-range channel like Lance). To keep the API consistent, ReadBlobRule now downgrades any Parquet scan to CONTENT for queries that contain read_blob(), while sibling queries on the same FileFormat instance keep DESCRIPTOR's I/O savings.

Summary and Changelog

User-visible behavior

hoodie.read.blob.inline.mode=DESCRIPTOR now works on Parquet base files (COW + MOR), not just Lance, and skips the blob data Parquet column for real I/O savings.
read_blob() keeps working under DESCRIPTOR on Parquet — the engine automatically downgrades the affected scan to CONTENT so bytes are materialized; sibling queries that don't use read_blob() still benefit from DESCRIPTOR.

Detailed changelog

DESCRIPTOR-on-Parquet:

VectorConversionUtils: new helpers detectBlobColumnsFromMetadata, stripBlobDataField, buildBlobNullPadRowMapper.
HoodieFileGroupReaderBasedFileFormat:
- supportBatch returns false when DESCRIPTOR is active and blob columns are present (row-level access required for null-padding).
- readBaseFile strips the data sub-field from the read schema and wraps the iterator with wrapWithBlobNullPadding.
SparkFileFormatInternalRowReaderContext.getFileRecordIterator: same rewrite/pad on the MOR base-file path, driven by the Hadoop conf entry.
HoodieReaderConfig.BLOB_INLINE_READ_MODE: docstring updated to describe Parquet semantics.
BatchedBlobReader: defensive null check on the data row.
HoodieHadoopFsRelationFactory: pass through the configured DESCRIPTOR flag.

read_blob() override:

HoodieFileGroupReaderBasedFileFormat: constructor flag renamed isBlobDescriptorMode → initialBlobDescriptorMode; new mutable _isBlobDescriptorMode with setBlobDescriptorMode / restoreBlobDescriptorMode. buildReaderWithPartitionValues syncs the Hadoop conf entry from the mutable flag so the MOR path agrees with the COW path after a flip.
ReadBlobRule: walks each plan it sees; if read_blob() (or an already-injected BatchedBlobRead) is present, flips DESCRIPTOR→CONTENT on every Hudi Parquet LogicalRelation's FileFormat. Otherwise restores the construction-time value (handles shared FileFormat instances across queries against the same temp view). Lance scans are skipped.
HoodieReaderConfig.BLOB_INLINE_READ_MODE: docstring updated to note the automatic downgrade.

Tests added

TestReadBlobSQL.testParquetDescriptorSkipsDataColumn — @ParameterizedTest over HoodieTableType (COW + MOR): asserts INLINE type preserved, data null, reference null.
TestReadBlobSQL.testReadBlobSupersedesDescriptorOnParquet — read_blob() materializes bytes despite DESCRIPTOR, then a follow-up query on the same view restores DESCRIPTOR's null-pad.
TestReadBlobSQL.testReadBlobInWhereClauseUnderDescriptor — override engages when read_blob() is in WHERE.
TestReadBlobSQL.testMultiBlobColumnsDescriptorWholeScanDowngrade — read_blob() on one blob column also materializes bytes for unrelated blob columns in the same scan.
TestReadBlobSQL.testDescriptorOnTableWithoutBlobColumns — DESCRIPTOR on a non-blob table is a no-op.
New TestVectorConversionUtilsBlob — unit tests for detectBlobColumnsFromMetadata, stripBlobDataField, and buildBlobNullPadRowMapper.

Impact

User-facing: Parquet readers that opt into hoodie.read.blob.inline.mode=DESCRIPTOR now skip the blob data Parquet column on reads, reducing I/O for tables with large inline blobs whose payload bytes aren't needed. read_blob() continues to work under DESCRIPTOR on Parquet (auto-downgraded per scan).
Public API: No public API changes. HoodieReaderConfig.BLOB_INLINE_READ_MODE keeps its key, default, and valid values; only the docstring is updated.
Performance: For DESCRIPTOR queries on Parquet without read_blob(), the blob bytes column is no longer read or decoded. For DESCRIPTOR queries that do use read_blob(), behavior is the same as CONTENT mode (no regression).
Compatibility: Default remains CONTENT; existing CONTENT-mode workloads are unchanged.

Risk Level

low

BLOB_INLINE_READ_MODE defaults to CONTENT, so existing reads are unaffected. The DESCRIPTOR rewrite is gated on a metadata marker plus the user's explicit opt-in. The read_blob() override mutates a per-FileFormat flag in place, which is single-JVM-safe; concurrent queries against the same temp view are sequenced through the optimizer rule. Coverage includes COW and MOR base-file paths plus WHERE-clause and multi-column scenarios.

Documentation Update

HoodieReaderConfig.BLOB_INLINE_READ_MODE docstring rewritten to describe Parquet semantics and the automatic read_blob() downgrade. No website doc changes are required since this is a refinement of an existing config; if the Hudi reader-config reference page is regenerated from the source javadoc, it picks up the new text automatically.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

When hoodie.read.blob.inline.mode=DESCRIPTOR is set with Parquet base files, leverage Parquet's nested column projection to skip reading the blob 'data' sub-column entirely (genuine I/O savings). Previously the config only affected Lance reads; Parquet still materialized the bytes. Approach mirrors the existing VECTOR column rewrite pattern in HoodieFileGroupReaderBasedFileFormat: 1. Detect blob columns via schema metadata 2. Strip the 'data' sub-field from blob structs in the read schema 3. Post-read null-pad the 'data' field back into output rows Both COW (HoodieFileGroupReaderBasedFileFormat.readBaseFile) and MOR (SparkFileFormatInternalRowReaderContext.getFileRecordIterator) paths are covered. Also adds defensive null check in BatchedBlobReader. read_blob() on Parquet DESCRIPTOR rows returns null since Parquet has no byte-range blob access like Lance — documented as known limitation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Parquet DESCRIPTOR strips the blob data column for I/O savings, but Parquet has no byte-range channel like Lance, so read_blob() under DESCRIPTOR would silently return null. ReadBlobRule now flips the FileFormat's mode to CONTENT for any scan whose query uses read_blob(), and restores DESCRIPTOR for sibling queries on the same shared FileFormat instance. The flag is mutated in place because Spark's planner/AQE retains a reference to the original instance. Also adds tests previously missing for the underlying PR: COW+MOR parameterized DESCRIPTOR happy-path, read_blob in WHERE under DESCRIPTOR, multi-blob whole-scan downgrade, no-op on non-blob tables, and unit tests for VectorConversionUtils blob helpers.

hudi-bot · 2026-05-08T18:54:26Z

CI report:

da8bdba Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-05-08T18:56:22Z

Codecov Report

❌ Patch coverage is 76.76056% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.15%. Comparing base (4d0e9cd) to head (da8bdba).
⚠️ Report is 9 commits behind head on master.

Files with missing lines	Patch %	Lines
...hudi/SparkFileFormatInternalRowReaderContext.scala	33.33%	12 Missing and 6 partials ⚠️
...parquet/HoodieFileGroupReaderBasedFileFormat.scala	84.00%	1 Missing and 7 partials ⚠️
.../apache/hudi/io/storage/VectorConversionUtils.java	90.47%	0 Missing and 4 partials ⚠️
.../org/apache/spark/sql/hudi/blob/ReadBlobRule.scala	87.50%	0 Missing and 2 partials ⚠️
...apache/spark/sql/hudi/blob/BatchedBlobReader.scala	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18683      +/-   ##
============================================
+ Coverage     68.08%   68.15%   +0.06%     
- Complexity    28940    29120     +180     
============================================
  Files          2519     2522       +3     
  Lines        140646   141307     +661     
  Branches      17427    17549     +122     
============================================
+ Hits          95757    96305     +548     
- Misses        37030    37077      +47     
- Partials       7859     7925      +66

Flag	Coverage Δ
common-and-other-modules	`44.39% <19.71%> (+0.04%)`	⬆️
hadoop-mr-java-client	`45.01% <ø> (+0.04%)`	⬆️
spark-client-hadoop-common	`48.41% <55.07%> (-0.03%)`	⬇️
spark-java-tests	`49.02% <73.94%> (+0.37%)`	⬆️
spark-scala-tests	`44.88% <28.16%> (+0.12%)`	⬆️
utilities	`37.62% <25.35%> (-0.08%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
.../apache/hudi/common/config/HoodieReaderConfig.java	`100.00% <ø> (ø)`
...rg/apache/hudi/HoodieHadoopFsRelationFactory.scala	`83.26% <100.00%> (+0.34%)`	⬆️
...apache/spark/sql/hudi/blob/BatchedBlobReader.scala	`84.11% <0.00%> (-0.37%)`	⬇️
.../org/apache/spark/sql/hudi/blob/ReadBlobRule.scala	`54.44% <87.50%> (+7.77%)`	⬆️
.../apache/hudi/io/storage/VectorConversionUtils.java	`83.33% <90.47%> (+2.94%)`	⬆️
...parquet/HoodieFileGroupReaderBasedFileFormat.scala	`84.67% <84.00%> (-0.61%)`	⬇️
...hudi/SparkFileFormatInternalRowReaderContext.scala	`70.17% <33.33%> (-6.54%)`	⬇️

... and 35 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR adds Parquet DESCRIPTOR mode for blob inline reading by stripping the data sub-field from blob structs and null-padding it back post-read, with a ReadBlobRule mechanism to downgrade to CONTENT when read_blob() is in scope. The mechanism for sharing the flip across queries against the same temp view raises a couple of correctness questions worth a closer look — please see the inline comments. Once those are addressed, this should be ready for a Hudi committer or PMC member to take it from here. A couple of naming nits in ReadBlobRule — containsReadBlobAnywhere looks too similar to the existing containsReadBlobInAnyExpression (different scopes, look-alike names), and syncDescriptorMode reads like a thread-sync operation rather than a policy-apply step.

hudi-agent · 2026-05-08T21:42:47Z

+  // Mutable so ReadBlobRule can flip DESCRIPTOR→CONTENT when read_blob() appears in a query.
+  // We mutate in place because Spark's planner/AQE retains a reference to this FileFormat
+  // instance even after the optimizer rewrites the LogicalRelation, so swapping wouldn't stick.
+  @volatile private var _isBlobDescriptorMode: Boolean = initialBlobDescriptorMode


🤖 I'm worried this mutable flag races across concurrent queries that share a FileFormat instance via a temp view. If query A (uses read_blob()) sets the flag to false during its optimization, and query B (no read_blob()) runs concurrently and calls restoreBlobDescriptorMode() before A's tasks are serialized, A's executors will deserialize the FileFormat with flag=true — the executor-side readBaseFile (line 565) reads isBlobDescriptorMode and would strip the data column, causing read_blob() to silently return null. Same hazard in the opposite direction for B. Async query submission (e.g. Futures, Airflow, multi-cell notebooks) makes this realistic. Have you considered passing the per-query mode through the LogicalRelation's options map / Hadoop conf at rule time instead of mutating shared state? @nsivabalan @yihua thoughts on whether this concurrency surface is acceptable here?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-08T21:42:47Z

+        SparkFileFormatInternalRowReaderContext.wrapWithVectorConversion(blobPaddedIterator, vectorColumnInfo, if (blobColumnIndices.nonEmpty) parquetReadStructType else readSchema)
+      } else {
+        blobPaddedIterator
+      }


🤖 For MOR tables with log files, the log-file branch (line 123-126) reads with the full requiredSchema, so log records keep their populated data field, while base-file records under DESCRIPTOR get data=null. After merge the user sees a mix: records updated via log have bytes, records still in base have null. Is this the intended semantics, or should DESCRIPTOR also null the data on log-file rows for consistency?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-08T21:42:47Z

+    // uses read_blob(), and restore DESCRIPTOR for queries that don't (the FileFormat instance
+    // is shared across queries against the same temp view).
+    val needsContentMode = containsReadBlobAnywhere(plan) || plan.exists(_.isInstanceOf[BatchedBlobRead])
+    syncDescriptorMode(plan, forceContentMode = needsContentMode)


🤖 syncDescriptorMode is a side effect inside an optimizer rule's apply. Spark's optimizer batches typically run rules to a fixed point, so this can fire multiple times per query. It's idempotent within a single query (good), but the side effect on a shared FileFormat ordering between concurrent queries' optimizer passes is what creates the race I flagged in HoodieFileGroupReaderBasedFileFormat. Have you considered mapChildren / structural rewrites to encode the mode in the plan (e.g. as a hint or option) rather than mutating the FileFormat?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-08T21:42:47Z

    plan.expressions.exists(containsReadBlobInExpression)
  }

+  private def containsReadBlobAnywhere(plan: LogicalPlan): Boolean = {


🤖 nit: containsReadBlobAnywhere is dangerously close to the existing containsReadBlobInAnyExpression just above it, but does something meaningfully different — it walks the entire plan tree rather than checking only the current node's expressions. A name like planTreeContainsReadBlob or treeContainsReadBlob would make the scope distinction immediately obvious to the next person who touches this rule.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-08T21:42:47Z

+
+  // Lance scans are skipped because their hoodieFileFormat != PARQUET and Lance already
+  // supports read_blob() under DESCRIPTOR via byte-range references.
+  private def syncDescriptorMode(plan: LogicalPlan, forceContentMode: Boolean): Unit = {


🤖 nit: syncDescriptorMode reads like a concurrency-sync operation rather than a policy-apply step. Something like applyDescriptorModePolicy or updateDescriptorModeForPlan would make it clearer that this is deciding whether to downgrade DESCRIPTOR→CONTENT based on query content, not synchronizing threads.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 3, 2026

github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels May 8, 2026

voonhous marked this pull request as ready for review May 8, 2026 17:24

hudi-agent reviewed May 8, 2026

View reviewed changes

hudi-agent mentioned this pull request May 8, 2026

[OSS PR #18683] feat: Add Parquet DESCRIPTOR mode for blob inline reading hudi-agent/hudi#32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Parquet DESCRIPTOR mode for blob inline reading#18683

feat: Add Parquet DESCRIPTOR mode for blob inline reading#18683
rahil-c wants to merge 2 commits intoapache:masterfrom
rahil-c:feat/parquet-blob-descriptor-mode

rahil-c commented May 3, 2026 •

edited by voonhous

Loading

Uh oh!

hudi-bot commented May 8, 2026

Uh oh!

codecov-commenter commented May 8, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent May 8, 2026

Uh oh!

hudi-agent May 8, 2026

Uh oh!

hudi-agent May 8, 2026

Uh oh!

hudi-agent May 8, 2026

Uh oh!

hudi-agent May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

rahil-c commented May 3, 2026 • edited by voonhous Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented May 8, 2026

CI report:

Uh oh!

codecov-commenter commented May 8, 2026

Codecov Report

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 8, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 8, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 8, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 8, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rahil-c commented May 3, 2026 •

edited by voonhous

Loading