Skip to content

feat: Add Parquet DESCRIPTOR mode for blob inline reading#18683

Open
rahil-c wants to merge 2 commits intoapache:masterfrom
rahil-c:feat/parquet-blob-descriptor-mode
Open

feat: Add Parquet DESCRIPTOR mode for blob inline reading#18683
rahil-c wants to merge 2 commits intoapache:masterfrom
rahil-c:feat/parquet-blob-descriptor-mode

Conversation

@rahil-c
Copy link
Copy Markdown
Collaborator

@rahil-c rahil-c commented May 3, 2026

Describe the issue this Pull Request addresses

When hoodie.read.blob.inline.mode=DESCRIPTOR is set with Parquet base files, leverage Parquet's nested column projection to skip reading the blob data sub-column entirely (genuine I/O savings). Previously the config only affected Lance reads; Parquet still materialized the bytes.

Approach mirrors the existing VECTOR column rewrite pattern in HoodieFileGroupReaderBasedFileFormat:

  1. Detect blob columns via schema metadata.
  2. Strip the data sub-field from blob structs in the read schema.
  3. Post-read null-pad the data field back into output rows.

Both COW (HoodieFileGroupReaderBasedFileFormat.readBaseFile) and MOR (SparkFileFormatInternalRowReaderContext.getFileRecordIterator) paths are covered. Also adds a defensive null check in BatchedBlobReader.

A naive DESCRIPTOR + read_blob() would silently return null on Parquet (no byte-range channel like Lance). To keep the API consistent, ReadBlobRule now downgrades any Parquet scan to CONTENT for queries that contain read_blob(), while sibling queries on the same FileFormat instance keep DESCRIPTOR's I/O savings.

Summary and Changelog

User-visible behavior

  • hoodie.read.blob.inline.mode=DESCRIPTOR now works on Parquet base files (COW + MOR), not just Lance, and skips the blob data Parquet column for real I/O savings.
  • read_blob() keeps working under DESCRIPTOR on Parquet — the engine automatically downgrades the affected scan to CONTENT so bytes are materialized; sibling queries that don't use read_blob() still benefit from DESCRIPTOR.

Detailed changelog

DESCRIPTOR-on-Parquet:

  • VectorConversionUtils: new helpers detectBlobColumnsFromMetadata, stripBlobDataField, buildBlobNullPadRowMapper.
  • HoodieFileGroupReaderBasedFileFormat:
    • supportBatch returns false when DESCRIPTOR is active and blob columns are present (row-level access required for null-padding).
    • readBaseFile strips the data sub-field from the read schema and wraps the iterator with wrapWithBlobNullPadding.
  • SparkFileFormatInternalRowReaderContext.getFileRecordIterator: same rewrite/pad on the MOR base-file path, driven by the Hadoop conf entry.
  • HoodieReaderConfig.BLOB_INLINE_READ_MODE: docstring updated to describe Parquet semantics.
  • BatchedBlobReader: defensive null check on the data row.
  • HoodieHadoopFsRelationFactory: pass through the configured DESCRIPTOR flag.

read_blob() override:

  • HoodieFileGroupReaderBasedFileFormat: constructor flag renamed isBlobDescriptorModeinitialBlobDescriptorMode; new mutable _isBlobDescriptorMode with setBlobDescriptorMode / restoreBlobDescriptorMode. buildReaderWithPartitionValues syncs the Hadoop conf entry from the mutable flag so the MOR path agrees with the COW path after a flip.
  • ReadBlobRule: walks each plan it sees; if read_blob() (or an already-injected BatchedBlobRead) is present, flips DESCRIPTOR→CONTENT on every Hudi Parquet LogicalRelation's FileFormat. Otherwise restores the construction-time value (handles shared FileFormat instances across queries against the same temp view). Lance scans are skipped.
  • HoodieReaderConfig.BLOB_INLINE_READ_MODE: docstring updated to note the automatic downgrade.

Tests added

  • TestReadBlobSQL.testParquetDescriptorSkipsDataColumn@ParameterizedTest over HoodieTableType (COW + MOR): asserts INLINE type preserved, data null, reference null.
  • TestReadBlobSQL.testReadBlobSupersedesDescriptorOnParquetread_blob() materializes bytes despite DESCRIPTOR, then a follow-up query on the same view restores DESCRIPTOR's null-pad.
  • TestReadBlobSQL.testReadBlobInWhereClauseUnderDescriptor — override engages when read_blob() is in WHERE.
  • TestReadBlobSQL.testMultiBlobColumnsDescriptorWholeScanDowngraderead_blob() on one blob column also materializes bytes for unrelated blob columns in the same scan.
  • TestReadBlobSQL.testDescriptorOnTableWithoutBlobColumns — DESCRIPTOR on a non-blob table is a no-op.
  • New TestVectorConversionUtilsBlob — unit tests for detectBlobColumnsFromMetadata, stripBlobDataField, and buildBlobNullPadRowMapper.

Impact

  • User-facing: Parquet readers that opt into hoodie.read.blob.inline.mode=DESCRIPTOR now skip the blob data Parquet column on reads, reducing I/O for tables with large inline blobs whose payload bytes aren't needed. read_blob() continues to work under DESCRIPTOR on Parquet (auto-downgraded per scan).
  • Public API: No public API changes. HoodieReaderConfig.BLOB_INLINE_READ_MODE keeps its key, default, and valid values; only the docstring is updated.
  • Performance: For DESCRIPTOR queries on Parquet without read_blob(), the blob bytes column is no longer read or decoded. For DESCRIPTOR queries that do use read_blob(), behavior is the same as CONTENT mode (no regression).
  • Compatibility: Default remains CONTENT; existing CONTENT-mode workloads are unchanged.

Risk Level

low

BLOB_INLINE_READ_MODE defaults to CONTENT, so existing reads are unaffected. The DESCRIPTOR rewrite is gated on a metadata marker plus the user's explicit opt-in. The read_blob() override mutates a per-FileFormat flag in place, which is single-JVM-safe; concurrent queries against the same temp view are sequenced through the optimizer rule. Coverage includes COW and MOR base-file paths plus WHERE-clause and multi-column scenarios.

Documentation Update

  • HoodieReaderConfig.BLOB_INLINE_READ_MODE docstring rewritten to describe Parquet semantics and the automatic read_blob() downgrade. No website doc changes are required since this is a refinement of an existing config; if the Hudi reader-config reference page is regenerated from the source javadoc, it picks up the new text automatically.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

When hoodie.read.blob.inline.mode=DESCRIPTOR is set with Parquet base
files, leverage Parquet's nested column projection to skip reading the
blob 'data' sub-column entirely (genuine I/O savings). Previously the
config only affected Lance reads; Parquet still materialized the bytes.

Approach mirrors the existing VECTOR column rewrite pattern in
HoodieFileGroupReaderBasedFileFormat:
1. Detect blob columns via schema metadata
2. Strip the 'data' sub-field from blob structs in the read schema
3. Post-read null-pad the 'data' field back into output rows

Both COW (HoodieFileGroupReaderBasedFileFormat.readBaseFile) and MOR
(SparkFileFormatInternalRowReaderContext.getFileRecordIterator) paths
are covered. Also adds defensive null check in BatchedBlobReader.

read_blob() on Parquet DESCRIPTOR rows returns null since Parquet has
no byte-range blob access like Lance — documented as known limitation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 3, 2026
Parquet DESCRIPTOR strips the blob data column for I/O savings, but
Parquet has no byte-range channel like Lance, so read_blob() under
DESCRIPTOR would silently return null. ReadBlobRule now flips the
FileFormat's mode to CONTENT for any scan whose query uses read_blob(),
and restores DESCRIPTOR for sibling queries on the same shared
FileFormat instance. The flag is mutated in place because Spark's
planner/AQE retains a reference to the original instance.

Also adds tests previously missing for the underlying PR: COW+MOR
parameterized DESCRIPTOR happy-path, read_blob in WHERE under
DESCRIPTOR, multi-blob whole-scan downgrade, no-op on non-blob tables,
and unit tests for VectorConversionUtils blob helpers.
@github-actions github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels May 8, 2026
@voonhous voonhous marked this pull request as ready for review May 8, 2026 17:24
@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented May 8, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 76.76056% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.15%. Comparing base (4d0e9cd) to head (da8bdba).
⚠️ Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
...hudi/SparkFileFormatInternalRowReaderContext.scala 33.33% 12 Missing and 6 partials ⚠️
...parquet/HoodieFileGroupReaderBasedFileFormat.scala 84.00% 1 Missing and 7 partials ⚠️
.../apache/hudi/io/storage/VectorConversionUtils.java 90.47% 0 Missing and 4 partials ⚠️
.../org/apache/spark/sql/hudi/blob/ReadBlobRule.scala 87.50% 0 Missing and 2 partials ⚠️
...apache/spark/sql/hudi/blob/BatchedBlobReader.scala 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18683      +/-   ##
============================================
+ Coverage     68.08%   68.15%   +0.06%     
- Complexity    28940    29120     +180     
============================================
  Files          2519     2522       +3     
  Lines        140646   141307     +661     
  Branches      17427    17549     +122     
============================================
+ Hits          95757    96305     +548     
- Misses        37030    37077      +47     
- Partials       7859     7925      +66     
Flag Coverage Δ
common-and-other-modules 44.39% <19.71%> (+0.04%) ⬆️
hadoop-mr-java-client 45.01% <ø> (+0.04%) ⬆️
spark-client-hadoop-common 48.41% <55.07%> (-0.03%) ⬇️
spark-java-tests 49.02% <73.94%> (+0.37%) ⬆️
spark-scala-tests 44.88% <28.16%> (+0.12%) ⬆️
utilities 37.62% <25.35%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../apache/hudi/common/config/HoodieReaderConfig.java 100.00% <ø> (ø)
...rg/apache/hudi/HoodieHadoopFsRelationFactory.scala 83.26% <100.00%> (+0.34%) ⬆️
...apache/spark/sql/hudi/blob/BatchedBlobReader.scala 84.11% <0.00%> (-0.37%) ⬇️
.../org/apache/spark/sql/hudi/blob/ReadBlobRule.scala 54.44% <87.50%> (+7.77%) ⬆️
.../apache/hudi/io/storage/VectorConversionUtils.java 83.33% <90.47%> (+2.94%) ⬆️
...parquet/HoodieFileGroupReaderBasedFileFormat.scala 84.67% <84.00%> (-0.61%) ⬇️
...hudi/SparkFileFormatInternalRowReaderContext.scala 70.17% <33.33%> (-6.54%) ⬇️

... and 35 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR adds Parquet DESCRIPTOR mode for blob inline reading by stripping the data sub-field from blob structs and null-padding it back post-read, with a ReadBlobRule mechanism to downgrade to CONTENT when read_blob() is in scope. The mechanism for sharing the flip across queries against the same temp view raises a couple of correctness questions worth a closer look — please see the inline comments. Once those are addressed, this should be ready for a Hudi committer or PMC member to take it from here. A couple of naming nits in ReadBlobRule — containsReadBlobAnywhere looks too similar to the existing containsReadBlobInAnyExpression (different scopes, look-alike names), and syncDescriptorMode reads like a thread-sync operation rather than a policy-apply step.

// Mutable so ReadBlobRule can flip DESCRIPTOR→CONTENT when read_blob() appears in a query.
// We mutate in place because Spark's planner/AQE retains a reference to this FileFormat
// instance even after the optimizer rewrites the LogicalRelation, so swapping wouldn't stick.
@volatile private var _isBlobDescriptorMode: Boolean = initialBlobDescriptorMode
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 I'm worried this mutable flag races across concurrent queries that share a FileFormat instance via a temp view. If query A (uses read_blob()) sets the flag to false during its optimization, and query B (no read_blob()) runs concurrently and calls restoreBlobDescriptorMode() before A's tasks are serialized, A's executors will deserialize the FileFormat with flag=true — the executor-side readBaseFile (line 565) reads isBlobDescriptorMode and would strip the data column, causing read_blob() to silently return null. Same hazard in the opposite direction for B. Async query submission (e.g. Futures, Airflow, multi-cell notebooks) makes this realistic. Have you considered passing the per-query mode through the LogicalRelation's options map / Hadoop conf at rule time instead of mutating shared state? @nsivabalan @yihua thoughts on whether this concurrency surface is acceptable here?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

SparkFileFormatInternalRowReaderContext.wrapWithVectorConversion(blobPaddedIterator, vectorColumnInfo, if (blobColumnIndices.nonEmpty) parquetReadStructType else readSchema)
} else {
blobPaddedIterator
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 For MOR tables with log files, the log-file branch (line 123-126) reads with the full requiredSchema, so log records keep their populated data field, while base-file records under DESCRIPTOR get data=null. After merge the user sees a mix: records updated via log have bytes, records still in base have null. Is this the intended semantics, or should DESCRIPTOR also null the data on log-file rows for consistency?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

// uses read_blob(), and restore DESCRIPTOR for queries that don't (the FileFormat instance
// is shared across queries against the same temp view).
val needsContentMode = containsReadBlobAnywhere(plan) || plan.exists(_.isInstanceOf[BatchedBlobRead])
syncDescriptorMode(plan, forceContentMode = needsContentMode)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 syncDescriptorMode is a side effect inside an optimizer rule's apply. Spark's optimizer batches typically run rules to a fixed point, so this can fire multiple times per query. It's idempotent within a single query (good), but the side effect on a shared FileFormat ordering between concurrent queries' optimizer passes is what creates the race I flagged in HoodieFileGroupReaderBasedFileFormat. Have you considered mapChildren / structural rewrites to encode the mode in the plan (e.g. as a hint or option) rather than mutating the FileFormat?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

plan.expressions.exists(containsReadBlobInExpression)
}

private def containsReadBlobAnywhere(plan: LogicalPlan): Boolean = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: containsReadBlobAnywhere is dangerously close to the existing containsReadBlobInAnyExpression just above it, but does something meaningfully different — it walks the entire plan tree rather than checking only the current node's expressions. A name like planTreeContainsReadBlob or treeContainsReadBlob would make the scope distinction immediately obvious to the next person who touches this rule.

- AI-generated; verify before applying. React 👍/👎 to flag quality.


// Lance scans are skipped because their hoodieFileFormat != PARQUET and Lance already
// supports read_blob() under DESCRIPTOR via byte-range references.
private def syncDescriptorMode(plan: LogicalPlan, forceContentMode: Boolean): Unit = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: syncDescriptorMode reads like a concurrency-sync operation rather than a policy-apply step. Something like applyDescriptorModePolicy or updateDescriptorModeForPlan would make it clearer that this is deciding whether to downgrade DESCRIPTOR→CONTENT based on query content, not synchronizing threads.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants