feat(blob): Accept partial {type,data} or {type,reference} structs on write by voonhous · Pull Request #18665 · apache/hudi

voonhous · 2026-04-30T18:05:48Z

Describe the issue this Pull Request addresses

BLOB writes require the full 3-field {type, data, reference} struct on every row, even when only one sibling is used (reference is unused for INLINE, data is unused for OUT_OF_LINE). The boilerplate is the first thing people hit when writing a blob.

Note: Merge this after:

Summary and Changelog

INLINE writes now accept {type, data}.
OUT_OF_LINE writes now accept {type, reference}.
Missing sibling is padded to null at the writer entry. Canonical 3-field input is a no-op.
Padding recurses through StructType, ArrayType, MapType (via Spark transform / transform_values), so partial blobs nested inside complex types work too.

Changes:

HoodieSparkSchemaConverters: new public padPartialBlobColumns(df) plus recursive helpers (padField, padDataType, padBlobStructValue, rebuiltType).
HoodieSparkSqlWriter.writeInternal: pads the source DataFrame just before the schema-conversion / validation call.
BlobTestHelpers: added inlineBlobStructColMinimal and outOfLineBlobStructColMinimal.
TestReadBlobSQL: minimal-struct tests for INLINE and OUT_OF_LINE plus a nested struct/array/map case.
TestBlobDataType: SQL named_struct minimal-literal tests for both INLINE and OUT_OF_LINE.

Impact

User-facing: BLOB writes accept fewer fields. On-disk layout: unchanged (still canonical 3-field).
Read path: untouched.
Performance: padding short-circuits on canonical inputs (single schema walk, no projection emitted).

Risk Level

low

Padding only fires when a partial blob field is detected by a quick schema scan. Canonical inputs hit an early return. Null-struct semantics are preserved with when(col.isNull, lit(null)).

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

Cover the invariant that the HoodieSchema.TYPE_METADATA_FIELD descriptor and payload shape of a custom-typed column survive inline compaction of a log-only MOR table into a base file. - TestVectorDataSource: add testMorLogOnlyCompactionPreservesVectorMetadata (5 commits via SQL + MERGE INTO to trigger default inline compaction). - TestVariantDataType: equivalent VARIANT test, gated on Spark 4.0+, asserting native VariantType round-trips through compaction. - TestBlobDataType (new): BLOB INLINE and BLOB OUT_OF_LINE cases. Inline uses named_struct with hex byte literals; out-of-line creates real files via BlobTestHelpers.createTestFile and verifies bytes via read_blob().

Mirror the parquet MOR log-only compaction tests for VECTOR, VARIANT, and BLOB onto the Lance base file format, and extend all variants with a 6th deltacommit so the cleaner has a chance to retire the post-compaction log-only slice and write a .clean instant. - VECTOR Lance: passes; verifies HoodieFileFormat.LANCE on the table config and that a .lance base file exists under the table path after compaction. - VARIANT Lance / BLOB INLINE Lance / BLOB OUT_OF_LINE Lance: gated by -Dlance.skip.tests; expected to fail at HoodieSparkLanceWriter -> LanceArrowUtils.toArrowType (RFC-100 Phase 2 gap). Each asserts the LANCE format config sticks to hoodie.properties immediately after CREATE TABLE so the table-level invariant is checked even when the writer fails downstream. - All 8 tests (4 parquet + 4 Lance) now drive a 6th merge-update after the compaction-triggering 5th commit. The 5th commit's auto-clean runs before inline compaction, so the prior log slice is not yet superseded; the 6th commit's postCommit clean retires it and writes the .clean instant. The cleaner-timeline assertion uses reloadActiveTimeline() to avoid a stale cached view.

… write INLINE writes now accept the natural `{type, data}` shape and OUT_OF_LINE writes accept `{type, reference}`; the missing sibling field is auto-padded with null at the writer ingest boundary so the canonical 3-field BLOB layout is preserved on disk. Padding recurses through StructType, ArrayType, and MapType (via Spark's transform / transform_values) so nested partial blobs are handled too. Already-canonical inputs are a no-op.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds a quality-of-life improvement that lets users write BLOB columns with partial {type,data} (INLINE) or {type,reference} (OUT_OF_LINE) structs, padding the missing sibling at the writer entry. The recursion through struct/array/map and the SQL-side canonicalizer for INSERT INTO are nicely scoped, and the canonical-input short-circuit keeps it cheap. A couple of observations worth double-checking in the inline comments around metadata preservation in nested rebuilds and a few smaller edge cases. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of naming nits: rebuiltType breaks the surrounding pad* convention, and dfPreBlobPad is hard to parse at a glance — suggestions below.

hudi-agent · 2026-04-30T18:21:05Z

+  private def padBlobStructValue(blobCol: Column, st: StructType, caseSensitive: Boolean): Column = {
+    val key: String => String =
+      if (caseSensitive) identity else (_: String).toLowerCase(Locale.ROOT)
+    val present = st.fields.map(f => key(f.name)).toSet


🤖 When rebuilding a non-blob struct that contains a nested partial blob, the inner padField(...).as(f.name) doesn't propagate f.metadata, so the rebuilt struct's nested BLOB-tagged field loses its HoodieSchema.TYPE_METADATA_FIELD on the resulting DataFrame schema. The null branch via rebuiltType does preserve metadata, which makes the two branches asymmetric. Was this intentional (relying on the catalog/table schema downstream) or should it be .as(f.name, f.metadata) to keep validators / consumers that introspect the DF schema happy?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-04-30T18:21:05Z

+        Binding(attr, attr, None)
+      case other =>
+        // Deterministic alias name from the canonicalized form so identical nested
+        // sub-trees (e.g. across SELECT and WHERE) get the same alias name.


🤖 Integer.toHexString(other.canonicalized.hashCode()) can collide between two genuinely different source expressions, in which case we'd inject two Aliases with the same name into the synthetic Project. Different ExprIds make plan addressing safe, but Spark name resolution within that Project would be ambiguous. Probably benign in practice — flagging in case a longer or UUID-suffixed alias name is preferable.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-04-30T18:21:05Z

+      // RFC-100 BLOB QoL: accept partial INLINE-only `{type,data}` or OUT_OF_LINE-only
+      // `{type,reference}` user inputs by padding the missing sibling field with null at the
+      // ingest boundary. No-op for already-canonical 3-field structs.
+      val df = HoodieSparkSchemaConverters.padPartialBlobColumns(dfPreBlobPad)


🤖 Padding now runs on the prepped/streaming branches too (where the early-return previously kept sourceDf untouched). The canonical-input short-circuit should make this a no-op, but is there any prepped path where the input dataframe has already been encoded such that running this projection through it could change semantics (e.g. lazy plan invalidation, breaking pre-bound encoders)? If not, all good — just calling it out.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-04-30T18:21:05Z

+  /**
+   * Returns the post-padding DataType corresponding to `dataType`: every accepted partial
+   * blob struct is replaced by `expectedBlobStructType`; nested struct/array/map containers
+   * are rebuilt with their inner types similarly transformed. Used to provide the


🤖 nit: rebuiltType sits in the middle of a pad*-named family (padPartialBlobColumns, padDataType, padBlobStructValue, padField) but doesn't follow that convention. Could you rename it to paddedType (or canonicalType) so readers immediately understand its role without having to read the Javadoc?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-04-30T18:21:05Z

@@ -356,11 +357,15 @@ class HoodieSparkSqlWriterInternal {

      val shouldReconcileSchema = parameters(DataSourceWriteOptions.RECONCILE_SCHEMA.key()).toBoolean
      val latestTableSchemaOpt = getLatestTableSchema(tableMetaClient, schemaFromCatalog)


🤖 nit: dfPreBlobPad is a bit hard to parse at a glance. Could you call it unpaddedDf (or rawDf) to make the before/after relationship with val df = …padPartialBlobColumns(…) on line 367 immediately clear?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-bot · 2026-04-30T19:39:46Z

CI report:

dcb21ee UNKNOWN
9947842 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-05-08T09:18:12Z

Codecov Report

❌ Patch coverage is 75.13514% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.10%. Comparing base (38db5ed) to head (9947842).
⚠️ Report is 15 commits behind head on master.

Files with missing lines	Patch %	Lines
.../org/apache/spark/sql/hudi/blob/ReadBlobRule.scala	68.33%	13 Missing and 6 partials ⚠️
...e/spark/sql/avro/HoodieSparkSchemaConverters.scala	79.41%	3 Missing and 11 partials ⚠️
...ark/sql/hudi/command/InsertBlobCanonicalizer.scala	78.18%	0 Missing and 12 partials ⚠️
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	50.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18665      +/-   ##
============================================
+ Coverage     68.06%   68.10%   +0.04%     
- Complexity    28922    28947      +25     
============================================
  Files          2518     2519       +1     
  Lines        140574   140712     +138     
  Branches      17419    17469      +50     
============================================
+ Hits          95682    95835     +153     
+ Misses        37036    37010      -26     
- Partials       7856     7867      +11

Flag	Coverage Δ
common-and-other-modules	`44.32% <0.54%> (-0.05%)`	⬇️
hadoop-mr-java-client	`44.95% <ø> (+0.06%)`	⬆️
spark-client-hadoop-common	`48.39% <0.00%> (-0.05%)`	⬇️
spark-java-tests	`49.09% <58.37%> (+0.45%)`	⬆️
spark-scala-tests	`45.18% <43.24%> (+0.48%)`	⬆️
utilities	`37.67% <4.32%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	`78.50% <50.00%> (+0.03%)`	⬆️
...ark/sql/hudi/command/InsertBlobCanonicalizer.scala	`78.18% <78.18%> (ø)`
...e/spark/sql/avro/HoodieSparkSchemaConverters.scala	`77.59% <79.41%> (+0.51%)`	⬆️
.../org/apache/spark/sql/hudi/blob/ReadBlobRule.scala	`57.30% <68.33%> (+10.63%)`	⬆️

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

voonhous added 3 commits May 1, 2026 01:52

voonhous force-pushed the fix-qol-blob-issue branch from dcb21ee to 9947842 Compare April 30, 2026 18:07

github-actions Bot added the size:XL PR with lines of changes > 1000 label Apr 30, 2026

hudi-agent reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(blob): Accept partial {type,data} or {type,reference} structs on write#18665

feat(blob): Accept partial {type,data} or {type,reference} structs on write#18665
voonhous wants to merge 3 commits into
apache:masterfrom
voonhous:fix-qol-blob-issue

voonhous commented Apr 30, 2026 •

edited

Loading

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Apr 30, 2026

Uh oh!

hudi-agent Apr 30, 2026

Uh oh!

hudi-agent Apr 30, 2026

Uh oh!

hudi-agent Apr 30, 2026

Uh oh!

hudi-agent Apr 30, 2026

Uh oh!

hudi-bot commented Apr 30, 2026

Uh oh!

codecov-commenter commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -356,11 +357,15 @@ class HoodieSparkSqlWriterInternal {

		val shouldReconcileSchema = parameters(DataSourceWriteOptions.RECONCILE_SCHEMA.key()).toBoolean
		val latestTableSchemaOpt = getLatestTableSchema(tableMetaClient, schemaFromCatalog)

Conversation

voonhous commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Apr 30, 2026

CI report:

Uh oh!

codecov-commenter commented May 8, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

voonhous commented Apr 30, 2026 •

edited

Loading