Skip to content

feat(blob): Accept partial {type,data} or {type,reference} structs on write#18665

Open
voonhous wants to merge 3 commits into
apache:masterfrom
voonhous:fix-qol-blob-issue
Open

feat(blob): Accept partial {type,data} or {type,reference} structs on write#18665
voonhous wants to merge 3 commits into
apache:masterfrom
voonhous:fix-qol-blob-issue

Conversation

@voonhous
Copy link
Copy Markdown
Member

@voonhous voonhous commented Apr 30, 2026

Describe the issue this Pull Request addresses

BLOB writes require the full 3-field {type, data, reference} struct on every row, even when only one sibling is used (reference is unused for INLINE, data is unused for OUT_OF_LINE). The boilerplate is the first thing people hit when writing a blob.

Note: Merge this after:

  1. test(schema): Add MOR log-only compaction tests for custom types #18583
  2. test(schema): Add lance fileformat test for custom types on MOR #18597

Summary and Changelog

  • INLINE writes now accept {type, data}.
  • OUT_OF_LINE writes now accept {type, reference}.
  • Missing sibling is padded to null at the writer entry. Canonical 3-field input is a no-op.
  • Padding recurses through StructType, ArrayType, MapType (via Spark transform / transform_values), so partial blobs nested inside complex types work too.

Changes:

  • HoodieSparkSchemaConverters: new public padPartialBlobColumns(df) plus recursive helpers (padField, padDataType, padBlobStructValue, rebuiltType).
  • HoodieSparkSqlWriter.writeInternal: pads the source DataFrame just before the schema-conversion / validation call.
  • BlobTestHelpers: added inlineBlobStructColMinimal and outOfLineBlobStructColMinimal.
  • TestReadBlobSQL: minimal-struct tests for INLINE and OUT_OF_LINE plus a nested struct/array/map case.
  • TestBlobDataType: SQL named_struct minimal-literal tests for both INLINE and OUT_OF_LINE.

Impact

User-facing: BLOB writes accept fewer fields. On-disk layout: unchanged (still canonical 3-field).
Read path: untouched.
Performance: padding short-circuits on canonical inputs (single schema walk, no projection emitted).

Risk Level

low

Padding only fires when a partial blob field is detected by a quick schema scan. Canonical inputs hit an early return. Null-struct semantics are preserved with when(col.isNull, lit(null)).

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

voonhous added 3 commits May 1, 2026 01:52
Cover the invariant that the HoodieSchema.TYPE_METADATA_FIELD descriptor
and payload shape of a custom-typed column survive inline compaction of
a log-only MOR table into a base file.

- TestVectorDataSource: add testMorLogOnlyCompactionPreservesVectorMetadata
  (5 commits via SQL + MERGE INTO to trigger default inline compaction).
- TestVariantDataType: equivalent VARIANT test, gated on Spark 4.0+,
  asserting native VariantType round-trips through compaction.
- TestBlobDataType (new): BLOB INLINE and BLOB OUT_OF_LINE cases. Inline
  uses named_struct with hex byte literals; out-of-line creates real files
  via BlobTestHelpers.createTestFile and verifies bytes via read_blob().
Mirror the parquet MOR log-only compaction tests for VECTOR, VARIANT, and
BLOB onto the Lance base file format, and extend all variants with a
6th deltacommit so the cleaner has a chance to retire the post-compaction
log-only slice and write a .clean instant.

- VECTOR Lance: passes; verifies HoodieFileFormat.LANCE on the table
  config and that a .lance base file exists under the table path after
  compaction.
- VARIANT Lance / BLOB INLINE Lance / BLOB OUT_OF_LINE Lance: gated by
  -Dlance.skip.tests; expected to fail at HoodieSparkLanceWriter ->
  LanceArrowUtils.toArrowType (RFC-100 Phase 2 gap). Each asserts the
  LANCE format config sticks to hoodie.properties immediately after
  CREATE TABLE so the table-level invariant is checked even when the
  writer fails downstream.
- All 8 tests (4 parquet + 4 Lance) now drive a 6th merge-update after
  the compaction-triggering 5th commit. The 5th commit's auto-clean
  runs before inline compaction, so the prior log slice is not yet
  superseded; the 6th commit's postCommit clean retires it and writes
  the .clean instant. The cleaner-timeline assertion uses
  reloadActiveTimeline() to avoid a stale cached view.
… write

INLINE writes now accept the natural `{type, data}` shape and OUT_OF_LINE
writes accept `{type, reference}`; the missing sibling field is auto-padded
with null at the writer ingest boundary so the canonical 3-field BLOB layout
is preserved on disk. Padding recurses through StructType, ArrayType, and
MapType (via Spark's transform / transform_values) so nested partial blobs
are handled too. Already-canonical inputs are a no-op.
@voonhous voonhous force-pushed the fix-qol-blob-issue branch from dcb21ee to 9947842 Compare April 30, 2026 18:07
@github-actions github-actions Bot added the size:XL PR with lines of changes > 1000 label Apr 30, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds a quality-of-life improvement that lets users write BLOB columns with partial {type,data} (INLINE) or {type,reference} (OUT_OF_LINE) structs, padding the missing sibling at the writer entry. The recursion through struct/array/map and the SQL-side canonicalizer for INSERT INTO are nicely scoped, and the canonical-input short-circuit keeps it cheap. A couple of observations worth double-checking in the inline comments around metadata preservation in nested rebuilds and a few smaller edge cases. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of naming nits: rebuiltType breaks the surrounding pad* convention, and dfPreBlobPad is hard to parse at a glance — suggestions below.

private def padBlobStructValue(blobCol: Column, st: StructType, caseSensitive: Boolean): Column = {
val key: String => String =
if (caseSensitive) identity else (_: String).toLowerCase(Locale.ROOT)
val present = st.fields.map(f => key(f.name)).toSet
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 When rebuilding a non-blob struct that contains a nested partial blob, the inner padField(...).as(f.name) doesn't propagate f.metadata, so the rebuilt struct's nested BLOB-tagged field loses its HoodieSchema.TYPE_METADATA_FIELD on the resulting DataFrame schema. The null branch via rebuiltType does preserve metadata, which makes the two branches asymmetric. Was this intentional (relying on the catalog/table schema downstream) or should it be .as(f.name, f.metadata) to keep validators / consumers that introspect the DF schema happy?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Binding(attr, attr, None)
case other =>
// Deterministic alias name from the canonicalized form so identical nested
// sub-trees (e.g. across SELECT and WHERE) get the same alias name.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Integer.toHexString(other.canonicalized.hashCode()) can collide between two genuinely different source expressions, in which case we'd inject two Aliases with the same name into the synthetic Project. Different ExprIds make plan addressing safe, but Spark name resolution within that Project would be ambiguous. Probably benign in practice — flagging in case a longer or UUID-suffixed alias name is preferable.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

// RFC-100 BLOB QoL: accept partial INLINE-only `{type,data}` or OUT_OF_LINE-only
// `{type,reference}` user inputs by padding the missing sibling field with null at the
// ingest boundary. No-op for already-canonical 3-field structs.
val df = HoodieSparkSchemaConverters.padPartialBlobColumns(dfPreBlobPad)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Padding now runs on the prepped/streaming branches too (where the early-return previously kept sourceDf untouched). The canonical-input short-circuit should make this a no-op, but is there any prepped path where the input dataframe has already been encoded such that running this projection through it could change semantics (e.g. lazy plan invalidation, breaking pre-bound encoders)? If not, all good — just calling it out.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

/**
* Returns the post-padding DataType corresponding to `dataType`: every accepted partial
* blob struct is replaced by `expectedBlobStructType`; nested struct/array/map containers
* are rebuilt with their inner types similarly transformed. Used to provide the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: rebuiltType sits in the middle of a pad*-named family (padPartialBlobColumns, padDataType, padBlobStructValue, padField) but doesn't follow that convention. Could you rename it to paddedType (or canonicalType) so readers immediately understand its role without having to read the Javadoc?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@@ -356,11 +357,15 @@ class HoodieSparkSqlWriterInternal {

val shouldReconcileSchema = parameters(DataSourceWriteOptions.RECONCILE_SCHEMA.key()).toBoolean
val latestTableSchemaOpt = getLatestTableSchema(tableMetaClient, schemaFromCatalog)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: dfPreBlobPad is a bit hard to parse at a glance. Could you call it unpaddedDf (or rawDf) to make the before/after relationship with val df = …padPartialBlobColumns(…) on line 367 immediately clear?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 75.13514% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.10%. Comparing base (38db5ed) to head (9947842).
⚠️ Report is 15 commits behind head on master.

Files with missing lines Patch % Lines
.../org/apache/spark/sql/hudi/blob/ReadBlobRule.scala 68.33% 13 Missing and 6 partials ⚠️
...e/spark/sql/avro/HoodieSparkSchemaConverters.scala 79.41% 3 Missing and 11 partials ⚠️
...ark/sql/hudi/command/InsertBlobCanonicalizer.scala 78.18% 0 Missing and 12 partials ⚠️
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18665      +/-   ##
============================================
+ Coverage     68.06%   68.10%   +0.04%     
- Complexity    28922    28947      +25     
============================================
  Files          2518     2519       +1     
  Lines        140574   140712     +138     
  Branches      17419    17469      +50     
============================================
+ Hits          95682    95835     +153     
+ Misses        37036    37010      -26     
- Partials       7856     7867      +11     
Flag Coverage Δ
common-and-other-modules 44.32% <0.54%> (-0.05%) ⬇️
hadoop-mr-java-client 44.95% <ø> (+0.06%) ⬆️
spark-client-hadoop-common 48.39% <0.00%> (-0.05%) ⬇️
spark-java-tests 49.09% <58.37%> (+0.45%) ⬆️
spark-scala-tests 45.18% <43.24%> (+0.48%) ⬆️
utilities 37.67% <4.32%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala 78.50% <50.00%> (+0.03%) ⬆️
...ark/sql/hudi/command/InsertBlobCanonicalizer.scala 78.18% <78.18%> (ø)
...e/spark/sql/avro/HoodieSparkSchemaConverters.scala 77.59% <79.41%> (+0.51%) ⬆️
.../org/apache/spark/sql/hudi/blob/ReadBlobRule.scala 57.30% <68.33%> (+10.63%) ⬆️

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants