Skip to content

Check if allowing nullable for all datatypes in blob is too permissive. #18601

@voonhous

Description

@voonhous

Task Description

What needs to be done:
In https://github.com/apache/hudi/pull/18540/changes#diff-3e2a24e519a0cf4b097131aa5ad08a41a9d16bebc85aa39cf168bc344e9a2e0dR285-R297, we made fields in blob type nullable, which may be too permissive to get tests to pass on all Spark engines, 3.3, 3.4, 3.5 and 4.0.

This workaround was to address: #18547.

Address @rahil-c's comment and ensure that we are not being too permissive and may allow for unintended user behaviours.

Background of the fix

RFC-100 declares the BLOB Avro record with three strictly non-null fields: type, reference.external_path, reference.managed.

The contract is actually conditional ("required when its parent is present", data matters only when type='INLINE', reference.* only when type='OUT_OF_LINE'), but Avro can't say that, so it just says "non-null" everywhere.

When that strict declaration is reflected back into Spark via toSqlType, the Spark catalog type for payload ends up with non-null type, reference.external_path, reference.managed.

Then on every write path we hit a chain of analyzer/resolver checks against user-supplied source structs:

  1. validateBlobStructure: the old version compared dataType and nullable field-by-field.

    The user's INSERT INTO ... values (1, named_struct(... 'data', cast(X'010203' as binary), 'reference', cast(null as struct<...>))) produces a source whose per-field nullability differs from the catalog's strict declaration, causing it to be rejected.

  2. TableOutputResolver (Spark 3.4+): even if we skip the validator, resolveOutputColumns walks nested struct assignments and rejects nullable-source -> non-null-target narrowing.

    User-supplied named_struct fields are nullable by default, so any assignment into the strict BLOB struct fails at analyzer time, before Hudi sees the write.

  3. castIfNeeded -> Cast (used by UpdateHoodieTableCommand and MergeIntoHoodieTableCommand, Cast (a) strips custom Metadata (the hudi_type tag we use to recognize BLOB), and (b) on some Spark versions performs its own nullability-narrowing check via Cast.canCast on nested structs.

So the strict catalog-side BLOB type collides with every Spark write-path rewrite, on every DML verb, on multiple Spark versions, for a contract that was already a partial lie (because the conditional non-null can't be expressed)

As of now, the fix in #18540 is no risk as it strictly adheres to the on-disk RFC-100 contract.

The physical Avro schema is not derived from the Spark type, the write path goes through HoodieSchema.Blob.createBlob() (called from `toHoodieTypeNested), which builds the canonical RFC-100 record fresh from RFC-100 definitions.

So data on disk is still type STRING NOT NULL, reference.external_path STRING NOT NULL, reference.managed BOOLEAN NOT NULL.

TLDR generated by Claude:
The strict declaration was already conditional, and Spark's type system can't model "conditional non-null." Trying to keep the strict declaration on the Spark side made every write path fight Spark's nullability machinery for a guarantee that wasn't really enforceable there anyway.

Pushing the non-null enforcement to the BLOB-aware physical writer (createBlob()) and presenting Spark with a uniformly-permissive type lets every generic write path (INSERT/UPDATE/MERGE on 3.3 / 3.4 / 3.5 / 4.0) pass through unchanged.

The validateBlobStructure change to ignore nullability (matchesStructure) and the per-field nullable = true projection are two parts of the same idea: structural shape is the contract on the Spark side; nullability is enforced at the physical-write boundary.

Why this task is needed:

Task Type

Code improvement/refactoring

Related Issues

Parent feature issue: (if applicable )
Related issues:
NOTE: Use Relationships button to add parent/blocking issues after issue is created.

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:devtaskDevelopment tasks and maintenance work

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions