Task Description
What needs to be done:
In https://github.com/apache/hudi/pull/18540/changes#diff-3e2a24e519a0cf4b097131aa5ad08a41a9d16bebc85aa39cf168bc344e9a2e0dR285-R297, we made fields in blob type nullable, which may be too permissive to get tests to pass on all Spark engines, 3.3, 3.4, 3.5 and 4.0.
This workaround was to address: #18547.
Address @rahil-c's comment and ensure that we are not being too permissive and may allow for unintended user behaviours.
Background of the fix
RFC-100 declares the BLOB Avro record with three strictly non-null fields: type, reference.external_path, reference.managed.
The contract is actually conditional ("required when its parent is present", data matters only when type='INLINE', reference.* only when type='OUT_OF_LINE'), but Avro can't say that, so it just says "non-null" everywhere.
When that strict declaration is reflected back into Spark via toSqlType, the Spark catalog type for payload ends up with non-null type, reference.external_path, reference.managed.
Then on every write path we hit a chain of analyzer/resolver checks against user-supplied source structs:
-
validateBlobStructure: the old version compared dataType and nullable field-by-field.
The user's INSERT INTO ... values (1, named_struct(... 'data', cast(X'010203' as binary), 'reference', cast(null as struct<...>))) produces a source whose per-field nullability differs from the catalog's strict declaration, causing it to be rejected.
-
TableOutputResolver (Spark 3.4+): even if we skip the validator, resolveOutputColumns walks nested struct assignments and rejects nullable-source -> non-null-target narrowing.
User-supplied named_struct fields are nullable by default, so any assignment into the strict BLOB struct fails at analyzer time, before Hudi sees the write.
-
castIfNeeded -> Cast (used by UpdateHoodieTableCommand and MergeIntoHoodieTableCommand, Cast (a) strips custom Metadata (the hudi_type tag we use to recognize BLOB), and (b) on some Spark versions performs its own nullability-narrowing check via Cast.canCast on nested structs.
So the strict catalog-side BLOB type collides with every Spark write-path rewrite, on every DML verb, on multiple Spark versions, for a contract that was already a partial lie (because the conditional non-null can't be expressed)
As of now, the fix in #18540 is no risk as it strictly adheres to the on-disk RFC-100 contract.
The physical Avro schema is not derived from the Spark type, the write path goes through HoodieSchema.Blob.createBlob() (called from `toHoodieTypeNested), which builds the canonical RFC-100 record fresh from RFC-100 definitions.
So data on disk is still type STRING NOT NULL, reference.external_path STRING NOT NULL, reference.managed BOOLEAN NOT NULL.
TLDR generated by Claude:
The strict declaration was already conditional, and Spark's type system can't model "conditional non-null." Trying to keep the strict declaration on the Spark side made every write path fight Spark's nullability machinery for a guarantee that wasn't really enforceable there anyway.
Pushing the non-null enforcement to the BLOB-aware physical writer (createBlob()) and presenting Spark with a uniformly-permissive type lets every generic write path (INSERT/UPDATE/MERGE on 3.3 / 3.4 / 3.5 / 4.0) pass through unchanged.
The validateBlobStructure change to ignore nullability (matchesStructure) and the per-field nullable = true projection are two parts of the same idea: structural shape is the contract on the Spark side; nullability is enforced at the physical-write boundary.
Why this task is needed:
Task Type
Code improvement/refactoring
Related Issues
Parent feature issue: (if applicable )
Related issues:
NOTE: Use Relationships button to add parent/blocking issues after issue is created.
Task Description
What needs to be done:
In https://github.com/apache/hudi/pull/18540/changes#diff-3e2a24e519a0cf4b097131aa5ad08a41a9d16bebc85aa39cf168bc344e9a2e0dR285-R297, we made fields in blob type nullable, which may be too permissive to get tests to pass on all Spark engines, 3.3, 3.4, 3.5 and 4.0.
This workaround was to address: #18547.
Address @rahil-c's comment and ensure that we are not being too permissive and may allow for unintended user behaviours.
Background of the fix
RFC-100 declares the BLOB Avro record with three strictly non-null fields: type, reference.external_path, reference.managed.
The contract is actually conditional ("required when its parent is present", data matters only when type='INLINE', reference.* only when type='OUT_OF_LINE'), but Avro can't say that, so it just says "non-null" everywhere.
When that strict declaration is reflected back into Spark via
toSqlType, the Spark catalog type for payload ends up with non-null type, reference.external_path, reference.managed.Then on every write path we hit a chain of analyzer/resolver checks against user-supplied source structs:
validateBlobStructure: the old version compared dataType and nullable field-by-field.The user's
INSERT INTO ... values (1, named_struct(... 'data', cast(X'010203' as binary), 'reference', cast(null as struct<...>)))produces a source whose per-field nullability differs from the catalog's strict declaration, causing it to be rejected.TableOutputResolver(Spark 3.4+): even if we skip the validator,resolveOutputColumnswalks nested struct assignments and rejects nullable-source -> non-null-target narrowing.User-supplied
named_structfields are nullable by default, so any assignment into the strict BLOB struct fails at analyzer time, before Hudi sees the write.castIfNeeded -> Cast (used by
UpdateHoodieTableCommandandMergeIntoHoodieTableCommand, Cast (a) strips custom Metadata (the hudi_type tag we use to recognize BLOB), and (b) on some Spark versions performs its own nullability-narrowing check viaCast.canCaston nested structs.So the strict catalog-side BLOB type collides with every Spark write-path rewrite, on every DML verb, on multiple Spark versions, for a contract that was already a partial lie (because the conditional non-null can't be expressed)
As of now, the fix in #18540 is no risk as it strictly adheres to the on-disk RFC-100 contract.
The physical Avro schema is not derived from the Spark type, the write path goes through
HoodieSchema.Blob.createBlob()(called from `toHoodieTypeNested), which builds the canonical RFC-100 record fresh from RFC-100 definitions.So data on disk is still type
STRING NOT NULL,reference.external_path STRING NOT NULL,reference.managed BOOLEAN NOT NULL.TLDR generated by Claude:
The strict declaration was already conditional, and Spark's type system can't model "conditional non-null." Trying to keep the strict declaration on the Spark side made every write path fight Spark's nullability machinery for a guarantee that wasn't really enforceable there anyway.
Pushing the non-null enforcement to the BLOB-aware physical writer (
createBlob()) and presenting Spark with a uniformly-permissive type lets every generic write path (INSERT/UPDATE/MERGEon 3.3 / 3.4 / 3.5 / 4.0) pass through unchanged.The
validateBlobStructurechange to ignore nullability (matchesStructure) and the per-fieldnullable = trueprojection are two parts of the same idea: structural shape is the contract on the Spark side; nullability is enforced at the physical-write boundary.Why this task is needed:
Task Type
Code improvement/refactoring
Related Issues
Parent feature issue: (if applicable )
Related issues:
NOTE: Use
Relationshipsbutton to add parent/blocking issues after issue is created.