feat(schema) Ensure spark SQL insert into preserves hudi_type for VEC…#18608
Draft
rahil-c wants to merge 2 commits into
Draft
feat(schema) Ensure spark SQL insert into preserves hudi_type for VEC…#18608rahil-c wants to merge 2 commits into
rahil-c wants to merge 2 commits into
Conversation
… to logical-type fields Forcing the catalog's nullability across every field (including nested structs of ordinary columns) shifted the Avro union layout for unrelated columns and broke ~90+ MergeInto / partial-update / CDC / secondary-index tests with 'Malformed data. Length is negative' in Avro deserialization. Narrow the helper: re-stamp nullability + recurse only when the catalog field carries the 'hudi_type' metadata key (BLOB / VECTOR / VARIANT). For ordinary fields, copy metadata only and leave Spark's resolved nullability untouched.
Collaborator
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18608 +/- ##
============================================
- Coverage 68.90% 68.63% -0.28%
+ Complexity 28581 28483 -98
============================================
Files 2482 2492 +10
Lines 137053 137282 +229
Branches 16713 16745 +32
============================================
- Hits 94436 94217 -219
- Misses 35009 35550 +541
+ Partials 7608 7515 -93
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe the issue this Pull Request addresses
VECTOR and BLOB columns lose their
hudi_typeStructFieldmetadataduring Spark's INSERT INTO column resolution. By the time
HoodieSparkSqlWriterconvertsdf.schemato a HoodieSchema, themetadata is gone and the converter falls through to the plain Avro
record/array branches. The schema-compat check then sees a logical-typed
catalog schema versus a plain incoming schema and fails with
SchemaBackwardsCompatibilityException: MISSING_UNION_BRANCH.A nested projection on a BLOB column (
SELECT payload.reference.external_path FROM t)hits the same metadata-loss pattern on the read path inside
pruneDataSchemaInternal, throwingData schema is not a record.Summary and Changelog
A scoped fix focused on the SQL INSERT INTO and read paths. Smaller than
#18540 — does not cover UPDATE/MERGE call-site rewiring, the BLOB
Spark-shape flip in
toSqlType, orArrayType.containsNullrecursion.The read-side change is the same short-circuit as #18566.
HoodieSchemaConversionUtils.alignSchemaMetadataFromCatalog(new) —re-stamps
StructField.metadataandnullablefrom the catalogStructType onto the source StructType, recursing into nested structs.
HoodieSparkSqlWriter.write— calls the helper beforeconvertStructTypeToHoodieSchema(df.schema, ...)whenschemaFromCatalogis provided. No-op for direct DataFrame writersthat don't pass it.
HoodieSparkSchemaConverters.validateBlobStructure— replaces strict.equalswith a recursive structural matcher that ignores nullabilitydifferences. Spark's INSERT can tighten a field's nullability from the
source's expression; the on-disk Avro side enforces non-null invariants
separately.
HoodieSchemaUtils.pruneDataSchemaInternal— short-circuits thecase RECORDbranch when the data schema is BLOB or VARIANT. Bothcarry
LogicalType.validate()contracts requiring the full canonicalinner layout, so partial pruning is not legal anyway. (Same patch as
fix(schema): Allow nested projection on BLOB and VARIANT columns in p… #18566.)
Tests:
TestHoodieSchemaConversionUtils— 7 unit cases for the helper(BLOB/VECTOR metadata copy, nullability widen + narrow, nested-struct
recursion, field-not-in-catalog passthrough, empty-catalog-metadata
preservation).
TestCreateTable— end-to-end regression test:CREATE TABLE t (id INT, payload BLOB, emb VECTOR(3)) USING hudi→INSERT INTO t SELECT … FROM tempview→ round-trip read.Example
Before — fails:
After — succeeds:
Impact
INSERT INTO ... SELECTand nested projection nowwork on tables with BLOB / VECTOR columns. No public API change. No
config change.
StructTypetraversal per write command(bounded by catalog schema size). Negligible.
Risk Level
Low. The helper runs only when
schemaFromCatalogis provided(
InsertIntoHoodieTableCommandpath); direct DataFrame writers fall backto existing behavior. The validator change relaxes a strict-equality
check to structural matching, which is strictly more permissive on the
Spark side; on-disk Avro invariants are unchanged. The pruner short-circuit
returns the unmodified data schema, matching what a non-projected read
already does.
Documentation Update
None.
Contributor's checklist