Skip to content

feat(schema) Ensure spark SQL insert into preserves hudi_type for VEC…#18608

Draft
rahil-c wants to merge 2 commits into
apache:masterfrom
rahil-c:rahil/fix-vec-blob-insert
Draft

feat(schema) Ensure spark SQL insert into preserves hudi_type for VEC…#18608
rahil-c wants to merge 2 commits into
apache:masterfrom
rahil-c:rahil/fix-vec-blob-insert

Conversation

@rahil-c
Copy link
Copy Markdown
Collaborator

@rahil-c rahil-c commented Apr 27, 2026

Describe the issue this Pull Request addresses

VECTOR and BLOB columns lose their hudi_type StructField metadata
during Spark's INSERT INTO column resolution. By the time
HoodieSparkSqlWriter converts df.schema to a HoodieSchema, the
metadata is gone and the converter falls through to the plain Avro
record/array branches. The schema-compat check then sees a logical-typed
catalog schema versus a plain incoming schema and fails with
SchemaBackwardsCompatibilityException: MISSING_UNION_BRANCH.

A nested projection on a BLOB column (SELECT payload.reference.external_path FROM t)
hits the same metadata-loss pattern on the read path inside
pruneDataSchemaInternal, throwing Data schema is not a record.

Summary and Changelog

A scoped fix focused on the SQL INSERT INTO and read paths. Smaller than
#18540 — does not cover UPDATE/MERGE call-site rewiring, the BLOB
Spark-shape flip in toSqlType, or ArrayType.containsNull recursion.
The read-side change is the same short-circuit as #18566.

  • HoodieSchemaConversionUtils.alignSchemaMetadataFromCatalog (new) —
    re-stamps StructField.metadata and nullable from the catalog
    StructType onto the source StructType, recursing into nested structs.
  • HoodieSparkSqlWriter.write — calls the helper before
    convertStructTypeToHoodieSchema(df.schema, ...) when
    schemaFromCatalog is provided. No-op for direct DataFrame writers
    that don't pass it.
  • HoodieSparkSchemaConverters.validateBlobStructure — replaces strict
    .equals with a recursive structural matcher that ignores nullability
    differences. Spark's INSERT can tighten a field's nullability from the
    source's expression; the on-disk Avro side enforces non-null invariants
    separately.
  • HoodieSchemaUtils.pruneDataSchemaInternal — short-circuits the
    case RECORD branch when the data schema is BLOB or VARIANT. Both
    carry LogicalType.validate() contracts requiring the full canonical
    inner layout, so partial pruning is not legal anyway. (Same patch as
    fix(schema): Allow nested projection on BLOB and VARIANT columns in p… #18566.)

Tests:

  • TestHoodieSchemaConversionUtils — 7 unit cases for the helper
    (BLOB/VECTOR metadata copy, nullability widen + narrow, nested-struct
    recursion, field-not-in-catalog passthrough, empty-catalog-metadata
    preservation).
  • TestCreateTable — end-to-end regression test:
    CREATE TABLE t (id INT, payload BLOB, emb VECTOR(3)) USING hudi
    INSERT INTO t SELECT … FROM tempview → round-trip read.

Example

Before — fails:

CREATE TABLE pets (id INT, payload BLOB, emb VECTOR(3)) USING hudi
TBLPROPERTIES (primaryKey = 'id', ...);

CREATE TEMPORARY VIEW src AS
SELECT 1 AS id, cast(X'010203' as binary) AS bytes, array(0.1f, 0.2f, 0.3f) AS emb;

INSERT INTO pets SELECT id,
  named_struct('type','INLINE','data', bytes, 'reference', cast(null as struct<...>)) AS payload,
  emb FROM src;
-- SchemaBackwardsCompatibilityException: MISSING_UNION_BRANCH
--   reader union lacking writer type: BLOB / VECTOR

After — succeeds:

INSERT INTO pets SELECT ...; -- ok
SELECT id, length(payload.data), size(emb) FROM pets;
-- 1 | 3 | 3

Impact

  • User-facing: SQL INSERT INTO ... SELECT and nested projection now
    work on tables with BLOB / VECTOR columns. No public API change. No
    config change.
  • Performance: one extra StructType traversal per write command
    (bounded by catalog schema size). Negligible.

Risk Level

Low. The helper runs only when schemaFromCatalog is provided
(InsertIntoHoodieTableCommand path); direct DataFrame writers fall back
to existing behavior. The validator change relaxes a strict-equality
check to structural matching, which is strictly more permissive on the
Spark side; on-disk Avro invariants are unchanged. The pruner short-circuit
returns the unmodified data schema, matching what a non-projected read
already does.

Documentation Update

None.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@rahil-c rahil-c requested review from voonhous and yihua April 27, 2026 03:48
@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Apr 27, 2026
… to logical-type fields

Forcing the catalog's nullability across every field (including nested
structs of ordinary columns) shifted the Avro union layout for unrelated
columns and broke ~90+ MergeInto / partial-update / CDC / secondary-index
tests with 'Malformed data. Length is negative' in Avro deserialization.

Narrow the helper: re-stamp nullability + recurse only when the catalog
field carries the 'hudi_type' metadata key (BLOB / VECTOR / VARIANT). For
ordinary fields, copy metadata only and leave Spark's resolved nullability
untouched.
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 67.64706% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.63%. Comparing base (29f9c40) to head (5949164).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...e/spark/sql/avro/HoodieSparkSchemaConverters.scala 42.85% 0 Missing and 4 partials ⚠️
.../org/apache/hudi/HoodieSchemaConversionUtils.scala 80.00% 0 Missing and 3 partials ⚠️
...g/apache/hudi/common/schema/HoodieSchemaUtils.java 0.00% 1 Missing and 2 partials ⚠️
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala 88.88% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18608      +/-   ##
============================================
- Coverage     68.90%   68.63%   -0.28%     
+ Complexity    28581    28483      -98     
============================================
  Files          2482     2492      +10     
  Lines        137053   137282     +229     
  Branches      16713    16745      +32     
============================================
- Hits          94436    94217     -219     
- Misses        35009    35550     +541     
+ Partials       7608     7515      -93     
Flag Coverage Δ
common-and-other-modules 44.31% <0.00%> (-0.07%) ⬇️
hadoop-mr-java-client 44.83% <0.00%> (+<0.01%) ⬆️
spark-client-hadoop-common 48.44% <0.00%> (-0.02%) ⬇️
spark-java-tests 49.51% <52.94%> (+0.01%) ⬆️
spark-scala-tests 44.94% <61.76%> (-0.28%) ⬇️
utilities 37.91% <5.88%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala 78.54% <88.88%> (+0.11%) ⬆️
.../org/apache/hudi/HoodieSchemaConversionUtils.scala 72.72% <80.00%> (+1.29%) ⬆️
...g/apache/hudi/common/schema/HoodieSchemaUtils.java 82.87% <0.00%> (-0.87%) ⬇️
...e/spark/sql/avro/HoodieSparkSchemaConverters.scala 77.99% <42.85%> (-1.32%) ⬇️

... and 42 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants