feat(schema) Ensure spark SQL insert into preserves hudi_type for VEC… by rahil-c · Pull Request #18608 · apache/hudi

rahil-c · 2026-04-27T03:42:09Z

Describe the issue this Pull Request addresses

VECTOR and BLOB columns lose their hudi_type StructField metadata
during Spark's INSERT INTO column resolution. By the time
HoodieSparkSqlWriter converts df.schema to a HoodieSchema, the
metadata is gone and the converter falls through to the plain Avro
record/array branches. The schema-compat check then sees a logical-typed
catalog schema versus a plain incoming schema and fails with
SchemaBackwardsCompatibilityException: MISSING_UNION_BRANCH.

A nested projection on a BLOB column (SELECT payload.reference.external_path FROM t)
hits the same metadata-loss pattern on the read path inside
pruneDataSchemaInternal, throwing Data schema is not a record.

Summary and Changelog

A scoped fix focused on the SQL INSERT INTO and read paths. Smaller than
#18540 — does not cover UPDATE/MERGE call-site rewiring, the BLOB
Spark-shape flip in toSqlType, or ArrayType.containsNull recursion.
The read-side change is the same short-circuit as #18566.

HoodieSchemaConversionUtils.alignSchemaMetadataFromCatalog (new) —
re-stamps StructField.metadata and nullable from the catalog
StructType onto the source StructType, recursing into nested structs.
HoodieSparkSqlWriter.write — calls the helper before
convertStructTypeToHoodieSchema(df.schema, ...) when
schemaFromCatalog is provided. No-op for direct DataFrame writers
that don't pass it.
HoodieSparkSchemaConverters.validateBlobStructure — replaces strict
.equals with a recursive structural matcher that ignores nullability
differences. Spark's INSERT can tighten a field's nullability from the
source's expression; the on-disk Avro side enforces non-null invariants
separately.
HoodieSchemaUtils.pruneDataSchemaInternal — short-circuits the
case RECORD branch when the data schema is BLOB or VARIANT. Both
carry LogicalType.validate() contracts requiring the full canonical
inner layout, so partial pruning is not legal anyway. (Same patch as
fix(schema): Allow nested projection on BLOB and VARIANT columns in p… #18566.)

Tests:

TestHoodieSchemaConversionUtils — 7 unit cases for the helper
(BLOB/VECTOR metadata copy, nullability widen + narrow, nested-struct
recursion, field-not-in-catalog passthrough, empty-catalog-metadata
preservation).
TestCreateTable — end-to-end regression test:
CREATE TABLE t (id INT, payload BLOB, emb VECTOR(3)) USING hudi →
INSERT INTO t SELECT … FROM tempview → round-trip read.

Example

Before — fails:

CREATE TABLE pets (id INT, payload BLOB, emb VECTOR(3)) USING hudi
TBLPROPERTIES (primaryKey = 'id', ...);

CREATE TEMPORARY VIEW src AS
SELECT 1 AS id, cast(X'010203' as binary) AS bytes, array(0.1f, 0.2f, 0.3f) AS emb;

INSERT INTO pets SELECT id,
  named_struct('type','INLINE','data', bytes, 'reference', cast(null as struct<...>)) AS payload,
  emb FROM src;
-- SchemaBackwardsCompatibilityException: MISSING_UNION_BRANCH
--   reader union lacking writer type: BLOB / VECTOR

After — succeeds:

INSERT INTO pets SELECT ...; -- ok
SELECT id, length(payload.data), size(emb) FROM pets;
-- 1 | 3 | 3

Impact

User-facing: SQL INSERT INTO ... SELECT and nested projection now
work on tables with BLOB / VECTOR columns. No public API change. No
config change.
Performance: one extra StructType traversal per write command
(bounded by catalog schema size). Negligible.

Risk Level

Low. The helper runs only when schemaFromCatalog is provided
(InsertIntoHoodieTableCommand path); direct DataFrame writers fall back
to existing behavior. The validator change relaxes a strict-equality
check to structural matching, which is strictly more permissive on the
Spark side; on-disk Avro invariants are unchanged. The pruner short-circuit
returns the unmodified data schema, matching what a non-projected read
already does.

Documentation Update

None.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…TOR/BLOB

… to logical-type fields Forcing the catalog's nullability across every field (including nested structs of ordinary columns) shifted the Avro union layout for unrelated columns and broke ~90+ MergeInto / partial-update / CDC / secondary-index tests with 'Malformed data. Length is negative' in Avro deserialization. Narrow the helper: re-stamp nullability + recurse only when the catalog field carries the 'hudi_type' metadata key (BLOB / VECTOR / VARIANT). For ordinary fields, copy metadata only and leave Spark's resolved nullability untouched.

hudi-bot · 2026-04-27T05:58:42Z

CI report:

5949164 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-04-27T06:30:34Z

Codecov Report

❌ Patch coverage is 67.64706% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.63%. Comparing base (29f9c40) to head (5949164).
⚠️ Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
...e/spark/sql/avro/HoodieSparkSchemaConverters.scala	42.85%	0 Missing and 4 partials ⚠️
.../org/apache/hudi/HoodieSchemaConversionUtils.scala	80.00%	0 Missing and 3 partials ⚠️
...g/apache/hudi/common/schema/HoodieSchemaUtils.java	0.00%	1 Missing and 2 partials ⚠️
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	88.88%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18608      +/-   ##
============================================
- Coverage     68.90%   68.63%   -0.28%     
+ Complexity    28581    28483      -98     
============================================
  Files          2482     2492      +10     
  Lines        137053   137282     +229     
  Branches      16713    16745      +32     
============================================
- Hits          94436    94217     -219     
- Misses        35009    35550     +541     
+ Partials       7608     7515      -93

Flag	Coverage Δ
common-and-other-modules	`44.31% <0.00%> (-0.07%)`	⬇️
hadoop-mr-java-client	`44.83% <0.00%> (+<0.01%)`	⬆️
spark-client-hadoop-common	`48.44% <0.00%> (-0.02%)`	⬇️
spark-java-tests	`49.51% <52.94%> (+0.01%)`	⬆️
spark-scala-tests	`44.94% <61.76%> (-0.28%)`	⬇️
utilities	`37.91% <5.88%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	`78.54% <88.88%> (+0.11%)`	⬆️
.../org/apache/hudi/HoodieSchemaConversionUtils.scala	`72.72% <80.00%> (+1.29%)`	⬆️
...g/apache/hudi/common/schema/HoodieSchemaUtils.java	`82.87% <0.00%> (-0.87%)`	⬇️
...e/spark/sql/avro/HoodieSparkSchemaConverters.scala	`77.99% <42.85%> (-1.32%)`	⬇️

... and 42 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

feat(schema) Ensure spark SQL insert into preserves hudi_type for VEC…

5b5f957

…TOR/BLOB

rahil-c requested review from voonhous and yihua April 27, 2026 03:48

github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(schema) Ensure spark SQL insert into preserves hudi_type for VEC…#18608

feat(schema) Ensure spark SQL insert into preserves hudi_type for VEC…#18608
rahil-c wants to merge 2 commits into
apache:masterfrom
rahil-c:rahil/fix-vec-blob-insert

rahil-c commented Apr 27, 2026 •

edited

Loading

Uh oh!

hudi-bot commented Apr 27, 2026

Uh oh!

codecov-commenter commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rahil-c commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Example

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Apr 27, 2026

CI report:

Uh oh!

codecov-commenter commented Apr 27, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rahil-c commented Apr 27, 2026 •

edited

Loading