fix(spark): Ignore duplicate fields when merging schema in IncrementalRelation by prashantwason · Pull Request #17776 · apache/hudi

prashantwason · 2026-01-04T07:10:36Z

Describe the issue this Pull Request addresses

When using HUDIIncrSource in HUDIStreamer, the data is read from an existing HUDI table using IncrementalRelation. The skeleton schema (containing Hudi meta fields like _hoodie_record_key, _hoodie_partition_path, etc.) is merged with the data schema to create the schema used to read from parquet files.

If the data schema already contains any of these meta fields, Spark fails with duplicate field errors. This commonly occurs when:

Creating derived datasets using HUDIIncrSource where auto-derived schema (RowBasedSchemaProvider) is used
The source table's data already contains the _hoodie_partition_path field which is not removed from the read data

Summary and Changelog

Summary: Filter out duplicate fields from skeleton schema when merging with data schema in IncrementalRelation to prevent Spark duplicate field errors.

Changelog:

Modified IncrementalRelationV1.scala to filter skeleton schema fields that already exist in data schema before merging
Modified IncrementalRelationV2.scala with the same fix
No code was copied from external sources

Impact

No public API changes. This is a bug fix that prevents runtime errors when reading from Hudi tables where the data schema contains Hudi meta fields.

Risk Level

Low - The change only filters out duplicate fields during schema merging. The fix is defensive and only activates when duplicates would otherwise cause an error. The logic is simple and well-understood.

Documentation Update

None - This is a bug fix with no new features, configs, or user-facing changes.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

nsivabalan

can we add some tests please

nsivabalan

and can you confirm that the tests added fails w/o the fix you have added.

nsivabalan · 2026-01-28T20:52:01Z

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala

+      .option(DataSourceWriteOptions.OPERATION.key, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+      .mode(SaveMode.Overwrite)
+      .save(basePath)
+


can we add one more commit so that we have log files in the table when we do the incremental query.

Done. Added an upsert operation after the initial insert to create log files in the MOR table. The test now includes:

Initial insert (creates base files)

Upsert (creates log files with an update to row1 and a new row4)

Incremental read that verifies both base files and log files are correctly read

prashantwason · 2026-02-05T19:30:29Z

Confirmed - the test fails without the fix with the following error:

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `_hoodie_partition_path`

The fix in IncrementalRelationV1.scala and IncrementalRelationV2.scala filters out fields from the skeleton schema that already exist in the data schema before merging, preventing this duplicate field error.

Also added an upsert operation to the test to ensure log files are created in the MOR table before the incremental query.

prashantwason · 2026-02-06T00:18:33Z

@hudi-bot run azure

prashantwason · 2026-02-06T00:18:39Z

@hudi-bot run github

prashantwason · 2026-02-06T02:00:10Z

@hudi-bot run azure

prashantwason · 2026-02-11T23:07:56Z

@hudi-bot run azure

prashantwason · 2026-02-23T21:35:47Z

@hudi-bot run azure

…lRelation When using HUDIIncrSource in HUDIStreamer, the skeleton schema (Hudi meta fields) is merged with the data schema. If the data schema already contains meta fields like _hoodie_partition_path, Spark fails with duplicate field errors. The fix ensures: - Keep ALL skeleton schema fields (Hudi meta fields) first - Only append data schema fields that don't exist in skeleton schema This prevents duplicate field errors while ensuring meta fields are read from the correct positions in parquet files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

prashantwason · 2026-02-24T08:50:48Z

Cleaned up the branch - rebased onto latest master with only the relevant changes (4 files instead of 265). Re-triggering CI.

@hudi-bot run azure

codecov-commenter · 2026-02-24T09:52:04Z

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.01%. Comparing base (894b817) to head (d3f638e).

Files with missing lines	Patch %	Lines
.../scala/org/apache/hudi/IncrementalRelationV1.scala	0.00%	3 Missing ⚠️
.../scala/org/apache/hudi/IncrementalRelationV2.scala	0.00%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #17776      +/-   ##
============================================
- Coverage     57.28%   56.01%   -1.27%     
+ Complexity    18535    18055     -480     
============================================
  Files          1944     1944              
  Lines        106137   106141       +4     
  Branches      13119    13119              
============================================
- Hits          60798    59455    -1343     
- Misses        39617    40908    +1291     
- Partials       5722     5778      +56

Flag	Coverage Δ
hadoop-mr-java-client	`45.42% <ø> (+<0.01%)`	⬆️
spark-java-tests	`44.18% <0.00%> (-3.24%)`	⬇️
spark-scala-tests	`45.51% <0.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
.../scala/org/apache/hudi/IncrementalRelationV1.scala	`0.00% <0.00%> (ø)`
.../scala/org/apache/hudi/IncrementalRelationV2.scala	`0.00% <0.00%> (ø)`

... and 193 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-02-24T11:52:26Z

CI report:

d3f638e Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

github-actions bot added the size:XS PR with lines of changes in <= 10 label Jan 4, 2026

nsivabalan assigned yihua Jan 20, 2026

nsivabalan reviewed Jan 26, 2026

View reviewed changes

prashantwason force-pushed the pw_oss_commit_porting_17 branch from 7590609 to 0028dc0 Compare January 28, 2026 19:06

github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:XS PR with lines of changes in <= 10 labels Jan 28, 2026

nsivabalan reviewed Jan 28, 2026

View reviewed changes

github-actions bot added size:XL PR with lines of changes > 1000 and removed size:M PR with lines of changes in (100, 300] labels Feb 23, 2026

prashantwason force-pushed the pw_oss_commit_porting_17 branch from 8e6b725 to d3f638e Compare February 24, 2026 08:50

github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:XL PR with lines of changes > 1000 labels Feb 24, 2026

Comments

Conversation

prashantwason commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

prashantwason Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

prashantwason commented Feb 5, 2026

Uh oh!

prashantwason commented Feb 6, 2026

Uh oh!

prashantwason commented Feb 6, 2026

Uh oh!

prashantwason commented Feb 6, 2026

Uh oh!

prashantwason commented Feb 11, 2026

Uh oh!

prashantwason commented Feb 23, 2026

Uh oh!

prashantwason commented Feb 24, 2026

Uh oh!

codecov-commenter commented Feb 24, 2026

Codecov Report

Uh oh!

hudi-bot commented Feb 24, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

prashantwason commented Jan 4, 2026 •

edited

Loading