Skip to content

Comments

fix(spark): Ignore duplicate fields when merging schema in IncrementalRelation#17776

Open
prashantwason wants to merge 1 commit intoapache:masterfrom
prashantwason:pw_oss_commit_porting_17
Open

fix(spark): Ignore duplicate fields when merging schema in IncrementalRelation#17776
prashantwason wants to merge 1 commit intoapache:masterfrom
prashantwason:pw_oss_commit_porting_17

Conversation

@prashantwason
Copy link
Member

@prashantwason prashantwason commented Jan 4, 2026

Describe the issue this Pull Request addresses

When using HUDIIncrSource in HUDIStreamer, the data is read from an existing HUDI table using IncrementalRelation. The skeleton schema (containing Hudi meta fields like _hoodie_record_key, _hoodie_partition_path, etc.) is merged with the data schema to create the schema used to read from parquet files.

If the data schema already contains any of these meta fields, Spark fails with duplicate field errors. This commonly occurs when:

  • Creating derived datasets using HUDIIncrSource where auto-derived schema (RowBasedSchemaProvider) is used
  • The source table's data already contains the _hoodie_partition_path field which is not removed from the read data

Summary and Changelog

Summary: Filter out duplicate fields from skeleton schema when merging with data schema in IncrementalRelation to prevent Spark duplicate field errors.

Changelog:

  • Modified IncrementalRelationV1.scala to filter skeleton schema fields that already exist in data schema before merging
  • Modified IncrementalRelationV2.scala with the same fix
  • No code was copied from external sources

Impact

No public API changes. This is a bug fix that prevents runtime errors when reading from Hudi tables where the data schema contains Hudi meta fields.

Risk Level

Low - The change only filters out duplicate fields during schema merging. The fix is defensive and only activates when duplicates would otherwise cause an error. The logic is simple and well-understood.

Documentation Update

None - This is a bug fix with no new features, configs, or user-facing changes.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:XS PR with lines of changes in <= 10 label Jan 4, 2026
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some tests please

@prashantwason prashantwason force-pushed the pw_oss_commit_porting_17 branch from 7590609 to 0028dc0 Compare January 28, 2026 19:06
@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:XS PR with lines of changes in <= 10 labels Jan 28, 2026
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and can you confirm that the tests added fails w/o the fix you have added.

.option(DataSourceWriteOptions.OPERATION.key, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
.mode(SaveMode.Overwrite)
.save(basePath)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add one more commit so that we have log files in the table when we do the incremental query.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added an upsert operation after the initial insert to create log files in the MOR table. The test now includes:

  • Initial insert (creates base files)
  • Upsert (creates log files with an update to row1 and a new row4)
  • Incremental read that verifies both base files and log files are correctly read

@prashantwason
Copy link
Member Author

Confirmed - the test fails without the fix with the following error:

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `_hoodie_partition_path`

The fix in IncrementalRelationV1.scala and IncrementalRelationV2.scala filters out fields from the skeleton schema that already exist in the data schema before merging, preventing this duplicate field error.

Also added an upsert operation to the test to ensure log files are created in the MOR table before the incremental query.

@prashantwason
Copy link
Member Author

@hudi-bot run azure

@prashantwason
Copy link
Member Author

@hudi-bot run github

@prashantwason
Copy link
Member Author

@hudi-bot run azure

2 similar comments
@prashantwason
Copy link
Member Author

@hudi-bot run azure

@prashantwason
Copy link
Member Author

@hudi-bot run azure

@github-actions github-actions bot added size:XL PR with lines of changes > 1000 and removed size:M PR with lines of changes in (100, 300] labels Feb 23, 2026
…lRelation

When using HUDIIncrSource in HUDIStreamer, the skeleton schema (Hudi meta
fields) is merged with the data schema. If the data schema already contains
meta fields like _hoodie_partition_path, Spark fails with duplicate field
errors.

The fix ensures:
- Keep ALL skeleton schema fields (Hudi meta fields) first
- Only append data schema fields that don't exist in skeleton schema

This prevents duplicate field errors while ensuring meta fields are read
from the correct positions in parquet files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@prashantwason prashantwason force-pushed the pw_oss_commit_porting_17 branch from 8e6b725 to d3f638e Compare February 24, 2026 08:50
@prashantwason
Copy link
Member Author

Cleaned up the branch - rebased onto latest master with only the relevant changes (4 files instead of 265). Re-triggering CI.

@hudi-bot run azure

@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:XL PR with lines of changes > 1000 labels Feb 24, 2026
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.01%. Comparing base (894b817) to head (d3f638e).

Files with missing lines Patch % Lines
.../scala/org/apache/hudi/IncrementalRelationV1.scala 0.00% 3 Missing ⚠️
.../scala/org/apache/hudi/IncrementalRelationV2.scala 0.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17776      +/-   ##
============================================
- Coverage     57.28%   56.01%   -1.27%     
+ Complexity    18535    18055     -480     
============================================
  Files          1944     1944              
  Lines        106137   106141       +4     
  Branches      13119    13119              
============================================
- Hits          60798    59455    -1343     
- Misses        39617    40908    +1291     
- Partials       5722     5778      +56     
Flag Coverage Δ
hadoop-mr-java-client 45.42% <ø> (+<0.01%) ⬆️
spark-java-tests 44.18% <0.00%> (-3.24%) ⬇️
spark-scala-tests 45.51% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../scala/org/apache/hudi/IncrementalRelationV1.scala 0.00% <0.00%> (ø)
.../scala/org/apache/hudi/IncrementalRelationV2.scala 0.00% <0.00%> (ø)

... and 193 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants