fix(spark): Ignore duplicate fields when merging schema in IncrementalRelation#17776
fix(spark): Ignore duplicate fields when merging schema in IncrementalRelation#17776prashantwason wants to merge 1 commit intoapache:masterfrom
Conversation
nsivabalan
left a comment
There was a problem hiding this comment.
can we add some tests please
7590609 to
0028dc0
Compare
nsivabalan
left a comment
There was a problem hiding this comment.
and can you confirm that the tests added fails w/o the fix you have added.
| .option(DataSourceWriteOptions.OPERATION.key, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) | ||
| .mode(SaveMode.Overwrite) | ||
| .save(basePath) | ||
|
|
There was a problem hiding this comment.
can we add one more commit so that we have log files in the table when we do the incremental query.
There was a problem hiding this comment.
Done. Added an upsert operation after the initial insert to create log files in the MOR table. The test now includes:
- Initial insert (creates base files)
- Upsert (creates log files with an update to row1 and a new row4)
- Incremental read that verifies both base files and log files are correctly read
|
Confirmed - the test fails without the fix with the following error: The fix in IncrementalRelationV1.scala and IncrementalRelationV2.scala filters out fields from the skeleton schema that already exist in the data schema before merging, preventing this duplicate field error. Also added an upsert operation to the test to ensure log files are created in the MOR table before the incremental query. |
|
@hudi-bot run azure |
|
@hudi-bot run github |
|
@hudi-bot run azure |
…lRelation When using HUDIIncrSource in HUDIStreamer, the skeleton schema (Hudi meta fields) is merged with the data schema. If the data schema already contains meta fields like _hoodie_partition_path, Spark fails with duplicate field errors. The fix ensures: - Keep ALL skeleton schema fields (Hudi meta fields) first - Only append data schema fields that don't exist in skeleton schema This prevents duplicate field errors while ensuring meta fields are read from the correct positions in parquet files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8e6b725 to
d3f638e
Compare
|
Cleaned up the branch - rebased onto latest master with only the relevant changes (4 files instead of 265). Re-triggering CI. @hudi-bot run azure |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #17776 +/- ##
============================================
- Coverage 57.28% 56.01% -1.27%
+ Complexity 18535 18055 -480
============================================
Files 1944 1944
Lines 106137 106141 +4
Branches 13119 13119
============================================
- Hits 60798 59455 -1343
- Misses 39617 40908 +1291
- Partials 5722 5778 +56
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
When using HUDIIncrSource in HUDIStreamer, the data is read from an existing HUDI table using IncrementalRelation. The skeleton schema (containing Hudi meta fields like
_hoodie_record_key,_hoodie_partition_path, etc.) is merged with the data schema to create the schema used to read from parquet files.If the data schema already contains any of these meta fields, Spark fails with duplicate field errors. This commonly occurs when:
_hoodie_partition_pathfield which is not removed from the read dataSummary and Changelog
Summary: Filter out duplicate fields from skeleton schema when merging with data schema in IncrementalRelation to prevent Spark duplicate field errors.
Changelog:
IncrementalRelationV1.scalato filter skeleton schema fields that already exist in data schema before mergingIncrementalRelationV2.scalawith the same fixImpact
No public API changes. This is a bug fix that prevents runtime errors when reading from Hudi tables where the data schema contains Hudi meta fields.
Risk Level
Low - The change only filters out duplicate fields during schema merging. The fix is defensive and only activates when duplicates would otherwise cause an error. The logic is simple and well-understood.
Documentation Update
None - This is a bug fix with no new features, configs, or user-facing changes.
Contributor's checklist