[HUDI-9120] Fix delete ordering comparison issue by linliu-code · Pull Request #12979 · apache/hudi

linliu-code · 2025-03-14T08:44:12Z

Change Logs

Root cause:
For queries using event time based merge mode, the requestedSchema may not contain the ordering field if the requiredSchema (output schema) does not contain it. In this case, when we merge base file records and log file records, base file has to use DEFAULT_ORDERING_VALUE (integer 0); However, the ordering field from log files is of their original data type, like long, or float, so the comparison fails due to type conflicts.

Fix:
The fix is to add the ordering field into the requestedSchema if possible.

Follow up:
Why do we add the field for commit time based queries as well?
Ideally we should not. We add it since 1. These are corner cases: with delete records, without the field requested. 2. it saves driver a trip to the storage for reading the table config for merge mode; 3. simplicity.

Impact

Fix a bug for event time based delete.

Risk level (write none, low medium or high below)

Low.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

linliu-code · 2025-03-14T08:57:11Z

...he/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala

+                               orderingField: StructField): Seq[StructField] = {
+    val fields = ArrayBuffer[StructField]()
+    fields ++= requiredSchema.fields
+    fields ++= partitionSchema.fields.filter(f => mandatoryFields.contains(f.name))


@jonvex , do you remember why we need to do this filtering? Can we directly add all fields from mandatoryFields?

danny0405 · 2025-03-14T11:23:50Z

...he/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala

+    fields ++= requiredSchema.fields
+    fields ++= partitionSchema.fields.filter(f => mandatoryFields.contains(f.name))
+    if (orderingField != null && !fields.contains(orderingField)) {
+      fields.append(orderingField)


So the delete record is always in the last pos.

requestedSchema is the internal schema used by fg reader during merging. The output one will be requiredSchema. So the order of orderingField should not affect the output.

Do we introduce additial projection for the rows if the user output schema does not sync with it?

I think Spark itself will do one projection anyways. But I need to confirm. This is the same for "row index" column we use for position based merging.

hudi-bot · 2025-03-14T20:35:11Z

CI report:

c9eaf6e UNKNOWN
e451e74 UNKNOWN
1fabe8e UNKNOWN
8d77025 UNKNOWN
fab48e8 UNKNOWN
4efa860 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua · 2025-03-17T18:11:30Z

...he/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala


    // schema that we want fg reader to output to us
-    val requestedSchema = StructType(requiredSchema.fields ++ partitionSchema.fields.filter(f => mandatoryFields.contains(f.name)))
+    val requestedSchema = getRequestedSchema(options, dataSchema, partitionSchema, requiredSchema, mandatoryFields)


HoodieFileGroupReaderSchemaHandler#generateRequiredSchema has the logic to append all necessary fields for merging in MOR which is independent of engines. Could you check why that's not applied for the bug you discovered?

Also, could we avoid such logic of appending all necessary fields for merging at the Spark layer if possible, to reduce the complexity in this class?

linliu-code · 2025-03-18T08:13:37Z

Close this one since it is not a correct fix.

linliu-code · 2025-03-18T14:19:55Z

This is not needed since we have found a better fix: #12991

linliu-code force-pushed the HUDI-9120-b branch 2 times, most recently from c9eaf6e to e451e74 Compare March 14, 2025 08:47

github-actions bot added the size:M PR with lines of changes in (100, 300] label Mar 14, 2025

linliu-code force-pushed the HUDI-9120-b branch 2 times, most recently from 1fabe8e to 8d77025 Compare March 14, 2025 08:55

linliu-code commented Mar 14, 2025

View reviewed changes

linliu-code marked this pull request as ready for review March 14, 2025 08:57

linliu-code force-pushed the HUDI-9120-b branch from 8d77025 to 4bd2b06 Compare March 14, 2025 09:01

danny0405 reviewed Mar 14, 2025

View reviewed changes

Fix delete ordering comparison issue

fab48e8

linliu-code force-pushed the HUDI-9120-b branch from 4bd2b06 to fab48e8 Compare March 14, 2025 17:55

Address formatting

4efa860

linliu-code requested a review from danny0405 March 14, 2025 18:07

yihua added the release-1.0.2 label Mar 14, 2025

linliu-code mentioned this pull request Mar 14, 2025

[HUDI-9120] Enable File Group reader by default for table version 6 #12935

Merged

4 tasks

yihua reviewed Mar 17, 2025

View reviewed changes

linliu-code closed this Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[HUDI-9120] Fix delete ordering comparison issue#12979

[HUDI-9120] Fix delete ordering comparison issue#12979
linliu-code wants to merge 2 commits intoapache:masterfrom
linliu-code:HUDI-9120-b

linliu-code commented Mar 14, 2025 •

edited

Loading

Uh oh!

linliu-code Mar 14, 2025

Uh oh!

danny0405 Mar 14, 2025

Uh oh!

linliu-code Mar 14, 2025

Uh oh!

danny0405 Mar 15, 2025

Uh oh!

linliu-code Mar 17, 2025

Uh oh!

hudi-bot commented Mar 14, 2025

Uh oh!

yihua Mar 17, 2025

Uh oh!

linliu-code commented Mar 18, 2025

Uh oh!

linliu-code commented Mar 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

linliu-code commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

linliu-code Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

danny0405 Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

linliu-code Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

danny0405 Mar 15, 2025

Choose a reason for hiding this comment

Uh oh!

linliu-code Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Mar 14, 2025

CI report:

Uh oh!

yihua Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

linliu-code commented Mar 18, 2025

Uh oh!

linliu-code commented Mar 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

linliu-code commented Mar 14, 2025 •

edited

Loading