Skip to content

Comments

[HUDI-9120] Fix delete ordering comparison issue#12979

Closed
linliu-code wants to merge 2 commits intoapache:masterfrom
linliu-code:HUDI-9120-b
Closed

[HUDI-9120] Fix delete ordering comparison issue#12979
linliu-code wants to merge 2 commits intoapache:masterfrom
linliu-code:HUDI-9120-b

Conversation

@linliu-code
Copy link
Collaborator

@linliu-code linliu-code commented Mar 14, 2025

Change Logs

Root cause:
For queries using event time based merge mode, the requestedSchema may not contain the ordering field if the requiredSchema (output schema) does not contain it. In this case, when we merge base file records and log file records, base file has to use DEFAULT_ORDERING_VALUE (integer 0); However, the ordering field from log files is of their original data type, like long, or float, so the comparison fails due to type conflicts.

Fix:
The fix is to add the ordering field into the requestedSchema if possible.

Follow up:
Why do we add the field for commit time based queries as well?
Ideally we should not. We add it since 1. These are corner cases: with delete records, without the field requested. 2. it saves driver a trip to the storage for reading the table config for merge mode; 3. simplicity.

Impact

Fix a bug for event time based delete.

Risk level (write none, low medium or high below)

Low.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@linliu-code linliu-code force-pushed the HUDI-9120-b branch 2 times, most recently from c9eaf6e to e451e74 Compare March 14, 2025 08:47
@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Mar 14, 2025
@linliu-code linliu-code force-pushed the HUDI-9120-b branch 2 times, most recently from 1fabe8e to 8d77025 Compare March 14, 2025 08:55
orderingField: StructField): Seq[StructField] = {
val fields = ArrayBuffer[StructField]()
fields ++= requiredSchema.fields
fields ++= partitionSchema.fields.filter(f => mandatoryFields.contains(f.name))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonvex , do you remember why we need to do this filtering? Can we directly add all fields from mandatoryFields?

@linliu-code linliu-code marked this pull request as ready for review March 14, 2025 08:57
fields ++= requiredSchema.fields
fields ++= partitionSchema.fields.filter(f => mandatoryFields.contains(f.name))
if (orderingField != null && !fields.contains(orderingField)) {
fields.append(orderingField)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the delete record is always in the last pos.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requestedSchema is the internal schema used by fg reader during merging. The output one will be requiredSchema. So the order of orderingField should not affect the output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we introduce additial projection for the rows if the user output schema does not sync with it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Spark itself will do one projection anyways. But I need to confirm. This is the same for "row index" column we use for position based merging.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build


// schema that we want fg reader to output to us
val requestedSchema = StructType(requiredSchema.fields ++ partitionSchema.fields.filter(f => mandatoryFields.contains(f.name)))
val requestedSchema = getRequestedSchema(options, dataSchema, partitionSchema, requiredSchema, mandatoryFields)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HoodieFileGroupReaderSchemaHandler#generateRequiredSchema has the logic to append all necessary fields for merging in MOR which is independent of engines. Could you check why that's not applied for the bug you discovered?

Also, could we avoid such logic of appending all necessary fields for merging at the Spark layer if possible, to reduce the complexity in this class?

@linliu-code
Copy link
Collaborator Author

Close this one since it is not a correct fix.

@linliu-code
Copy link
Collaborator Author

This is not needed since we have found a better fix: #12991

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-1.0.2 size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants