Skip to content

[HUDI-7240] Clean delete logic#10398

Closed
linliu-code wants to merge 1 commit intoapache:masterfrom
linliu-code:HUDI-7240-clean-delete-logic
Closed

[HUDI-7240] Clean delete logic#10398
linliu-code wants to merge 1 commit intoapache:masterfrom
linliu-code:HUDI-7240-clean-delete-logic

Conversation

@linliu-code
Copy link
Collaborator

@linliu-code linliu-code commented Dec 22, 2023

Change Logs

  1. When we create HoodieRecord for a delete, we store the necessary information into the metadata field.
  2. When we need to merge delete records, we extract orderingVal from metadata field of HoodieRecord.
  3. Removed HoodieRecordTestPayload.

Impact

Simplifies the logic for handling delete records.

Risk level (write none, low medium or high below)

Low.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@linliu-code linliu-code force-pushed the HUDI-7240-clean-delete-logic branch from 8dc8152 to eda1992 Compare December 22, 2023 01:44
@linliu-code
Copy link
Collaborator Author

@yihua @codope @danny0405

@linliu-code linliu-code force-pushed the HUDI-7240-clean-delete-logic branch 2 times, most recently from cb8a503 to 57327e3 Compare December 22, 2023 02:15
public Comparable<?> getOrderingValue(Schema recordSchema, Properties props) {
return this.getData().getOrderingValue();
// For non-delete record.
if (data != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data is designed to be non-null, we even have some validation logic for it in HoodieRecord:

public T getData() {
    if (data == null) {
      throw new IllegalStateException("Payload already deflated for record.");
    }
    return data;
  }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can check the metadata first in the validation.

@linliu-code linliu-code force-pushed the HUDI-7240-clean-delete-logic branch from 57327e3 to d311264 Compare December 22, 2023 02:30
@linliu-code
Copy link
Collaborator Author

Will clean the failures.

@linliu-code linliu-code reopened this Dec 22, 2023
@linliu-code linliu-code force-pushed the HUDI-7240-clean-delete-logic branch 3 times, most recently from 36f0142 to c3779c9 Compare December 22, 2023 21:21
@linliu-code linliu-code force-pushed the HUDI-7240-clean-delete-logic branch from c3779c9 to 3d71d1c Compare January 9, 2024 00:42
@hudi-bot
Copy link
Collaborator

hudi-bot commented Jan 9, 2024

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405
Copy link
Contributor

Thanks for raising this fix, I think it is a good chance we fix the event time sequence comparison of delete records with payloads, I can see 2 mistaks in our code that uses processing time sequence for deletes:

  1. OverwriteWithLatestAvroPayload#preCombine:
  public OverwriteWithLatestAvroPayload preCombine(OverwriteWithLatestAvroPayload oldValue) {
    if (oldValue.recordBytes.length == 0) {
      // use natural order for delete record
      return this;
    }
    ...
  }
  1. DefaultHoodieRecordPayload#combineAndGetUpdateValue
  public Option<IndexedRecord> combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema, Properties properties) throws IOException {
    if (recordBytes.length == 0) {
      return Option.empty();
    }

    ...
  }

In any case, the orderingVal should be set up correctly and we should utilize it as much as possible.

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing this PR as we should revisit the delete ordering overall.

@yihua yihua closed this Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

4 participants