Skip to content

[HUDI-7518] Fix HoodieMetadataPayload merging logic around repeated deletes#10913

Merged
nsivabalan merged 1 commit intoapache:masterfrom
yihua:HUDI-7518-fix-metadata-payload-around-deletion
Mar 27, 2024
Merged

[HUDI-7518] Fix HoodieMetadataPayload merging logic around repeated deletes#10913
nsivabalan merged 1 commit intoapache:masterfrom
yihua:HUDI-7518-fix-metadata-payload-around-deletion

Conversation

@yihua
Copy link
Contributor

@yihua yihua commented Mar 22, 2024

Change Logs

When there are repeated duplicate deletes to the partition file list in files partition of the MDT, the current HoodieMetadataPayload merging logic drops such "deletion", causing the file that is deleted from the file system and supposed to be deleted from MDT file listing still left in MDT, because the merging logic of file system metadata does not account for such a case. Other MDT partitions have already considered repeated deletes when merging payloads.

This PR fixes the logic. New tests are added around the repeated deletes (the tests fail before the fix and succeed after the fix).

The impact of this bug is that deleted files may still exist in MDT if there are repeated delete operations. It does not cause any data loss.

Here's a concrete example of how this bug causes the ingestion to fail:

(1) A data file and file group are replaced by clustering. The data file is still on the file system and in MDT file listing.

(2) A cleaner plan is generated to delete the data file.

(3) The cleaner plan is executed the first time, and fails before commit due to Spark job shutdown.

(4) The ingestion continues and succeeds, and another cleaner plan is generated containing the same data file/file group to delete.

(5) The first cleaner plan is successfully executed, incurring deletion to the file list with a metadata payload, and this is added to one log file in MDT, e.g.,

HoodieMetadataPayload {key=partition, type=2, Files: {creations=[], deletions=[7f6b146e-cd43-4fd3-9ce0-118232562569-0_63-29223-5579389_20240303214408245.parquet], }}

(6) The second cleaner plan is also successfully executed, incurring deletion to the file list with a metadata payload containing the same data file to delete, and this is added to a subsequent log file in the same file slice in MDT, e.g.,

HoodieMetadataPayload {key=partition, type=2, Files: {creations=[], deletions=[7f6b146e-cd43-4fd3-9ce0-118232562569-0_63-29223-5579389_20240303214408245.parquet], }} 

(7) The replacecommit corresponds to the clustering is archived as the cleaner has deleted the replaced file groups.

(8) When reading MDT or MDT compaction happens, the merging of these two metadata payloads with identical deletes leads to empty deletion, so the data file is not deleted from the partition file list in MDT. The expected behavior is to keep the data file in the "deletions" field.

HoodieMetadataPayload {key=partition, type=2, Files: {creations=[], deletions=[], }}

(9) Next time, when doing upsert and indexing, the deleted data file is served by the file system view based on MDT (e.g., 7f6b146e-cd43-4fd3-9ce0-118232562569-0_63-29223-5579389_20240303214408245.parquet), and the data file cannot be found on the file system, causing the ingestion to fail.

Impact

MDT bug fix.

Risk level

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Mar 22, 2024
@yihua yihua force-pushed the HUDI-7518-fix-metadata-payload-around-deletion branch from 101b295 to e162c91 Compare March 22, 2024 20:45
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan merged commit 8a13763 into apache:master Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-0.15.0 release-1.0.0 size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants