Skip to content

Flink: Refresh table in ListMetadataFiles to prevent incorrect orphan file deletion#16324

Merged
pvary merged 1 commit into
apache:mainfrom
yadavay-amzn:fix/15487-list-metadata-refresh
May 14, 2026
Merged

Flink: Refresh table in ListMetadataFiles to prevent incorrect orphan file deletion#16324
pvary merged 1 commit into
apache:mainfrom
yadavay-amzn:fix/15487-list-metadata-refresh

Conversation

@yadavay-amzn
Copy link
Copy Markdown
Contributor

@yadavay-amzn yadavay-amzn commented May 13, 2026

Problem

Fixes #15487.

When Flink TableMaintenance runs both ExpireSnapshots and DeleteOrphanFiles, manifest list files of live snapshots are incorrectly deleted as orphans, causing NotFoundException in subsequent ExpireSnapshots runs.

Root cause

ListMetadataFiles loads the table once at operator startup (open()) and never calls table.refresh() in processElement(). It only emits manifest list and manifest file paths for snapshots that existed when the Flink job started.

Any snapshot added after job start has its metadata files missing from the "referenced" set that DeleteOrphanFiles uses. When those manifest lists are older than minAge, OrphanFilesDetector classifies them as orphans and DeleteFilesProcessor deletes them.

On the next maintenance cycle, ExpireSnapshots tries to read those manifest lists in IncrementalFileCleanup.cleanFiles() and fails with NotFoundException.

This explains why the bug:

  • only occurs with DeleteOrphanFiles enabled (it is the one incorrectly deleting the files)
  • never occurs with ExpireSnapshots alone (it only deletes manifest lists of snapshots it has already expired and read)
  • becomes more likely over time (more snapshots added after job start = more unprotected manifest lists)

Fix

Add table.refresh() at the top of ListMetadataFiles.processElement(), matching what MetadataTablePlanner already does. This ensures the "referenced" set always reflects the current table state.

Copy link
Copy Markdown
Contributor

@Guosmilesmile Guosmilesmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

Please remove the 2.0, 1.20 changes for now - we usually cherry pick the changes to the other versions after the original PR for the main versions has been merged.

// Verify that manifest lists from ALL 3 snapshots are present, not just the first one.
// Without table.refresh() in processElement, only snapshot 1's files would be emitted.
table.refresh();
for (org.apache.iceberg.Snapshot snapshot : table.snapshots()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: import

Comment on lines +109 to +110
// Verify that manifest lists from ALL 3 snapshots are present, not just the first one.
// Without table.refresh() in processElement, only snapshot 1's files would be emitted.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we verify this based on numbers rather than comments?

… file deletion

ListMetadataFiles loads the table once at job start and never refreshes,
so it only emits manifest list and manifest paths for snapshots that existed
when the job started. Any snapshot added after job start has its metadata
files unprotected. DeleteOrphanFiles then incorrectly classifies those files
as orphans and deletes them, causing NotFoundException in subsequent
ExpireSnapshots runs when IncrementalFileCleanup tries to read them.

The fix adds table.refresh() in processElement(), matching what
MetadataTablePlanner already does.

Closes apache#15487
@yadavay-amzn yadavay-amzn force-pushed the fix/15487-list-metadata-refresh branch from aaab19b to 266f0f6 Compare May 14, 2026 07:30
@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

yadavay-amzn commented May 14, 2026

Thanks for the feedback!
Removed v2.0/v1.20 changes, fixed imports, and strengthened the test assertion to verify count (hasSize(24)) matching the existing test pattern.

@Guosmilesmile
Copy link
Copy Markdown
Contributor

Looks good to me. @pvary If you have a minute, could you help take a look?

@pvary pvary merged commit 6b8b57e into apache:main May 14, 2026
19 checks passed
@pvary
Copy link
Copy Markdown
Contributor

pvary commented May 14, 2026

Merged to main.
Thanks @yadavay-amzn for the PR and @Guosmilesmile for the review.

@yadavay-amzn: Please create the backport PR and tell us if you need to change anything manuallly, or the following command was doing everything without issue:

git diff 6b8b57e2e454ec64e0a6113c80d9cdfe09c6783a^ 6b8b57e2e454ec64e0a6113c80d9cdfe09c6783a flink/v2.1 | sed "s/v2.1/v2.0/g">/tmp/patch;g apply -3 -p1 /tmp/patch
git diff 6b8b57e2e454ec64e0a6113c80d9cdfe09c6783a^ 6b8b57e2e454ec64e0a6113c80d9cdfe09c6783a flink/v2.1 | sed "s/v2.1/v1.20/g">/tmp/patch;g apply -3 -p1 /tmp/patch

Copilot AI added a commit to kevinjqliu/iceberg that referenced this pull request May 14, 2026
Agent-Logs-Url: https://github.com/kevinjqliu/iceberg/sessions/682ce8b4-890f-41a9-a89a-b1f2873be44c

Co-authored-by: kevinjqliu <9057843+kevinjqliu@users.noreply.github.com>
kevinjqliu added a commit that referenced this pull request May 14, 2026
Agent-Logs-Url: https://github.com/kevinjqliu/iceberg/sessions/682ce8b4-890f-41a9-a89a-b1f2873be44c

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: kevinjqliu <9057843+kevinjqliu@users.noreply.github.com>
@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

Thanks @Guosmilesmile for the review and @pvary for merging!

I was just about to create the backport PRs but I see that @kevinjqliu already created and merged it! I'm assuming no more action needed from me. Thanks again folks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flink tableMaintenance incorrectly delete manifest files

3 participants