[core] Avoid deleting directories in orphan files cleanup#7920
Merged
JingsongLi merged 11 commits intoMay 21, 2026
Merged
Conversation
…ransient empty result When listFileDirs encounters a transient empty response (network jitter, throttling) for a partition directory, the partition path is mistakenly passed to pathProcessor as if it were a bucket-level path. If the second listing succeeds, bucket sub-directories are collected as orphan file candidates. Since bucket directory names never appear in snapshot manifests, they pass the orphan diff and cleanFile recursively deletes the entire bucket directory including valid data files. Fix: - Add isDataStructureDirectory() filter in candidate collection (Local/Flink/Spark) to skip bucket-* and partition=value directories. Other directories (e.g. UNKNOWN-* temp dirs) remain eligible for orphan cleanup to preserve existing behavior. - Add defensive guard in cleanFile to refuse deletion of structural data directories even if they somehow reach the deletion path. - Fix Flink empty-dir cleanup missing dryRun guard (Local and Spark already had it). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0604a9c to
e3ce874
Compare
Orphan candidate should only be files — directories have no Paimon metadata reference semantics. Using plain isDir() is cleaner and more robust than pattern-matching on bucket-*/partition=value names. Also remove the cleanFile defensive guard and isDataStructureDirectory helper since directories can no longer reach the candidate deletion path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cleanFile semantics: delete orphan FILE candidates only. A directory reaching this method means a bug in candidate collection. Log ERROR and refuse instead of recursively deleting. Empty directory cleanup is handled separately via tryDeleteEmptyDirectory (non-recursive delete).
- addNonUsedFiles now only creates files (not directories) - New test verifies subdirectories are filtered by !isDir()
listPathWithFilter now skips directories so that non-standard dirs under snapshot/ or changelog/ are never counted as orphan candidates. Adds test coverage for this scenario.
- testDirectoriesNotTreatedAsOrphanCandidates now discovers the actual bucket path via listSubDirs instead of hardcoding bucket-0 - filterDirs adds status.isDir() check so only actual directories pass through the partition/bucket traversal filter
Prevents the parent-walk loop from deleting table.location() itself, matching the existing guard in LocalOrphanFilesClean.
Contributor
|
+1 |
XiaoHongbo-Hope
added a commit
that referenced
this pull request
May 21, 2026
This PR fixes a potential data loss risk in orphan files cleanup. The issue was introduced by #7295, which added support for cleaning empty partition directories without bucket subdirectories. In that change, `listFileDirs` may add a partition directory to the directories to be scanned when its child listing is empty. This is safe when the partition is truly empty. However, if a transient `listStatus` failure returns an empty result, a partition directory that still contains data may be treated as empty and later scanned again. When the later listing succeeds, bucket or partition directories could be collected as orphan file candidates. Before this fix, `cleanFile` recursively deleted directory candidates via `deleteDirectoryQuietly`, so a directory incorrectly collected as an orphan candidate could delete still-referenced data files under that directory. (cherry picked from commit d7b0825)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
This PR fixes a potential data loss risk in orphan files cleanup.
The issue was introduced by #7295, which added support for cleaning empty partition directories without
bucket subdirectories. In that change,
listFileDirsmay add a partition directory to the directories tobe scanned when its child listing is empty.
This is safe when the partition is truly empty. However, if a transient
listStatusfailure returns anempty result, a partition directory that still contains data may be treated as empty and later scanned
again. When the later listing succeeds, bucket or partition directories could be collected as orphan file
candidates.
Before this fix,
cleanFilerecursively deleted directory candidates viadeleteDirectoryQuietly, so adirectory incorrectly collected as an orphan candidate could delete still-referenced data files under
that directory.
Changes
This PR adds multiple safeguards:
snapshot/changelog special cleanup.
cleanFilerefuse directories by removing recursive directory deletion from orphan file cleanup.dryRun, stop at the table location, and keep deletionnon-recursive via
delete(path, false).Tests
testDirectoriesNotTreatedAsOrphanCandidatestestDirectoryInSnapshotDirNotTreatedAsCandidate