Skip to content

[core] Avoid deleting directories in orphan files cleanup#7920

Merged
JingsongLi merged 11 commits into
apache:masterfrom
XiaoHongbo-Hope:fix/orphan-clean-skip-directory-candidates
May 21, 2026
Merged

[core] Avoid deleting directories in orphan files cleanup#7920
JingsongLi merged 11 commits into
apache:masterfrom
XiaoHongbo-Hope:fix/orphan-clean-skip-directory-candidates

Conversation

@XiaoHongbo-Hope
Copy link
Copy Markdown
Contributor

@XiaoHongbo-Hope XiaoHongbo-Hope commented May 21, 2026

Purpose

This PR fixes a potential data loss risk in orphan files cleanup.

The issue was introduced by #7295, which added support for cleaning empty partition directories without
bucket subdirectories. In that change, listFileDirs may add a partition directory to the directories to
be scanned when its child listing is empty.

This is safe when the partition is truly empty. However, if a transient listStatus failure returns an
empty result, a partition directory that still contains data may be treated as empty and later scanned
again. When the later listing succeeds, bucket or partition directories could be collected as orphan file
candidates.

Before this fix, cleanFile recursively deleted directory candidates via deleteDirectoryQuietly, so a
directory incorrectly collected as an orphan candidate could delete still-referenced data files under
that directory.

Changes

This PR adds multiple safeguards:

  1. Filter directories from all orphan file candidate collection paths, including Local, Flink, Spark, and
    snapshot/changelog special cleanup.
  2. Make cleanFile refuse directories by removing recursive directory deletion from orphan file cleanup.
  3. Harden Flink empty directory cleanup: respect dryRun, stop at the table location, and keep deletion
    non-recursive via delete(path, false).

Tests

  • testDirectoriesNotTreatedAsOrphanCandidates
  • testDirectoryInSnapshotDirNotTreatedAsCandidate

@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [core] Fix dir loops delete candicates n orphan clean when [core] Fix dir loops delete candicates in orphan files clean May 21, 2026
…ransient empty result

When listFileDirs encounters a transient empty response (network jitter,
throttling) for a partition directory, the partition path is mistakenly
passed to pathProcessor as if it were a bucket-level path. If the second
listing succeeds, bucket sub-directories are collected as orphan file
candidates. Since bucket directory names never appear in snapshot
manifests, they pass the orphan diff and cleanFile recursively deletes
the entire bucket directory including valid data files.

Fix:
- Add isDataStructureDirectory() filter in candidate collection
  (Local/Flink/Spark) to skip bucket-* and partition=value directories.
  Other directories (e.g. UNKNOWN-* temp dirs) remain eligible for
  orphan cleanup to preserve existing behavior.
- Add defensive guard in cleanFile to refuse deletion of structural
  data directories even if they somehow reach the deletion path.
- Fix Flink empty-dir cleanup missing dryRun guard (Local and Spark
  already had it).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the fix/orphan-clean-skip-directory-candidates branch from 0604a9c to e3ce874 Compare May 21, 2026 03:59
XiaoHongbo-Hope and others added 3 commits May 21, 2026 12:03
Orphan candidate should only be files — directories have no Paimon
metadata reference semantics. Using plain isDir() is cleaner and
more robust than pattern-matching on bucket-*/partition=value names.

Also remove the cleanFile defensive guard and isDataStructureDirectory
helper since directories can no longer reach the candidate deletion
path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cleanFile semantics: delete orphan FILE candidates only. A directory
reaching this method means a bug in candidate collection. Log ERROR
and refuse instead of recursively deleting.

Empty directory cleanup is handled separately via
tryDeleteEmptyDirectory (non-recursive delete).
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [core] Fix dir loops delete candicates in orphan files clean [core] Avoid deleting directories in orphan files cleanup May 21, 2026
- addNonUsedFiles now only creates files (not directories)
- New test verifies subdirectories are filtered by !isDir()
listPathWithFilter now skips directories so that non-standard
dirs under snapshot/ or changelog/ are never counted as orphan
candidates. Adds test coverage for this scenario.
- testDirectoriesNotTreatedAsOrphanCandidates now discovers the
  actual bucket path via listSubDirs instead of hardcoding bucket-0
- filterDirs adds status.isDir() check so only actual directories
  pass through the partition/bucket traversal filter
Prevents the parent-walk loop from deleting table.location()
itself, matching the existing guard in LocalOrphanFilesClean.
@JingsongLi
Copy link
Copy Markdown
Contributor

+1

@JingsongLi JingsongLi merged commit d7b0825 into apache:master May 21, 2026
11 of 12 checks passed
XiaoHongbo-Hope added a commit that referenced this pull request May 21, 2026
This PR fixes a potential data loss risk in orphan files cleanup.

The issue was introduced by #7295, which added support for cleaning
empty partition directories without
bucket subdirectories. In that change, `listFileDirs` may add a
partition directory to the directories to
  be scanned when its child listing is empty.

This is safe when the partition is truly empty. However, if a transient
`listStatus` failure returns an
empty result, a partition directory that still contains data may be
treated as empty and later scanned
again. When the later listing succeeds, bucket or partition directories
could be collected as orphan file
  candidates.

Before this fix, `cleanFile` recursively deleted directory candidates
via `deleteDirectoryQuietly`, so a
directory incorrectly collected as an orphan candidate could delete
still-referenced data files under
  that directory.

(cherry picked from commit d7b0825)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants