Skip to content

Spark: Fix file_list_view prefix matching to exclude sibling paths in DeleteOrphanFiles#17094

Closed
maoli67660 wants to merge 1 commit into
apache:mainfrom
maoli67660:fix/issue-16493-orphan-files-prefix
Closed

Spark: Fix file_list_view prefix matching to exclude sibling paths in DeleteOrphanFiles#17094
maoli67660 wants to merge 1 commit into
apache:mainfrom
maoli67660:fix/issue-16493-orphan-files-prefix

Conversation

@maoli67660

Copy link
Copy Markdown

Problem

filteredCompareToFileList() in DeleteOrphanFilesSparkAction filters the caller-provided file_list_view dataset using:

files = files.filter(files.col(FILE_PATH).startsWith(location));

Because location has no trailing /, a sibling path that shares the table location as a raw string prefix is incorrectly included. For example, when the table is at s3://bucket/my_table, files under s3://bucket/my_table_backup/ also satisfy startsWith("s3://bucket/my_table") and get pulled into orphan detection for the wrong table.

Fixes #16493.

Solution

Append / to location before the prefix match:

String locationPrefix = location.endsWith("/") ? location : location + "/";
files = files.filter(files.col(FILE_PATH).startsWith(locationPrefix));

Applied to Spark 3.5, 4.0, and 4.1.

Testing

Added testRemoveOrphanFilesFileListViewDoesNotMatchSiblingPaths to TestRemoveOrphanFilesProcedure in all three Spark versions. The test:

  1. Creates an empty Iceberg table at a known location
  2. Builds a file_list_view that includes an orphan file inside the table directory and a file under a sibling path (table-location + "-sibling")
  3. Runs remove_orphan_files with file_list_view and dry_run => true
  4. Asserts the sibling file is not identified as an orphan, and the in-table orphan is identified

… DeleteOrphanFiles

`filteredCompareToFileList()` filtered the caller-provided file list using
`startsWith(location)` without a trailing slash, causing sibling paths that
share the table location as a string prefix (e.g. `s3://bucket/table-backup/`
for a table at `s3://bucket/table`) to be incorrectly scoped into orphan
detection.

Fix by appending `/` to the location before the prefix match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added the spark label Jul 4, 2026
@maoli67660

Copy link
Copy Markdown
Author

Closing this as a duplicate of #16498, which has been open since May and takes the same approach (and already uses LocationUtil.stripTrailingSlash(location) + PATH_SEPARATOR, which is cleaner than my conditional check). I'll help review #16498 instead to get the fix landed.

@maoli67660 maoli67660 closed this Jul 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

remove_orphan_files scopes file_list_view with raw string prefix matching

1 participant