Spark: Fix file_list_view prefix matching to exclude sibling paths in DeleteOrphanFiles#17094
Closed
maoli67660 wants to merge 1 commit into
Closed
Spark: Fix file_list_view prefix matching to exclude sibling paths in DeleteOrphanFiles#17094maoli67660 wants to merge 1 commit into
maoli67660 wants to merge 1 commit into
Conversation
… DeleteOrphanFiles `filteredCompareToFileList()` filtered the caller-provided file list using `startsWith(location)` without a trailing slash, causing sibling paths that share the table location as a string prefix (e.g. `s3://bucket/table-backup/` for a table at `s3://bucket/table`) to be incorrectly scoped into orphan detection. Fix by appending `/` to the location before the prefix match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
filteredCompareToFileList()inDeleteOrphanFilesSparkActionfilters the caller-providedfile_list_viewdataset using:Because
locationhas no trailing/, a sibling path that shares the table location as a raw string prefix is incorrectly included. For example, when the table is ats3://bucket/my_table, files unders3://bucket/my_table_backup/also satisfystartsWith("s3://bucket/my_table")and get pulled into orphan detection for the wrong table.Fixes #16493.
Solution
Append
/tolocationbefore the prefix match:Applied to Spark 3.5, 4.0, and 4.1.
Testing
Added
testRemoveOrphanFilesFileListViewDoesNotMatchSiblingPathstoTestRemoveOrphanFilesProcedurein all three Spark versions. The test:file_list_viewthat includes an orphan file inside the table directory and a file under a sibling path (table-location + "-sibling")remove_orphan_fileswithfile_list_viewanddry_run => true