Skip to content

Spark: Fix DeleteOrphanFilesSparkAction sibling-prefix scope in compareToFileList#16498

Open
wombatu-kun wants to merge 2 commits into
apache:mainfrom
wombatu-kun:issue/16493-orphan-files-sibling-prefix
Open

Spark: Fix DeleteOrphanFilesSparkAction sibling-prefix scope in compareToFileList#16498
wombatu-kun wants to merge 2 commits into
apache:mainfrom
wombatu-kun:issue/16493-orphan-files-sibling-prefix

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

Summary

Closes #16493.

DeleteOrphanFilesSparkAction.filteredCompareToFileList() previously scoped a user-supplied compareToFileList to the action's location field using a raw files.col(FILE_PATH).startsWith(location) filter. When location lacks a trailing path separator — the production-typical shape for storage URIs like s3://bucket/table returned by Table.location() — that filter also accepts sibling paths such as s3://bucket/table-backup/.... Files in those sibling directories then entered the orphan candidate set and could be deleted.

This PR normalizes the prefix to directory form via LocationUtil.stripTrailingSlash(location) + "/" before the startsWith filter. The same + "/" shape is already used in SnapshotTableSparkAction (lines 131-132) to prevent identical sibling-prefix collisions, so this aligns the orphan-files action with that existing precedent. The fix is applied symmetrically to all three currently supported Spark version trees (v3.5, v4.0, v4.1) — their source files were byte-identical for this method, so the patch is mechanical.

The directory-listing path (listedFileDS()) is unaffected: it uses Hadoop's FileSystem.listStatus from a single root, which is inherently bounded to that directory.

Copy link
Copy Markdown
Contributor

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wombatu-kun - the fix looks good to me. I added a nit regarding test coverage.

Separately, do you think it would be worth clarifying the javadoc in the DeleteOrphanFiles API? the location where to look for orphan files feels a bit loose given how much behavior is implied by that parameter. If we make the matching semantics more explicit there, I think it would be clearer for both users and maintainers.

public interface DeleteOrphanFiles extends Action<DeleteOrphanFiles, DeleteOrphanFiles.Result> {
/**
* Passes a location which should be scanned for orphan files.
*
* <p>If not set, the root table location will be scanned potentially removing both orphan data
* and metadata files.
*
* @param location the location where to look for orphan files
* @return this for method chaining
*/
DeleteOrphanFiles location(String location);

}

@TestTemplate
public void testCompareToFileListWithSiblingDirectory() throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in addition to checking the sibling location, could we include a case where file_path == location? Since the new filtering behavior excludes exact match, I think it would be helpful to add a test to call out the behavior change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea — I folded the exact-match case into the same scope test (renamed to testCompareToFileListExcludesPathsOutsideLocationScope, applied across v3.5/v4.0/v4.1). Alongside the sibling directory it now adds a compareToFileList entry whose path equals location exactly and asserts it isn't reported as orphan, documenting that the trailing-separator prefix excludes exact matches as deliberate behavior. Done in 341d038.

@github-actions github-actions Bot added the API label May 23, 2026
@wombatu-kun
Copy link
Copy Markdown
Contributor Author

Thanks for the review! Good call on the javadoc — I clarified DeleteOrphanFiles.location(String) to make the matching semantics explicit: the location is treated as a directory, so only files contained within its subtree are in scope, and paths that merely share the location's prefix (a sibling directory such as .../table-backup relative to a .../table location) or equal the location itself are excluded. I also reworded the loose @param line you quoted. Done in 341d038.

@wombatu-kun wombatu-kun requested a review from sungwy May 23, 2026 03:43
Vova Kolmakov and others added 2 commits May 23, 2026 14:15
…reToFileList

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…location javadoc

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wombatu-kun wombatu-kun force-pushed the issue/16493-orphan-files-sibling-prefix branch from 341d038 to 2a967a9 Compare May 23, 2026 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

remove_orphan_files scopes file_list_view with raw string prefix matching

2 participants