Spark: Fix DeleteOrphanFilesSparkAction sibling-prefix scope in compareToFileList#16498
Spark: Fix DeleteOrphanFilesSparkAction sibling-prefix scope in compareToFileList#16498wombatu-kun wants to merge 2 commits into
Conversation
sungwy
left a comment
There was a problem hiding this comment.
Thanks @wombatu-kun - the fix looks good to me. I added a nit regarding test coverage.
Separately, do you think it would be worth clarifying the javadoc in the DeleteOrphanFiles API? the location where to look for orphan files feels a bit loose given how much behavior is implied by that parameter. If we make the matching semantics more explicit there, I think it would be clearer for both users and maintainers.
| } | ||
|
|
||
| @TestTemplate | ||
| public void testCompareToFileListWithSiblingDirectory() throws IOException { |
There was a problem hiding this comment.
nit: in addition to checking the sibling location, could we include a case where file_path == location? Since the new filtering behavior excludes exact match, I think it would be helpful to add a test to call out the behavior change.
There was a problem hiding this comment.
Good idea — I folded the exact-match case into the same scope test (renamed to testCompareToFileListExcludesPathsOutsideLocationScope, applied across v3.5/v4.0/v4.1). Alongside the sibling directory it now adds a compareToFileList entry whose path equals location exactly and asserts it isn't reported as orphan, documenting that the trailing-separator prefix excludes exact matches as deliberate behavior. Done in 341d038.
|
Thanks for the review! Good call on the javadoc — I clarified |
…reToFileList Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…location javadoc Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
341d038 to
2a967a9
Compare
Summary
Closes #16493.
DeleteOrphanFilesSparkAction.filteredCompareToFileList() previously scoped a user-supplied compareToFileList to the action's location field using a raw files.col(FILE_PATH).startsWith(location) filter. When location lacks a trailing path separator — the production-typical shape for storage URIs like s3://bucket/table returned by Table.location() — that filter also accepts sibling paths such as s3://bucket/table-backup/.... Files in those sibling directories then entered the orphan candidate set and could be deleted.
This PR normalizes the prefix to directory form via
LocationUtil.stripTrailingSlash(location) + "/"before the startsWith filter. The same+ "/"shape is already used in SnapshotTableSparkAction (lines 131-132) to prevent identical sibling-prefix collisions, so this aligns the orphan-files action with that existing precedent. The fix is applied symmetrically to all three currently supported Spark version trees (v3.5, v4.0, v4.1) — their source files were byte-identical for this method, so the patch is mechanical.The directory-listing path (listedFileDS()) is unaffected: it uses Hadoop's FileSystem.listStatus from a single root, which is inherently bounded to that directory.