Skip to content

Spark: Fix NPE in RemoveOrphanFiles with prefix_listing for root table location(#16350)#16351

Open
liuliquan-marshal wants to merge 2 commits into
apache:mainfrom
liuliquan-marshal:main
Open

Spark: Fix NPE in RemoveOrphanFiles with prefix_listing for root table location(#16350)#16351
liuliquan-marshal wants to merge 2 commits into
apache:mainfrom
liuliquan-marshal:main

Conversation

@liuliquan-marshal
Copy link
Copy Markdown

summary

fix #16350

Root table location(e.g. location is s3://bucket/ in metadata.json) when running orphan files removing with prefix_listing=true and S3FlieIO causes NPE. Orphan files removing will always fail in this situation.
This change adds null check when walk reaches the storage root. Besides this adds some tests.

@github-actions github-actions Bot added the core label May 15, 2026
@liuliquan-marshal liuliquan-marshal changed the title Core: Fix NPE in RemoveOrphanFiles with prefix_listing for root table location(#16350) Spark: Fix NPE in RemoveOrphanFiles with prefix_listing for root table location(#16350) May 15, 2026
new FileInfo("s3://bucket/data/a.parquet", 1L, 0L),
new FileInfo("s3://bucket/_temporary/attempt_x/b.parquet", 1L, 0L),
new FileInfo("s3://bucket/.staging/c.parquet", 1L, 0L));
SupportsPrefixOperations mockIO = new StaticPrefixFileIO(entries);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much of the code is duplicated. Can we refactor that out?
Maybe something like:

  private static List<String> walkBucket(String baseDir, FileInfo... entries) {
    try (SupportsPrefixOperations fileIO = new StaticPrefixFileIO(ImmutableList.copyOf(entries))) {
      List<String> foundFiles = Lists.newArrayList();
      Predicate<FileInfo> fileFilter = fileInfo -> fileInfo.location().endsWith(".parquet");

      assertThatCode(
              () ->
                  FileSystemWalker.listDirRecursivelyWithFileIO(
                      fileIO, baseDir, null, fileFilter, foundFiles::add))
          .doesNotThrowAnyException();

      return foundFiles;
    }
  }

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing, addressed this comment in the latest commit. PTAL.

assertThat(foundFiles).containsExactly("s3://bucket/data/a.parquet");
}

private static class StaticPrefixFileIO implements SupportsPrefixOperations {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on IntelliJ, this could be a java record

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen this alert as well, but I don't think we should do a one-off java record usage here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with nssalian. I think it's not quite appropriate to use java record. WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NPE in RemoveOrphanFiles with S3FileIO when metadata.json location is at filesystem root

3 participants