[C++] Listing files with S3FileSystem is slow #25019

asfimport · 2020-05-21T18:10:16Z

Listing files on S3 is slow due to the recursive nature of the algorithm.

The following change modifies the behavior of the S3Result to include all objects but no "grouping" (directories). This lower dramatically the number of HTTP calls.

diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc
index 70c87f46ec..98a40b17a2 100644
--- a/cpp/src/arrow/filesystem/s3fs.cc
+++ b/cpp/src/arrow/filesystem/s3fs.cc
@@ -986,7 +986,7 @@ class S3FileSystem::Impl {
     if (!prefix.empty()) {
       req.SetPrefix(ToAwsString(prefix) + kSep);
     }
-    req.SetDelimiter(Aws::String() + kSep);
+    // req.SetDelimiter(Aws::String() + kSep);
     req.SetMaxKeys(kListObjectsMaxKeys);
 
     while (true) {

The suggested change is to add an option to Selector, e.g. no_directory_result or something like this.

Reporter: Francois Saint-Jacques / @fsaintjacques

Related issues:

[C++] Make S3 recursive walks parallel (relates to)

_{Note: This issue was originally created as ARROW-8884. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2020-12-02T12:41:59Z

Antoine Pitrou / @pitrou:
Related: ARROW-10788

westonpace · 2023-03-21T15:25:18Z

Mentioned in #34213

I have no idea what the implications are.

@pitrou

I attempted some investigation. In the AWS CLI it appears that the delimiter is only used when the listing is non-recursive.

In S3FS the delimiter is skipped when looking for a file recursively.

In the S3 docs it states:

If you issue a list request with a delimiter, you can browse your hierarchy at only one level, skipping over and summarizing the (possibly millions of) keys nested at deeper levels.

My conclusion is that the delimiter's purpose is to reduce the number of files returned when you do not need to retrieve all the files. If we are doing a recursive listing then I think it is consistent with other projects and S3's intentions that we do not specify the delimiter.

asfimport mentioned this issue Jan 11, 2023

[C++] Make S3 recursive walks parallel #26728

Closed

westonpace mentioned this issue Mar 21, 2023

[C++] Performance issue listing files over S3 #34213

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Listing files with S3FileSystem is slow #25019

[C++] Listing files with S3FileSystem is slow #25019

asfimport commented May 21, 2020 •

edited

asfimport commented Dec 2, 2020

westonpace commented Mar 21, 2023

[C++] Listing files with S3FileSystem is slow #25019

[C++] Listing files with S3FileSystem is slow #25019

Comments

asfimport commented May 21, 2020 • edited

Related issues:

asfimport commented Dec 2, 2020

westonpace commented Mar 21, 2023

asfimport commented May 21, 2020 •

edited