Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Listing files with S3FileSystem is slow #25019

Open
asfimport opened this issue May 21, 2020 · 2 comments
Open

[C++] Listing files with S3FileSystem is slow #25019

asfimport opened this issue May 21, 2020 · 2 comments

Comments

@asfimport
Copy link

asfimport commented May 21, 2020

Listing files on S3 is slow due to the recursive nature of the algorithm.

The following change modifies the behavior of the S3Result to include all objects but no "grouping" (directories). This lower dramatically the number of HTTP calls.

diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc
index 70c87f46ec..98a40b17a2 100644
--- a/cpp/src/arrow/filesystem/s3fs.cc
+++ b/cpp/src/arrow/filesystem/s3fs.cc
@@ -986,7 +986,7 @@ class S3FileSystem::Impl {
     if (!prefix.empty()) {
       req.SetPrefix(ToAwsString(prefix) + kSep);
     }
-    req.SetDelimiter(Aws::String() + kSep);
+    // req.SetDelimiter(Aws::String() + kSep);
     req.SetMaxKeys(kListObjectsMaxKeys);
 
     while (true) {

The suggested change is to add an option to Selector, e.g. no_directory_result or something like this.

Reporter: Francois Saint-Jacques / @fsaintjacques

Related issues:

Note: This issue was originally created as ARROW-8884. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Related: ARROW-10788

@westonpace
Copy link
Member

Mentioned in #34213

I have no idea what the implications are.

@pitrou

I attempted some investigation. In the AWS CLI it appears that the delimiter is only used when the listing is non-recursive.

In S3FS the delimiter is skipped when looking for a file recursively.

In the S3 docs it states:

If you issue a list request with a delimiter, you can browse your hierarchy at only one level, skipping over and summarizing the (possibly millions of) keys nested at deeper levels.

My conclusion is that the delimiter's purpose is to reduce the number of files returned when you do not need to retrieve all the files. If we are doing a recursive listing then I think it is consistent with other projects and S3's intentions that we do not specify the delimiter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants