[C++] Performance issue listing files over S3 #34213

westonpace · 2023-02-15T23:49:15Z

Describe the enhancement requested

The current GetFileInfo implementation (ignoring paging/continuation) in S3 is roughly:

def list_dir(path, results=[]):
  rsp = s3_list_objects(prefix=path, delimiter=/)
  parallel for common_prefix in rsp:
    list_dir(common_prefix, results)
  for file in rsp:
    results.append(file)

The prefix and delimiter constructs are S3 constructs described in more detail in S3 docs

Since we use / as the delimiter this yields an experience that is very similar to "directory walking". We issue one HTTP request for every single directory.

Alternatively, one could simply do:

def list_dir(path, results=[]):
  rsp = s3_list_objects(prefix=path)
  for file in rsp:
    results.append(file)

This would guarantee one HTTP request (ignoring paging) regardless of how many directories there are.

On the face of it, it would seem like this delimiter feature is always a bad idea (why would you want more HTTP requests?). However, from reading some documentation, it seems that the point of the delimiter feature is to allow concurrent list objects calls. However, this is expecting a situation where there are many many files (examples seem to consider millions) and they are more or less evenly distributed across prefixes (e.g. hundreds of thousands of files per prefix). Furthemore, this appears to very geared to the "container in the same datacenter as the S3 bucket" situation where the per-request latency is very small.

Some users are either using non-S3 technologies (e.g. minio) or they are downloading data from outside EC2 or they simply don't have very many parquet files per partition folder.

This leads to very slow (20-25x slower in #34145) performance when discovering datasets in S3.

I believe we should make the delimiter a property of the S3 filesystem and the default should be "no delimiter". This ought to speed up the normal? case and still makes it possible to optimize for a case where a user has structured their dataset to benefit from delimiters.

Component(s)

C++

The text was updated successfully, but these errors were encountered:

pitrou · 2023-03-21T13:44:50Z

Potentially the same as https://issues.apache.org/jira/browse/ARROW-8884 . I have no idea what the implications are.

westonpace · 2023-03-21T15:10:44Z

duplicate of #25019

westonpace · 2023-03-21T15:11:36Z

Duplicate of #25019

westonpace · 2023-03-21T15:11:56Z

Strange that the command is case-sensitive

…s doing a recursive GetFileInfo (#35440) ### Rationale for this change The old model of "walk"ing the directory could lead to a large number of calls. If someone is fully listing a bucket they will need to make one S3 API call for every single directory in the bucket. With this approach there is only 1 call made for every 1000 files, regardless of how they are spread across directories. The only potential regression would be if max_recursion was set to something > 1. For example, if a user had: ``` bucket/foo/bar/<10000 files here> ``` Then if they make a request for `bucket` with `max_recursion=2` the new approach will list all 10,000 files and then eliminate the files that don't match. However, I believe these cases (using max_recursion) to be rarer and less common than the typical case of listing all files (which dataset discovery does). ### What changes are included in this PR? The algorithm behind GetFileInfo and DeleteDirContents in S3FileSystem has changed. ### Are these changes tested? Yes, there should be no behavior change. All of the existing filesystem tests will test this change. ### Are there any user-facing changes? No, other than (hopefully) better performance. * Closes: #34213 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

…user is doing a recursive GetFileInfo (apache#35440) ### Rationale for this change The old model of "walk"ing the directory could lead to a large number of calls. If someone is fully listing a bucket they will need to make one S3 API call for every single directory in the bucket. With this approach there is only 1 call made for every 1000 files, regardless of how they are spread across directories. The only potential regression would be if max_recursion was set to something > 1. For example, if a user had: ``` bucket/foo/bar/<10000 files here> ``` Then if they make a request for `bucket` with `max_recursion=2` the new approach will list all 10,000 files and then eliminate the files that don't match. However, I believe these cases (using max_recursion) to be rarer and less common than the typical case of listing all files (which dataset discovery does). ### What changes are included in this PR? The algorithm behind GetFileInfo and DeleteDirContents in S3FileSystem has changed. ### Are these changes tested? Yes, there should be no behavior change. All of the existing filesystem tests will test this change. ### Are there any user-facing changes? No, other than (hopefully) better performance. * Closes: apache#34213 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

westonpace added the Type: enhancement label Feb 15, 2023

github-actions bot added the Component: C++ label Feb 15, 2023

westonpace mentioned this issue Feb 15, 2023

[Python] Specifying schema does not prevent arrow from reading metadata on every single parquet? #34145

Open

cboettig mentioned this issue Feb 16, 2023

[R] unify_schemas=FALSE does not improve open_dataset() read times #33312

Open

westonpace mentioned this issue Mar 7, 2023

[C++] Reduce directory and file IO when reading partition parquet dataset with partition key filters #31174

Open

westonpace marked this as a duplicate of #25019 Mar 21, 2023

westonpace mentioned this issue Mar 21, 2023

[C++] Listing files with S3FileSystem is slow #25019

Open

github-actions bot mentioned this issue May 5, 2023

GH-34213: [C++] Use recursive calls without a delimiter if the user is doing a recursive GetFileInfo #35440

Merged

github-actions bot assigned westonpace May 5, 2023

cboettig mentioned this issue May 22, 2023

[R] open_dataset() on long vec of URIs uses much more RAM & is much slower than on partition root. #35715

Open

pitrou closed this as completed in #35440 Jul 25, 2023

pitrou added this to the 14.0.0 milestone Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Performance issue listing files over S3 #34213

[C++] Performance issue listing files over S3 #34213

westonpace commented Feb 15, 2023 •

edited

pitrou commented Mar 21, 2023

westonpace commented Mar 21, 2023

westonpace commented Mar 21, 2023

westonpace commented Mar 21, 2023

[C++] Performance issue listing files over S3 #34213

[C++] Performance issue listing files over S3 #34213

Comments

westonpace commented Feb 15, 2023 • edited

Describe the enhancement requested

Component(s)

pitrou commented Mar 21, 2023

westonpace commented Mar 21, 2023

westonpace commented Mar 21, 2023

westonpace commented Mar 21, 2023

westonpace commented Feb 15, 2023 •

edited