Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Performance issue listing files over S3 #34213

Closed
westonpace opened this issue Feb 15, 2023 · 4 comments · Fixed by #35440
Closed

[C++] Performance issue listing files over S3 #34213

westonpace opened this issue Feb 15, 2023 · 4 comments · Fixed by #35440

Comments

@westonpace
Copy link
Member

westonpace commented Feb 15, 2023

Describe the enhancement requested

The current GetFileInfo implementation (ignoring paging/continuation) in S3 is roughly:

def list_dir(path, results=[]):
  rsp = s3_list_objects(prefix=path, delimiter=/)
  parallel for common_prefix in rsp:
    list_dir(common_prefix, results)
  for file in rsp:
    results.append(file)

The prefix and delimiter constructs are S3 constructs described in more detail in S3 docs

Since we use / as the delimiter this yields an experience that is very similar to "directory walking". We issue one HTTP request for every single directory.

Alternatively, one could simply do:

def list_dir(path, results=[]):
  rsp = s3_list_objects(prefix=path)
  for file in rsp:
    results.append(file)

This would guarantee one HTTP request (ignoring paging) regardless of how many directories there are.

On the face of it, it would seem like this delimiter feature is always a bad idea (why would you want more HTTP requests?). However, from reading some documentation, it seems that the point of the delimiter feature is to allow concurrent list objects calls. However, this is expecting a situation where there are many many files (examples seem to consider millions) and they are more or less evenly distributed across prefixes (e.g. hundreds of thousands of files per prefix). Furthemore, this appears to very geared to the "container in the same datacenter as the S3 bucket" situation where the per-request latency is very small.

Some users are either using non-S3 technologies (e.g. minio) or they are downloading data from outside EC2 or they simply don't have very many parquet files per partition folder.

This leads to very slow (20-25x slower in #34145) performance when discovering datasets in S3.

I believe we should make the delimiter a property of the S3 filesystem and the default should be "no delimiter". This ought to speed up the normal? case and still makes it possible to optimize for a case where a user has structured their dataset to benefit from delimiters.

Component(s)

C++

@pitrou
Copy link
Member

pitrou commented Mar 21, 2023

Potentially the same as https://issues.apache.org/jira/browse/ARROW-8884 . I have no idea what the implications are.

@westonpace
Copy link
Member Author

duplicate of #25019

@westonpace
Copy link
Member Author

Duplicate of #25019

@westonpace westonpace marked this as a duplicate of #25019 Mar 21, 2023
@westonpace
Copy link
Member Author

Strange that the command is case-sensitive

pitrou added a commit that referenced this issue Jul 25, 2023
…s doing a recursive GetFileInfo (#35440)

### Rationale for this change

The old model of "walk"ing the directory could lead to a large number of calls.  If someone is fully listing a bucket they will need to make one S3 API call for every single directory in the bucket.  With this approach there is only 1 call made for every 1000 files, regardless of how they are spread across directories.

The only potential regression would be if max_recursion was set to something > 1.  For example, if a user had:

```
bucket/foo/bar/<10000 files here>
```

Then if they make a request for `bucket` with `max_recursion=2` the new approach will list all 10,000 files and then eliminate the files that don't match.

However, I believe these cases (using max_recursion) to be rarer and less common than the typical case of listing all files (which dataset discovery does).

### What changes are included in this PR?

The algorithm behind GetFileInfo and DeleteDirContents in S3FileSystem has changed.

### Are these changes tested?

Yes, there should be no behavior change.  All of the existing filesystem tests will test this change.

### Are there any user-facing changes?

No, other than (hopefully) better performance.
* Closes: #34213

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 14.0.0 milestone Jul 25, 2023
R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this issue Aug 20, 2023
…user is doing a recursive GetFileInfo (apache#35440)

### Rationale for this change

The old model of "walk"ing the directory could lead to a large number of calls.  If someone is fully listing a bucket they will need to make one S3 API call for every single directory in the bucket.  With this approach there is only 1 call made for every 1000 files, regardless of how they are spread across directories.

The only potential regression would be if max_recursion was set to something > 1.  For example, if a user had:

```
bucket/foo/bar/<10000 files here>
```

Then if they make a request for `bucket` with `max_recursion=2` the new approach will list all 10,000 files and then eliminate the files that don't match.

However, I believe these cases (using max_recursion) to be rarer and less common than the typical case of listing all files (which dataset discovery does).

### What changes are included in this PR?

The algorithm behind GetFileInfo and DeleteDirContents in S3FileSystem has changed.

### Are these changes tested?

Yes, there should be no behavior change.  All of the existing filesystem tests will test this change.

### Are there any user-facing changes?

No, other than (hopefully) better performance.
* Closes: apache#34213

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…user is doing a recursive GetFileInfo (apache#35440)

### Rationale for this change

The old model of "walk"ing the directory could lead to a large number of calls.  If someone is fully listing a bucket they will need to make one S3 API call for every single directory in the bucket.  With this approach there is only 1 call made for every 1000 files, regardless of how they are spread across directories.

The only potential regression would be if max_recursion was set to something > 1.  For example, if a user had:

```
bucket/foo/bar/<10000 files here>
```

Then if they make a request for `bucket` with `max_recursion=2` the new approach will list all 10,000 files and then eliminate the files that don't match.

However, I believe these cases (using max_recursion) to be rarer and less common than the typical case of listing all files (which dataset discovery does).

### What changes are included in this PR?

The algorithm behind GetFileInfo and DeleteDirContents in S3FileSystem has changed.

### Are these changes tested?

Yes, there should be no behavior change.  All of the existing filesystem tests will test this change.

### Are there any user-facing changes?

No, other than (hopefully) better performance.
* Closes: apache#34213

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants