New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Performance issue listing files over S3 #34213
Comments
Potentially the same as https://issues.apache.org/jira/browse/ARROW-8884 . I have no idea what the implications are. |
duplicate of #25019 |
Duplicate of #25019 |
Strange that the command is case-sensitive |
pitrou
added a commit
that referenced
this issue
Jul 25, 2023
…s doing a recursive GetFileInfo (#35440) ### Rationale for this change The old model of "walk"ing the directory could lead to a large number of calls. If someone is fully listing a bucket they will need to make one S3 API call for every single directory in the bucket. With this approach there is only 1 call made for every 1000 files, regardless of how they are spread across directories. The only potential regression would be if max_recursion was set to something > 1. For example, if a user had: ``` bucket/foo/bar/<10000 files here> ``` Then if they make a request for `bucket` with `max_recursion=2` the new approach will list all 10,000 files and then eliminate the files that don't match. However, I believe these cases (using max_recursion) to be rarer and less common than the typical case of listing all files (which dataset discovery does). ### What changes are included in this PR? The algorithm behind GetFileInfo and DeleteDirContents in S3FileSystem has changed. ### Are these changes tested? Yes, there should be no behavior change. All of the existing filesystem tests will test this change. ### Are there any user-facing changes? No, other than (hopefully) better performance. * Closes: #34213 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
R-JunmingChen
pushed a commit
to R-JunmingChen/arrow
that referenced
this issue
Aug 20, 2023
…user is doing a recursive GetFileInfo (apache#35440) ### Rationale for this change The old model of "walk"ing the directory could lead to a large number of calls. If someone is fully listing a bucket they will need to make one S3 API call for every single directory in the bucket. With this approach there is only 1 call made for every 1000 files, regardless of how they are spread across directories. The only potential regression would be if max_recursion was set to something > 1. For example, if a user had: ``` bucket/foo/bar/<10000 files here> ``` Then if they make a request for `bucket` with `max_recursion=2` the new approach will list all 10,000 files and then eliminate the files that don't match. However, I believe these cases (using max_recursion) to be rarer and less common than the typical case of listing all files (which dataset discovery does). ### What changes are included in this PR? The algorithm behind GetFileInfo and DeleteDirContents in S3FileSystem has changed. ### Are these changes tested? Yes, there should be no behavior change. All of the existing filesystem tests will test this change. ### Are there any user-facing changes? No, other than (hopefully) better performance. * Closes: apache#34213 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne
pushed a commit
to loicalleyne/arrow
that referenced
this issue
Nov 13, 2023
…user is doing a recursive GetFileInfo (apache#35440) ### Rationale for this change The old model of "walk"ing the directory could lead to a large number of calls. If someone is fully listing a bucket they will need to make one S3 API call for every single directory in the bucket. With this approach there is only 1 call made for every 1000 files, regardless of how they are spread across directories. The only potential regression would be if max_recursion was set to something > 1. For example, if a user had: ``` bucket/foo/bar/<10000 files here> ``` Then if they make a request for `bucket` with `max_recursion=2` the new approach will list all 10,000 files and then eliminate the files that don't match. However, I believe these cases (using max_recursion) to be rarer and less common than the typical case of listing all files (which dataset discovery does). ### What changes are included in this PR? The algorithm behind GetFileInfo and DeleteDirContents in S3FileSystem has changed. ### Are these changes tested? Yes, there should be no behavior change. All of the existing filesystem tests will test this change. ### Are there any user-facing changes? No, other than (hopefully) better performance. * Closes: apache#34213 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the enhancement requested
The current GetFileInfo implementation (ignoring paging/continuation) in S3 is roughly:
The
prefix
anddelimiter
constructs are S3 constructs described in more detail in S3 docsSince we use
/
as the delimiter this yields an experience that is very similar to "directory walking". We issue one HTTP request for every single directory.Alternatively, one could simply do:
This would guarantee one HTTP request (ignoring paging) regardless of how many directories there are.
On the face of it, it would seem like this delimiter feature is always a bad idea (why would you want more HTTP requests?). However, from reading some documentation, it seems that the point of the delimiter feature is to allow concurrent list objects calls. However, this is expecting a situation where there are many many files (examples seem to consider millions) and they are more or less evenly distributed across prefixes (e.g. hundreds of thousands of files per prefix). Furthemore, this appears to very geared to the "container in the same datacenter as the S3 bucket" situation where the per-request latency is very small.
Some users are either using non-S3 technologies (e.g. minio) or they are downloading data from outside EC2 or they simply don't have very many parquet files per partition folder.
This leads to very slow (20-25x slower in #34145) performance when discovering datasets in S3.
I believe we should make the delimiter a property of the S3 filesystem and the default should be "no delimiter". This ought to speed up the normal? case and still makes it possible to optimize for a case where a user has structured their dataset to benefit from delimiters.
Component(s)
C++
The text was updated successfully, but these errors were encountered: