-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40279: [C++] Reduce S3Client initialization time #40299
Conversation
@github-actions crossbow submit -g cpp -g python -g wheel |
This comment was marked as outdated.
This comment was marked as outdated.
35c040f
to
3ba9dca
Compare
@github-actions crossbow submit -g cpp -g wheel |
This comment was marked as outdated.
This comment was marked as outdated.
Oh, it looks like the RTools 40 build is using a very old AWS SDK version (1.7.365):
Do we know why that is @paleolimbot @jonkeane @assignUser ? |
@github-actions crossbow submit -g cpp -g wheel |
This comment was marked as outdated.
This comment was marked as outdated.
790f39b
to
2be5194
Compare
@github-actions crossbow submit -g cpp -g python -g wheel |
This comment was marked as outdated.
This comment was marked as outdated.
It seems that RTools 40 still uses old MSYS2. |
2be5194
to
09dc998
Compare
@github-actions crossbow submit -g cpp -g wheel |
This comment was marked as outdated.
This comment was marked as outdated.
09dc998
to
d51037a
Compare
d51037a
to
6b770c0
Compare
@github-actions crossbow submit -g cpp -g wheel |
Revision: 6b770c0 Submitted crossbow builds: ursacomputing/crossbow @ actions-8946df4af2 |
|
The R failures are resolved by #40710 |
6b770c0
to
1ca1d81
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
options_.endpoint_override.empty() || options_.force_virtual_addressing; | ||
|
||
#ifdef ARROW_S3_HAS_S3CLIENT_CONFIGURATION | ||
client_config_.useVirtualAddressing = use_virtual_addressing; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need client_config_.payloadSigningPolicy = Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy::Never
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. We just had to pass the argument on older verisons:
https://github.com/aws/aws-sdk-cpp/blob/ffd81252bec92f3e0587e144b07c05f8aed28eb1/aws-cpp-sdk-s3/include/aws/s3/S3Client.h#L460-L465
1ca1d81
to
d862a6a
Compare
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 3095344. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 30 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
By default, S3Client instantiation is extremely slow (around 1ms for every instance). Investigation led to the conclusion that most of this time was spent inside the AWS SDK, parsing a hardcoded piece of JSON data when instantiating a AWS rule engine.
Python benchmarks show this repeated initiatlization cost:
What changes are included in this PR?
Instead of letting the AWS SDK create a new S3EndpointProvider for each S3Client, arrange to only create a single S3EndpointProvider per set of endpoint configuration options. This lets the 1ms instantiation cost be paid only when a new set of endpoint configuration options is given.
Python benchmarks show the initialization cost has become a one-time cost:
Are these changes tested?
By existing tests.
Are there any user-facing changes?
No.
S3FileSystem
slow to deserialize due to AWS rule engine JSON parsing #40279