Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40279: [C++] Reduce S3Client initialization time #40299

Merged
merged 2 commits into from
Mar 25, 2024

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Feb 29, 2024

Rationale for this change

By default, S3Client instantiation is extremely slow (around 1ms for every instance). Investigation led to the conclusion that most of this time was spent inside the AWS SDK, parsing a hardcoded piece of JSON data when instantiating a AWS rule engine.

Python benchmarks show this repeated initiatlization cost:

>>> from pyarrow.fs import S3FileSystem

>>> %time s = S3FileSystem()
CPU times: user 21.1 ms, sys: 0 ns, total: 21.1 ms
Wall time: 20.9 ms
>>> %time s = S3FileSystem()
CPU times: user 2.37 ms, sys: 0 ns, total: 2.37 ms
Wall time: 2.18 ms
>>> %time s = S3FileSystem()
CPU times: user 2.42 ms, sys: 0 ns, total: 2.42 ms
Wall time: 2.23 ms

>>> %timeit s = S3FileSystem()
1.28 ms ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit s = S3FileSystem()
1.28 ms ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit s = S3FileSystem(anonymous=True)
1.26 ms ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

What changes are included in this PR?

Instead of letting the AWS SDK create a new S3EndpointProvider for each S3Client, arrange to only create a single S3EndpointProvider per set of endpoint configuration options. This lets the 1ms instantiation cost be paid only when a new set of endpoint configuration options is given.

Python benchmarks show the initialization cost has become a one-time cost:

>>> from pyarrow.fs import S3FileSystem

>>> %time s = S3FileSystem()
CPU times: user 20 ms, sys: 0 ns, total: 20 ms
Wall time: 19.8 ms
>>> %time s = S3FileSystem()
CPU times: user 404 µs, sys: 49 µs, total: 453 µs
Wall time: 266 µs
>>> %time s = S3FileSystem()
CPU times: user 361 µs, sys: 42 µs, total: 403 µs
Wall time: 249 µs

>>> %timeit s = S3FileSystem()
50.4 µs ± 227 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit s = S3FileSystem(anonymous=True)
33.5 µs ± 306 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Are these changes tested?

By existing tests.

Are there any user-facing changes?

No.

@pitrou
Copy link
Member Author

pitrou commented Feb 29, 2024

@github-actions crossbow submit -g cpp -g python -g wheel

This comment was marked as outdated.

@pitrou
Copy link
Member Author

pitrou commented Feb 29, 2024

@github-actions crossbow submit -g cpp -g wheel

This comment was marked as outdated.

@pitrou
Copy link
Member Author

pitrou commented Feb 29, 2024

Oh, it looks like the RTools 40 build is using a very old AWS SDK version (1.7.365):

2024-02-29T19:29:45.9968248Z ==> Making package: mingw-w64-arrow 15.0.0.9000-8000 (Thu, Feb 29, 2024  7:29:45 PM)
2024-02-29T19:29:46.0104530Z ==> Checking runtime dependencies...
2024-02-29T19:29:46.0933579Z ==> Installing missing dependencies...
2024-02-29T19:29:46.1681983Z resolving dependencies...
2024-02-29T19:29:46.1816658Z looking for conflicting packages...
2024-02-29T19:29:46.1842088Z 
2024-02-29T19:29:46.1846702Z Packages (12) mingw-w64-ucrt-x86_64-boost-1.67.0-9002  mingw-w64-ucrt-x86_64-libssh2-1.11.0-9801  mingw-w64-ucrt-x86_64-nghttp2-1.51.0-1  mingw-w64-ucrt-x86_64-openssl-3.1.1-9800  mingw-w64-ucrt-x86_64-aws-sdk-cpp-1.7.365-1  mingw-w64-ucrt-x86_64-brotli-1.0.9-4  mingw-w64-ucrt-x86_64-curl-8.1.2-9000  mingw-w64-ucrt-x86_64-libutf8proc-2.4.0-2  mingw-w64-ucrt-x86_64-lz4-1.8.2-1  mingw-w64-ucrt-x86_64-re2-20200801-1  mingw-w64-ucrt-x86_64-snappy-1.1.7-2  mingw-w64-ucrt-x86_64-thrift-0.13.0-1

Do we know why that is @paleolimbot @jonkeane @assignUser ?

@pitrou
Copy link
Member Author

pitrou commented Feb 29, 2024

@github-actions crossbow submit -g cpp -g wheel

This comment was marked as outdated.

@pitrou
Copy link
Member Author

pitrou commented Feb 29, 2024

@github-actions crossbow submit -g cpp -g python -g wheel

This comment was marked as outdated.

@kou
Copy link
Member

kou commented Mar 1, 2024

Oh, it looks like the RTools 40 build is using a very old AWS SDK version (1.7.365):

It seems that RTools 40 still uses old MSYS2.

@pitrou
Copy link
Member Author

pitrou commented Mar 14, 2024

@github-actions crossbow submit -g cpp -g wheel

This comment was marked as outdated.

@pitrou
Copy link
Member Author

pitrou commented Mar 21, 2024

@github-actions crossbow submit -g cpp -g wheel

Copy link

Revision: 6b770c0

Submitted crossbow builds: ursacomputing/crossbow @ actions-8946df4af2

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
wheel-macos-big-sur-cp310-arm64 GitHub Actions
wheel-macos-big-sur-cp311-arm64 GitHub Actions
wheel-macos-big-sur-cp312-arm64 GitHub Actions
wheel-macos-big-sur-cp38-arm64 GitHub Actions
wheel-macos-big-sur-cp39-arm64 GitHub Actions
wheel-macos-catalina-cp310-amd64 GitHub Actions
wheel-macos-catalina-cp311-amd64 GitHub Actions
wheel-macos-catalina-cp312-amd64 GitHub Actions
wheel-macos-catalina-cp38-amd64 GitHub Actions
wheel-macos-catalina-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-arm64 GitHub Actions
wheel-manylinux-2-28-cp311-amd64 GitHub Actions
wheel-manylinux-2-28-cp311-arm64 GitHub Actions
wheel-manylinux-2-28-cp312-amd64 GitHub Actions
wheel-manylinux-2-28-cp312-arm64 GitHub Actions
wheel-manylinux-2-28-cp38-amd64 GitHub Actions
wheel-manylinux-2-28-cp38-arm64 GitHub Actions
wheel-manylinux-2-28-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp39-arm64 GitHub Actions
wheel-manylinux-2014-cp310-amd64 GitHub Actions
wheel-manylinux-2014-cp310-arm64 GitHub Actions
wheel-manylinux-2014-cp311-amd64 GitHub Actions
wheel-manylinux-2014-cp311-arm64 GitHub Actions
wheel-manylinux-2014-cp312-amd64 GitHub Actions
wheel-manylinux-2014-cp312-arm64 GitHub Actions
wheel-manylinux-2014-cp38-amd64 GitHub Actions
wheel-manylinux-2014-cp38-arm64 GitHub Actions
wheel-manylinux-2014-cp39-amd64 GitHub Actions
wheel-manylinux-2014-cp39-arm64 GitHub Actions
wheel-windows-cp310-amd64 GitHub Actions
wheel-windows-cp311-amd64 GitHub Actions
wheel-windows-cp312-amd64 GitHub Actions
wheel-windows-cp38-amd64 GitHub Actions
wheel-windows-cp39-amd64 GitHub Actions

@pitrou pitrou changed the title EXPERIMENT: [C++] Use S3ClientConfiguration GH-40279: [C++] Reduce S3Client initialization time Mar 21, 2024
Copy link

⚠️ GitHub issue #40279 has been automatically assigned in GitHub to PR creator.

@pitrou pitrou marked this pull request as ready for review March 21, 2024 18:30
@pitrou pitrou requested review from kou and felipecrv March 21, 2024 18:32
@jonkeane
Copy link
Member

The R failures are resolved by #40710

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

options_.endpoint_override.empty() || options_.force_virtual_addressing;

#ifdef ARROW_S3_HAS_S3CLIENT_CONFIGURATION
client_config_.useVirtualAddressing = use_virtual_addressing;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need client_config_.payloadSigningPolicy = Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy::Never?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpp/src/arrow/filesystem/s3fs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/s3fs.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Mar 21, 2024
@github-actions github-actions bot removed the awaiting merge Awaiting merge label Mar 22, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Mar 22, 2024
@pitrou pitrou merged commit 3095344 into apache:main Mar 25, 2024
35 of 36 checks passed
@pitrou pitrou removed the awaiting merge Awaiting merge label Mar 25, 2024
@pitrou pitrou deleted the s3clientconfiguration branch March 25, 2024 15:49
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 3095344.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 30 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants