Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python/C++] S3FileSystem slow to deserialize due to AWS rule engine JSON parsing #40279

Closed
fjetter opened this issue Feb 28, 2024 · 11 comments
Closed

Comments

@fjetter
Copy link
Contributor

fjetter commented Feb 28, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Deserializing a pickled S3FileSystem instance is surprisingly slow

import boto3
from pyarrow.fs import S3FileSystem

# Going via boto is not strictly necessary but setting all the keys and tokens already avoids one HTTP request during init
session = boto3.session.Session()
credentials = session.get_credentials()

fs = S3FileSystem(
    secret_key=credentials.secret_key,
    access_key=credentials.access_key,
    region="us-east-2",
    session_token=credentials.token,
)
# Note: This can also be seen by using just S3FileSystem() but this then posts one HTTP request and I want to emphasize the slow json parser, see below
%timeit pickle.loads(pickle.dumps(fs))

takes 1.01 ms ± 153 µs per loop on my machine

Looking at a py-spy profile shows that most of the time is spent in some internal JSON parsing. Is there a way to avoid this?

image

Component(s)

Python

@fjetter fjetter changed the title S3FileSystem slow to deserialize due to AWS rule engine JSON parsing [Python] S3FileSystem slow to deserialize due to AWS rule engine JSON parsing Feb 28, 2024
@fjetter fjetter changed the title [Python] S3FileSystem slow to deserialize due to AWS rule engine JSON parsing [Python/C++] S3FileSystem slow to deserialize due to AWS rule engine JSON parsing Feb 28, 2024
@jorisvandenbossche
Copy link
Member

cc @pitrou

@pitrou
Copy link
Member

pitrou commented Feb 29, 2024

Looking at a py-spy profile shows that most of the time is spent in some internal JSON parsing. Is there a way to avoid this?

I have no idea. S3 configuration mechanisms are still a mystery to me.

@pitrou
Copy link
Member

pitrou commented Feb 29, 2024

Similar observations here:

>>> %timeit s = S3FileSystem()
1.29 ms ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit s = S3FileSystem(anonymous=True)
1.29 ms ± 5.52 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit s = S3FileSystem(anonymous=True, region='eu-west-1')
1.29 ms ± 4.86 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@pitrou
Copy link
Member

pitrou commented Feb 29, 2024

@fjetter Is it the repeated cost that bothers you? As in, if only the first S3FileSystem instantiation in a Python process was slow, would you mind?

@fjetter
Copy link
Contributor Author

fjetter commented Feb 29, 2024

If the first instantiation is slow but repeated ones would reuse, etc. that would be fine. I'm running into this issue because those things are de/serialized as part of dask tasks, i.e. once per interpreter is fine, once per instance is not great

@pitrou
Copy link
Member

pitrou commented Feb 29, 2024

Note to self: this is the construction path we go through currently:

S3Client::S3Client(const std::shared_ptr<AWSCredentialsProvider>& credentialsProvider,
                   const Client::ClientConfiguration& clientConfiguration,
                   Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy signPayloads /*= Never*/,
                   bool useVirtualAddressing /*= true*/,
                   Aws::S3::US_EAST_1_REGIONAL_ENDPOINT_OPTION USEast1RegionalEndPointOption) :
  BASECLASS(clientConfiguration,
            Aws::MakeShared<Aws::Auth::S3ExpressSignerProvider>(ALLOCATION_TAG,
                                                                credentialsProvider,
                                                                Aws::MakeShared<DefaultS3ExpressIdentityProvider>(ALLOCATION_TAG, *this),
                                                                SERVICE_NAME,
                                                                Aws::Region::ComputeSignerRegion(clientConfiguration.region),
                                                                signPayloads,
                                                                /*doubleEncodeValue*/ false),
            Aws::MakeShared<S3ErrorMarshaller>(ALLOCATION_TAG)),
    m_clientConfiguration(clientConfiguration, signPayloads, useVirtualAddressing, USEast1RegionalEndPointOption),
    m_executor(clientConfiguration.executor),
    m_endpointProvider(Aws::MakeShared<S3EndpointProvider>(ALLOCATION_TAG))
{
  init(m_clientConfiguration);
}

void S3Client::init(const S3::S3ClientConfiguration& config)
{
  AWSClient::SetServiceClientName("S3");
  AWS_CHECK_PTR(SERVICE_NAME, m_endpointProvider);
  m_endpointProvider->InitBuiltInParameters(config);
}

It's the Aws::MakeShared<S3EndpointProvider>(ALLOCATION_TAG) that takes so much time in practice.
We might cache the endpoint providers based on the config options that are used by InitBuiltInParameters, that is:

  • in S3BuiltInParameters::SetFromClientConfiguration: config.useUSEast1RegionalEndPointOption, config.useArnRegion, config.disableMultiRegionAccessPoints, config.useVirtualAddressing
  • in BuiltInParameters::SetFromClientConfiguration: config.region, config.useFIPS, config.useDualStack, config.endpointOverride, config.scheme

We would need at least the AWS SDK >= 1.10 to have S3ClientConfiguration and its additional fields.

@pitrou
Copy link
Member

pitrou commented Feb 29, 2024

What is our minimum required AWS SDK version? @kou would you know the answer?

Edit: trying to find out in #40299

@pitrou
Copy link
Member

pitrou commented Feb 29, 2024

With the caching PoC in #40299 I get the following:

>>> %timeit s = S3FileSystem(anonymous=True)
34.2 µs ± 42.8 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit s = S3FileSystem(anonymous=True, region='eu-west-1')
32.8 µs ± 37.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit s = S3FileSystem()
52.5 µs ± 262 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

@kou
Copy link
Member

kou commented Mar 1, 2024

What is our minimum required AWS SDK version?

I think that we don't define it. Old AWS SDK can be used but there are some known problems. (I think that we recommend at least the current bundled version (1.10.55) to avoid known problems.)

If we want to define it, we can do it like the following:

diff --git a/cpp/cmake_modules/ThirdpartyToolchain.cmake b/cpp/cmake_modules/ThirdpartyToolchain.cmake
index 951028b699..cf9232f331 100644
--- a/cpp/cmake_modules/ThirdpartyToolchain.cmake
+++ b/cpp/cmake_modules/ThirdpartyToolchain.cmake
@@ -5031,7 +5031,7 @@ macro(build_awssdk)
 endmacro()
 
 if(ARROW_S3)
-  resolve_dependency(AWSSDK HAVE_ALT TRUE)
+  resolve_dependency(AWSSDK HAVE_ALT TRUE REQUIRED_VERSION "X.Y.Z")
 
   message(STATUS "Found AWS SDK headers: ${AWSSDK_INCLUDE_DIR}")
   message(STATUS "Found AWS SDK libraries: ${AWSSDK_LINK_LIBRARIES}")

@pitrou
Copy link
Member

pitrou commented Mar 21, 2024

I've marked #40299 ready for review.

pitrou added a commit that referenced this issue Mar 25, 2024
### Rationale for this change

By default, S3Client instantiation is extremely slow (around 1ms for every instance). Investigation led to the conclusion that most of this time was spent inside the AWS SDK, parsing a hardcoded piece of JSON data when instantiating a AWS rule engine.

Python benchmarks show this repeated initiatlization cost:
```python
>>> from pyarrow.fs import S3FileSystem

>>> %time s = S3FileSystem()
CPU times: user 21.1 ms, sys: 0 ns, total: 21.1 ms
Wall time: 20.9 ms
>>> %time s = S3FileSystem()
CPU times: user 2.37 ms, sys: 0 ns, total: 2.37 ms
Wall time: 2.18 ms
>>> %time s = S3FileSystem()
CPU times: user 2.42 ms, sys: 0 ns, total: 2.42 ms
Wall time: 2.23 ms

>>> %timeit s = S3FileSystem()
1.28 ms ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit s = S3FileSystem()
1.28 ms ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit s = S3FileSystem(anonymous=True)
1.26 ms ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

### What changes are included in this PR?

Instead of letting the AWS SDK create a new S3EndpointProvider for each S3Client, arrange to only create a single S3EndpointProvider per set of endpoint configuration options. This lets the 1ms instantiation cost be paid only when a new set of endpoint configuration options is given.

Python benchmarks show the initialization cost has become a one-time cost:
```python
>>> from pyarrow.fs import S3FileSystem

>>> %time s = S3FileSystem()
CPU times: user 20 ms, sys: 0 ns, total: 20 ms
Wall time: 19.8 ms
>>> %time s = S3FileSystem()
CPU times: user 404 µs, sys: 49 µs, total: 453 µs
Wall time: 266 µs
>>> %time s = S3FileSystem()
CPU times: user 361 µs, sys: 42 µs, total: 403 µs
Wall time: 249 µs

>>> %timeit s = S3FileSystem()
50.4 µs ± 227 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit s = S3FileSystem(anonymous=True)
33.5 µs ± 306 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
```

### Are these changes tested?

By existing tests.

### Are there any user-facing changes?

No.
* GitHub Issue: #40279

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 16.0.0 milestone Mar 25, 2024
@pitrou
Copy link
Member

pitrou commented Mar 25, 2024

Issue resolved by pull request 40299
#40299

@pitrou pitrou closed this as completed Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants