[Python/C++] `S3FileSystem` slow to deserialize due to AWS rule engine JSON parsing #40279

fjetter · 2024-02-28T16:34:39Z

Describe the bug, including details regarding any error messages, version, and platform.

Deserializing a pickled S3FileSystem instance is surprisingly slow

import boto3
from pyarrow.fs import S3FileSystem

# Going via boto is not strictly necessary but setting all the keys and tokens already avoids one HTTP request during init
session = boto3.session.Session()
credentials = session.get_credentials()

fs = S3FileSystem(
    secret_key=credentials.secret_key,
    access_key=credentials.access_key,
    region="us-east-2",
    session_token=credentials.token,
)
# Note: This can also be seen by using just S3FileSystem() but this then posts one HTTP request and I want to emphasize the slow json parser, see below

%timeit pickle.loads(pickle.dumps(fs))

takes 1.01 ms ± 153 µs per loop on my machine

Looking at a py-spy profile shows that most of the time is spent in some internal JSON parsing. Is there a way to avoid this?

Component(s)

Python

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2024-02-29T15:17:52Z

cc @pitrou

pitrou · 2024-02-29T16:17:04Z

Looking at a py-spy profile shows that most of the time is spent in some internal JSON parsing. Is there a way to avoid this?

I have no idea. S3 configuration mechanisms are still a mystery to me.

pitrou · 2024-02-29T16:33:55Z

Similar observations here:

>>> %timeit s = S3FileSystem()
1.29 ms ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit s = S3FileSystem(anonymous=True)
1.29 ms ± 5.52 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit s = S3FileSystem(anonymous=True, region='eu-west-1')
1.29 ms ± 4.86 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

pitrou · 2024-02-29T17:25:29Z

@fjetter Is it the repeated cost that bothers you? As in, if only the first S3FileSystem instantiation in a Python process was slow, would you mind?

fjetter · 2024-02-29T17:27:48Z

If the first instantiation is slow but repeated ones would reuse, etc. that would be fine. I'm running into this issue because those things are de/serialized as part of dask tasks, i.e. once per interpreter is fine, once per instance is not great

pitrou · 2024-02-29T17:28:15Z

Note to self: this is the construction path we go through currently:

S3Client::S3Client(const std::shared_ptr<AWSCredentialsProvider>& credentialsProvider,
                   const Client::ClientConfiguration& clientConfiguration,
                   Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy signPayloads /*= Never*/,
                   bool useVirtualAddressing /*= true*/,
                   Aws::S3::US_EAST_1_REGIONAL_ENDPOINT_OPTION USEast1RegionalEndPointOption) :
  BASECLASS(clientConfiguration,
            Aws::MakeShared<Aws::Auth::S3ExpressSignerProvider>(ALLOCATION_TAG,
                                                                credentialsProvider,
                                                                Aws::MakeShared<DefaultS3ExpressIdentityProvider>(ALLOCATION_TAG, *this),
                                                                SERVICE_NAME,
                                                                Aws::Region::ComputeSignerRegion(clientConfiguration.region),
                                                                signPayloads,
                                                                /*doubleEncodeValue*/ false),
            Aws::MakeShared<S3ErrorMarshaller>(ALLOCATION_TAG)),
    m_clientConfiguration(clientConfiguration, signPayloads, useVirtualAddressing, USEast1RegionalEndPointOption),
    m_executor(clientConfiguration.executor),
    m_endpointProvider(Aws::MakeShared<S3EndpointProvider>(ALLOCATION_TAG))
{
  init(m_clientConfiguration);
}

void S3Client::init(const S3::S3ClientConfiguration& config)
{
  AWSClient::SetServiceClientName("S3");
  AWS_CHECK_PTR(SERVICE_NAME, m_endpointProvider);
  m_endpointProvider->InitBuiltInParameters(config);
}

It's the Aws::MakeShared<S3EndpointProvider>(ALLOCATION_TAG) that takes so much time in practice.
We might cache the endpoint providers based on the config options that are used by InitBuiltInParameters, that is:

in S3BuiltInParameters::SetFromClientConfiguration: config.useUSEast1RegionalEndPointOption, config.useArnRegion, config.disableMultiRegionAccessPoints, config.useVirtualAddressing
in BuiltInParameters::SetFromClientConfiguration: config.region, config.useFIPS, config.useDualStack, config.endpointOverride, config.scheme

We would need at least the AWS SDK >= 1.10 to have S3ClientConfiguration and its additional fields.

pitrou · 2024-02-29T17:36:27Z

What is our minimum required AWS SDK version? @kou would you know the answer?

Edit: trying to find out in #40299

pitrou · 2024-02-29T21:25:23Z

With the caching PoC in #40299 I get the following:

>>> %timeit s = S3FileSystem(anonymous=True)
34.2 µs ± 42.8 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit s = S3FileSystem(anonymous=True, region='eu-west-1')
32.8 µs ± 37.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit s = S3FileSystem()
52.5 µs ± 262 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

kou · 2024-03-01T02:42:02Z

What is our minimum required AWS SDK version?

I think that we don't define it. Old AWS SDK can be used but there are some known problems. (I think that we recommend at least the current bundled version (1.10.55) to avoid known problems.)

If we want to define it, we can do it like the following:

diff --git a/cpp/cmake_modules/ThirdpartyToolchain.cmake b/cpp/cmake_modules/ThirdpartyToolchain.cmake
index 951028b699..cf9232f331 100644
--- a/cpp/cmake_modules/ThirdpartyToolchain.cmake
+++ b/cpp/cmake_modules/ThirdpartyToolchain.cmake
@@ -5031,7 +5031,7 @@ macro(build_awssdk)
 endmacro()
 
 if(ARROW_S3)
-  resolve_dependency(AWSSDK HAVE_ALT TRUE)
+  resolve_dependency(AWSSDK HAVE_ALT TRUE REQUIRED_VERSION "X.Y.Z")
 
   message(STATUS "Found AWS SDK headers: ${AWSSDK_INCLUDE_DIR}")
   message(STATUS "Found AWS SDK libraries: ${AWSSDK_LINK_LIBRARIES}")

pitrou · 2024-03-21T18:31:54Z

I've marked #40299 ready for review.

### Rationale for this change By default, S3Client instantiation is extremely slow (around 1ms for every instance). Investigation led to the conclusion that most of this time was spent inside the AWS SDK, parsing a hardcoded piece of JSON data when instantiating a AWS rule engine. Python benchmarks show this repeated initiatlization cost: ```python >>> from pyarrow.fs import S3FileSystem >>> %time s = S3FileSystem() CPU times: user 21.1 ms, sys: 0 ns, total: 21.1 ms Wall time: 20.9 ms >>> %time s = S3FileSystem() CPU times: user 2.37 ms, sys: 0 ns, total: 2.37 ms Wall time: 2.18 ms >>> %time s = S3FileSystem() CPU times: user 2.42 ms, sys: 0 ns, total: 2.42 ms Wall time: 2.23 ms >>> %timeit s = S3FileSystem() 1.28 ms ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) >>> %timeit s = S3FileSystem() 1.28 ms ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) >>> %timeit s = S3FileSystem(anonymous=True) 1.26 ms ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) ``` ### What changes are included in this PR? Instead of letting the AWS SDK create a new S3EndpointProvider for each S3Client, arrange to only create a single S3EndpointProvider per set of endpoint configuration options. This lets the 1ms instantiation cost be paid only when a new set of endpoint configuration options is given. Python benchmarks show the initialization cost has become a one-time cost: ```python >>> from pyarrow.fs import S3FileSystem >>> %time s = S3FileSystem() CPU times: user 20 ms, sys: 0 ns, total: 20 ms Wall time: 19.8 ms >>> %time s = S3FileSystem() CPU times: user 404 µs, sys: 49 µs, total: 453 µs Wall time: 266 µs >>> %time s = S3FileSystem() CPU times: user 361 µs, sys: 42 µs, total: 403 µs Wall time: 249 µs >>> %timeit s = S3FileSystem() 50.4 µs ± 227 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) >>> %timeit s = S3FileSystem(anonymous=True) 33.5 µs ± 306 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) ``` ### Are these changes tested? By existing tests. ### Are there any user-facing changes? No. * GitHub Issue: #40279 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

pitrou · 2024-03-25T15:49:11Z

Issue resolved by pull request 40299
#40299

fjetter added the Type: bug label Feb 28, 2024

github-actions bot added the Component: Python label Feb 28, 2024

fjetter changed the title ~~S3FileSystem slow to deserialize due to AWS rule engine JSON parsing~~ [Python] S3FileSystem slow to deserialize due to AWS rule engine JSON parsing Feb 28, 2024

fjetter changed the title ~~[Python] S3FileSystem slow to deserialize due to AWS rule engine JSON parsing~~ [Python/C++] S3FileSystem slow to deserialize due to AWS rule engine JSON parsing Feb 28, 2024

fjetter mentioned this issue Feb 29, 2024

Wrap fragments dask/dask-expr#911

Merged

pitrou mentioned this issue Feb 29, 2024

S3Client instantiation extremely slow aws/aws-sdk-cpp#2880

Open

github-actions bot mentioned this issue Mar 21, 2024

GH-40279: [C++] Reduce S3Client initialization time #40299

Merged

github-actions bot assigned pitrou Mar 21, 2024

pitrou added this to the 16.0.0 milestone Mar 25, 2024

pitrou closed this as completed Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python/C++] `S3FileSystem` slow to deserialize due to AWS rule engine JSON parsing #40279

[Python/C++] `S3FileSystem` slow to deserialize due to AWS rule engine JSON parsing #40279

fjetter commented Feb 28, 2024

jorisvandenbossche commented Feb 29, 2024

pitrou commented Feb 29, 2024

pitrou commented Feb 29, 2024

pitrou commented Feb 29, 2024

fjetter commented Feb 29, 2024

pitrou commented Feb 29, 2024 •

edited

Loading

pitrou commented Feb 29, 2024 •

edited

Loading

pitrou commented Feb 29, 2024

kou commented Mar 1, 2024

pitrou commented Mar 21, 2024

pitrou commented Mar 25, 2024

[Python/C++] S3FileSystem slow to deserialize due to AWS rule engine JSON parsing #40279

[Python/C++] S3FileSystem slow to deserialize due to AWS rule engine JSON parsing #40279

Comments

fjetter commented Feb 28, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

jorisvandenbossche commented Feb 29, 2024

pitrou commented Feb 29, 2024

pitrou commented Feb 29, 2024

pitrou commented Feb 29, 2024

fjetter commented Feb 29, 2024

pitrou commented Feb 29, 2024 • edited Loading

pitrou commented Feb 29, 2024 • edited Loading

pitrou commented Feb 29, 2024

kou commented Mar 1, 2024

pitrou commented Mar 21, 2024

pitrou commented Mar 25, 2024

[Python/C++] `S3FileSystem` slow to deserialize due to AWS rule engine JSON parsing #40279

[Python/C++] `S3FileSystem` slow to deserialize due to AWS rule engine JSON parsing #40279

pitrou commented Feb 29, 2024 •

edited

Loading

pitrou commented Feb 29, 2024 •

edited

Loading