-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python/C++] S3FileSystem
slow to deserialize due to AWS rule engine JSON parsing
#40279
Comments
S3FileSystem
slow to deserialize due to AWS rule engine JSON parsingS3FileSystem
slow to deserialize due to AWS rule engine JSON parsing
S3FileSystem
slow to deserialize due to AWS rule engine JSON parsingS3FileSystem
slow to deserialize due to AWS rule engine JSON parsing
cc @pitrou |
I have no idea. S3 configuration mechanisms are still a mystery to me. |
Similar observations here: >>> %timeit s = S3FileSystem()
1.29 ms ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit s = S3FileSystem(anonymous=True)
1.29 ms ± 5.52 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit s = S3FileSystem(anonymous=True, region='eu-west-1')
1.29 ms ± 4.86 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) |
@fjetter Is it the repeated cost that bothers you? As in, if only the first |
If the first instantiation is slow but repeated ones would reuse, etc. that would be fine. I'm running into this issue because those things are de/serialized as part of dask tasks, i.e. once per interpreter is fine, once per instance is not great |
Note to self: this is the construction path we go through currently: S3Client::S3Client(const std::shared_ptr<AWSCredentialsProvider>& credentialsProvider,
const Client::ClientConfiguration& clientConfiguration,
Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy signPayloads /*= Never*/,
bool useVirtualAddressing /*= true*/,
Aws::S3::US_EAST_1_REGIONAL_ENDPOINT_OPTION USEast1RegionalEndPointOption) :
BASECLASS(clientConfiguration,
Aws::MakeShared<Aws::Auth::S3ExpressSignerProvider>(ALLOCATION_TAG,
credentialsProvider,
Aws::MakeShared<DefaultS3ExpressIdentityProvider>(ALLOCATION_TAG, *this),
SERVICE_NAME,
Aws::Region::ComputeSignerRegion(clientConfiguration.region),
signPayloads,
/*doubleEncodeValue*/ false),
Aws::MakeShared<S3ErrorMarshaller>(ALLOCATION_TAG)),
m_clientConfiguration(clientConfiguration, signPayloads, useVirtualAddressing, USEast1RegionalEndPointOption),
m_executor(clientConfiguration.executor),
m_endpointProvider(Aws::MakeShared<S3EndpointProvider>(ALLOCATION_TAG))
{
init(m_clientConfiguration);
}
void S3Client::init(const S3::S3ClientConfiguration& config)
{
AWSClient::SetServiceClientName("S3");
AWS_CHECK_PTR(SERVICE_NAME, m_endpointProvider);
m_endpointProvider->InitBuiltInParameters(config);
} It's the
We would need at least the AWS SDK >= 1.10 to have |
With the caching PoC in #40299 I get the following: >>> %timeit s = S3FileSystem(anonymous=True)
34.2 µs ± 42.8 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit s = S3FileSystem(anonymous=True, region='eu-west-1')
32.8 µs ± 37.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit s = S3FileSystem()
52.5 µs ± 262 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) |
I think that we don't define it. Old AWS SDK can be used but there are some known problems. (I think that we recommend at least the current bundled version (1.10.55) to avoid known problems.) If we want to define it, we can do it like the following: diff --git a/cpp/cmake_modules/ThirdpartyToolchain.cmake b/cpp/cmake_modules/ThirdpartyToolchain.cmake
index 951028b699..cf9232f331 100644
--- a/cpp/cmake_modules/ThirdpartyToolchain.cmake
+++ b/cpp/cmake_modules/ThirdpartyToolchain.cmake
@@ -5031,7 +5031,7 @@ macro(build_awssdk)
endmacro()
if(ARROW_S3)
- resolve_dependency(AWSSDK HAVE_ALT TRUE)
+ resolve_dependency(AWSSDK HAVE_ALT TRUE REQUIRED_VERSION "X.Y.Z")
message(STATUS "Found AWS SDK headers: ${AWSSDK_INCLUDE_DIR}")
message(STATUS "Found AWS SDK libraries: ${AWSSDK_LINK_LIBRARIES}") |
I've marked #40299 ready for review. |
### Rationale for this change By default, S3Client instantiation is extremely slow (around 1ms for every instance). Investigation led to the conclusion that most of this time was spent inside the AWS SDK, parsing a hardcoded piece of JSON data when instantiating a AWS rule engine. Python benchmarks show this repeated initiatlization cost: ```python >>> from pyarrow.fs import S3FileSystem >>> %time s = S3FileSystem() CPU times: user 21.1 ms, sys: 0 ns, total: 21.1 ms Wall time: 20.9 ms >>> %time s = S3FileSystem() CPU times: user 2.37 ms, sys: 0 ns, total: 2.37 ms Wall time: 2.18 ms >>> %time s = S3FileSystem() CPU times: user 2.42 ms, sys: 0 ns, total: 2.42 ms Wall time: 2.23 ms >>> %timeit s = S3FileSystem() 1.28 ms ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) >>> %timeit s = S3FileSystem() 1.28 ms ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) >>> %timeit s = S3FileSystem(anonymous=True) 1.26 ms ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) ``` ### What changes are included in this PR? Instead of letting the AWS SDK create a new S3EndpointProvider for each S3Client, arrange to only create a single S3EndpointProvider per set of endpoint configuration options. This lets the 1ms instantiation cost be paid only when a new set of endpoint configuration options is given. Python benchmarks show the initialization cost has become a one-time cost: ```python >>> from pyarrow.fs import S3FileSystem >>> %time s = S3FileSystem() CPU times: user 20 ms, sys: 0 ns, total: 20 ms Wall time: 19.8 ms >>> %time s = S3FileSystem() CPU times: user 404 µs, sys: 49 µs, total: 453 µs Wall time: 266 µs >>> %time s = S3FileSystem() CPU times: user 361 µs, sys: 42 µs, total: 403 µs Wall time: 249 µs >>> %timeit s = S3FileSystem() 50.4 µs ± 227 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) >>> %timeit s = S3FileSystem(anonymous=True) 33.5 µs ± 306 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) ``` ### Are these changes tested? By existing tests. ### Are there any user-facing changes? No. * GitHub Issue: #40279 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Issue resolved by pull request 40299 |
Describe the bug, including details regarding any error messages, version, and platform.
Deserializing a pickled S3FileSystem instance is surprisingly slow
takes
1.01 ms ± 153 µs per loop
on my machineLooking at a py-spy profile shows that most of the time is spent in some internal JSON parsing. Is there a way to avoid this?
Component(s)
Python
The text was updated successfully, but these errors were encountered: