New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] arrow::fs::FileSystemFromUri() not thread-safe with s3 URIs #39897
Comments
Could you initialize S3 on the main thread explicitly? #include <iostream>
#include <thread>
#include <arrow/api.h>
#include <arrow/filesystem/api.h>
void thread_main()
{
// Call FileSystemFromURI() and check the returned status code.
std::string path;
auto result = arrow::fs::FileSystemFromUri("s3://bucket/key", &path);
if (result.ok()){
std::cout << std::this_thread::get_id() << ": FileSystemFromUri() succeeded." << std::endl;
} else {
std::cout << std::this_thread::get_id() << ": FileSystemFromUri() failed." << std::endl;
}
}
int main()
{
(void)arrow::fs::EnsureS3Initialized();
const int num_threads = 2;
auto threads = std::vector<std::thread>();
for (int i = 0; i < num_threads; i++)
{
threads.push_back(std::thread(thread_main));
}
for (auto &thread : threads)
{
thread.join();
}
// This line would prevent a segfault if the above code were to work.
(void)arrow::fs::FinalizeS3();
} |
Yes, making a call to It looks like I can call that function from inside the worker threads while holding a mutex and get the same effect: #include <iostream>
#include <thread>
#include <arrow/api.h>
#include <arrow/filesystem/api.h>
std::mutex s3_init_mutex;
arrow::Status thread_main()
{
{
std::lock_guard<std::mutex> guard(s3_init_mutex);
ARROW_RETURN_NOT_OK(arrow::fs::EnsureS3Initialized());
}
std::string path;
auto result = arrow::fs::FileSystemFromUri("s3://bucket/key", &path);
if (result.ok()){
std::cout << std::this_thread::get_id() << ": FileSystemFromUri() succeeded." << std::endl;
} else {
std::cout << std::this_thread::get_id() << ": FileSystemFromUri() failed." << std::endl;
}
return result.status();
}
int main()
{
const int num_threads = 2;
auto threads = std::vector<std::thread>();
for (int i = 0; i < num_threads; i++)
{
threads.push_back(std::thread(thread_main));
}
for (auto &thread : threads)
{
thread.join();
}
// This line would prevent a segfault if the above code were to work.
(void)arrow::fs::FinalizeS3();
} Perhaps |
I think that it's application's responsibility by design: arrow/cpp/src/arrow/filesystem/s3fs.h Lines 374 to 384 in 22f2cfd
@pitrou Do you have any opinion on this? |
We should certainly make |
I would like to fix it, plz assign it to me. |
… initialization exactly once under concurrency (#40110) ### Rationale for this change `FileSystemFromUri` could be called concurrently, and its implicit call to `AwsInstance::EnsureInitialized` will cause segment fault due to data race. Therefore, make init stage thread-safe for `AwsInstance::EnsureInitialized`. ### What changes are included in this PR? Serialize calls to S3 initialization to ensure initialization is done exactly once. ### Are these changes tested? Yes, a test is added for the PyArrow bindings. ### Are there any user-facing changes? No. * Closes: #39897 * GitHub Issue: #39897 Lead-authored-by: tsy <tangsiyang2001@foxmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
… to do initialization exactly once under concurrency (apache#40110) ### Rationale for this change `FileSystemFromUri` could be called concurrently, and its implicit call to `AwsInstance::EnsureInitialized` will cause segment fault due to data race. Therefore, make init stage thread-safe for `AwsInstance::EnsureInitialized`. ### What changes are included in this PR? Serialize calls to S3 initialization to ensure initialization is done exactly once. ### Are these changes tested? Yes, a test is added for the PyArrow bindings. ### Are there any user-facing changes? No. * Closes: apache#39897 * GitHub Issue: apache#39897 Lead-authored-by: tsy <tangsiyang2001@foxmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
… to do initialization exactly once under concurrency (apache#40110) ### Rationale for this change `FileSystemFromUri` could be called concurrently, and its implicit call to `AwsInstance::EnsureInitialized` will cause segment fault due to data race. Therefore, make init stage thread-safe for `AwsInstance::EnsureInitialized`. ### What changes are included in this PR? Serialize calls to S3 initialization to ensure initialization is done exactly once. ### Are these changes tested? Yes, a test is added for the PyArrow bindings. ### Are there any user-facing changes? No. * Closes: apache#39897 * GitHub Issue: apache#39897 Lead-authored-by: tsy <tangsiyang2001@foxmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Describe the bug, including details regarding any error messages, version, and platform.
On MacOS, with Arrow version 15.0.0 and AWS SDK for C++ version 1.11.250, calling the C++ function
arrow::fs::FileSystemFromUri()
simultaneously from multiple threads results in a segmentation fault.Here is a program that reproduces the problem on my machine:
This program should produce output like:
Instead, it experiences a segmentation fault before printing any output.
Component(s)
C++
The text was updated successfully, but these errors were encountered: