Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] arrow::fs::FileSystemFromUri() not thread-safe with s3 URIs #39897

Closed
frreiss opened this issue Feb 1, 2024 · 5 comments · Fixed by #40110
Closed

[C++] arrow::fs::FileSystemFromUri() not thread-safe with s3 URIs #39897

frreiss opened this issue Feb 1, 2024 · 5 comments · Fixed by #40110

Comments

@frreiss
Copy link

frreiss commented Feb 1, 2024

Describe the bug, including details regarding any error messages, version, and platform.

On MacOS, with Arrow version 15.0.0 and AWS SDK for C++ version 1.11.250, calling the C++ function arrow::fs::FileSystemFromUri() simultaneously from multiple threads results in a segmentation fault.

Here is a program that reproduces the problem on my machine:

#include <iostream>
#include <thread>
#include <arrow/api.h>
#include <arrow/filesystem/api.h>

void thread_main()
{
    // Call FileSystemFromURI() and check the returned status code.
    std::string path;
    auto result = arrow::fs::FileSystemFromUri("s3://bucket/key", &path);
    if (result.ok()){
        std::cout << std::this_thread::get_id() << ": FileSystemFromUri() succeeded." << std::endl;
    } else {
        std::cout << std::this_thread::get_id() << ": FileSystemFromUri() failed." << std::endl;
    }
}

int main()
{
    const int num_threads = 2;
    auto threads = std::vector<std::thread>();
    for (int i = 0; i < num_threads; i++)
    {
        threads.push_back(std::thread(thread_main));
    }

    for (auto &thread : threads)
    {
        thread.join();
    }

    // This line would prevent a segfault if the above code were to work.
    (void)arrow::fs::FinalizeS3();
}

This program should produce output like:

0x16eeeb000: FileSystemFromUri() succeeded.
0x16b5a7000: FileSystemFromUri() succeeded.

Instead, it experiences a segmentation fault before printing any output.

Component(s)

C++

@kou kou changed the title arrow::fs::FileSystemFromUri() not thread-safe with s3 URIs [C+=] arrow::fs::FileSystemFromUri() not thread-safe with s3 URIs Feb 2, 2024
@kou
Copy link
Member

kou commented Feb 2, 2024

Could you initialize S3 on the main thread explicitly?

#include <iostream>
#include <thread>
#include <arrow/api.h>
#include <arrow/filesystem/api.h>

void thread_main()
{
    // Call FileSystemFromURI() and check the returned status code.
    std::string path;
    auto result = arrow::fs::FileSystemFromUri("s3://bucket/key", &path);
    if (result.ok()){
        std::cout << std::this_thread::get_id() << ": FileSystemFromUri() succeeded." << std::endl;
    } else {
        std::cout << std::this_thread::get_id() << ": FileSystemFromUri() failed." << std::endl;
    }
}

int main()
{
   (void)arrow::fs::EnsureS3Initialized();
   const int num_threads = 2;
    auto threads = std::vector<std::thread>();
    for (int i = 0; i < num_threads; i++)
    {
        threads.push_back(std::thread(thread_main));
    }

    for (auto &thread : threads)
    {
        thread.join();
    }

    // This line would prevent a segfault if the above code were to work.
    (void)arrow::fs::FinalizeS3();
}

@frreiss
Copy link
Author

frreiss commented Feb 2, 2024

Yes, making a call to EnsureS3Initialized() prior to launching any background threads makes the example work.

It looks like I can call that function from inside the worker threads while holding a mutex and get the same effect:

#include <iostream>
#include <thread>
#include <arrow/api.h>
#include <arrow/filesystem/api.h>

std::mutex s3_init_mutex;

arrow::Status thread_main()
{
    {
       std::lock_guard<std::mutex> guard(s3_init_mutex); 
       ARROW_RETURN_NOT_OK(arrow::fs::EnsureS3Initialized());
    }

    std::string path;
    auto result = arrow::fs::FileSystemFromUri("s3://bucket/key", &path);
    if (result.ok()){
        std::cout << std::this_thread::get_id() << ": FileSystemFromUri() succeeded." << std::endl;
    } else {
        std::cout << std::this_thread::get_id() << ": FileSystemFromUri() failed." << std::endl;
    }
    return result.status();
}

int main()
{
    const int num_threads = 2;
    auto threads = std::vector<std::thread>();
    for (int i = 0; i < num_threads; i++)
    {
        threads.push_back(std::thread(thread_main));
    }

    for (auto &thread : threads)
    {
        thread.join();
    }

    // This line would prevent a segfault if the above code were to work.
    (void)arrow::fs::FinalizeS3();
}

Perhaps FileSystemFromUriReal() should acquire a mutex while calling EnsureS3Initialized()?

@kou kou changed the title [C+=] arrow::fs::FileSystemFromUri() not thread-safe with s3 URIs [C++] arrow::fs::FileSystemFromUri() not thread-safe with s3 URIs Feb 4, 2024
@kou
Copy link
Member

kou commented Feb 4, 2024

I think that it's application's responsibility by design:

/// \brief Shutdown the S3 APIs.
///
/// This can wait for some S3 concurrent calls to finish so as to avoid
/// race conditions.
/// After this function has been called, all S3 calls will fail with an error.
///
/// Calls to InitializeS3() and FinalizeS3() should be serialized by the
/// application (this also applies to EnsureS3Initialized() and
/// EnsureS3Finalized()).
ARROW_EXPORT
Status FinalizeS3();

@pitrou Do you have any opinion on this?

@pitrou
Copy link
Member

pitrou commented Feb 5, 2024

We should certainly make EnsureS3Initialized thread-safe (implicitly called by FileSystemFromUri), given that it can be called lazily.

@TangSiyang2001
Copy link
Contributor

I would like to fix it, plz assign it to me.

@pitrou pitrou added this to the 15.0.1 milestone Feb 26, 2024
pitrou added a commit that referenced this issue Feb 26, 2024
… initialization exactly once under concurrency (#40110)

### Rationale for this change

`FileSystemFromUri` could be called concurrently, and its implicit call to `AwsInstance::EnsureInitialized` will cause segment fault due to data race.

Therefore, make init stage thread-safe for `AwsInstance::EnsureInitialized`.

### What changes are included in this PR?

Serialize calls to S3 initialization to ensure initialization is done exactly once.

### Are these changes tested?

Yes, a test is added for the PyArrow bindings.

### Are there any user-facing changes?

No.

* Closes: #39897
* GitHub Issue: #39897

Lead-authored-by: tsy <tangsiyang2001@foxmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou modified the milestones: 15.0.1, 16.0.0 Feb 26, 2024
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
… to do initialization exactly once under concurrency (apache#40110)

### Rationale for this change

`FileSystemFromUri` could be called concurrently, and its implicit call to `AwsInstance::EnsureInitialized` will cause segment fault due to data race.

Therefore, make init stage thread-safe for `AwsInstance::EnsureInitialized`.

### What changes are included in this PR?

Serialize calls to S3 initialization to ensure initialization is done exactly once.

### Are these changes tested?

Yes, a test is added for the PyArrow bindings.

### Are there any user-facing changes?

No.

* Closes: apache#39897
* GitHub Issue: apache#39897

Lead-authored-by: tsy <tangsiyang2001@foxmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
… to do initialization exactly once under concurrency (apache#40110)

### Rationale for this change

`FileSystemFromUri` could be called concurrently, and its implicit call to `AwsInstance::EnsureInitialized` will cause segment fault due to data race.

Therefore, make init stage thread-safe for `AwsInstance::EnsureInitialized`.

### What changes are included in this PR?

Serialize calls to S3 initialization to ensure initialization is done exactly once.

### Are these changes tested?

Yes, a test is added for the PyArrow bindings.

### Are there any user-facing changes?

No.

* Closes: apache#39897
* GitHub Issue: apache#39897

Lead-authored-by: tsy <tangsiyang2001@foxmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment