-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] TSAN error in s3 (maybe false positive?) #36523
Comments
I'm not able to reproduce. I think, by the time the |
CC @pitrou for a second opinion but I will look at this further tomorrow too. |
Hmm, so what seems to happen is that there are three different threads:
M0 is The complex part is T21: Like you, I'm skeptical there is a genuine risk, as M1 is taken shared. |
Sorry, I just re-did the analysis you already did :-) |
A reasonable fix is to lock the client holder only as long as necessary. I should have done that from the start: diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc
index f05d89963c..79b8716017 100644
--- a/cpp/src/arrow/filesystem/s3fs.cc
+++ b/cpp/src/arrow/filesystem/s3fs.cc
@@ -1527,8 +1527,6 @@ class ObjectOutputStream final : public io::OutputStream {
Status UploadPart(const void* data, int64_t nbytes,
std::shared_ptr<Buffer> owned_buffer = nullptr) {
- ARROW_ASSIGN_OR_RAISE(auto client_lock, holder_->Lock());
-
S3Model::UploadPartRequest req;
req.SetBucket(ToAwsString(path_.bucket));
req.SetKey(ToAwsString(path_.key));
@@ -1538,7 +1536,8 @@ class ObjectOutputStream final : public io::OutputStream {
if (!background_writes_) {
req.SetBody(std::make_shared<StringViewStream>(data, nbytes));
- auto outcome = client_lock->UploadPart(req);
+ ARROW_ASSIGN_OR_RAISE(auto client_lock, holder_->Lock());
+ auto outcome = std::move(client_lock)->UploadPart(req);
if (!outcome.IsSuccess()) {
return UploadPartError(req, outcome);
} else {
@@ -1562,12 +1561,13 @@ class ObjectOutputStream final : public io::OutputStream {
upload_state_->pending_parts_completed = Future<>::Make();
}
}
+ ARROW_ASSIGN_OR_RAISE(auto client_lock, holder_->Lock());
// XXX This callback returns Aws::Utils::Outcome, it cannot easily call
// `holder->Lock()` which returns arrow::Result.
ARROW_ASSIGN_OR_RAISE(
auto fut,
SubmitIO(io_context_, [client_lock = std::move(client_lock), req]() mutable {
- return client_lock->UploadPart(req);
+ return std::move(client_lock)->UploadPart(req);
}));
// The closure keeps the buffer and the upload state alive
auto state = upload_state_; |
By the way, it seems S3 is disabled in our TSAN CI job. Should we enable it? |
I'm getting other potential false positives from TSAN when run under Ubuntu 22.04 with Clang 14.0.0, so I opened an issue there: |
It would be nice. We do a fair amount of thread management here. However, it's mostly controlled by Note that I did not actually end up finding any real bugs with TSAN once I quieted down the false positives.
That seems fine. I think it would still be good to unlock the upload_state before marking the flush complete too. It's not a bug but it will make it easier to reason about the code in the future. |
It is counter-intuitive, but lock ordering issues can happen even with a shared mutex locked in shared mode. The reason is that locking again in shared mode can block while there are threads waiting to take the lock in exclusive mode. For this reason, we must avoid to keep the S3ClientLock taken before is it taken again. * Closes: #36523 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Describe the bug, including details regarding any error messages, version, and platform.
I encountered this while stress testing #35440 with TSAN enabled but I believe it's not related to #35440 and, indeed, not a bug at all. I'm attaching the log from TSAN here but I'll give a summary.
tsan_out.txt
There are three locks involved and so it is quite confusing.
M0: The holder lock (
std::mutex S3ClientHolder::mutex_
)M1: The finalizer lock (
std::shared_mutex S3ClientFinalizer::mutex_
)M2: The upload state lock (
std::mutex UploadState::mutex
)TSAN reports
Cycle in lock order graph: M0 => M1 => M2 => M0
The task is "creating a bunch of files in parallel".
In the main thread we are calling create dir. This all happens in
arrow::fs::S3FileSystem::Impl::CreateBucket
. As part of this process we:In a CPU thread we are closing a file, which causes us to upload the final part. This all happens as part of
arrow::fs::(anonymous namespace)::ObjectOutputStream::UploadPart
. As part of this process we:In an I/O thread we are handling the upload outcome which causes us to call close on the file. This all happens as part of
arrow::fs::(anonymous namespace)::ObjectOutputStream::HandleUploadOutcome
. As part of this process we:However, given that M1 is a shared lock, I'm not sure this deadlock that TSAN is reporting is possible.
Unfortunately, I think there is actually a potential deadlock here, just not the one that TSAN is reporting. As part of
HandleUploadOutcome
it seems we are first locking the upload state and then locking the client holder (M2
thenM0
). As part ofUploadPart
we are first locking the client holder and then locking the upload state (M0
thenM2
). TSAN is reporting this latter operation asM1
thenM2
which is also true, but not as interesting.A good fix would probably be to release the upload state lock in
HandleUploadOutcome
before markingpending_parts_completed
finished (in general it's not a good idea to mark futures finished when holding locks). I want to reproduce and confirm this is actually a problem and this fix actually works.Component(s)
C++
The text was updated successfully, but these errors were encountered: