New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15265: [C++] Fix hang in dataset writer with kDeleteMatchingPartitions and #partitions >= 8 #12099
Conversation
|
Hmm, this sometimes hangs on >8 partitions, taking another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good. Just one small change in how you are invoking it at the top level.
init_future_ = | ||
DeferNotOk(write_options_.filesystem->io_context().executor()->Submit([this] { | ||
init_future_ = DeferNotOk( | ||
write_options_.filesystem->io_context().executor()->Submit([this]() -> Future<> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be a good addition to allow executor submission tasks to return futures but I don't think this actually works today. The future you are returning from the task will be discarded without being awaited.
I think you'll need to change it to...
init_future_ = SubmitTask(CreateDir).Then(DeleteDirContentsAsync);
Ah, there's another deadlock: finishing a FileWriter closes the underlying file. This is done as a continuation that runs on the I/O thread pool (I think). (On a side note, would anyone complain if I #ifdef'd in the pthread calls to name threads on Linux to make debugging easier?) On S3, closing a file blocks on a condition variable until all background writes are finished…and of course, those background writes are on the I/O thread pool, like everything else, so we hang. |
And there's definitely a race condition somewhere… (using the reproducer from JIRA)
|
I think finishing a file actually happens on the CPU thread pool at the moment. Although it's at the mercy of the writer. |
Please do. |
Ah, but the background Close/Wait also blocks on the CPU thread pool. So I think you're filling up the CPU thread pool in this case. |
Moving the Close/Wait to the I/O thread pool will probably be an easy fix. Then the rules we are building are...
...I think that will prevent any sort of nested deadlock. |
Surprisingly it is the I/O thread pool:
|
Thanks for the help Weston, this last commit should fix the deadlock…though I still occasionally see that |
|
It's the same hang with 16 partitions so I think we will need a CloseAsync(). |
Ah, the fundamental issue is S3FS implements writes asynchronously (unless background_writes=False), but our file interfaces are still mostly synchronous, and the dataset writer is asynchronous, using the thread pool to manage parallelism…so we have nested parallelism. Setting background_writes=False fixes it by breaking this loop, the other way will be to have at least CloseAsync(). |
Alright, this now also adds CloseAsync() and updates some APIs and tests to match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know how tricky it is tracking down these race conditions and finding these bugs so a huge thank you for figuring this all out. I have a few observations but overall I think everything checks out.
cpp/src/arrow/dataset/file_csv.cc
Outdated
Future<> CsvFileWriter::FinishInternal() { | ||
return DeferNotOk(destination_locator_.filesystem->io_context().executor()->Submit( | ||
[this]() { return batch_writer_->Close(); })); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CSV writer's Close
method is a no-op so I think it would be safe to just return a finished future here.
Future<> CsvFileWriter::FinishInternal() { | |
return DeferNotOk(destination_locator_.filesystem->io_context().executor()->Submit( | |
[this]() { return batch_writer_->Close(); })); | |
} | |
Future<> CsvFileWriter::FinishInternal() { | |
batch_writer_->Close(); | |
return Future<>::MakeFinished(); | |
} |
@@ -1243,6 +1243,48 @@ class ObjectOutputStream final : public io::OutputStream { | |||
return Status::OK(); | |||
} | |||
|
|||
Future<> CloseAsync() override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of overlap between this method and Close
. Can we create some helper methods so we have something like (pseudo-code)...
Future<> CloseAsync() override {
PrepareClose();
FlushAsync().Then(DoClose);
}
Status Close() override {
PrepareClose();
Flush();
DoClose();
}
Or you could just change Close
to be return CloseAsync().status()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah whoops, I had meant to make Close() just call CloseAsync(), thanks for catching this.
cpp/src/arrow/filesystem/s3fs.cc
Outdated
if (closed_) { | ||
return Status::Invalid("Operation on closed stream"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check needed? If closed_
is true then FlushAsync
is just going to immediately return a failed future anyways.
cpp/src/arrow/filesystem/s3fs.cc
Outdated
Aws::Vector<S3Model::CompletedPart> completed_parts; | ||
int64_t parts_in_progress = 0; | ||
Status status; | ||
Future<> completed = Future<>::MakeFinished(Status::OK()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: completed
is a little bit misleading here. A user could write, flush, and continue writing. I'm not sure what a better name would be but maybe something like writes_in_progress
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it pending_parts_completed
if that makes more sense.
@westonpace I think I've addressed all feedback, any final comments here? |
@lidavidm Those fixes look good, I think this is ready. |
Benchmark runs are scheduled for baseline = 0bce440 and contender = bafaa76. bafaa76 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
When the dataset writer is configured to delete existing data before writing, the target directory is on S3, the dataset is partitioned, and there are at least as many partitions as threads in the I/O thread pool, then the writer would hang. The writer spawns a task on the I/O thread pool for each partition to delete existing data. However, S3FS implemented the relevant filesystem call by asynchronously listing the objects using the I/O thread pool, then deleting them, blocking until this is done. Hence, nested asynchrony would cause the program to hang.
The fix is to do this deletion fully asynchronously, so that there is no blocking. It's sufficient to just use the default implementation of async filesystem methods; it just spawns another task on the I/O thread pool, but this lets the writer avoid blocking. However, this PR also refactors the S3FS internals to implement the call truly asynchronously.
This PR also implements FileInterface::CloseAsync. This is required because by default on S3, files do writes asynchronously in the background, and Close() just blocks until those complete. This also consumes the I/O thread pool (both the blocking and the background writes), so we need an async version of this to avoid the deadlock.