-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race in ExitAsBatchGroupLeader with pipelined writes #9944
Conversation
Hi @mpoeter! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mpoeter has updated the pull request. You must reimport the pull request before landing. |
Unless somebody volunteers, I will take a look next week. |
@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Not sure why the format check keeps complaining. I ran |
@mpoeter has updated the pull request. You must reimport the pull request before landing. |
Seems like there are some issues with my clang-format. I reran |
@mpoeter has updated the pull request. You must reimport the pull request before landing. |
@mpoeter has updated the pull request. You must reimport the pull request before landing. |
@riversand963 any updates on this? |
@mpoeter sorry haven't been able to get to this. Will take a look. |
@riversand963 : any chance this fix can be incorporated into upstream any soon? Thanks! |
@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@jsteemann on it |
@mpoeter before taking closer look, I checked out the first commit of this PR, but my local build cannot reproduce the hang. Even tried gtest-parallel. <username>@<hostname>:~/rocksdb (8f40e2d2)$ ~/gtest-parallel/gtest-parallel -r 100 ./db_write_test --gtest_filter=*/DBWriteTest.PipelinedWriteRace/*
[300/300] DBWriteTestInstance/DBWriteTest.PipelinedWriteRace/2 (1169 ms)
<username>@<hostname>:~/rocksdb (8f40e2d2)$ is there anything that I have missed? |
@riversand963 I just tried it again - I checked out commit 8f40e2d, built and ran the test, and for me the test blocks infinitely. No need to use gtest-parallel or so. The test case itself is setup to deterministically trigger the race condition. |
@mpoeter tried on a different non-vm host, and the test hangs as expected. In order to reproduce the expected result, it requires that the address of |
Probably not that important. If it's not possible or requires non-trivial change to the code, then I would not worry about it. We can just add a note about when this test will be meaningful. UPDATE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM with a few minor points.
Can you mention this bug fix in HISTORY.md?
One more thing, can you also run the stress test as follows:
make -j16 db_stress
CRASH_TEST_EXT_ARGS='--enable_pipelined_write=1' make crash_test
db/db_write_test.cc
Outdated
@@ -4,6 +4,7 @@ | |||
// (found in the LICENSE.Apache file in the root directory). | |||
|
|||
#include <atomic> | |||
#include <chrono> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So can #include <fstream>
in line 8
db/db_write_test.cc
Outdated
options.enable_pipelined_write = true; | ||
std::vector<port::Thread> threads; | ||
|
||
std::atomic<int> write_counter(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: be consistent about variable initialization, thus use {}
for all of these atomic variables.
db/db_write_test.cc
Outdated
std::atomic<WriteThread::Writer*> leader{nullptr}; | ||
std::atomic<bool> finished_WAL_write{false}; | ||
|
||
Reopen(options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend using DestroyAndReopen(options)
.
db/write_thread.cc
Outdated
@@ -4,8 +4,11 @@ | |||
// (found in the LICENSE.Apache file in the root directory). | |||
|
|||
#include "db/write_thread.h" | |||
|
|||
#include <atomic> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be removed.
db/write_thread.cc
Outdated
@@ -639,6 +631,8 @@ void WriteThread::ExitAsBatchGroupFollower(Writer* w) { | |||
static WriteThread::AdaptationContext eabgl_ctx("ExitAsBatchGroupLeader"); | |||
void WriteThread::ExitAsBatchGroupLeader(WriteGroup& write_group, | |||
Status& status) { | |||
TEST_SYNC_POINT_CALLBACK("WriteThread::ExitAsBatchGroupLeader", &write_group); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: let's rename it to WriteThread::ExitAsBatchGroupLeader:Start
.
db/db_write_test.cc
Outdated
Options options = GetOptions(); | ||
options.two_write_queues = false; | ||
options.enable_pipelined_write = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are interested in pipelined write only and to reduce the number of actual tests run, we can do
if (GetParam() != kPipelinedWrite) {
ROCKSDB_GTEST_BYPASS("This test requires pipelined write");
return;
}
Options options = GetOptions();
// The following lines of setting two_write_queus and enable_pipelined_write can be removed
This is a parameterized test, and original code just repeats the same test three times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to make the whole test unparameterized (e.g. class DBWriteTestUnparameterized : public DBTestBase
) than to include a single dimension of behavioral variance only to pick a single point along that dimension.
db/db_write_test.cc
Outdated
// owning that writer before the writer has been removed. This resulted in a | ||
// race - if the leader thread was fast enough, then everything was fine. | ||
// However, if the woken up thread finished the current write operation and | ||
// then performed yet another write, then the new writer instance was added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important to emphasize that there is going to be a problem if the local variables' address are the same across two invocations of the same function PipelinedWriteImpl()
, which is platform (OS + compiler) dependent
db/write_thread.cc
Outdated
// Notify writers don't write to memtable to exit. | ||
// We insert a dummy Writer right before our current write_group. This | ||
// allows us to unlink our write_group without the risk that a subsequent | ||
// writer becomes a new leader and might overtake use before we can add our |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: a minor suggestion on the wording: I think the point is that the new leader overtake us before we remove our group from newest_weriters_list. Where the group is moved to is not so important. Therefore, I suggest that we can say
writer becomes a new leader and might overtake use before we can remove our group from the newest_writers_ list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, as long as our writer group is still in the list (i.e., reachable via newest_writer
) there cannot be a new leader and therefore we cannot be overtaken. But suppose we have unlinked our group (let's call it A) and the writer list is now empty. Then a new writer can be added (let's call it B), which will become leader and therefore immediately perform its write. This thread can then continue to unlink its writer which results in a race to add the writers to the memtable writer queue. I.e., in the WAL writer queue we had [A, B] (in that order), while in the memtable writer queue we can end up with [B, A]. Not sure if this can be a problem, but the old code made sure this cannot happen, so I did the same. I changed the wording to make this more clear.
I do not have much experience in lock-free programming, just another idea: if (link_newer) {
link_newer->link_older = nullptr;
} The above requires that we do not reset the link_newer of the to-be-destructed Writer obj. My observation is that, the owning thread of the new leader (pointed to by UPDATE: UPDATE 2: |
@mpoeter has updated the pull request. You must reimport the pull request before landing. |
@riversand963 Thanks for the review, I think I have addressed all of your comments. |
@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
I did some benchmarking using db_bench TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_main -enable_pipelined_write=1 -progress_reports=0 -benchmarks=fillseq,overwrite[X3],fillrandom[X3] -threads=16 Before this PR
With this PR
If we just look at avg, "overwrite" and "fillrandom" shows a slight regression. |
I am able to fix the unit test added by this PR after changing only the following diff --git a/db/db_impl/db_impl_write.cc b/db/db_impl/db_impl_write.cc
index 39657d462..eae8bb3a0 100644
--- a/db/db_impl/db_impl_write.cc
+++ b/db/db_impl/db_impl_write.cc
@@ -568,9 +568,22 @@ Status DBImpl::PipelinedWriteImpl(const WriteOptions& write_options,
WriteContext write_context;
- WriteThread::Writer w(write_options, my_batch, callback, log_ref,
- disable_memtable, /*_batch_cnt=*/0,
- /*_pre_release_callback=*/nullptr);
+ char wbuf1[sizeof(WriteThread::Writer)] = {0};
+ char wbuf2[sizeof(WriteThread::Writer)] = {0};
+ static thread_local unsigned long long __pipelined_write_call_id = 0ULL;
+ WriteThread::Writer* writer_ptr = nullptr;
+ if (__pipelined_write_call_id & 1) {
+ writer_ptr =
+ new (wbuf1) WriteThread::Writer(write_options, my_batch, callback,
+ log_ref, disable_memtable, 0, nullptr);
+ } else {
+ writer_ptr =
+ new (wbuf2) WriteThread::Writer(write_options, my_batch, callback,
+ log_ref, disable_memtable, 0, nullptr);
+ }
+ ++__pipelined_write_call_id;
+ assert(writer_ptr);
+ WriteThread::Writer& w = *writer_ptr;
write_thread_.JoinBatchGroup(&w);
TEST_SYNC_POINT("DBImplWrite::PipelinedWriteImpl:AfterJoinBatchGroup");
if (w.state == WriteThread::STATE_GROUP_LEADER) { Basically, I deterministically (I hope) make sure two consecutive invocations of |
Ran benchmarks on another host, and no obvious perf regression observed. |
@mpoeter has updated the pull request. You must reimport the pull request before landing. |
Thanks @mpoeter for fixing this tricky bug! |
// a race where the owning thread of one of these writers can start a new | ||
// write operation. | ||
Writer dummy; | ||
Writer* head = newest_writer_.load(std::memory_order_acquire); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The start of this closely resembles the !enable_pipelined_write_
branch below, which has some useful comments. I would add a comment to refer to it. (Not obvious how to cleanly combine the two pieces of code--without introducing an unnecessary allocation of dummy Writer on the stack.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied some comments here to explain what is being done, hoping this duplication is easier to maintain (because comment is close to code) than referring to another comment. Should we have a way of identifying comments, I may have used a reference.
@mpoeter has updated the pull request. You must reimport the pull request before landing. |
@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Resolves #9692
This PR adds a unit test that reproduces the race described in #9692 and an according fix.
The unit test does not have any assertions, because I could not find a reliable and save way to assert that the writers list does not form a cycle. So with the old (buggy) code, the test would simply hang, while with the fix the test passes successfully.