Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race in ExitAsBatchGroupLeader with pipelined writes #9944

Closed
wants to merge 12 commits into from

Conversation

mpoeter
Copy link
Contributor

@mpoeter mpoeter commented May 4, 2022

Resolves #9692

This PR adds a unit test that reproduces the race described in #9692 and an according fix.

The unit test does not have any assertions, because I could not find a reliable and save way to assert that the writers list does not form a cycle. So with the old (buggy) code, the test would simply hang, while with the fix the test passes successfully.

@facebook-github-bot
Copy link
Contributor

Hi @mpoeter!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mpoeter has updated the pull request. You must reimport the pull request before landing.

@riversand963 riversand963 self-requested a review May 5, 2022 18:30
@riversand963
Copy link
Contributor

Unless somebody volunteers, I will take a look next week.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mpoeter
Copy link
Contributor Author

mpoeter commented May 5, 2022

Not sure why the format check keeps complaining. I ran make format for the last commit and it came back clean. 😕

@facebook-github-bot
Copy link
Contributor

@mpoeter has updated the pull request. You must reimport the pull request before landing.

@mpoeter
Copy link
Contributor Author

mpoeter commented May 6, 2022

Seems like there are some issues with my clang-format. I reran make format and now it also complained about the whitespace in this one line (same as in the red check run). However, when I run clang-format on that file, it reformats a lot more than that. But obviously I don't want to commit completely unrelated formatting changes. Can this be caused by different clang-format versions? I could not find any information about which clang-format version one should use.

@facebook-github-bot
Copy link
Contributor

@mpoeter has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@mpoeter has updated the pull request. You must reimport the pull request before landing.

@mpoeter
Copy link
Contributor Author

mpoeter commented Jun 14, 2022

@riversand963 any updates on this?

@riversand963
Copy link
Contributor

@mpoeter sorry haven't been able to get to this. Will take a look.

@jsteemann
Copy link
Contributor

@riversand963 : any chance this fix can be incorporated into upstream any soon? Thanks!

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@riversand963
Copy link
Contributor

@jsteemann on it

@riversand963
Copy link
Contributor

@mpoeter before taking closer look, I checked out the first commit of this PR, but my local build cannot reproduce the hang. Even tried gtest-parallel.

<username>@<hostname>:~/rocksdb  (8f40e2d2)$ ~/gtest-parallel/gtest-parallel -r 100 ./db_write_test --gtest_filter=*/DBWriteTest.PipelinedWriteRace/*
[300/300] DBWriteTestInstance/DBWriteTest.PipelinedWriteRace/2 (1169 ms)
<username>@<hostname>:~/rocksdb  (8f40e2d2)$

is there anything that I have missed?

@mpoeter
Copy link
Contributor Author

mpoeter commented Jul 27, 2022

@riversand963 I just tried it again - I checked out commit 8f40e2d, built and ran the test, and for me the test blocks infinitely. No need to use gtest-parallel or so. The test case itself is setup to deterministically trigger the race condition.

@riversand963
Copy link
Contributor

@mpoeter tried on a different non-vm host, and the test hangs as expected. In order to reproduce the expected result, it requires that the address of w in PipeliendWriteImpl are the same across two invocations of the method. I printed out the address which showed that on the first vm host where the test passes, the address of the local variable are always different, while on the second host, I can get the same address on the stack. I wonder whether we can make this deterministic across different platforms.

@riversand963
Copy link
Contributor

riversand963 commented Jul 27, 2022

I wonder whether we can make this deterministic across different platforms.

Probably not that important. If it's not possible or requires non-trivial change to the code, then I would not worry about it. We can just add a note about when this test will be meaningful.

UPDATE:
suspected it was ASLR and/or some compiler flags. So far disabling ASLR does not seem to work....

Copy link
Contributor

@riversand963 riversand963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM with a few minor points.

Can you mention this bug fix in HISTORY.md?
One more thing, can you also run the stress test as follows:

make -j16 db_stress
CRASH_TEST_EXT_ARGS='--enable_pipelined_write=1' make crash_test

@@ -4,6 +4,7 @@
// (found in the LICENSE.Apache file in the root directory).

#include <atomic>
#include <chrono>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So can #include <fstream> in line 8

options.enable_pipelined_write = true;
std::vector<port::Thread> threads;

std::atomic<int> write_counter(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: be consistent about variable initialization, thus use {} for all of these atomic variables.

std::atomic<WriteThread::Writer*> leader{nullptr};
std::atomic<bool> finished_WAL_write{false};

Reopen(options);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend using DestroyAndReopen(options).

@@ -4,8 +4,11 @@
// (found in the LICENSE.Apache file in the root directory).

#include "db/write_thread.h"

#include <atomic>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed.

@@ -639,6 +631,8 @@ void WriteThread::ExitAsBatchGroupFollower(Writer* w) {
static WriteThread::AdaptationContext eabgl_ctx("ExitAsBatchGroupLeader");
void WriteThread::ExitAsBatchGroupLeader(WriteGroup& write_group,
Status& status) {
TEST_SYNC_POINT_CALLBACK("WriteThread::ExitAsBatchGroupLeader", &write_group);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's rename it to WriteThread::ExitAsBatchGroupLeader:Start.

Comment on lines 340 to 342
Options options = GetOptions();
options.two_write_queues = false;
options.enable_pipelined_write = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are interested in pipelined write only and to reduce the number of actual tests run, we can do

if (GetParam() != kPipelinedWrite) {
  ROCKSDB_GTEST_BYPASS("This test requires pipelined write");
  return;
}
Options options = GetOptions();
// The following lines of setting two_write_queus and enable_pipelined_write can be removed

This is a parameterized test, and original code just repeats the same test three times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to make the whole test unparameterized (e.g. class DBWriteTestUnparameterized : public DBTestBase) than to include a single dimension of behavioral variance only to pick a single point along that dimension.

// owning that writer before the writer has been removed. This resulted in a
// race - if the leader thread was fast enough, then everything was fine.
// However, if the woken up thread finished the current write operation and
// then performed yet another write, then the new writer instance was added
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to emphasize that there is going to be a problem if the local variables' address are the same across two invocations of the same function PipelinedWriteImpl(), which is platform (OS + compiler) dependent

// Notify writers don't write to memtable to exit.
// We insert a dummy Writer right before our current write_group. This
// allows us to unlink our write_group without the risk that a subsequent
// writer becomes a new leader and might overtake use before we can add our
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: a minor suggestion on the wording: I think the point is that the new leader overtake us before we remove our group from newest_weriters_list. Where the group is moved to is not so important. Therefore, I suggest that we can say

writer becomes a new leader and might overtake use before we can remove our group from the newest_writers_ list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, as long as our writer group is still in the list (i.e., reachable via newest_writer) there cannot be a new leader and therefore we cannot be overtaken. But suppose we have unlinked our group (let's call it A) and the writer list is now empty. Then a new writer can be added (let's call it B), which will become leader and therefore immediately perform its write. This thread can then continue to unlink its writer which results in a race to add the writers to the memtable writer queue. I.e., in the WAL writer queue we had [A, B] (in that order), while in the memtable writer queue we can end up with [B, A]. Not sure if this can be a problem, but the old code made sure this cannot happen, so I did the same. I changed the wording to make this more clear.

@riversand963
Copy link
Contributor

riversand963 commented Jul 27, 2022

I do not have much experience in lock-free programming, just another idea:
can we do the following in ~Writer()?

if (link_newer) {
  link_newer->link_older = nullptr;
}

The above requires that we do not reset the link_newer of the to-be-destructed Writer obj.

My observation is that, the owning thread of the new leader (pointed to by link_newer of the to-be-destructed Writer) is waiting and not accessing its Writer object, thus link_newer->link_older = nullptr should be safe as long as link_newer is valid.

UPDATE:
nvm, the above does not seem to work, or I have to think more.
Anyway, even if it does, I am not a fan of separating the logic of link list manipulation. So we should not worry about it.

UPDATE 2:
the above does not work because when Writer is destructed, the link_newer may not have been set up by a next CreateMissingNewer().

@facebook-github-bot
Copy link
Contributor

@mpoeter has updated the pull request. You must reimport the pull request before landing.

@mpoeter
Copy link
Contributor Author

mpoeter commented Jul 28, 2022

@riversand963 Thanks for the review, I think I have addressed all of your comments.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@riversand963
Copy link
Contributor

I did some benchmarking using db_bench

TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_main -enable_pipelined_write=1 -progress_reports=0 -benchmarks=fillseq,overwrite[X3],fillrandom[X3] -threads=16

Before this PR

RocksDB:    version 7.6.0
Date:       Thu Jul 28 17:23:14 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Set seed to 1659054194349308 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/dev/shm/rocksdb/dbbench]
fillseq      :      26.047 micros/op 613232 ops/sec 26.091 seconds 16000000 operations;   67.8 MB/s
DB path: [/dev/shm/rocksdb/dbbench]
Running benchmark for 3 times
overwrite    :      25.351 micros/op 629141 ops/sec 25.431 seconds 16000000 operations;   69.6 MB/s
overwrite    :      27.422 micros/op 580673 ops/sec 27.554 seconds 16000000 operations;   64.2 MB/s
overwrite [AVG 2 runs] : 604907 (± 47497) ops/sec;   66.9 (± 5.3) MB/sec
overwrite    :      27.690 micros/op 575694 ops/sec 27.793 seconds 16000000 operations;   63.7 MB/s
overwrite [AVG 3 runs] : 595169 (± 33410) ops/sec;   65.8 (± 3.7) MB/sec
overwrite [AVG    3 runs] : 595169 (± 33410) ops/sec;   65.8 (± 3.7) MB/sec
overwrite [MEDIAN 3 runs] : 580673 ops/sec;   64.2 MB/sec
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/dev/shm/rocksdb/dbbench]
Running benchmark for 3 times
fillrandom   :      30.038 micros/op 531252 ops/sec 30.117 seconds 16000000 operations;   58.8 MB/s
fillrandom   :      27.098 micros/op 588992 ops/sec 27.165 seconds 16000000 operations;   65.2 MB/s
fillrandom [AVG 2 runs] : 560122 (± 56584) ops/sec;   62.0 (± 6.3) MB/sec
fillrandom   :      29.705 micros/op 537501 ops/sec 29.767 seconds 16000000 operations;   59.5 MB/s
fillrandom [AVG 3 runs] : 552582 (± 35856) ops/sec;   61.1 (± 4.0) MB/sec
fillrandom [AVG    3 runs] : 552582 (± 35856) ops/sec;   61.1 (± 4.0) MB/sec
fillrandom [MEDIAN 3 runs] : 537501 ops/sec;   59.5 MB/sec

With this PR

RocksDB:    version 7.6.0
Date:       Thu Jul 28 17:18:39 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Set seed to 1659053919797532 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/dev/shm/rocksdb/dbbench]
fillseq      :      21.306 micros/op 748766 ops/sec 21.368 seconds 16000000 operations;   82.8 MB/s
DB path: [/dev/shm/rocksdb/dbbench]
Running benchmark for 3 times
overwrite    :      26.874 micros/op 590716 ops/sec 27.086 seconds 16000000 operations;   65.3 MB/s
overwrite    :      27.772 micros/op 574241 ops/sec 27.863 seconds 16000000 operations;   63.5 MB/s
overwrite [AVG 2 runs] : 582479 (± 16144) ops/sec;   64.4 (± 1.8) MB/sec
overwrite    :      30.998 micros/op 514197 ops/sec 31.116 seconds 16000000 operations;   56.9 MB/s
overwrite [AVG 3 runs] : 559718 (± 45573) ops/sec;   61.9 (± 5.0) MB/sec
overwrite [AVG    3 runs] : 559718 (± 45573) ops/sec;   61.9 (± 5.0) MB/sec
overwrite [MEDIAN 3 runs] : 574241 ops/sec;   63.5 MB/sec
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/dev/shm/rocksdb/dbbench]
Running benchmark for 3 times
fillrandom   :      27.326 micros/op 584266 ops/sec 27.385 seconds 16000000 operations;   64.6 MB/s
fillrandom   :      33.816 micros/op 471942 ops/sec 33.902 seconds 16000000 operations;   52.2 MB/s
fillrandom [AVG 2 runs] : 528104 (± 110074) ops/sec;   58.4 (± 12.2) MB/sec
fillrandom   :      27.653 micros/op 576048 ops/sec 27.775 seconds 16000000 operations;   63.7 MB/s
fillrandom [AVG 3 runs] : 544085 (± 70851) ops/sec;   60.2 (± 7.8) MB/sec
fillrandom [AVG    3 runs] : 544085 (± 70851) ops/sec;   60.2 (± 7.8) MB/sec
fillrandom [MEDIAN 3 runs] : 576048 ops/sec;   63.7 MB/sec

If we just look at avg, "overwrite" and "fillrandom" shows a slight regression.
@mpoeter have you been able to do some benchmarking as well as the stress test?

@riversand963
Copy link
Contributor

riversand963 commented Jul 29, 2022

I am able to fix the unit test added by this PR after changing only the following

diff --git a/db/db_impl/db_impl_write.cc b/db/db_impl/db_impl_write.cc
index 39657d462..eae8bb3a0 100644
--- a/db/db_impl/db_impl_write.cc
+++ b/db/db_impl/db_impl_write.cc
@@ -568,9 +568,22 @@ Status DBImpl::PipelinedWriteImpl(const WriteOptions& write_options,

   WriteContext write_context;

-  WriteThread::Writer w(write_options, my_batch, callback, log_ref,
-                        disable_memtable, /*_batch_cnt=*/0,
-                        /*_pre_release_callback=*/nullptr);
+  char wbuf1[sizeof(WriteThread::Writer)] = {0};
+  char wbuf2[sizeof(WriteThread::Writer)] = {0};
+  static thread_local unsigned long long __pipelined_write_call_id = 0ULL;
+  WriteThread::Writer* writer_ptr = nullptr;
+  if (__pipelined_write_call_id & 1) {
+    writer_ptr =
+        new (wbuf1) WriteThread::Writer(write_options, my_batch, callback,
+                                        log_ref, disable_memtable, 0, nullptr);
+  } else {
+    writer_ptr =
+        new (wbuf2) WriteThread::Writer(write_options, my_batch, callback,
+                                        log_ref, disable_memtable, 0, nullptr);
+  }
+  ++__pipelined_write_call_id;
+  assert(writer_ptr);
+  WriteThread::Writer& w = *writer_ptr;
   write_thread_.JoinBatchGroup(&w);
   TEST_SYNC_POINT("DBImplWrite::PipelinedWriteImpl:AfterJoinBatchGroup");
   if (w.state == WriteThread::STATE_GROUP_LEADER) {

Basically, I deterministically (I hope) make sure two consecutive invocations of PipelinedWriteImpl() will not allocate the same w on the same stack address. Will be evaluating this approach or other ones.

@riversand963
Copy link
Contributor

Ran benchmarks on another host, and no obvious perf regression observed.

@facebook-github-bot
Copy link
Contributor

@mpoeter has updated the pull request. You must reimport the pull request before landing.

@riversand963
Copy link
Contributor

Thanks @mpoeter for fixing this tricky bug!

// a race where the owning thread of one of these writers can start a new
// write operation.
Writer dummy;
Writer* head = newest_writer_.load(std::memory_order_acquire);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The start of this closely resembles the !enable_pipelined_write_ branch below, which has some useful comments. I would add a comment to refer to it. (Not obvious how to cleanly combine the two pieces of code--without introducing an unnecessary allocation of dummy Writer on the stack.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied some comments here to explain what is being done, hoping this duplication is easier to maintain (because comment is close to code) than referring to another comment. Should we have a way of identifying comments, I may have used a reference.

@facebook-github-bot
Copy link
Contributor

@mpoeter has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants