Fix race in ExitAsBatchGroupLeader with pipelined writes #9944

mpoeter · 2022-05-04T12:14:30Z

Resolves #9692

This PR adds a unit test that reproduces the race described in #9692 and an according fix.

The unit test does not have any assertions, because I could not find a reliable and save way to assert that the writers list does not form a cycle. So with the old (buggy) code, the test would simply hang, while with the fix the test passes successfully.

facebook-github-bot · 2022-05-04T12:14:34Z

Hi @mpoeter!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

facebook-github-bot · 2022-05-04T14:26:14Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

facebook-github-bot · 2022-05-04T16:19:58Z

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-05-05T08:22:09Z

@mpoeter has updated the pull request. You must reimport the pull request before landing.

riversand963 · 2022-05-05T18:30:41Z

Unless somebody volunteers, I will take a look next week.

facebook-github-bot · 2022-05-05T18:31:07Z

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mpoeter · 2022-05-05T20:22:49Z

Not sure why the format check keeps complaining. I ran make format for the last commit and it came back clean. 😕

facebook-github-bot · 2022-05-06T08:22:32Z

@mpoeter has updated the pull request. You must reimport the pull request before landing.

mpoeter · 2022-05-06T08:27:55Z

Seems like there are some issues with my clang-format. I reran make format and now it also complained about the whitespace in this one line (same as in the red check run). However, when I run clang-format on that file, it reformats a lot more than that. But obviously I don't want to commit completely unrelated formatting changes. Can this be caused by different clang-format versions? I could not find any information about which clang-format version one should use.

facebook-github-bot · 2022-06-01T07:48:44Z

@mpoeter has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-06-07T13:25:16Z

@mpoeter has updated the pull request. You must reimport the pull request before landing.

mpoeter · 2022-06-14T17:27:36Z

@riversand963 any updates on this?

riversand963 · 2022-06-15T01:04:24Z

@mpoeter sorry haven't been able to get to this. Will take a look.

jsteemann · 2022-07-25T09:46:02Z

@riversand963 : any chance this fix can be incorporated into upstream any soon? Thanks!

facebook-github-bot · 2022-07-25T19:07:32Z

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

riversand963 · 2022-07-25T23:18:12Z

@jsteemann on it

riversand963 · 2022-07-26T23:38:46Z

@mpoeter before taking closer look, I checked out the first commit of this PR, but my local build cannot reproduce the hang. Even tried gtest-parallel.

<username>@<hostname>:~/rocksdb  (8f40e2d2)$ ~/gtest-parallel/gtest-parallel -r 100 ./db_write_test --gtest_filter=*/DBWriteTest.PipelinedWriteRace/*
[300/300] DBWriteTestInstance/DBWriteTest.PipelinedWriteRace/2 (1169 ms)
<username>@<hostname>:~/rocksdb  (8f40e2d2)$

is there anything that I have missed?

mpoeter · 2022-07-27T17:23:15Z

@riversand963 I just tried it again - I checked out commit 8f40e2d, built and ran the test, and for me the test blocks infinitely. No need to use gtest-parallel or so. The test case itself is setup to deterministically trigger the race condition.

riversand963 · 2022-07-27T20:50:44Z

@mpoeter tried on a different non-vm host, and the test hangs as expected. In order to reproduce the expected result, it requires that the address of w in PipeliendWriteImpl are the same across two invocations of the method. I printed out the address which showed that on the first vm host where the test passes, the address of the local variable are always different, while on the second host, I can get the same address on the stack. I wonder whether we can make this deterministic across different platforms.

riversand963 · 2022-07-27T20:53:34Z

I wonder whether we can make this deterministic across different platforms.

Probably not that important. If it's not possible or requires non-trivial change to the code, then I would not worry about it. We can just add a note about when this test will be meaningful.

UPDATE:
suspected it was ASLR and/or some compiler flags. So far disabling ASLR does not seem to work....

riversand963

Overall LGTM with a few minor points.

Can you mention this bug fix in HISTORY.md?
One more thing, can you also run the stress test as follows:

make -j16 db_stress
CRASH_TEST_EXT_ARGS='--enable_pipelined_write=1' make crash_test

riversand963 · 2022-07-27T21:20:11Z

db/db_write_test.cc

@@ -4,6 +4,7 @@
 //  (found in the LICENSE.Apache file in the root directory).

 #include <atomic>
+#include <chrono>


Can be removed.

So can #include <fstream> in line 8

riversand963 · 2022-07-27T21:25:37Z

db/db_write_test.cc

+  options.enable_pipelined_write = true;
+  std::vector<port::Thread> threads;
+
+  std::atomic<int> write_counter(0);


Nit: be consistent about variable initialization, thus use {} for all of these atomic variables.

riversand963 · 2022-07-27T21:26:44Z

db/db_write_test.cc

+  std::atomic<WriteThread::Writer*> leader{nullptr};
+  std::atomic<bool> finished_WAL_write{false};
+
+  Reopen(options);


I would recommend using DestroyAndReopen(options).

riversand963 · 2022-07-27T21:46:36Z

db/write_thread.cc

@@ -4,8 +4,11 @@
 //  (found in the LICENSE.Apache file in the root directory).

 #include "db/write_thread.h"
+
+#include <atomic>


Can be removed.

riversand963 · 2022-07-27T21:47:09Z

db/write_thread.cc

@@ -639,6 +631,8 @@ void WriteThread::ExitAsBatchGroupFollower(Writer* w) {
 static WriteThread::AdaptationContext eabgl_ctx("ExitAsBatchGroupLeader");
 void WriteThread::ExitAsBatchGroupLeader(WriteGroup& write_group,
                                         Status& status) {
+  TEST_SYNC_POINT_CALLBACK("WriteThread::ExitAsBatchGroupLeader", &write_group);


Nit: let's rename it to WriteThread::ExitAsBatchGroupLeader:Start.

riversand963 · 2022-07-27T22:28:37Z

db/db_write_test.cc

+  Options options = GetOptions();
+  options.two_write_queues = false;
+  options.enable_pipelined_write = true;


We are interested in pipelined write only and to reduce the number of actual tests run, we can do

if (GetParam() != kPipelinedWrite) { ROCKSDB_GTEST_BYPASS("This test requires pipelined write"); return; } Options options = GetOptions(); // The following lines of setting two_write_queus and enable_pipelined_write can be removed

This is a parameterized test, and original code just repeats the same test three times.

Better to make the whole test unparameterized (e.g. class DBWriteTestUnparameterized : public DBTestBase) than to include a single dimension of behavioral variance only to pick a single point along that dimension.

riversand963 · 2022-07-27T22:43:05Z

db/db_write_test.cc

+  // owning that writer before the writer has been removed. This resulted in a
+  // race - if the leader thread was fast enough, then everything was fine.
+  // However, if the woken up thread finished the current write operation and
+  // then performed yet another write, then the new writer instance was added


I think it's important to emphasize that there is going to be a problem if the local variables' address are the same across two invocations of the same function PipelinedWriteImpl(), which is platform (OS + compiler) dependent

riversand963 · 2022-07-27T22:47:06Z

db/write_thread.cc

-    // Notify writers don't write to memtable to exit.
+    // We insert a dummy Writer right before our current write_group. This
+    // allows us to unlink our write_group without the risk that a subsequent
+    // writer becomes a new leader and might overtake use before we can add our


Nit: a minor suggestion on the wording: I think the point is that the new leader overtake us before we remove our group from newest_weriters_list. Where the group is moved to is not so important. Therefore, I suggest that we can say

writer becomes a new leader and might overtake use before we can remove our group from the newest_writers_ list

No, as long as our writer group is still in the list (i.e., reachable via newest_writer) there cannot be a new leader and therefore we cannot be overtaken. But suppose we have unlinked our group (let's call it A) and the writer list is now empty. Then a new writer can be added (let's call it B), which will become leader and therefore immediately perform its write. This thread can then continue to unlink its writer which results in a race to add the writers to the memtable writer queue. I.e., in the WAL writer queue we had [A, B] (in that order), while in the memtable writer queue we can end up with [B, A]. Not sure if this can be a problem, but the old code made sure this cannot happen, so I did the same. I changed the wording to make this more clear.

riversand963 · 2022-07-27T23:23:01Z

I do not have much experience in lock-free programming, just another idea:
can we do the following in ~Writer()?

if (link_newer) {
  link_newer->link_older = nullptr;
}

The above requires that we do not reset the link_newer of the to-be-destructed Writer obj.

My observation is that, the owning thread of the new leader (pointed to by link_newer of the to-be-destructed Writer) is waiting and not accessing its Writer object, thus link_newer->link_older = nullptr should be safe as long as link_newer is valid.

UPDATE:
nvm, the above does not seem to work, or I have to think more.
Anyway, even if it does, I am not a fan of separating the logic of link list manipulation. So we should not worry about it.

UPDATE 2:
the above does not work because when Writer is destructed, the link_newer may not have been set up by a next CreateMissingNewer().

facebook-github-bot · 2022-07-28T13:53:05Z

@mpoeter has updated the pull request. You must reimport the pull request before landing.

mpoeter · 2022-07-28T13:53:40Z

@riversand963 Thanks for the review, I think I have addressed all of your comments.

facebook-github-bot · 2022-07-28T19:39:12Z

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

riversand963 · 2022-07-29T00:33:09Z

I did some benchmarking using db_bench

TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_main -enable_pipelined_write=1 -progress_reports=0 -benchmarks=fillseq,overwrite[X3],fillrandom[X3] -threads=16

Before this PR

RocksDB:    version 7.6.0
Date:       Thu Jul 28 17:23:14 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Set seed to 1659054194349308 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/dev/shm/rocksdb/dbbench]
fillseq      :      26.047 micros/op 613232 ops/sec 26.091 seconds 16000000 operations;   67.8 MB/s
DB path: [/dev/shm/rocksdb/dbbench]
Running benchmark for 3 times
overwrite    :      25.351 micros/op 629141 ops/sec 25.431 seconds 16000000 operations;   69.6 MB/s
overwrite    :      27.422 micros/op 580673 ops/sec 27.554 seconds 16000000 operations;   64.2 MB/s
overwrite [AVG 2 runs] : 604907 (± 47497) ops/sec;   66.9 (± 5.3) MB/sec
overwrite    :      27.690 micros/op 575694 ops/sec 27.793 seconds 16000000 operations;   63.7 MB/s
overwrite [AVG 3 runs] : 595169 (± 33410) ops/sec;   65.8 (± 3.7) MB/sec
overwrite [AVG    3 runs] : 595169 (± 33410) ops/sec;   65.8 (± 3.7) MB/sec
overwrite [MEDIAN 3 runs] : 580673 ops/sec;   64.2 MB/sec
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/dev/shm/rocksdb/dbbench]
Running benchmark for 3 times
fillrandom   :      30.038 micros/op 531252 ops/sec 30.117 seconds 16000000 operations;   58.8 MB/s
fillrandom   :      27.098 micros/op 588992 ops/sec 27.165 seconds 16000000 operations;   65.2 MB/s
fillrandom [AVG 2 runs] : 560122 (± 56584) ops/sec;   62.0 (± 6.3) MB/sec
fillrandom   :      29.705 micros/op 537501 ops/sec 29.767 seconds 16000000 operations;   59.5 MB/s
fillrandom [AVG 3 runs] : 552582 (± 35856) ops/sec;   61.1 (± 4.0) MB/sec
fillrandom [AVG    3 runs] : 552582 (± 35856) ops/sec;   61.1 (± 4.0) MB/sec
fillrandom [MEDIAN 3 runs] : 537501 ops/sec;   59.5 MB/sec

With this PR

RocksDB:    version 7.6.0
Date:       Thu Jul 28 17:18:39 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Set seed to 1659053919797532 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/dev/shm/rocksdb/dbbench]
fillseq      :      21.306 micros/op 748766 ops/sec 21.368 seconds 16000000 operations;   82.8 MB/s
DB path: [/dev/shm/rocksdb/dbbench]
Running benchmark for 3 times
overwrite    :      26.874 micros/op 590716 ops/sec 27.086 seconds 16000000 operations;   65.3 MB/s
overwrite    :      27.772 micros/op 574241 ops/sec 27.863 seconds 16000000 operations;   63.5 MB/s
overwrite [AVG 2 runs] : 582479 (± 16144) ops/sec;   64.4 (± 1.8) MB/sec
overwrite    :      30.998 micros/op 514197 ops/sec 31.116 seconds 16000000 operations;   56.9 MB/s
overwrite [AVG 3 runs] : 559718 (± 45573) ops/sec;   61.9 (± 5.0) MB/sec
overwrite [AVG    3 runs] : 559718 (± 45573) ops/sec;   61.9 (± 5.0) MB/sec
overwrite [MEDIAN 3 runs] : 574241 ops/sec;   63.5 MB/sec
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/dev/shm/rocksdb/dbbench]
Running benchmark for 3 times
fillrandom   :      27.326 micros/op 584266 ops/sec 27.385 seconds 16000000 operations;   64.6 MB/s
fillrandom   :      33.816 micros/op 471942 ops/sec 33.902 seconds 16000000 operations;   52.2 MB/s
fillrandom [AVG 2 runs] : 528104 (± 110074) ops/sec;   58.4 (± 12.2) MB/sec
fillrandom   :      27.653 micros/op 576048 ops/sec 27.775 seconds 16000000 operations;   63.7 MB/s
fillrandom [AVG 3 runs] : 544085 (± 70851) ops/sec;   60.2 (± 7.8) MB/sec
fillrandom [AVG    3 runs] : 544085 (± 70851) ops/sec;   60.2 (± 7.8) MB/sec
fillrandom [MEDIAN 3 runs] : 576048 ops/sec;   63.7 MB/sec

If we just look at avg, "overwrite" and "fillrandom" shows a slight regression.
@mpoeter have you been able to do some benchmarking as well as the stress test?

riversand963 · 2022-07-29T04:57:35Z

I am able to fix the unit test added by this PR after changing only the following

diff --git a/db/db_impl/db_impl_write.cc b/db/db_impl/db_impl_write.cc
index 39657d462..eae8bb3a0 100644
--- a/db/db_impl/db_impl_write.cc
+++ b/db/db_impl/db_impl_write.cc
@@ -568,9 +568,22 @@ Status DBImpl::PipelinedWriteImpl(const WriteOptions& write_options,

   WriteContext write_context;

-  WriteThread::Writer w(write_options, my_batch, callback, log_ref,
-                        disable_memtable, /*_batch_cnt=*/0,
-                        /*_pre_release_callback=*/nullptr);
+  char wbuf1[sizeof(WriteThread::Writer)] = {0};
+  char wbuf2[sizeof(WriteThread::Writer)] = {0};
+  static thread_local unsigned long long __pipelined_write_call_id = 0ULL;
+  WriteThread::Writer* writer_ptr = nullptr;
+  if (__pipelined_write_call_id & 1) {
+    writer_ptr =
+        new (wbuf1) WriteThread::Writer(write_options, my_batch, callback,
+                                        log_ref, disable_memtable, 0, nullptr);
+  } else {
+    writer_ptr =
+        new (wbuf2) WriteThread::Writer(write_options, my_batch, callback,
+                                        log_ref, disable_memtable, 0, nullptr);
+  }
+  ++__pipelined_write_call_id;
+  assert(writer_ptr);
+  WriteThread::Writer& w = *writer_ptr;
   write_thread_.JoinBatchGroup(&w);
   TEST_SYNC_POINT("DBImplWrite::PipelinedWriteImpl:AfterJoinBatchGroup");
   if (w.state == WriteThread::STATE_GROUP_LEADER) {

Basically, I deterministically (I hope) make sure two consecutive invocations of PipelinedWriteImpl() will not allocate the same w on the same stack address. Will be evaluating this approach or other ones.

riversand963 · 2022-07-29T18:02:43Z

Ran benchmarks on another host, and no obvious perf regression observed.

facebook-github-bot · 2022-07-29T18:03:13Z

@mpoeter has updated the pull request. You must reimport the pull request before landing.

riversand963 · 2022-07-29T18:07:00Z

Thanks @mpoeter for fixing this tricky bug!

pdillinger · 2022-08-01T20:10:04Z

db/write_thread.cc

+    // a race where the owning thread of one of these writers can start a new
+    // write operation.
+    Writer dummy;
+    Writer* head = newest_writer_.load(std::memory_order_acquire);


The start of this closely resembles the !enable_pipelined_write_ branch below, which has some useful comments. I would add a comment to refer to it. (Not obvious how to cleanly combine the two pieces of code--without introducing an unnecessary allocation of dummy Writer on the stack.)

I copied some comments here to explain what is being done, hoping this duplication is easier to maintain (because comment is close to code) than referring to another comment. Should we have a way of identifying comments, I may have used a reference.

…ed-write

facebook-github-bot · 2022-08-02T19:51:16Z

@mpoeter has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-02T19:53:15Z

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mpoeter added 2 commits May 4, 2022 14:12

Add test

8f40e2d

Fix race in ExitAsBatchGroupLeader with pipelined writes

e0ac0cc

facebook-github-bot added the CLA Signed label May 4, 2022

mpoeter added 2 commits May 5, 2022 10:19

Add comments

89757b7

Fix formatting

5e13d40

riversand963 self-requested a review May 5, 2022 18:30

Remove whitespace

5d4803c

Merge branch 'main' into fix-race-in-pipelined-write

ee73fc3

Merge branch 'main' into fix-race-in-pipelined-write

49c635f

riversand963 mentioned this pull request Jul 26, 2022

After encountering the problem of disk full, some write threads are stuck forever even though the disk has been cleaned up. #9692

Closed

riversand963 reviewed Jul 27, 2022

View reviewed changes

mpoeter added 2 commits July 28, 2022 15:52

Address review comments.

12f203a

Merge branch 'main' into fix-race-in-pipelined-write

29b84a6

Merge branch 'main' into fix-race-in-pipelined-write

1268b31

riversand963 approved these changes Jul 29, 2022

View reviewed changes

pdillinger reviewed Aug 1, 2022

View reviewed changes

riversand963 added 2 commits August 2, 2022 12:48

Address comments

1b46e2b

Merge remote-tracking branch 'upstream/main' into fix-race-in-pipelin…

f6c4665

…ed-write

facebook-github-bot closed this in bef3127 Aug 2, 2022

Fix race in ExitAsBatchGroupLeader with pipelined writes #9944

Fix race in ExitAsBatchGroupLeader with pipelined writes #9944

Conversation

mpoeter commented May 4, 2022 • edited by riversand963 Loading

facebook-github-bot commented May 4, 2022

Action Required

Process

facebook-github-bot commented May 4, 2022

facebook-github-bot commented May 4, 2022

facebook-github-bot commented May 5, 2022

riversand963 commented May 5, 2022

facebook-github-bot commented May 5, 2022

mpoeter commented May 5, 2022 • edited Loading

facebook-github-bot commented May 6, 2022

mpoeter commented May 6, 2022

facebook-github-bot commented Jun 1, 2022

facebook-github-bot commented Jun 7, 2022

mpoeter commented Jun 14, 2022

riversand963 commented Jun 15, 2022

jsteemann commented Jul 25, 2022

facebook-github-bot commented Jul 25, 2022

riversand963 commented Jul 25, 2022

riversand963 commented Jul 26, 2022

mpoeter commented Jul 27, 2022

riversand963 commented Jul 27, 2022

riversand963 commented Jul 27, 2022 • edited Loading

riversand963 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

riversand963 commented Jul 27, 2022 • edited Loading

facebook-github-bot commented Jul 28, 2022

mpoeter commented Jul 28, 2022

facebook-github-bot commented Jul 28, 2022

riversand963 commented Jul 29, 2022

riversand963 commented Jul 29, 2022 • edited Loading

riversand963 commented Jul 29, 2022

facebook-github-bot commented Jul 29, 2022

riversand963 commented Jul 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Aug 2, 2022

facebook-github-bot commented Aug 2, 2022

mpoeter commented May 4, 2022 •

edited by riversand963

Loading

mpoeter commented May 5, 2022 •

edited

Loading

riversand963 commented Jul 27, 2022 •

edited

Loading

riversand963 commented Jul 27, 2022 •

edited

Loading

riversand963 commented Jul 29, 2022 •

edited

Loading