After encountering the problem of disk full, some write threads are stuck forever even though the disk has been cleaned up. #9692

zhuchong0329 · 2022-03-14T12:09:29Z

Expected behavior

After disk cleanup, the write threads execute normally.

Actual behavior

After disk cleanup, the write threads are stuck forever.
In my environment，three write threads are stuck and their link_older is actually looped.

thread 1

Thread 89 (Thread 0x7f412b240700 (LWP 17794)):
#0  0x00007f41570b3a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f41560bdaec in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x0000000000d3306c in wait<rocksdb::WriteThread::BlockingAwaitState(rocksdb::WriteThread::Writer*, uint8_t)::__lambda8> (__p=..., __lock=..., this=0x7f3e493fc870) at /usr/include/c++/4.8.2/condition_variable:93
#3  rocksdb::WriteThread::BlockingAwaitState (this=this@entry=0x7f4153c8dfb0, w=w@entry=0x7f3e493fc7c0, goal_mask=goal_mask@entry=30 '\036') at /data/landun/workspace/db/write_thread.cc:48
#4  0x0000000000d33181 in rocksdb::WriteThread::AwaitState (this=0x7f4153c8dfb0, w=0x7f3e493fc7c0, goal_mask=30 '\036', ctx=0x32efa10 <rocksdb::jbg_ctx>) at /data/landun/workspace/db/write_thread.cc:185
#5  0x0000000000c57865 in rocksdb::DBImpl::PipelinedWriteImpl (this=this@entry=0x7f4153c8d800, write_options=..., my_batch=my_batch@entry=0x7f3e493fd380, callback=callback@entry=0x0, log_used=log_used@entry=0x0, log_ref=log_ref@entry=0, disable_memtable=disable_memtable@entry=false, seq_used=seq_used@entry=0x0) at /data/landun/workspace/db/db_impl_write.cc:758
#6  0x0000000000c5c346 in rocksdb::DBImpl::WriteImpl (this=0x7f4153c8d800, write_options=..., my_batch=0x7f3e493fd380, callback=callback@entry=0x0, log_used=log_used@entry=0x0, log_ref=log_ref@entry=0, disable_memtable=disable_memtable@entry=false, seq_used=seq_used@entry=0x0, batch_cnt=batch_cnt@entry=0, pre_release_callback=pre_release_callback@entry=0x0) at /data/landun/workspace/db/db_impl_write.cc:431

thread 2

Thread 81 (Thread 0x7f41266f8700 (LWP 17802)):
#0  0x00007f41570b3a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f41560bdaec in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x0000000000d3306c in wait<rocksdb::WriteThread::BlockingAwaitState(rocksdb::WriteThread::Writer*, uint8_t)::__lambda8> (__p=..., __lock=..., this=0x7f3e3d520850) at /usr/include/c++/4.8.2/condition_variable:93
#3  rocksdb::WriteThread::BlockingAwaitState (this=this@entry=0x7f4153c8dfb0, w=w@entry=0x7f3e3d5207a0, goal_mask=goal_mask@entry=30 '\036') at /data/landun/workspace/db/write_thread.cc:48
#4  0x0000000000d33181 in rocksdb::WriteThread::AwaitState (this=0x7f4153c8dfb0, w=0x7f3e3d5207a0, goal_mask=30 '\036', ctx=0x32efa10 <rocksdb::jbg_ctx>) at /data/landun/workspace/db/write_thread.cc:185
#5  0x0000000000c57865 in rocksdb::DBImpl::PipelinedWriteImpl (this=this@entry=0x7f4153c8d800, write_options=..., my_batch=my_batch@entry=0x7f3e3d5213f0, callback=callback@entry=0x0, log_used=log_used@entry=0x0, log_ref=log_ref@entry=0, disable_memtable=disable_memtable@entry=false, seq_used=seq_used@entry=0x0) at /data/landun/workspace/db/db_impl_write.cc:758
#6  0x0000000000c5c346 in rocksdb::DBImpl::WriteImpl (this=0x7f4153c8d800, write_options=..., my_batch=0x7f3e3d5213f0, callback=callback@entry=0x0, log_used=log_used@entry=0x0, log_ref=log_ref@entry=0, disable_memtable=disable_memtable@entry=false, seq_used=seq_used@entry=0x0, batch_cnt=batch_cnt@entry=0, pre_release_callback=pre_release_callback@entry=0x0) at /data/landun/workspace/db/db_impl_write.cc:431

thread 3

Thread 70 (Thread 0x7f41205ed700 (LWP 17813)):
#0  0x00007f41570b3a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f41560bdaec in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x0000000000d3306c in wait<rocksdb::WriteThread::BlockingAwaitState(rocksdb::WriteThread::Writer*, uint8_t)::__lambda8> (__p=..., __lock=..., this=0x7f2ed6154850) at /usr/include/c++/4.8.2/condition_variable:93
#3  rocksdb::WriteThread::BlockingAwaitState (this=this@entry=0x7f4153c8dfb0, w=w@entry=0x7f2ed61547a0, goal_mask=goal_mask@entry=30 '\036') at /data/landun/workspace/db/write_thread.cc:48
#4  0x0000000000d33181 in rocksdb::WriteThread::AwaitState (this=0x7f4153c8dfb0, w=0x7f2ed61547a0, goal_mask=30 '\036', ctx=0x32efa10 <rocksdb::jbg_ctx>) at /data/landun/workspace/db/write_thread.cc:185
#5  0x0000000000c57865 in rocksdb::DBImpl::PipelinedWriteImpl (this=this@entry=0x7f4153c8d800, write_options=..., my_batch=my_batch@entry=0x7f2ed61553f0, callback=callback@entry=0x0, log_used=log_used@entry=0x0, log_ref=log_ref@entry=0, disable_memtable=disable_memtable@entry=false, seq_used=seq_used@entry=0x0) at /data/landun/workspace/db/db_impl_write.cc:758
#6  0x0000000000c5c346 in rocksdb::DBImpl::WriteImpl (this=0x7f4153c8d800, write_options=..., my_batch=0x7f2ed61553f0, callback=callback@entry=0x0, log_used=log_used@entry=0x0, log_ref=log_ref@entry=0, disable_memtable=disable_memtable@entry=false, seq_used=seq_used@entry=0x0, batch_cnt=batch_cnt@entry=0, pre_release_callback=pre_release_callback@entry=0x0) at /data/landun/workspace/db/db_impl_write.cc:431

their link_older is looped

[Switching to thread 89 (Thread 0x7f412b240700 (LWP 17794))]
$8 = (rocksdb::WriteThread::Writer *) 0x7f3e493fc7c0
$3 = {batch = 0x7f3e493fd380, sync = true, no_slowdown = false, disable_wal = false, disable_memtable = false, batch_cnt = 0, pre_release_callback = 0x0, log_used = 0, log_ref = 0, callback = 0x0, 
  made_waitable = true, state = {<std::__atomic_base<unsigned char>> = {_M_i = 32 ' '}, <No data fields>}, write_group = 0x0, sequence = 72057594037927935, status = {code_ = rocksdb::Status::kOk, 
    subcode_ = rocksdb::Status::kNone, tdcode_ = 0, sev_ = rocksdb::Status::kNoError, state_ = 0x0}, callback_status = {code_ = rocksdb::Status::kOk, subcode_ = rocksdb::Status::kNone, tdcode_ = 0, 
    sev_ = rocksdb::Status::kNoError, state_ = 0x0}, state_mutex_bytes = {__data = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = {<No data fields>}}, state_cv_bytes = {
    __data = "\000\000\000\000\001\000\000\000\001", '\000' <repeats 23 times>, "@\310?I>\177\000\000\002\000\000\000\000\000\000", __align = {<No data fields>}}, link_older = 0x7f2ed61547a0, link_newer = 0x0, 
  is_tdstore_region_writer_ = false}

[Switching to thread 81 (Thread 0x7f41266f8700 (LWP 17802))]
$7 = (rocksdb::WriteThread::Writer *) 0x7f3e3d5207a0
$4 = {batch = 0x7f3e3d5213f0, sync = true, no_slowdown = false, disable_wal = false, disable_memtable = false, batch_cnt = 0, pre_release_callback = 0x0, log_used = 0, log_ref = 0, callback = 0x0, 
made_waitable = true, state = {<std::__atomic_base<unsigned char>> = {_M_i = 32 ' '}, <No data fields>}, write_group = 0x0, sequence = 72057594037927935, status = {code_ = rocksdb::Status::kOk, 
subcode_ = rocksdb::Status::kNone, tdcode_ = 0, sev_ = rocksdb::Status::kNoError, state_ = 0x0}, callback_status = {code_ = rocksdb::Status::kOk, subcode_ = rocksdb::Status::kNone, tdcode_ = 0, 
sev_ = rocksdb::Status::kNoError, state_ = 0x0}, state_mutex_bytes = {__data = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = {<No data fields>}}, state_cv_bytes = {
__data = "\000\000\000\000\001\000\000\000\001", '\000' <repeats 23 times>, " \bR=>\177\000\000\002\000\000\000\000\000\000", __align = {<No data fields>}}, link_older = 0x7f3e493fc7c0, link_newer = 0x0, 
is_tdstore_region_writer_ = false}

[Switching to thread 70 (Thread 0x7f41205ed700 (LWP 17813))]
$6 = (rocksdb::WriteThread::Writer *) 0x7f2ed61547a0
$5 = {batch = 0x7f2ed61553f0, sync = true, no_slowdown = false, disable_wal = false, disable_memtable = false, batch_cnt = 0, pre_release_callback = 0x0, log_used = 0, log_ref = 0, callback = 0x0, 
made_waitable = true, state = {<std::__atomic_base<unsigned char>> = {_M_i = 32 ' '}, <No data fields>}, write_group = 0x0, sequence = 72057594037927935, status = {code_ = rocksdb::Status::kOk, 
subcode_ = rocksdb::Status::kNone, tdcode_ = 0, sev_ = rocksdb::Status::kNoError, state_ = 0x0}, callback_status = {code_ = rocksdb::Status::kOk, subcode_ = rocksdb::Status::kNone, tdcode_ = 0, 
sev_ = rocksdb::Status::kNoError, state_ = 0x0}, state_mutex_bytes = {__data = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = {<No data fields>}}, state_cv_bytes = {
__data = "\000\000\000\000\001\000\000\000\001", '\000' <repeats 23 times>, " H\025\326.\177\000\000\002\000\000\000\000\000\000", __align = {<No data fields>}}, link_older = 0x7f3e3d5207a0, link_newer = 0x0, 
is_tdstore_region_writer_ = false}

Steps to reproduce the behavior

Why do these three writes form a ring? I think it is caused by the ABA problem of lock-free programming in the scenario where the disk is full.
An execution path of the conjecture that the three writes form a ring in the above case:

w1 becomes the leader in write group1, and w4 becomes the last writer in write group1.
w5 and w6 are successively added to the write linked list.
The execution of w1 encounters a disk full error and needs to notify other writes in write group 1. w4 was notified first and was removed from the write group.
w1 notifies other writes in turn.
The thread where w4 is located is awakened by w1, and continues to perform other write operations after returning an error. At this time, we name it w7. Unfortunately, the w7 object has the same address as the w4 object. w7 is added to the write list.
w1 notifies all writes in the write group, and then uses the recorded last writer to check whether newest_writer_ is updated. Unfortunately, the current newest_writer_ is consistent with last_writer, so newest_writer_ is set to nullptr.
w5, w6, w7 form a ring and are forever forgotten.

Where is the bug I think

In WriteThread::ExitAsBatchGroupLeader, it execute CompleteFollower first and then execute FindNextLeader When status is not ok. CompleteFollower will wake up the originally blocked Write thread, which may lead to ABA problems.

The text was updated successfully, but these errors were encountered:

jay-zhuang · 2022-03-29T17:29:00Z

an related issue #9762 for recovery after disk full.

ajkr · 2022-04-21T02:36:16Z

Also #9876.

mpoeter · 2022-05-02T10:02:12Z

I only just familiarized myself with the code, but I agree with @zhuchong0329's analysis . In ExitAsBatchGroupLeader with enable_pipelined_write_, writers for which ShouldWriteToMemtable() evalutes to false are removed from the write_group via CompleteFollower/ CompleteLeader. Writers in the middle of the group are fully unlinked, but if that writers is the last_writer, then we do not update the predecessor's link_older, i.e., this writer is still reachable via newest_writer_. In fact, we rely on last_writer still being reachable via newest_writer_, because FindNextLeader traverses the list until it finds last_writer.

But the problem is, that CompleteFollower already wakes up the thread owning that writer before the writer has been removed. This results in a race - if the leader thread is fast enough, then everything is fine. However, if the woken up thread finishes the current write operation and then performs yet another write, then the new writer instance will be added to newest_writer_. And if this thread is faster, then we have a problem!

So far I only analyzed the situation for pipelined writes, but I will also check the other code paths.
@ajkr you added the "up-for-grabs" tag - should I attempt to create a fix ?

mpoeter · 2022-05-03T10:16:40Z

I have created a unit test to systematically trigger this error.
Will now start working on a fix.

riversand963 · 2022-07-26T21:24:23Z

The thread where w4 is located is awakened by w1, and continues to perform other write operations after returning an error. At this time, we name it w7. Unfortunately, the w7 object has the same address as the w4 object. w7 is added to the write list.

Thanks @zhuchong0329 for the analysis.
I have been looking at the implementation of write_thread.[cc|h], but I cannot find where w4/w7 is added back to the WriteThread::newest_writers_ list during pipelined write while w4,w3,w2,w1 are still in the list with w1 being the leader. Maybe I am missing something. Do w1-4 require memtable writes? If so, they should be added to the WriteThread::newest_memtable_writers_ list. If not, they should be removed from the list before the new leader is waken up.

I will need to check #9944 as well as @mpoeter 's comments in this thread to see if my question can be clarified.

UPDATE: nvm, I think the comment above clarifies it.

riversand963 · 2022-07-26T22:01:32Z

then the new writer instance will be added to newest_writer_. And if this thread is faster, then we have a problem!

Good finding, one thread executes the same function twice allocates local object of the same type at the same address on the stack.

riversand963 · 2022-07-26T22:55:49Z

So far I only analyzed the situation for pipelined writes

Have you found anything? I looked at the non-pipelined write case, and the last writer in the group is completely unlinked, thus become unreachable via newest_writers_ before the newer writers are waken up, according to ExitAsBatchGroupleader.

riversand963 · 2022-07-26T23:13:11Z

#9876 seems another occurrence because pipelined write is confirmed to be used in that case.

ajkr added question up-for-grabs Up for grabs labels Mar 30, 2022

mpoeter mentioned this issue May 4, 2022

Fix race in ExitAsBatchGroupLeader with pipelined writes #9944

Closed

ltamasi mentioned this issue Jul 6, 2022

Error of Max allowed space was reached will not recover even if disk space has enough space when using blobdb #10294

Closed

riversand963 self-assigned this Jul 25, 2022

facebook-github-bot closed this as completed in bef3127 Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After encountering the problem of disk full, some write threads are stuck forever even though the disk has been cleaned up. #9692

After encountering the problem of disk full, some write threads are stuck forever even though the disk has been cleaned up. #9692

zhuchong0329 commented Mar 14, 2022 •

edited by riversand963

Loading

jay-zhuang commented Mar 29, 2022

ajkr commented Apr 21, 2022

mpoeter commented May 2, 2022

mpoeter commented May 3, 2022 •

edited

Loading

riversand963 commented Jul 26, 2022 •

edited

Loading

riversand963 commented Jul 26, 2022

riversand963 commented Jul 26, 2022

riversand963 commented Jul 26, 2022

After encountering the problem of disk full, some write threads are stuck forever even though the disk has been cleaned up. #9692

After encountering the problem of disk full, some write threads are stuck forever even though the disk has been cleaned up. #9692

Comments

zhuchong0329 commented Mar 14, 2022 • edited by riversand963 Loading

Expected behavior

Actual behavior

thread 1

thread 2

thread 3

their link_older is looped

Steps to reproduce the behavior

Where is the bug I think

jay-zhuang commented Mar 29, 2022

ajkr commented Apr 21, 2022

mpoeter commented May 2, 2022

mpoeter commented May 3, 2022 • edited Loading

riversand963 commented Jul 26, 2022 • edited Loading

riversand963 commented Jul 26, 2022

riversand963 commented Jul 26, 2022

riversand963 commented Jul 26, 2022

zhuchong0329 commented Mar 14, 2022 •

edited by riversand963

Loading

mpoeter commented May 3, 2022 •

edited

Loading

riversand963 commented Jul 26, 2022 •

edited

Loading