Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After encountering the problem of disk full, some write threads are stuck forever even though the disk has been cleaned up. #9692

Closed
zhuchong0329 opened this issue Mar 14, 2022 · 8 comments
Assignees
Labels

Comments

@zhuchong0329
Copy link
Contributor

zhuchong0329 commented Mar 14, 2022

Expected behavior

After disk cleanup, the write threads execute normally.

Actual behavior

After disk cleanup, the write threads are stuck forever.
In my environment,three write threads are stuck and their link_older is actually looped.

thread 1

Thread 89 (Thread 0x7f412b240700 (LWP 17794)):
#0  0x00007f41570b3a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f41560bdaec in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x0000000000d3306c in wait<rocksdb::WriteThread::BlockingAwaitState(rocksdb::WriteThread::Writer*, uint8_t)::__lambda8> (__p=..., __lock=..., this=0x7f3e493fc870) at /usr/include/c++/4.8.2/condition_variable:93
#3  rocksdb::WriteThread::BlockingAwaitState (this=this@entry=0x7f4153c8dfb0, w=w@entry=0x7f3e493fc7c0, goal_mask=goal_mask@entry=30 '\036') at /data/landun/workspace/db/write_thread.cc:48
#4  0x0000000000d33181 in rocksdb::WriteThread::AwaitState (this=0x7f4153c8dfb0, w=0x7f3e493fc7c0, goal_mask=30 '\036', ctx=0x32efa10 <rocksdb::jbg_ctx>) at /data/landun/workspace/db/write_thread.cc:185
#5  0x0000000000c57865 in rocksdb::DBImpl::PipelinedWriteImpl (this=this@entry=0x7f4153c8d800, write_options=..., my_batch=my_batch@entry=0x7f3e493fd380, callback=callback@entry=0x0, log_used=log_used@entry=0x0, log_ref=log_ref@entry=0, disable_memtable=disable_memtable@entry=false, seq_used=seq_used@entry=0x0) at /data/landun/workspace/db/db_impl_write.cc:758
#6  0x0000000000c5c346 in rocksdb::DBImpl::WriteImpl (this=0x7f4153c8d800, write_options=..., my_batch=0x7f3e493fd380, callback=callback@entry=0x0, log_used=log_used@entry=0x0, log_ref=log_ref@entry=0, disable_memtable=disable_memtable@entry=false, seq_used=seq_used@entry=0x0, batch_cnt=batch_cnt@entry=0, pre_release_callback=pre_release_callback@entry=0x0) at /data/landun/workspace/db/db_impl_write.cc:431

thread 2

Thread 81 (Thread 0x7f41266f8700 (LWP 17802)):
#0  0x00007f41570b3a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f41560bdaec in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x0000000000d3306c in wait<rocksdb::WriteThread::BlockingAwaitState(rocksdb::WriteThread::Writer*, uint8_t)::__lambda8> (__p=..., __lock=..., this=0x7f3e3d520850) at /usr/include/c++/4.8.2/condition_variable:93
#3  rocksdb::WriteThread::BlockingAwaitState (this=this@entry=0x7f4153c8dfb0, w=w@entry=0x7f3e3d5207a0, goal_mask=goal_mask@entry=30 '\036') at /data/landun/workspace/db/write_thread.cc:48
#4  0x0000000000d33181 in rocksdb::WriteThread::AwaitState (this=0x7f4153c8dfb0, w=0x7f3e3d5207a0, goal_mask=30 '\036', ctx=0x32efa10 <rocksdb::jbg_ctx>) at /data/landun/workspace/db/write_thread.cc:185
#5  0x0000000000c57865 in rocksdb::DBImpl::PipelinedWriteImpl (this=this@entry=0x7f4153c8d800, write_options=..., my_batch=my_batch@entry=0x7f3e3d5213f0, callback=callback@entry=0x0, log_used=log_used@entry=0x0, log_ref=log_ref@entry=0, disable_memtable=disable_memtable@entry=false, seq_used=seq_used@entry=0x0) at /data/landun/workspace/db/db_impl_write.cc:758
#6  0x0000000000c5c346 in rocksdb::DBImpl::WriteImpl (this=0x7f4153c8d800, write_options=..., my_batch=0x7f3e3d5213f0, callback=callback@entry=0x0, log_used=log_used@entry=0x0, log_ref=log_ref@entry=0, disable_memtable=disable_memtable@entry=false, seq_used=seq_used@entry=0x0, batch_cnt=batch_cnt@entry=0, pre_release_callback=pre_release_callback@entry=0x0) at /data/landun/workspace/db/db_impl_write.cc:431

thread 3

Thread 70 (Thread 0x7f41205ed700 (LWP 17813)):
#0  0x00007f41570b3a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f41560bdaec in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x0000000000d3306c in wait<rocksdb::WriteThread::BlockingAwaitState(rocksdb::WriteThread::Writer*, uint8_t)::__lambda8> (__p=..., __lock=..., this=0x7f2ed6154850) at /usr/include/c++/4.8.2/condition_variable:93
#3  rocksdb::WriteThread::BlockingAwaitState (this=this@entry=0x7f4153c8dfb0, w=w@entry=0x7f2ed61547a0, goal_mask=goal_mask@entry=30 '\036') at /data/landun/workspace/db/write_thread.cc:48
#4  0x0000000000d33181 in rocksdb::WriteThread::AwaitState (this=0x7f4153c8dfb0, w=0x7f2ed61547a0, goal_mask=30 '\036', ctx=0x32efa10 <rocksdb::jbg_ctx>) at /data/landun/workspace/db/write_thread.cc:185
#5  0x0000000000c57865 in rocksdb::DBImpl::PipelinedWriteImpl (this=this@entry=0x7f4153c8d800, write_options=..., my_batch=my_batch@entry=0x7f2ed61553f0, callback=callback@entry=0x0, log_used=log_used@entry=0x0, log_ref=log_ref@entry=0, disable_memtable=disable_memtable@entry=false, seq_used=seq_used@entry=0x0) at /data/landun/workspace/db/db_impl_write.cc:758
#6  0x0000000000c5c346 in rocksdb::DBImpl::WriteImpl (this=0x7f4153c8d800, write_options=..., my_batch=0x7f2ed61553f0, callback=callback@entry=0x0, log_used=log_used@entry=0x0, log_ref=log_ref@entry=0, disable_memtable=disable_memtable@entry=false, seq_used=seq_used@entry=0x0, batch_cnt=batch_cnt@entry=0, pre_release_callback=pre_release_callback@entry=0x0) at /data/landun/workspace/db/db_impl_write.cc:431

their link_older is looped

[Switching to thread 89 (Thread 0x7f412b240700 (LWP 17794))]
$8 = (rocksdb::WriteThread::Writer *) 0x7f3e493fc7c0
$3 = {batch = 0x7f3e493fd380, sync = true, no_slowdown = false, disable_wal = false, disable_memtable = false, batch_cnt = 0, pre_release_callback = 0x0, log_used = 0, log_ref = 0, callback = 0x0, 
  made_waitable = true, state = {<std::__atomic_base<unsigned char>> = {_M_i = 32 ' '}, <No data fields>}, write_group = 0x0, sequence = 72057594037927935, status = {code_ = rocksdb::Status::kOk, 
    subcode_ = rocksdb::Status::kNone, tdcode_ = 0, sev_ = rocksdb::Status::kNoError, state_ = 0x0}, callback_status = {code_ = rocksdb::Status::kOk, subcode_ = rocksdb::Status::kNone, tdcode_ = 0, 
    sev_ = rocksdb::Status::kNoError, state_ = 0x0}, state_mutex_bytes = {__data = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = {<No data fields>}}, state_cv_bytes = {
    __data = "\000\000\000\000\001\000\000\000\001", '\000' <repeats 23 times>, "@\310?I>\177\000\000\002\000\000\000\000\000\000", __align = {<No data fields>}}, link_older = 0x7f2ed61547a0, link_newer = 0x0, 
  is_tdstore_region_writer_ = false}
[Switching to thread 81 (Thread 0x7f41266f8700 (LWP 17802))]
$7 = (rocksdb::WriteThread::Writer *) 0x7f3e3d5207a0
$4 = {batch = 0x7f3e3d5213f0, sync = true, no_slowdown = false, disable_wal = false, disable_memtable = false, batch_cnt = 0, pre_release_callback = 0x0, log_used = 0, log_ref = 0, callback = 0x0, 
made_waitable = true, state = {<std::__atomic_base<unsigned char>> = {_M_i = 32 ' '}, <No data fields>}, write_group = 0x0, sequence = 72057594037927935, status = {code_ = rocksdb::Status::kOk, 
subcode_ = rocksdb::Status::kNone, tdcode_ = 0, sev_ = rocksdb::Status::kNoError, state_ = 0x0}, callback_status = {code_ = rocksdb::Status::kOk, subcode_ = rocksdb::Status::kNone, tdcode_ = 0, 
sev_ = rocksdb::Status::kNoError, state_ = 0x0}, state_mutex_bytes = {__data = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = {<No data fields>}}, state_cv_bytes = {
__data = "\000\000\000\000\001\000\000\000\001", '\000' <repeats 23 times>, " \bR=>\177\000\000\002\000\000\000\000\000\000", __align = {<No data fields>}}, link_older = 0x7f3e493fc7c0, link_newer = 0x0, 
is_tdstore_region_writer_ = false}
[Switching to thread 70 (Thread 0x7f41205ed700 (LWP 17813))]
$6 = (rocksdb::WriteThread::Writer *) 0x7f2ed61547a0
$5 = {batch = 0x7f2ed61553f0, sync = true, no_slowdown = false, disable_wal = false, disable_memtable = false, batch_cnt = 0, pre_release_callback = 0x0, log_used = 0, log_ref = 0, callback = 0x0, 
made_waitable = true, state = {<std::__atomic_base<unsigned char>> = {_M_i = 32 ' '}, <No data fields>}, write_group = 0x0, sequence = 72057594037927935, status = {code_ = rocksdb::Status::kOk, 
subcode_ = rocksdb::Status::kNone, tdcode_ = 0, sev_ = rocksdb::Status::kNoError, state_ = 0x0}, callback_status = {code_ = rocksdb::Status::kOk, subcode_ = rocksdb::Status::kNone, tdcode_ = 0, 
sev_ = rocksdb::Status::kNoError, state_ = 0x0}, state_mutex_bytes = {__data = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = {<No data fields>}}, state_cv_bytes = {
__data = "\000\000\000\000\001\000\000\000\001", '\000' <repeats 23 times>, " H\025\326.\177\000\000\002\000\000\000\000\000\000", __align = {<No data fields>}}, link_older = 0x7f3e3d5207a0, link_newer = 0x0, 
is_tdstore_region_writer_ = false}

Steps to reproduce the behavior

Why do these three writes form a ring? I think it is caused by the ABA problem of lock-free programming in the scenario where the disk is full.
An execution path of the conjecture that the three writes form a ring in the above case:

  1. w1 becomes the leader in write group1, and w4 becomes the last writer in write group1.
    image

  2. w5 and w6 are successively added to the write linked list.
    image

  3. The execution of w1 encounters a disk full error and needs to notify other writes in write group 1. w4 was notified first and was removed from the write group.
    image

  4. w1 notifies other writes in turn.
    image

  5. The thread where w4 is located is awakened by w1, and continues to perform other write operations after returning an error. At this time, we name it w7. Unfortunately, the w7 object has the same address as the w4 object. w7 is added to the write list.
    image

  6. w1 notifies all writes in the write group, and then uses the recorded last writer to check whether newest_writer_ is updated. Unfortunately, the current newest_writer_ is consistent with last_writer, so newest_writer_ is set to nullptr.
    image

  7. w5, w6, w7 form a ring and are forever forgotten.
    image

Where is the bug I think

In WriteThread::ExitAsBatchGroupLeader, it execute CompleteFollower first and then execute FindNextLeader When status is not ok. CompleteFollower will wake up the originally blocked Write thread, which may lead to ABA problems.

@jay-zhuang
Copy link
Contributor

an related issue #9762 for recovery after disk full.

@ajkr ajkr added question up-for-grabs Up for grabs labels Mar 30, 2022
@ajkr
Copy link
Contributor

ajkr commented Apr 21, 2022

Also #9876.

@mpoeter
Copy link
Contributor

mpoeter commented May 2, 2022

I only just familiarized myself with the code, but I agree with @zhuchong0329's analysis . In ExitAsBatchGroupLeader with enable_pipelined_write_, writers for which ShouldWriteToMemtable() evalutes to false are removed from the write_group via CompleteFollower/ CompleteLeader. Writers in the middle of the group are fully unlinked, but if that writers is the last_writer, then we do not update the predecessor's link_older, i.e., this writer is still reachable via newest_writer_. In fact, we rely on last_writer still being reachable via newest_writer_, because FindNextLeader traverses the list until it finds last_writer.

But the problem is, that CompleteFollower already wakes up the thread owning that writer before the writer has been removed. This results in a race - if the leader thread is fast enough, then everything is fine. However, if the woken up thread finishes the current write operation and then performs yet another write, then the new writer instance will be added to newest_writer_. And if this thread is faster, then we have a problem!

So far I only analyzed the situation for pipelined writes, but I will also check the other code paths.
@ajkr you added the "up-for-grabs" tag - should I attempt to create a fix ?

@mpoeter
Copy link
Contributor

mpoeter commented May 3, 2022

I have created a unit test to systematically trigger this error.
Will now start working on a fix.

@riversand963
Copy link
Contributor

riversand963 commented Jul 26, 2022

  1. The thread where w4 is located is awakened by w1, and continues to perform other write operations after returning an error. At this time, we name it w7. Unfortunately, the w7 object has the same address as the w4 object. w7 is added to the write list.

Thanks @zhuchong0329 for the analysis.
I have been looking at the implementation of write_thread.[cc|h], but I cannot find where w4/w7 is added back to the WriteThread::newest_writers_ list during pipelined write while w4,w3,w2,w1 are still in the list with w1 being the leader. Maybe I am missing something. Do w1-4 require memtable writes? If so, they should be added to the WriteThread::newest_memtable_writers_ list. If not, they should be removed from the list before the new leader is waken up.

I will need to check #9944 as well as @mpoeter 's comments in this thread to see if my question can be clarified.

UPDATE: nvm, I think the comment above clarifies it.

@riversand963
Copy link
Contributor

then the new writer instance will be added to newest_writer_. And if this thread is faster, then we have a problem!

Good finding, one thread executes the same function twice allocates local object of the same type at the same address on the stack.

@riversand963
Copy link
Contributor

So far I only analyzed the situation for pipelined writes

Have you found anything? I looked at the non-pipelined write case, and the last writer in the group is completely unlinked, thus become unreachable via newest_writers_ before the newer writers are waken up, according to ExitAsBatchGroupleader.

@riversand963
Copy link
Contributor

#9876 seems another occurrence because pipelined write is confirmed to be used in that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants