Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise LockWAL/UnlockWAL implementation #11020

Closed
wants to merge 11 commits into from

Conversation

riversand963
Copy link
Contributor

@riversand963 riversand963 commented Dec 6, 2022

RocksDB has two public APIs: DB::LockWAL()/DB::UnlockWAL(). The current implementation acquires and
releases the internal DBImpl::log_write_mutex_.

According to the comment on DBImpl::log_write_mutex_: https://github.com/facebook/rocksdb/blob/7.8.fb/db/db_impl/db_impl.h#L2287:L2288

Note: to avoid dealock, if needed to acquire both log_write_mutex_ and mutex_, the order should be first mutex_ and then log_write_mutex_.

This puts limitations on how applications can use the LockWAL() API. After LockWAL() returns ok, then application
should not perform any operation that acquires mutex_. Currently, the use case of LockWAL() is MyRocks implementing
the MySQL storage engine handlerton lock_hton_log interface. The operation that MyRocks performs after LockWAL()
is GetSortedWalFiless() which not only acquires mutex_, but also log_write_mutex_.

There are two issues:

  1. Applications using these two APIs may hang if one thread calls GetSortedWalFiles() after
    calling LockWAL() because log_write_mutex is not recursive.
  2. Two threads may dead lock due to lock order inversion.

To fix these issues, we can modify the implementation of LockWAL so that it does not keep
log_write_mutex_ held until UnlockWAL. To achieve the goal of locking the WAL, we can
instead manually inject a write stall so that all future writes will be stopped.

Test plan:
make check

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@riversand963
Copy link
Contributor Author

Previous impl can be found at #5146

@@ -924,6 +924,38 @@ Status DBImpl::WriteImplWALOnly(
write_thread->ExitAsBatchGroupLeader(write_group, status);
return status;
}
} else {
// TODO(yanqin): maybe move this block into a refactored version of
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I can at least put this block into a helper function

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this approach. Was wondering about a few more possible simplifications.

db/db_impl/db_impl.cc Outdated Show resolved Hide resolved
db/db_impl/db_impl.cc Outdated Show resolved Hide resolved
db/db_impl/db_impl_write.cc Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

db/db_impl/db_impl_write.cc Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@riversand963
Copy link
Contributor Author

Thanks @ajkr for the review!

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@riversand963
Copy link
Contributor Author

Looking into flaky tests.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 merged this pull request in c93ba7d.

@riversand963 riversand963 deleted the fix-deadlock-2 branch December 14, 2022 06:03
pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Jan 25, 2023
Summary: PR facebook#11020 fixed a case where it was easy to deadlock the DB
with LockWAL() but introduced a bug showing up as a rare assertion
failure in the stress test. Specifically, `assert(w->state ==
STATE_INIT)` in `WriteThread::LinkOne()` called from
`BeginWriteStall()`, `DelayWrite()`, `WriteImplWALOnly()`. I haven't
been about to generate a unit test that reproduces this failure but I
believe the root cause is that DelayWrite() was never meant to be
re-entrant, only called from the DB's write_thread_ leader. facebook#11020
introduced a call to DelayWrite() from the nonmem_write_thread_ group
leader.

This fix is to make DelayWrite() apply to the specific write queue that
it is being called from (inject a dummy write stall entry to the head of
the appropriate write queue). WriteController is re-entrant, based on
polling and state changes signalled with bg_cv_, so can manage stalling
two queues. The only anticipated complication (called out by Andrew in
previous PR) is that we don't want timed write delays being injected in
parallel for the two queues, because that dimishes the intended
throttling effect. Thus, we only allow timed delays for the primary
write queue.

Test Plan: Although I was not able to reproduce the assertion failure,
I was able to reproduce a distinct flaw with what I believe is the same
root cause: a kind of deadlock if both write queues need to wake up from
stopped writes. Only one will be waiting on bg_cv_ (the other waiting
in `LinkOne()` for the write queue to open up), so a single
SignalAll() will only unblock one of the queues, with the other
re-instating the stop until another signal on bg_cv_. A simple unit test
is added for this case.

Will also run crash_test_with_multiops_wc_txn for a while looking for
issues.
facebook-github-bot pushed a commit that referenced this pull request Jan 25, 2023
Summary:
PR #11020 fixed a case where it was easy to deadlock the DB with LockWAL() but introduced a bug showing up as a rare assertion failure in the stress test. Specifically, `assert(w->state == STATE_INIT)` in `WriteThread::LinkOne()` called from `BeginWriteStall()`, `DelayWrite()`, `WriteImplWALOnly()`. I haven't been about to generate a unit test that reproduces this failure but I believe the root cause is that DelayWrite() was never meant to be re-entrant, only called from the DB's write_thread_ leader. #11020 introduced a call to DelayWrite() from the nonmem_write_thread_ group leader.

This fix is to make DelayWrite() apply to the specific write queue that it is being called from (inject a dummy write stall entry to the head of the appropriate write queue). WriteController is re-entrant, based on polling and state changes signalled with bg_cv_, so can manage stalling two queues. The only anticipated complication (called out by Andrew in previous PR) is that we don't want timed write delays being injected in parallel for the two queues, because that dimishes the intended throttling effect. Thus, we only allow timed delays for the primary write queue.

HISTORY not updated because this is intended for the same release where the bug was introduced.

Pull Request resolved: #11130

Test Plan:
Although I was not able to reproduce the assertion failure, I was able to reproduce a distinct flaw with what I believe is the same root cause: a kind of deadlock if both write queues need to wake up from stopped writes. Only one will be waiting on bg_cv_ (the other waiting in `LinkOne()` for the write queue to open up), so a single SignalAll() will only unblock one of the queues, with the other re-instating the stop until another signal on bg_cv_. A simple unit test is added for this case.

Will also run crash_test_with_multiops_wc_txn for a while looking for issues.

Reviewed By: ajkr

Differential Revision: D42749330

Pulled By: pdillinger

fbshipit-source-id: 4317dd899a93d57c26fd5af7143038f82d4d4d1b
pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Jan 25, 2023
Summary:
PR facebook#11020 fixed a case where it was easy to deadlock the DB with LockWAL() but introduced a bug showing up as a rare assertion failure in the stress test. Specifically, `assert(w->state == STATE_INIT)` in `WriteThread::LinkOne()` called from `BeginWriteStall()`, `DelayWrite()`, `WriteImplWALOnly()`. I haven't been about to generate a unit test that reproduces this failure but I believe the root cause is that DelayWrite() was never meant to be re-entrant, only called from the DB's write_thread_ leader. facebook#11020 introduced a call to DelayWrite() from the nonmem_write_thread_ group leader.

This fix is to make DelayWrite() apply to the specific write queue that it is being called from (inject a dummy write stall entry to the head of the appropriate write queue). WriteController is re-entrant, based on polling and state changes signalled with bg_cv_, so can manage stalling two queues. The only anticipated complication (called out by Andrew in previous PR) is that we don't want timed write delays being injected in parallel for the two queues, because that dimishes the intended throttling effect. Thus, we only allow timed delays for the primary write queue.

HISTORY not updated because this is intended for the same release where the bug was introduced.

Pull Request resolved: facebook#11130

Test Plan:
Although I was not able to reproduce the assertion failure, I was able to reproduce a distinct flaw with what I believe is the same root cause: a kind of deadlock if both write queues need to wake up from stopped writes. Only one will be waiting on bg_cv_ (the other waiting in `LinkOne()` for the write queue to open up), so a single SignalAll() will only unblock one of the queues, with the other re-instating the stop until another signal on bg_cv_. A simple unit test is added for this case.

Will also run crash_test_with_multiops_wc_txn for a while looking for issues.

Reviewed By: ajkr

Differential Revision: D42749330

Pulled By: pdillinger

fbshipit-source-id: 4317dd899a93d57c26fd5af7143038f82d4d4d1b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants