Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WritePrepared: reduce prepared_mutex_ overhead #5420

Conversation

maysamyabandeh
Copy link
Contributor

@maysamyabandeh maysamyabandeh commented Jun 6, 2019

The patch reduces the contention over prepared_mutex_ using these techniques:

  1. Move ::RemovePrepared() to be called from the commit callback when we have two write queues.
  2. Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop()
  3. Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last.

@@ -55,25 +55,17 @@ TEST(PreparedHeap, BasicsTest) {
heap.push(34l);
// Test that old min is still on top
ASSERT_EQ(14l, heap.top());
heap.push(13l);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With ::AddPrepared being called from the main write queue, the push is now called in order. So we should remove tests with unordered calls to push. This will later help to simplify the implementation of PreparedHeap as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this is only true in case of two write queues though? If so, maybe we should just decide to only support two writes queues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it is true regardless. I did not mention one write queue because it has always been the case for one write queue to insert into PreparedHeap in order.
But I am in favor of deprecating one-write-queue case, which would make the implementation simpler. We should keep that direction in mind.

{"AddPrepared::begin:pause", "AddPreparedBeforeMax::read_thread:start"},
{"AdvanceMaxEvictedSeq::update_max:pause", "AddPrepared::begin:resume"},
{"AddPrepared::end", "AdvanceMaxEvictedSeq::update_max:resume"},
{"AddPreparedCallback::AddPrepared::begin:pause", "AddPreparedBeforeMax::read_thread:start"},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we now acquire the mutex outside AddPrepared, the synchronization points were removes from within ::AddPrepared to outside it (from AddPreparedCallback)

@@ -263,16 +263,19 @@ Status DBImpl::WriteImpl(const WriteOptions& write_options,
size_t total_count = 0;
size_t valid_batches = 0;
size_t total_byte_size = 0;
size_t pre_release_callback_cnt = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to db_impl_write.cc are separately approved in #5381

uint64_t log_number, size_t index,
size_t total) override {
assert(index < total);
// To reduce lock intention with the conccurrent prepare requests, lock on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lock contention with the concurrent

auto to_be_popped = prepared_txns_.top();
delayed_prepared_.insert(to_be_popped);
ROCKS_LOG_WARN(info_log_,
"prepared_mutex_ overhead %" PRIu64 " (prep=%" PRIu64
" new_max=%" PRIu64,
static_cast<uint64_t>(delayed_prepared_.size()),
to_be_popped, new_max);
prepared_txns_.pop();
prepared_txns_.pop(true /*locked*/);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a potential race condition here between the prepared_txns_.top() <= new_max check and the pop here (ie. is it possible to store into delayed_prepared_empty_ something greater than the new max?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delayed_prepared_empty_ is a bool so cannot be greater than uint64_t. But if you mean prepared_txns_.top(), it is supposed to be updated after the pop otherwise the loop would continue indefinitely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I meant delayed_prepared_.insert(to_be_popped);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes. It is possible. Will submit a fix soon.

@lth lth self-requested a review June 7, 2019 21:53
@lth
Copy link
Contributor

lth commented Jun 7, 2019

I didn't mean to accept actually.

Copy link
Contributor

@lth lth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

to_be_popped, new_max);
prepared_txns_.pop();
delayed_prepared_empty_.store(false, std::memory_order_release);
// Need to fetch feresh values of ::top after mutex is acquired
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fresh" values

uint64_t log_number, size_t index,
size_t total) override {
assert(index < total);
// To reduce lock intention with the concurrent prepare requests, lock on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lock contention?

Also, I don't know if I'd call it lock contention because it looks like you actually increase lock contention by holding the lock for longer periods of time (whereas before, it was shorter, but more frequent). I think you're just saving CPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I can say "reduce lock acquisition cost".

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maysamyabandeh is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh merged this pull request in c292dc8.

vagogte pushed a commit to vagogte/rocksdb that referenced this pull request Jun 18, 2019
Summary:
The patch reduces the contention over prepared_mutex_ using these techniques:
1) Move ::RemovePrepared() to be called from the commit callback when we have two write queues.
2) Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop()
3) Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last.
Pull Request resolved: facebook#5420

Differential Revision: D15741985

Pulled By: maysamyabandeh

fbshipit-source-id: 84ce8016007e88bb6e10da5760ba1f0d26347735
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants