Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notifying OnFlushCompleted and OnCompactionCompleted in sequence #6342

Conversation

burtonli
Copy link
Contributor

@burtonli burtonli commented Jan 29, 2020

Summary:
BlobDB keeps track of the mapping between SSTs and blob files using
the OnFlushCompleted and OnCompactionCompleted callbacks of
the Listener interface: upon receiving a flush notification, a link is added
between the newly flushed SST and the corresponding blob file; for
compactions, links are removed for the inputs and added for the outputs.
In certain cases, it is possible for the compaction notification that results
in a link being removed to precede the flush notification that establishes
the link (see #6338 ).

This change is a general fix that makes sure OnFlushCompleted and OnCompactionCompleted notifications in sequence, which is a more nature way for other external consumers to built their similar logic like BlobDB.

@burtonli burtonli requested a review from ltamasi January 29, 2020 19:18
@siying siying requested review from riversand963 and removed request for siying January 29, 2020 20:32
db/db_impl/db_impl.h Outdated Show resolved Hide resolved
…duleWork to handle flush and compaction completion notifications.

2. Use a cv and a id queue to handle completion notification ordering.
Copy link
Contributor

@ltamasi ltamasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for working on this @burtonli ! I have some comments/questions, please see below.

// Next notification id for completion listeners.
uint64_t next_notification_id_;
// Completion listeners queue.
std::deque<uint64_t> notification_queue_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify why this queueing mechanism is necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a extra lock for keeping notification in sequence, since hold mutex_ is not expectable during notifying listeners.
The original idea of maintaining a queue is we can have some parallelism for parparing the object for notificaiton.
e.g. :

  1. nlock.lock()
  2. Add notification id into queue.
  3. nlock.unlock()
  4. prepare notification object <-- parallel tasks
  5. wait con_var for current id is the first one of the queue.
  6. send notifications
    But it looks like it's a little bit overkill, considering the extra code complexity. I have simplified the logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I totally get the need for the mutex, was just wondering about the queue (since it was FIFO apparently). And I do agree about the complexity (it's better to keep it simple).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@burtonli Actually, question: isn't causality ensured simply by issuing the notification before scheduling further work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, or is it that another background task might also finish around the same time and schedule a compaction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, in the original logic, there is no lock to protect flush or compaction completion notification, even we can make sure current thread doesn't trigger any new compaction before notification, but there may be other background tasks jump ahead. It's easy to get a repro by setting aggressive LSM compaction, e.g.; 10KB memtable size, L0 compaction trigger = 4.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for clarifying!

db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved
db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved
db/db_impl/db_impl_compaction_flush.cc Show resolved Hide resolved
db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved
db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved
@burtonli burtonli changed the title Always notifying on flush completed ahead of triggering compaction Notifying OnFlushCompleted and OnCompactionCompleted in sequence Jan 30, 2020
Copy link
Contributor

@ltamasi ltamasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @burtonli! I just have a couple of minor comments.

HISTORY.md Outdated
@@ -4,6 +4,7 @@
* Fix incorrect results while block-based table uses kHashSearch, together with Prev()/SeekForPrev().
* Fix a bug that prevents opening a DB after two consecutive crash with TransactionDB, where the first crash recovers from a corrupted WAL with kPointInTimeRecovery but the second cannot.
* Fixed issue #6316 that can cause a corruption of the MANIFEST file in the middle when writing to it fails due to no disk space.
* Fix BlobDB crash #6338 for maintaining mapping between SSTs and blob files when enabling garbage collection, by keeping flush and compaction completion notifications in sequence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we flip this around to clarify the fix/impact? Like "Fixed an issue where listeners could receive out of order OnFlushCompleted/OnCompactionCompleted notifications. This could cause a crash in BlobDB when garbage collection is enabled (see #6338)."

db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved
InstallSuperVersionAndScheduleWork(c->column_family_data(),
&job_context->superversion_contexts[0],
*c->mutable_cf_options());
*c->mutable_cf_options(), callback);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a side note that has nothing to do with the PR per se but we might have a small bug here: InstallSuperVersionAndScheduleWork is called here (and on the trivial move branch) regardless of status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have noticed that as well. Also there is a different behavior between NotifyOnCompactionCompleted and NotifyOnFlushCompleted. NotifyOnFlushCompleted only triggers when status.ok(), but NotifyOnCompactionCompleted triggers regardless of status. I kept the existing logic as it is, and we can have separate PR to address them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is definitely out of scope here. Also, IIRC FlushJobInfo does not even have a status field.

Copy link
Contributor

@ltamasi ltamasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the fix @burtonli !

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ltamasi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@riversand963
Copy link
Contributor

kind of similar to #6069 , will take a look.

db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved
db/db_impl/db_impl.h Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@burtonli has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ltamasi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@burtonli has updated the pull request. Re-import the pull request

@riversand963
Copy link
Contributor

Thanks @burtonli for reporting and investigating this issue. It's very helpful.
@ltamasi and I discussed offline about this, and we agree we must solve this issue. There is a concern about adding a new mutex here.
First, in https://github.com/facebook/rocksdb/blob/master/include/rocksdb/listener.h#L309:L310, we say "without holding any DB mutex". Different people may have different understanding of the phrase "any DB mutex". Does it refer to all mutex of the DB class? If so, adding another mutex and locking it while calling listener methods break this promise.
Second, do we want to order listener callbacks among different column families in the same db? Probably not, thus a db-wide mutex seems an overkill to other cases.
With the above being said, I suggest that we take time to think about a long term solution to ensure in-order execution of listener methods when necessary. However, in the meantime, I would like to know whether you can cherry-pick/back-port this change to your codebase so that you won't be blocked. IIRC, your RocksDB codebase lives in a private, internal repo, and is not up-to-date with RocksDB master.
Please let me know if you have any question and/or concern.

@burtonli
Copy link
Contributor Author

burtonli commented Feb 5, 2020

@riversand963 @ltamasi Thanks for feedback. I think the design of notifying event listener within the same thread of flush/compaction requires callback to be light weighted: https://github.com/facebook/rocksdb/blob/master/include/rocksdb/listener.h#L319:L321
Otherwise, the thread will be blocked while DB is fine as the big mutex_ is not locked during the callbacks. Based on this consideration, the fix doesn't break the assumption, it still won't block the entire DB. On the other hand, I do agree it may hurt the performance if customer putting expensive operations into the callbacks, also it's better to not lock across CFs.
Actually, I have the same worry before adding the extra lock here, so I had other two alternatives before implement the fix:

  1. Built an internal notification queue, and putting notifications to the queue in sequence, then having another background thread to send notifications.
    The problems of this approach are:
  • Not all the notifications need to leverage a queue to make it ordered, e.g.: For any specific sst,
    OnFlushBegin is guaranteed to be called ahead of OnFlushCompleted, as they are called within the same thread.
  • Customer can easily do the same thing to make callback cheap, by having a queue on their side, and only insert the notification into the queue in the callback.
  • The extra code complexity is not trivial.
  1. Having a special callback for BlobDB, and not reuse the general listeners. This might work, but it still depends on whether we want BlobDB to be more like a stackable DB or more like a native built-in DB type.

I'm not rush for the fix, let's make it proper!

@ltamasi
Copy link
Contributor

ltamasi commented Feb 6, 2020

@burtonli We discussed this a bit more with @riversand963 and there are a couple more related issues worth mentioning:

  1. The class comment for EventListener states that callbacks are issued by the background thread that actually performed the work, which would not be true with a queue + separate notification thread type of solution. (Side note: this is probably not a big deal and actually, a recent fix already broke this promise; just calling it out.)

  2. Currently, both NotifyOnFlushCompleted and NotifyOnCompactionCompleted use the ColumnFamilyData and the current Version internally. The problem is that this means queuing/delaying the notification can affect the semantics of notification itself (i.e. the current Version might change in the meantime). We could solve this by storing all necessary information in the function object itself when it is created.

Again, just calling these out for the record.

As for the impact on BlobDB, we do have plans to integrate it deeper into the RocksDB core (and migrate off the EventListener interface), so this issue will become moot from that perspective. We could also work around it in various ways in the meantime. However, I still think it would be nice to fix it in one place on the RocksDB side instead of putting the burden of dealing with out of order notifications on the clients.

@burtonli burtonli closed this Feb 26, 2020
@yiwu-arbug
Copy link
Contributor

Is there any future plan for this PR, or is there a separate fix to the issue? Thanks. @ltamasi @riversand963

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants