Notifying OnFlushCompleted and OnCompactionCompleted in sequence #6342

burtonli · 2020-01-29T09:58:49Z

Summary:
BlobDB keeps track of the mapping between SSTs and blob files using
the OnFlushCompleted and OnCompactionCompleted callbacks of
the Listener interface: upon receiving a flush notification, a link is added
between the newly flushed SST and the corresponding blob file; for
compactions, links are removed for the inputs and added for the outputs.
In certain cases, it is possible for the compaction notification that results
in a link being removed to precede the flush notification that establishes
the link (see #6338 ).

This change is a general fix that makes sure OnFlushCompleted and OnCompactionCompleted notifications in sequence, which is a more nature way for other external consumers to built their similar logic like BlobDB.

db/db_impl/db_impl.h

…duleWork to handle flush and compaction completion notifications. 2. Use a cv and a id queue to handle completion notification ordering.

ltamasi

Thanks so much for working on this @burtonli ! I have some comments/questions, please see below.

ltamasi · 2020-01-30T18:37:50Z

db/db_impl/db_impl.h

+  // Next notification id for completion listeners.
+  uint64_t next_notification_id_;
+  // Completion listeners queue.
+  std::deque<uint64_t> notification_queue_;


Could you please clarify why this queueing mechanism is necessary?

We need a extra lock for keeping notification in sequence, since hold mutex_ is not expectable during notifying listeners.
The original idea of maintaining a queue is we can have some parallelism for parparing the object for notificaiton.
e.g. :

nlock.lock()

Add notification id into queue.

nlock.unlock()

prepare notification object <-- parallel tasks

wait con_var for current id is the first one of the queue.

send notifications
But it looks like it's a little bit overkill, considering the extra code complexity. I have simplified the logic.

Yeah, I totally get the need for the mutex, was just wondering about the queue (since it was FIFO apparently). And I do agree about the complexity (it's better to keep it simple).

@burtonli Actually, question: isn't causality ensured simply by issuing the notification before scheduling further work?

Oh, or is it that another background task might also finish around the same time and schedule a compaction?

Correct, in the original logic, there is no lock to protect flush or compaction completion notification, even we can make sure current thread doesn't trigger any new compaction before notification, but there may be other background tasks jump ahead. It's easy to get a repro by setting aggressive LSM compaction, e.g.; 10KB memtable size, L0 compaction trigger = 4.

Makes sense, thanks for clarifying!

db/db_impl/db_impl_compaction_flush.cc

ltamasi

Thanks @burtonli! I just have a couple of minor comments.

ltamasi · 2020-01-30T23:55:26Z

HISTORY.md

@@ -4,6 +4,7 @@
 * Fix incorrect results while block-based table uses kHashSearch, together with Prev()/SeekForPrev().
 * Fix a bug that prevents opening a DB after two consecutive crash with TransactionDB, where the first crash recovers from a corrupted WAL with kPointInTimeRecovery but the second cannot.
 * Fixed issue #6316 that can cause a corruption of the MANIFEST file in the middle when writing to it fails due to no disk space.
+* Fix BlobDB crash #6338 for maintaining mapping between SSTs and blob files when enabling garbage collection, by keeping flush and compaction completion notifications in sequence.


Can we flip this around to clarify the fix/impact? Like "Fixed an issue where listeners could receive out of order OnFlushCompleted/OnCompactionCompleted notifications. This could cause a crash in BlobDB when garbage collection is enabled (see #6338)."

db/db_impl/db_impl_compaction_flush.cc

ltamasi · 2020-01-31T00:01:26Z

db/db_impl/db_impl_compaction_flush.cc

    InstallSuperVersionAndScheduleWork(c->column_family_data(),
                                       &job_context->superversion_contexts[0],
-                                       *c->mutable_cf_options());
+                                       *c->mutable_cf_options(), callback);


Just a side note that has nothing to do with the PR per se but we might have a small bug here: InstallSuperVersionAndScheduleWork is called here (and on the trivial move branch) regardless of status.

I have noticed that as well. Also there is a different behavior between NotifyOnCompactionCompleted and NotifyOnFlushCompleted. NotifyOnFlushCompleted only triggers when status.ok(), but NotifyOnCompactionCompleted triggers regardless of status. I kept the existing logic as it is, and we can have separate PR to address them.

Yeah, this is definitely out of scope here. Also, IIRC FlushJobInfo does not even have a status field.

…://github.com/burtonli/rocksdb into notify_on_flush_completed_ahead_of_compaction

ltamasi

Thanks so much for the fix @burtonli !

facebook-github-bot

@ltamasi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

riversand963 · 2020-01-31T01:44:49Z

kind of similar to #6069 , will take a look.

db/db_impl/db_impl_compaction_flush.cc

db/db_impl/db_impl.h

facebook-github-bot · 2020-02-04T08:24:44Z

@burtonli has updated the pull request. Re-import the pull request

facebook-github-bot

@ltamasi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-02-04T22:06:14Z

@burtonli has updated the pull request. Re-import the pull request

riversand963 · 2020-02-04T22:26:48Z

Thanks @burtonli for reporting and investigating this issue. It's very helpful.
@ltamasi and I discussed offline about this, and we agree we must solve this issue. There is a concern about adding a new mutex here.
First, in https://github.com/facebook/rocksdb/blob/master/include/rocksdb/listener.h#L309:L310, we say "without holding any DB mutex". Different people may have different understanding of the phrase "any DB mutex". Does it refer to all mutex of the DB class? If so, adding another mutex and locking it while calling listener methods break this promise.
Second, do we want to order listener callbacks among different column families in the same db? Probably not, thus a db-wide mutex seems an overkill to other cases.
With the above being said, I suggest that we take time to think about a long term solution to ensure in-order execution of listener methods when necessary. However, in the meantime, I would like to know whether you can cherry-pick/back-port this change to your codebase so that you won't be blocked. IIRC, your RocksDB codebase lives in a private, internal repo, and is not up-to-date with RocksDB master.
Please let me know if you have any question and/or concern.

burtonli · 2020-02-05T00:00:54Z

@riversand963 @ltamasi Thanks for feedback. I think the design of notifying event listener within the same thread of flush/compaction requires callback to be light weighted: https://github.com/facebook/rocksdb/blob/master/include/rocksdb/listener.h#L319:L321
Otherwise, the thread will be blocked while DB is fine as the big mutex_ is not locked during the callbacks. Based on this consideration, the fix doesn't break the assumption, it still won't block the entire DB. On the other hand, I do agree it may hurt the performance if customer putting expensive operations into the callbacks, also it's better to not lock across CFs.
Actually, I have the same worry before adding the extra lock here, so I had other two alternatives before implement the fix:

Built an internal notification queue, and putting notifications to the queue in sequence, then having another background thread to send notifications.
The problems of this approach are:

Not all the notifications need to leverage a queue to make it ordered, e.g.: For any specific sst,
OnFlushBegin is guaranteed to be called ahead of OnFlushCompleted, as they are called within the same thread.
Customer can easily do the same thing to make callback cheap, by having a queue on their side, and only insert the notification into the queue in the callback.
The extra code complexity is not trivial.

Having a special callback for BlobDB, and not reuse the general listeners. This might work, but it still depends on whether we want BlobDB to be more like a stackable DB or more like a native built-in DB type.

I'm not rush for the fix, let's make it proper!

ltamasi · 2020-02-06T00:02:23Z

@burtonli We discussed this a bit more with @riversand963 and there are a couple more related issues worth mentioning:

The class comment for EventListener states that callbacks are issued by the background thread that actually performed the work, which would not be true with a queue + separate notification thread type of solution. (Side note: this is probably not a big deal and actually, a recent fix already broke this promise; just calling it out.)
Currently, both NotifyOnFlushCompleted and NotifyOnCompactionCompleted use the ColumnFamilyData and the current Version internally. The problem is that this means queuing/delaying the notification can affect the semantics of notification itself (i.e. the current Version might change in the meantime). We could solve this by storing all necessary information in the function object itself when it is created.

Again, just calling these out for the record.

As for the impact on BlobDB, we do have plans to integrate it deeper into the RocksDB core (and migrate off the EventListener interface), so this issue will become moot from that perspective. We could also work around it in various ways in the meantime. However, I still think it would be nice to fix it in one place on the RocksDB side instead of putting the burden of dealing with out of order notifications on the clients.

yiwu-arbug · 2020-07-06T23:46:48Z

Is there any future plan for this PR, or is there a separate fix to the issue? Thanks. @ltamasi @riversand963

notifying on flush completed ahead of triggering compactions

8114ac4

burtonli requested a review from siying January 29, 2020 09:58

facebook-github-bot added the CLA Signed label Jan 29, 2020

ltamasi mentioned this pull request Jan 29, 2020

BlobDB: handle the case when OnCompactionCompleted precedes OnFlushCompleted #6341

Closed

fix lite build

eca7e8d

burtonli requested a review from ltamasi January 29, 2020 19:18

siying requested review from riversand963 and removed request for siying January 29, 2020 20:32

ltamasi reviewed Jan 29, 2020

View reviewed changes

db/db_impl/db_impl.h Outdated Show resolved Hide resolved

1. Use a function object as a callback for InstallSuperVersionAndSche…

559349d

…duleWork to handle flush and compaction completion notifications. 2. Use a cv and a id queue to handle completion notification ordering.

ltamasi reviewed Jan 30, 2020

View reviewed changes

burtonli and others added 2 commits January 30, 2020 14:10

Remove notification queue and simplify sync logic

9118967

Merge branch 'master' into notify_on_flush_completed_ahead_of_compaction

df6c0be

burtonli changed the title ~~Always notifying on flush completed ahead of triggering compaction~~ Notifying OnFlushCompleted and OnCompactionCompleted in sequence Jan 30, 2020

ltamasi reviewed Jan 31, 2020

View reviewed changes

burtonli added 2 commits January 30, 2020 16:52

address comments.

9494469

Merge branch 'notify_on_flush_completed_ahead_of_compaction' of https…

9745171

…://github.com/burtonli/rocksdb into notify_on_flush_completed_ahead_of_compaction

ltamasi approved these changes Jan 31, 2020

View reviewed changes

facebook-github-bot reviewed Jan 31, 2020

View reviewed changes

riversand963 reviewed Feb 4, 2020

View reviewed changes

db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved

db/db_impl/db_impl.h Outdated Show resolved Hide resolved

Switch to InstrumentedMutex

47e1978

facebook-github-bot reviewed Feb 4, 2020

View reviewed changes

Merge branch 'master' into notify_on_flush_completed_ahead_of_compaction

145ab41

burtonli closed this Feb 26, 2020

Connor1996 mentioned this pull request Jul 6, 2020

RocksDB makes sure the order of notifying OnFlushCompleted and OnCompactionCompleted tikv/titan#179

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notifying OnFlushCompleted and OnCompactionCompleted in sequence #6342

Notifying OnFlushCompleted and OnCompactionCompleted in sequence #6342

burtonli commented Jan 29, 2020 •

edited

Loading

ltamasi left a comment

ltamasi Jan 30, 2020

burtonli Jan 30, 2020

ltamasi Jan 30, 2020

ltamasi Jan 30, 2020

ltamasi Jan 30, 2020

burtonli Jan 30, 2020

ltamasi Jan 30, 2020

ltamasi left a comment

ltamasi Jan 30, 2020

ltamasi Jan 31, 2020

burtonli Jan 31, 2020

ltamasi Jan 31, 2020

ltamasi left a comment

facebook-github-bot left a comment

riversand963 commented Jan 31, 2020

facebook-github-bot commented Feb 4, 2020

facebook-github-bot left a comment

facebook-github-bot commented Feb 4, 2020

riversand963 commented Feb 4, 2020

burtonli commented Feb 5, 2020

ltamasi commented Feb 6, 2020

yiwu-arbug commented Jul 6, 2020

Notifying OnFlushCompleted and OnCompactionCompleted in sequence #6342

Notifying OnFlushCompleted and OnCompactionCompleted in sequence #6342

Conversation

burtonli commented Jan 29, 2020 • edited Loading

ltamasi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ltamasi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ltamasi left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

riversand963 commented Jan 31, 2020

facebook-github-bot commented Feb 4, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Feb 4, 2020

riversand963 commented Feb 4, 2020

burtonli commented Feb 5, 2020

ltamasi commented Feb 6, 2020

yiwu-arbug commented Jul 6, 2020

burtonli commented Jan 29, 2020 •

edited

Loading