-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Notifying OnFlushCompleted and OnCompactionCompleted in sequence #6342
Notifying OnFlushCompleted and OnCompactionCompleted in sequence #6342
Conversation
…duleWork to handle flush and compaction completion notifications. 2. Use a cv and a id queue to handle completion notification ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for working on this @burtonli ! I have some comments/questions, please see below.
db/db_impl/db_impl.h
Outdated
// Next notification id for completion listeners. | ||
uint64_t next_notification_id_; | ||
// Completion listeners queue. | ||
std::deque<uint64_t> notification_queue_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify why this queueing mechanism is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a extra lock for keeping notification in sequence, since hold mutex_ is not expectable during notifying listeners.
The original idea of maintaining a queue is we can have some parallelism for parparing the object for notificaiton.
e.g. :
- nlock.lock()
- Add notification id into queue.
- nlock.unlock()
- prepare notification object <-- parallel tasks
- wait con_var for current id is the first one of the queue.
- send notifications
But it looks like it's a little bit overkill, considering the extra code complexity. I have simplified the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I totally get the need for the mutex, was just wondering about the queue (since it was FIFO apparently). And I do agree about the complexity (it's better to keep it simple).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@burtonli Actually, question: isn't causality ensured simply by issuing the notification before scheduling further work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, or is it that another background task might also finish around the same time and schedule a compaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, in the original logic, there is no lock to protect flush or compaction completion notification, even we can make sure current thread doesn't trigger any new compaction before notification, but there may be other background tasks jump ahead. It's easy to get a repro by setting aggressive LSM compaction, e.g.; 10KB memtable size, L0 compaction trigger = 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for clarifying!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @burtonli! I just have a couple of minor comments.
HISTORY.md
Outdated
@@ -4,6 +4,7 @@ | |||
* Fix incorrect results while block-based table uses kHashSearch, together with Prev()/SeekForPrev(). | |||
* Fix a bug that prevents opening a DB after two consecutive crash with TransactionDB, where the first crash recovers from a corrupted WAL with kPointInTimeRecovery but the second cannot. | |||
* Fixed issue #6316 that can cause a corruption of the MANIFEST file in the middle when writing to it fails due to no disk space. | |||
* Fix BlobDB crash #6338 for maintaining mapping between SSTs and blob files when enabling garbage collection, by keeping flush and compaction completion notifications in sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we flip this around to clarify the fix/impact? Like "Fixed an issue where listeners could receive out of order OnFlushCompleted
/OnCompactionCompleted
notifications. This could cause a crash in BlobDB when garbage collection is enabled (see #6338)."
InstallSuperVersionAndScheduleWork(c->column_family_data(), | ||
&job_context->superversion_contexts[0], | ||
*c->mutable_cf_options()); | ||
*c->mutable_cf_options(), callback); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a side note that has nothing to do with the PR per se but we might have a small bug here: InstallSuperVersionAndScheduleWork
is called here (and on the trivial move branch) regardless of status
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have noticed that as well. Also there is a different behavior between NotifyOnCompactionCompleted and NotifyOnFlushCompleted. NotifyOnFlushCompleted only triggers when status.ok(), but NotifyOnCompactionCompleted triggers regardless of status. I kept the existing logic as it is, and we can have separate PR to address them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is definitely out of scope here. Also, IIRC FlushJobInfo
does not even have a status
field.
…://github.com/burtonli/rocksdb into notify_on_flush_completed_ahead_of_compaction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for the fix @burtonli !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ltamasi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
kind of similar to #6069 , will take a look. |
@burtonli has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ltamasi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@burtonli has updated the pull request. Re-import the pull request |
Thanks @burtonli for reporting and investigating this issue. It's very helpful. |
@riversand963 @ltamasi Thanks for feedback. I think the design of notifying event listener within the same thread of flush/compaction requires callback to be light weighted: https://github.com/facebook/rocksdb/blob/master/include/rocksdb/listener.h#L319:L321
I'm not rush for the fix, let's make it proper! |
@burtonli We discussed this a bit more with @riversand963 and there are a couple more related issues worth mentioning:
Again, just calling these out for the record. As for the impact on BlobDB, we do have plans to integrate it deeper into the RocksDB core (and migrate off the |
Is there any future plan for this PR, or is there a separate fix to the issue? Thanks. @ltamasi @riversand963 |
Summary:
BlobDB keeps track of the mapping between SSTs and blob files using
the OnFlushCompleted and OnCompactionCompleted callbacks of
the Listener interface: upon receiving a flush notification, a link is added
between the newly flushed SST and the corresponding blob file; for
compactions, links are removed for the inputs and added for the outputs.
In certain cases, it is possible for the compaction notification that results
in a link being removed to precede the flush notification that establishes
the link (see #6338 ).
This change is a general fix that makes sure OnFlushCompleted and OnCompactionCompleted notifications in sequence, which is a more nature way for other external consumers to built their similar logic like BlobDB.