New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option for manual flush/compaction to ignore unexpired UDT #12585
base: main
Are you sure you want to change the base?
Conversation
72efa3b
to
c6b463d
Compare
@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
c6b463d
to
afb8a26
Compare
@jowlyzhang has updated the pull request. You must reimport the pull request before landing. |
afb8a26
to
f632ee0
Compare
@jowlyzhang has updated the pull request. You must reimport the pull request before landing. |
@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
f632ee0
to
3a8703a
Compare
@jowlyzhang has updated the pull request. You must reimport the pull request before landing. |
@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@jowlyzhang has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was pleasantly surprising to see that FlushOptions
only had two members up until now. It would be great if the simplicity can last a little bit longer. It might be worth checking if an alternative plan can work, such as:
-
We add DB and CF identification info to
EventListener::OnMemTableSealed()
callback. -
MyRocks changes logic surrounding
Flush()
/CompactRange()
:
2a) CallIncreaseFullHistoryTsLow(/* some ts representing now */)
2b) Register the DB/CF with an event listener that implementsOnMemTableSealed()
, which also callsIncreaseFullHistoryTsLow(/* some ts representing now */)
2c)Flush()
/CompactRange()
2d) Deregister the DB/CF from the event listener
I am not really sure 2a) is needed; I just included it there in case OnMemTableSealed()
(2b) is skipped when the memtable is empty.
// Set user-defined timestamp low bound, the data with older timestamp than | ||
// low bound maybe GCed by compaction. Default: nullptr | ||
const Slice* full_history_ts_low = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this support setting a higher value than GetFullHistoryTsLow() without ever increasing it on the CF? If so I was thinking perhaps they could set it here to the max timestamp so just the CompactRange() can drop history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this actually only supports setting a higher value than GetFullHistoryTsLow(), it will updates the column family's cutoff timestamp which does not allow going back down. So once it's set to max timestamp, it kind of is setting to no retention going forward, for auto flush too.
Looks like it is needed. Probably should also swap 2b) and 2a) for safety from an unlikely edge case. |
Another one that might be worth considering: a mode option for specifying the user's intent with respect to history retention. For example, minimal retention could do nothing (i.e., never reschedule) so flush could happen whenever. Loose retention would be to make modest effort to preserve history, like the current rescheduling behavior. Maybe a future absolute retention would go even further, like stalling writes to ensure enough history is preserved. If the user wanted loose retention by default and minimal retention only during Flush() and CompactRange(), they could dynamically set it accordingly. |
Thanks for this detailed alternative proposal. My understanding is that this is taking advantage of the fact that RocksDB itself needs to block writes once a manual flush starts before it seals a memtable, so MyRocks wouldn't need to do this blocking on their own. This is a good idea, I think another part that is essential for this to work is that MyRocks needs to be able to get a large enough One question I have though, is that db mutex is locked during manual flush at this time. The |
This is a great idea! It seems to me to be the most future proof/non-hacky/good API design/.... solution, I will also sync with MyRocks team about this idea. If they don't have any concerns, let's move forward with this alternative then. Thanks a lot for the suggestion! |
We currently release the DB mutex while calling any callback, I believe. For rocksdb/db/db_impl/db_impl_write.cc Lines 2142 to 2146 in 6cc7ad1
That makes it difficult for users to synchronize with RocksDB, like if they want to track the live file set based on callbacks. But it makes it possible to call DB functions. |
Unless |
Add options in
FlushOptions
andCompactRangeOptions
to ignore unexpired user-defined timestamps for user initiated flush.For user-defined timestamps in Memtable only feature (a.k.a when
Options.persist_user_defined_timestamps
is false), flush has the side effect that UDTs are also removed. AFlushRequest
is recheduled when it applies to try to retain user-defined timestamps that are not expired in a best effort. The expiration is determined w.r.t the cutoff timestampfull_history_ts_low
that users can set via theIncreaseFullHistoryTsLow
API.The current behavior of manual flush and manual compaction is that it won't return until the memtables only contain expired UDT and flush proceeds to finish. This was intended to make the user more conscience of aforementioned side effect of flush. And users can explicitly increase the cutoff timestamp before calling manual flush /compaction to indicate they are aware of this. However, these steps are inconvenient since user also need to block their writes before calling
IncreaseFullHisotoryTsLow
and unblock after manual flush / compaction finishes. This is to avoid another write with higher UDT gets added after increasing cutoff timestamp, making the memtable containing unexpired UDT again.In order to avoid this inconvenience, we added
strict_udt_retention
options inFlushOptions
andCompactRangeOptions
for users to achieve similar effect without the need to block writes on their side.Test Plan:
Existing tests
./column_family_test --gtest_filter=ColumnFamilyRetainUDTTest, NotAllKeysExpiredUserAsksToIgnore