Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for manual flush/compaction to ignore unexpired UDT #12585

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jowlyzhang
Copy link
Contributor

@jowlyzhang jowlyzhang commented Apr 25, 2024

Add options in FlushOptions and CompactRangeOptions to ignore unexpired user-defined timestamps for user initiated flush.

For user-defined timestamps in Memtable only feature (a.k.a when Options.persist_user_defined_timestamps is false), flush has the side effect that UDTs are also removed. A FlushRequest is recheduled when it applies to try to retain user-defined timestamps that are not expired in a best effort. The expiration is determined w.r.t the cutoff timestamp full_history_ts_low that users can set via the IncreaseFullHistoryTsLow API.

The current behavior of manual flush and manual compaction is that it won't return until the memtables only contain expired UDT and flush proceeds to finish. This was intended to make the user more conscience of aforementioned side effect of flush. And users can explicitly increase the cutoff timestamp before calling manual flush /compaction to indicate they are aware of this. However, these steps are inconvenient since user also need to block their writes before calling IncreaseFullHisotoryTsLow and unblock after manual flush / compaction finishes. This is to avoid another write with higher UDT gets added after increasing cutoff timestamp, making the memtable containing unexpired UDT again.

In order to avoid this inconvenience, we added strict_udt_retention options in FlushOptions and CompactRangeOptions for users to achieve similar effect without the need to block writes on their side.

Test Plan:
Existing tests
./column_family_test --gtest_filter=ColumnFamilyRetainUDTTest, NotAllKeysExpiredUserAsksToIgnore

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was pleasantly surprising to see that FlushOptions only had two members up until now. It would be great if the simplicity can last a little bit longer. It might be worth checking if an alternative plan can work, such as:

  1. We add DB and CF identification info to EventListener::OnMemTableSealed() callback.

  2. MyRocks changes logic surrounding Flush()/CompactRange():
    2a) Call IncreaseFullHistoryTsLow(/* some ts representing now */)
    2b) Register the DB/CF with an event listener that implements OnMemTableSealed(), which also calls IncreaseFullHistoryTsLow(/* some ts representing now */)
    2c) Flush()/CompactRange()
    2d) Deregister the DB/CF from the event listener

I am not really sure 2a) is needed; I just included it there in case OnMemTableSealed() (2b) is skipped when the memtable is empty.

Comment on lines 2027 to 2029
// Set user-defined timestamp low bound, the data with older timestamp than
// low bound maybe GCed by compaction. Default: nullptr
const Slice* full_history_ts_low = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this support setting a higher value than GetFullHistoryTsLow() without ever increasing it on the CF? If so I was thinking perhaps they could set it here to the max timestamp so just the CompactRange() can drop history.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this actually only supports setting a higher value than GetFullHistoryTsLow(), it will updates the column family's cutoff timestamp which does not allow going back down. So once it's set to max timestamp, it kind of is setting to no retention going forward, for auto flush too.

@ajkr
Copy link
Contributor

ajkr commented May 2, 2024

I am not really sure 2a) is needed; I just included it there in case OnMemTableSealed() (2b) is skipped when the memtable is empty.

Looks like it is needed. Probably should also swap 2b) and 2a) for safety from an unlikely edge case.

@ajkr
Copy link
Contributor

ajkr commented May 2, 2024

It might be worth checking if an alternative plan can work, such as:

Another one that might be worth considering: a mode option for specifying the user's intent with respect to history retention. For example, minimal retention could do nothing (i.e., never reschedule) so flush could happen whenever. Loose retention would be to make modest effort to preserve history, like the current rescheduling behavior. Maybe a future absolute retention would go even further, like stalling writes to ensure enough history is preserved.

If the user wanted loose retention by default and minimal retention only during Flush() and CompactRange(), they could dynamically set it accordingly.

@jowlyzhang
Copy link
Contributor Author

2. 2b) Register the DB/CF with an event listener that implements OnMemTableSealed(), which also calls IncreaseFullHistoryTsLow(/* some ts representing now */)

Thanks for this detailed alternative proposal. My understanding is that this is taking advantage of the fact that RocksDB itself needs to block writes once a manual flush starts before it seals a memtable, so MyRocks wouldn't need to do this blocking on their own. This is a good idea, I think another part that is essential for this to work is that MyRocks needs to be able to get a large enough now time that should be higher than any concurrent writes. So they either need to retrieve the most current hlc in the callback, or RocksDB can return the highest timestamp in the sealed memtable in the callback, which we already keep track of.

One question I have though, is that db mutex is locked during manual flush at this time. The IncreaseFullHistoryTsLow API in the callback also need to lock the db mutex . I think that will cause a deadlock, right? Do you have ideas how this can be avoided?

@jowlyzhang
Copy link
Contributor Author

If the user wanted loose retention by default and minimal retention only during Flush() and CompactRange(), they could dynamically set it accordingly.

This is a great idea! It seems to me to be the most future proof/non-hacky/good API design/.... solution, I will also sync with MyRocks team about this idea. If they don't have any concerns, let's move forward with this alternative then. Thanks a lot for the suggestion!

@ajkr
Copy link
Contributor

ajkr commented May 2, 2024

One question I have though, is that db mutex is locked during manual flush at this time. The IncreaseFullHistoryTsLow API in the callback also need to lock the db mutex . I think that will cause a deadlock, right? Do you have ideas how this can be avoided?

We currently release the DB mutex while calling any callback, I believe. For OnMemTableSealed() that happens here:

mutex_.Unlock();
for (const auto& listener : immutable_db_options_.listeners) {
listener->OnMemTableSealed(mem_table_info);
}
mutex_.Lock();

That makes it difficult for users to synchronize with RocksDB, like if they want to track the live file set based on callbacks. But it makes it possible to call DB functions.

@ajkr
Copy link
Contributor

ajkr commented May 2, 2024

But it makes it possible to call DB functions.

Unless OnMemTableSealed() happens in the write thread, which it does. Then there's some restrictions. Not sure if increasing history ts would work there or not. Even if it works, I think this callback is not an ideal place to change history retention. There might be a better place to put a callback like after the manual flush scheduling and before the waiting.

@jowlyzhang jowlyzhang marked this pull request as draft May 8, 2024 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants