Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to start TTL earlier with kMinOverlappingRatio is used #8749

Closed
wants to merge 4 commits into from

Conversation

siying
Copy link
Contributor

@siying siying commented Sep 2, 2021

Summary:
Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles.

When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed.

In order for this to work, another change is made: during a compaction, if an output level file is older than ttl/2, cut output files based on original boundary (if it is not in the last level). This is to make sure that after an old file is moved to the next level, and new data is merged from the upper level, the new data falling into this range isn't reset with old timestamp. Without this change, in many cases, most files from one level will keep having old timestamp, even if they have newer data and we stuck in it.

Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end.

@facebook-github-bot
Copy link
Contributor

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@jay-zhuang jay-zhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me, just a few questions and would be great to have some benchmark result.

db/compaction/file_pri.h Outdated Show resolved Hide resolved
// starts to boost, the previous level's boosting amount is 16.
class FileTtlBooster {
public:
FileTtlBooster(uint64_t current_time, uint64_t ttl, int num_non_empty_levels,
Copy link
Contributor

@jay-zhuang jay-zhuang Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the benefit of using num_non_empty_levels instead of just level_num?
If there's multiple empty level, the max - 1 level should still need boost, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Levels beyond num_non_empty_levels will be empty. I assume we would boost files so that they reach the last non-empty level to remove tombstones. They will likely never reach Lmax anyway.

FileTtlBooster(uint64_t current_time, uint64_t ttl, int num_non_empty_levels,
int level)
: current_time_(current_time) {
if (ttl == 0 || level == 0 || level >= num_non_empty_levels) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the bottommost level needs boost (when level == num_non_empty_levels - 1)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I guess they don't. I probably should remove that level and adjust other levels' odd accordingly.

@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

2 similar comments
@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@jay-zhuang jay-zhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Looking forward to the benchmark result.

HISTORY.md Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

@siying
Copy link
Contributor Author

siying commented Sep 9, 2021

LGTM.

Looking forward to the benchmark result.

I'm working on it. Constructing a scenario which it shows significant gain is extremely hard. I'm constructing a scenario at least these TTL compactions are scheduled earlier and make sure at least there is no regression.

@facebook-github-bot
Copy link
Contributor

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@siying
Copy link
Contributor Author

siying commented Sep 9, 2021

LGTM.

Looking forward to the benchmark result.

I got an interesting observation so far. Without this PR, if write is fast enough to keep compaction always running, it appears that TTL compaction would never execute and TTL will be violated. Might be something we should fix... A side effect is that it makes constructing a benchmark even harder for this PR.

@jay-zhuang
Copy link
Contributor

I got an interesting observation so far. Without this PR, if write is fast enough to keep compaction always running, it appears that TTL compaction would never execute and TTL will be violated. Might be something we should fix... A side effect is that it makes constructing a benchmark even harder for this PR.

I see. Is TTL a strong guarantee? It could also be violated by having a slow TTL compaction.

@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

5 similar comments
@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

@jay-zhuang
Copy link
Contributor

Thanks for the improvement, but not sure how effective it would be. If the data is randomly distributed, seems it won't improve too much by splitting the output files. Either it would be lots of small files or normal size file with older timestamp, I think it likely be the later one, as it limited the size of output level files (to be > target_size/2, which is good). I guess for some specific use-case that the data is mostly ordered by timestamp, it's going to help. Should we benchmark it first?

}
} else {
// Look for the key position
while (next_files_to_cut_for_ttl <
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it performance sensitive code? Would it impact the performance of compaction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I hope in most of the case, files_to_cut_for_ttl is empty so this path is all skipped. For non empty case, most of the case one extra key comparison will be made per key compacted and I hope it is acceptable. I can run a benchmark and see performance penalty to this path.

Comment on lines +266 to +278
if (icmp->Compare(internal_key,
files_to_cut_for_ttl[next_files_to_cut_for_ttl]
->smallest.Encode()) >= 0) {
if (icmp->Compare(internal_key,
files_to_cut_for_ttl[next_files_to_cut_for_ttl]
->largest.Encode()) <= 0) {
// With in the current file
cur_files_to_cut_for_ttl = next_files_to_cut_for_ttl;
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first key within this range will create a new file unnecessarily.
Here is an example test, which is basically compacting the following 4 SSTs:

L1: [0-31] [32-63]
L2: [0-31] [32-63]

The output is 4 SSTs:

L2: [0] [1-31] [32] [33-63]

https://github.com/jay-zhuang/rocksdb/blob/46446fce9aff774ddc60d8828e14e71faae7e754/db/db_compaction_test.cc#L4430

Maybe we need a seen_key similar to grandparent checking?
As it's getting more complicated, could it be a function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the example. Is L2 grandparent? If we data in L1 is supposed to be output and L2 is grand parent, it appears that we would cut to [0-31][32-63].

Btw, these are internal keys and we are not supposed to have two entries with exactly the same internal keys. In case it happens, cutting a small file is the least worry.

Copy link
Contributor

@jay-zhuang jay-zhuang Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the bad example. Yeah, that answers part of my question, but there's still a problem. Here is a simpler example:

L1: [0-31]
L2: [2-31]
L3: [...] [...] ... // grandparents that we can ignore

If we compact 1@L1 + 1@L2, then the output would be:

L2: [0-2] [3-31]

Then both output files are having the older timestamp.

2 is included in the first file because larger seq_num (from output data) is smaller than smaller seq_num (from files_to_cut_for_ttl):

// decreasing sequence number

It only happens when there's override for the smallest key in files_to_cut_for_ttl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is on purpose. We would separate parts with different old ancestor times. [0 2] and [3 31] will be different, so we separate them. That's the concern on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the output should be:

L2: [0-1] [2-31]

So the first file would have a newer timestamp and the old one would have older timestamp.

@siying
Copy link
Contributor Author

siying commented Oct 27, 2021

Thanks for the improvement, but not sure how effective it would be. If the data is randomly distributed, seems it won't improve too much by splitting the output files. Either it would be lots of small files or normal size file with older timestamp, I think it likely be the later one, as it limited the size of output level files (to be > target_size/2, which is good). I guess for some specific use-case that the data is mostly ordered by timestamp, it's going to help. Should we benchmark it first?

This is not the solve user data unbalanced, but unbalanced creation time distribution caused by compaction.

Let's assume TTL is 10 days, and we use "[A B]:-3" to indicate that it is a file range with oldest ancester time to be 3 days ago, and we are using the original approach without the boosting mechanism in this PR. Assume we have such a tree structure:

[1 40]: -1
[1 20]:-2 [21 40]:-3
[1 10]:-9 [11 15]:-10 [16 22]:-8 [24 30]: -7 ... 
== lowest level ==

Now, since [11 15]:-10 exceeds TTL limit, it is compacted to the last level. So now it looks like following:

[1 40]: -1
[1 20]:-2 [21 40]:-3
[1 10]:-9 (.. now empty ..) [16 22]:-8 [24 30]: -7 ... 
== lowest level ==

Now no file is qualified for compaction. Now a compaction naturally happened and [1 20]:-2 is merged down. The data range is re-arranged and got the oldest ancester time, e.g.:

[1 40]: -1
(.. now empty ..) [21 40]:-3
[1 7]:-9 [8 18]:-9  [19 22]:-8 [24 30]: -7 ... 
== lowest level ==

This even assumes that we take min(oldest_ancester_time) by range, rather than min of all inputs.

Now one day passes and more files will qualify:

[1 40]: -2
(.. now empty ..) [21 40]:-4
[1 7]:-10 [8 18]:-10  [19 22]:-9 [24 30]: -8 ... 
== lowest level ==

Now more files can be moved down so L3 looks like:

(now empty)  [19 22]:-9 [24 30]: -8 ... 

Now new upper levels come down

......
[1 20]: -3 [21 40]:-4
[19 22]:-9 [24 30]: -8 ... 
== lowest level ==

Assuming we are merging [1 20]:-3 down. Since the file is small and cover more ranges. It might look like this:

......
(now empty) [21 40]:-4
[1 20]:-9 [21 22]:-9 [24 30]: -8 ... 
== lowest level ==

Now all incoming file covering range from 1 to 22 will need to use old ancester time -9, although they are all newer data.

This creates a situation where new data keeps coming but they always keep a very old ancester time and potentially being compacted to the last level for TTL compaction.

This file cutting is to solve this problem. I can schedule a VC meeting if it is helpful.

@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

@siying
Copy link
Contributor Author

siying commented Oct 30, 2021

@jay-zhuang Testing performance regression caused by this extra file cutting is hard. Cutting files more by itself introduces more CPU overheads, and based on my benchmark, this appears to be more than the checking boundary path. I hack the code trying so that the boundary checking always happens but cutting happens less frequently: https://gist.github.com/siying/c6322f6d9cd1b2febc6c9408f8b8c877

The result is that per compaction write MB CPU usage indeed increased by about 5%. However, we still cut files more. so it's hard to determine the actual overhead. I hope that most of the time we don't need the check, so it is acceptable that a small percentage of compactions regresses something like 5% is acceptable. Without this feature, we are doing almost an extra full compaction, which is expensive too. Result: https://gist.github.com/siying/190114297e9ec63e6fce67c82cdeb654 (I checked low compaction pool's CompMergeCPU over Write(GB))

@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@jay-zhuang jay-zhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Even though I still prefer using the user key instead of internal key for getting the min ancester time and for cutting the file, if it's possible.
For example, the following output file will be splitted into 2 files unnecessarily (both output files are having the same old timestamp):

  // Before TTL boosted compaction (timestamp t2 is newer than t1):
  // L1: [0->31](t2) ...
  // L2: [0->31](t1) ...

  // After:
  // L1: ...
  // L2: [0](t1) [1->31](t1)

Any way, it's a very minor case, only happens when the smallest key is override the other smallest key. So LGTM.
I added a simple unittest to cover the file splitting.

Summary:
Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles.

When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed.

Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end.

add

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Fix condition

Update db/compaction/file_pri.h

Co-authored-by: Jay Zhuang <jay.zhuang@yahoo.com>

SKip last level and add a DB-level unit teset

Fix a divide 0 error and add the option to db_bench

Fix

new

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Remove printf

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Update HISTORY.md

Make format

Fix

Fix a bug

Try to fix

Small improvement
@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in a2b9be4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants