Try to start TTL earlier with kMinOverlappingRatio is used #8749

siying · 2021-09-02T19:16:52Z

Summary:
Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles.

When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed.

In order for this to work, another change is made: during a compaction, if an output level file is older than ttl/2, cut output files based on original boundary (if it is not in the last level). This is to make sure that after an old file is moved to the next level, and new data is merged from the upper level, the new data falling into this range isn't reset with old timestamp. Without this change, in many cases, most files from one level will keep having old timestamp, even if they have newer data and we stuck in it.

Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end.

facebook-github-bot · 2021-09-02T19:17:05Z

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-09-02T19:58:45Z

@siying has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-09-07T22:48:23Z

@siying has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-09-07T22:49:24Z

@siying has updated the pull request. You must reimport the pull request before landing.

jay-zhuang

The change looks good to me, just a few questions and would be great to have some benchmark result.

db/compaction/file_pri.h

jay-zhuang · 2021-09-08T18:09:56Z

db/compaction/file_pri.h

+// starts to boost, the previous level's boosting amount is 16.
+class FileTtlBooster {
+ public:
+  FileTtlBooster(uint64_t current_time, uint64_t ttl, int num_non_empty_levels,


what's the benefit of using num_non_empty_levels instead of just level_num?
If there's multiple empty level, the max - 1 level should still need boost, right?

Levels beyond num_non_empty_levels will be empty. I assume we would boost files so that they reach the last non-empty level to remove tombstones. They will likely never reach Lmax anyway.

jay-zhuang · 2021-09-08T18:12:51Z

db/compaction/file_pri.h

+  FileTtlBooster(uint64_t current_time, uint64_t ttl, int num_non_empty_levels,
+                 int level)
+      : current_time_(current_time) {
+    if (ttl == 0 || level == 0 || level >= num_non_empty_levels) {


Should the bottommost level needs boost (when level == num_non_empty_levels - 1)?

Good point. I guess they don't. I probably should remove that level and adjust other levels' odd accordingly.

facebook-github-bot · 2021-09-08T21:35:08Z

@siying has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-09-08T22:57:33Z

@siying has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-09-08T23:16:19Z

@siying has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-09-08T23:17:03Z

@siying has updated the pull request. You must reimport the pull request before landing.

jay-zhuang

LGTM.

Looking forward to the benchmark result.

HISTORY.md

facebook-github-bot · 2021-09-08T23:58:58Z

@siying has updated the pull request. You must reimport the pull request before landing.

siying · 2021-09-09T00:03:17Z

LGTM.

Looking forward to the benchmark result.

I'm working on it. Constructing a scenario which it shows significant gain is extremely hard. I'm constructing a scenario at least these TTL compactions are scheduled earlier and make sure at least there is no regression.

facebook-github-bot · 2021-09-09T00:19:54Z

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

siying · 2021-09-09T00:27:36Z

LGTM.

Looking forward to the benchmark result.

I got an interesting observation so far. Without this PR, if write is fast enough to keep compaction always running, it appears that TTL compaction would never execute and TTL will be violated. Might be something we should fix... A side effect is that it makes constructing a benchmark even harder for this PR.

jay-zhuang · 2021-09-09T17:01:31Z

I got an interesting observation so far. Without this PR, if write is fast enough to keep compaction always running, it appears that TTL compaction would never execute and TTL will be violated. Might be something we should fix... A side effect is that it makes constructing a benchmark even harder for this PR.

I see. Is TTL a strong guarantee? It could also be violated by having a slow TTL compaction.

facebook-github-bot · 2021-09-09T22:55:03Z

@siying has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-09-09T23:00:48Z

@siying has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-09-09T23:03:18Z

@siying has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-09-09T23:37:31Z

@siying has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-09-10T00:28:25Z

@siying has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-09-10T01:24:35Z

@siying has updated the pull request. You must reimport the pull request before landing.

jay-zhuang · 2021-10-26T03:59:49Z

Thanks for the improvement, but not sure how effective it would be. If the data is randomly distributed, seems it won't improve too much by splitting the output files. Either it would be lots of small files or normal size file with older timestamp, I think it likely be the later one, as it limited the size of output level files (to be > target_size/2, which is good). I guess for some specific use-case that the data is mostly ordered by timestamp, it's going to help. Should we benchmark it first?

jay-zhuang · 2021-10-26T04:04:07Z

db/compaction/compaction_job.cc

+        }
+      } else {
+        // Look for the key position
+        while (next_files_to_cut_for_ttl <


Is it performance sensitive code? Would it impact the performance of compaction?

Yes. I hope in most of the case, files_to_cut_for_ttl is empty so this path is all skipped. For non empty case, most of the case one extra key comparison will be made per key compacted and I hope it is acceptable. I can run a benchmark and see performance penalty to this path.

jay-zhuang · 2021-10-27T04:11:29Z

db/compaction/compaction_job.cc

+          if (icmp->Compare(internal_key,
+                            files_to_cut_for_ttl[next_files_to_cut_for_ttl]
+                                ->smallest.Encode()) >= 0) {
+            if (icmp->Compare(internal_key,
+                              files_to_cut_for_ttl[next_files_to_cut_for_ttl]
+                                  ->largest.Encode()) <= 0) {
+              // With in the current file
+              cur_files_to_cut_for_ttl = next_files_to_cut_for_ttl;
+              return true;


The first key within this range will create a new file unnecessarily.
Here is an example test, which is basically compacting the following 4 SSTs:

L1: [0-31] [32-63] L2: [0-31] [32-63]

The output is 4 SSTs:

L2: [0] [1-31] [32] [33-63]

https://github.com/jay-zhuang/rocksdb/blob/46446fce9aff774ddc60d8828e14e71faae7e754/db/db_compaction_test.cc#L4430

Maybe we need a seen_key similar to grandparent checking?
As it's getting more complicated, could it be a function?

I don't understand the example. Is L2 grandparent? If we data in L1 is supposed to be output and L2 is grand parent, it appears that we would cut to [0-31][32-63].

Btw, these are internal keys and we are not supposed to have two entries with exactly the same internal keys. In case it happens, cutting a small file is the least worry.

Sorry for the bad example. Yeah, that answers part of my question, but there's still a problem. Here is a simpler example:

L1: [0-31] L2: [2-31] L3: [...] [...] ... // grandparents that we can ignore

If we compact 1@L1 + 1@L2, then the output would be:

L2: [0-2] [3-31]

Then both output files are having the older timestamp.

2 is included in the first file because larger seq_num (from output data) is smaller than smaller seq_num (from files_to_cut_for_ttl):

rocksdb/db/dbformat.h

Line 712 in 711881b

// decreasing sequence number

It only happens when there's override for the smallest key in files_to_cut_for_ttl.

Yes, that is on purpose. We would separate parts with different old ancestor times. [0 2] and [3 31] will be different, so we separate them. That's the concern on this?

I think the output should be:

L2: [0-1] [2-31]

So the first file would have a newer timestamp and the old one would have older timestamp.

siying · 2021-10-27T19:29:29Z

Thanks for the improvement, but not sure how effective it would be. If the data is randomly distributed, seems it won't improve too much by splitting the output files. Either it would be lots of small files or normal size file with older timestamp, I think it likely be the later one, as it limited the size of output level files (to be > target_size/2, which is good). I guess for some specific use-case that the data is mostly ordered by timestamp, it's going to help. Should we benchmark it first?

This is not the solve user data unbalanced, but unbalanced creation time distribution caused by compaction.

Let's assume TTL is 10 days, and we use "[A B]:-3" to indicate that it is a file range with oldest ancester time to be 3 days ago, and we are using the original approach without the boosting mechanism in this PR. Assume we have such a tree structure:

[1 40]: -1
[1 20]:-2 [21 40]:-3
[1 10]:-9 [11 15]:-10 [16 22]:-8 [24 30]: -7 ... 
== lowest level ==

Now, since [11 15]:-10 exceeds TTL limit, it is compacted to the last level. So now it looks like following:

[1 40]: -1
[1 20]:-2 [21 40]:-3
[1 10]:-9 (.. now empty ..) [16 22]:-8 [24 30]: -7 ... 
== lowest level ==

Now no file is qualified for compaction. Now a compaction naturally happened and [1 20]:-2 is merged down. The data range is re-arranged and got the oldest ancester time, e.g.:

[1 40]: -1
(.. now empty ..) [21 40]:-3
[1 7]:-9 [8 18]:-9  [19 22]:-8 [24 30]: -7 ... 
== lowest level ==

This even assumes that we take min(oldest_ancester_time) by range, rather than min of all inputs.

Now one day passes and more files will qualify:

[1 40]: -2
(.. now empty ..) [21 40]:-4
[1 7]:-10 [8 18]:-10  [19 22]:-9 [24 30]: -8 ... 
== lowest level ==

Now more files can be moved down so L3 looks like:

(now empty)  [19 22]:-9 [24 30]: -8 ...

Now new upper levels come down

......
[1 20]: -3 [21 40]:-4
[19 22]:-9 [24 30]: -8 ... 
== lowest level ==

Assuming we are merging [1 20]:-3 down. Since the file is small and cover more ranges. It might look like this:

......
(now empty) [21 40]:-4
[1 20]:-9 [21 22]:-9 [24 30]: -8 ... 
== lowest level ==

Now all incoming file covering range from 1 to 22 will need to use old ancester time -9, although they are all newer data.

This creates a situation where new data keeps coming but they always keep a very old ancester time and potentially being compacted to the last level for TTL compaction.

This file cutting is to solve this problem. I can schedule a VC meeting if it is helpful.

facebook-github-bot · 2021-10-29T19:20:38Z

@siying has updated the pull request. You must reimport the pull request before landing.

siying · 2021-10-30T00:01:23Z

@jay-zhuang Testing performance regression caused by this extra file cutting is hard. Cutting files more by itself introduces more CPU overheads, and based on my benchmark, this appears to be more than the checking boundary path. I hack the code trying so that the boundary checking always happens but cutting happens less frequently: https://gist.github.com/siying/c6322f6d9cd1b2febc6c9408f8b8c877

The result is that per compaction write MB CPU usage indeed increased by about 5%. However, we still cut files more. so it's hard to determine the actual overhead. I hope that most of the time we don't need the check, so it is acceptable that a small percentage of compactions regresses something like 5% is acceptable. Without this feature, we are doing almost an extra full compaction, which is expensive too. Result: https://gist.github.com/siying/190114297e9ec63e6fce67c82cdeb654 (I checked low compaction pool's CompMergeCPU over Write(GB))

facebook-github-bot · 2021-10-30T00:03:20Z

@siying has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-10-31T19:37:02Z

@siying has updated the pull request. You must reimport the pull request before landing.

jay-zhuang

LGTM.

Even though I still prefer using the user key instead of internal key for getting the min ancester time and for cutting the file, if it's possible.
For example, the following output file will be splitted into 2 files unnecessarily (both output files are having the same old timestamp):

  // Before TTL boosted compaction (timestamp t2 is newer than t1):
  // L1: [0->31](t2) ...
  // L2: [0->31](t1) ...

  // After:
  // L1: ...
  // L2: [0](t1) [1->31](t1)

Any way, it's a very minor case, only happens when the smallest key is override the other smallest key. So LGTM.
I added a simple unittest to cover the file splitting.

Summary: Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles. When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed. Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end. add Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Fix condition Update db/compaction/file_pri.h Co-authored-by: Jay Zhuang <jay.zhuang@yahoo.com> SKip last level and add a DB-level unit teset Fix a divide 0 error and add the option to db_bench Fix new Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Remove printf Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Update HISTORY.md Make format Fix Fix a bug Try to fix Small improvement

…hould both use internal key.

facebook-github-bot · 2021-11-01T17:40:07Z

@siying has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-11-01T18:48:28Z

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-11-01T21:36:40Z

This pull request has been merged in a2b9be4.

siying requested review from jay-zhuang and ajkr September 2, 2021 19:16

facebook-github-bot added the CLA Signed label Sep 2, 2021

siying force-pushed the smooth_ttl branch from 53b353a to e097553 Compare September 7, 2021 22:49

jay-zhuang reviewed Sep 8, 2021

View reviewed changes

siying force-pushed the smooth_ttl branch from 86e09bd to 3c05324 Compare September 8, 2021 22:57

jay-zhuang approved these changes Sep 8, 2021

View reviewed changes

HISTORY.md Outdated Show resolved Hide resolved

siying force-pushed the smooth_ttl branch from 911437b to b79e52c Compare September 9, 2021 22:55

jay-zhuang reviewed Oct 26, 2021

View reviewed changes

jay-zhuang reviewed Oct 27, 2021

View reviewed changes

siying force-pushed the smooth_ttl branch from 25c8109 to b493f2f Compare October 29, 2021 19:20

jay-zhuang approved these changes Oct 31, 2021

View reviewed changes

siying added 4 commits November 1, 2021 10:39

Address the comment on oldest ancestor range and file cutting range s…

2905059

…hould both use internal key.

Fix my bug.

9ad9e9f

Fix a bug

2bf6fda

siying force-pushed the smooth_ttl branch from 9a8631f to 2bf6fda Compare November 1, 2021 17:40

facebook-github-bot closed this in a2b9be4 Nov 1, 2021

facebook-github-bot added the Merged label Nov 1, 2021

Try to start TTL earlier with kMinOverlappingRatio is used #8749

Try to start TTL earlier with kMinOverlappingRatio is used #8749

Conversation

siying commented Sep 2, 2021 • edited Loading

facebook-github-bot commented Sep 2, 2021

facebook-github-bot commented Sep 2, 2021

facebook-github-bot commented Sep 7, 2021

facebook-github-bot commented Sep 7, 2021

jay-zhuang left a comment

Choose a reason for hiding this comment

jay-zhuang Sep 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Sep 8, 2021

facebook-github-bot commented Sep 8, 2021

facebook-github-bot commented Sep 8, 2021

facebook-github-bot commented Sep 8, 2021

jay-zhuang left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Sep 8, 2021

siying commented Sep 9, 2021

facebook-github-bot commented Sep 9, 2021

siying commented Sep 9, 2021

jay-zhuang commented Sep 9, 2021

facebook-github-bot commented Sep 9, 2021

facebook-github-bot commented Sep 9, 2021

facebook-github-bot commented Sep 9, 2021

facebook-github-bot commented Sep 9, 2021

facebook-github-bot commented Sep 10, 2021

facebook-github-bot commented Sep 10, 2021

jay-zhuang commented Oct 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jay-zhuang Oct 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siying commented Oct 27, 2021

facebook-github-bot commented Oct 29, 2021

siying commented Oct 30, 2021

facebook-github-bot commented Oct 30, 2021

facebook-github-bot commented Oct 31, 2021

jay-zhuang left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 1, 2021

facebook-github-bot commented Nov 1, 2021

facebook-github-bot commented Nov 1, 2021

siying commented Sep 2, 2021 •

edited

Loading

jay-zhuang Sep 8, 2021 •

edited

Loading

jay-zhuang Oct 28, 2021 •

edited

Loading