-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix regression issue of too large score #10518
Conversation
Summary: facebook#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stucks in L1 without being moved down. We fix calculating a score of L0 in the same way of L1 so that L0->L1 is favored if L0 size is larger than L1. Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away.
@siying has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@siying has updated the pull request. You must reimport the pull request before landing. |
@siying has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@siying has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siying has updated the pull request. You must reimport the pull request before landing. |
@siying has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
// It might be guafanteed by logic below anyway, but we are | ||
// explicit here to make sure we don't stop writes with no | ||
// compaction scheduled. | ||
score = std::max(score, 1.01); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use std::max(score, std::min(level_multiplier_, total_size / max_bytes_for_level_base))
so increased Size(L0) would actually cause score to increase even when it's smaller than Size(Lbase)? In the current PR, for example, if Score(Lbase) = 3, then Size(L0)/Size(Lbase) must be larger than 3 to prioritize L0->Lbase. That seems like too large an L0 and also an hourglass shape. With total_size / max_bytes_for_level_base
we will at least prefer L0 whenever it's larger than Lbase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that. I faced a problem that total_size / max_bytes_for_level_base
can be very large. In the current algorithm, actually L1 will have a much smaller score than L0 (because we de-prioritize a level when there are lots of pending data in upper levels), and this causes data stuck in L1. That's why it is always divided by Size(Lbase) if Lbase is relatively large, this is to make it possible for L0 score could be smaller than Lbase if Lbase is large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I faced a problem that total_size / max_bytes_for_level_base can be very large
The std::min(level_multiplier_, total_size / max_bytes_for_level_base) intended to prevent that case
That's why it is always divided by Size(Lbase) if Lbase is relatively large, this is to make it possible for L0 score could be smaller than Lbase if Lbase is large.
I think Score(L0) < Score(Lbase) when Size(L0) < Size(Lbase) is good enough. The current formula makes Score(L0) < Score(Lbase) when Size(L0) < Score(Lbase) * Size(Lbase), which just seems like L0 may have to be much much bigger than Lbase in order for L0 to be prioritized.
Anyway I don't mind trying it if you want since hourglass LSM might work fine. I haven't measured it. So my accept still stands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Score(L0) < Score(Lbase) when Size(L0) < Size(Lbase) is good enough. The current formula makes Score(L0) < Score(Lbase) when Size(L0) < Score(Lbase) * Size(Lbase), which just seems like L0 may have to be much much bigger than Lbase in order for L0 to be prioritized.
The benchmark I ran (the one Mark used) actually shows that L0->L1 still feels prioritized perhaps too much against L1->L2. If L0 is larger than L1, L0->L1 is always prioritized against L1->L2, because the score Math makes L1 score < 0.5 (5 after scaling). Similarly, if number of files are more than number of L0 file trigger, we almost always pick from L0. L1->L2 usually gets picked up with L1 is significantly larger than L0.
Here are some snapshots of the benchmark runs (level multiplier is 8 and base level size base is 64MB, exact parameters used by Mark):
L0 12/10 321.94 MB
L4 23/23 368.26 MB
L5 11/0 176.77 MB
L6 112/4 1.50 GB
L7 798/23 11.42 GB
L0 8/6 218.41 MB
L4 24/24 374.77 MB
L5 20/0 317.81 MB
L6 118/5 1.53 GB
L7 814/25 11.69 GB
L0 10/6 278.14 MB
L4 20/20 312.10 MB
L5 11/0 175.13 MB
L6 119/4 1.55 GB
L7 830/18 11.96 GB
L0 9/6 307.93 MB
L4 20/20 321.85 MB
L5 12/0 193.10 MB
L6 113/0 1.52 GB
L7 853/0 12.21 GB
L0 5/0 131.31 MB
L4 26/0 402.99 MB
L5 18/0 282.12 MB
L6 123/3 1.60 GB
L7 871/15 12.42 GB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If L0 is larger than L1, L0->L1 is always prioritized against L1->L2, because the score Math makes L1 score < 0.5 (5 after scaling).
Got it, did the math and agreed Score(L0) > Score(Lbase) when Size(L0) > Size(Lbase). I had forgotten to incorporate Size(L0) in the denominator of Score(Lbase) earlier (courtesy of total_downcompact_bytes
). This is challenging to reason about but I am glad it works.
Similarly, if number of files are more than number of L0 file trigger, we almost always pick from L0.
Understood this part already. The size-based component of Score(L0) is not used for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the score Math makes L1 score < 0.5 (5 after scaling).
My calculation for Score(Lbase) is 10 * S'(Lbase) / (max_bytes_for_level_base + S(L0)). S'(Li) is the sum of compensated size of files not being compacted on level i, and S(Li) is sum of all actual file sizes on level i.
I don't see why S'(L0) > S'(Lbase) (assuming that's what you mean by L0 is larger than L1) means Score(Lbase) is < 5. It looks to be typically less than 10, unless Lbase files have high size compensation.
For my attempt to prove Score(L0) > Score(Lbase) in this scenario (S'(L0) > S'(Lbase)), it required the following two small tweaks. I think they fix a case where Lbase has a lot of deletions and wonder if you think they make sense.
base_level_size
uses S'(Lbase) instead of S*(Lbase). S*(Li) is sum of all compensated file sizes on level i. This change removes files undergoing compaction frombase_level_size
total_downcompact_bytes
uses S'(Li) instead of S(Li). This changes actual file size to compensated file size intotal_downcompact_bytes
, and removes files undergoing compaction.
Summary: #10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large.. Pull Request resolved: #10518 Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away. Reviewed By: ajkr Differential Revision: D38603145 fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225
Summary: facebook#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large.. Pull Request resolved: facebook#10518 Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away. Reviewed By: ajkr Differential Revision: D38603145 fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225 Signed-off-by: tabokie <xy.tao@outlook.com>
Summary: facebook#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large.. Pull Request resolved: facebook#10518 Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away. Reviewed By: ajkr Differential Revision: D38603145 fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225 Signed-off-by: tabokie <xy.tao@outlook.com>
Summary:
#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large..
Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away.