Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(level): change split key range right key to use ts=0 #1932

Merged
merged 1 commit into from
May 9, 2023

Conversation

hugy718
Copy link
Contributor

@hugy718 hugy718 commented Apr 27, 2023

Problem

This old implementation is not incorrect, but it does not seem to follow the logic in its description. When a compaction is split into subcompaction jobs, each of them builds an iterator on all tables touched by the compaction and iterates over a certain keyRange defined in addSplits. In combination, they covers the key range of all the tables in the compaction. The right key of a key range is intended to be the right key of a bottom level table (cd,bot), It is intended to include all different versions of that key, as described in the comments. However, using math.MaxUint64 will exclude those keys from this keyRange and include them in the next split. The reason is that the timestamp is encoded as math.MaUint64-ts in y.KeyWithTs, so a key with larger ts is actually smaller in y.CompareKeys. It can be corrected by using ts=0. Note that even using ts=math.MaxUint64 is not going to drop keys unlike what the comments above suggest. because those keys are covered in the subsequent split and the last split have an empty right key, iterating till the end of the tables being compacted.

Solution

Changed the timestamp used for split key range right key from math.MaxUint64 to 0. Updated the comments above it.

@mangalaman93
Copy link
Contributor

Hi @hugy718, thank you for the PR. Do you think it is possible to add a test for this?

@hugy718
Copy link
Contributor Author

hugy718 commented Apr 28, 2023

TestCompaction/with_split test in level_test.go covers the correctness of using addSplits to run multiple subcompactions

@mangalaman93
Copy link
Contributor

@hugy718 I am wondering whether it is easy to add a test that shows that things could go wrong without this change and this change fixes the issue? I can also look into it myself if you think it's not easy to come up with a test.

@hugy718
Copy link
Contributor Author

hugy718 commented May 3, 2023

Actually, nothing would go wrong without this change. It only improves the implementation such that keys in the same bottom table will not span two splits, which is what the old code comments there suggested.

@mangalaman93
Copy link
Contributor

yes, you are right. I will let @harshil-goel review it and merge it. Thanks for the PR again.

@harshil-goel
Copy link
Contributor

harshil-goel commented May 4, 2023

@hugy718 Thanks a lot for the PR. I went through the logic, and the old PR: https://github.com/dgraph-io/badger/pull/1587/files. From what I understand, what's happening is that

  1. We need all the versions of a key, in the same split. This would ensure that a key's delete is handled properly.
  2. Earlier, we were just creating splits by the bottom table right most element. That led to creating two splits, one contained C2, other C3. This leads to C2 being still there, but C3 being deleted. (example from the comments in the earlier code).
  3. With the old fix, we would have still created two splits, [A1 ... B] [C2...C3]. With your change, we would be creating something like, [A1...B...C2..C3]

Is my understanding correct?
If my understanding is correct, we should have more parallelism with the older approach, right?
This was actually considered earlier on also, as you can see in the comments, but we chose to do it this way. We can just update the comment to be accurate.

@hugy718
Copy link
Contributor Author

hugy718 commented May 5, 2023

C3 is placed before C2, since the encoding of time stamp uses maxuint64 to minus it.

The old approach is very unlikely give meaningful extra parallelism, since it only affects the records of the same key. (Actually no extra parallelism, because of the if statement below. the placement of the records of the last user key of a table will not affect the number of splits. For the last table, the condition is true and the whole table will be placed in the last split with right boundary set to empty. For tables in between, those tail records will be processed in the next split that contains the next table.) However, the old approach places a bottom table into two splits. The old example actually is not so good in explaining the resulting splits, since the if statement below will return just one split.

badger/levels.go

Lines 1071 to 1074 in 9afd0a0

if i == len(cd.bot)-1 {
addRange([]byte{})
return
}

If we consider a compaction that truly uses multi-split subcompaction, there are much more bottom tables. We consider the examples in the comment L1049.

badger/levels.go

Lines 1049 to 1054 in 9afd0a0

// Let's say we have 10 tables in cd.bot and min width = 3. Then, we'll pick
// 0, 1, 2 (pick), 3, 4, 5 (pick), 6, 7, 8 (pick), 9 (pick, because last table).
// This gives us 4 picks for 10 tables.
// In an edge case, 142 tables in bottom led to 48 splits. That's too many splits, because it
// then uses up a lot of memory for table builder.
// We should keep it so we have at max 5 splits.
The old approach does not create splits that have clean boundary that align with a bot table boundary, so the splits for the tables would be: [t0, t1, t2] [t2, t3, t4, t5] [t5, t6, t7, t8] [t8, t9]. The fix would create the desirable splits of [t0, t1, t2] [t3, t4, t5] [t6, t7, t8][t9].

Again the change is unlikely affect the performance and will not affect the correctness. It is just that placing a bottom table in only one split is what the logic was designed to do.

@harshil-goel
Copy link
Contributor

harshil-goel commented May 5, 2023

From what I understand, you are saying that a bottom table is only in one subcompaction job. However, with this change, if the key is split into multiple tables (C2 in table 3, C3 in table 4, C4 in top level), then the table 4 would be in 2 different subcompaction job right? If we take the 2nd example, t2 might be present in table 3, so your splits would also look like [t0, t1, t2, t3], [t3, t4, t5], [t6, t7, t8], [t8, t9]

It doesn't make sense to merge the change if it doesn't provide any performance / correctness benefit. However, this diff made us go deeper into some topics. You have done a really great work in understanding badger. We would be happy if you just want to change the comment to make it more understandable / correct.

@hugy718
Copy link
Contributor Author

hugy718 commented May 6, 2023

All the tables mentioned in the second example are bottom tables because addsplits only consider the bottom tables then the iterator will iterate through all records in that key range. A top table can definitely be involved in two splits subcompaction. The case you raise that the different versions of the same key belongs to different tables in a bottom level, however, will never happen, because these tables are generated by compaction and the addKeys lambda in subcompact ensures that adjacent tables produced by a compaction will never have different versions of the same key. (i.e. it only checks the conditions that break the add key looping whenever the iterator sees a different key.) Therefore, a bottom table will only be found in one split using the fix.

Performance-wise, there is unlikely improvement because block cache is used most of the time and the last block of a bottom table referenced by two splits (processed by two separate goroutine) is likely in-cache when the slower split needs it. But when block cache is disabled, by placing that block into one split, it avoids both goroutine fetching that block from storage, reducing 1 IO, but that would not be significant since a compaction access hundreds of blocks.

If you still feel that it is easier to keep the old way, I will then revert back and only correct those code comments describing the old way.

@harshil-goel
Copy link
Contributor

It makes sense, that we don't have keys spilling over to new tables. In that case it would make sense to do the changes. Thanks a lot for the diff, I have accepted it.

@mangalaman93 mangalaman93 merged commit ef0e552 into dgraph-io:main May 9, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants