fix(level): change split key range right key to use ts=0 #1932

hugy718 · 2023-04-27T09:04:49Z

Problem

This old implementation is not incorrect, but it does not seem to follow the logic in its description. When a compaction is split into subcompaction jobs, each of them builds an iterator on all tables touched by the compaction and iterates over a certain keyRange defined in addSplits. In combination, they covers the key range of all the tables in the compaction. The right key of a key range is intended to be the right key of a bottom level table (cd,bot), It is intended to include all different versions of that key, as described in the comments. However, using math.MaxUint64 will exclude those keys from this keyRange and include them in the next split. The reason is that the timestamp is encoded as math.MaUint64-ts in y.KeyWithTs, so a key with larger ts is actually smaller in y.CompareKeys. It can be corrected by using ts=0. Note that even using ts=math.MaxUint64 is not going to drop keys unlike what the comments above suggest. because those keys are covered in the subsequent split and the last split have an empty right key, iterating till the end of the tables being compacted.

Solution

Changed the timestamp used for split key range right key from math.MaxUint64 to 0. Updated the comments above it.

mangalaman93 · 2023-04-27T12:15:48Z

Hi @hugy718, thank you for the PR. Do you think it is possible to add a test for this?

hugy718 · 2023-04-28T08:42:17Z

TestCompaction/with_split test in level_test.go covers the correctness of using addSplits to run multiple subcompactions

mangalaman93 · 2023-05-02T18:53:41Z

@hugy718 I am wondering whether it is easy to add a test that shows that things could go wrong without this change and this change fixes the issue? I can also look into it myself if you think it's not easy to come up with a test.

hugy718 · 2023-05-03T02:38:51Z

Actually, nothing would go wrong without this change. It only improves the implementation such that keys in the same bottom table will not span two splits, which is what the old code comments there suggested.

mangalaman93 · 2023-05-03T10:56:29Z

yes, you are right. I will let @harshil-goel review it and merge it. Thanks for the PR again.

harshil-goel · 2023-05-04T19:18:31Z

@hugy718 Thanks a lot for the PR. I went through the logic, and the old PR: https://github.com/dgraph-io/badger/pull/1587/files. From what I understand, what's happening is that

We need all the versions of a key, in the same split. This would ensure that a key's delete is handled properly.
Earlier, we were just creating splits by the bottom table right most element. That led to creating two splits, one contained C2, other C3. This leads to C2 being still there, but C3 being deleted. (example from the comments in the earlier code).
With the old fix, we would have still created two splits, [A1 ... B] [C2...C3]. With your change, we would be creating something like, [A1...B...C2..C3]

Is my understanding correct?
If my understanding is correct, we should have more parallelism with the older approach, right?
This was actually considered earlier on also, as you can see in the comments, but we chose to do it this way. We can just update the comment to be accurate.

hugy718 · 2023-05-05T04:06:21Z

C3 is placed before C2, since the encoding of time stamp uses maxuint64 to minus it.

The old approach is very unlikely give meaningful extra parallelism, since it only affects the records of the same key. (Actually no extra parallelism, because of the if statement below. the placement of the records of the last user key of a table will not affect the number of splits. For the last table, the condition is true and the whole table will be placed in the last split with right boundary set to empty. For tables in between, those tail records will be processed in the next split that contains the next table.) However, the old approach places a bottom table into two splits. The old example actually is not so good in explaining the resulting splits, since the if statement below will return just one split.

badger/levels.go

Lines 1071 to 1074 in 9afd0a0

    
           if i == len(cd.bot)-1 { 
        
           	addRange([]byte{}) 
        
           	return 
        
           }

If we consider a compaction that truly uses multi-split subcompaction, there are much more bottom tables. We consider the examples in the comment L1049.

badger/levels.go

Lines 1049 to 1054 in 9afd0a0

    
           // Let's say we have 10 tables in cd.bot and min width = 3. Then, we'll pick 
        
           // 0, 1, 2 (pick), 3, 4, 5 (pick), 6, 7, 8 (pick), 9 (pick, because last table). 
        
           // This gives us 4 picks for 10 tables. 
        
           // In an edge case, 142 tables in bottom led to 48 splits. That's too many splits, because it 
        
           // then uses up a lot of memory for table builder. 
        
           // We should keep it so we have at max 5 splits.

The old approach does not create splits that have clean boundary that align with a bot table boundary, so the splits for the tables would be: [t0, t1, t2] [t2, t3, t4, t5] [t5, t6, t7, t8] [t8, t9]. The fix would create the desirable splits of [t0, t1, t2] [t3, t4, t5] [t6, t7, t8][t9].

Again the change is unlikely affect the performance and will not affect the correctness. It is just that placing a bottom table in only one split is what the logic was designed to do.

harshil-goel · 2023-05-05T19:01:14Z

From what I understand, you are saying that a bottom table is only in one subcompaction job. However, with this change, if the key is split into multiple tables (C2 in table 3, C3 in table 4, C4 in top level), then the table 4 would be in 2 different subcompaction job right? If we take the 2nd example, t2 might be present in table 3, so your splits would also look like [t0, t1, t2, t3], [t3, t4, t5], [t6, t7, t8], [t8, t9]

It doesn't make sense to merge the change if it doesn't provide any performance / correctness benefit. However, this diff made us go deeper into some topics. You have done a really great work in understanding badger. We would be happy if you just want to change the comment to make it more understandable / correct.

hugy718 · 2023-05-06T02:47:00Z

All the tables mentioned in the second example are bottom tables because addsplits only consider the bottom tables then the iterator will iterate through all records in that key range. A top table can definitely be involved in two splits subcompaction. The case you raise that the different versions of the same key belongs to different tables in a bottom level, however, will never happen, because these tables are generated by compaction and the addKeys lambda in subcompact ensures that adjacent tables produced by a compaction will never have different versions of the same key. (i.e. it only checks the conditions that break the add key looping whenever the iterator sees a different key.) Therefore, a bottom table will only be found in one split using the fix.

Performance-wise, there is unlikely improvement because block cache is used most of the time and the last block of a bottom table referenced by two splits (processed by two separate goroutine) is likely in-cache when the slower split needs it. But when block cache is disabled, by placing that block into one split, it avoids both goroutine fetching that block from storage, reducing 1 IO, but that would not be significant since a compaction access hundreds of blocks.

If you still feel that it is easier to keep the old way, I will then revert back and only correct those code comments describing the old way.

harshil-goel · 2023-05-08T17:08:49Z

It makes sense, that we don't have keys spilling over to new tables. In that case it would make sense to do the changes. Thanks a lot for the diff, I have accepted it.

hugy718 requested review from akon-dey, billprovince, joshua-goldstein and skrdgraph as code owners April 27, 2023 09:04

change split key range right key to use ts=0

803bc90

hugy718 force-pushed the fix-kr-right branch from f9e789f to 803bc90 Compare April 28, 2023 08:51

harshil-goel approved these changes May 8, 2023

View reviewed changes

mangalaman93 merged commit ef0e552 into dgraph-io:main May 9, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(level): change split key range right key to use ts=0 #1932

fix(level): change split key range right key to use ts=0 #1932

hugy718 commented Apr 27, 2023 •

edited

mangalaman93 commented Apr 27, 2023

hugy718 commented Apr 28, 2023

mangalaman93 commented May 2, 2023

hugy718 commented May 3, 2023

mangalaman93 commented May 3, 2023

harshil-goel commented May 4, 2023 •

edited

hugy718 commented May 5, 2023 •

edited

harshil-goel commented May 5, 2023 •

edited

hugy718 commented May 6, 2023

harshil-goel commented May 8, 2023

fix(level): change split key range right key to use ts=0 #1932

fix(level): change split key range right key to use ts=0 #1932

Conversation

hugy718 commented Apr 27, 2023 • edited

Problem

Solution

mangalaman93 commented Apr 27, 2023

hugy718 commented Apr 28, 2023

mangalaman93 commented May 2, 2023

hugy718 commented May 3, 2023

mangalaman93 commented May 3, 2023

harshil-goel commented May 4, 2023 • edited

hugy718 commented May 5, 2023 • edited

harshil-goel commented May 5, 2023 • edited

hugy718 commented May 6, 2023

harshil-goel commented May 8, 2023

hugy718 commented Apr 27, 2023 •

edited

harshil-goel commented May 4, 2023 •

edited

hugy718 commented May 5, 2023 •

edited

harshil-goel commented May 5, 2023 •

edited