Skip to content

Commit 64b58ad

Browse files
committed
manifest: add proof sketch for L0 sublevels
1 parent 64f938f commit 64b58ad

File tree

1 file changed

+264
-0
lines changed

1 file changed

+264
-0
lines changed

internal/manifest/doc.go

Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
// Copyright 2025 The LevelDB-Go and Pebble Authors. All rights reserved. Use
2+
// of this source code is governed by a BSD-style license that can be found in
3+
// the LICENSE file.
4+
5+
package manifest
6+
7+
/*
8+
9+
Proof Sketch of Correctness of LSM with Sub-Levels (Incomplete):
10+
11+
NB: Ordering of L0 files (from oldest to youngest) is done by (LargestSeqNum,
12+
SmallestSeqNum, FileNum). This is used by sub-level initialization to do a
13+
more sophisticated time and key-span based ordering.
14+
15+
1. History
16+
17+
Consider a scenario where there are no compactions out of L0. Consider the
18+
history of files getting added to L0 and two ways of accumulating that history
19+
into a stack of sstables:
20+
21+
- Simple-stack: Files are simply stacked into levels based on their add
22+
ordering (files added atomically can randomly be ordered in any way).
23+
24+
- Seqnum-stack: Files are stacked based on (LargestSeqNum, SmallestSeqNum,
25+
FileNum).
26+
27+
These two stacks will not be identical after every file add, even though they
28+
contain the same set of files. The trivial divergence is because of the atomic
29+
adds of multiple files. But the real divergence is because we cannot claim
30+
that files are added to L0 in increasing order of LargestSeqNum: ingests that
31+
don't have the same keys as memtables can be added to L0 before the memtables
32+
are flushed.
33+
34+
We make some claims about the invariants at any point of the history, in these
35+
stacks.
36+
37+
Claim 1: If sstable S1 and sstable S2 contain the same user key k and the k
38+
seqnum in S1 is < the k seqnum in S2, then S1 is added to L0 before S2.
39+
40+
Proof by contradiction:
41+
42+
- Flushes only: Consider that S2 is added to L0 before S1.
43+
44+
- S2 and S1 are part of the same flush operation: All instances of k will be
45+
in the same sst (flush splits are always on userkey boundaries).
46+
Contradiction.
47+
48+
- S2 and S1 are in different flushes: Since flushes are sequential, the
49+
flush of S2 must end before the flush of S1 starts. Since we flush
50+
memtables in order of their filling up and all seqnums inside older
51+
memtables are less than all seqnums inside newer memtables, the seqnum of
52+
k in S2 < the seqnum in S1. Contradiction.
53+
54+
- Ingests and flushes: Consider that S2 is added to L0 before S1.
55+
56+
- S2 and S1 are part of the same atomic ingest: Atomic ingests have ssts
57+
with non-overlapping user key bounds, so they cannot contain k in
58+
different ssts. Contradiction.
59+
60+
- S2 and S1 are part of different ingests. They must be assigned different
61+
sst seqnums. Ingests must be added to the LSM in order of their seqnum
62+
(see the discussion here
63+
https://github.com/cockroachdb/pebble/issues/2196#issuecomment-1523461535).
64+
So seqnum of S2 < seqnum of S1. Contradiction.
65+
66+
- S2 is an ingest and S1 is from a flush, and S2 is added first. Cases:
67+
68+
- The seqnum of k in S1 was present in an unflushed memtable when S2
69+
ingestion was requested: So the memtable seqnum for k < S2 seqnum. And k
70+
in the memtable will be flushed first (both with normal ingests and
71+
flushable ingests). Contradiction.
72+
73+
- The seqnum of k in S1 was not present in any memtable when S2 ingestion
74+
was requested: The k in S1 will be assigned a higher seqnum than S2.
75+
Contradiction.
76+
77+
- S2 is a flush and S1 is an ingest, and S2 is added first to L0.
78+
79+
- S2 is from memtable(s) that flushed before the ingest was requested.
80+
Seqnum of k in S2 < Seqnum of k in S1. Contradiction.
81+
82+
- S2 is from memtable(s) that flushed because the ingest was requested.
83+
Seqnum of k in S2 < Seqnum of k in S1. Contradiction.
84+
85+
- The k in S2 was added to the memtable after S1 was assigned a seqnum,
86+
but S2 got added first. This is disallowed by the fix in
87+
https://github.com/cockroachdb/pebble/issues/2196. Contradiction.
88+
89+
Claim 1 is sufficient to prove the level invariant for key k for the
90+
simple-stack, since when S1 has k#t1 and S2 has k#t2, and t1 < t2, S1 is added
91+
first. However it is insufficient to prove the level invariant for
92+
seqnum-stack: say S1 LargestSeqNum LSN1 > S2's LargestSeqNum LSN2. Then the
93+
ordering of levels will be inverted after S2 is added. We address this with
94+
claim 2.
95+
96+
Claim 2: Consider sstable S1 and sstable S2, such that S1 is added to L0
97+
before S2. Then LSN(S1) < LSN(S2) or S1 and S2 have no userkeys in common.
98+
99+
Proof sketch by contradiction:
100+
101+
Consider LSN(S1) >= LSN(S2) and S1 has k#t1 and S2 has k#t2, where t1 < t2. We
102+
will consider the case of LSN(S1) >= t2. If we can contradict this and show
103+
LSN(S1) < t2, then by t2 <= LSN(S2), we have LSN(S1) < LSN(S2).
104+
105+
Cases:
106+
107+
- S1 and S2 are from different flushes: The seqnums are totally ordered across
108+
memtables, so all seqnums in S1 < all seqnums in S2. So LSN(S1) < t2.
109+
Contradiction.
110+
111+
- S1 is a flush and S2 is an ingest:
112+
113+
- S1 was the result of flushing memtables that were immutable when S2
114+
arrived, which has LSN(S2)=t2. Then all seqnums in those immutable
115+
memtables < t2, i.e., LSN(S1) < t2, Contradiction.
116+
117+
- S1 was the result of flushing a mutable memtable when S2 arrived for
118+
ingestion. k#t1 was in this mutable memtable or one of the immutable
119+
memtables that will be flushed together. Consider the sequence of such
120+
memtables M1, M2, Mn, where Mn is the mutable memtable:
121+
122+
- k#t1 is in Mn: S2 waits for the flush of Mn. Can Mn concurrently get a
123+
higher seqnum write (of some other key) > t2 added to it before the
124+
flush. No, because we are holding commitPipeline.mu when assigning t2
125+
and while holding it, we mark Mn as immutable. So the flush of M1 … Mn
126+
has LSN(S1) < t2. Contradiction.
127+
128+
- k#t1 is in M1: Mn becomes full and flushes together with M1. Can happen
129+
but will be fixed in https://github.com/cockroachdb/pebble/issues/2196
130+
by preventing any memtable with seqnum > t2 from flushing.
131+
132+
- S1 is an ingest and S2 is a flush: By definition LSN(S1)=t1. t1 >= t2 is not
133+
possible since by definition t1 < t2.
134+
135+
Claim 1 and claim 2 can together be used to prove the level invariant for the
136+
seqnum-stack: We are given S1 is added before S2 and both have user key k,
137+
with seqnum t1 and t2 respectively. From claim 1, t1 < t2. From claim 2,
138+
LSN(S1) < LSN(s2). So the ordering based on the LSN will not violate the
139+
seqnum invariant.
140+
141+
Since we have the level-invariant for both simple-stack and seqnum-stack,
142+
reads are consistent across both.
143+
144+
A real LSM behaves as a hybrid of these two stacks, since there are
145+
compactions out from L0 at arbitrary points in time. So any reordering of the
146+
stack that is possible in the seqnum-stack may not actually happen, since
147+
those files may no longer be in L0. This hybrid can be shown to be correct
148+
since both the simple-stack and seqnum-stack are correct. This correctness
149+
argument predates the introduction of sub-levels.
150+
151+
TODO(sumeer): proof of level-invariant for the hybrid.
152+
153+
Because of the seqnum-stack being equivalent to the simple-stack, we no longer
154+
worry about future file additions to L0 and only consider what is currently in
155+
L0. We focus on the seqnum-stack and the current L0, and how it is organized
156+
into sub-levels. Sub-levels is a conceptually simple reorganization of the
157+
seqnum-stack in that files that don't overlap in the keyspans, so
158+
pessimistically cannot have conflicting keys, no longer need to stack up on
159+
top of each other. This cannot violate the level invariant since the key span
160+
(in terms of user keys) of S1 which is at a higher level than S2 despite LSN(s1)
161+
< LSN(s2), must be non-overlapping.
162+
163+
2. L0=>Lbase compactions with sub-levels
164+
165+
They cut out a triangle from the bottom of the sub-levels, so will never
166+
compact out a higher seqnum of k while leaving behind a lower seqnum. Once in
167+
Lbase, the seqnums of the files play no part and only the keyspans are used
168+
for future maintenance of the already established level invariant.
169+
TODO(needed): more details.
170+
171+
3. Intra-L0 compactions with sub-levels
172+
173+
They cut out an inverted triangle from the top of the sub-levels. Correctness
174+
here is more complicated because (a) the use of earliest-unflushed-seq-num,
175+
(b) the ordering of output files and untouched existing files is based on
176+
(LargestSeqNum, SmallestSeqNum, FileNum). We consider these in turn.
177+
178+
3.1 Use of earliest-unflushed-seq-num to exclude files
179+
180+
Consider a key span at sub-level i for which all files in the key-span have
181+
LSN >= earliest-unflushed-seq-num (so are excluded). Extend this key span to
182+
include any adjacent files on that sub-level that also have the same property,
183+
then extend it until the end-bounds of the adjacent files that do not satisfy
184+
this property. Consider the rectangle defined by this key-span going all the
185+
way down to sub-level 0. And then start with that key-span in sub-level i-1.
186+
In the following picture -- represents the key span in i and | bars represent
187+
that rectangle defined.
188+
189+
i +++++|----------------|++f2++
190+
i-1 --|--------------++|++f1+++
191+
192+
193+
We claim that the files in this key-span in sub-level i-1 that satisfy this
194+
property cannot extend out of the key-span. This can be proved by
195+
contradiction: if a file f1 at sub-level i-1 extends beyond, there must be a
196+
file at sub-level i, say f2, that did not satisfy this property (otherwise the
197+
maximal keyspan in i would have been wider). Now we know that
198+
earliest-unflushed-seq-num > LSN(f2) and LSN(f1) >=
199+
earliest-unflushed-seq-num. So LSN(f1) > LSN(f2) and they have overlapping
200+
keyspans, which is not possible since f1 is in sub-level i-1 and f2 in
201+
sub-level i. This argument can be continued to claim that the
202+
earliest-unflushed-seq-num property cuts out an inverted triangle from the
203+
sub-levels. Pretending these files are not in the history is ok, since the
204+
final sub-levels will look the same if these were not yet known and we then
205+
added them in the future in LSN order.
206+
207+
3.2 Ordering of output files
208+
209+
The actual files chosen for the intra-L0 compaction also follow the same
210+
inverted triangle pattern. This means we have a contiguous history of the
211+
seqnums for a key participating in the compaction, and anything not
212+
participating has either lower or higher seqnums. The shape can look like the
213+
following picture where . represents the spans chosen for the compaction and -
214+
the spans ignored because of earliest-unflushed-seq-num and x that are older
215+
and not participating.
216+
217+
6 ------------------------
218+
5 --------------------
219+
4 .....----.......
220+
3 ...........
221+
2 ........
222+
1 xxxxxxx....xxxxxxxxxxxxxxxx
223+
0 xxxxxxxxxxxxxxxxxxxxxxxxx
224+
225+
We know the compaction input choice is sound, but the question is whether an
226+
output . produced by the compaction can fall either too low, i.e., lower than
227+
a conflicting x, or end up too high, above a conflicting -. This is because
228+
the choice of sub-level depends on the LSN of the output and not the actual
229+
conflicting key seqnums in the file (the LSN is just a summary). Claim 1 and 2
230+
are insufficient to prove this. Those claims allow for the following sequence
231+
of files (from higher to lower sub-level):
232+
233+
- a#20 f3
234+
. a#10 b#15 f2
235+
x a#5 c#12 f1
236+
237+
If the compaction separates a#10 from b#15 in the output, a#10 can fall below
238+
f1. To prove this cannot happen we need another claim.
239+
240+
Claim 3: if key k is in files S1 and S2, with k#t1, k#t2 with t1 < t2, then
241+
LSN(S1) < t2.
242+
243+
Proof: We have proved this stronger claim when proving claim 2.
244+
245+
Based on claim 3, the above example is impossible, since it would require
246+
LSN(f1) < 10.
247+
248+
Using claim 3 we can prove that even if the intra-L0 compaction writes one
249+
userkey per output sst, the LSN of that output sst will be > the LSN of ssts
250+
with the same userkey that are categorized as x.
251+
252+
Next we need to show that the output won't land higher than - with a
253+
conflicting key, say when we produce a single output file. NB: the narrowest
254+
possible outputs (considered in the previous paragraph, with one key per file)
255+
were the risk in the output sinking too low, and the widest possible output
256+
(considered now) is the risk in staying too high.
257+
258+
The file that was excluded (- file) with the conflicting key has LSN >=
259+
earliest-unflushed-seq-num. By definition there is no point in any of the
260+
participating files that is >= earliest-unflushed-seq-num. So the LSN of this
261+
single output file is < earliest-unflushed-seq-num. Hence the output can't
262+
land higher than the excluded file with the conflicting key.
263+
264+
*/

0 commit comments

Comments
 (0)