|
| 1 | +// Copyright 2025 The LevelDB-Go and Pebble Authors. All rights reserved. Use |
| 2 | +// of this source code is governed by a BSD-style license that can be found in |
| 3 | +// the LICENSE file. |
| 4 | + |
| 5 | +package manifest |
| 6 | + |
| 7 | +/* |
| 8 | +
|
| 9 | +Proof Sketch of Correctness of LSM with Sub-Levels (Incomplete): |
| 10 | +
|
| 11 | +NB: Ordering of L0 files (from oldest to youngest) is done by (LargestSeqNum, |
| 12 | +SmallestSeqNum, FileNum). This is used by sub-level initialization to do a |
| 13 | +more sophisticated time and key-span based ordering. |
| 14 | +
|
| 15 | +1. History |
| 16 | +
|
| 17 | +Consider a scenario where there are no compactions out of L0. Consider the |
| 18 | +history of files getting added to L0 and two ways of accumulating that history |
| 19 | +into a stack of sstables: |
| 20 | +
|
| 21 | +- Simple-stack: Files are simply stacked into levels based on their add |
| 22 | + ordering (files added atomically can randomly be ordered in any way). |
| 23 | +
|
| 24 | +- Seqnum-stack: Files are stacked based on (LargestSeqNum, SmallestSeqNum, |
| 25 | + FileNum). |
| 26 | +
|
| 27 | +These two stacks will not be identical after every file add, even though they |
| 28 | +contain the same set of files. The trivial divergence is because of the atomic |
| 29 | +adds of multiple files. But the real divergence is because we cannot claim |
| 30 | +that files are added to L0 in increasing order of LargestSeqNum: ingests that |
| 31 | +don't have the same keys as memtables can be added to L0 before the memtables |
| 32 | +are flushed. |
| 33 | +
|
| 34 | +We make some claims about the invariants at any point of the history, in these |
| 35 | +stacks. |
| 36 | +
|
| 37 | +Claim 1: If sstable S1 and sstable S2 contain the same user key k and the k |
| 38 | +seqnum in S1 is < the k seqnum in S2, then S1 is added to L0 before S2. |
| 39 | +
|
| 40 | +Proof by contradiction: |
| 41 | +
|
| 42 | +- Flushes only: Consider that S2 is added to L0 before S1. |
| 43 | +
|
| 44 | + - S2 and S1 are part of the same flush operation: All instances of k will be |
| 45 | + in the same sst (flush splits are always on userkey boundaries). |
| 46 | + Contradiction. |
| 47 | +
|
| 48 | + - S2 and S1 are in different flushes: Since flushes are sequential, the |
| 49 | + flush of S2 must end before the flush of S1 starts. Since we flush |
| 50 | + memtables in order of their filling up and all seqnums inside older |
| 51 | + memtables are less than all seqnums inside newer memtables, the seqnum of |
| 52 | + k in S2 < the seqnum in S1. Contradiction. |
| 53 | +
|
| 54 | +- Ingests and flushes: Consider that S2 is added to L0 before S1. |
| 55 | +
|
| 56 | + - S2 and S1 are part of the same atomic ingest: Atomic ingests have ssts |
| 57 | + with non-overlapping user key bounds, so they cannot contain k in |
| 58 | + different ssts. Contradiction. |
| 59 | +
|
| 60 | + - S2 and S1 are part of different ingests. They must be assigned different |
| 61 | + sst seqnums. Ingests must be added to the LSM in order of their seqnum |
| 62 | + (see the discussion here |
| 63 | + https://github.com/cockroachdb/pebble/issues/2196#issuecomment-1523461535). |
| 64 | + So seqnum of S2 < seqnum of S1. Contradiction. |
| 65 | +
|
| 66 | + - S2 is an ingest and S1 is from a flush, and S2 is added first. Cases: |
| 67 | +
|
| 68 | + - The seqnum of k in S1 was present in an unflushed memtable when S2 |
| 69 | + ingestion was requested: So the memtable seqnum for k < S2 seqnum. And k |
| 70 | + in the memtable will be flushed first (both with normal ingests and |
| 71 | + flushable ingests). Contradiction. |
| 72 | +
|
| 73 | + - The seqnum of k in S1 was not present in any memtable when S2 ingestion |
| 74 | + was requested: The k in S1 will be assigned a higher seqnum than S2. |
| 75 | + Contradiction. |
| 76 | +
|
| 77 | + - S2 is a flush and S1 is an ingest, and S2 is added first to L0. |
| 78 | +
|
| 79 | + - S2 is from memtable(s) that flushed before the ingest was requested. |
| 80 | + Seqnum of k in S2 < Seqnum of k in S1. Contradiction. |
| 81 | +
|
| 82 | + - S2 is from memtable(s) that flushed because the ingest was requested. |
| 83 | + Seqnum of k in S2 < Seqnum of k in S1. Contradiction. |
| 84 | +
|
| 85 | + - The k in S2 was added to the memtable after S1 was assigned a seqnum, |
| 86 | + but S2 got added first. This is disallowed by the fix in |
| 87 | + https://github.com/cockroachdb/pebble/issues/2196. Contradiction. |
| 88 | +
|
| 89 | +Claim 1 is sufficient to prove the level invariant for key k for the |
| 90 | +simple-stack, since when S1 has k#t1 and S2 has k#t2, and t1 < t2, S1 is added |
| 91 | +first. However it is insufficient to prove the level invariant for |
| 92 | +seqnum-stack: say S1 LargestSeqNum LSN1 > S2's LargestSeqNum LSN2. Then the |
| 93 | +ordering of levels will be inverted after S2 is added. We address this with |
| 94 | +claim 2. |
| 95 | +
|
| 96 | +Claim 2: Consider sstable S1 and sstable S2, such that S1 is added to L0 |
| 97 | +before S2. Then LSN(S1) < LSN(S2) or S1 and S2 have no userkeys in common. |
| 98 | +
|
| 99 | +Proof sketch by contradiction: |
| 100 | +
|
| 101 | +Consider LSN(S1) >= LSN(S2) and S1 has k#t1 and S2 has k#t2, where t1 < t2. We |
| 102 | +will consider the case of LSN(S1) >= t2. If we can contradict this and show |
| 103 | +LSN(S1) < t2, then by t2 <= LSN(S2), we have LSN(S1) < LSN(S2). |
| 104 | +
|
| 105 | +Cases: |
| 106 | +
|
| 107 | +- S1 and S2 are from different flushes: The seqnums are totally ordered across |
| 108 | + memtables, so all seqnums in S1 < all seqnums in S2. So LSN(S1) < t2. |
| 109 | + Contradiction. |
| 110 | +
|
| 111 | +- S1 is a flush and S2 is an ingest: |
| 112 | +
|
| 113 | + - S1 was the result of flushing memtables that were immutable when S2 |
| 114 | + arrived, which has LSN(S2)=t2. Then all seqnums in those immutable |
| 115 | + memtables < t2, i.e., LSN(S1) < t2, Contradiction. |
| 116 | +
|
| 117 | + - S1 was the result of flushing a mutable memtable when S2 arrived for |
| 118 | + ingestion. k#t1 was in this mutable memtable or one of the immutable |
| 119 | + memtables that will be flushed together. Consider the sequence of such |
| 120 | + memtables M1, M2, Mn, where Mn is the mutable memtable: |
| 121 | +
|
| 122 | + - k#t1 is in Mn: S2 waits for the flush of Mn. Can Mn concurrently get a |
| 123 | + higher seqnum write (of some other key) > t2 added to it before the |
| 124 | + flush. No, because we are holding commitPipeline.mu when assigning t2 |
| 125 | + and while holding it, we mark Mn as immutable. So the flush of M1 … Mn |
| 126 | + has LSN(S1) < t2. Contradiction. |
| 127 | +
|
| 128 | + - k#t1 is in M1: Mn becomes full and flushes together with M1. Can happen |
| 129 | + but will be fixed in https://github.com/cockroachdb/pebble/issues/2196 |
| 130 | + by preventing any memtable with seqnum > t2 from flushing. |
| 131 | +
|
| 132 | +- S1 is an ingest and S2 is a flush: By definition LSN(S1)=t1. t1 >= t2 is not |
| 133 | + possible since by definition t1 < t2. |
| 134 | +
|
| 135 | +Claim 1 and claim 2 can together be used to prove the level invariant for the |
| 136 | +seqnum-stack: We are given S1 is added before S2 and both have user key k, |
| 137 | +with seqnum t1 and t2 respectively. From claim 1, t1 < t2. From claim 2, |
| 138 | +LSN(S1) < LSN(s2). So the ordering based on the LSN will not violate the |
| 139 | +seqnum invariant. |
| 140 | +
|
| 141 | +Since we have the level-invariant for both simple-stack and seqnum-stack, |
| 142 | +reads are consistent across both. |
| 143 | +
|
| 144 | +A real LSM behaves as a hybrid of these two stacks, since there are |
| 145 | +compactions out from L0 at arbitrary points in time. So any reordering of the |
| 146 | +stack that is possible in the seqnum-stack may not actually happen, since |
| 147 | +those files may no longer be in L0. This hybrid can be shown to be correct |
| 148 | +since both the simple-stack and seqnum-stack are correct. This correctness |
| 149 | +argument predates the introduction of sub-levels. |
| 150 | +
|
| 151 | +TODO(sumeer): proof of level-invariant for the hybrid. |
| 152 | +
|
| 153 | +Because of the seqnum-stack being equivalent to the simple-stack, we no longer |
| 154 | +worry about future file additions to L0 and only consider what is currently in |
| 155 | +L0. We focus on the seqnum-stack and the current L0, and how it is organized |
| 156 | +into sub-levels. Sub-levels is a conceptually simple reorganization of the |
| 157 | +seqnum-stack in that files that don't overlap in the keyspans, so |
| 158 | +pessimistically cannot have conflicting keys, no longer need to stack up on |
| 159 | +top of each other. This cannot violate the level invariant since the key span |
| 160 | +(in terms of user keys) of S1 which is at a higher level than S2 despite LSN(s1) |
| 161 | +< LSN(s2), must be non-overlapping. |
| 162 | +
|
| 163 | +2. L0=>Lbase compactions with sub-levels |
| 164 | +
|
| 165 | +They cut out a triangle from the bottom of the sub-levels, so will never |
| 166 | +compact out a higher seqnum of k while leaving behind a lower seqnum. Once in |
| 167 | +Lbase, the seqnums of the files play no part and only the keyspans are used |
| 168 | +for future maintenance of the already established level invariant. |
| 169 | +TODO(needed): more details. |
| 170 | +
|
| 171 | +3. Intra-L0 compactions with sub-levels |
| 172 | +
|
| 173 | +They cut out an inverted triangle from the top of the sub-levels. Correctness |
| 174 | +here is more complicated because (a) the use of earliest-unflushed-seq-num, |
| 175 | +(b) the ordering of output files and untouched existing files is based on |
| 176 | +(LargestSeqNum, SmallestSeqNum, FileNum). We consider these in turn. |
| 177 | +
|
| 178 | +3.1 Use of earliest-unflushed-seq-num to exclude files |
| 179 | +
|
| 180 | +Consider a key span at sub-level i for which all files in the key-span have |
| 181 | +LSN >= earliest-unflushed-seq-num (so are excluded). Extend this key span to |
| 182 | +include any adjacent files on that sub-level that also have the same property, |
| 183 | +then extend it until the end-bounds of the adjacent files that do not satisfy |
| 184 | +this property. Consider the rectangle defined by this key-span going all the |
| 185 | +way down to sub-level 0. And then start with that key-span in sub-level i-1. |
| 186 | +In the following picture -- represents the key span in i and | bars represent |
| 187 | +that rectangle defined. |
| 188 | +
|
| 189 | +i +++++|----------------|++f2++ |
| 190 | +i-1 --|--------------++|++f1+++ |
| 191 | +
|
| 192 | +
|
| 193 | +We claim that the files in this key-span in sub-level i-1 that satisfy this |
| 194 | +property cannot extend out of the key-span. This can be proved by |
| 195 | +contradiction: if a file f1 at sub-level i-1 extends beyond, there must be a |
| 196 | +file at sub-level i, say f2, that did not satisfy this property (otherwise the |
| 197 | +maximal keyspan in i would have been wider). Now we know that |
| 198 | +earliest-unflushed-seq-num > LSN(f2) and LSN(f1) >= |
| 199 | +earliest-unflushed-seq-num. So LSN(f1) > LSN(f2) and they have overlapping |
| 200 | +keyspans, which is not possible since f1 is in sub-level i-1 and f2 in |
| 201 | +sub-level i. This argument can be continued to claim that the |
| 202 | +earliest-unflushed-seq-num property cuts out an inverted triangle from the |
| 203 | +sub-levels. Pretending these files are not in the history is ok, since the |
| 204 | +final sub-levels will look the same if these were not yet known and we then |
| 205 | +added them in the future in LSN order. |
| 206 | +
|
| 207 | +3.2 Ordering of output files |
| 208 | +
|
| 209 | +The actual files chosen for the intra-L0 compaction also follow the same |
| 210 | +inverted triangle pattern. This means we have a contiguous history of the |
| 211 | +seqnums for a key participating in the compaction, and anything not |
| 212 | +participating has either lower or higher seqnums. The shape can look like the |
| 213 | +following picture where . represents the spans chosen for the compaction and - |
| 214 | +the spans ignored because of earliest-unflushed-seq-num and x that are older |
| 215 | +and not participating. |
| 216 | +
|
| 217 | +6 ------------------------ |
| 218 | +5 -------------------- |
| 219 | +4 .....----....... |
| 220 | +3 ........... |
| 221 | +2 ........ |
| 222 | +1 xxxxxxx....xxxxxxxxxxxxxxxx |
| 223 | +0 xxxxxxxxxxxxxxxxxxxxxxxxx |
| 224 | +
|
| 225 | +We know the compaction input choice is sound, but the question is whether an |
| 226 | +output . produced by the compaction can fall either too low, i.e., lower than |
| 227 | +a conflicting x, or end up too high, above a conflicting -. This is because |
| 228 | +the choice of sub-level depends on the LSN of the output and not the actual |
| 229 | +conflicting key seqnums in the file (the LSN is just a summary). Claim 1 and 2 |
| 230 | +are insufficient to prove this. Those claims allow for the following sequence |
| 231 | +of files (from higher to lower sub-level): |
| 232 | +
|
| 233 | +- a#20 f3 |
| 234 | +. a#10 b#15 f2 |
| 235 | +x a#5 c#12 f1 |
| 236 | +
|
| 237 | +If the compaction separates a#10 from b#15 in the output, a#10 can fall below |
| 238 | +f1. To prove this cannot happen we need another claim. |
| 239 | +
|
| 240 | +Claim 3: if key k is in files S1 and S2, with k#t1, k#t2 with t1 < t2, then |
| 241 | +LSN(S1) < t2. |
| 242 | +
|
| 243 | +Proof: We have proved this stronger claim when proving claim 2. |
| 244 | +
|
| 245 | +Based on claim 3, the above example is impossible, since it would require |
| 246 | +LSN(f1) < 10. |
| 247 | +
|
| 248 | +Using claim 3 we can prove that even if the intra-L0 compaction writes one |
| 249 | +userkey per output sst, the LSN of that output sst will be > the LSN of ssts |
| 250 | +with the same userkey that are categorized as x. |
| 251 | +
|
| 252 | +Next we need to show that the output won't land higher than - with a |
| 253 | +conflicting key, say when we produce a single output file. NB: the narrowest |
| 254 | +possible outputs (considered in the previous paragraph, with one key per file) |
| 255 | +were the risk in the output sinking too low, and the widest possible output |
| 256 | +(considered now) is the risk in staying too high. |
| 257 | +
|
| 258 | +The file that was excluded (- file) with the conflicting key has LSN >= |
| 259 | +earliest-unflushed-seq-num. By definition there is no point in any of the |
| 260 | +participating files that is >= earliest-unflushed-seq-num. So the LSN of this |
| 261 | +single output file is < earliest-unflushed-seq-num. Hence the output can't |
| 262 | +land higher than the excluded file with the conflicting key. |
| 263 | +
|
| 264 | +*/ |
0 commit comments