Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: investigate N-level compactions (N > 2) #136

Open
petermattis opened this issue May 13, 2019 · 7 comments
Projects

Comments

@petermattis
Copy link
Collaborator

@petermattis petermattis commented May 13, 2019

Compactions are currently limited to inputs from 2 levels, but that is a small limitation in the code and the compaction picking heuristics, not a fundamental limitation of compactions. The problem with limiting compactions to 2 levels is that doing so increases write-amplification and complicates policies for picking which compaction to perform. Consider a write only workload with default options and all new data (so that compactions are simply do not result in any space savings, which is a reasonably assumption for any non-bottom-level compaction). Below I'm showing only non-empty levels:

L0  64 MB
L5  64 MB
L6  64 MB

The compaction picker will see this state and perform an L0->L5 compaction resulting in:

L0  0 MB
L5  128 MB
L6  64 MB

If the write workload is turned off, there will be a subsequent compaction at this point from L5->L6:

L0  0 MB
L5  0 MB
L6  192 MB

By doing 2 compactions we've increased write-amplification. Even worse, if the write workload wasn't turned off, we could "starve" the L5->L6 compaction. Let's backup a step to the original L0->L5 compaction and imagine that while it is taking place, another 64 MB L0 table is being flushed. The result is:

L0  64 MB
L5  128 MB
L6  64 MB

The compaction picking heuristics might pick the L5->L6 compaction at this point, but even if they do, pretty soon we'd get into a situation where L0 increases in size faster than L5->L6 compactions can clean it out. Back-pressure is definitely needed at some point (see #7), but I suspect that allowing multi-level compactions will also help.

The heuristic that I propose to explore is to expand a compaction to contain the next level if the current output level would become over-full due to the compaction. We estimate the number of new bytes to the output level as the number of incoming bytes from other levels (this doesn't take into account updated keys, butI believe that is ok). So back to our original state:

L0  64 MB
L5  64 MB
L6  64 MB

We'd decide that an L0->L5 compaction needs to take place, and estimate the new size of L5 as 128 MB. This would exceed the L5 max-size threshold (64 MB since it is the base level) which would cause us to include L6 in the compaction.

The end result should be a decrease in starvation of Lbase->Lbase+1 compactions because we're automatically including them in the L0->Lbase compaction.

There is a major caveat to implementing this with the current Pebble compaction setup and mechanisms: L0 might span the entire key space, so an L0->Lbase compaction might involve compacting all of Lbase which would in turn compact all of Lbase+1. There are a variety of ways in which this could be addressed. For example, L0 tables should be split at boundaries that line up with the sstables in Lbase.

@petermattis petermattis self-assigned this May 13, 2019
@petermattis

This comment has been minimized.

Copy link
Collaborator Author

@petermattis petermattis commented May 19, 2019

See also #48.

@ajkr

This comment has been minimized.

Copy link
Collaborator

@ajkr ajkr commented May 19, 2019

This makes sense to me for L0, Lbase, and Lbase+1 assuming we do the cutting up of L0 files to align them with Lbase file boundaries. Beyond that, I'd expect it to create compactions that are way larger than they'd normally be, which may go against the goal of reducing Lbase contention. Consider:

  • An LSM where consecutive levels at Lbase+ differ in target size by a factor of 10.
  • Base level is L1. Bottom level is L4.
  • Size targets: target(L1) = 1GB, target(L2) = 10GB, target(L3) = 100GB, target(L4) = 1000GB
  • Current sizes: size(L1) = 2GB, size(L2) = 20GB, size(L3) = 200GB, size(L4) = 2000GB. So all L1+ levels have compaction score 2.0.
  • Let's say all files are 16MB.
  • Now we pick an L1 file (16 MB) to compact. It overlaps with 160MB in L2. Since L2 requires compaction we move on to L3 where we pick up 1600MB of overlapping data. L3 also requires compaction so we continue on to L4 and find 16000MB. Now our compaction size is 17776MB. While this is happening, a lot of files can accumulate in L0.
  • If we were following the normal level-based compaction, we would instead do a number of 176MB compactions (picking a 16MB file from the input level and 160MB of overlapping files in the output level). That gives us quite a bit more flexibility in adapting to changes in the LSM. In particular, if lots of files are added to L0, we will very soon have a compaction thread available to do L0->Lbase.
@petermattis

This comment has been minimized.

Copy link
Collaborator Author

@petermattis petermattis commented May 19, 2019

Beyond that, I'd expect it to create compactions that are way larger than they'd normally be, which may go against the goal of reducing Lbase contention.

This exact problem has been on my mind as well. The balanced rent-or-buy strategy and the multi-level strategy both create excessively large (though rare) compactions in order to reduce write-amplification. I don't have a good answer right now for how to avoid that issue. Perhaps horizontal partitioning will help, but I haven't worked out the details to my satisfaction.

@petermattis

This comment has been minimized.

Copy link
Collaborator Author

@petermattis petermattis commented Aug 4, 2019

I dreamed of another compaction mechanic which may help when N>2 compactions are being performed: we can compact to multiple output levels. Currently, compaction always output to a single level, which makes sense when there are only 2 input levels to the compaction. Consider the following LSM structure:

L0:        h---------r
L1:   c--------l   p------w
L2: a---------k  n-------v

With a 2-level compaction, an L0->L1 compaction would compact [h,r) with [c,l) and [p,w). The new L1 tables would be split on L2 boundaries, perhaps producing [c,k) and [l,w).

L0:
L1:   c-------kl----------w
L2: a---------k  n-------v

A subsequent L1->L2 compaction could then pick either of the L1 tables as an input. The problem seen in #203 arises, though. While doing the L0->L1 compaction, more L0 tables may have been flushed which requires another L0->L1 compaction, starving L1->L2 compactions.

The proposal above is to include L2 in the compaction. The obvious way to do this is to include any overlapping L2 table. There is an alternative: only include 1 L2 table, but generate new sstables in both L1 and L2. For example, if we include L2:[a,k) in the original L0->L1 compaction, any keys in the range [a,k) would be placed in an sstable in L2, all other keys would be placed in L1 sstables:

L0:
L1:            l----------w
L2: a---------k  n-------v

The effect is that the L1->L2 compaction was "collapsed" into the L0->L1 compaction. I think this mechanism would lessen the need to cut up L0 files along Lbase boundaries. L0->Lbase compactions could involve all of Lbase, but then we'd include 1 or 2 Lbase+1 files so that Lbase stays the same size (or shrinks).

@ajkr Do you see any problems with this proposed compaction mechanism?

@ajkr

This comment has been minimized.

Copy link
Collaborator

@ajkr ajkr commented Aug 4, 2019

That's a very interesting idea to include just enough files from Lbase+1 to prevent the situation where Lbase is ever-increasing in size during a write burst. I will think about it more. The main questions I'm not sure about are (1) does solving the ever-increasing Lbase problem solve user-visible problems like slow reads due to too many sorted runs or out-of-space? and (2) can this cause Lbase+1 to be ever-increasing in size? I don't know. Trying it out and seeing would be good.

As a side note, I hope we aren't writing off parallel L0->L0 and L0->Lbase (with an ever-increasing Lbase size). I'm confident that technique helps user-visible problems associated with write bursts, despite the apparent contortions of the LSM it makes.

@petermattis

This comment has been minimized.

Copy link
Collaborator Author

@petermattis petermattis commented Aug 4, 2019

As a side note, I hope we aren't writing off parallel L0->L0 and L0->Lbase (with an ever-increasing Lbase size). I'm confident that technique helps user-visible problems associated with write bursts, despite the apparent contortions of the LSM it makes.

Definitely not writing them off. Will they be necessary once there is a solution the ever-growing Lbase problem? I'm not sure. It may turn out they are unnecessary. Or maybe they will be necessary. I haven't thought about it enough yet.

@petermattis

This comment has been minimized.

Copy link
Collaborator Author

@petermattis petermattis commented Aug 4, 2019

The main questions I'm not sure about are (1) does solving the ever-increasing Lbase problem solve user-visible problems like slow reads due to too many sorted runs or out-of-space?

Not sure about "solve", but I think it will have an impact. With the current behavior, we're effectively limited to 2 levels: L0 and Lbase. This is because the Lbase->Lbase+1 compactions are starved. That seems bad, and seems like it would affect write amplification negatively and eventually affect read amplification as well.

and (2) can this cause Lbase+1 to be ever-increasing in size?

Perhaps. Concurrent compactions might be necessary to avoid that. Or we need to keep on adding sstables from deeper levels.

@petermattis petermattis removed their assignment Sep 11, 2019
@petermattis petermattis added this to Incoming in Storage via automation Oct 1, 2019
@petermattis petermattis moved this from Incoming to To Do (next milestone) in Storage Nov 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Storage
  
To Do (next milestone)
2 participants
You can’t perform that action at this time.