db: do not cache compaction block reads #2699

jbowens · 2023-06-29T19:19:57Z

db: do not cache compaction block reads

During compactions, avoid populating the block cache with input files' blocks.
These files will soon be removed from the LSM, making it less likely any
iterator will need to read these blocks. While Pebble uses a scan-resistant
block cache algorithm (ClockPRO), the act of inserting the blocks into the
cache increases contention on the block cache mutexes (#1997). This contention
has been observed to significantly contribute to tail latencies, both for reads
and for writes during memtable reservation. Additionally, although these blocks
may be soon replaced with more useful blocks due to ClockPRO's scan resistance,
they may be freed by a different thread inducing excessive TLB shootdowns
(#2693).

A compaction only requires a relatively small working set of buffers during its
scan across input sstables. In this commit, we introduce a per-compaction
BufferPool that is used to allocate buffers during cache misses. Buffers are
reused throughout the compaction and only freed to the memory allocator when
they're too small or the compaction is finished. This reduces pressure on the
memory allocator and the block cache.

sstable: avoid caching meta blocks

When opening a sstable for the first time, we read two 'meta' blocks: one
describes the layout of sstable, encoding block handles for the index block,
properties block, etc. The other contains table properties. These blocks are
decoded, and the necessary state is copied onto the heap, and then blocks are
released. In typical configurations with sufficiently large table caches, these
blocks are never read again or only much later after the table has been evicted
from the table cache.

This commit uses a BufferPool to hold these temporary blocks, and refrains from
adding these blocks to the block cache. This reduces contention on the block
cache mutexes (#1997).

cockroach-teamcity · 2023-06-29T19:20:04Z

This change is

RaduBerinde

Reviewed 12 of 15 files at r1, 2 of 4 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @bananabrick and @jbowens)

sstable/buffer_pool.go line 77 at r3 (raw file):

	*p = BufferPool{
		cache: cache,
		pool:  pool,

[nit] should we do pool[:0] here so there's no possibility of misuse?

sumeerbhola

Reviewed 7 of 15 files at r1, 2 of 4 files at r2, 2 of 5 files at r3, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @bananabrick and @jbowens)

compaction.go line 2756 at r2 (raw file):

	// Compactions use a pool of buffers to read blocks, avoiding polluting the
	// block cache with blocks that will not be read again.

Can you add a comment on why 5 is a good value for initial size.
In the worst-case with 2 level compactions do we have up to 2 data blocks, 2 value blocks, 2 rangedel blocks and 2 rangekey blocks allocated?

sstable/block.go line 405 at r2 (raw file):

	cached      []blockEntry
	cachedBuf   []byte
	cacheHandle cache.Handle

is cacheHandle unused, since we have the cache.Handle inside bufferHandle?

sstable/buffer_pool.go line 49 at r2 (raw file):

	// pool contains all the buffers held by the pool, including buffers that
	// are in-use.
	pool []allocedBuffer

IIUC, the invariant is that for every pool[i] the pool[i].v value is non-nil. If so, can you add a comment.

sstable/buffer_pool.go line 66 at r2 (raw file):

func (p *BufferPool) Init(cache *cache.Cache, initialSize int) {
	*p = BufferPool{
		cache: cache,

I don't see why cache.Alloc and cache.Free are methods on the Cache struct. They don't seem to reference anything in the cache. Could they be functions. Then we don't need cache as a member.

sstable/buffer_pool.go line 139 at r2 (raw file):

	// is no longer in use and a future call to BufferPool.Alloc may reuse this
	// buffer.
	b.p.pool[b.i].b = nil

This code is not setting b.p to nil, which makes me worry that if Release is accidentally called again it may free a buffer that no longer belongs to Buf. I suppose we can't do better because Release is a value receiver. I am guessing making it a pointer receiver is not going to cause Buf to be heap allocated, in which case we could fix this.

sstable/reader.go line 3429 at r2 (raw file):

			decompressed = cacheValueOrBuf{buf: bufferPool.Alloc(decodedLen)}
		} else {
			decompressed = cacheValueOrBuf{

so we need two bufs for each block, which means my previous comment may have undercounted by 2x.

sstable/value_block.go line 753 at r2 (raw file):

) (bufferHandle, error) {
	ctx = objiotracing.WithBlockType(ctx, objiotracing.ValueBlock)
	return bpwc.r.readBlock(ctx, h, nil, nil, stats, nil /* buffer pool */)

this case should be rare enough but if we needed to use the buffer pool we could since the blockProviderWhenClosed is not allowed to outlive the iterator tree, which means it cannot outlive the buffer pool.

Alloc and Free do not actually use or record any state on the cache.Cache type. This commit lifts them up on to the package level.

During compactions, avoid populating the block cache with input files' blocks. These files will soon be removed from the LSM, making it less likely any iterator will need to read these blocks. While Pebble uses a scan-resistant block cache algorithm (ClockPRO), the act of inserting the blocks into the cache increases contention on the block cache mutexes (cockroachdb#1997). This contention has been observed to significantly contribute to tail latencies, both for reads and for writes during memtable reservation. Additionally, although these blocks may be soon replaced with more useful blocks due to ClockPRO's scan resistance, they may be freed by a different thread inducing excessive TLB shootdowns (cockroachdb#2693). A compaction only requires a relatively small working set of buffers during its scan across input sstables. In this commit, we introduce a per-compaction BufferPool that is used to allocate buffers during cache misses. Buffers are reused throughout the compaction and only freed to the memory allocator when they're too small or the compaction is finished. This reduces pressure on the memory allocator and the block cache.

When opening a sstable for the first time, we read two 'meta' blocks: one describes the layout of sstable, encoding block handles for the index block, properties block, etc. The other contains table properties. These blocks are decoded, and the necessary state is copied onto the heap, and then blocks are released. In typical configurations with sufficiently large table caches, these blocks are never read again or only much later after the table has been evicted from the table cache. This commit uses a BufferPool to hold these temporary blocks, and refrains from adding these blocks to the block cache. This reduces contention on the block cache mutexes (cockroachdb#1997).

jbowens

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @bananabrick and @sumeerbhola)

compaction.go line 2756 at r2 (raw file):

Previously, sumeerbhola wrote…

Can you add a comment on why 5 is a good value for initial size.
In the worst-case with 2 level compactions do we have up to 2 data blocks, 2 value blocks, 2 rangedel blocks and 2 rangekey blocks allocated?

In the worst case for a two-level compaction, we have up to 2 index blocks, 2 data blocks, 2 value blocks, n rangedel blocks and n rangekey blocks + 1 ephemeral block used to hold the intermediary compressed data block during a block load. I've bumped this up to 12 and written up a justification.

sstable/block.go line 405 at r2 (raw file):

Previously, sumeerbhola wrote…

is cacheHandle unused, since we have the cache.Handle inside bufferHandle?

Good catch

sstable/buffer_pool.go line 49 at r2 (raw file):

Previously, sumeerbhola wrote…

IIUC, the invariant is that for every pool[i] the pool[i].v value is non-nil. If so, can you add a comment.

Done.

sstable/buffer_pool.go line 66 at r2 (raw file):

Previously, sumeerbhola wrote…

I don't see why cache.Alloc and cache.Free are methods on the Cache struct. They don't seem to reference anything in the cache. Could they be functions. Then we don't need cache as a member.

Very good point. This also lets us remove cache as a member from cacheValueOrBuf

sstable/buffer_pool.go line 139 at r2 (raw file):

Previously, sumeerbhola wrote…

This code is not setting b.p to nil, which makes me worry that if Release is accidentally called again it may free a buffer that no longer belongs to Buf. I suppose we can't do better because Release is a value receiver. I am guessing making it a pointer receiver is not going to cause Buf to be heap allocated, in which case we could fix this.

Done.

sstable/buffer_pool.go line 77 at r3 (raw file):

Previously, RaduBerinde wrote…

[nit] should we do pool[:0] here so there's no possibility of misuse?

Done

sstable/reader.go line 3429 at r2 (raw file):

Previously, sumeerbhola wrote…

so we need two bufs for each block, which means my previous comment may have undercounted by 2x.

It should be a constant +1 buf since only one of the buffers survive to the end

sstable/value_block.go line 753 at r2 (raw file):

Previously, sumeerbhola wrote…

this case should be rare enough but if we needed to use the buffer pool we could since the blockProviderWhenClosed is not allowed to outlive the iterator tree, which means it cannot outlive the buffer pool.

Left a TODO here

bananabrick

they may be freed by a different thread inducing excessive TLB shootdowns

I never thought about this.

Why is it necessary that a free happens on a different thread to observe a TLB shootdown? If more than one thread is working with some memory, and any one of those threads calls free, we'd have the same problem right? Doesn't necessarily have to be a different thread than the one which allocated the memory.

A compaction only requires a relatively small working set of buffers during its scan across input sstables.

Do we really need manual memory allocation for these buffers?

Reviewed 6 of 15 files at r1, 1 of 5 files at r3, 6 of 25 files at r4, 14 of 17 files at r5, 5 of 5 files at r6, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @jbowens and @sumeerbhola)

sstable/buffer_pool.go line 99 at r6 (raw file):

	for i := 0; i < len(p.pool); i++ {
		if p.pool[i].b == nil {
			if len(p.pool[i].v.Buf()) >= n {

nit: I think there's an invariant that v.buf is never resized so cap(v.Buf) == len(v.Buf), when a Value is used by the BufferPool. Could we mention that in the allocatedBuffer?

jbowens

Why is it necessary that a free happens on a different thread to observe a TLB shootdown? If more than one thread is working with some memory, and any one of those threads calls free, we'd have the same problem right? Doesn't necessarily have to be a different thread than the one which allocated the memory.

I believe that's true, but:

for these blocks read during compactions, the expectation is that only the one thread running the compaction goroutine is reading this memory, so only the free from another thread is causing any TLB shootdown.
jemalloc has a thread-local 'tcache' of cached objects. Beyond the TLB shootdown, the memory being freed from another thread defeats the utility of the thread-local tcache, making it less likely that subsequent allocations can be served by the tcache.

Do we really need manual memory allocation for these buffers?

Not strictly, but we already need to use a manual accounting of their lifecycles, so keeping them off the Go heap and out of GC's view is just a freebie.

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @bananabrick and @sumeerbhola)

sstable/buffer_pool.go line 99 at r6 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

nit: I think there's an invariant that v.buf is never resized so cap(v.Buf) == len(v.Buf), when a Value is used by the BufferPool. Could we mention that in the allocatedBuffer?

while true, it's not an invariant we use in any way

bananabrick

the expectation is that only the one thread running the compaction goroutine is reading this memory, so only the free from another thread is causing any TLB shootdown.

If the compaction thread is still reading the memory, then another thread wouldn't call free on it, because the refcount of the Value isn't 0, right?

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @jbowens and @sumeerbhola)

sstable/buffer_pool.go line 99 at r6 (raw file):

Previously, jbowens (Jackson Owens) wrote…

while true, it's not an invariant we use in any way

I figured we're relying on it here: len(p.pool[i].v.Buf()) >= n, we use len instead of cap, and it's correct to use len instead of cap, because len == cap.

sumeerbhola

Reviewed 3 of 25 files at r4, 9 of 17 files at r5, 5 of 5 files at r6, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @jbowens)

jbowens

If the compaction thread is still reading the memory, then another thread wouldn't call free on it, because the refcount of the Value isn't 0, right?

That's right, but even if the compaction thread is done with the block and has moved on to a subsequent block, the block may still be in its TLB. I believe even in this scenario, the processor suffers the cost of a 'TLB shootdown.'

TFTRs!

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @sumeerbhola)

This comment was marked as outdated.

Sign in to view

jbowens force-pushed the compactcache branch 2 times, most recently from d41de74 to 9017c6d Compare July 13, 2023 18:19

jbowens marked this pull request as ready for review July 13, 2023 18:59

jbowens requested review from a team and bananabrick July 13, 2023 19:34

RaduBerinde approved these changes Jul 13, 2023

View reviewed changes

sumeerbhola requested changes Jul 14, 2023

View reviewed changes

jbowens added 3 commits July 14, 2023 15:26

internal/cache: move Alloc and Free into package-level functions

b32d96c

Alloc and Free do not actually use or record any state on the cache.Cache type. This commit lifts them up on to the package level.

jbowens force-pushed the compactcache branch from 9017c6d to c5fc029 Compare July 14, 2023 19:48

jbowens commented Jul 14, 2023

View reviewed changes

bananabrick reviewed Jul 14, 2023

View reviewed changes

jbowens commented Jul 14, 2023

View reviewed changes

bananabrick approved these changes Jul 14, 2023

View reviewed changes

sumeerbhola approved these changes Jul 15, 2023

View reviewed changes

jbowens commented Jul 16, 2023

View reviewed changes

jbowens merged commit 809057a into cockroachdb:master Jul 16, 2023
9 checks passed

jbowens deleted the compactcache branch July 16, 2023 02:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: do not cache compaction block reads #2699

db: do not cache compaction block reads #2699

jbowens commented Jun 29, 2023 •

edited

Loading

cockroach-teamcity commented Jun 29, 2023

This comment was marked as outdated.

RaduBerinde left a comment

sumeerbhola left a comment

jbowens left a comment

bananabrick left a comment

jbowens left a comment

bananabrick left a comment

sumeerbhola left a comment

jbowens left a comment

db: do not cache compaction block reads #2699

db: do not cache compaction block reads #2699

Conversation

jbowens commented Jun 29, 2023 • edited Loading

cockroach-teamcity commented Jun 29, 2023

This comment was marked as outdated.

RaduBerinde left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

bananabrick left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

bananabrick left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

jbowens commented Jun 29, 2023 •

edited

Loading