Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: do not cache compaction block reads #2699

Merged
merged 3 commits into from
Jul 16, 2023

Conversation

jbowens
Copy link
Collaborator

@jbowens jbowens commented Jun 29, 2023

db: do not cache compaction block reads

During compactions, avoid populating the block cache with input files' blocks.
These files will soon be removed from the LSM, making it less likely any
iterator will need to read these blocks. While Pebble uses a scan-resistant
block cache algorithm (ClockPRO), the act of inserting the blocks into the
cache increases contention on the block cache mutexes (#1997). This contention
has been observed to significantly contribute to tail latencies, both for reads
and for writes during memtable reservation. Additionally, although these blocks
may be soon replaced with more useful blocks due to ClockPRO's scan resistance,
they may be freed by a different thread inducing excessive TLB shootdowns
(#2693).

A compaction only requires a relatively small working set of buffers during its
scan across input sstables. In this commit, we introduce a per-compaction
BufferPool that is used to allocate buffers during cache misses. Buffers are
reused throughout the compaction and only freed to the memory allocator when
they're too small or the compaction is finished. This reduces pressure on the
memory allocator and the block cache.

sstable: avoid caching meta blocks

When opening a sstable for the first time, we read two 'meta' blocks: one
describes the layout of sstable, encoding block handles for the index block,
properties block, etc. The other contains table properties. These blocks are
decoded, and the necessary state is copied onto the heap, and then blocks are
released. In typical configurations with sufficiently large table caches, these
blocks are never read again or only much later after the table has been evicted
from the table cache.

This commit uses a BufferPool to hold these temporary blocks, and refrains from
adding these blocks to the block cache. This reduces contention on the block
cache mutexes (#1997).

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jbowens

This comment was marked as outdated.

@jbowens jbowens force-pushed the compactcache branch 2 times, most recently from d41de74 to 9017c6d Compare July 13, 2023 18:19
@jbowens jbowens marked this pull request as ready for review July 13, 2023 18:59
@jbowens jbowens requested review from a team and bananabrick July 13, 2023 19:34
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 12 of 15 files at r1, 2 of 4 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @bananabrick and @jbowens)


sstable/buffer_pool.go line 77 at r3 (raw file):

	*p = BufferPool{
		cache: cache,
		pool:  pool,

[nit] should we do pool[:0] here so there's no possibility of misuse?

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 15 files at r1, 2 of 4 files at r2, 2 of 5 files at r3, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @bananabrick and @jbowens)


compaction.go line 2756 at r2 (raw file):

	// Compactions use a pool of buffers to read blocks, avoiding polluting the
	// block cache with blocks that will not be read again.

Can you add a comment on why 5 is a good value for initial size.
In the worst-case with 2 level compactions do we have up to 2 data blocks, 2 value blocks, 2 rangedel blocks and 2 rangekey blocks allocated?


sstable/block.go line 405 at r2 (raw file):

	cached      []blockEntry
	cachedBuf   []byte
	cacheHandle cache.Handle

is cacheHandle unused, since we have the cache.Handle inside bufferHandle?


sstable/buffer_pool.go line 49 at r2 (raw file):

	// pool contains all the buffers held by the pool, including buffers that
	// are in-use.
	pool []allocedBuffer

IIUC, the invariant is that for every pool[i] the pool[i].v value is non-nil. If so, can you add a comment.


sstable/buffer_pool.go line 66 at r2 (raw file):

func (p *BufferPool) Init(cache *cache.Cache, initialSize int) {
	*p = BufferPool{
		cache: cache,

I don't see why cache.Alloc and cache.Free are methods on the Cache struct. They don't seem to reference anything in the cache. Could they be functions. Then we don't need cache as a member.


sstable/buffer_pool.go line 139 at r2 (raw file):

	// is no longer in use and a future call to BufferPool.Alloc may reuse this
	// buffer.
	b.p.pool[b.i].b = nil

This code is not setting b.p to nil, which makes me worry that if Release is accidentally called again it may free a buffer that no longer belongs to Buf. I suppose we can't do better because Release is a value receiver. I am guessing making it a pointer receiver is not going to cause Buf to be heap allocated, in which case we could fix this.


sstable/reader.go line 3429 at r2 (raw file):

			decompressed = cacheValueOrBuf{buf: bufferPool.Alloc(decodedLen)}
		} else {
			decompressed = cacheValueOrBuf{

so we need two bufs for each block, which means my previous comment may have undercounted by 2x.


sstable/value_block.go line 753 at r2 (raw file):

) (bufferHandle, error) {
	ctx = objiotracing.WithBlockType(ctx, objiotracing.ValueBlock)
	return bpwc.r.readBlock(ctx, h, nil, nil, stats, nil /* buffer pool */)

this case should be rare enough but if we needed to use the buffer pool we could since the blockProviderWhenClosed is not allowed to outlive the iterator tree, which means it cannot outlive the buffer pool.

Alloc and Free do not actually use or record any state on the cache.Cache type.
This commit lifts them up on to the package level.
During compactions, avoid populating the block cache with input files' blocks.
These files will soon be removed from the LSM, making it less likely any
iterator will need to read these blocks. While Pebble uses a scan-resistant
block cache algorithm (ClockPRO), the act of inserting the blocks into the
cache increases contention on the block cache mutexes (cockroachdb#1997). This contention
has been observed to significantly contribute to tail latencies, both for reads
and for writes during memtable reservation. Additionally, although these blocks
may be soon replaced with more useful blocks due to ClockPRO's scan resistance,
they may be freed by a different thread inducing excessive TLB shootdowns
(cockroachdb#2693).

A compaction only requires a relatively small working set of buffers during its
scan across input sstables. In this commit, we introduce a per-compaction
BufferPool that is used to allocate buffers during cache misses. Buffers are
reused throughout the compaction and only freed to the memory allocator when
they're too small or the compaction is finished. This reduces pressure on the
memory allocator and the block cache.
When opening a sstable for the first time, we read two 'meta' blocks: one
describes the layout of sstable, encoding block handles for the index block,
properties block, etc. The other contains table properties. These blocks are
decoded, and the necessary state is copied onto the heap, and then blocks are
released. In typical configurations with sufficiently large table caches, these
blocks are never read again or only much later after the table has been evicted
from the table cache.

This commit uses a BufferPool to hold these temporary blocks, and refrains from
adding these blocks to the block cache. This reduces contention on the block
cache mutexes (cockroachdb#1997).
Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @bananabrick and @sumeerbhola)


compaction.go line 2756 at r2 (raw file):

Previously, sumeerbhola wrote…

Can you add a comment on why 5 is a good value for initial size.
In the worst-case with 2 level compactions do we have up to 2 data blocks, 2 value blocks, 2 rangedel blocks and 2 rangekey blocks allocated?

In the worst case for a two-level compaction, we have up to 2 index blocks, 2 data blocks, 2 value blocks, n rangedel blocks and n rangekey blocks + 1 ephemeral block used to hold the intermediary compressed data block during a block load. I've bumped this up to 12 and written up a justification.


sstable/block.go line 405 at r2 (raw file):

Previously, sumeerbhola wrote…

is cacheHandle unused, since we have the cache.Handle inside bufferHandle?

Good catch


sstable/buffer_pool.go line 49 at r2 (raw file):

Previously, sumeerbhola wrote…

IIUC, the invariant is that for every pool[i] the pool[i].v value is non-nil. If so, can you add a comment.

Done.


sstable/buffer_pool.go line 66 at r2 (raw file):

Previously, sumeerbhola wrote…

I don't see why cache.Alloc and cache.Free are methods on the Cache struct. They don't seem to reference anything in the cache. Could they be functions. Then we don't need cache as a member.

Very good point. This also lets us remove cache as a member from cacheValueOrBuf


sstable/buffer_pool.go line 139 at r2 (raw file):

Previously, sumeerbhola wrote…

This code is not setting b.p to nil, which makes me worry that if Release is accidentally called again it may free a buffer that no longer belongs to Buf. I suppose we can't do better because Release is a value receiver. I am guessing making it a pointer receiver is not going to cause Buf to be heap allocated, in which case we could fix this.

Done.


sstable/buffer_pool.go line 77 at r3 (raw file):

Previously, RaduBerinde wrote…

[nit] should we do pool[:0] here so there's no possibility of misuse?

Done


sstable/reader.go line 3429 at r2 (raw file):

Previously, sumeerbhola wrote…

so we need two bufs for each block, which means my previous comment may have undercounted by 2x.

It should be a constant +1 buf since only one of the buffers survive to the end


sstable/value_block.go line 753 at r2 (raw file):

Previously, sumeerbhola wrote…

this case should be rare enough but if we needed to use the buffer pool we could since the blockProviderWhenClosed is not allowed to outlive the iterator tree, which means it cannot outlive the buffer pool.

Left a TODO here

Copy link
Contributor

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they may be freed by a different thread inducing excessive TLB shootdowns

I never thought about this.

Why is it necessary that a free happens on a different thread to observe a TLB shootdown? If more than one thread is working with some memory, and any one of those threads calls free, we'd have the same problem right? Doesn't necessarily have to be a different thread than the one which allocated the memory.

A compaction only requires a relatively small working set of buffers during its scan across input sstables.

Do we really need manual memory allocation for these buffers?

Reviewed 6 of 15 files at r1, 1 of 5 files at r3, 6 of 25 files at r4, 14 of 17 files at r5, 5 of 5 files at r6, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @jbowens and @sumeerbhola)


sstable/buffer_pool.go line 99 at r6 (raw file):

	for i := 0; i < len(p.pool); i++ {
		if p.pool[i].b == nil {
			if len(p.pool[i].v.Buf()) >= n {

nit: I think there's an invariant that v.buf is never resized so cap(v.Buf) == len(v.Buf), when a Value is used by the BufferPool. Could we mention that in the allocatedBuffer?

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary that a free happens on a different thread to observe a TLB shootdown? If more than one thread is working with some memory, and any one of those threads calls free, we'd have the same problem right? Doesn't necessarily have to be a different thread than the one which allocated the memory.

I believe that's true, but:

  1. for these blocks read during compactions, the expectation is that only the one thread running the compaction goroutine is reading this memory, so only the free from another thread is causing any TLB shootdown.
  2. jemalloc has a thread-local 'tcache' of cached objects. Beyond the TLB shootdown, the memory being freed from another thread defeats the utility of the thread-local tcache, making it less likely that subsequent allocations can be served by the tcache.

Do we really need manual memory allocation for these buffers?

Not strictly, but we already need to use a manual accounting of their lifecycles, so keeping them off the Go heap and out of GC's view is just a freebie.

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @bananabrick and @sumeerbhola)


sstable/buffer_pool.go line 99 at r6 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

nit: I think there's an invariant that v.buf is never resized so cap(v.Buf) == len(v.Buf), when a Value is used by the BufferPool. Could we mention that in the allocatedBuffer?

while true, it's not an invariant we use in any way

Copy link
Contributor

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

the expectation is that only the one thread running the compaction goroutine is reading this memory, so only the free from another thread is causing any TLB shootdown.

If the compaction thread is still reading the memory, then another thread wouldn't call free on it, because the refcount of the Value isn't 0, right?

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @jbowens and @sumeerbhola)


sstable/buffer_pool.go line 99 at r6 (raw file):

Previously, jbowens (Jackson Owens) wrote…

while true, it's not an invariant we use in any way

I figured we're relying on it here: len(p.pool[i].v.Buf()) >= n, we use len instead of cap, and it's correct to use len instead of cap, because len == cap.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 25 files at r4, 9 of 17 files at r5, 5 of 5 files at r6, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @jbowens)

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the compaction thread is still reading the memory, then another thread wouldn't call free on it, because the refcount of the Value isn't 0, right?

That's right, but even if the compaction thread is done with the block and has moved on to a subsequent block, the block may still be in its TLB. I believe even in this scenario, the processor suffers the cost of a 'TLB shootdown.'

TFTRs!

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @sumeerbhola)

@jbowens jbowens merged commit 809057a into cockroachdb:master Jul 16, 2023
9 checks passed
@jbowens jbowens deleted the compactcache branch July 16, 2023 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants