New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raftentry: rewrite storage.raftEntryCache with dense, per-replica state #32618

Merged
merged 1 commit into from Nov 29, 2018

Conversation

Projects
None yet
5 participants
@ajwerner
Copy link
Collaborator

ajwerner commented Nov 26, 2018

Problem

This blocking profile is filtered to show all Mutex contention (ignoring other
forms like channels and Cond vars). We can see that blocking in the
raftEntryCache is responsible for 90% of all Mutex contention. This seems
absurd, but it adds up given that the cache is responsible for 3% of CPU
utilization on a node and requires mutual exclusion across an entire Store.

We've also seen in changes like #29596 how expensive these cache accesses have
become, especially as the cache grows because most entry accesses take log(n)
time, where n is the number of raftpb.Entrys in the cache across all Ranges
on a Store.

Rewrite

This PR rewrites the raftEntryCache as a new storage/raftentries.Cache type.
The rewrite diverges from the original implementation in two important ways,
both of which exploit and optimize for the unique access patterns that this
cache is subject to.

Per Replica Cache State

The first change is to move from a single global data tree with per-entry LRU
state to per-range caches which evict entire ranges. This makes maintaining the
LRU linked-list significantly cheaper because we only need to update it once per
Replica access instead of once per entry access. It also means that the
linked-list will be significantly smaller, resulting in far fewer memory
allocations and a reduced GC footprint.

The per-replica state also enables more work to be done concurrently which
dramatically reducing the amount of work done while holding the cache mutex.
Because individual ranges are never accessed concurrently, by optimistically
computing space requirements, we can allow operations to on different ranges
to proceed at the same time.

In order to maintain a cap on the total amount of data which can be stored in
cache, the size of cache additions are computed before looking at the cache and
then space is freed such that the entire write will be able to fit in the cache.
This downside of this approach is that it can lead to overly aggressive evictions
in cases where large amounts of data which are already present in the cache are
added again. This edge case is not particularly concerning as it implies that
the cache is poorly sized for the workload.

LLRB -> Ring Buffers

The second change is that the rewrite trades in the balanced binary tree
structure for a per-replica ring buffer structure. This structure is preferable
because it improves the time complexity of entry access. It's also more memory
efficient and allocation friendly because it takes advantage of builtin Go
slices and their compile-time specialization. Finally, it should be more GC
friendly because it cuts down on the number of pointers in use.

It turns out that Raft entries are densely ordered with discrete indices.
This means that given a low and a high index, we can simply iterate through the
range efficiently, without the need for an ordered data-structure. This property
was also exploited in 6e4e57f. The problem with a dense data structure it that
it makes dealing with disjoint ranges cumbersome and wasteful without further
specialization. Rather than shouldering the complexity of supporting caching
disjoint ranges, this implementation chooses to ignore writes which precede and
are non-adjacent to currently cached values. When non-adjacent inserts occur
which follow the currently cached range the currently cached values are evicted
in factor of the new values. This policy seeks to avoid failure to cache new
entries in the face of long-lived demand for old entries

Performance

name                 old time/op    new time/op    delta
EntryCache-8           4.69ms ± 9%    0.18ms ± 3%  -96.27%  (p=0.000 n=8+8)
EntryCacheClearTo-8     456µs ± 3%      58µs ± 0%  -87.29%  (p=0.000 n=8+8)

name                 old alloc/op   new alloc/op   delta
EntryCache-8            113kB ± 0%      75kB ± 0%  -33.43%  (p=0.002 n=7+8)
EntryCacheClearTo-8    8.31kB ± 0%    1.15kB ± 0%  -86.14%  (p=0.000 n=8+8)

name                 old allocs/op  new allocs/op  delta
EntryCache-8            2.02k ± 0%     0.00k ± 0%  -99.85%  (p=0.000 n=8+8)
EntryCacheClearTo-8      14.0 ± 0%       1.0 ± 0%  -92.86%  (p=0.000 n=8+8)

on 3x n1-standard-32

$ ./workload run kv '{pgurl:1-3}' --init --splits=512 --duration 180s --read-percent=90 --concurrency=512
On 3 n1-highcpu-16 nodes running KV5 with concurrency set to 49
name       old ops/s  new ops/s  delta
KV90       100k ± 0%  108k ± 1%  +8.05%  (p=0.004 n=6+5)
KV90       100k ± 0%  107k ± 2%  +6.81%  (p=0.002 n=6+6)
kV90       101k ± 1%  107k ± 0%  +6.25%  (p=0.010 n=6+4)

Release note (performance improvement): Rewrite Raft entry cache to optimize for access patterns, reduce lock contention, and reduce memory footprint.

@ajwerner ajwerner requested a review from nvanbenschoten Nov 26, 2018

@ajwerner ajwerner requested a review from cockroachdb/core-prs as a code owner Nov 26, 2018

@cockroach-teamcity

This comment has been minimized.

Copy link
Member

cockroach-teamcity commented Nov 26, 2018

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/raft_entry_cache branch from ee16a65 to f2659d6 Nov 26, 2018

@nvanbenschoten
Copy link
Member

nvanbenschoten left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/raftentry/cache.go, line 183 at r1 (raw file):

func (c *Cache) recordUpdate(p *partition, bytesAdded, bytesGuessed, entriesAdded int) {
	c.mu.Lock()

Planning on giving this a more thorough review tomorrow morning, but I have one question already. Do you think it's possible to avoid this second lock? You have a nice optimization to guess at the impact that a set of entries will have on the size of the cache during the initial lock. This allows us to guarantee that we never need to evict in this mutual exclusion zone. I'd hope this would allow us to avoid this second lock entirely!

Could we push some of this logic into the partition and perform the rest (e.g. c.bytes += delta) using atomic operations? The gauges are already thread-safe, so then the only other thing to worry about is deleting empty partitions. This seems like a rare enough case to justify the lock. Maybe we never do this unless the replica is being removed from the Store?

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/raft_entry_cache branch from f2659d6 to a4cdc75 Nov 26, 2018

@nvanbenschoten

This comment has been minimized.

Copy link
Member

nvanbenschoten commented Nov 26, 2018

Also, what was the command you used to run kv5, and what kind of cluster did you run it on? I'm curious whether you performed any splits and what your load concurrency was.

@ajwerner
Copy link
Collaborator

ajwerner left a comment

Here's the command:
./workload run kv '{pgurl:1-3}' --init --splits=4096 --duration 60s --read-percent=5 --concurrency=4096

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@ajwerner
Copy link
Collaborator

ajwerner left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/raftentry/cache.go, line 183 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Planning on giving this a more thorough review tomorrow morning, but I have one question already. Do you think it's possible to avoid this second lock? You have a nice optimization to guess at the impact that a set of entries will have on the size of the cache during the initial lock. This allows us to guarantee that we never need to evict in this mutual exclusion zone. I'd hope this would allow us to avoid this second lock entirely!

Could we push some of this logic into the partition and perform the rest (e.g. c.bytes += delta) using atomic operations? The gauges are already thread-safe, so then the only other thing to worry about is deleting empty partitions. This seems like a rare enough case to justify the lock. Maybe we never do this unless the replica is being removed from the Store?

Seems like it'll take a little bit of thought to make sure the bookkeeping stays correct in the case of evictions but doesn't seem crazy. Let me see what I can come up with. Making the partition deletion explicit seems pretty reasonable and ultimately probably a better solution. In the interim I think we can avoid grabbing the lock in cases where the size doesn't go to zero. Let me take another pass and then we'll see where we are.

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/raft_entry_cache branch from a4cdc75 to 568421f Nov 27, 2018

@ajwerner
Copy link
Collaborator

ajwerner left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/raftentry/cache.go, line 183 at r1 (raw file):

Previously, ajwerner wrote…

Seems like it'll take a little bit of thought to make sure the bookkeeping stays correct in the case of evictions but doesn't seem crazy. Let me see what I can come up with. Making the partition deletion explicit seems pretty reasonable and ultimately probably a better solution. In the interim I think we can avoid grabbing the lock in cases where the size doesn't go to zero. Let me take another pass and then we'll see where we are.

Got rid of the second locking. For now we just defer to the LRU to remove empty partitions. I could see giving empty partitions some size in the accounting so that we don't leak memory growing huge list of empty partitions but it seems like it would take a pretty pathological workload where we never actually fill the cache to get to that place.

Master
New cache with single locking

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/raft_entry_cache branch 2 times, most recently from ac676ca to 3ff1755 Nov 27, 2018

@nvanbenschoten
Copy link
Member

nvanbenschoten left a comment

I did a first pass looking at the ringBuffer implementation. Mostly just nits, I like the structure.

We can get rid of pkg/storage/entry_cache.go and pkg/storage/entry_cache_test.go, right?

Reviewed 5 of 11 files at r1, 2 of 5 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/raftentry/ring_buffer.go, line 23 at r2 (raw file):

// ringBuf is a ring buffer of raft entries.
type ringBuf struct {

It's probably worth adding a few small unit tests for this type to make sure the resizing is working as expected.


pkg/storage/raftentry/ring_buffer.go, line 41 at r2 (raw file):

	before, after, ok := computeExtension(b, ents[0].Index, ents[len(ents)-1].Index)
	if !ok {
		return

So if there's a gap then we don't add anything? I was expecting that we'd replace what was already there, but maybe that's not right. What's the trade-off here?


pkg/storage/raftentry/ring_buffer.go, line 45 at r2 (raw file):

	extend(b, before, after)
	it := first(b)
	if before == 0 && after != b.len {

Drop a comment like // Ignore unchanged prefix. here. Took me some time to figure out what it was doing.


pkg/storage/raftentry/ring_buffer.go, line 66 at r2 (raw file):

	}
	it, ok := first(b), true
	firstIndex := int(it.index(b))

Super nit: I'd leave firstIndex as a uint64 and cast after the subtraction to keep it clear that this is an operation in the Raft log index domain and not in the buffer index domain.


pkg/storage/raftentry/ring_buffer.go, line 94 at r2 (raw file):

func (b *ringBuf) scan(
	ents []raftpb.Entry, id roachpb.RangeID, lo, hi, maxBytes uint64,

Why do you need id?


pkg/storage/raftentry/ring_buffer.go, line 119 at r2 (raw file):

// realloc copies the data from the current slice into new buf leaving
// before buf untouched

This comment looks like it was written when the function had different params.


pkg/storage/raftentry/ring_buffer.go, line 121 at r2 (raw file):

// before buf untouched
func realloc(b *ringBuf, before, newSize int) {
	newBuf := make([]raftpb.Entry, newSize)

I think you're working on this now, but we probably want a minimum len here. Maybe 16?


pkg/storage/raftentry/ring_buffer.go, line 152 at r2 (raw file):

}

func computeExtension(b *ringBuf, lo, hi uint64) (before, after int, ok bool) {

It was pretty unclear to me that before and after weren't indexes, but were the number of elements that need to be prepended/appended. Add a comment to make that clear.


pkg/storage/raftentry/ring_buffer.go, line 157 at r2 (raw file):

	}
	first, last := first(b).index(b), last(b).index(b)
	if lo > (last+1) || hi < (first-1) {

This is the case where you have a gap, right? Drop a small comment here.


pkg/storage/raftentry/ring_buffer.go, line 169 at r2 (raw file):

}

// iterator is an index into a ringBuf

missing period at the end.

Actually, missing on a lot of these comments. Our style guide adheres to https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences.


pkg/storage/raftentry/ring_buffer.go, line 170 at r2 (raw file):

// iterator is an index into a ringBuf
type iterator int

Cool idea! It might be more natural to have the type be struct { b *ringBuf, idx int } though, so that the methods don't also need to take the same *ringBuf. You'll still be able to chain methods as long as you never use a pointer receiver.

If you do make this change, double check that it doesn't cause anything to escape.


pkg/storage/raftentry/ring_buffer.go, line 180 at r2 (raw file):

}

func first(b *ringBuf) iterator {

Is first inclusive or exclusive? How about last? Let's add a comment making that clear.

If it's exclusive then is it safe to call on an empty buffer?


pkg/storage/raftentry/ring_buffer.go, line 184 at r2 (raw file):

}

// unsafe to call on empty buffer

Feel free to make this panic in this case. We'll be switching to go1.11 soon enough, which permits functions with panics to still be inlined.


pkg/storage/raftentry/ring_buffer.go, line 190 at r2 (raw file):

// unsafe to call on empty buffer
func (it iterator) next(b *ringBuf) (_ iterator, ok bool) {

The usual API for iterators is a next method and a valid method. This would be possible if iterator was more than just an index. You might consider this as an alternative to the "unsafe to call on empty buffer" comments and the return value here.

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/raft_entry_cache branch from 3ff1755 to 6c71eb3 Nov 27, 2018

@ajwerner
Copy link
Collaborator

ajwerner left a comment

Dealt with most of your comments as well as the lint failures. I didn't change the policy or the iterator structure yet. I also have yet to add the specific ringBuf unit tests.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/raftentry/ring_buffer.go, line 23 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It's probably worth adding a few small unit tests for this type to make sure the resizing is working as expected.

I made sure to hit all of the edge cases of ringBuf resizing through the public API but will add a more focused set of tests.


pkg/storage/raftentry/ring_buffer.go, line 41 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

So if there's a gap then we don't add anything? I was expecting that we'd replace what was already there, but maybe that's not right. What's the trade-off here?

I commented in the commit that it might be worthwhile to overwrite rather than ignore non-overlapping entries that are after the currently cached range. While testing I added a metric for how many entries we drop before and after due to this condition and without the trimming optimization under kv5, we never dropped any due to this condition. I imagine there are probably some cases where we could pull entries into the cache to cache up another node and then miss new traffic which seems bad. It feels wrong however to overwrite the cached entries with lower index entries. I'm happy to change the policy, this just seemed like the simplest first step.


pkg/storage/raftentry/ring_buffer.go, line 45 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Drop a comment like // Ignore unchanged prefix. here. Took me some time to figure out what it was doing.

Done.


pkg/storage/raftentry/ring_buffer.go, line 66 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Super nit: I'd leave firstIndex as a uint64 and cast after the subtraction to keep it clear that this is an operation in the Raft log index domain and not in the buffer index domain.

Done.


pkg/storage/raftentry/ring_buffer.go, line 94 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why do you need id?

Don't


pkg/storage/raftentry/ring_buffer.go, line 119 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This comment looks like it was written when the function had different params.

Done.


pkg/storage/raftentry/ring_buffer.go, line 121 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think you're working on this now, but we probably want a minimum len here. Maybe 16?

This is already done in reallocSize


pkg/storage/raftentry/ring_buffer.go, line 169 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

missing period at the end.

Actually, missing on a lot of these comments. Our style guide adheres to https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences.

Done.


pkg/storage/raftentry/ring_buffer.go, line 170 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Cool idea! It might be more natural to have the type be struct { b *ringBuf, idx int } though, so that the methods don't also need to take the same *ringBuf. You'll still be able to chain methods as long as you never use a pointer receiver.

If you do make this change, double check that it doesn't cause anything to escape.

I don't feel strongly here. I can make the change if you want.


pkg/storage/raftentry/ring_buffer.go, line 180 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is first inclusive or exclusive? How about last? Let's add a comment making that clear.

If it's exclusive then is it safe to call on an empty buffer?

Added a comment. It is inclusive. It's always safe to call but will create an invalid iterator if b.len is 0.


pkg/storage/raftentry/ring_buffer.go, line 184 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Feel free to make this panic in this case. We'll be switching to go1.11 soon enough, which permits functions with panics to still be inlined.

It already will panic with a division by zero

@ajwerner
Copy link
Collaborator

ajwerner left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/raftentry/ring_buffer.go, line 152 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It was pretty unclear to me that before and after weren't indexes, but were the number of elements that need to be prepended/appended. Add a comment to make that clear.

Done.


pkg/storage/raftentry/ring_buffer.go, line 157 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is the case where you have a gap, right? Drop a small comment here.

Done.


pkg/storage/raftentry/ring_buffer.go, line 190 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The usual API for iterators is a next method and a valid method. This would be possible if iterator was more than just an index. You might consider this as an alternative to the "unsafe to call on empty buffer" comments and the return value here.

We can revisit on the next pass if you still want the change.

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/raft_entry_cache branch 5 times, most recently from 9db2b4c to c6d307d Nov 27, 2018

@nvanbenschoten
Copy link
Member

nvanbenschoten left a comment

This is looking really good!

Reviewed 1 of 11 files at r1, 6 of 7 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 67 at r3 (raw file):

	"github.com/cockroachdb/cockroach/pkg/util/tracing"
	"github.com/cockroachdb/cockroach/pkg/util/uuid"
	"github.com/google/btree"

What happened here?


pkg/storage/raftentry/cache.go, line 46 at r3 (raw file):

must be acquired while holding Cache.mu

If we do this then we're back to a single global lock for the entire Cache whenever we have contention on this mutex. I'm not convinced we need this condition if we still only allow a single concurrent writer per Replica.


pkg/storage/raftentry/cache.go, line 53 at r3 (raw file):

// rangeCache represents the interface that the partition uses.
// It is never explicitly used but a new implementation to replace ringBuf must

👍


pkg/storage/raftentry/cache.go, line 96 at r3 (raw file):

	c.evictLocked(bytesGuessed)
	// get p again in case we just evicted everything
	p := c.getPartLocked(id, true /* create */, false /* recordUse */)

We can wrap this in a if len(c.parts) == 0 { condition, right? Should be faster and easier to understand.


pkg/storage/raftentry/cache.go, line 141 at r3 (raw file):

		c.metrics.Hits.Inc(1)
	}
	return

We almost never like naked returns in the codebase. There were a few cases in the ring buffer that were passable, but this one seems worse because we're setting the values in a nested scope. I'd strongly lean towards avoiding them, especially in a package's external facing API.


pkg/storage/raftentry/cache.go, line 175 at r3 (raw file):

func (c *Cache) getPartLocked(id roachpb.RangeID, create, recordUse bool) *partition {
	part := c.parts[id]
	if create {

if create && part == nil {


pkg/storage/raftentry/cache.go, line 177 at r3 (raw file):

	if create {
		if part == nil {
			newPart := &partition{id: id}

part = &partition{id: id}?


pkg/storage/raftentry/cache.go, line 207 at r3 (raw file):

	p *partition, orig cacheSize, bytesAdded, bytesGuessed, entriesAdded int32,
) {
	// the only way that the stats here could change is if we were evicted so

s/the/The/


pkg/storage/raftentry/cache.go, line 212 at r3 (raw file):

	delta := bytesAdded - bytesGuessed
	new := orig.add(delta, entriesAdded)
	if p.setSize(orig, new) {

nit: notEvicted := p.setSize(orig, new) so this is explicit. Might even be worth asserting that the parition is evicted if this check fails to guard against concurrent updates.

As is, will this work with concurrent updates?


pkg/storage/raftentry/cache.go, line 231 at r3 (raw file):

func (p *partition) evict() (bytes, entries int32) {
	const evicted = 0

What happens if orig is 0 for an update attempt and then the partition gets evicted? Do we need an out-of-band value here to indicate that the partition is evicted? 1 would work because a partition can't be 1 byte with 0 entries.


pkg/storage/raftentry/cache.go, line 247 at r3 (raw file):

}

// analyzeEntries calculates the size in bytes of ents as well and ensures that

s/as well//


pkg/storage/raftentry/cache_test.go, line 260 at r3 (raw file):

}

func BenchmarkEntryCache(b *testing.B) {

Why the increase in allocation size in this benchmark? Resizing of the ring buffer?


pkg/storage/raftentry/ring_buffer.go, line 41 at r2 (raw file):

Previously, ajwerner wrote…

I commented in the commit that it might be worthwhile to overwrite rather than ignore non-overlapping entries that are after the currently cached range. While testing I added a metric for how many entries we drop before and after due to this condition and without the trimming optimization under kv5, we never dropped any due to this condition. I imagine there are probably some cases where we could pull entries into the cache to cache up another node and then miss new traffic which seems bad. It feels wrong however to overwrite the cached entries with lower index entries. I'm happy to change the policy, this just seemed like the simplest first step.

I think I'd prefer the possibility of thrashing vs. the possibility of the cache getting stuck and never retaining anything. Those are the two failure modes that fall out of this decision, right?


pkg/storage/raftentry/ring_buffer.go, line 152 at r2 (raw file):

Previously, ajwerner wrote…

Done.

Much better, thanks.


pkg/storage/raftentry/ring_buffer.go, line 169 at r2 (raw file):

Previously, ajwerner wrote…

Done.

Still missing the period here and in a few comments below.


pkg/storage/raftentry/ring_buffer.go, line 170 at r2 (raw file):

Previously, ajwerner wrote…

I don't feel strongly here. I can make the change if you want.

If not then I'm fine with leaving it.


pkg/storage/raftentry/ring_buffer.go, line 180 at r2 (raw file):

Previously, ajwerner wrote…

Added a comment. It is inclusive. It's always safe to call but will create an invalid iterator if b.len is 0.

I think the "not valid" state is the most interesting reason to increase the amount of state held by this iterator. We could also just define -1 as an invalid state, set that wherever necessary, and have a func (it iterator) valid () bool { return it > -1 } method.

@petermattis
Copy link
Contributor

petermattis left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/raftentry/cache.go, line 40 at r3 (raw file):

	mu    syncutil.Mutex
	lru   list.List

In the past, we've inlined the implementation of the circularly linked list for these performance sensitive data structures. Doing so avoids an allocation for list.Element. See util/cache.Entry for an example of doing so.


pkg/storage/raftentry/cache.go, line 46 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

must be acquired while holding Cache.mu

If we do this then we're back to a single global lock for the entire Cache whenever we have contention on this mutex. I'm not convinced we need this condition if we still only allow a single concurrent writer per Replica.

I agree with @nvanbenschoten regarding locking. Something else to take a look at is the commitQueue in pebble which is a lock-free single-producer multiple-consumer queue. I don't quite see how that could be adapted to ringBuf as resizing the slice in a lock-free manner seems difficult. Pointing it out in case one of you sees a possibility.

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/raft_entry_cache branch from c6d307d to 3354029 Nov 28, 2018

@ajwerner
Copy link
Collaborator

ajwerner left a comment

Addressed most of the feedback and expanded the testing. Just two minor things left to do:

  1. inlining the linked list for the LRU
  2. giving empty partitions some space in the cache bookkeeping to avoid piling up empty partitions

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 67 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What happened here?

Sorry was mucking around with goimports, must have killed a space and then had them sorted. What's our usual style?

import (
   <stdlib>

   <3rdparty>
   
   <first party>
)

?


pkg/storage/raftentry/cache.go, line 40 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

In the past, we've inlined the implementation of the circularly linked list for these performance sensitive data structures. Doing so avoids an allocation for list.Element. See util/cache.Entry for an example of doing so.

Can do.


pkg/storage/raftentry/cache.go, line 46 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

must be acquired while holding Cache.mu

If we do this then we're back to a single global lock for the entire Cache whenever we have contention on this mutex. I'm not convinced we need this condition if we still only allow a single concurrent writer per Replica.

Yes, fair point. It seems we still need some locking on the partition because it seems that reads and writes occur concurrently. Addressed by moving the locking of the partition out from the cache lock and adding a comment that concurrent writes are not allowed


pkg/storage/raftentry/cache.go, line 53 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

👍

Done.


pkg/storage/raftentry/cache.go, line 96 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We can wrap this in a if len(c.parts) == 0 { condition, right? Should be faster and easier to understand.

Done.


pkg/storage/raftentry/cache.go, line 141 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We almost never like naked returns in the codebase. There were a few cases in the ring buffer that were passable, but this one seems worse because we're setting the values in a nested scope. I'd strongly lean towards avoiding them, especially in a package's external facing API.

Done.


pkg/storage/raftentry/cache.go, line 175 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

if create && part == nil {

Done.


pkg/storage/raftentry/cache.go, line 177 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

part = &partition{id: id}?

Done.


pkg/storage/raftentry/cache.go, line 207 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/the/The/

Done.


pkg/storage/raftentry/cache.go, line 212 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: notEvicted := p.setSize(orig, new) so this is explicit. Might even be worth asserting that the parition is evicted if this check fails to guard against concurrent updates.

As is, will this work with concurrent updates?

I did the nit. Can you help me understand the concern about concurrent updates. Also renamed the variables to not shadow the new keyword.


pkg/storage/raftentry/cache.go, line 231 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What happens if orig is 0 for an update attempt and then the partition gets evicted? Do we need an out-of-band value here to indicate that the partition is evicted? 1 would work because a partition can't be 1 byte with 0 entries.

I think this case is dealt with already because we add the "guessedBytes" to orig for an add under Cache.mu and evict here is always called under Cache.mu. If orig here is 0, then either no other goroutine is accessing this replica or if a goroutine is currently clearing the replica, it has already atomically taken its space (in recordUpdate) and will update the Cache bookkeeping when it gets the lock. It is impossible for a routine to enter recordUpdate with positive numbers to add with an orig value of 0. This does however highlight a bug in evictLocked whereby we may have evicted all of the partitions and still exceed the cache because the goroutine which will soon free the cache bytes needs the lock to do so. I fixed this by bailing out of the cache loop once all partitions have been evicted.

In summary, it seems like the check for cs != evicted, while probably fine, is causing confusion and so I've removed it.


pkg/storage/raftentry/cache.go, line 247 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/as well//

Done.


pkg/storage/raftentry/cache_test.go, line 260 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why the increase in allocation size in this benchmark? Resizing of the ring buffer?

honestly this is straight copied from your diff and from the previous incarnation of the test

https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/entry_cache_test.go#L146

func BenchmarkEntryCache(b *testing.B) {

so you tell me?


pkg/storage/raftentry/ring_buffer.go, line 41 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think I'd prefer the possibility of thrashing vs. the possibility of the cache getting stuck and never retaining anything. Those are the two failure modes that fall out of this decision, right?

Yes, that sounds right to me. I changed to logic to detect adds after the current range and clear the cache.


pkg/storage/raftentry/ring_buffer.go, line 180 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think the "not valid" state is the most interesting reason to increase the amount of state held by this iterator. We could also just define -1 as an invalid state, set that wherever necessary, and have a func (it iterator) valid () bool { return it > -1 } method.

Done.

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/raft_entry_cache branch from 3354029 to 31fea4a Nov 28, 2018

@ajwerner
Copy link
Collaborator

ajwerner left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 67 at r3 (raw file):

Previously, ajwerner wrote…

Sorry was mucking around with goimports, must have killed a space and then had them sorted. What's our usual style?

import (
   <stdlib>

   <3rdparty>
   
   <first party>
)

?

Done.


pkg/storage/raftentry/cache.go, line 40 at r3 (raw file):

Previously, ajwerner wrote…

Can do.

Done.


pkg/storage/raftentry/cache.go, line 46 at r3 (raw file):

Previously, ajwerner wrote…

Yes, fair point. It seems we still need some locking on the partition because it seems that reads and writes occur concurrently. Addressed by moving the locking of the partition out from the cache lock and adding a comment that concurrent writes are not allowed

Done.


pkg/storage/raftentry/ring_buffer.go, line 170 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If not then I'm fine with leaving it.

Done.

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/raft_entry_cache branch from 31fea4a to d291b6e Nov 28, 2018

@ajwerner
Copy link
Collaborator

ajwerner left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/raftentry/ring_buffer.go, line 41 at r2 (raw file):

Previously, ajwerner wrote…

Yes, that sounds right to me. I changed to logic to detect adds after the current range and clear the cache.

Done.

@nvanbenschoten
Copy link
Member

nvanbenschoten left a comment

:lgtm: once the final batch of comments is resolved. Really nice job on this! I suspect we could actually find workloads where the effect of this is even more dramatic.

Reviewed 5 of 5 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/storage/raftentry/cache.go, line 212 at r3 (raw file):

Previously, ajwerner wrote…

I did the nit. Can you help me understand the concern about concurrent updates. Also renamed the variables to not shadow the new keyword.

The concern with concurrent updates is that the CAS operation might fail even if the partition wasn't evicted, simply because the second update already modified the partition before grabbing the partition lock. We don't allow concurrent updates though, so this is a non-issue.


pkg/storage/raftentry/cache.go, line 231 at r3 (raw file):

Previously, ajwerner wrote…

I think this case is dealt with already because we add the "guessedBytes" to orig for an add under Cache.mu and evict here is always called under Cache.mu. If orig here is 0, then either no other goroutine is accessing this replica or if a goroutine is currently clearing the replica, it has already atomically taken its space (in recordUpdate) and will update the Cache bookkeeping when it gets the lock. It is impossible for a routine to enter recordUpdate with positive numbers to add with an orig value of 0. This does however highlight a bug in evictLocked whereby we may have evicted all of the partitions and still exceed the cache because the goroutine which will soon free the cache bytes needs the lock to do so. I fixed this by bailing out of the cache loop once all partitions have been evicted.

In summary, it seems like the check for cs != evicted, while probably fine, is causing confusion and so I've removed it.

Ah, I see. Yeah, this is now easier to understand.

Might be worth adding somewhere that this is all easy to reason about because once we move to the evicted state, we can never have a transition away from that state.


pkg/storage/raftentry/cache.go, line 52 at r4 (raw file):

//
// Cache is designed to be a shared store-wide object which incurs low
// contention for operations on different ranges. This is achieved through the

Might want to end this sentence with "while maintaining a global memory policy".


pkg/storage/raftentry/cache.go, line 53 at r4 (raw file):

// Cache is designed to be a shared store-wide object which incurs low
// contention for operations on different ranges. This is achieved through the
// use ofa two-level locking scheme. Cache.mu is acquired to access any data in

"of a"


pkg/storage/raftentry/cache.go, line 66 at r4 (raw file):

date

state?


pkg/storage/raftentry/cache.go, line 74 at r4 (raw file):

mediated the

mediated through the


pkg/storage/raftentry/cache_test.go, line 260 at r3 (raw file):

Previously, ajwerner wrote…

honestly this is straight copied from your diff and from the previous incarnation of the test

https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/entry_cache_test.go#L146

func BenchmarkEntryCache(b *testing.B) {

so you tell me?

You're right that that attempt also had a similar effect. That adds weight to the suspicion that this is a result of overestimation in partition sizing (i.e. vector resizing in this change and map resizing in the previous one).

Might be worth dropping this benchmark into a memory profile just to verify that everything lines up and matches expectations, but don't let that hold up the change.


pkg/storage/raftentry/cache_test.go, line 307 at r4 (raw file):

c.Metrics().Bytes.Value()

Can we compute what this should be by peeking at the number of partitions left in the cache? That will allow us to test the byte count handling as well.


pkg/storage/raftentry/ring_buffer.go, line 41 at r2 (raw file):

Previously, ajwerner wrote…

Done.

Nice!


pkg/storage/raftentry/ring_buffer.go, line 135 at r4 (raw file):

// reallocLen returns a new length which is greater than neededSize and is
// a power-of-two multiple of minBufSize.
func reallocLen(curSize, neededSize int) (newLen int) {

This seems overly complex. Do we even need curSize? Will something like this do?

func reallocLen(need int) int {
	if need <= minBufSize {
		return minBufSize
	}
	f := bits.UintSize - bits.LeadingZeros(uint(need))
	return 1 << uint(f)
}
@ajwerner
Copy link
Collaborator

ajwerner left a comment

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/storage/raftentry/cache.go, line 231 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Ah, I see. Yeah, this is now easier to understand.

Might be worth adding somewhere that this is all easy to reason about because once we move to the evicted state, we can never have a transition away from that state.

Done.


pkg/storage/raftentry/cache.go, line 52 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Might want to end this sentence with "while maintaining a global memory policy".

Done.


pkg/storage/raftentry/cache.go, line 53 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"of a"

Done.


pkg/storage/raftentry/cache.go, line 66 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
date

state?

cache state, fixed


pkg/storage/raftentry/cache.go, line 74 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
mediated the

mediated through the

Done.


pkg/storage/raftentry/cache_test.go, line 260 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

You're right that that attempt also had a similar effect. That adds weight to the suspicion that this is a result of overestimation in partition sizing (i.e. vector resizing in this change and map resizing in the previous one).

Might be worth dropping this benchmark into a memory profile just to verify that everything lines up and matches expectations, but don't let that hold up the change.

honestly looking at the cache size in the ClearTo benchmark reveals a problem which is that the cache size was smaller than required to store the data so the entries were being ignored completely. I've updated the micro benchmarks.


pkg/storage/raftentry/cache_test.go, line 307 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
c.Metrics().Bytes.Value()

Can we compute what this should be by peeking at the number of partitions left in the cache? That will allow us to test the byte count handling as well.

sure, done


pkg/storage/raftentry/ring_buffer.go, line 135 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This seems overly complex. Do we even need curSize? Will something like this do?

func reallocLen(need int) int {
	if need <= minBufSize {
		return minBufSize
	}
	f := bits.UintSize - bits.LeadingZeros(uint(need))
	return 1 << uint(f)
}

Much cleaner!

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/raft_entry_cache branch from d291b6e to 13cd01c Nov 29, 2018

@ajwerner
Copy link
Collaborator

ajwerner left a comment

I've updated the commit message with new stats. For KV90 on n1-standard-32's I find a pretty big win:

    $ ./workload run kv '{pgurl:1-3}' --init --splits=512 --duration 180s --read-percent=90 --concurrency=512
    name       old ops/s  new ops/s  delta
    KV90       100k ± 0%  108k ± 1%  +8.05%  (p=0.004 n=6+5)
    KV90       100k ± 0%  107k ± 2%  +6.81%  (p=0.002 n=6+6)
    kV90       101k ± 1%  107k ± 0%  +6.25%  (p=0.010 n=6+4)

Also update the micro benchmarks after resolving that ClearTo bug and cleaned up some comments.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/storage/raftentry/ring_buffer.go, line 135 at r4 (raw file):

Previously, ajwerner wrote…

Much cleaner!

Done.

@nvanbenschoten
Copy link
Member

nvanbenschoten left a comment

Still :lgtm:

Reviewed 2 of 3 files at r5.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/storage/raftentry/cache.go, line 53 at r4 (raw file):

Previously, ajwerner wrote…

Done.

Now it's "o fa"


pkg/storage/raftentry/cache.go, line 58 at r5 (raw file):

add to the cache

Double space recently added here between "the" and "cache".


pkg/storage/raftentry/cache_test.go, line 260 at r3 (raw file):

Previously, ajwerner wrote…

honestly looking at the cache size in the ClearTo benchmark reveals a problem which is that the cache size was smaller than required to store the data so the entries were being ignored completely. I've updated the micro benchmarks.

Well that explains it. Glad you tracked it down.


pkg/storage/raftentry/ring_buffer.go, line 135 at r4 (raw file):

Previously, ajwerner wrote…

Done.

Out of curiosity, I took a look to see whether this bits function was implemented with a single instruction. It's not, but in checking I realized that the snippet is a bit redundant. The definition of LeadingZeros is:

func LeadingZeros(x uint) int { return UintSize - Len(x) }

So we can make this:

func reallocLen(need int) int {
	if need <= minBufSize {
		return minBufSize
	}
	return 1 << uint(bits.Len(uint(need)))
}
raftentry: rewrite storage.raftEntryCache with dense, per-replica state
_### Problem

This blocking profile is filtered to show all Mutex contention (ignoring other
forms like channels and Cond vars). We can see that blocking in the
`raftEntryCache` is responsible for **90%** of all Mutex contention. This seems
absurd, but it adds up given that the cache is responsible for **3%** of CPU
utilization on a node and requires mutual exclusion across an entire `Store`.

We've also seen in changes like #29596 how expensive these cache accesses have
become, especially as the cache grows because most entry accesses take `log(n)`
time, where `n` is the number of `raftpb.Entry`s in the cache across all Ranges
on a Store.

_### Rewrite

This PR rewrites the `raftEntryCache` as a new `storage/raftentries.Cache` type.
The rewrite diverges from the original implementation in two important ways,
both of which exploit and optimize for the unique access patterns that this
cache is subject to.

_#### Per Replica Cache State

The first change is to move from a single global data tree with per-entry LRU
state to per-range caches which evict entire ranges.  This makes maintaining the
LRU linked-list significantly cheaper because we only need to update it once per
Replica access instead of once per entry access. It also means that the
linked-list will be significantly smaller, resulting in far fewer memory
allocations and a reduced GC footprint.

The per-replica state also enables more work to be done concurrently which
dramatically reducing the amount of work done while holding the cache mutex.
Because individual ranges are never accessed concurrently, by optimistically
computing space requirements, we can allow operations to on different ranges
to proceed at the same time.

In order to maintain a cap on the total amount of data which can be stored in
cache, the size of cache additions are computed before looking at the cache and
then space is freed such that the entire write will be able to fit in the cache.
This downside of this approach is that it can lead to overly aggressive evictions
in cases where large amounts of data which are already present in the cache are
added again. This edge case is not particularly concerning as it implies that
the cache is poorly sized for the workload.

_#### LLRB -> Ring Buffers

The second change is that the rewrite trades in the balanced binary tree
structure for a per-replica ring buffer structure. This structure is preferable
because it improves the time complexity of entry access. It's also more memory
efficient and allocation friendly because it takes advantage of builtin Go
slices and their compile-time specialization. Finally, it should be more GC
friendly because it cuts down on the number of pointers in use.

It turns out that Raft entries are densely ordered with discrete indices.
This means that given a low and a high index, we can simply iterate through the
range efficiently, without the need for an ordered data-structure. This property
was also exploited in 6e4e57f. The problem with a dense data structure it that
it makes dealing with disjoint ranges cumbersome and wasteful without further
specialization. Rather than shouldering the complexity of supporting caching
disjoint ranges, this implementation chooses to ignore writes which precede and
are non-adjacent to currently cached values. When non-adjacent inserts occur
which follow the currently cached range the currently cached values are evicted
in factor of the new values. This policy seeks to avoid failure to cache new
entries in the face of long-lived demand for old entries

_### Performance

```
name                 old time/op    new time/op    delta
EntryCache-8           4.69ms ± 9%    0.18ms ± 3%  -96.27%  (p=0.000 n=8+8)
EntryCacheClearTo-8     456µs ± 3%      58µs ± 0%  -87.29%  (p=0.000 n=8+8)

name                 old alloc/op   new alloc/op   delta
EntryCache-8            113kB ± 0%      75kB ± 0%  -33.43%  (p=0.002 n=7+8)
EntryCacheClearTo-8    8.31kB ± 0%    1.15kB ± 0%  -86.14%  (p=0.000 n=8+8)

name                 old allocs/op  new allocs/op  delta
EntryCache-8            2.02k ± 0%     0.00k ± 0%  -99.85%  (p=0.000 n=8+8)
EntryCacheClearTo-8      14.0 ± 0%       1.0 ± 0%  -92.86%  (p=0.000 n=8+8)
```

on 3x n1-standard-32
```
$ ./workload run kv '{pgurl:1-3}' --init --splits=512 --duration 180s --read-percent=90 --concurrency=512
On 3 n1-highcpu-16 nodes running KV5 with concurrency set to 49
name       old ops/s  new ops/s  delta
KV90       100k ± 0%  108k ± 1%  +8.05%  (p=0.004 n=6+5)
KV90       100k ± 0%  107k ± 2%  +6.81%  (p=0.002 n=6+6)
kV90       101k ± 1%  107k ± 0%  +6.25%  (p=0.010 n=6+4)
```

Release note (performance improvement): Rewrite Raft entry cache to optimize for access patterns, reduce lock contention, and reduce memory footprint.

@ajwerner ajwerner force-pushed the ajwerner:ajwerner/raft_entry_cache branch from 13cd01c to aa5ca4d Nov 29, 2018

@ajwerner
Copy link
Collaborator

ajwerner left a comment

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/storage/raftentry/cache.go, line 58 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

add to the cache

Double space recently added here between "the" and "cache".

Done.


pkg/storage/raftentry/cache_test.go, line 260 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Well that explains it. Glad you tracked it down.

Done.

@ajwerner

This comment has been minimized.

Copy link
Collaborator

ajwerner commented Nov 29, 2018

bors r+

craig bot pushed a commit that referenced this pull request Nov 29, 2018

Merge #32618
32618: raftentry: rewrite storage.raftEntryCache with dense, per-replica state r=ajwerner a=ajwerner

### Problem

This blocking profile is filtered to show all Mutex contention (ignoring other
forms like channels and Cond vars). We can see that blocking in the
`raftEntryCache` is responsible for **90%** of all Mutex contention. This seems
absurd, but it adds up given that the cache is responsible for **3%** of CPU
utilization on a node and requires mutual exclusion across an entire `Store`.

We've also seen in changes like #29596 how expensive these cache accesses have
become, especially as the cache grows because most entry accesses take `log(n)`
time, where `n` is the number of `raftpb.Entry`s in the cache across all Ranges
on a Store.

### Rewrite

This PR rewrites the `raftEntryCache` as a new `storage/raftentries.Cache` type.
The rewrite diverges from the original implementation in two important ways,
both of which exploit and optimize for the unique access patterns that this
cache is subject to.

#### Per Replica Cache State

The first change is to move from a single global data tree with per-entry LRU
state to per-range caches which evict entire ranges.  This makes maintaining the
LRU linked-list significantly cheaper because we only need to update it once per
Replica access instead of once per entry access. It also means that the
linked-list will be significantly smaller, resulting in far fewer memory
allocations and a reduced GC footprint.

The per-replica state also enables more work to be done concurrently which
dramatically reducing the amount of work done while holding the cache mutex.
Because individual ranges are never accessed concurrently, by optimistically
computing space requirements, we can allow operations to on different ranges
to proceed at the same time.

In order to maintain a cap on the total amount of data which can be stored in
cache, the size of cache additions are computed before looking at the cache and
then space is freed such that the entire write will be able to fit in the cache.
This downside of this approach is that it can lead to overly aggressive evictions
in cases where large amounts of data which are already present in the cache are
added again. This edge case is not particularly concerning as it implies that
the cache is poorly sized for the workload.

#### LLRB -> Ring Buffers

The second change is that the rewrite trades in the balanced binary tree
structure for a per-replica ring buffer structure. This structure is preferable
because it improves the time complexity of entry access. It's also more memory
efficient and allocation friendly because it takes advantage of builtin Go
slices and their compile-time specialization. Finally, it should be more GC
friendly because it cuts down on the number of pointers in use.

It turns out that Raft entries are densely ordered with discrete indices.
This means that given a low and a high index, we can simply iterate through the
range efficiently, without the need for an ordered data-structure. This property
was also exploited in 6e4e57f. The problem with a dense data structure it that
it makes dealing with disjoint ranges cumbersome and wasteful without further
specialization. Rather than shouldering the complexity of supporting caching
disjoint ranges, this implementation chooses to ignore writes which precede and
are non-adjacent to currently cached values. When non-adjacent inserts occur
which follow the currently cached range the currently cached values are evicted
in factor of the new values. This policy seeks to avoid failure to cache new
entries in the face of long-lived demand for old entries

### Performance

```
name                 old time/op    new time/op    delta
EntryCache-8           4.69ms ± 9%    0.18ms ± 3%  -96.27%  (p=0.000 n=8+8)
EntryCacheClearTo-8     456µs ± 3%      58µs ± 0%  -87.29%  (p=0.000 n=8+8)

name                 old alloc/op   new alloc/op   delta
EntryCache-8            113kB ± 0%      75kB ± 0%  -33.43%  (p=0.002 n=7+8)
EntryCacheClearTo-8    8.31kB ± 0%    1.15kB ± 0%  -86.14%  (p=0.000 n=8+8)

name                 old allocs/op  new allocs/op  delta
EntryCache-8            2.02k ± 0%     0.00k ± 0%  -99.85%  (p=0.000 n=8+8)
EntryCacheClearTo-8      14.0 ± 0%       1.0 ± 0%  -92.86%  (p=0.000 n=8+8)
```

on 3x n1-standard-32
```
$ ./workload run kv '{pgurl:1-3}' --init --splits=512 --duration 180s --read-percent=90 --concurrency=512
On 3 n1-highcpu-16 nodes running KV5 with concurrency set to 49
name       old ops/s  new ops/s  delta
KV90       100k ± 0%  108k ± 1%  +8.05%  (p=0.004 n=6+5)
KV90       100k ± 0%  107k ± 2%  +6.81%  (p=0.002 n=6+6)
kV90       101k ± 1%  107k ± 0%  +6.25%  (p=0.010 n=6+4)
```

Release note (performance improvement): Rewrite Raft entry cache to optimize for access patterns, reduce lock contention, and reduce memory footprint.

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@craig

This comment has been minimized.

Copy link

craig bot commented Nov 29, 2018

Build succeeded

@craig craig bot merged commit aa5ca4d into cockroachdb:master Nov 29, 2018

3 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details
@tbg
Copy link
Member

tbg left a comment

Very nice!

Dismissed @nvanbenschoten from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/raftentry/cache.go, line 288 at r6 (raw file):

	const evicted = 0
	// a partition size is never equal to evicted while a partition exists and
	// Cache.mu is not held because a partition caries the byte size of its struct

carries

@tbg
Copy link
Member

tbg left a comment

Reviewed 5 of 11 files at r1, 1 of 5 files at r2, 2 of 7 files at r3, 2 of 5 files at r4, 1 of 3 files at r5, 2 of 2 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

craig bot pushed a commit that referenced this pull request Nov 30, 2018

Merge #32717 #32739 #32745
32717: sql: better checks on table version changes r=vivekmenezes a=vivekmenezes

A big thank you to MutableTableDescriptor!

Release note: None

32739: roachpb: avoid accidental MinTxnPriority due to rounding error r=nvanbenschoten a=nvanbenschoten

Fixes #32676.
Reverts #31965 since this was the (confirmed) underlying cause of #31859.

Before this change, it was possible for a non-minimum `UserPriority`
to result in `MinTxnPriority`. This was due to a rounding error in
`MakePriority`.

I stressed both `TestPushTxnHeartbeatTimeout` and `TestTxnWaitQueueUpdateNotPushedTxn`
for over 2,000,000 iterations each without a failure after this fix.

Release note: None

32745: raftentry: fix typo in comment r=tbg a=ajwerner

The mistake was noticed by @tbg in #32618 after it was merged.

Co-authored-by: Vivek Menezes <vivek@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment