Skip to content

Commit

Permalink
storage/raftentry: rewrite raftEntryCache with hashmaps and a Replica…
Browse files Browse the repository at this point in the history
…-level LRU policy

!!! Disclaimer !!!

This approach has a lot of promise (see benchmarks below), but #30151 may have
stolen some of its benefit by making the current raftEntryCache less of an
issue. It's unclear at the moment whether such an aggressive change is still
needed. At a minimum, it's probably too big of a change for 2.1.

_### Problem

Profiles are showing that the `raftEntryCache` is a performance bottleneck on
certain workloads. `tpcc1000` on a 3-node cluster with `n1-highcpu-16` machines
presents the following profiles.

<cpu profile>

This CPU profile shows that the `raftEntryCache` is responsible for just over
**3%** of all CPU utilization on the node. Interestingly, the cost isn't centralized
in a single method. Instead we can see both `addEntries` and `getEntries` in the
profile. Most of what we see is `OrderedCache` manipulation as the cache interacts
with its underlying red-black tree.

<block profile>

This blocking profile is filtered to show all Mutex contention (ignoring other forms
like channels and Cond vars). We can see that blocking in the `raftEntryCache` is
responsible for **99%** of all Mutex contention. This seems absurd, but it adds up
given that the cache is responsible for **3%** of CPU utilization on a node and requires
mutual exclusion across an entire `Store`.

We've also seen in changes like #29596 how expensive these cache accesses have
become, especially as the cache grows because most entry accesses take `log(n)`
time, where `n` is the number of `raftpb.Entry`s in the cache across all Ranges
on a Store.

_### Rewrite

This PR rewrites the `raftEntryCache` as a new `storage/raftentries.Cache` type.
The rewrite diverges from the original implementation in two important ways,
both of which exploit and optimize for the unique access patterns that this
cache is subject to.

_### LLRB -> Hashmaps

The first change is that the rewrite trades in the balanced binary tree
structure for a multi-level hashmap structure. This structure is preferable
because it improves the time complexity of entry access. It's also more memory
efficient and allocation friendly because it takes advantage of builtin Go
hashmaps and their compile-time specialization. Finally, it should be more GC
friendly because it cuts down on the number of pointers in use.

Of course, a major benefit of a balanced binary tree is that its elements are
ordered, which allows it to accommodate range-based access patterns on a sparse
set of elements. While the ordering across replicas was wasted with the
`raftEntryCache`, it did appear that the ordering within replicas was useful.
However, it turns out that Raft entries are densely ordered with discrete indices.
This means that given a low and a high index, we can simply iterate through the
range efficiently, without the need for an ordered data-structure. This property
was also exploited in 6e4e57f.

_#### Replica-level LRU policy

The second change is that the rewrite maintains an LRU-policy across Replicas,
instead of across individual Raft entries. This makes maintaining the LRU
linked-list significantly cheaper because we only need to update it once per
Replica access instead of once per entry access. It also means that the
linked-list will be significantly smaller, resulting in far fewer memory
allocations and a reduced GC footprint.

This reduction in granularity of the LRU-policy changes its behavior. Instead of
the cache holding on to the last N Raft entries accessed across all Replicas, it
will hold on to all Raft entries for the last N Replicas accessed. I suspect
that this is actually a better policy for Raft's usage because Replicas access
entries in chunks and missing even a single entry in the cache results in an
expensive RocksDB seek. Furthermore, I think the LRU policy, as it was, was
actually somewhat pathological because when a replica looks up its entries, it
almost always starts by requesting its oldest entry and scanning up from there.

_### Performance

```
name                 old time/op    new time/op    delta
EntryCache-4           3.00ms ± 9%    0.21ms ± 2%   -92.90%  (p=0.000 n=10+10)
EntryCacheClearTo-4     313µs ± 9%      37µs ± 2%   -88.21%  (p=0.000 n=10+10)

name                 old alloc/op   new alloc/op   delta
EntryCache-4            113kB ± 0%     180kB ± 0%   +60.15%  (p=0.000 n=9+10)
EntryCacheClearTo-4    8.31kB ± 0%    0.00kB       -100.00%  (p=0.000 n=10+10)

name                 old allocs/op  new allocs/op  delta
EntryCache-4            2.02k ± 0%     0.01k ± 0%   -99.75%  (p=0.000 n=10+10)
EntryCacheClearTo-4      14.0 ± 0%       0.0       -100.00%  (p=0.000 n=10+10)
```

Testing with TPC-C is still needed. I'll need to verify that the cache hit
rate does not go down because of this change and that the change translates
to the expected improvements on CPU and blocking profiles.

Release note (performance improvement): Rewrite Raft entry cache to
optimize for access patterns, reduce lock contention, and reduce memory
footprint.
  • Loading branch information
nvanbenschoten committed Sep 28, 2018
1 parent d41daeb commit 6c212b7
Show file tree
Hide file tree
Showing 11 changed files with 525 additions and 514 deletions.
302 changes: 0 additions & 302 deletions pkg/storage/entry_cache.go

This file was deleted.

0 comments on commit 6c212b7

Please sign in to comment.