Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
storage/raftentry: rewrite raftEntryCache with hashmaps and a Replica…
…-level LRU policy !!! Disclaimer !!! This approach has a lot of promise (see benchmarks below), but #30151 may have stolen some of its benefit by making the current raftEntryCache less of an issue. It's unclear at the moment whether such an aggressive change is still needed. At a minimum, it's probably too big of a change for 2.1. _### Problem Profiles are showing that the `raftEntryCache` is a performance bottleneck on certain workloads. `tpcc1000` on a 3-node cluster with `n1-highcpu-16` machines presents the following profiles. <cpu profile> This CPU profile shows that the `raftEntryCache` is responsible for just over **3%** of all CPU utilization on the node. Interestingly, the cost isn't centralized in a single method. Instead we can see both `addEntries` and `getEntries` in the profile. Most of what we see is `OrderedCache` manipulation as the cache interacts with its underlying red-black tree. <block profile> This blocking profile is filtered to show all Mutex contention (ignoring other forms like channels and Cond vars). We can see that blocking in the `raftEntryCache` is responsible for **99%** of all Mutex contention. This seems absurd, but it adds up given that the cache is responsible for **3%** of CPU utilization on a node and requires mutual exclusion across an entire `Store`. We've also seen in changes like #29596 how expensive these cache accesses have become, especially as the cache grows because most entry accesses take `log(n)` time, where `n` is the number of `raftpb.Entry`s in the cache across all Ranges on a Store. _### Rewrite This PR rewrites the `raftEntryCache` as a new `storage/raftentries.Cache` type. The rewrite diverges from the original implementation in two important ways, both of which exploit and optimize for the unique access patterns that this cache is subject to. _### LLRB -> Hashmaps The first change is that the rewrite trades in the balanced binary tree structure for a multi-level hashmap structure. This structure is preferable because it improves the time complexity of entry access. It's also more memory efficient and allocation friendly because it takes advantage of builtin Go hashmaps and their compile-time specialization. Finally, it should be more GC friendly because it cuts down on the number of pointers in use. Of course, a major benefit of a balanced binary tree is that its elements are ordered, which allows it to accommodate range-based access patterns on a sparse set of elements. While the ordering across replicas was wasted with the `raftEntryCache`, it did appear that the ordering within replicas was useful. However, it turns out that Raft entries are densely ordered with discrete indices. This means that given a low and a high index, we can simply iterate through the range efficiently, without the need for an ordered data-structure. This property was also exploited in 6e4e57f. _#### Replica-level LRU policy The second change is that the rewrite maintains an LRU-policy across Replicas, instead of across individual Raft entries. This makes maintaining the LRU linked-list significantly cheaper because we only need to update it once per Replica access instead of once per entry access. It also means that the linked-list will be significantly smaller, resulting in far fewer memory allocations and a reduced GC footprint. This reduction in granularity of the LRU-policy changes its behavior. Instead of the cache holding on to the last N Raft entries accessed across all Replicas, it will hold on to all Raft entries for the last N Replicas accessed. I suspect that this is actually a better policy for Raft's usage because Replicas access entries in chunks and missing even a single entry in the cache results in an expensive RocksDB seek. Furthermore, I think the LRU policy, as it was, was actually somewhat pathological because when a replica looks up its entries, it almost always starts by requesting its oldest entry and scanning up from there. _### Performance ``` name old time/op new time/op delta EntryCache-4 3.00ms ± 9% 0.21ms ± 2% -92.90% (p=0.000 n=10+10) EntryCacheClearTo-4 313µs ± 9% 37µs ± 2% -88.21% (p=0.000 n=10+10) name old alloc/op new alloc/op delta EntryCache-4 113kB ± 0% 180kB ± 0% +60.15% (p=0.000 n=9+10) EntryCacheClearTo-4 8.31kB ± 0% 0.00kB -100.00% (p=0.000 n=10+10) name old allocs/op new allocs/op delta EntryCache-4 2.02k ± 0% 0.01k ± 0% -99.75% (p=0.000 n=10+10) EntryCacheClearTo-4 14.0 ± 0% 0.0 -100.00% (p=0.000 n=10+10) ``` Testing with TPC-C is still needed. I'll need to verify that the cache hit rate does not go down because of this change and that the change translates to the expected improvements on CPU and blocking profiles. Release note (performance improvement): Rewrite Raft entry cache to optimize for access patterns, reduce lock contention, and reduce memory footprint.
- Loading branch information