kvserver: investigate raft entry cache sizing #98666

tbg · 2023-03-15T09:24:31Z

The raft entry cache is sized to 16mb per store. This seems very small:

Lines 103 to 105 in e005c0f

    
           // defaultRaftEntryCacheSize is the default size in bytes for a 
        
           // store's Raft log entry cache. 
        
           defaultRaftEntryCacheSize = 1 << 24 // 16M

The raft.entrycache.{accesses,hits} metrics can be used to arrive at a cache hit rate. This is something we should be paying attention to when optimizing write performance.

x-ref #98576 where this was noticed.

Jira issue: CRDB-25431

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-03-15T09:24:34Z

cc @cockroachdb/replication

tbg · 2023-03-15T11:45:36Z

Supplementary screenshot from an 8tb restore

We're reading from the raft storage quite a bit, probably we're usually lucky and pebble serves most of it from the block cache, as I don't see anywhere near that read bandwidth used by the disks.

erikgrinaker · 2023-03-15T12:00:38Z

I'd also expect this to still be in OS disk caches.

This tracks the cases where we fall back to reading log entries from storage (i.e. pebble) in `(raft.Storage.Entries)` Ideally this is zero, as everything ought to be served from the raft entry cache. We know that this cache is not configured well[^1] and so we can't really expect this to work, but you can't improve what you don't measure. Additionally, this metric has been useful in investigations related to raft overload[^2]. [^1]: cockroachdb#98666 [^2]: cockroachdb#98576 Epic: none Release note: None

This tracks the cases where we fall back to reading log entries from storage (i.e. pebble) in `(raft.Storage.Entries)` Ideally this is zero, as everything ought to be served from the raft entry cache. We know that this cache is not configured well[^1] and so we can't really expect this to work, but you can't improve what you don't measure. Additionally, this metric has been useful in investigations related to raft overload[^2]. [^1]: #98666 [^2]: #98576 Epic: none Release note: None

Informs cockroachdb#98666. This commit introduces a new `COCKROACH_RAFT_ENTRY_CACHE_SIZE` which can be used to configure the size of the raft entry cache. The default value is 16 MiB. Release note: None

nvanbenschoten · 2023-06-29T06:37:57Z

In a write-heavy workload running kv0 with 2kb writes (repro steps), we see a gradual drop off in throughput. This is due in part to an increase in raft scheduler latency, which is a consequence of a reduction in the raft entry cache hit rate as the workload runs. Interestingly, we are able to see this clearly because some of the nodes have higher cache entry hit rates than others, and these perform much better.

This demonstrates that for write-heavy workloads, a 16mb raft cache entry cache is insufficient. We should explore dynamically sizing the cache based on available memory capacity in the system.

In the meantime, #105799 makes it configurable through an environment variable.

qps and p99 workload latency

p99 raft scheduler latency: notice node 2 and 3, compared to the other nodes

raft entry cache accesses and hits

same data, tabular form

	Accesses	Hits	Hit Rate
n1	314468	308334	98.1%
n2	276748	260645	94.2%
n3	271915	255306	93.9%
n4	325052	320766	98.7%
n5	326403	321934	98.6%

cpu profile from n3 and n4: notice n3's time reading the raft log from pebble

erikgrinaker · 2023-06-29T06:53:28Z

Interesting, thanks for these results.

There are other actors here too -- we've seen previously (#78412) that performance tanks when recent Raft log entries are flushed from the memtable into L0 before application and truncation, and the Pebble block cache and OS caches also come into play.

I suspect the memtable flushing primarily affects performance by increasing compaction disk IO, competing with foreground writes. However, it'd be interesting to see if it would be sufficient/better to try to have the log entries stick around in the memtable for as long as possible, instead of relying on a separate entry cache, since that would solve two problems at once.

erikgrinaker · 2023-06-29T07:33:55Z

That said, let's see if we can do something tactical for 23.2 -- 16 MB is miniscule.

jbowens · 2023-06-29T13:03:51Z

the log entries stick around in the memtable for as long as possible, instead of relying on a separate entry cache, since that would solve two problems at once.

I'm unsure how much keeping the entries in the memtable will help with these reads, because any read other than a point DB.Get lookup still needs to read and merge across all levels of the LSM. Even if the entries themselves are dropped in flushes, the tombstones dilute the keyspace making it less likely for reads of adjacent raft logs to hit the same sstable blocks.

erikgrinaker · 2023-06-29T13:05:14Z

Good point, I guess that answers that.

104350: sql/schemachanger: support for create sequence inside the declarative schema changer r=fqazi a=fqazi This patch implements CREATE SEQUENCE in the declarative schema changer. The first three commits can be ignored and are included in (#104348, which should be reviewed and merged first). This patch will: - Skip validation/backfill for descriptors in an added state - Fixes prefix resolution creating new objects, where two-part names were not handled properly before - Adds support for CREATE sequence in the declarative schema changer Fixes: #104351 105454: roachtest: update `mixedversion` to always use secure clusters r=srosenberg a=renatolabs Secure clusters are closer to production deployments and also allow us to tests features that we couldn't before, like creating new users with passwords during a test, and then performing SQL operations with those users. In the process of getting this to work, there were a few bugs that needed to be fixed (first commit), and the `cluster` interface needed a small update as well (second commit). Epic: CRDB-19321 Release note: None 105765: ui: fix db page Node/Regions column rendering r=xinhaoz a=xinhaoz Previously, the db page was not updating its columns if the `showNodeRegionsColumn` prop changed. The db details page was also not filtering out the regions column when desired. Epic: none Release note (bug fix): node/regions columns in db and db details page should properly render. This column is hidden for tenants and otherwise is shown for clusters with > 1 node. 105770: persistedsqlstats: skip TestSQLStatsPersistedLimitReached under stress r=zachlite a=zachlite TestSQLStatsPersistedLimitReached succeeds under 'normal' testing conditions. We can and should get this test coverage when we can. Informs #97488 Epic: none Release note: None 105799: kv: expose env var to configure raft entry cache size r=erikgrinaker a=nvanbenschoten Informs #98666. This commit introduces a new `COCKROACH_RAFT_ENTRY_CACHE_SIZE` which can be used to configure the size of the raft entry cache. The default value is 16 MiB. Release note: None Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Renato Costa <renato@cockroachlabs.com> Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com> Co-authored-by: zachlite <zachlite@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

Informs cockroachdb#98666. This commit introduces a new `COCKROACH_RAFT_ENTRY_CACHE_SIZE` which can be used to configure the size of the raft entry cache. The default value is 16 MiB. Release note: None

erikgrinaker · 2023-07-23T10:34:19Z

I sampled CentMon hit rates:

quantile(x, raft_entrycache_hits / raft_entrycache_accesses)

10th percentile: 100%
5th percentile: 99.8%
1st percentile: 99.4%
Minimum: 90.3%

erikgrinaker · 2023-07-23T10:52:02Z

Some quick benchmarks, recording hit rates for all 3 nodes:

kv0/enc=false/nodes=3/size=4kb: 99.3% 99.4% 99.4%
kv0/enc=false/nodes=3/cpu=32/size=4kb: 99.4% 99.4% 99.4%
kv0/enc=false/nodes=3/size=64kb: 86.1% 87.3% 87.3%
kv0/enc=false/nodes=3/cpu=32/size=64kb: 87.1% 87.7% 87.4%

Will run some experiments with the small-node variants.

erikgrinaker · 2023-07-23T11:39:23Z

Average hit rate on single 10-minute runs of kv0/enc=false/nodes=3/size=64kb with varying COCKROACH_RAFT_ENTRY_CACHE_SIZE:

16 MB: 86.9%
64 MB: 96.3%
128 MB: 98.0%
256 MB: 98.5%
512 MB: 98.5%
1 GB: 98.5%

In the >=256 MB cases, the steady-state hit rate was 100% -- the misses all happened at the start of the workload. The 128 MB case was almost able to keep up, with a 99.6% hit rate if we ignore the first minute.

I don't think we should optimize for the 64 KB case, since it seems like a bit of an outlier. I'll do some 30-minute runs at 128 MB cache size with 1 KB, 4 KB, 16 KB, and 64 KB writes. If 128 MB seems reasonable then I think we can simply change the default without any scaling heuristics, at least as a trivial first step.

Here's the 256 MB chart:

erikgrinaker · 2023-07-23T13:02:30Z

3 runs of kv0/enc=false/nodes=3/size= on GCE for 30 minutes with COCKROACH_RAFT_ENTRY_CACHE_SIZE=128MB. Cache metrics discarded for first 2 minutes. Median of 3 runs.

Size	Hit Rate	Throughput	Nightly
1 kb	100%	12442 w/s
4 kb	100%	6522 w/s	6498 w/s
16 kb	100%	1692 w/s
64 kb	99.88%	417 w/s	418 w/s

While the cache hit rate is significantly improved compared to #98666 (comment), it unfortunately doesn't seem to have a significant impact on throughput for this specific workload.

The results in #98666 (comment) were on AWS though, which I believe tends to have lower IOPS capacity. I'll rerun that experiment with a 128 MB entry cache.

erikgrinaker · 2023-07-23T19:25:56Z

I re-ran the experiments from #98666 (comment), using 16 MB and 128 MB entry cache sizes respectively. I saw similar cluster behavior as the original results, including the throughput degradation and node bimodality.

The 128 MB entry cache basically eliminated cache misses. Furthermore, these are 256 GB RAM nodes, so if we had used the linear 1/256 memory scaling from #107424 the entry cache would have been 1 GB, which almost certainly would have ensured a 100% hit rate.

Per-node hit rates:

16 MB: 98.84% 98.84% 98.42% 95.53% 94.92%
128 MB: 99.98% 99.98% 99.97% 99.98% 98.85%

Unfortunately, as with other results, this did not translate to a significant throughput improvement. In fact, the 16 MB cluster had higher throughput (40127 w/s vs. 38865 w/s) which is most likely down to VM variation.

Furthermore, the 128 MB cluster also showed Raft scheduler bimodality that did not correlate strongly with entry cache misses:

The CPU profile from n3 (one with high scheduler latencies) confirms that we're not reading entries from disk:

n1 with low Raft scheduler latency appears very similar:

For the purposes of the entry cache, I think we can say that the linear 1/256 scaling proposed in #107424 appears sufficient to avoid cache misses, but the impact on throughput is unclear.

107265: liveness: allow registering callbacks after start r=erikgrinaker a=tbg I discovered[^1] a deadlock scenario when multiple nodes in the cluster restart with additional stores that need to be bootstrapped. In that case, liveness must be running when the StoreIDs are allocated, but it is not. Trying to address this problem, I realized that when an auxiliary Store is bootstrapped, it will create a new replicateQueue, which will register a new callback into NodeLiveness. But if liveness must be started at this point to fix #106706, we'll run into the assertion that checks that we don't register callbacks on a started node liveness. Something's got to give: we will allow registering callbacks at any given point in time, and they'll get an initial set of notifications synchronously. I audited the few users of RegisterCallback and this seems OK with all of them. [^1]: #106706 (comment) Epic: None Release note: None 107417: kvserver: ignore RPC conn when deciding to campaign/vote r=erikgrinaker a=erikgrinaker **kvserver: remove stale mayCampaignOnWake comment** The comment is about a parameter that no longer exists. **kvserver: revamp shouldCampaign/Forget tests** **kvserver: ignore RPC conn in `shouldCampaignOnWake`** Previously, `shouldCampaignOnWake()` used `IsLiveMapEntry.IsLive` to determine whether the leader was dead. However, this not only depends on the node's liveness, but also its RPC connectivity. This can prevent an unquiescing replica from acquiring Raft leadership if the leader is still alive but unable to heartbeat liveness, and the leader will be unable to acquire epoch leases in this case. This patch ignores the RPC connection state when deciding whether to campaign, using only on the liveness state. **kvserver: ignore RPC conn in `shouldForgetLeaderOnVoteRequest`** Previously, `shouldForgetLeaderOnVoteRequest()` used `IsLiveMapEntry.IsLive` to determine whether the leader was dead. However, this not only depends on the node's liveness, but also its RPC connectivity. This can prevent granting votes to a new leader that may be attempting to acquire a epoch lease (which the current leader can't). This patch ignores the RPC connection state when deciding whether to campaign, using only on the liveness state. Resolves #107060. Epic: none Release note: None **kvserver: remove `StoreTestingKnobs.DisableLivenessMapConnHealth`** 107424: kvserver: scale Raft entry cache size with system memory r=erikgrinaker a=erikgrinaker The Raft entry cache size defaulted to 16 MB, which is rather small. This has been seen to cause tail latency and throughput degradation with high write volume on large nodes, correlating with a reduction in the entry cache hit rate. This patch linearly scales the Raft entry cache size as 1/256 of total system/cgroup memory, shared evenly between all stores, with a minimum 32 MB. For example, a 32 GB 8-vCPU node will have a 128 MB entry cache. This is a conservative default, since this memory is not accounted for in existing memory budgets nor by the `--cache` flag. We rarely see cache misses in production clusters anyway, and have seen significantly improved hit rates with this scaling (e.g. a 64 KB kv0 workload on 8-vCPU nodes increased from 87% to 99% hit rate). Resolves #98666. Epic: none Release note (performance improvement): The default Raft entry cache size has been increased from 16 MB to 1/256 of system memory with a minimum of 32 MB, divided evenly between all stores. This can be configured via `COCKROACH_RAFT_ENTRY_CACHE_SIZE`. 107442: kvserver: deflake TestRequestsOnFollowerWithNonLiveLeaseholder r=erikgrinaker a=tbg The test previously relied on aggressive liveness heartbeat expirations to avoid running for too long. As a result, it was flaky since liveness wasn't reliably pinned in the way the test wanted. The hybrid manual clock allows time to jump forward at an opportune moment. Use it here to avoid running with a tight lease interval. On my gceworker, previously flaked within a few minutes. As of this commit, I ran it for double-digit minutes without issue. Fixes #107200. Epic: None Release note: None 107526: kvserver: fail gracefully in TestLeaseTransferRejectedIfTargetNeedsSnapshot r=erikgrinaker a=tbg We saw this test hang in CI. What likely happened (according to the stacks) is that a lease transfer that was supposed to be caught by an interceptor never showed up in the interceptor. The most likely explanation is that it errored out before it got to evaluation. It then signaled a channel the test was only prepared to check later, so the test hung (waiting for a channel that was now never to be touched). This test is hard to maintain. It would be great (though, for now, out of reach) to write tests like it in a deterministic framework[^1] [^1]: see #105177. For now, fix the test so that when the (so far unknown) error rears its head again, it will fail properly, so we get to see the error and can take another pass at fixing the test (separately). Stressing this commit[^2], we get: > transferErrC unexpectedly signaled: /Table/Max: transfer lease unexpected > error: refusing to transfer lease to (n3,s3):3 because target may need a Raft > snapshot: replica in StateProbe This makes sense. The test wants to exercise the below-raft mechanism, but the above-raft mechanism also exists and while we didn't want to interact with it, we sometimes do[^1] The second commit introduces a testing knob that disables the above-raft mechanism selectively. I've stressed the test for 15 minutes without issues after this change. [^1]: somewhat related to #107524 [^2]: `./dev test --filter TestLeaseTransferRejectedIfTargetNeedsSnapshot --stress ./pkg/kv/kvserver/` on gceworker, 285s Fixes #106383. Epic: None Release note: None 107531: kvserver: disable replicate queue and lease transfers in closedts tests r=erikgrinaker a=tbg For a more holistic suggestion on how to fix this for the likely many other tests susceptible to similar issues, see: #107528 > 1171 runs so far, 0 failures, over 15m55s Fixes #101824. Release note: None Epic: none Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv-replication KV Replication Team labels Mar 15, 2023

tbg added C-investigation Further steps needed to qualify. C-label will change. and removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Mar 15, 2023

tbg mentioned this issue Mar 15, 2023

[dnm] kvserver: extra logging for restore/tpce/8TB/aws/nodes=10/cpus=8 #98576

Closed

nvanbenschoten mentioned this issue Jun 29, 2023

kv: expose env var to configure raft entry cache size #105799

Merged

nvanbenschoten mentioned this issue Jul 5, 2023

release-23.1: kv: expose env var to configure raft entry cache size #106170

Merged

nvanbenschoten mentioned this issue Jul 5, 2023

release-22.2: kv: expose env var to configure raft entry cache size #106171

Merged

erikgrinaker self-assigned this Jul 23, 2023

erikgrinaker mentioned this issue Jul 23, 2023

kvserver: scale Raft entry cache size with system memory #107424

Merged

craig bot closed this as completed in #107424 Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: investigate raft entry cache sizing #98666

kvserver: investigate raft entry cache sizing #98666

tbg commented Mar 15, 2023 •

edited by cockroach-jira-scripts

blathers-crl bot commented Mar 15, 2023

tbg commented Mar 15, 2023

erikgrinaker commented Mar 15, 2023

nvanbenschoten commented Jun 29, 2023 •

edited

erikgrinaker commented Jun 29, 2023

erikgrinaker commented Jun 29, 2023

jbowens commented Jun 29, 2023

erikgrinaker commented Jun 29, 2023

erikgrinaker commented Jul 23, 2023

erikgrinaker commented Jul 23, 2023

erikgrinaker commented Jul 23, 2023

erikgrinaker commented Jul 23, 2023

erikgrinaker commented Jul 23, 2023

kvserver: investigate raft entry cache sizing #98666

kvserver: investigate raft entry cache sizing #98666

Comments

tbg commented Mar 15, 2023 • edited by cockroach-jira-scripts

blathers-crl bot commented Mar 15, 2023

tbg commented Mar 15, 2023

erikgrinaker commented Mar 15, 2023

nvanbenschoten commented Jun 29, 2023 • edited

erikgrinaker commented Jun 29, 2023

erikgrinaker commented Jun 29, 2023

jbowens commented Jun 29, 2023

erikgrinaker commented Jun 29, 2023

erikgrinaker commented Jul 23, 2023

erikgrinaker commented Jul 23, 2023

erikgrinaker commented Jul 23, 2023

erikgrinaker commented Jul 23, 2023

erikgrinaker commented Jul 23, 2023

tbg commented Mar 15, 2023 •

edited by cockroach-jira-scripts

nvanbenschoten commented Jun 29, 2023 •

edited