New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: 23.2 replication microbenchmark regressions #111561
Comments
cc @cockroachdb/replication |
Weekly microbenchmark results are available in #perf-ops. |
Here's the latest KV microbenchmark results (compared with 23.1): https://docs.google.com/spreadsheets/d/12cla5YkEpAd1gX3nMushoN354yuurO67k7PfPs5wI64/edit#gid=5 |
Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Looked at
Before: cpu-before.pprof.gz After: cpu-after.pprof.gz |
For We're seeing that allocations doubled to 2 allocs/op because there is a new allocation in @jbowens Has Something here allocates: cockroach/pkg/clusterversion/setting.go Lines 132 to 144 in a0b6f5c
This potentially impacts other benchmarks too, as creating a Pebble batch is a common operation. More generally, looks like every version check in CRDB allocates? Opened a PR #113043 to this extent. |
Yeah, that version check is new in 23.2. It's unfortunate that allocates. I suppose we could subscribe to cluster version progressions (IIRC there's an API to do this), and stash the current cluster version on the Engine for use in |
I wonder if it's worth running some of these benchmarks with different Go versions, since we're not able to find clear explanations for a couple of these regressions. If there is a performance regression in Go itself then we'd probably want to know about that. |
Also could be worth checking whether there is a change in proto. The raft cache tests spend a bit more time in protos. |
In The remaining ~10% difference is yet to be explained. Raft stack almost doubled the allocations: The notable addition is Another one is the error handling bonanza in |
@erikgrinaker Given the above, I suspect that Raft stack cost got increased, so the thing you observed in |
Can you post a link to the "after" memory profile file? |
Took a quick look in the code. Tracking those down |
These methods will typically return empty maps, so we can avoid the allocations. Informs cockroachdb#111561 Epic: none Release note: None
Unlikely to see the errors in actual deployments as they are created for missing/unexpected We can drop them, seeing they are used only in |
113150: kvserver: do lazy map allocations in replicaFlowControl r=pavelkalinnikov,aadityasondhi a=sumeerbhola These methods will typically return empty maps, so we can avoid the allocations. Informs #111561 Epic: none Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
These methods will typically return empty maps, so we can avoid the allocations. Informs #111561 Epic: none Release note: None
Cross locality traffic instrumentation was added to raft, snapshots and batch requests to quantify the amount of cross region/zone traffic. Errors would be returned from `CompareWithLocality` when the region, or zone locality flags were set in an unsupported manner according to our documentation. These error allocations added overhead (cpu/mem) when hit. Alter `CompareWithLocality` to return booleans in place of an error to reduce overhead. Resolves: cockroachdb#111148 Resolves: cockroachdb#111142 Informs: cockroachdb#111561 Release note: None
113069: kvserver: add BenchmarkNodeLivenessScanStorage to measure liveness scan r=andrewbaptist,jbowens a=sumeerbhola Node liveness scans, like the one done in MaybeGossipNodeLivenessRaftMuLocked, while holding raftMu, are performance sensitive, and slowness has caused production issues (https://github.com/cockroachlabs/support/issues/2665, https://github.com/cockroachlabs/support/issues/2107). This benchmark measures the scan performance both when DELs (due to GC) have not been compacted away, and when they have. It also sets up a varying number of live versions since decommissioned nodes will have a single live version. Results on M1 macbook on master with dead-keys=false and compacted=true: ``` NodeLivenessScanStorage/num-live=2/compacted=true-10 26.80µ ± 9% NodeLivenessScanStorage/num-live=5/compacted=true-10 30.34µ ± 3% NodeLivenessScanStorage/num-live=10/compacted=true-10 38.88µ ± 8% NodeLivenessScanStorage/num-live=1000/compacted=true-10 861.5µ ± 3% ``` When compacted=false the scan takes ~10ms, which is > 100x slower, but probably acceptable for this workload. ``` NodeLivenessScanStorage/num-live=2/compacted=false-10 9.430m ± 5% NodeLivenessScanStorage/num-live=5/compacted=false-10 9.534m ± 4% NodeLivenessScanStorage/num-live=10/compacted=false-10 9.456m ± 2% NodeLivenessScanStorage/num-live=1000/compacted=false-10 10.34m ± 7% ``` dead-keys=true (and compacted=false) defeats the NextPrefix optimization, since the next prefix can have all its keys deleted and the iterator has to step through all of them (it can't be sure that all the keys for that next prefix are deleted). This case should not occur in the liveness range, as we don't remove decommissioned entries, but is included for better understanding. ``` NodeLivenessScanStorage/num-live=2/dead-keys=true/compacted=false-10 58.33m ``` Compared to v22.2, the results are sometimes > 10x faster, when the pebbleMVCCScanner seek optimization in v22.2 was defeated. ``` │ sec/op │ sec/op vs base │ NodeLivenessScanStorage/num-live=2/compacted=false-10 117.280m ± 2% 9.430m ± 5% -91.96% (p=0.002 n=6) NodeLivenessScanStorage/num-live=5/compacted=false-10 117.298m ± 0% 9.534m ± 4% -91.87% (p=0.002 n=6) NodeLivenessScanStorage/num-live=10/compacted=false-10 12.009m ± 0% 9.456m ± 2% -21.26% (p=0.002 n=6) NodeLivenessScanStorage/num-live=1000/compacted=false-10 13.04m ± 0% 10.34m ± 7% -20.66% (p=0.002 n=6) │ block-bytes/op │ block-bytes/op vs base │ NodeLivenessScanStorage/num-live=2/compacted=false-10 14.565Mi ± 0% 8.356Mi ± 0% -42.63% (p=0.002 n=6) NodeLivenessScanStorage/num-live=5/compacted=false-10 14.570Mi ± 0% 8.361Mi ± 0% -42.61% (p=0.002 n=6) NodeLivenessScanStorage/num-live=10/compacted=false-10 11.094Mi ± 0% 8.368Mi ± 0% -24.57% (p=0.002 n=6) NodeLivenessScanStorage/num-live=1000/compacted=false-10 12.235Mi ± 0% 8.990Mi ± 0% -26.53% (p=0.002 n=6) │ B/op │ B/op vs base │ NodeLivenessScanStorage/num-live=2/compacted=false-10 42.83Ki ± 4% 41.87Ki ± 0% -2.22% (p=0.002 n=6) NodeLivenessScanStorage/num-live=5/compacted=false-10 43.28Ki ± 3% 41.84Ki ± 0% -3.32% (p=0.002 n=6) NodeLivenessScanStorage/num-live=10/compacted=false-10 37.59Ki ± 0% 41.92Ki ± 0% +11.51% (p=0.002 n=6) NodeLivenessScanStorage/num-live=1000/compacted=false-10 37.67Ki ± 1% 42.66Ki ± 0% +13.23% (p=0.002 n=6) │ allocs/op │ allocs/op vs base │ NodeLivenessScanStorage/num-live=2/compacted=false-10 105.00 ± 8% 85.00 ± 0% -19.05% (p=0.002 n=6) NodeLivenessScanStorage/num-live=5/compacted=false-10 107.00 ± 5% 85.00 ± 0% -20.56% (p=0.002 n=6) NodeLivenessScanStorage/num-live=10/compacted=false-10 74.00 ± 1% 85.00 ± 0% +14.86% (p=0.002 n=6) NodeLivenessScanStorage/num-live=1000/compacted=false-10 79.00 ± 1% 92.00 ± 1% +16.46% (p=0.002 n=6) ``` Relates to https://github.com/cockroachlabs/support/issues/2665 Epic: none Release note: None 113229: kv,server,roachpb: avoid error overhead for x-locality comparison r=pavelkalinnikov a=kvoli Cross locality traffic instrumentation was added to raft, snapshots and batch requests to quantify the amount of cross region/zone traffic. Errors would be returned from `CompareWithLocality` when the region, or zone locality flags were set in an unsupported manner according to our documentation. These error allocations added overhead (cpu/mem) when hit. Alter `CompareWithLocality` to return booleans in place of an error to reduce overhead. Resolves: #111148 Resolves: #111142 Informs: #111561 Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Cross locality traffic instrumentation was added to raft, snapshots and batch requests to quantify the amount of cross region/zone traffic. Errors would be returned from `CompareWithLocality` when the region, or zone locality flags were set in an unsupported manner according to our documentation. These error allocations added overhead (cpu/mem) when hit. Alter `CompareWithLocality` to return booleans in place of an error to reduce overhead. Resolves: #111148 Resolves: #111142 Informs: #111561 Release note: None
@pavelkalinnikov We're planning to cut the .0 branch today, can we get the remaining fixes backported before then? |
There is only #113043. I probably can workaround the linter and land it today. |
113043: clusterversion: benchmark and rm Unmarshal allocs r=erikgrinaker a=pavelkalinnikov Currently, each `IsActive` does a memory allocation: ``` ==================== Test output for //pkg/clusterversion:clusterversion_test: goos: darwin goarch: arm64 BenchmarkClusterVersionSettingIsActive BenchmarkClusterVersionSettingIsActive-10 28778041 42.03 ns/op 16 B/op 1 allocs/op PASS ``` Since the cluster version check is in many hot paths, we should eliminate this allocation. After: ``` ==================== Test output for //pkg/clusterversion:clusterversion_test: goos: darwin goarch: arm64 BenchmarkClusterVersionSettingIsActive BenchmarkClusterVersionSettingIsActive-10 45417914 26.43 ns/op 0 B/op 0 allocs/op PASS ``` Touches #111561 Epic: none Release note: none Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
New results from this Monday. These don't yet contain the proto Unmarshal fix, but some/all other optimizations should be in already. There is still some room for improvement. |
Taking another (first one was here) look at
The "big" tests are even. The "small" tests allocate a bit more, and therefore are slower. Figuring out the reason now. The clear outlier is the The reason (I double checked before/after) is #107680. Probably it added added an allocation here (size of cockroach/pkg/kv/kvserver/logstore/logstore.go Lines 386 to 388 in e836704
@erikgrinaker do you think there is a quick fix? One way would be returning the stats delta rather than providing an "output" parameter. But that's not quick, and might have cons. |
I'm not sure we necessarily care about this allocation. The IO cost of log appends is probably high enough that the allocation doesn't make a significant difference, and this cost is constant for all appended entries. This benchmark uses an in-memory engine, which hides the IO cost (and we're only benchmarking the inner logic). I suppose we could pool these to amortize the allocation cost though. |
Yeah, feel similarly. I guess we can leave it then, just need to be mindful that it might impact other microbenchmarks that are yet to be investigated. |
Can you try to pool them and see if it makes any difference on |
113742: logstore: pool MVCCStats to avoid hot path allocs r=erikgrinaker a=pavelkalinnikov Before this commit, `logstore.logAppend` allocated `MVCCStats` on heap despite its lifetime being confined within this function's scope. This commit switches to allocating `MVCCStats` on a `sync.Pool`. Since `logAppend` already has another pooled `roachpb.Value` with exactly the same lifetime, coalesce the allocation of both and put them in the same pool. This makes the cost of this change zero. Microbenchmark results: ``` // before BenchmarkLogStore_StoreEntries/bytes=1.0_KiB-24 837254 1270 ns/op 333 B/op 1 allocs/op // after BenchmarkLogStore_StoreEntries/bytes=1.0_KiB-24 1000000 1153 ns/op 157 B/op 0 allocs/op ``` Part of #111561 Epic: none Release note: none Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
Turns out there is a simple fix, so we're just fixing the thing. |
The repl-specific part is done. If there is any Go-/proto-induced regression, the impact of it should be investigated on a broader scope. |
Preliminary results were posted in https://docs.google.com/spreadsheets/d/1xYiEZlqL9tukSgZENvZIxgIHxrwo7c_PKFDdXCP6QkA/edit#gid=5. These may be fairly old, but we should look into the replication-related regressions until newer results are ready. These may or may not be false positives. In no particular order:
kvserver/ReplicaProposal
@pavelkalinnikovkvserver/BumpSideTransportClosed
@erikgrinakerkvserver/raftentry/EntryCache
@pavelkalinnikovraftpb.(*Entry).Size()
: 3.62s -> 4.48skvserver/raftentry/EntryCacheClearTo
@pavelkalinnikovkvserver/raftentry/EntryCache
kvserver/logstore/LogStore_StoreEntries
@pavelkalinnikovkvserver/StoreRangeMerge
@erikgrinakerkvserver/closedts/tracker/HeapTracker
@erikgrinakerJira issue: CRDB-31969
The text was updated successfully, but these errors were encountered: