Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: 23.2 replication microbenchmark regressions #111561

Closed
7 tasks done
erikgrinaker opened this issue Oct 2, 2023 · 24 comments
Closed
7 tasks done

kvserver: 23.2 replication microbenchmark regressions #111561

erikgrinaker opened this issue Oct 2, 2023 · 24 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-release-23.2 Used to mark GA and release blockers and technical advisories for 23.2 C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker T-kv-replication KV Replication Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Oct 2, 2023

Preliminary results were posted in https://docs.google.com/spreadsheets/d/1xYiEZlqL9tukSgZENvZIxgIHxrwo7c_PKFDdXCP6QkA/edit#gid=5. These may be fairly old, but we should look into the replication-related regressions until newer results are ready. These may or may not be false positives. In no particular order:

Jira issue: CRDB-31969

@erikgrinaker erikgrinaker added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication KV Replication Team labels Oct 2, 2023
@blathers-crl
Copy link

blathers-crl bot commented Oct 2, 2023

cc @cockroachdb/replication

@erikgrinaker erikgrinaker added branch-master Failures on the master branch. GA-blocker labels Oct 2, 2023
@erikgrinaker
Copy link
Contributor Author

Weekly microbenchmark results are available in #perf-ops.

@exalate-issue-sync exalate-issue-sync bot removed branch-master Failures on the master branch. GA-blocker labels Oct 3, 2023
@erikgrinaker
Copy link
Contributor Author

Here's the latest KV microbenchmark results (compared with 23.1): https://docs.google.com/spreadsheets/d/12cla5YkEpAd1gX3nMushoN354yuurO67k7PfPs5wI64/edit#gid=5

@blathers-crl
Copy link

blathers-crl bot commented Oct 23, 2023

Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@erikgrinaker erikgrinaker added the branch-release-23.2 Used to mark GA and release blockers and technical advisories for 23.2 label Oct 23, 2023
@erikgrinaker
Copy link
Contributor Author

Looked at BumpSideTransportClosed. No significant profile changes nor code changes -- Go runtime change? ~15% diff, small in absolute terms, but higher variance.

name                        old time/op  new time/op  delta
BumpSideTransportClosed-24   225ns ± 0%   261ns ±11%  +16.38%  (p=0.000 n=9+10)

Before: cpu-before.pprof.gz

Screenshot 2023-10-24 at 22 51 52

After: cpu-after.pprof.gz

Screenshot 2023-10-24 at 22 51 40

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 25, 2023

For kvserver/logstore/LogStore_StoreEntries allocations regression, there is a difference between 23.1 and master:

Screenshot 2023-10-25 at 13 14 22 Screenshot 2023-10-25 at 13 14 28

We're seeing that allocations doubled to 2 allocs/op because there is a new allocation in clusterversion package. Previously, the logstore.newStoreEntriesBatch call did not allocate.

@jbowens Has storage.newPebbleBatch started using this version check recently?

Something here allocates:

func (cv *clusterVersionSetting) activeVersionOrEmpty(
ctx context.Context, sv *settings.Values,
) ClusterVersion {
encoded := cv.GetInternal(sv)
if encoded == nil {
return ClusterVersion{}
}
var curVer ClusterVersion
if err := protoutil.Unmarshal(encoded.([]byte), &curVer); err != nil {
log.Fatalf(ctx, "%v", err)
}
return curVer
}

This potentially impacts other benchmarks too, as creating a Pebble batch is a common operation.

More generally, looks like every version check in CRDB allocates? Opened a PR #113043 to this extent.

@jbowens
Copy link
Collaborator

jbowens commented Oct 25, 2023

@jbowens Has storage.newPebbleBatch started using this version check recently?

Yeah, that version check is new in 23.2. It's unfortunate that allocates. I suppose we could subscribe to cluster version progressions (IIRC there's an API to do this), and stash the current cluster version on the Engine for use in newBatch.

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 25, 2023

@jbowens I think #113043 should fix the thing.

@erikgrinaker
Copy link
Contributor Author

I wonder if it's worth running some of these benchmarks with different Go versions, since we're not able to find clear explanations for a couple of these regressions. If there is a performance regression in Go itself then we'd probably want to know about that.

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 26, 2023

I wonder if it's worth running some of these benchmarks with different Go versions, since we're not able to find clear explanations for a couple of these regressions. If there is a performance regression in Go itself then we'd probably want to know about that.

Also could be worth checking whether there is a change in proto. The raft cache tests spend a bit more time in protos.

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 26, 2023

In ReplicaProposal benchmark (which is more like an end-to-end thing), the IsActive allocations add up as 1% of all allocations:

Screenshot 2023-10-26 at 13 23 33

The remaining ~10% difference is yet to be explained.

Raft stack almost doubled the allocations:

Before/after Screenshot 2023-10-26 at 13 31 30 Screenshot 2023-10-26 at 13 32 07

The notable addition is updateProposalQuotaRaftMuLocked. Added recently with the replication admission control? CC @sumeerbhola

Another one is the error handling bonanza in HandleRaftRequestSent. Added recently in #105122 and #105267. Gut feeling is that we should not see these errors in this test. CC @kvoli as reviewer of these PRs - if you remember the context, are these errors in the profile legit?

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 26, 2023

@erikgrinaker Given the above, I suspect that Raft stack cost got increased, so the thing you observed in kvserver/StoreRangeMerge is probably impacted too.

@sumeerbhola
Copy link
Collaborator

The notable addition is updateProposalQuotaRaftMuLocked. Added recently with the replication admission control? CC @sumeerbhola

Can you post a link to the "after" memory profile file?

@aadityasondhi
Copy link
Collaborator

The notable addition is updateProposalQuotaRaftMuLocked. Added recently with the replication admission control?

Took a quick look in the code. Tracking those down updateProposalQuotaRaftMuLocked, I see most of the new allocs come from this bit of code with all the get functions. Maybe we can do something about it.

https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/flow_control_replica_integration.go#L298-L302

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Oct 26, 2023
These methods will typically return empty maps, so we can avoid the
allocations.

Informs cockroachdb#111561

Epic: none

Release note: None
@kvoli
Copy link
Collaborator

kvoli commented Oct 26, 2023

Another one is the error handling bonanza in HandleRaftRequestSent. Added recently in #105122 and #105267. Gut feeling is that we should not see these errors in this test. CC @kvoli as reviewer of these PRs - if you remember the context, are these errors in the profile legit?

Unlikely to see the errors in actual deployments as they are created for missing/unexpected locality specification.

We can drop them, seeing they are used only in VInfof(5,...) -- and either check V(5) before doing work, or entirely.

craig bot pushed a commit that referenced this issue Oct 27, 2023
113150: kvserver: do lazy map allocations in replicaFlowControl r=pavelkalinnikov,aadityasondhi a=sumeerbhola

These methods will typically return empty maps, so we can avoid the allocations.

Informs #111561

Epic: none

Release note: None

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Oct 27, 2023
These methods will typically return empty maps, so we can avoid the
allocations.

Informs #111561

Epic: none

Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Oct 27, 2023
Cross locality traffic instrumentation  was added to raft, snapshots and
batch requests to quantify the amount of cross region/zone traffic.
Errors would be returned from `CompareWithLocality` when the region, or
zone locality flags were set in an unsupported manner according to our
documentation. These error allocations added overhead (cpu/mem) when
hit.

Alter `CompareWithLocality` to return booleans in place of an error to
reduce overhead.

Resolves: cockroachdb#111148
Resolves: cockroachdb#111142
Informs: cockroachdb#111561
Release note: None
craig bot pushed a commit that referenced this issue Oct 30, 2023
113069: kvserver: add BenchmarkNodeLivenessScanStorage to measure liveness scan r=andrewbaptist,jbowens a=sumeerbhola

Node liveness scans, like the one done in MaybeGossipNodeLivenessRaftMuLocked, while holding raftMu, are performance sensitive, and slowness has caused production issues (https://github.com/cockroachlabs/support/issues/2665, https://github.com/cockroachlabs/support/issues/2107).

This benchmark measures the scan performance both when DELs (due to GC) have not been compacted away, and when they have. It also sets up a varying number of live versions since decommissioned nodes will have a single live version.

Results on M1 macbook on master with dead-keys=false and compacted=true:
```
 NodeLivenessScanStorage/num-live=2/compacted=true-10   26.80µ ± 9%
 NodeLivenessScanStorage/num-live=5/compacted=true-10   30.34µ ± 3%
 NodeLivenessScanStorage/num-live=10/compacted=true-10   38.88µ ± 8%
 NodeLivenessScanStorage/num-live=1000/compacted=true-10 861.5µ ± 3%
```

When compacted=false the scan takes ~10ms, which is > 100x slower, but probably acceptable for this workload.
```
 NodeLivenessScanStorage/num-live=2/compacted=false-10     9.430m ± 5%
 NodeLivenessScanStorage/num-live=5/compacted=false-10     9.534m ± 4%
 NodeLivenessScanStorage/num-live=10/compacted=false-10    9.456m ± 2%
 NodeLivenessScanStorage/num-live=1000/compacted=false-10 10.34m ± 7%
```

dead-keys=true (and compacted=false) defeats the NextPrefix optimization, since the next prefix can have all its keys deleted and the iterator has to step through all of them (it can't be sure that all the keys for that next prefix are deleted). This case should not occur in the liveness range, as we don't remove decommissioned entries, but is included for better understanding.
```
 NodeLivenessScanStorage/num-live=2/dead-keys=true/compacted=false-10 58.33m
```

Compared to v22.2, the results are sometimes > 10x faster, when the pebbleMVCCScanner seek optimization in v22.2 was defeated.

```
                                                         │                    sec/op                    │          sec/op           vs base               │
NodeLivenessScanStorage/num-live=2/compacted=false-10                                     117.280m ± 2%                9.430m ± 5%  -91.96% (p=0.002 n=6)
NodeLivenessScanStorage/num-live=5/compacted=false-10                                     117.298m ± 0%                9.534m ± 4%  -91.87% (p=0.002 n=6)
NodeLivenessScanStorage/num-live=10/compacted=false-10                                     12.009m ± 0%                9.456m ± 2%  -21.26% (p=0.002 n=6)
NodeLivenessScanStorage/num-live=1000/compacted=false-10                                    13.04m ± 0%                10.34m ± 7%  -20.66% (p=0.002 n=6)

                                                         │                block-bytes/op                │      block-bytes/op       vs base               │
NodeLivenessScanStorage/num-live=2/compacted=false-10                                     14.565Mi ± 0%               8.356Mi ± 0%  -42.63% (p=0.002 n=6)
NodeLivenessScanStorage/num-live=5/compacted=false-10                                     14.570Mi ± 0%               8.361Mi ± 0%  -42.61% (p=0.002 n=6)
NodeLivenessScanStorage/num-live=10/compacted=false-10                                    11.094Mi ± 0%               8.368Mi ± 0%  -24.57% (p=0.002 n=6)
NodeLivenessScanStorage/num-live=1000/compacted=false-10                                  12.235Mi ± 0%               8.990Mi ± 0%  -26.53% (p=0.002 n=6)

                                                         │                     B/op                     │           B/op            vs base               │
NodeLivenessScanStorage/num-live=2/compacted=false-10                                      42.83Ki ± 4%               41.87Ki ± 0%   -2.22% (p=0.002 n=6)
NodeLivenessScanStorage/num-live=5/compacted=false-10                                      43.28Ki ± 3%               41.84Ki ± 0%   -3.32% (p=0.002 n=6)
NodeLivenessScanStorage/num-live=10/compacted=false-10                                     37.59Ki ± 0%               41.92Ki ± 0%  +11.51% (p=0.002 n=6)
NodeLivenessScanStorage/num-live=1000/compacted=false-10                                   37.67Ki ± 1%               42.66Ki ± 0%  +13.23% (p=0.002 n=6)

                                                         │                  allocs/op                   │        allocs/op          vs base               │
NodeLivenessScanStorage/num-live=2/compacted=false-10                                       105.00 ± 8%                 85.00 ± 0%  -19.05% (p=0.002 n=6)
NodeLivenessScanStorage/num-live=5/compacted=false-10                                       107.00 ± 5%                 85.00 ± 0%  -20.56% (p=0.002 n=6)
NodeLivenessScanStorage/num-live=10/compacted=false-10                                       74.00 ± 1%                 85.00 ± 0%  +14.86% (p=0.002 n=6)
NodeLivenessScanStorage/num-live=1000/compacted=false-10                                     79.00 ± 1%                 92.00 ± 1%  +16.46% (p=0.002 n=6)
```

Relates to https://github.com/cockroachlabs/support/issues/2665

Epic: none

Release note: None

113229: kv,server,roachpb: avoid error overhead for x-locality comparison r=pavelkalinnikov a=kvoli

Cross locality traffic instrumentation  was added to raft, snapshots and batch requests to quantify the amount of cross region/zone traffic. Errors would be returned from `CompareWithLocality` when the region, or zone locality flags were set in an unsupported manner according to our documentation. These error allocations added overhead (cpu/mem) when hit.

Alter `CompareWithLocality` to return booleans in place of an error to reduce overhead.

Resolves: #111148
Resolves: #111142
Informs: #111561
Release note: None

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Oct 30, 2023
Cross locality traffic instrumentation  was added to raft, snapshots and
batch requests to quantify the amount of cross region/zone traffic.
Errors would be returned from `CompareWithLocality` when the region, or
zone locality flags were set in an unsupported manner according to our
documentation. These error allocations added overhead (cpu/mem) when
hit.

Alter `CompareWithLocality` to return booleans in place of an error to
reduce overhead.

Resolves: #111148
Resolves: #111142
Informs: #111561
Release note: None
@erikgrinaker
Copy link
Contributor Author

@pavelkalinnikov We're planning to cut the .0 branch today, can we get the remaining fixes backported before then?

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 31, 2023

There is only #113043. I probably can workaround the linter and land it today.

craig bot pushed a commit that referenced this issue Oct 31, 2023
113043: clusterversion: benchmark and rm Unmarshal allocs r=erikgrinaker a=pavelkalinnikov

Currently, each `IsActive` does a memory allocation:

```
==================== Test output for //pkg/clusterversion:clusterversion_test:
goos: darwin
goarch: arm64
BenchmarkClusterVersionSettingIsActive
BenchmarkClusterVersionSettingIsActive-10       28778041                42.03 ns/op           16 B/op          1 allocs/op
PASS
```

Since the cluster version check is in many hot paths, we should eliminate this allocation.

After:

```
==================== Test output for //pkg/clusterversion:clusterversion_test:
goos: darwin
goarch: arm64
BenchmarkClusterVersionSettingIsActive
BenchmarkClusterVersionSettingIsActive-10       45417914                26.43 ns/op            0 B/op          0 allocs/op
PASS
```

Touches #111561
Epic: none
Release note: none

Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
@pav-kv
Copy link
Collaborator

pav-kv commented Oct 31, 2023

New results from this Monday. These don't yet contain the proto Unmarshal fix, but some/all other optimizations should be in already.

There is still some room for improvement.

@pav-kv
Copy link
Collaborator

pav-kv commented Nov 2, 2023

Taking another (first one was here) look at LogStore_StoreEntries after all the fixes.

23.1:

BenchmarkLogStore_StoreEntries/bytes=1.0_KiB
BenchmarkLogStore_StoreEntries/bytes=1.0_KiB-24                   924716              1267 ns/op             157 B/op          0 allocs/op
BenchmarkLogStore_StoreEntries/bytes=256_KiB
BenchmarkLogStore_StoreEntries/bytes=256_KiB-24                    31063             38715 ns/op             239 B/op          0 allocs/op
BenchmarkLogStore_StoreEntries/bytes=512_KiB
BenchmarkLogStore_StoreEntries/bytes=512_KiB-24                     9567            121584 ns/op             558 B/op          0 allocs/op
BenchmarkLogStore_StoreEntries/bytes=1.0_MiB
BenchmarkLogStore_StoreEntries/bytes=1.0_MiB-24                     1256            855636 ns/op         2207006 B/op          4 allocs/op
BenchmarkLogStore_StoreEntries/bytes=2.0_MiB
BenchmarkLogStore_StoreEntries/bytes=2.0_MiB-24                      572           1848153 ns/op         4655889 B/op          5 allocs/op

master:

BenchmarkLogStore_StoreEntries/bytes=1.0_KiB
BenchmarkLogStore_StoreEntries/bytes=1.0_KiB-24                   705525              1429 ns/op             333 B/op          1 allocs/op
BenchmarkLogStore_StoreEntries/bytes=256_KiB
BenchmarkLogStore_StoreEntries/bytes=256_KiB-24                    28836             40695 ns/op             402 B/op          1 allocs/op
BenchmarkLogStore_StoreEntries/bytes=512_KiB
BenchmarkLogStore_StoreEntries/bytes=512_KiB-24                     9249            120265 ns/op             576 B/op          1 allocs/op
BenchmarkLogStore_StoreEntries/bytes=1.0_MiB
BenchmarkLogStore_StoreEntries/bytes=1.0_MiB-24                     1346            844536 ns/op         2187245 B/op          4 allocs/op
BenchmarkLogStore_StoreEntries/bytes=2.0_MiB
BenchmarkLogStore_StoreEntries/bytes=2.0_MiB-24                      645           1716122 ns/op         4522320 B/op          5 allocs/op

The "big" tests are even. The "small" tests allocate a bit more, and therefore are slower. Figuring out the reason now.

The clear outlier is the logstore.logAppend:

Before/after mem profiles Screenshot 2023-11-02 at 12 22 43 Screenshot 2023-11-02 at 12 22 37

The reason (I double checked before/after) is #107680. Probably it added added an allocation here (size of MVCCStats is ~168 bytes which matches the diff):

opts := storage.MVCCWriteOptions{
Stats: &diff,
}

@erikgrinaker do you think there is a quick fix? One way would be returning the stats delta rather than providing an "output" parameter. But that's not quick, and might have cons.

@erikgrinaker
Copy link
Contributor Author

I'm not sure we necessarily care about this allocation. The IO cost of log appends is probably high enough that the allocation doesn't make a significant difference, and this cost is constant for all appended entries. This benchmark uses an in-memory engine, which hides the IO cost (and we're only benchmarking the inner logic).

I suppose we could pool these to amortize the allocation cost though.

@pav-kv
Copy link
Collaborator

pav-kv commented Nov 3, 2023

Yeah, feel similarly. I guess we can leave it then, just need to be mindful that it might impact other microbenchmarks that are yet to be investigated.

@erikgrinaker
Copy link
Contributor Author

Can you try to pool them and see if it makes any difference on BenchmarkKV/Insert in pkg/sql/tests?

craig bot pushed a commit that referenced this issue Nov 3, 2023
113742: logstore: pool MVCCStats to avoid hot path allocs r=erikgrinaker a=pavelkalinnikov

Before this commit, `logstore.logAppend` allocated `MVCCStats` on heap despite its lifetime being confined within this function's scope.

This commit switches to allocating `MVCCStats` on a `sync.Pool`. Since `logAppend` already has another pooled `roachpb.Value` with exactly the same lifetime, coalesce the allocation of both and put them in the same pool. This makes the cost of this change zero.

Microbenchmark results:

```
// before
BenchmarkLogStore_StoreEntries/bytes=1.0_KiB-24 837254   1270 ns/op  333 B/op  1 allocs/op
// after
BenchmarkLogStore_StoreEntries/bytes=1.0_KiB-24 1000000  1153 ns/op  157 B/op  0 allocs/op
```

Part of #111561
Epic: none
Release note: none

Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
@pav-kv
Copy link
Collaborator

pav-kv commented Nov 3, 2023

Turns out there is a simple fix, so we're just fixing the thing.
#113742

@pav-kv
Copy link
Collaborator

pav-kv commented Nov 7, 2023

The repl-specific part is done. If there is any Go-/proto-induced regression, the impact of it should be investigated on a broader scope.

@pav-kv pav-kv closed this as completed Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-release-23.2 Used to mark GA and release blockers and technical advisories for 23.2 C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker T-kv-replication KV Replication Team
Projects
None yet
Development

No branches or pull requests

7 participants