Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachperf: kv95 and kv0 regression around October 28th #72499

Closed
erikgrinaker opened this issue Nov 6, 2021 · 19 comments · Fixed by #72740
Closed

roachperf: kv95 and kv0 regression around October 28th #72499

erikgrinaker opened this issue Nov 6, 2021 · 19 comments · Fixed by #72740
Assignees
Labels
A-roachperf-investigation Issue opened as a result of roachperf triage C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@erikgrinaker
Copy link
Contributor

Some versions of kv95 and kv0 saw a ~10% performance regression around October 28th, e.g. kv95/enc=false/nodes=1/cpu=32:

Screenshot 2021-11-06 at 14 25 52

This is even more pronounced with 64KB payloads, e.g. kv95/enc=false/nodes=3/cpu=32/size=64kb:

Screenshot 2021-11-06 at 14 29 17

Admission control was enabled by default in #68535, which merged on October 28th, so that seems like a likely culprit. Running a couple of benchmarks to confirm.

@erikgrinaker erikgrinaker added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-roachperf-investigation Issue opened as a result of roachperf triage labels Nov 6, 2021
@erikgrinaker erikgrinaker self-assigned this Nov 6, 2021
@erikgrinaker
Copy link
Contributor Author

Confirmed, comparing 3207354 (admission control disabled) with 7e5741b (admission control enabled) shows a ~20% ops/sec regression on kv95/enc=false/nodes=3/cpu=32/size=64kb:

name                                     old ops/sec  new ops/sec  delta
kv95/enc=false/nodes=3/cpu=32/size=64kb   11.0k ± 1%    8.6k ± 2%  -21.82%  (p=0.000 n=10+10)

name                                     old p50      new p50      delta
kv95/enc=false/nodes=3/cpu=32/size=64kb    2.80 ± 0%    1.50 ± 0%  -46.43%  (p=0.001 n=7+7)

name                                     old p95      new p95      delta
kv95/enc=false/nodes=3/cpu=32/size=64kb    35.7 ± 0%    36.7 ± 8%     ~     (p=0.137 n=8+10)

name                                     old p99      new p99      delta
kv95/enc=false/nodes=3/cpu=32/size=64kb     419 ± 0%     481 ± 2%  +14.67%  (p=0.000 n=8+9)

@sumeerbhola Is this regression expected and accepted with admission control?

@tbg
Copy link
Member

tbg commented Nov 9, 2021

cc @sumeerbhola

@sumeerbhola
Copy link
Collaborator

The regression should only be for cases where the LSM was degrading prior to admission control (admission control is not expected to reduce throughput if the bottleneck was CPU). If one clicks to get details of a run, and sees something like the following where the rate fluctuates, it is because for LSM overload the tokens get adjusted every 15s (unhealthy for 15s, so fewer tokens, resulting in transition to healthy, with unconstrained tokens, and then repeat). Admission control will only affect writes in this case but since we are issuing a sequence of reads and writes from the same workers, reducing write throughput also reduces read throughput.

Screen Shot 2021-11-10 at 11 15 43 AM

The corresponding no-admission roachtest has a higher mean rate, but note the decline in rate over time as the LSM becomes more and more unhealthy. It is not a sustainable rate, but since we only run for 10min we have overlooked this problem.

Screen Shot 2021-11-10 at 11 20 05 AM

You should see the same thing with the kv0 slowdowns.

But this is not the case with kv95/enc=false/nodes=1/cpu=32, and kv95 typically needs to have larger block sizes to cause write overload, which makes me suspicious that it could be a different root cause. We didn't have an "admission" variant for this test prior to the enabling of admission control, but we did have it for kv95/enc=false/nodes=3/cpu=32. See the screenshots below for the with and without admission for a randomly picked day (Oct 17) -- they are close:

Screen Shot 2021-11-10 at 11 27 53 AM

Screen Shot 2021-11-10 at 11 29 03 AM

@erikgrinaker
Copy link
Contributor Author

The corresponding no-admission roachtest has a higher mean rate, but note the decline in rate over time as the LSM becomes more and more unhealthy. It is not a sustainable rate, but since we only run for 10min we have overlooked this problem.

Interesting find. This seems fine then, since presumably running for another 10 minutes could have shown an overall improvement when admission control is enabled.

But this is not the case with kv95/enc=false/nodes=1/cpu=32, and kv95 typically needs to have larger block sizes to cause write overload, which makes me suspicious that it could be a different root cause.

Interesting, I'll run another bisection and see if something turns up. Thanks!

@tbg
Copy link
Member

tbg commented Nov 11, 2021

In light of this, should we consider running these perf tests for longer than ten minutes (say thirty minutes)? This will completely shift the graphs we see (since they use total qps or something like that) but it may give us better data in the long run?

@erikgrinaker
Copy link
Contributor Author

Yeah, 30 minutes makes sense. That probably shouldn't be the default though, since people will run these from the command-line as well, but for the nightlies that seems reasonable.

I'll submit a PR.

@tbg
Copy link
Member

tbg commented Nov 11, 2021

Yeah, 30 minutes makes sense. That probably shouldn't be the default though, since people will run these from the command-line as well, but for the nightlies that seems reasonable. I'll submit a PR.

Thanks! You don't think that approach will be confusing to people trying to repro the numbers? I would've thought 30m is the default but it should be "easy" to lower that to 10 minutes when desired, perhaps through an env var (I know you love those!) or small code change (but code change sucks for bisecting).

@erikgrinaker
Copy link
Contributor Author

OK, we can make 30 the default.

@erikgrinaker
Copy link
Contributor Author

#72646

@erikgrinaker
Copy link
Contributor Author

Bisected to bba05aa.

@ajwerner Looks like #71936 caused a ~10% hit on kv95/enc=false/nodes=1/cpu=32 and other benchmarks.

@ajwerner
Copy link
Contributor

wowza, ack, okay

@ajwerner ajwerner self-assigned this Nov 12, 2021
@erikgrinaker erikgrinaker removed their assignment Nov 13, 2021
@ajwerner
Copy link
Contributor

I've been benchmarking and tuning this. Indeed #71936 caused a good bit of the pain. However, I don't think it was all 10%, I think it was closer to 7% and then another bit was cause by admission control. I went and tuned the catalog layer more in #72740 such that I think it makes up for both, but I see that the admission control code now shows up prominently in the mutex profile.

Screen Shot 2021-11-15 at 9 59 33 AM

Turning off admission control seems to show about a 3-4% improvement in that benchmark. My guess is that we should be running that benchmark at a higher concurrency, at which point I suspect it’d show more. I've run the benchmark with admission control off and on with the patch in #72740 and the results are around 103893.2 vs 99933.6 for total throughput.

craig bot pushed a commit that referenced this issue Nov 15, 2021
72669: bazel: properly generate `.eg.go` code in `pkg/sql/colconv` via bazel r=rail a=rickystewart

Release note: None

72714: sql/catalog/lease: permit gaps in descriptor history r=ajwerner a=ajwerner

In #71239, we added a new mechanism to look up historical descriptors. I
erroneously informed @jameswsj10 that we would never have gaps in the
descriptor history, and, thus, when looking up historical descriptors, we
could always use the earliest descriptor's modification time as the bounds
for the relevant query.

This turns out to not be true. Consider the case where version 3 is a
historical version and then version 4 pops up and gets leased. Version 3 will
get removed if it is not referenced. In the meantime, version 3 existed when we
went to go find version 2. At that point, we'll inject version 2 and have
version 4 leased.  We need to make sure we can handle the case where we need to
go fetch version 3.

In the meantime, this change also removes some logic added to support the
eventual resurrection of #59606 whereby we'll use the export request to fetch
descriptor history to power historical queries even in the face of descriptors
having been deleted.

Fixes #72706.

Release note: None

72740: sql/catalog/descs: fix perf regression  r=ajwerner a=ajwerner

This commit in #71936 had the unfortunate side-effect of allocating and forcing reads on the `uncommittedDescriptors` set even when we aren't looking for the system database. This has an outsized impact on the performance of the single-node, high-core-count KV runs. Instead of always initializing the system database, just do it when we access it. 

```
name             old ops/s   new ops/s   delta
KV95-throughput  88.6k ± 0%  94.8k ± 1%   +7.00%  (p=0.008 n=5+5)

name             old ms/s    new ms/s    delta
KV95-P50          1.60 ± 0%   1.40 ± 0%  -12.50%  (p=0.008 n=5+5)
KV95-Avg          0.60 ± 0%   0.50 ± 0%  -16.67%  (p=0.008 n=5+5)
```

The second commit is more speculative and came from looking at a profile where 1.6% of the allocated garbage was due to that `NameInfo` even though we'll never, ever hit it.

<img width="2345" alt="Screen Shot 2021-11-15 at 12 57 31 AM" src="https://user-images.githubusercontent.com/1839234/141729924-d00eebab-b35c-42bd-8d0b-ee39f3ac7d46.png">
 
 Fixes #72499

Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
@craig craig bot closed this as completed in 0faba2d Nov 15, 2021
@sumeerbhola
Copy link
Collaborator

@ajwerner thanks for investigating.
What time interval was the mutex profile taken over? And do you know what the cpu utilization value was? It is very possible that contention is dropping cpu utilization peak by a few percent, given that the KV benchmark has tiny work units being subject to admission, similar to the observation in #65614 (comment)

@ajwerner
Copy link
Contributor

@ajwerner thanks for investigating. What time interval was the mutex profile taken over? And do you know what the cpu utilization value was? It is very possible that contention is dropping cpu utilization peak by a few percent, given that the KV benchmark has tiny work units being subject to admission, similar to the observation in #65614 (comment)

I think the mutex profile covers time since the process started, though I could be wrong. It was probably around a minute into the benchmark run. I wasn't running it for that long. I'm setting it back up now.


Turning off admission control had only a very modest impact on CPU utilization. There was an initial throughput bump, but interestingly, it didn't last long. Tail latency went down and stayed down, but from a small number. I turned off admission control at 23:29.

Screen Shot 2021-11-15 at 6 31 03 PM

Screen Shot 2021-11-15 at 6 30 53 PM

Screen Shot 2021-11-15 at 6 36 07 PM

@ajwerner
Copy link
Contributor

The above were run at --concurrency 96 with 1000 splits. 96 is probably too many. The original experiment was 64 and this one was at 64. There was a regime change I don't understand which happened with admission control after 23:41. Then another when I turned it off at 23:42.
Screen Shot 2021-11-15 at 6 43 26 PM

Screen Shot 2021-11-15 at 6 45 06 PM

I then turned it back on around 23:47, and saw the throughput fall again. The CPU utilization fell a tad, from ~97.4% to ~96.2%.
Screen Shot 2021-11-15 at 6 48 43 PM

@sumeerbhola
Copy link
Collaborator

With that low number of runnable goroutines, there should be no queueing in admission control (the admission control page in the db console should confirm that), so any affect would be the contention due to acquiring the mutexes WorkQueue.admitMu, WorkQueue.mu, GrantCoordinator.mu. The first one would be contended.

@tbg
Copy link
Member

tbg commented Nov 16, 2021

I think the mutex profile covers time since the process started, though I could be wrong. It was probably around a minute into the benchmark run. I wasn't running it for that long. I'm setting it back up now.

Just a drive-by to mention that to the best of my knowledge, on recent Go versions the mutex profile supports the ?seconds=5s thing as well.

@sumeerbhola
Copy link
Collaborator

@ajwerner if admission control is not decreasing cpu utilization, and we are seeing a 3-4% decline in throughput, the admission control code should be showing up as significant in the cpu profile -- do you have one?

@ajwerner
Copy link
Contributor

profile_without.pb.gz
profile_with.pb.gz

https://share.polarsignals.com/c7ecb30/ -- is with.

I have the binary if you want it to get line info. Looks like admission control is ~1% of CPU samples. It does look like some of the samples are in _doSpin/procYield on mutexes.

Screen Shot 2021-11-16 at 1 27 47 PM

craig bot pushed a commit that referenced this issue Dec 9, 2021
72646: roachtest: bump KV and YCSB benchmark durations r=stevendanna,srosenberg a=erikgrinaker

roachtest benchmark performance often deteriorates over time, e.g. due
to garbage buildup in the LSM and MVCC. We previously ran benchmarks for
10 minutes, but this is often not long enough for this buildup to
significantly affect the benchmark.

To better take this performance deterioration into account, this patch
changes the default duration for KV and YCSB benchmarks from 10 minutes
to 30 minutes. It also adds the environment variables
`ROACHTEST_KV_FLAGS` and `ROACHTEST_YCSB_FLAGS` to pass additional flags
to the workloads in order to override the duration (e.g.
`ROACHTEST_KV_FLAGS="--duration 5m"`).

Touches #72499.

Release note: None

A follow-up PR will annotate this change on roachperf graphs.

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-roachperf-investigation Issue opened as a result of roachperf triage C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants