roachperf: `kv95` and `kv0` regression around October 28th #72499

erikgrinaker · 2021-11-06T13:32:04Z

Some versions of kv95 and kv0 saw a ~10% performance regression around October 28th, e.g. kv95/enc=false/nodes=1/cpu=32:

This is even more pronounced with 64KB payloads, e.g. kv95/enc=false/nodes=3/cpu=32/size=64kb:

Admission control was enabled by default in #68535, which merged on October 28th, so that seems like a likely culprit. Running a couple of benchmarks to confirm.

The text was updated successfully, but these errors were encountered:

erikgrinaker · 2021-11-06T16:23:24Z

Confirmed, comparing 3207354 (admission control disabled) with 7e5741b (admission control enabled) shows a ~20% ops/sec regression on kv95/enc=false/nodes=3/cpu=32/size=64kb:

name                                     old ops/sec  new ops/sec  delta
kv95/enc=false/nodes=3/cpu=32/size=64kb   11.0k ± 1%    8.6k ± 2%  -21.82%  (p=0.000 n=10+10)

name                                     old p50      new p50      delta
kv95/enc=false/nodes=3/cpu=32/size=64kb    2.80 ± 0%    1.50 ± 0%  -46.43%  (p=0.001 n=7+7)

name                                     old p95      new p95      delta
kv95/enc=false/nodes=3/cpu=32/size=64kb    35.7 ± 0%    36.7 ± 8%     ~     (p=0.137 n=8+10)

name                                     old p99      new p99      delta
kv95/enc=false/nodes=3/cpu=32/size=64kb     419 ± 0%     481 ± 2%  +14.67%  (p=0.000 n=8+9)

@sumeerbhola Is this regression expected and accepted with admission control?

tbg · 2021-11-09T11:15:14Z

cc @sumeerbhola

sumeerbhola · 2021-11-10T16:34:17Z

The regression should only be for cases where the LSM was degrading prior to admission control (admission control is not expected to reduce throughput if the bottleneck was CPU). If one clicks to get details of a run, and sees something like the following where the rate fluctuates, it is because for LSM overload the tokens get adjusted every 15s (unhealthy for 15s, so fewer tokens, resulting in transition to healthy, with unconstrained tokens, and then repeat). Admission control will only affect writes in this case but since we are issuing a sequence of reads and writes from the same workers, reducing write throughput also reduces read throughput.

The corresponding no-admission roachtest has a higher mean rate, but note the decline in rate over time as the LSM becomes more and more unhealthy. It is not a sustainable rate, but since we only run for 10min we have overlooked this problem.

You should see the same thing with the kv0 slowdowns.

But this is not the case with kv95/enc=false/nodes=1/cpu=32, and kv95 typically needs to have larger block sizes to cause write overload, which makes me suspicious that it could be a different root cause. We didn't have an "admission" variant for this test prior to the enabling of admission control, but we did have it for kv95/enc=false/nodes=3/cpu=32. See the screenshots below for the with and without admission for a randomly picked day (Oct 17) -- they are close:

erikgrinaker · 2021-11-10T16:54:29Z

The corresponding no-admission roachtest has a higher mean rate, but note the decline in rate over time as the LSM becomes more and more unhealthy. It is not a sustainable rate, but since we only run for 10min we have overlooked this problem.

Interesting find. This seems fine then, since presumably running for another 10 minutes could have shown an overall improvement when admission control is enabled.

But this is not the case with kv95/enc=false/nodes=1/cpu=32, and kv95 typically needs to have larger block sizes to cause write overload, which makes me suspicious that it could be a different root cause.

Interesting, I'll run another bisection and see if something turns up. Thanks!

tbg · 2021-11-11T11:33:03Z

In light of this, should we consider running these perf tests for longer than ten minutes (say thirty minutes)? This will completely shift the graphs we see (since they use total qps or something like that) but it may give us better data in the long run?

erikgrinaker · 2021-11-11T12:16:20Z

Yeah, 30 minutes makes sense. That probably shouldn't be the default though, since people will run these from the command-line as well, but for the nightlies that seems reasonable.

I'll submit a PR.

tbg · 2021-11-11T12:18:20Z

Yeah, 30 minutes makes sense. That probably shouldn't be the default though, since people will run these from the command-line as well, but for the nightlies that seems reasonable. I'll submit a PR.

Thanks! You don't think that approach will be confusing to people trying to repro the numbers? I would've thought 30m is the default but it should be "easy" to lower that to 10 minutes when desired, perhaps through an env var (I know you love those!) or small code change (but code change sucks for bisecting).

erikgrinaker · 2021-11-11T12:38:58Z

OK, we can make 30 the default.

erikgrinaker · 2021-11-11T14:48:21Z

#72646

erikgrinaker · 2021-11-12T20:44:02Z

Bisected to bba05aa.

@ajwerner Looks like #71936 caused a ~10% hit on kv95/enc=false/nodes=1/cpu=32 and other benchmarks.

ajwerner · 2021-11-12T20:56:57Z

wowza, ack, okay

ajwerner · 2021-11-15T15:36:46Z

I've been benchmarking and tuning this. Indeed #71936 caused a good bit of the pain. However, I don't think it was all 10%, I think it was closer to 7% and then another bit was cause by admission control. I went and tuned the catalog layer more in #72740 such that I think it makes up for both, but I see that the admission control code now shows up prominently in the mutex profile.

Turning off admission control seems to show about a 3-4% improvement in that benchmark. My guess is that we should be running that benchmark at a higher concurrency, at which point I suspect it’d show more. I've run the benchmark with admission control off and on with the patch in #72740 and the results are around 103893.2 vs 99933.6 for total throughput.

@jameswsj10

72669: bazel: properly generate `.eg.go` code in `pkg/sql/colconv` via bazel r=rail a=rickystewart Release note: None 72714: sql/catalog/lease: permit gaps in descriptor history r=ajwerner a=ajwerner In #71239, we added a new mechanism to look up historical descriptors. I erroneously informed @jameswsj10 that we would never have gaps in the descriptor history, and, thus, when looking up historical descriptors, we could always use the earliest descriptor's modification time as the bounds for the relevant query. This turns out to not be true. Consider the case where version 3 is a historical version and then version 4 pops up and gets leased. Version 3 will get removed if it is not referenced. In the meantime, version 3 existed when we went to go find version 2. At that point, we'll inject version 2 and have version 4 leased. We need to make sure we can handle the case where we need to go fetch version 3. In the meantime, this change also removes some logic added to support the eventual resurrection of #59606 whereby we'll use the export request to fetch descriptor history to power historical queries even in the face of descriptors having been deleted. Fixes #72706. Release note: None 72740: sql/catalog/descs: fix perf regression r=ajwerner a=ajwerner This commit in #71936 had the unfortunate side-effect of allocating and forcing reads on the `uncommittedDescriptors` set even when we aren't looking for the system database. This has an outsized impact on the performance of the single-node, high-core-count KV runs. Instead of always initializing the system database, just do it when we access it. ``` name old ops/s new ops/s delta KV95-throughput 88.6k ± 0% 94.8k ± 1% +7.00% (p=0.008 n=5+5) name old ms/s new ms/s delta KV95-P50 1.60 ± 0% 1.40 ± 0% -12.50% (p=0.008 n=5+5) KV95-Avg 0.60 ± 0% 0.50 ± 0% -16.67% (p=0.008 n=5+5) ``` The second commit is more speculative and came from looking at a profile where 1.6% of the allocated garbage was due to that `NameInfo` even though we'll never, ever hit it. <img width="2345" alt="Screen Shot 2021-11-15 at 12 57 31 AM" src="https://user-images.githubusercontent.com/1839234/141729924-d00eebab-b35c-42bd-8d0b-ee39f3ac7d46.png"> Fixes #72499 Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com> Co-authored-by: Andrew Werner <awerner32@gmail.com>

sumeerbhola · 2021-11-15T22:53:27Z

@ajwerner thanks for investigating.
What time interval was the mutex profile taken over? And do you know what the cpu utilization value was? It is very possible that contention is dropping cpu utilization peak by a few percent, given that the KV benchmark has tiny work units being subject to admission, similar to the observation in #65614 (comment)

ajwerner · 2021-11-15T23:37:24Z

@ajwerner thanks for investigating. What time interval was the mutex profile taken over? And do you know what the cpu utilization value was? It is very possible that contention is dropping cpu utilization peak by a few percent, given that the KV benchmark has tiny work units being subject to admission, similar to the observation in #65614 (comment)

I think the mutex profile covers time since the process started, though I could be wrong. It was probably around a minute into the benchmark run. I wasn't running it for that long. I'm setting it back up now.

Turning off admission control had only a very modest impact on CPU utilization. There was an initial throughput bump, but interestingly, it didn't last long. Tail latency went down and stayed down, but from a small number. I turned off admission control at 23:29.

ajwerner · 2021-11-15T23:52:29Z

The above were run at --concurrency 96 with 1000 splits. 96 is probably too many. The original experiment was 64 and this one was at 64. There was a regime change I don't understand which happened with admission control after 23:41. Then another when I turned it off at 23:42.

I then turned it back on around 23:47, and saw the throughput fall again. The CPU utilization fell a tad, from ~97.4% to ~96.2%.

sumeerbhola · 2021-11-16T01:40:13Z

With that low number of runnable goroutines, there should be no queueing in admission control (the admission control page in the db console should confirm that), so any affect would be the contention due to acquiring the mutexes WorkQueue.admitMu, WorkQueue.mu, GrantCoordinator.mu. The first one would be contended.

tbg · 2021-11-16T10:12:34Z

I think the mutex profile covers time since the process started, though I could be wrong. It was probably around a minute into the benchmark run. I wasn't running it for that long. I'm setting it back up now.

Just a drive-by to mention that to the best of my knowledge, on recent Go versions the mutex profile supports the ?seconds=5s thing as well.

sumeerbhola · 2021-11-16T18:07:51Z

@ajwerner if admission control is not decreasing cpu utilization, and we are seeing a 3-4% decline in throughput, the admission control code should be showing up as significant in the cpu profile -- do you have one?

ajwerner · 2021-11-16T18:29:09Z

profile_without.pb.gz
profile_with.pb.gz

https://share.polarsignals.com/c7ecb30/ -- is with.

I have the binary if you want it to get line info. Looks like admission control is ~1% of CPU samples. It does look like some of the samples are in _doSpin/procYield on mutexes.

72646: roachtest: bump KV and YCSB benchmark durations r=stevendanna,srosenberg a=erikgrinaker roachtest benchmark performance often deteriorates over time, e.g. due to garbage buildup in the LSM and MVCC. We previously ran benchmarks for 10 minutes, but this is often not long enough for this buildup to significantly affect the benchmark. To better take this performance deterioration into account, this patch changes the default duration for KV and YCSB benchmarks from 10 minutes to 30 minutes. It also adds the environment variables `ROACHTEST_KV_FLAGS` and `ROACHTEST_YCSB_FLAGS` to pass additional flags to the workloads in order to override the duration (e.g. `ROACHTEST_KV_FLAGS="--duration 5m"`). Touches #72499. Release note: None A follow-up PR will annotate this change on roachperf graphs. Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

erikgrinaker added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-roachperf-investigation Issue opened as a result of roachperf triage labels Nov 6, 2021

erikgrinaker self-assigned this Nov 6, 2021

erikgrinaker mentioned this issue Nov 11, 2021

roachtest: bump KV and YCSB benchmark durations #72646

Merged

ajwerner self-assigned this Nov 12, 2021

erikgrinaker removed their assignment Nov 13, 2021

ajwerner mentioned this issue Nov 15, 2021

sql/catalog/descs: fix perf regression #72740

Merged

craig bot closed this as completed in 0faba2d Nov 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachperf: `kv95` and `kv0` regression around October 28th #72499

roachperf: `kv95` and `kv0` regression around October 28th #72499

erikgrinaker commented Nov 6, 2021

erikgrinaker commented Nov 6, 2021

tbg commented Nov 9, 2021

sumeerbhola commented Nov 10, 2021

erikgrinaker commented Nov 10, 2021

tbg commented Nov 11, 2021

erikgrinaker commented Nov 11, 2021

tbg commented Nov 11, 2021 •

edited

erikgrinaker commented Nov 11, 2021

erikgrinaker commented Nov 11, 2021

erikgrinaker commented Nov 12, 2021

ajwerner commented Nov 12, 2021

ajwerner commented Nov 15, 2021

sumeerbhola commented Nov 15, 2021

ajwerner commented Nov 15, 2021

ajwerner commented Nov 15, 2021

sumeerbhola commented Nov 16, 2021

tbg commented Nov 16, 2021 •

edited

sumeerbhola commented Nov 16, 2021

ajwerner commented Nov 16, 2021

roachperf: kv95 and kv0 regression around October 28th #72499

roachperf: kv95 and kv0 regression around October 28th #72499

Comments

erikgrinaker commented Nov 6, 2021

erikgrinaker commented Nov 6, 2021

tbg commented Nov 9, 2021

sumeerbhola commented Nov 10, 2021

erikgrinaker commented Nov 10, 2021

tbg commented Nov 11, 2021

erikgrinaker commented Nov 11, 2021

tbg commented Nov 11, 2021 • edited

erikgrinaker commented Nov 11, 2021

erikgrinaker commented Nov 11, 2021

erikgrinaker commented Nov 12, 2021

ajwerner commented Nov 12, 2021

ajwerner commented Nov 15, 2021

sumeerbhola commented Nov 15, 2021

ajwerner commented Nov 15, 2021

ajwerner commented Nov 15, 2021

sumeerbhola commented Nov 16, 2021

tbg commented Nov 16, 2021 • edited

sumeerbhola commented Nov 16, 2021

ajwerner commented Nov 16, 2021

roachperf: `kv95` and `kv0` regression around October 28th #72499

roachperf: `kv95` and `kv0` regression around October 28th #72499

tbg commented Nov 11, 2021 •

edited

tbg commented Nov 16, 2021 •

edited