admission: reduce over-admission into the goroutine scheduler for latency isolation #91536

sumeerbhola · 2022-11-08T18:59:29Z

(The issue described here will make things better for storage servers in multi-tenant settings. To improve things where user-facing SQL runs in the same node as KV and storage, we additionally need to address #85471)

The goal of kvSlotAdjuster is to achieve close to 100% CPU utilization (work-conserving) while not forming long queues of runnable goroutines in the goroutine scheduler (since the scheduler is not able to differentiate between work the way the admission control WorkQueue can). As discussed in https://docs.google.com/document/d/16RISoB3zOORf24k0fQPDJ4tTCisBVh-PkCZEit4iwV8/edit#heading=h.t3244p17un83, changing the scheduler to integrate admission and CPU scheduling would be the ideal solution, but that is not an option.

The current heuristic is to adjust work concurrency (slot count) by +1 or -1 based on whether the runnable queue length is below or above a threshold, defined by admission.kv_slot_adjuster.overload_threshold, which defaults to 32. This metric is sampled every 1ms, in order to be able to adjust quickly. Despite this high sampling, the heuristic is not performing well: we have noticed, while experimenting with the elastic tokens scheme, that the runnable=>running wait time histograms of goroutines show significant queuing delay in the goroutine scheduler with this heuristic. This suggests that the current heuristic over-admits. In the early days of admission control, there were some experiments with lower values of admission.kv_slot_adjuster.overload_threshold, both 16 and 8, which resulted in the CPU being idle despite the WorkQueue having waiting work. We should confirm this behavior.

The other avenue we can pursue to reduce over-admission is to use a different metric than mean runnable queue length sampled every 1ms (though whatever we choose will probably need to also be sampled at 1000Hz). Some rough initial ideas:

For all BatchRequests that finish in the last 1ms, compute the sum-runnable-time and sum-running-time. If the former is “high”, we may be over-admitting. Note that “high” is hard to define. Even the perfect scheduler-doing-admission scheme can have very high sum-runnable-time in order to be work-conserving. For example: 1 P and each work immediately blocks after admission. The scheduler will keep admitting since there is an idle P. Then after admitting a 100 requests, all become runnable, so sum-runnable-time will be high. So this approach does not seem viable.
Change the scheduler to count the number of times (or total time that the Ps were idle), and do the following at 1000Hz:
- Under: Compute an indicator of under-admittance, which is the predicate: idle-P-count > 0 AND WorkQueue length > 0.
- Over: Compute an indicator of over-admittance: idle-P-count == 0 AND sum-runnable-time >= sum-running-time.
- Right: Compute an indicator of right-admittance and admission backlog: idle-P-count == 0 AND sum-runnable-time < sum-running-time AND WorkQueue length > 0.
- When Under, increase slot count by 1. When Over, decrease slot count by 1. When Right, increase slot count by 1 with probability 0.5.

Jira issue: CRDB-21309

Epic CRDB-25469

The text was updated successfully, but these errors were encountered:

sumeerbhola · 2023-10-09T19:49:15Z

Here is an approach for fixing both (a) this over-admission problem and (b) the priority inversion problem for SQLKVResponseWork and SQLSQLResponseWork #85471.

We eliminate the notion of slots and with every sample decide how many to admit until the next sample.

Sample the goroutine scheduler (runnable_goroutines, idle_procs) at high frequency, say every 20us.
Use a * P - runnable_goroutines + b * (idle_procs - runnable_goroutines) to decide the exact number of new work to admit, where P is the number of procs and a and b are constants we can tune. The idea here is that
- a * P is our goal of how many runnable goroutines we are willing to tolerate. If a is too low we can have idle P's, but we are able to tolerate a lower a because we are sampling at high frequency. A low a reduces the goroutine scheduling latency.
- a * P - runnable_goroutines is the number to admit to get closer to the goal. We clamp this to >= 0.
- idle_procs - runnable_goroutines represents idleness that can be explained by having too few runnable goroutines. Ideally this should be <= 0, but when it becomes > 0, b is the multiplier to admit a burst.

By sampling at high frequency and deciding exactly how many to admit (until the next sample) the feeback loop is tighter, compared to adjusting a slot count. Also, we don't need to know whether some admitted work is completed, since there is no slot count. So an admission request is simply grabbing a "token", therefore KV, SQKVResponse, SQLSQLResponse can share the same WorkQueue.

On linux, nanosleep along with chrt --rr 99 to run with high priority and SCHED_RR allows us to fairly accurately run with a 20us sleep.
Here is a run with nanosleep with the histogram of observerd sleep times.

Total duration: 10.50457843s, Histogram:
[0s,16.383µs): 5
[16.384µs,32.767µs): 398145
[32.768µs,49.151µs): 1140
[49.152µs,65.535µs): 477
[65.536µs,81.919µs): 163
[81.92µs,98.303µs): 45
[98.304µs,114.687µs): 17
[114.688µs,131.071µs): 5
[131.072µs,147.455µs): 2

And a run with time.Ticker, which is much worse (causes are discussed in https://github.com/golang/go, issue 44343).

Total duration: 799.794795ms, Histogram:
[0s,16.383µs): 1296
[16.384µs,32.767µs): 1608
[32.768µs,49.151µs): 312
[49.152µs,65.535µs): 65
[65.536µs,81.919µs): 6
[81.92µs,98.303µs): 2
[524.288µs,540.671µs): 1
[901.12µs,917.503µs): 1
[933.888µs,950.271µs): 7
[950.272µs,966.655µs): 17
[966.656µs,983.039µs): 20
[983.04µs,999.423µs): 19
[999.424µs,1.015807ms): 33
[1.015808ms,1.032191ms): 296
[1.032192ms,1.048575ms): 225
[1.048576ms,1.064959ms): 66
[1.06496ms,1.081343ms): 20
[1.081344ms,1.097727ms): 4
[1.114112ms,1.130495ms): 1

A hacked up prototype of this is in https://github.com/sumeerbhola/cockroach/tree/ac_cpu

kv95 achieves 86% cpu utilization on a single node cluster. kv95 has tiny batches, so this is probably close to a worst case. The sampling callbacks were running at 33 KHz, so slightly slower than the configured 50 KHz (NB: if the callback is delayed because no CPU is available, it is relatively harmless since the work done in the callback is necessary only to prevent CPU becoming idle). On an unloaded machine they run at 40 KHz.

99th percentile of goroutine scheduling delay was ~750us which is low (very desirable).

In comparison, the current slot mechanism achieves 95% cpu utilization, but with a goroutine scheduling latency p99 pf 17ms (see plot below 19:01-19:04). Decreasing admission.kv_slot_adjuster.overload_threshold to 2 (from 19:04 onwards), gives us 81% cpu utilization and 4ms goroutine scheduling latency which are strictly worse than the 86% cpu utilization and 750us goroutine scheduling latency with this prototype.

The CPU overhead of this approach is modest. 0.83% of the 8 cpu machine is spent in the nanosleep.

sumeerbhola · 2023-10-10T18:23:41Z

32 node looks good too. Running at ~94% utilization

sumeerbhola · 2023-10-11T17:26:39Z

There is instability with 64 CPUs. Especially when concurrency of the workload is very high.
When the WorkQueue queue length is high, the goschedstats frequency drops, sometimes as low as 1500 Hz. I suspect this is because the callback is trying to admit a lot of work. Increasing the b constant from 20 to 100 did not help. However, since the runnable goroutines were low (0.4 versus the target specified by a=2), I tried increasing a. At a=10, the queue length dropped and goschedstats callback frequency rose to 30 KHz. A simple feedback loop that adjusts a to get closer to the runnable goroutine goal (as long as there are waiting requests in the WorkQueue) may be sufficient.

sumeerbhola · 2024-04-26T18:46:48Z

Running admission-control/tpcc-severe-overload from #121833 with some tweaks (backups turned off; server.max_open_transactions_per_gateway disabled; workload ramp of 1h).

sumeerbhola · 2024-04-26T18:55:28Z

When the throughput started dropping, when CPU was close to saturated, I set admission.sql_kv_response.enabled = false and admission.sql_sql_response.enabled = false to see if SQL throughput would recover if queueing only happened for KV requests. One can see the AC queueing latency for such requests drop at 18:31:30. But it does not help the throughput.

There were some liveness blips and lease loss, but it went away.

sumeerbhola · 2024-04-26T20:53:15Z

Lost one of the nodes to #123146

Pre-severe overload: 20% of the cpu profile was in executeBatchWithConcurrencyRetries.
after: 13% in executeBatchWithConcurrencyRetries, and 10% is batcheval.HeartbeatTxn, so there isn't much real work happening. And 20% is work related to txn heartbeating.

Pre-severe overload, this was 6%:

sumeerbhola · 2024-04-26T20:54:22Z

attaching profile from the severe overload
Uploading profile_after_drop_node4_with_exhausted.pb.gz…

sumeerbhola · 2024-04-26T22:01:42Z

The number of open txns is only ~5000 when it slips into badness. This is a 6 node cluster with 8 vCPUs each. If we allowed 200 open txns per vCPU (@mwang1026 for this rule of thumb from other systems), that would be 6*8*200 = 9600 open txns. It would not prevent this cluster from reaching the bad state.

mwang1026 · 2024-04-26T22:08:47Z

Ack. 200 was just an example and we can tune what we think a reasonable upper bound that would balance not being too aggressive and not so conservative that it never actually takes effect

sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels Nov 8, 2022

exalate-issue-sync bot added the T-kv KV Team label Mar 15, 2023

blathers-crl bot added this to Incoming in KV Mar 15, 2023

sumeerbhola mentioned this issue Oct 2, 2023

util/goschedstats,admission: evaluate CPU overload detection and response #82702

Closed

exalate-issue-sync bot added T-admission-control Admission Control and removed T-kv KV Team labels Nov 14, 2023

sumeerbhola mentioned this issue Feb 21, 2024

Application Clusters Can Starve the System Cluster #119417

Open

yuzefovich added the O-support Originated from a customer label Apr 22, 2024

sumeerbhola added the P-3 Issues/test failures with no fix SLA label Apr 24, 2024

sumeerbhola mentioned this issue May 7, 2024

kv: clients without retries backoffs can cause metastable failure #123304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admission: reduce over-admission into the goroutine scheduler for latency isolation #91536

admission: reduce over-admission into the goroutine scheduler for latency isolation #91536

sumeerbhola commented Nov 8, 2022 •

edited by exalate-issue-sync bot

sumeerbhola commented Oct 9, 2023 •

edited

sumeerbhola commented Oct 10, 2023

sumeerbhola commented Oct 11, 2023

sumeerbhola commented Apr 26, 2024

sumeerbhola commented Apr 26, 2024

sumeerbhola commented Apr 26, 2024

sumeerbhola commented Apr 26, 2024

sumeerbhola commented Apr 26, 2024

mwang1026 commented Apr 26, 2024

admission: reduce over-admission into the goroutine scheduler for latency isolation #91536

admission: reduce over-admission into the goroutine scheduler for latency isolation #91536

Comments

sumeerbhola commented Nov 8, 2022 • edited by exalate-issue-sync bot

sumeerbhola commented Oct 9, 2023 • edited

sumeerbhola commented Oct 10, 2023

sumeerbhola commented Oct 11, 2023

sumeerbhola commented Apr 26, 2024

sumeerbhola commented Apr 26, 2024

sumeerbhola commented Apr 26, 2024

sumeerbhola commented Apr 26, 2024

sumeerbhola commented Apr 26, 2024

mwang1026 commented Apr 26, 2024

sumeerbhola commented Nov 8, 2022 •

edited by exalate-issue-sync bot

sumeerbhola commented Oct 9, 2023 •

edited