New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admission: reduce over-admission into the goroutine scheduler for latency isolation #91536
Comments
Here is an approach for fixing both (a) this over-admission problem and (b) the priority inversion problem for SQLKVResponseWork and SQLSQLResponseWork #85471. We eliminate the notion of slots and with every sample decide how many to admit until the next sample.
By sampling at high frequency and deciding exactly how many to admit (until the next sample) the feeback loop is tighter, compared to adjusting a slot count. Also, we don't need to know whether some admitted work is completed, since there is no slot count. So an admission request is simply grabbing a "token", therefore KV, SQKVResponse, SQLSQLResponse can share the same WorkQueue. On linux, nanosleep along with
And a run with
A hacked up prototype of this is in https://github.com/sumeerbhola/cockroach/tree/ac_cpu kv95 achieves 86% cpu utilization on a single node cluster. kv95 has tiny batches, so this is probably close to a worst case. The sampling callbacks were running at 33 KHz, so slightly slower than the configured 50 KHz (NB: if the callback is delayed because no CPU is available, it is relatively harmless since the work done in the callback is necessary only to prevent CPU becoming idle). On an unloaded machine they run at 40 KHz. 99th percentile of goroutine scheduling delay was ~750us which is low (very desirable). In comparison, the current slot mechanism achieves 95% cpu utilization, but with a goroutine scheduling latency p99 pf 17ms (see plot below 19:01-19:04). Decreasing The CPU overhead of this approach is modest. 0.83% of the 8 cpu machine is spent in the nanosleep. |
There is instability with 64 CPUs. Especially when concurrency of the workload is very high. |
Running |
Lost one of the nodes to #123146 Pre-severe overload: 20% of the cpu profile was in executeBatchWithConcurrencyRetries. |
attaching profile from the severe overload |
The number of open txns is only ~5000 when it slips into badness. This is a 6 node cluster with 8 vCPUs each. If we allowed 200 open txns per vCPU (@mwang1026 for this rule of thumb from other systems), that would be 6*8*200 = 9600 open txns. It would not prevent this cluster from reaching the bad state. |
Ack. 200 was just an example and we can tune what we think a reasonable upper bound that would balance not being too aggressive and not so conservative that it never actually takes effect |
(The issue described here will make things better for storage servers in multi-tenant settings. To improve things where user-facing SQL runs in the same node as KV and storage, we additionally need to address #85471)
The goal of
kvSlotAdjuster
is to achieve close to 100% CPU utilization (work-conserving) while not forming long queues of runnable goroutines in the goroutine scheduler (since the scheduler is not able to differentiate between work the way the admission control WorkQueue can). As discussed in https://docs.google.com/document/d/16RISoB3zOORf24k0fQPDJ4tTCisBVh-PkCZEit4iwV8/edit#heading=h.t3244p17un83, changing the scheduler to integrate admission and CPU scheduling would be the ideal solution, but that is not an option.The current heuristic is to adjust work concurrency (slot count) by +1 or -1 based on whether the runnable queue length is below or above a threshold, defined by
admission.kv_slot_adjuster.overload_threshold
, which defaults to 32. This metric is sampled every 1ms, in order to be able to adjust quickly. Despite this high sampling, the heuristic is not performing well: we have noticed, while experimenting with the elastic tokens scheme, that the runnable=>running wait time histograms of goroutines show significant queuing delay in the goroutine scheduler with this heuristic. This suggests that the current heuristic over-admits. In the early days of admission control, there were some experiments with lower values ofadmission.kv_slot_adjuster.overload_threshold
, both 16 and 8, which resulted in the CPU being idle despite the WorkQueue having waiting work. We should confirm this behavior.The other avenue we can pursue to reduce over-admission is to use a different metric than mean runnable queue length sampled every 1ms (though whatever we choose will probably need to also be sampled at 1000Hz). Some rough initial ideas:
Jira issue: CRDB-21309
Epic CRDB-25469
The text was updated successfully, but these errors were encountered: