release-22.2: admission: defend against severe elastic cpu token debt #103648

irfansharif · 2023-05-19T03:43:24Z

Backport 1/1 commits from #103365.

/cc @cockroachdb/release

Speculative fix for #102817.
Speculative fix for #103359.

In experiments and in production clusters, we observed elastic CPU wait queue durations in the order of minutes, which can be long enough to fail entire backups. Separately, we saw that with the elastic CPU limiter enabled, backups can sometimes be extremely inefficient, likely exporting a single key per ExportRequest (#103365 will improve this observability). We suspect that happened because by the time we actually got to the mvccExportToSST loop, we'd already expended the 100ms CPU-time grant we were given resolving intents or doing conflict resolution. Now we instead start the CPU timer per request once actually in the mvccExportToSST loop, not counting all the "pre-work" CPU time against our allotted budget. We still deduct CPU tokens for that work to avoid risking over-admission, and export a metric for how much pre-work we're doing. As a defense-in-depth thing, we also periodically reset the elastic CPU token bucket periodically (every 30s), again permitting some amount of over-admission.

While here, we export a few other metrics in the elastic CPU limiter stack to better diagnose issues:

admission.elastic_cpu.nanos_exhausted_duration, which is the total duration when elastic CPU tokens were exhausted, in micros. We have equivalents for {IO token, CPU slot} exhaustion.
admission.elastic_cpu.over_limit_durations, a latency histogram that surfaces exactly how much over the 100ms grants we go. This was quite high before this patch when ExportRequests could spend time resolving intents, pushing txns, doing conflict resolution.
admission.elastic_cpu.pre_work_nanos, the elastic CPU time spent doing pre-work. See commentary above.
admission.elastic_cpu.available_nanos, an instantaneous metric that surfaces exactly how many tokens are available. Useful to understand how much debt we're in.

Release note (bug fix): In earlier patch releases of v23.1, it was possible for backups to be excessively slow, slower than they were in earlier releases. It was also possible for them to fail with errors of the following form: 'operation "Export Request for span ..." timed out after 5m0.001s'. At least one of the reasons for this behavior is now addressed. This problem also affected v22.2 clusters if using a hidden-by-default, default-as-disabled cluster setting, 'admission.elastic_cpu.enabled'.

Release justification: Bug fix.

blathers-crl · 2023-05-19T03:43:27Z

cockroach-teamcity · 2023-05-19T03:44:39Z

This change is

Speculative fix for cockroachdb#102817. Speculative fix for cockroachdb#103359. In experiments and in production clusters, we observed elastic CPU wait queue durations in the order of minutes, which can be long enough to fail entire backups. Separately, we saw that with the elastic CPU limiter enabled, backups can sometimes be extremely inefficient, likely exporting a single key per ExportRequest (cockroachdb#101685 will improve this observability). We suspect that happened because by the time we actually got to the mvccExportToSST loop, we'd already expended the 100ms CPU-time grant we were given resolving intents or doing conflict resolution. Now we instead start the CPU timer per request once actually in the mvccExportToWriter loop, not counting all the "pre-work" CPU time against our allotted budget. We still deduct CPU tokens for the pre-work to avoid risking over-admission, and export a metric for how much pre-work we're doing. As a defense-in-depth thing, we also periodically reset the elastic CPU token bucket periodically (every 30s), again permitting some amount of over-admission. While here, we export a few other metrics in the elastic CPU limiter stack to better diagnose issues: - admission.elastic_cpu.nanos_exhausted_duration, which is the total duration when elastic CPU tokens were exhausted, in micros. We have equivalents for {IO token, CPU slot} exhaustion. - admission.elastic_cpu.over_limit_durations, a latency histogram that surfaces exactly how much over the 100ms grants we go, including pre-work. This is quite high before/after this patch when ExportRequests can spend time resolving intents, pushing txns, doing conflict resolution. Except now we still give 100ms of on-CPU to to mvccExportToWriter. - admission.elastic_cpu.pre_work_nanos, the elastic CPU time spent doing pre-work. See commentary above. - admission.elastic_cpu.available_nanos, an instantaneous metric that surfaces exactly how many tokens are available. Useful to understand how much debt we're in. Release note (bug fix): In earlier patch releases of v23.1, it was possible for backups to be excessively slow, slower than they were in earlier releases. It was also possible for them to fail with errors of the following form: 'operation "Export Request for span ..." timed out after 5m0.001s'. At least one of the reasons for this behavior is now addressed. This problem also affected v22.2 clusters if using a hidden-by-default, default-as-disabled cluster setting 'admission.elastic_cpu.enabled'.

irfansharif requested review from sumeerbhola and a team May 19, 2023 03:43

irfansharif requested review from a team as code owners May 19, 2023 03:43

irfansharif force-pushed the backport22.2-103365 branch from 85591d8 to 0bd475c Compare May 19, 2023 03:57

irfansharif requested a review from a team May 19, 2023 03:57

sumeerbhola approved these changes May 19, 2023

View reviewed changes

irfansharif merged commit 1a05abe into cockroachdb:release-22.2 May 19, 2023
1 of 2 checks passed

irfansharif deleted the backport22.2-103365 branch May 19, 2023 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-22.2: admission: defend against severe elastic cpu token debt #103648

release-22.2: admission: defend against severe elastic cpu token debt #103648

irfansharif commented May 19, 2023 •

edited

Loading

blathers-crl bot commented May 19, 2023

cockroach-teamcity commented May 19, 2023

release-22.2: admission: defend against severe elastic cpu token debt #103648

release-22.2: admission: defend against severe elastic cpu token debt #103648

Conversation

irfansharif commented May 19, 2023 • edited Loading

blathers-crl bot commented May 19, 2023

cockroach-teamcity commented May 19, 2023

irfansharif commented May 19, 2023 •

edited

Loading