Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.2: admission: defend against severe elastic cpu token debt #103648

Merged

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented May 19, 2023

Backport 1/1 commits from #103365.

/cc @cockroachdb/release


Speculative fix for #102817.
Speculative fix for #103359.

In experiments and in production clusters, we observed elastic CPU wait queue durations in the order of minutes, which can be long enough to fail entire backups. Separately, we saw that with the elastic CPU limiter enabled, backups can sometimes be extremely inefficient, likely exporting a single key per ExportRequest (#103365 will improve this observability). We suspect that happened because by the time we actually got to the mvccExportToSST loop, we'd already expended the 100ms CPU-time grant we were given resolving intents or doing conflict resolution. Now we instead start the CPU timer per request once actually in the mvccExportToSST loop, not counting all the "pre-work" CPU time against our allotted budget. We still deduct CPU tokens for that work to avoid risking over-admission, and export a metric for how much pre-work we're doing. As a defense-in-depth thing, we also periodically reset the elastic CPU token bucket periodically (every 30s), again permitting some amount of over-admission.

While here, we export a few other metrics in the elastic CPU limiter stack to better diagnose issues:

  • admission.elastic_cpu.nanos_exhausted_duration, which is the total duration when elastic CPU tokens were exhausted, in micros. We have equivalents for {IO token, CPU slot} exhaustion.
  • admission.elastic_cpu.over_limit_durations, a latency histogram that surfaces exactly how much over the 100ms grants we go. This was quite high before this patch when ExportRequests could spend time resolving intents, pushing txns, doing conflict resolution.
  • admission.elastic_cpu.pre_work_nanos, the elastic CPU time spent doing pre-work. See commentary above.
  • admission.elastic_cpu.available_nanos, an instantaneous metric that surfaces exactly how many tokens are available. Useful to understand how much debt we're in.

Release note (bug fix): In earlier patch releases of v23.1, it was possible for backups to be excessively slow, slower than they were in earlier releases. It was also possible for them to fail with errors of the following form: 'operation "Export Request for span ..." timed out after 5m0.001s'. At least one of the reasons for this behavior is now addressed. This problem also affected v22.2 clusters if using a hidden-by-default, default-as-disabled cluster setting, 'admission.elastic_cpu.enabled'.

Release justification: Bug fix.

@irfansharif irfansharif requested review from sumeerbhola and a team May 19, 2023 03:43
@irfansharif irfansharif requested review from a team as code owners May 19, 2023 03:43
@blathers-crl
Copy link

blathers-crl bot commented May 19, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Speculative fix for cockroachdb#102817.
Speculative fix for cockroachdb#103359.

In experiments and in production clusters, we observed elastic CPU wait
queue durations in the order of minutes, which can be long enough to
fail entire backups. Separately, we saw that with the elastic CPU
limiter enabled, backups can sometimes be extremely inefficient, likely
exporting a single key per ExportRequest (cockroachdb#101685 will improve this
observability). We suspect that happened because by the time we actually
got to the mvccExportToSST loop, we'd already expended the 100ms
CPU-time grant we were given resolving intents or doing conflict
resolution. Now we instead start the CPU timer per request once actually
in the mvccExportToWriter loop, not counting all the "pre-work" CPU time
against our allotted budget. We still deduct CPU tokens for the pre-work
to avoid risking over-admission, and export a metric for how much
pre-work we're doing. As a defense-in-depth thing, we also periodically
reset the elastic CPU token bucket periodically (every 30s), again
permitting some amount of over-admission.

While here, we export a few other metrics in the elastic CPU limiter
stack to better diagnose issues:
- admission.elastic_cpu.nanos_exhausted_duration, which is the total
  duration when elastic CPU tokens were exhausted, in micros. We have
  equivalents for {IO token, CPU slot} exhaustion.
- admission.elastic_cpu.over_limit_durations, a latency histogram that
  surfaces exactly how much over the 100ms grants we go, including
  pre-work. This is quite high before/after this patch when
  ExportRequests can spend time resolving intents, pushing txns, doing
  conflict resolution. Except now we still give 100ms of on-CPU to to
  mvccExportToWriter.
- admission.elastic_cpu.pre_work_nanos, the elastic CPU time spent doing
  pre-work. See commentary above.
- admission.elastic_cpu.available_nanos, an instantaneous metric that
  surfaces exactly how many tokens are available. Useful to understand
  how much debt we're in.

Release note (bug fix): In earlier patch releases of v23.1, it was
possible for backups to be excessively slow, slower than they were in
earlier releases. It was also possible for them to fail with errors of
the following form: 'operation "Export Request for span ..." timed out
after 5m0.001s'. At least one of the reasons for this behavior is now
addressed. This problem also affected v22.2 clusters if using a
hidden-by-default, default-as-disabled cluster setting
'admission.elastic_cpu.enabled'.
@irfansharif irfansharif requested a review from a team May 19, 2023 03:57
@irfansharif irfansharif merged commit 1a05abe into cockroachdb:release-22.2 May 19, 2023
1 of 2 checks passed
@irfansharif irfansharif deleted the backport22.2-103365 branch May 19, 2023 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants