release-22.2: kv,rangefeed: integrate catchup scans with elastic cpu #92439

irfansharif · 2022-11-24T01:52:18Z

Backport 4/4 commits from #89709.

/cc @cockroachdb/release

Part of #65957. First commit is from #89656.

Changefeed backfills, given their scan-heavy nature, can be fairly
CPU-intensive. In #89656 we introduced a roachtest demonstrating the
latency impact backfills can have on a moderately CPU-saturated cluster.
Similar to what we saw for backups, this CPU heavy nature can elevate Go
scheduling latencies which in turn translates to foreground latency
impact. This commit integrates rangefeed catchup scan with the elastic
CPU limiter we introduced in #86638; this is one of two optional halves
of changefeed backfills. The second half is the initial scan -- scan
requests issued over some keyspan as of some timestamp. For that we
simply rely on the existing slots mechanism but now setting a lower
priority bit (BulkNormalPri). Experimentally we observed that during
initial scans the encoding routines in changefeedccl are the most
impactful CPU-wise, something #89589 can help with. We leave admission
integration of parallel worker goroutines to future work.

Unlike export requests, rangefeed catchup scans are non-premptible. The
rangefeed RPC is a streaming one, and the catchup scan is done during
stream setup. So we don't have resumption tokens to propagate up to the
caller like we did for backups. We still want CPU-bound work going
through admission control to use at most 100ms of CPU time (it makes
for an ineffective controller otherwise), and to that end we introduce
the following component used within the rangefeed catchup iterator:

// Pacer is used in tight loops (CPU-bound) for non-premptible
// elastic work. Callers are expected to invoke Pace() every loop
// iteration and Close() once done. Internally this type integrates
// with elastic CPU work queue, acquiring tokens for the CPU work
// being done, and blocking if tokens are unavailable. This allows
// for a form of cooperative scheduling with elastic CPU granters.
type Pacer struct
func (p *Pacer) Pace(ctx context.Context) error { ... }
func (p *Pacer) Close() { ... }

Release note: None
Release justification: Non-production code. In 22.2 this machinery is all switched off by default.

Pure code movement/renaming. We also renamed a cluster setting kv.store.admission.provisioned_bandwidth to kvadmission.store.provisioned_bandwidth. Release note (general note): We renamed a cluster setting kv.store.admission.provisioned_bandwidth to kvadmission.store.provisioned_bandwidth.

There's only ever one URL. Release note: None

Part of cockroachdb#65957. Changefeed backfills, given their scan-heavy nature, can be fairly CPU-intensive. In cockroachdb#89656 we introduced a roachtest demonstrating the latency impact backfills can have on a moderately CPU-saturated cluster. Similar to what we saw for backups, this CPU heavy nature can elevate Go scheduling latencies which in turn translates to foreground latency impact. This commit integrates rangefeed catchup scan with the elastic CPU limiter we introduced in cockroachdb#86638; this is one of two optional halves of changefeed backfills. The second half is the initial scan -- scan requests issued over some keyspan as of some timestamp. For that we simply rely on the existing slots mechanism but now setting a lower priority bit (BulkNormalPri) -- cockroachdb#88733. Experimentally we observed that during initial scans the encoding routines in changefeedccl are the most impactful CPU-wise, something cockroachdb#89589 can help with. We leave admission integration of parallel worker goroutines to future work (cockroachdb#90089). Unlike export requests rangefeed catchup scans are non-premptible. The rangefeed RPC is a streaming one, and the catchup scan is done during stream setup. So we don't have resumption tokens to propagate up to the caller like we did for backups. We still want CPU-bound work going through admission control to only use 100ms of CPU time, to exercise better control over scheduling latencies. To that end, we introduce the following component used within the rangefeed catchup iterator. // Pacer is used in tight loops (CPU-bound) for non-premptible // elastic work. Callers are expected to invoke Pace() every loop // iteration and Close() once done. Internally this type integrates // with elastic CPU work queue, acquiring tokens for the CPU work // being done, and blocking if tokens are unavailable. This allows // for a form of cooperative scheduling with elastic CPU granters. type Pacer struct func (p *Pacer) Pace(ctx context.Context) error { ... } func (p *Pacer) Close() { ... } Release note: None

.. catch-up scans, introduce a private cluster setting (kvadmission.rangefeed_catchup_scan_elastic_control.enabled) to selectively switch off catch-up scan integration if needed, and plumb kvadmission.Pacer in explicitly to rangefeed catchup scan loop instead of opaquely through the surrounding context. Release note: None

blathers-crl · 2022-11-24T01:52:20Z

cockroach-teamcity · 2022-11-24T01:53:31Z

This change is

We want this to be disabled by default in 22.2, used only selectively. Release note: None

miretskiy · 2022-11-28T13:29:24Z

This is a fairly large PR; Even though this functionality turned off, do we still want to merge it?
I think having a particular client in mind who would turn on this setting would be useful. Otherwise, if there is
no such client, why bother?

shermanCRL · 2022-11-28T15:08:00Z

@miretskiy Well, we agree that it’s a fairly universal problem of CPU spikes and therefore SQL latency on changefeed backfill. Your question is about the value/risk of the backport, I assume?

We could give more bake time on v23.1 and then re-evaluate the backport. Or @irfansharif can make the case on risk/return.

irfansharif · 2022-11-28T15:32:28Z

Large is in the eye of the beholder, I've backported bigger things and this to me is relatively small -- it's not adding new machinery, it's adding a straightforward integration. There's a commit in this PR that's carving out a separate package (purely code movement) that's contributing to the high line count.
We were deliberate about backporting changefeed,kvcoord: populate AC headers for backfill work #90093 earlier into the .0 release to pave the way for just this PR.
It's all switched off like you can see. KV L2 pains have taught me it's better to have something available that's switched off than scramble for a subsequent release/custom binary for an already fixed problem.
There is a customer who's asked for it, see this internal document.
This has had almost four weeks of baking time on master now with no fallout. We're not going to learn much more at this point.

miretskiy · 2022-11-28T15:35:01Z

Cool -- I just want to make sure we have existing pains we are trying to solve.

irfansharif added 4 commits November 23, 2022 20:43

roachprod: simplify grafana-url computation

25b3a8f

There's only ever one URL. Release note: None

irfansharif requested a review from a team November 24, 2022 01:52

irfansharif requested review from a team as code owners November 24, 2022 01:52

irfansharif requested a review from a team November 24, 2022 01:52

irfansharif requested a review from a team as a code owner November 24, 2022 01:52

irfansharif requested review from herkolategan and removed request for a team November 24, 2022 01:52

irfansharif requested review from smg260 and miretskiy and removed request for a team November 24, 2022 01:52

kvadmission: disable AC-integration of rangefeed catch-up scans

03b2bbd

We want this to be disabled by default in 22.2, used only selectively. Release note: None

miretskiy approved these changes Nov 28, 2022

View reviewed changes

irfansharif merged commit 68aac6f into cockroachdb:release-22.2 Nov 28, 2022

irfansharif deleted the backport22.2-89709 branch November 28, 2022 15:35

cockroach-teamcity mentioned this pull request Nov 28, 2022

PR #92439 - kvadmission: carve out package for AC<->KV integration cockroachdb/docs#15695

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-22.2: kv,rangefeed: integrate catchup scans with elastic cpu #92439

release-22.2: kv,rangefeed: integrate catchup scans with elastic cpu #92439

irfansharif commented Nov 24, 2022

blathers-crl bot commented Nov 24, 2022 •

edited by irfansharif

Loading

cockroach-teamcity commented Nov 24, 2022

miretskiy commented Nov 28, 2022

shermanCRL commented Nov 28, 2022

irfansharif commented Nov 28, 2022

miretskiy commented Nov 28, 2022

release-22.2: kv,rangefeed: integrate catchup scans with elastic cpu #92439

release-22.2: kv,rangefeed: integrate catchup scans with elastic cpu #92439

Conversation

irfansharif commented Nov 24, 2022

blathers-crl bot commented Nov 24, 2022 • edited by irfansharif Loading

cockroach-teamcity commented Nov 24, 2022

miretskiy commented Nov 28, 2022

shermanCRL commented Nov 28, 2022

irfansharif commented Nov 28, 2022

miretskiy commented Nov 28, 2022

blathers-crl bot commented Nov 24, 2022 •

edited by irfansharif

Loading