kvserver: collect accurate timings for request execution #72092

tbg · 2021-10-28T14:00:57Z

This prototype is an exploration of this comment. Our server-side RPC
timing metrics are not great and in particular they don't allow us to
see which fraction of the end-to-end latency is spent in contention
(which is generally not something indicative of the health of the
KV layer, as users can easily generate as much contention as they
like).

cockroach-teamcity · 2021-10-28T14:01:05Z

This change is

This prototype is an exploration of [this comment]. Our server-side RPC timing metrics are not great and in particular they don't allow us to see which fraction of the end-to-end latency is spent in contention (which is generally not something indicative of the health of the KV layer, as users can easily generate as much contention as they like). [this comment]: cockroachdb#71169 (comment)

dhartunian · 2022-05-05T22:33:44Z

@tbg just as a gut check for understanding, is the need for recording these timings because you want to aggregate together non-contiguous chunks of time? I'm just trying to understand why we can't rely on the timings of tracing spans themselves for the information you're trying to collect. Or perhaps the cost of creating spans per-operation here is too high?

adityamaru · 2022-05-06T00:36:24Z

Just a driveby but I am also interested in @dhartunians question. Over in bulk, we're hoping to rely on trace spans to give us the duration for suboperations in a job (including from the kvserver). The hope is that we can build a generic aggregator as described in #80388 that will use accumulated tracing span durations for each operation #81079 and StructuredEvents #80460 to answer "what is the job spending its time doing".

@andreimatei fixed up Batch requests to send back their trace recordings to the request sender, which can then be imported into the sender's context. So the more children we have in this remote recording, the more granular our view of what is going on will be.

tbg · 2022-05-06T18:19:43Z

@dhartunian these are good questions and the truth is, I need to engage with this issue again/more.

Perfect world goals:

we would like to break down the latency of KV requests (for this issue, once they arrive at the (*Node).Batch boundary, i.e. at a KV server), and do so for every request, not just requests that are being traced.
1a. observability into hanging requests, i.e. ongoing durations are timestamp-based and can be pulled from, say, inflight registry, rather than showing up after the fact only.
we would like to, and are able to due to 1), keep metrics for each of the constituent parts.
we would like to do this in a way that is robust, i.e. not ad-hoc code points across the kvserver code base that are sensitive to each other. (Sort of fuzzily expressed, but you can imagine that some versions of this might be more, and some less, maintainable. I have a sense that a trivial solution will be unmaintainable, but might be wrong).
3a. I dream of having an "unaccounted" measurement that is basically the E2E latency minus anything that isn't attributed to a constituent part. Because sometimes the things that are slow are not where we expect.
we don't want to reinvent tracing/build our own thing but ideally whatever we use is used throughout CRDB.
can measure replication latencies and attribute them to request.

A lot of compromises could be made. For example, for SRE's specific ask, we don't need to always collect these timings - it's enough to do this for requests for which we've enabled it. It still needs to be cheap, but maybe the bar is lower than for all traffic (maybe not - some workloads are almost exclusively point reads/writes).
The maintainability/ease of use - maybe it's not so bad to do this with a fairly bare bones API.
6) is probably half out of the question since we mostly pipeline writes, i.e. the caller goes away before replication is even attempted. So at least on a per-request basis, our current tracing infra will have nowhere for the information to go once we have it. But we can put it into 2).

In the prototype here I was mostly exploring API, i.e. how to set up measuring a "phase" of request execution without having to know about the other phases. I was eyeing doing this in a low-allocation manner with structs that lend themselves to pooling (which is always hard to do once you've stuffed something into a ctx).

To @adityamaru's question, I think integrating with #80388 is in everybody's interest. The only question is, do we "just" create ad-hoc structured events in kvserver in various places and call it a day (and don't do metrics, etc, because that's probably way to alloc-heavy) or do we have a nimbler way to collect this data which we then translate into structured events on the way out, if needed? (And if so, how do the inflight operations - so say a txn stuck in contention - show up on the inflight trace span if there is one?)

It's possible that structured events only are a fine solution for the E2E issue. It would leave something to be desired in terms of perf, I think, and wouldn't be always-on. For always-on, we need perf counters.

Happy to chat about this more after InFest with the involved parties.

cc @koorosh

joshimhoff · 2022-05-23T14:03:10Z

just as a gut check for understanding, is the need for recording these timings because you want to aggregate together non-contiguous chunks of time?

@dhartunian, the trace-based impleennation doesn't seem off the table to me necessarily, but I do think the answer to this Q is yes. I think this is a key point. We want to trace execution time while excluding the time spent in code paths that signal workload issue more than CRDB issue. For example, we want to exclude time waiting for the concurrency manager to sequence requests. This way we can page SRE on user traffic in addition to probes.

This project is best thought of as having two distinct goals:

Page SRE on user traffic.
Get better observability into E2E latency.

andreimatei

I've read through this, but things are a still less than clear to me - why exactly are metrics not sufficient (i.e. why can't we exclude contention, and whatever else we don't want, from some metrics) / what exactly would we do with the events that a trace, or the Timing struct in this PR, once we collect them?
Tobi, let's meet and talk about. With your holiday tomorrow, and the US holiday on Monday, and your packed schedule, it's a bit tough, so maybe you can find a time for it? :) @dhartunian and @abarganier also said they're interested.
Or, if this is not longer burning for you (or for Josh?), I'm also happy to drop it :P.

Reviewable status: complete! 0 of 0 LGTMs obtained

joshimhoff · 2022-05-26T13:36:39Z

If not a 1:1, I'd like to join too!

For me, this is still important. Here's a writeup re: what we want for CC alerting: #71169 (comment). I think there are observability wins too. We do measure how much latency certain components introduce (e.g. admission control), but also there are gaps. We don't do it in a very systematic or consistent way today.

tbg · 2022-06-01T07:57:44Z

@andreimatei I wrote up #82200 just now which is just the "record to metrics" portion of this. I think that's the first one to work on since it's a lot less involved. We can discuss in the KV/Repl Eng Slot at 11am ET today! cc @koorosh

Filed #82203 for the follow-up and it's mostly speculative when we will need this. I'm fairly confused about what the plan is with #71169 as a whole. I think just doing this with metrics will not be enough for SREs to set up alerting (for example, ExportRequest may evaluate for a long time and this may even be intentional, so you need to ignore it) but it will get us to a place where we have a more systematic breakdown of request timings and that's the right first step.

I don't think this PR really has a lot that's going to survive, so I'm going to close it out, let's discuss on the issues above.

tbg mentioned this pull request Jan 6, 2022

kvserver: improve observability of "slow" requests #74509

Open

tbg mentioned this pull request Feb 24, 2022

Provide E2E SQL and/or KV latency & error rate time-series metrics that indicate issues with CRDB rather than poor workload given cluster provisioning #71169

Open

joshimhoff mentioned this pull request Mar 2, 2022

kvserver: improve quota pool metrics #75978

Open

This was referenced Mar 3, 2022

kvserver: add generalized circuit-breaking to catch, e.g., mutex deadlocks #77366

Open

kvprober: allow a manual probe of the entire keyspace #61695

Closed

tbg and others added 3 commits March 4, 2022 20:42

wip

11227a8

wip

a91da16

tbg force-pushed the poc-exec-tracker branch from 85e1010 to a91da16 Compare March 4, 2022 20:13

wip

838ab23

andreimatei reviewed May 25, 2022

View reviewed changes

tbg mentioned this pull request Jun 1, 2022

kvserver: request stage metrics #82200

Open

tbg mentioned this pull request Jun 1, 2022

kvserver: record request stage metrics to request's trace span #82203

Open

tbg closed this Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: collect accurate timings for request execution #72092

kvserver: collect accurate timings for request execution #72092

tbg commented Oct 28, 2021

cockroach-teamcity commented Oct 28, 2021

dhartunian commented May 5, 2022

adityamaru commented May 6, 2022 •

edited

Loading

tbg commented May 6, 2022

joshimhoff commented May 23, 2022

andreimatei left a comment

joshimhoff commented May 26, 2022

tbg commented Jun 1, 2022 •

edited

Loading

kvserver: collect accurate timings for request execution #72092

kvserver: collect accurate timings for request execution #72092

Conversation

tbg commented Oct 28, 2021

cockroach-teamcity commented Oct 28, 2021

dhartunian commented May 5, 2022

adityamaru commented May 6, 2022 • edited Loading

tbg commented May 6, 2022

joshimhoff commented May 23, 2022

andreimatei left a comment

Choose a reason for hiding this comment

joshimhoff commented May 26, 2022

tbg commented Jun 1, 2022 • edited Loading

adityamaru commented May 6, 2022 •

edited

Loading

tbg commented Jun 1, 2022 •

edited

Loading