Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: collect accurate timings for request execution #72092

Closed
wants to merge 4 commits into from

Conversation

tbg
Copy link
Member

@tbg tbg commented Oct 28, 2021

This prototype is an exploration of this comment. Our server-side RPC
timing metrics are not great and in particular they don't allow us to
see which fraction of the end-to-end latency is spent in contention
(which is generally not something indicative of the health of the
KV layer, as users can easily generate as much contention as they
like).

@cockroach-teamcity
Copy link
Member

This change is Reviewable

tbg and others added 3 commits March 4, 2022 20:42
This prototype is an exploration of [this comment]. Our server-side RPC
timing metrics are not great and in particular they don't allow us to
see which fraction of the end-to-end latency is spent in contention
(which is generally not something indicative of the health of the
KV layer, as users can easily generate as much contention as they
like).

[this comment]: cockroachdb#71169 (comment)
@dhartunian
Copy link
Collaborator

@tbg just as a gut check for understanding, is the need for recording these timings because you want to aggregate together non-contiguous chunks of time? I'm just trying to understand why we can't rely on the timings of tracing spans themselves for the information you're trying to collect. Or perhaps the cost of creating spans per-operation here is too high?

@adityamaru
Copy link
Contributor

adityamaru commented May 6, 2022

Just a driveby but I am also interested in @dhartunians question. Over in bulk, we're hoping to rely on trace spans to give us the duration for suboperations in a job (including from the kvserver). The hope is that we can build a generic aggregator as described in #80388 that will use accumulated tracing span durations for each operation #81079 and StructuredEvents #80460 to answer "what is the job spending its time doing".

@andreimatei fixed up Batch requests to send back their trace recordings to the request sender, which can then be imported into the sender's context. So the more children we have in this remote recording, the more granular our view of what is going on will be.

@tbg
Copy link
Member Author

tbg commented May 6, 2022

@dhartunian these are good questions and the truth is, I need to engage with this issue again/more.

Perfect world goals:

  1. we would like to break down the latency of KV requests (for this issue, once they arrive at the (*Node).Batch boundary, i.e. at a KV server), and do so for every request, not just requests that are being traced.
    1a. observability into hanging requests, i.e. ongoing durations are timestamp-based and can be pulled from, say, inflight registry, rather than showing up after the fact only.
  2. we would like to, and are able to due to 1), keep metrics for each of the constituent parts.
  3. we would like to do this in a way that is robust, i.e. not ad-hoc code points across the kvserver code base that are sensitive to each other. (Sort of fuzzily expressed, but you can imagine that some versions of this might be more, and some less, maintainable. I have a sense that a trivial solution will be unmaintainable, but might be wrong).
    3a. I dream of having an "unaccounted" measurement that is basically the E2E latency minus anything that isn't attributed to a constituent part. Because sometimes the things that are slow are not where we expect.
  4. we don't want to reinvent tracing/build our own thing but ideally whatever we use is used throughout CRDB.
  5. can measure replication latencies and attribute them to request.

A lot of compromises could be made. For example, for SRE's specific ask, we don't need to always collect these timings - it's enough to do this for requests for which we've enabled it. It still needs to be cheap, but maybe the bar is lower than for all traffic (maybe not - some workloads are almost exclusively point reads/writes).
The maintainability/ease of use - maybe it's not so bad to do this with a fairly bare bones API.
6) is probably half out of the question since we mostly pipeline writes, i.e. the caller goes away before replication is even attempted. So at least on a per-request basis, our current tracing infra will have nowhere for the information to go once we have it. But we can put it into 2).

In the prototype here I was mostly exploring API, i.e. how to set up measuring a "phase" of request execution without having to know about the other phases. I was eyeing doing this in a low-allocation manner with structs that lend themselves to pooling (which is always hard to do once you've stuffed something into a ctx).

To @adityamaru's question, I think integrating with #80388 is in everybody's interest. The only question is, do we "just" create ad-hoc structured events in kvserver in various places and call it a day (and don't do metrics, etc, because that's probably way to alloc-heavy) or do we have a nimbler way to collect this data which we then translate into structured events on the way out, if needed? (And if so, how do the inflight operations - so say a txn stuck in contention - show up on the inflight trace span if there is one?)

It's possible that structured events only are a fine solution for the E2E issue. It would leave something to be desired in terms of perf, I think, and wouldn't be always-on. For always-on, we need perf counters.

Happy to chat about this more after InFest with the involved parties.

cc @koorosh

@joshimhoff
Copy link
Collaborator

just as a gut check for understanding, is the need for recording these timings because you want to aggregate together non-contiguous chunks of time?

@dhartunian, the trace-based impleennation doesn't seem off the table to me necessarily, but I do think the answer to this Q is yes. I think this is a key point. We want to trace execution time while excluding the time spent in code paths that signal workload issue more than CRDB issue. For example, we want to exclude time waiting for the concurrency manager to sequence requests. This way we can page SRE on user traffic in addition to probes.

This project is best thought of as having two distinct goals:

  1. Page SRE on user traffic.
  2. Get better observability into E2E latency.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read through this, but things are a still less than clear to me - why exactly are metrics not sufficient (i.e. why can't we exclude contention, and whatever else we don't want, from some metrics) / what exactly would we do with the events that a trace, or the Timing struct in this PR, once we collect them?
Tobi, let's meet and talk about. With your holiday tomorrow, and the US holiday on Monday, and your packed schedule, it's a bit tough, so maybe you can find a time for it? :) @dhartunian and @abarganier also said they're interested.
Or, if this is not longer burning for you (or for Josh?), I'm also happy to drop it :P.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@joshimhoff
Copy link
Collaborator

If not a 1:1, I'd like to join too!

For me, this is still important. Here's a writeup re: what we want for CC alerting: #71169 (comment). I think there are observability wins too. We do measure how much latency certain components introduce (e.g. admission control), but also there are gaps. We don't do it in a very systematic or consistent way today.

@tbg
Copy link
Member Author

tbg commented Jun 1, 2022

@andreimatei I wrote up #82200 just now which is just the "record to metrics" portion of this. I think that's the first one to work on since it's a lot less involved. We can discuss in the KV/Repl Eng Slot at 11am ET today! cc @koorosh

Filed #82203 for the follow-up and it's mostly speculative when we will need this. I'm fairly confused about what the plan is with #71169 as a whole. I think just doing this with metrics will not be enough for SREs to set up alerting (for example, ExportRequest may evaluate for a long time and this may even be intentional, so you need to ignore it) but it will get us to a place where we have a more systematic breakdown of request timings and that's the right first step.

I don't think this PR really has a lot that's going to survive, so I'm going to close it out, let's discuss on the issues above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants