Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: record request stage metrics to request's trace span #82203

Open
tbg opened this issue Jun 1, 2022 · 1 comment
Open

kvserver: record request stage metrics to request's trace span #82203

tbg opened this issue Jun 1, 2022 · 1 comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects

Comments

@tbg
Copy link
Member

tbg commented Jun 1, 2022

Is your feature request related to a problem? Please describe.

#82200 suggests adding metrics that break down the "life of requests" at a Node into its constituent parts. That issue does not include returning these timings on a per-request basis.

A per-phase breakdown at the individual query level ("why is this particular query slow") is helpful for observability and having it in a structured form allows higher-level tools (better statement bundle visualizations, etc).

Currently, we rely on verbose traces to figure out why "a request is slow". But this is hit-miss (we may not have the right trace events in place) and not everyone can do it.

Another important application is to an idea discussed in #71169. The SQL layer is to obtain the per-stage latencies for a subset of requests (those that are assumed to have amortized "constant const", for example point reads with response size <10kb) and to sum up the latencies for stages that under normal operations shouldn't depend on the workload (i.e. ignore contention, etc), to provide a signal for SRE alerting that essentially lets them know when "a workload is slow without the workload being at fault". It is possible that store-wide metrics already allow some form of this, but without per-request information returned it certainly wouldn't be possible to track the impact on SLAs on a per-customer level in multi-tenant environments.

Describe the solution you'd like

Extend the work done for #82200 such that in the presence of a recording tracing span, measurements are added to it (as structured payloads) as well (This might not be performant enough to use for #71169, but I think that can can be kicked down the road).

We could either add a single trace event containing all measurements (fewer allocs, but delayed information in inflight traces) or a trace event per stage (more allocs, less delay, but still delay for inflight operations) or tracing could learn to contain ad-hoc "interim" state in which case we could attach the payload(s) once the event starts and "make them immutable" when it ends. This last option is best for introspection into inflight traces that are taking a long time.

Describe alternatives you've considered

It is not 100% clear which of the motivations presented above is going to be the most pressing in the future and we may find that doing nothing is an option; that simply aggregating the stats at the per-range level is enough; or that we want this information returned so frequently that we need to bring down the overhead of the trace payloads further and do something bespoke (I hope not).

Additional context

#71169

Jira issue: CRDB-16250

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv-observability labels Jun 1, 2022
@tbg tbg changed the title kvserver: record request stage metrics to request's span kvserver: record request stage metrics to request's trace span Jun 1, 2022
@andreimatei
Copy link
Contributor

or tracing could learn to contain ad-hoc "interim" state in which case we could attach the payload(s) once the event starts and "make them immutable" when it ends

FWIW, a version of this exists in the form of "lazy span tags": you can attach an arbitrary, mutable, object as a span tag. The object needs to know how to serialize itself into key-value pairs when the recording is collected.

@blathers-crl blathers-crl bot added this to Triage in Cluster Observability Mar 16, 2023
@maryliag maryliag added A-kv Anything in KV that doesn't belong in a more specific category. T-kv KV Team and removed T-sql-observability labels Apr 24, 2023
@blathers-crl blathers-crl bot added this to Incoming in KV Apr 24, 2023
@maryliag maryliag removed this from Triage in Cluster Observability Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
KV
Incoming
Development

No branches or pull requests

3 participants