Provide Grafana dashboard #15

emschwartz · 2023-01-27T12:57:35Z

No description provided.

emschwartz · 2023-02-24T14:35:49Z

I made a dashboard, but I'm not sure if it's a good idea to publish it in its current form.

The problem is that if you run complicated enough queries, you can actually bring down your Prometheus instance (which might or might not have happened 🙈).

The main query that's currently causing trouble is the one that tries to show the functions with the highest 99th percentile latencies:

topk(5, histogram_quantile(0.99, sum by (le, function, module) (rate(function_calls_duration_bucket{function=~"${function}", function!~"${exclude_function}", module=~"${module}", module!~"${exclude_module}"}[5m]))))

A couple of potential options include:

Don't show latency in the dashboard
Require that you first load certain recording rules into Prometheus before the dashboard will work (so it can pre-calculate the results of this query and save those results as a dedicated time series)
Only show the latency for functions you've added SLOs to (in which case we might want to rename that parameter from alerts to something that indicates its broader scope)
Maybe there's some other chart type that would make sense to show? For example, show only which functions have been slow very recently or look at the average latency over a longer time window? Not sure if this would help though
... maybe there's something else?

@hatchan @IvanMerrill or anyone else have ideas?

grainednoise · 2023-03-23T23:52:32Z

Ouch, that query's a doozy. I've just got one remark, and one kernel of idea. Firstly, I've found that higher quantiles can be very noisy at lower rates, as there's simply not a lot of data to work with in the chosen time window. This could make unproblematic time series dart in and out of your top 5 a lot, making it less useful as a tool.

One idea that I have (and I still need to investigate if that has any real merit at all) is to make use of the cumulative nature of the buckets and just look at the ratio of the "Le+Inf" bucket (aka "count") and some lower 'threshold' bucket, and calculate the ratio. If that ratio is just a solid 1.0 you know there aren't any outliers to worry about, but the further it deviates from that value, the more unwanted values there are in that metric. The advantage is that you now only have to query 2 time series per metric, instead of the, say, 20 for the entire bucket set. The disadvantages are that (1) you need to know the exact threshold bucket in the set, and (2) that the calculation leaves you with a single number that has no obvious meaning until you learn to interpret it.

emschwartz · 2023-03-24T08:27:57Z

Thanks for the input @grainednoise!

That's an interesting idea for looking at the ratio of a lower-threshold bucket to the total. That's a little similar to how we ended up making the latency SLO work (look at the section "Attempt 3: Label renaming and set intersections" in An adventure with SLOs, generic Prometheus alerting rules, and complex PromQL queries). We use a label called objective.latency_threshold and compare the ratio of events where the value of that label equals the le to the total.

My current draft of the dashboard leaves out the chart showing the latencies for all functions. Instead, it has a row for each SLO and shows the latencies of functions that are included in an SLO. This reduces the number we're looking at and means that the user has expressed some particular interest in those functions.

emschwartz added the enhancement New feature or request label Jan 27, 2023

emschwartz changed the title ~~Generate Grafana dashboard~~ Provide Grafana dashboard Feb 14, 2023

emschwartz pinned this issue Feb 24, 2023

emschwartz closed this as completed Apr 17, 2023

emschwartz unpinned this issue Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide Grafana dashboard #15

Provide Grafana dashboard #15

emschwartz commented Jan 27, 2023

emschwartz commented Feb 24, 2023 •

edited

Loading

grainednoise commented Mar 23, 2023

emschwartz commented Mar 24, 2023

Provide Grafana dashboard #15

Provide Grafana dashboard #15

Comments

emschwartz commented Jan 27, 2023

emschwartz commented Feb 24, 2023 • edited Loading

grainednoise commented Mar 23, 2023

emschwartz commented Mar 24, 2023

emschwartz commented Feb 24, 2023 •

edited

Loading