Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide Grafana dashboard #15

Closed
emschwartz opened this issue Jan 27, 2023 · 3 comments
Closed

Provide Grafana dashboard #15

emschwartz opened this issue Jan 27, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@emschwartz
Copy link
Contributor

No description provided.

@emschwartz emschwartz added the enhancement New feature or request label Jan 27, 2023
@emschwartz emschwartz changed the title Generate Grafana dashboard Provide Grafana dashboard Feb 14, 2023
@emschwartz
Copy link
Contributor Author

emschwartz commented Feb 24, 2023

I made a dashboard, but I'm not sure if it's a good idea to publish it in its current form.

The problem is that if you run complicated enough queries, you can actually bring down your Prometheus instance (which might or might not have happened 🙈).

The main query that's currently causing trouble is the one that tries to show the functions with the highest 99th percentile latencies:

topk(5, histogram_quantile(0.99, sum by (le, function, module) (rate(function_calls_duration_bucket{function=~"${function}", function!~"${exclude_function}", module=~"${module}", module!~"${exclude_module}"}[5m]))))

A couple of potential options include:

  1. Don't show latency in the dashboard
  2. Require that you first load certain recording rules into Prometheus before the dashboard will work (so it can pre-calculate the results of this query and save those results as a dedicated time series)
  3. Only show the latency for functions you've added SLOs to (in which case we might want to rename that parameter from alerts to something that indicates its broader scope)
  4. Maybe there's some other chart type that would make sense to show? For example, show only which functions have been slow very recently or look at the average latency over a longer time window? Not sure if this would help though
  5. ... maybe there's something else?

@hatchan @IvanMerrill or anyone else have ideas?

@emschwartz emschwartz pinned this issue Feb 24, 2023
@grainednoise
Copy link

Ouch, that query's a doozy. I've just got one remark, and one kernel of idea. Firstly, I've found that higher quantiles can be very noisy at lower rates, as there's simply not a lot of data to work with in the chosen time window. This could make unproblematic time series dart in and out of your top 5 a lot, making it less useful as a tool.

One idea that I have (and I still need to investigate if that has any real merit at all) is to make use of the cumulative nature of the buckets and just look at the ratio of the "Le+Inf" bucket (aka "count") and some lower 'threshold' bucket, and calculate the ratio. If that ratio is just a solid 1.0 you know there aren't any outliers to worry about, but the further it deviates from that value, the more unwanted values there are in that metric. The advantage is that you now only have to query 2 time series per metric, instead of the, say, 20 for the entire bucket set. The disadvantages are that (1) you need to know the exact threshold bucket in the set, and (2) that the calculation leaves you with a single number that has no obvious meaning until you learn to interpret it.

@emschwartz
Copy link
Contributor Author

Thanks for the input @grainednoise!

That's an interesting idea for looking at the ratio of a lower-threshold bucket to the total. That's a little similar to how we ended up making the latency SLO work (look at the section "Attempt 3: Label renaming and set intersections" in An adventure with SLOs, generic Prometheus alerting rules, and complex PromQL queries). We use a label called objective.latency_threshold and compare the ratio of events where the value of that label equals the le to the total.

My current draft of the dashboard leaves out the chart showing the latencies for all functions. Instead, it has a row for each SLO and shows the latencies of functions that are included in an SLO. This reduces the number we're looking at and means that the user has expressed some particular interest in those functions.

@emschwartz emschwartz unpinned this issue Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants