-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide Grafana dashboard #15
Comments
I made a dashboard, but I'm not sure if it's a good idea to publish it in its current form. The problem is that if you run complicated enough queries, you can actually bring down your Prometheus instance (which might or might not have happened 🙈). The main query that's currently causing trouble is the one that tries to show the functions with the highest 99th percentile latencies:
A couple of potential options include:
@hatchan @IvanMerrill or anyone else have ideas? |
Ouch, that query's a doozy. I've just got one remark, and one kernel of idea. Firstly, I've found that higher quantiles can be very noisy at lower rates, as there's simply not a lot of data to work with in the chosen time window. This could make unproblematic time series dart in and out of your top 5 a lot, making it less useful as a tool. One idea that I have (and I still need to investigate if that has any real merit at all) is to make use of the cumulative nature of the buckets and just look at the ratio of the "Le+Inf" bucket (aka "count") and some lower 'threshold' bucket, and calculate the ratio. If that ratio is just a solid 1.0 you know there aren't any outliers to worry about, but the further it deviates from that value, the more unwanted values there are in that metric. The advantage is that you now only have to query 2 time series per metric, instead of the, say, 20 for the entire bucket set. The disadvantages are that (1) you need to know the exact threshold bucket in the set, and (2) that the calculation leaves you with a single number that has no obvious meaning until you learn to interpret it. |
Thanks for the input @grainednoise! That's an interesting idea for looking at the ratio of a lower-threshold bucket to the total. That's a little similar to how we ended up making the latency SLO work (look at the section "Attempt 3: Label renaming and set intersections" in An adventure with SLOs, generic Prometheus alerting rules, and complex PromQL queries). We use a label called My current draft of the dashboard leaves out the chart showing the latencies for all functions. Instead, it has a row for each SLO and shows the latencies of functions that are included in an SLO. This reduces the number we're looking at and means that the user has expressed some particular interest in those functions. |
No description provided.
The text was updated successfully, but these errors were encountered: