-
Notifications
You must be signed in to change notification settings - Fork 557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Show blacklisting in the Grafana Dashboard #8263
Comments
What about exposing the metric as a rate and visualizing it with a heat map? |
Prioritized as planned for now under the assumption we won't work on a better long term alternative to blacklisting in the next quarter, and this will already be an improvement in terms of visibility. Before opening a PR for this, please discuss and decide on the visualization/metric that we want. |
The question for me is really what we want to achieve with the metric. Do we want to know really how many (exact) are blacklisted? Then Prometheus might be not the best fit, but if we just want to have an indicator that SOMETHING is blacklisted then this could work: Here we could change the colors and shown values to something like Blacklisted (if x > 0) and nothing (if x <= 0).I think this is similar what I have shown above with example A. This would potentially already help. |
What about exposing the metric as a rate and visualizing it with a heat map? |
This wouldn't help since the metric is not exported all the time. Plus what should tell me the heatmap would be the question? |
I would expect to see a change in the blacklisting rate. If it is not exported for some time, I will just see a black column, but I can always make the time scale wider and then I should see data pretty much for all the time, because each export would be aggregated in the rate (I hope). Then I could look at a big enough time scale and if my blacklisting suddenly jumps from 2 per day to 200 per day, that would be my signal. And if the jump is correlated to e.g. the point there was an update or a pod restart, that would also be interesting. This is all speculation though, one would have to see what it actually looks like. |
Thanks @pihme yeah so it would look like this Problem is that we not store data long enough on our prometheus server. So there can be data deleted which contained some blacklisting, which is why I ask what we want to see. The real count or just an indication all in all I think our current metric doesn't work well. Ideally we would report always or on bootstrap with the current value of blacklisted instances. |
Thanks @Zelldon. Yeah I regularly forget we forget data. I think 90 days would be good enough, and yes, the visualization looks like what I had in mind. But if the history is 30 days or less, it becomes less useful. So I see your point. |
I think the best would be as described as alternative above:
This means at least on restart we export the metric at least once. |
@remcowesterhoud since you mentioned to me that you had to look at a cluster with zdb to find out whether something was blacklisted might be useful for you to just have this in the metrics :) |
Right now we don't see value to implement this, we have other ways to identify blacklisted instances, which also allow us to see the corresponding process instance keys, i.e. logs or zdb. Hopefully at one point we can get rid of blacklisting |
ZPA triage:
|
Attaching this to the "Blacklisting replacement" as having insights in these metrics may be useful for this epic. |
12606: Restore blacklist metric r=remcowesterhoud a=Zelldon ## Description On restart/recovery we need to restore the blacklist metric in order to report it always. Otherwise, we will have gaps in the reports and it is not reliable, due to restarts and resetting of counters. Change counter to gauge, such that it can be reset/set to a specific value based on the state. See related comment #8263 (comment) <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> closes #8263 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
Reopening since we still need panels on the dashboard. |
12643: [Backport stable/8.2] fix: brokers can list more than 255 backups r=deepthidevaki a=oleschoenburg Manual backport of #12621 to fix revapi merge conflicts. 12644: [Backport stable/8.2] Restore blacklist metric r=Zelldon a=backport-action # Description Backport of #12606 to `stable/8.2`. relates to #8263 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
12629: [Backport 8.1]: Introduce experimental SST partitioning r=remcowesterhoud a=Zelldon ## Description Backports #12483 <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> relates to #12033 12646: [Backport 8.1]: Restore blacklist metric r=remcowesterhoud a=Zelldon ## Description Backports #12606 Merge conflicts because of imports. ## Related issues closes #8263 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>
12630: [Backport 8.0]: Introduce experimental SST partitioning r=remcowesterhoud a=Zelldon ## Description Backports #12483 <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> closes #12033 12645: [Backport 8.0]: Restore blacklist metric r=remcowesterhoud a=Zelldon ## Description Backports #12606 <!-- Please explain the changes you made here. --> The PR https://github.com/camunda/zeebe/pull/12306/files wasn't backported to 8.0, which caused some conflicts. I had to add the onRecovered method and call it in the ZeebeDbState. ## Related issues <!-- Which issues are closed by this PR or are related --> closes #8263 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>
12761: Add blacklisting panel r=oleschoenburg a=Zelldon ## Description Add new panels to show blacklisted process instances, in order to avoid searching the whole log for errors. One panel was added in the general section, so we can directly see whether there are any blacklisted instances in this namespace. ![blacklisted](https://github.com/camunda/zeebe/assets/2758593/0b1a3587-3353-4687-99bb-16dbd6c7eabf) If there are any blacklisted instances, the `None` will change to the actual count in $${\color{red}red}$$ If we want to dig deeper we can take a look at the panel in the processing section which groups by namespace AND partition and shows it as graph so we can clearly see when it increased for a certain partition. ![blacklisted-proc](https://github.com/camunda/zeebe/assets/2758593/1150278d-5c60-4341-94ff-7d02f00ab8e5) In order show a real example I tried it on INT, which shows us `3953a05c-7c20-41fb-8231-ec283dd2138b-zeebe` with three blacklisted instances. This is BTW a Zeebe cluster from our team on int 😅 ![blacklisted-proc-int](https://github.com/camunda/zeebe/assets/2758593/f00c52e2-482f-489c-bafa-ba57cce7b71d) <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> closes #8263 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
Is your feature request related to a problem? Please describe.
Came up in a recent incident review that it would be nice to see whether instances are blacklisted or not, maybe also the rate of blacklisting.
I realized that we added a while ago the metrics for that, but never added this to the dashboard, see #6715.
Describe the solution you'd like
Either show a graph of blacklisted instance count or just some indication whether instances are blacklisted.
I had yesterday a hard time to find a good visualization, since the count is only exported if there was a new blacklisted instance. If the pod is restarted the metric is not exported. It might make sense to refill the metric on restart OR we need to work with it, but this would mean we can only show limited data, see below.
Example A - Show an indication that instances are blacklisted
In general I like this, since it shows directly whether there is something wrong.
The problem here is if the time frame is smaller (were no instances are blacklisted) than this indication is green. 👎
Example B - Graph
Showing a graph is not that fruitful, since as described above the metric is not always exported.
Showing zero for null values
Example C - Rate
The rate of such metric is also not really useful, since the change is too rare.
Example D - Table
Other alternative would be to show in a table the recent count, but this is also very limited (to the time frame) and other
Describe alternatives you've considered.
Ideally we would export the metric always, this would simplify things enormous. Then we can choose better one or more of the possible visualization from above.
Additional context
Blacklisting always shows that something problematic happened in the process execution (processing). An exception was thrown during the execution, which mostly an indication of a bug.
I would like to find a good way how we can visualize it and hope someone has some comments, opinion or ideas.
\cc @pihme
The text was updated successfully, but these errors were encountered: