Show blacklisting in the Grafana Dashboard #8263

Zelldon · 2021-11-24T06:49:53Z

Is your feature request related to a problem? Please describe.
Came up in a recent incident review that it would be nice to see whether instances are blacklisted or not, maybe also the rate of blacklisting.

I realized that we added a while ago the metrics for that, but never added this to the dashboard, see #6715.

Describe the solution you'd like

Either show a graph of blacklisted instance count or just some indication whether instances are blacklisted.

I had yesterday a hard time to find a good visualization, since the count is only exported if there was a new blacklisted instance. If the pod is restarted the metric is not exported. It might make sense to refill the metric on restart OR we need to work with it, but this would mean we can only show limited data, see below.

Example A - Show an indication that instances are blacklisted

In general I like this, since it shows directly whether there is something wrong.
The problem here is if the time frame is smaller (were no instances are blacklisted) than this indication is green. 👎

Example B - Graph

Showing a graph is not that fruitful, since as described above the metric is not always exported.

Showing zero for null values

Example C - Rate

The rate of such metric is also not really useful, since the change is too rare.

Example D - Table

Other alternative would be to show in a table the recent count, but this is also very limited (to the time frame) and other

Describe alternatives you've considered.

Ideally we would export the metric always, this would simplify things enormous. Then we can choose better one or more of the possible visualization from above.

Additional context

Blacklisting always shows that something problematic happened in the process execution (processing). An exception was thrown during the execution, which mostly an indication of a bug.

I would like to find a good way how we can visualize it and hope someone has some comments, opinion or ideas.

\cc @pihme

pihme · 2021-11-24T07:26:56Z

What about exposing the metric as a rate and visualizing it with a heat map?

npepinpe · 2021-12-09T09:02:41Z

Prioritized as planned for now under the assumption we won't work on a better long term alternative to blacklisting in the next quarter, and this will already be an improvement in terms of visibility. Before opening a PR for this, please discuss and decide on the visualization/metric that we want.

Zelldon · 2022-01-07T13:22:11Z

The question for me is really what we want to achieve with the metric.

Do we want to know really how many (exact) are blacklisted? Then Prometheus might be not the best fit, but if we just want to have an indicator that SOMETHING is blacklisted then this could work:

Here we could change the colors and shown values to something like Blacklisted (if x > 0) and nothing (if x <= 0).I think this is similar what I have shown above with example A. This would potentially already help.

pihme · 2022-01-07T13:25:14Z

What about exposing the metric as a rate and visualizing it with a heat map?

Zelldon · 2022-01-07T13:27:46Z

This wouldn't help since the metric is not exported all the time. Plus what should tell me the heatmap would be the question?

pihme · 2022-01-07T13:40:02Z

I would expect to see a change in the blacklisting rate.

Kinda like here:

If it is not exported for some time, I will just see a black column, but I can always make the time scale wider and then I should see data pretty much for all the time, because each export would be aggregated in the rate (I hope).

Then I could look at a big enough time scale and if my blacklisting suddenly jumps from 2 per day to 200 per day, that would be my signal. And if the jump is correlated to e.g. the point there was an update or a pod restart, that would also be interesting.

This is all speculation though, one would have to see what it actually looks like.

Zelldon · 2022-01-07T13:58:26Z

Thanks @pihme

yeah so it would look like this

15 mins

90 days

Problem is that we not store data long enough on our prometheus server. So there can be data deleted which contained some blacklisting, which is why I ask what we want to see. The real count or just an indication all in all I think our current metric doesn't work well. Ideally we would report always or on bootstrap with the current value of blacklisted instances.

pihme · 2022-01-07T14:08:29Z

Thanks @Zelldon. Yeah I regularly forget we forget data. I think 90 days would be good enough, and yes, the visualization looks like what I had in mind. But if the history is 30 days or less, it becomes less useful. So I see your point.

Zelldon · 2022-06-02T12:34:15Z

I think the best would be as described as alternative above:

Describe alternatives you've considered.
Ideally we would export the metric always, this would simplify things enormous. Then we can choose better one or more of the possible visualization from above.

This means at least on restart we export the metric at least once.

Zelldon · 2022-11-18T12:58:21Z

@remcowesterhoud since you mentioned to me that you had to look at a cluster with zdb to find out whether something was blacklisted might be useful for you to just have this in the metrics :)

menski · 2022-11-25T09:45:42Z

Right now we don't see value to implement this, we have other ways to identify blacklisted instances, which also allow us to see the corresponding process instance keys, i.e. logs or zdb.

Hopefully at one point we can get rid of blacklisting

korthout · 2023-04-05T08:21:14Z

ZPA triage:

we want to gain more insights into the frequency of blacklisting before choosing to replace the concept

remcowesterhoud · 2023-04-05T09:12:19Z

Attaching this to the "Blacklisting replacement" as having insights in these metrics may be useful for this epic.

12606: Restore blacklist metric r=remcowesterhoud a=Zelldon ## Description On restart/recovery we need to restore the blacklist metric in order to report it always. Otherwise, we will have gaps in the reports and it is not reliable, due to restarts and resetting of counters. Change counter to gauge, such that it can be reset/set to a specific value based on the state. See related comment #8263 (comment)  ## Related issues  closes #8263 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>

Zelldon · 2023-05-02T06:42:32Z

Reopening since we still need panels on the dashboard.

12643: [Backport stable/8.2] fix: brokers can list more than 255 backups r=deepthidevaki a=oleschoenburg Manual backport of #12621 to fix revapi merge conflicts. 12644: [Backport stable/8.2] Restore blacklist metric r=Zelldon a=backport-action # Description Backport of #12606 to `stable/8.2`. relates to #8263 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Christopher Zell <zelldon91@googlemail.com>

12629: [Backport 8.1]: Introduce experimental SST partitioning r=remcowesterhoud a=Zelldon ## Description Backports #12483  ## Related issues  relates to #12033 12646: [Backport 8.1]: Restore blacklist metric r=remcowesterhoud a=Zelldon ## Description Backports #12606 Merge conflicts because of imports. ## Related issues closes #8263 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>

12630: [Backport 8.0]: Introduce experimental SST partitioning r=remcowesterhoud a=Zelldon ## Description Backports #12483  ## Related issues  closes #12033 12645: [Backport 8.0]: Restore blacklist metric r=remcowesterhoud a=Zelldon ## Description Backports #12606  The PR https://github.com/camunda/zeebe/pull/12306/files wasn't backported to 8.0, which caused some conflicts. I had to add the onRecovered method and call it in the ZeebeDbState. ## Related issues  closes #8263 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>

koevskinikola · 2023-05-12T08:12:01Z

ZPA triage:

@Zelldon you mentioned HERE that you'll add the Grafana dashboards. Is this still your plan, or ZPA should do something from our side?
- ZPA also has this issue marked as upcoming so we would do it soon either way.

12761: Add blacklisting panel r=oleschoenburg a=Zelldon ## Description Add new panels to show blacklisted process instances, in order to avoid searching the whole log for errors. One panel was added in the general section, so we can directly see whether there are any blacklisted instances in this namespace. ![blacklisted](https://github.com/camunda/zeebe/assets/2758593/0b1a3587-3353-4687-99bb-16dbd6c7eabf) If there are any blacklisted instances, the `None` will change to the actual count in $${\color{red}red}$$ If we want to dig deeper we can take a look at the panel in the processing section which groups by namespace AND partition and shows it as graph so we can clearly see when it increased for a certain partition. ![blacklisted-proc](https://github.com/camunda/zeebe/assets/2758593/1150278d-5c60-4341-94ff-7d02f00ab8e5) In order show a real example I tried it on INT, which shows us `3953a05c-7c20-41fb-8231-ec283dd2138b-zeebe` with three blacklisted instances. This is BTW a Zeebe cluster from our team on int 😅 ![blacklisted-proc-int](https://github.com/camunda/zeebe/assets/2758593/f00c52e2-482f-489c-bafa-ba57cce7b71d)  ## Related issues  closes #8263 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>

Zelldon added kind/feature Categorizes an issue or PR as a feature, i.e. new behavior scope/broker Marks an issue or PR to appear in the broker section of the changelog Impact: Observability area/observability Marks an issue as observability related labels Nov 24, 2021

npepinpe added this to Planned in Zeebe Dec 9, 2021

KerstinHebel removed this from Planned in Zeebe Mar 23, 2022

npepinpe removed the Impact: Observability label Apr 11, 2022

Zelldon added team/distributed labels Jun 2, 2022

menski removed team/distributed labels Jul 11, 2022

Zelldon added the component/engine label Dec 29, 2022

korthout mentioned this issue Apr 5, 2023

Jobs of banned Process Instance can still be activated #12239

Closed

korthout mentioned this issue Apr 12, 2023

Ban process instance on error of MessageSubscription #12316

Open

Zelldon mentioned this issue Apr 29, 2023

Restore blacklist metric #12606

Merged

zeebe-bors-camunda bot closed this as completed in b6e73bb May 2, 2023

Zelldon reopened this May 2, 2023

backport-action mentioned this issue May 3, 2023

[Backport stable/8.2] Restore blacklist metric #12644

Merged

This was referenced May 3, 2023

[Backport 8.0]: Restore blacklist metric #12645

Merged

[Backport 8.1]: Restore blacklist metric #12646

Merged

remcowesterhoud added the version:8.2.4 Marks an issue as being completely or in parts released in 8.2.4 label May 3, 2023

Zelldon mentioned this issue May 15, 2023

Add blacklisting panel #12761

Merged

14 tasks

zeebe-bors-camunda bot closed this as completed in 0261cc1 May 16, 2023

lenaschoenburg added the version:8.3.0-alpha2 Marks an issue as being completely or in parts released in 8.3.0-alpha2 label Jun 7, 2023

megglos added the version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0 label Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show blacklisting in the Grafana Dashboard #8263

Show blacklisting in the Grafana Dashboard #8263

Zelldon commented Nov 24, 2021

pihme commented Nov 24, 2021

npepinpe commented Dec 9, 2021

Zelldon commented Jan 7, 2022

pihme commented Jan 7, 2022

Zelldon commented Jan 7, 2022 •

edited

pihme commented Jan 7, 2022

Zelldon commented Jan 7, 2022

pihme commented Jan 7, 2022

Zelldon commented Jun 2, 2022

Zelldon commented Nov 18, 2022

menski commented Nov 25, 2022

korthout commented Apr 5, 2023

remcowesterhoud commented Apr 5, 2023

Zelldon commented May 2, 2023

koevskinikola commented May 12, 2023 •

edited

Show blacklisting in the Grafana Dashboard #8263

Show blacklisting in the Grafana Dashboard #8263

Comments

Zelldon commented Nov 24, 2021

pihme commented Nov 24, 2021

npepinpe commented Dec 9, 2021

Zelldon commented Jan 7, 2022

pihme commented Jan 7, 2022

Zelldon commented Jan 7, 2022 • edited

pihme commented Jan 7, 2022

Zelldon commented Jan 7, 2022

pihme commented Jan 7, 2022

Zelldon commented Jun 2, 2022

Zelldon commented Nov 18, 2022

menski commented Nov 25, 2022

korthout commented Apr 5, 2023

remcowesterhoud commented Apr 5, 2023

Zelldon commented May 2, 2023

koevskinikola commented May 12, 2023 • edited

Zelldon commented Jan 7, 2022 •

edited

koevskinikola commented May 12, 2023 •

edited