Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show blacklisting in the Grafana Dashboard #8263

Closed
Zelldon opened this issue Nov 24, 2021 · 15 comments · Fixed by #12606 or #12761
Closed

Show blacklisting in the Grafana Dashboard #8263

Zelldon opened this issue Nov 24, 2021 · 15 comments · Fixed by #12606 or #12761
Labels
area/observability Marks an issue as observability related component/engine kind/feature Categorizes an issue or PR as a feature, i.e. new behavior scope/broker Marks an issue or PR to appear in the broker section of the changelog version:8.2.4 Marks an issue as being completely or in parts released in 8.2.4 version:8.3.0-alpha2 Marks an issue as being completely or in parts released in 8.3.0-alpha2 version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0

Comments

@Zelldon
Copy link
Member

Zelldon commented Nov 24, 2021

Is your feature request related to a problem? Please describe.
Came up in a recent incident review that it would be nice to see whether instances are blacklisted or not, maybe also the rate of blacklisting.

I realized that we added a while ago the metrics for that, but never added this to the dashboard, see #6715.

Describe the solution you'd like

Either show a graph of blacklisted instance count or just some indication whether instances are blacklisted.

I had yesterday a hard time to find a good visualization, since the count is only exported if there was a new blacklisted instance. If the pod is restarted the metric is not exported. It might make sense to refill the metric on restart OR we need to work with it, but this would mean we can only show limited data, see below.

Example A - Show an indication that instances are blacklisted

gauge-blacklisted
gauge-blacklisted-general

In general I like this, since it shows directly whether there is something wrong.
The problem here is if the time frame is smaller (were no instances are blacklisted) than this indication is green. 👎

Example B - Graph

Showing a graph is not that fruitful, since as described above the metric is not always exported.
graph-blacklisted

Showing zero for null values
graph-zero-blacklisted

Example C - Rate

The rate of such metric is also not really useful, since the change is too rare.

graph-rate-blacklisted

Example D - Table
table-blacklisted

Other alternative would be to show in a table the recent count, but this is also very limited (to the time frame) and other

Describe alternatives you've considered.

Ideally we would export the metric always, this would simplify things enormous. Then we can choose better one or more of the possible visualization from above.

Additional context

Blacklisting always shows that something problematic happened in the process execution (processing). An exception was thrown during the execution, which mostly an indication of a bug.

I would like to find a good way how we can visualize it and hope someone has some comments, opinion or ideas.

\cc @pihme

@Zelldon Zelldon added kind/feature Categorizes an issue or PR as a feature, i.e. new behavior scope/broker Marks an issue or PR to appear in the broker section of the changelog Impact: Observability area/observability Marks an issue as observability related labels Nov 24, 2021
@pihme
Copy link
Contributor

pihme commented Nov 24, 2021

What about exposing the metric as a rate and visualizing it with a heat map?

@npepinpe npepinpe added this to Planned in Zeebe Dec 9, 2021
@npepinpe
Copy link
Member

npepinpe commented Dec 9, 2021

Prioritized as planned for now under the assumption we won't work on a better long term alternative to blacklisting in the next quarter, and this will already be an improvement in terms of visibility. Before opening a PR for this, please discuss and decide on the visualization/metric that we want.

@Zelldon
Copy link
Member Author

Zelldon commented Jan 7, 2022

The question for me is really what we want to achieve with the metric.

Do we want to know really how many (exact) are blacklisted? Then Prometheus might be not the best fit, but if we just want to have an indicator that SOMETHING is blacklisted then this could work:

blacklist

Here we could change the colors and shown values to something like Blacklisted (if x > 0) and nothing (if x <= 0).I think this is similar what I have shown above with example A. This would potentially already help.

@pihme
Copy link
Contributor

pihme commented Jan 7, 2022

What about exposing the metric as a rate and visualizing it with a heat map?

@Zelldon
Copy link
Member Author

Zelldon commented Jan 7, 2022

This wouldn't help since the metric is not exported all the time. Plus what should tell me the heatmap would be the question?

@pihme
Copy link
Contributor

pihme commented Jan 7, 2022

I would expect to see a change in the blacklisting rate.

Kinda like here:
image

If it is not exported for some time, I will just see a black column, but I can always make the time scale wider and then I should see data pretty much for all the time, because each export would be aggregated in the rate (I hope).

Then I could look at a big enough time scale and if my blacklisting suddenly jumps from 2 per day to 200 per day, that would be my signal. And if the jump is correlated to e.g. the point there was an update or a pod restart, that would also be interesting.

This is all speculation though, one would have to see what it actually looks like.

@Zelldon
Copy link
Member Author

Zelldon commented Jan 7, 2022

Thanks @pihme

yeah so it would look like this

15 mins
heatmap

90 days
heatmap2

Problem is that we not store data long enough on our prometheus server. So there can be data deleted which contained some blacklisting, which is why I ask what we want to see. The real count or just an indication all in all I think our current metric doesn't work well. Ideally we would report always or on bootstrap with the current value of blacklisted instances.

@pihme
Copy link
Contributor

pihme commented Jan 7, 2022

Thanks @Zelldon. Yeah I regularly forget we forget data. I think 90 days would be good enough, and yes, the visualization looks like what I had in mind. But if the history is 30 days or less, it becomes less useful. So I see your point.

@KerstinHebel KerstinHebel removed this from Planned in Zeebe Mar 23, 2022
@Zelldon
Copy link
Member Author

Zelldon commented Jun 2, 2022

I think the best would be as described as alternative above:

Describe alternatives you've considered.
Ideally we would export the metric always, this would simplify things enormous. Then we can choose better one or more of the possible visualization from above.

This means at least on restart we export the metric at least once.

@Zelldon
Copy link
Member Author

Zelldon commented Nov 18, 2022

@remcowesterhoud since you mentioned to me that you had to look at a cluster with zdb to find out whether something was blacklisted might be useful for you to just have this in the metrics :)

@menski
Copy link
Contributor

menski commented Nov 25, 2022

Right now we don't see value to implement this, we have other ways to identify blacklisted instances, which also allow us to see the corresponding process instance keys, i.e. logs or zdb.

Hopefully at one point we can get rid of blacklisting

@korthout
Copy link
Member

korthout commented Apr 5, 2023

ZPA triage:

  • we want to gain more insights into the frequency of blacklisting before choosing to replace the concept

@remcowesterhoud
Copy link
Contributor

Attaching this to the "Blacklisting replacement" as having insights in these metrics may be useful for this epic.

zeebe-bors-camunda bot added a commit that referenced this issue May 1, 2023
12606: Restore blacklist metric r=remcowesterhoud a=Zelldon

## Description
On restart/recovery we need to restore the blacklist metric in order to report it always. Otherwise, we will have gaps in the reports and it is not reliable, due to restarts and resetting of counters.

Change counter to gauge, such that it can be reset/set to a specific value based on the state.

See related comment #8263 (comment)

<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #8263



Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
@Zelldon Zelldon reopened this May 2, 2023
@Zelldon
Copy link
Member Author

Zelldon commented May 2, 2023

Reopening since we still need panels on the dashboard.

zeebe-bors-camunda bot added a commit that referenced this issue May 3, 2023
12643: [Backport stable/8.2] fix: brokers can list more than 255 backups r=deepthidevaki a=oleschoenburg

Manual backport of #12621 to fix revapi merge conflicts.

12644: [Backport stable/8.2] Restore blacklist metric r=Zelldon a=backport-action

# Description
Backport of #12606 to `stable/8.2`.

relates to #8263

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
zeebe-bors-camunda bot added a commit that referenced this issue May 3, 2023
12629: [Backport 8.1]: Introduce experimental SST partitioning r=remcowesterhoud a=Zelldon

## Description
Backports #12483 
<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates to #12033



12646: [Backport 8.1]: Restore blacklist metric r=remcowesterhoud a=Zelldon

## Description
Backports #12606

Merge conflicts because of imports.

## Related issues

closes #8263



Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>
zeebe-bors-camunda bot added a commit that referenced this issue May 3, 2023
12630: [Backport 8.0]: Introduce experimental SST partitioning r=remcowesterhoud a=Zelldon

## Description
Backports  #12483
<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #12033



12645: [Backport 8.0]: Restore blacklist metric r=remcowesterhoud a=Zelldon

## Description

Backports #12606
<!-- Please explain the changes you made here. -->

The PR https://github.com/camunda/zeebe/pull/12306/files wasn't backported to 8.0, which caused some conflicts. I had to add the onRecovered method and call it in the ZeebeDbState.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #8263



Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>
@remcowesterhoud remcowesterhoud added the version:8.2.4 Marks an issue as being completely or in parts released in 8.2.4 label May 3, 2023
@koevskinikola
Copy link
Member

koevskinikola commented May 12, 2023

ZPA triage:

  • @Zelldon you mentioned HERE that you'll add the Grafana dashboards. Is this still your plan, or ZPA should do something from our side?
    • ZPA also has this issue marked as upcoming so we would do it soon either way.

@Zelldon Zelldon mentioned this issue May 15, 2023
14 tasks
zeebe-bors-camunda bot added a commit that referenced this issue May 15, 2023
12761: Add blacklisting panel r=oleschoenburg a=Zelldon

## Description
Add new panels to show blacklisted process instances, in order to avoid searching the whole log for errors.

One panel was added in the general section, so we can directly see whether there are any blacklisted instances in this namespace.

![blacklisted](https://github.com/camunda/zeebe/assets/2758593/0b1a3587-3353-4687-99bb-16dbd6c7eabf)

If there are any blacklisted instances, the `None` will change to the actual count in $${\color{red}red}$$


If we want to dig deeper we can take a look at the panel in the processing section which groups by namespace AND partition and shows it as graph so we can clearly see when it increased for a certain partition.

![blacklisted-proc](https://github.com/camunda/zeebe/assets/2758593/1150278d-5c60-4341-94ff-7d02f00ab8e5)


In order show a real example I tried it on INT, which shows us `3953a05c-7c20-41fb-8231-ec283dd2138b-zeebe` with three blacklisted instances. This is BTW a Zeebe cluster from our team on int 😅 

![blacklisted-proc-int](https://github.com/camunda/zeebe/assets/2758593/f00c52e2-482f-489c-bafa-ba57cce7b71d)

<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #8263 



Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
@lenaschoenburg lenaschoenburg added the version:8.3.0-alpha2 Marks an issue as being completely or in parts released in 8.3.0-alpha2 label Jun 7, 2023
@megglos megglos added the version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0 label Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/observability Marks an issue as observability related component/engine kind/feature Categorizes an issue or PR as a feature, i.e. new behavior scope/broker Marks an issue or PR to appear in the broker section of the changelog version:8.2.4 Marks an issue as being completely or in parts released in 8.2.4 version:8.3.0-alpha2 Marks an issue as being completely or in parts released in 8.3.0-alpha2 version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants