Add zeebe overview dashboard #9067

Zelldon · 2022-04-07T06:58:41Z

Description

This feature was already part of several discussions and plannings, but we had never time for it. Again it came up in the most recent gameday https://confluence.camunda.com/display/ZEEBE/2022-03-23+Game+Day + #8998 that people like Support Engineers should have a dashboard which shows only a good overview without overwhelming to much.

Yesterday at the air port I had some time to play around with it and made up a dashboard with panels which I would normally check to detect issues.

It shows metrics like:

Partition topology and healthness
Current processing, exporting, incoming requests and backpressure
Resource consumption like cpu, memory and disk
Process instance execution latency is also shownion

Tbh I'm a bit proud of the first table, since I was finally able to combine the partition topology (roles) and healthiness in one table 💪

You can take a look here http://34.77.165.228/d/NzsO1mUnk/zeebe-overview?orgId=1&refresh=10s&var-DS_PROMETHEUS=Prometheus&var-namespace=All&var-pod=All&var-partition=All&from=now-15m&to=now

Related issues

I think this covers #8998 but we can discuss this @deepthidevaki If you think we need more here.

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

The changes are backwards compatibility with previous versions
If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

There are unit/integration tests that verify all acceptance criterias of the issue
New tests are written to ensure backwards compatibility with further versions
The behavior is tested manually
The change has been verified by a QA run
The impact of the changes is verified by a benchmark

Documentation:

The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
New content is added to the release announcement
If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Dashboard shows a general overview of the Zeebe system. It shows metrics like: * Partition topology and healthness * Current processing, exporting, incoming requests and backpressure * Resource consumption like cpu, memory and disk * Process instance execution latency is also shown

deepthidevaki

Thanks @Zelldon for taking the initiative. It is high time we have something like this. Here are my comments.

Since this is supposed to be a small dashboard, we can remove the collapsible section.

Remove following panels. All of them can be in the detailed panels.

Current Events - I don't think for a first level support this is interesting.
Exporting per partition - Same here.
Number of log segments - Pvc disk usage already gives some indication of this value. So this is only needed for further investigation.
Processing Queue Size - I guess a first level support engineer cannot make sense of this value.
Process Instance Execution Time - Optional. Not sure if it is useful to have it in the overview. The latencies depend on the process. So probably not very useful as first level health check.

Rename PVC Disk Usage -> (Zeebe) Brokers Disk Usage

For "Request handled by Gateway", I would use a similar query as in grpc requests handled per sec in the grpc panel. I think it is useful to know how many requests are successful or not.

Apply review hints and remove panels like: * current events * number of segments * processing queue * process latency Kepts exporters since I think it makes sense to see it. Reorder panels and removed row.

Zelldon · 2022-04-07T09:04:20Z

Thanks @deepthidevaki for the review :)

I was unsure about the row, but I removed it now.

Remove following panels. All of them can be in the detailed panels.

Current Events - I don't think for a first level support this is interesting.

Yeah I thought might be interesting for internal load and to see what is processed, but yeah maybe the processing is enough for now. We can add it also later again, if we see the need.

Exporting per partition - Same here.

I will keep it. I think this is important since you can verify whether anything is currently exporting, useful for detect issues with operate and not showing data etc.

Number of log segments - Pvc disk usage already gives some indication of this value. So this is only needed for further investigation.

Yeah was probably my own curiosity :D Agreed, I removed it.

Processing Queue Size - I guess a first level support engineer cannot make sense of this value.

Probably right. I removed it.

Process Instance Execution Time - Optional. Not sure if it is useful to have it in the overview. The latencies depend on the process. So probably not very useful as first level health check.

I removed it as well :)

See the new panels:

deepthidevaki

❤️ 🚀

As a follow up, we may have to update the troubleshooting guide to link to this dashboard. It would be also good to test it during the next Gameday - if the information in this dashboard is useful and enough for an overview.

Zelldon · 2022-04-07T09:30:04Z

As a follow up, we may have to update the troubleshooting guide to link to this dashboard. It would be also good to test it during the next Gameday - if the information in this dashboard is useful and enough for an overview.

Jep totally agree :) First I need to get this into cloud and updating the throubleshooting I have also on my todo list :D 👍

Zelldon · 2022-04-07T09:30:10Z

bors r+

9067: Add zeebe overview dashboard r=Zelldon a=Zelldon ## Description This feature was already part of several discussions and plannings, but we had never time for it. Again it came up in the most recent gameday https://confluence.camunda.com/display/ZEEBE/2022-03-23+Game+Day + #8998 that people like Support Engineers should have a dashboard which shows only a good overview without overwhelming to much. Yesterday at the air port I had some time to play around with it and made up a dashboard with panels which I would normally check to detect issues. It shows metrics like: * Partition topology and healthness * Current processing, exporting, incoming requests and backpressure * Resource consumption like cpu, memory and disk * Process instance execution latency is also shownion ![first](https://user-images.githubusercontent.com/2758593/162138337-30dfc08d-1492-4ae3-9569-d21daf829adc.png) ![second](https://user-images.githubusercontent.com/2758593/162138341-838060f6-b76b-480c-bf80-df1d9d24bdd8.png) Tbh I'm a bit proud of the first table, since I was finally able to combine the partition topology (roles) and healthiness in one table 💪 You can take a look here http://34.77.165.228/d/NzsO1mUnk/zeebe-overview?orgId=1&refresh=10s&var-DS_PROMETHEUS=Prometheus&var-namespace=All&var-pod=All&var-partition=All&from=now-15m&to=now  ## Related issues  I think this covers #8998 but we can discuss this `@deepthidevaki` If you think we need more here. Co-authored-by: Christopher Zell <zelldon91@googlemail.com>

npepinpe

🚀

Late but still wanted to comment 😄

zeebe-bors-camunda · 2022-04-07T09:52:36Z

Build failed:

continuous-integration/jenkins/branch

Zelldon · 2022-04-07T09:55:22Z

bors r+

9067: Add zeebe overview dashboard r=Zelldon a=Zelldon ## Description This feature was already part of several discussions and plannings, but we had never time for it. Again it came up in the most recent gameday https://confluence.camunda.com/display/ZEEBE/2022-03-23+Game+Day + #8998 that people like Support Engineers should have a dashboard which shows only a good overview without overwhelming to much. Yesterday at the air port I had some time to play around with it and made up a dashboard with panels which I would normally check to detect issues. It shows metrics like: * Partition topology and healthness * Current processing, exporting, incoming requests and backpressure * Resource consumption like cpu, memory and disk * Process instance execution latency is also shownion ![first](https://user-images.githubusercontent.com/2758593/162138337-30dfc08d-1492-4ae3-9569-d21daf829adc.png) ![second](https://user-images.githubusercontent.com/2758593/162138341-838060f6-b76b-480c-bf80-df1d9d24bdd8.png) Tbh I'm a bit proud of the first table, since I was finally able to combine the partition topology (roles) and healthiness in one table 💪 You can take a look here http://34.77.165.228/d/NzsO1mUnk/zeebe-overview?orgId=1&refresh=10s&var-DS_PROMETHEUS=Prometheus&var-namespace=All&var-pod=All&var-partition=All&from=now-15m&to=now  ## Related issues  I think this covers #8998 but we can discuss this `@deepthidevaki` If you think we need more here. Co-authored-by: Christopher Zell <zelldon91@googlemail.com>

zeebe-bors-camunda · 2022-04-07T10:18:05Z

Build failed:

continuous-integration/jenkins/branch

Zelldon requested review from npepinpe and deepthidevaki April 7, 2022 06:58

refactor: reset vars and collaps row

54d99e8

deepthidevaki requested changes Apr 7, 2022

View reviewed changes

Zelldon added 2 commits April 7, 2022 10:55

refactor: remove panels

5a78d2b

Apply review hints and remove panels like: * current events * number of segments * processing queue * process latency Kepts exporters since I think it makes sense to see it. Reorder panels and removed row.

refactor: show responses in gateway panel

212f82f

Zelldon requested a review from deepthidevaki April 7, 2022 09:04

deepthidevaki approved these changes Apr 7, 2022

View reviewed changes

npepinpe approved these changes Apr 7, 2022

View reviewed changes

Zelldon merged commit ec6b462 into main Apr 7, 2022

Zelldon deleted the zell-overview-metrics branch April 7, 2022 10:18

Zelldon mentioned this pull request Apr 7, 2022

Make new overview dashboard available in zeebe [+long running] cluster #9075

Closed

deepthidevaki added the version:8.1.0-alpha1 Marks an issue as being completely or in parts released in 8.1.0-alpha1 label May 3, 2022

This was referenced Jun 3, 2022

In metrics, I can distinguish the partition health of the leader from the followers #8352

Closed

Split grafana dashboard into two #8998

Closed

Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add zeebe overview dashboard #9067

Add zeebe overview dashboard #9067

Zelldon commented Apr 7, 2022 •

edited

Loading

deepthidevaki left a comment

Zelldon commented Apr 7, 2022

deepthidevaki left a comment

Zelldon commented Apr 7, 2022

Zelldon commented Apr 7, 2022

npepinpe left a comment

zeebe-bors-camunda bot commented Apr 7, 2022

Zelldon commented Apr 7, 2022

zeebe-bors-camunda bot commented Apr 7, 2022

Add zeebe overview dashboard #9067

Add zeebe overview dashboard #9067

Conversation

Zelldon commented Apr 7, 2022 • edited Loading

Description

Related issues

Definition of Done

deepthidevaki left a comment

Choose a reason for hiding this comment

Zelldon commented Apr 7, 2022

deepthidevaki left a comment

Choose a reason for hiding this comment

Zelldon commented Apr 7, 2022

Zelldon commented Apr 7, 2022

npepinpe left a comment

Choose a reason for hiding this comment

zeebe-bors-camunda bot commented Apr 7, 2022

Zelldon commented Apr 7, 2022

zeebe-bors-camunda bot commented Apr 7, 2022

Zelldon commented Apr 7, 2022 •

edited

Loading