Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zeebe overview dashboard #9067

Merged
merged 4 commits into from
Apr 7, 2022
Merged

Add zeebe overview dashboard #9067

merged 4 commits into from
Apr 7, 2022

Conversation

Zelldon
Copy link
Member

@Zelldon Zelldon commented Apr 7, 2022

Description

This feature was already part of several discussions and plannings, but we had never time for it. Again it came up in the most recent gameday https://confluence.camunda.com/display/ZEEBE/2022-03-23+Game+Day + #8998 that people like Support Engineers should have a dashboard which shows only a good overview without overwhelming to much.

Yesterday at the air port I had some time to play around with it and made up a dashboard with panels which I would normally check to detect issues.

It shows metrics like:

  • Partition topology and healthness
  • Current processing, exporting, incoming requests and backpressure
  • Resource consumption like cpu, memory and disk
  • Process instance execution latency is also shownion

first
second

Tbh I'm a bit proud of the first table, since I was finally able to combine the partition topology (roles) and healthiness in one table 馃挭

You can take a look here http://34.77.165.228/d/NzsO1mUnk/zeebe-overview?orgId=1&refresh=10s&var-DS_PROMETHEUS=Prometheus&var-namespace=All&var-pod=All&var-partition=All&from=now-15m&to=now

Related issues

I think this covers #8998 but we can discuss this @deepthidevaki If you think we need more here.

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • New content is added to the release announcement
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

 Dashboard shows a general overview of the Zeebe system. It shows metrics like:

 * Partition topology and healthness
 * Current processing, exporting, incoming requests and backpressure
 * Resource consumption like cpu, memory and disk
 * Process instance execution latency is also shown
Copy link
Contributor

@deepthidevaki deepthidevaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Zelldon for taking the initiative. It is high time we have something like this. Here are my comments.

Since this is supposed to be a small dashboard, we can remove the collapsible section.

Remove following panels. All of them can be in the detailed panels.

  • Current Events - I don't think for a first level support this is interesting.
  • Exporting per partition - Same here.
  • Number of log segments - Pvc disk usage already gives some indication of this value. So this is only needed for further investigation.
  • Processing Queue Size - I guess a first level support engineer cannot make sense of this value.
  • Process Instance Execution Time - Optional. Not sure if it is useful to have it in the overview. The latencies depend on the process. So probably not very useful as first level health check.

Rename PVC Disk Usage -> (Zeebe) Brokers Disk Usage

For "Request handled by Gateway", I would use a similar query as in grpc requests handled per sec in the grpc panel. I think it is useful to know how many requests are successful or not.

Apply review hints and remove panels like:

 * current events
 * number of segments
 * processing queue
 * process latency

Kepts exporters since I think it makes sense to see it. Reorder panels and removed row.
@Zelldon
Copy link
Member Author

Zelldon commented Apr 7, 2022

Thanks @deepthidevaki for the review :)

I was unsure about the row, but I removed it now.

Remove following panels. All of them can be in the detailed panels.

Current Events - I don't think for a first level support this is interesting.

Yeah I thought might be interesting for internal load and to see what is processed, but yeah maybe the processing is enough for now. We can add it also later again, if we see the need.

Exporting per partition - Same here.

I will keep it. I think this is important since you can verify whether anything is currently exporting, useful for detect issues with operate and not showing data etc.

Number of log segments - Pvc disk usage already gives some indication of this value. So this is only needed for further investigation.

Yeah was probably my own curiosity :D Agreed, I removed it.

Processing Queue Size - I guess a first level support engineer cannot make sense of this value.

Probably right. I removed it.

Process Instance Execution Time - Optional. Not sure if it is useful to have it in the overview. The latencies depend on the process. So probably not very useful as first level health check.

I removed it as well :)

See the new panels:

other
other2

Copy link
Contributor

@deepthidevaki deepthidevaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

鉂わ笍 馃殌

As a follow up, we may have to update the troubleshooting guide to link to this dashboard. It would be also good to test it during the next Gameday - if the information in this dashboard is useful and enough for an overview.

@Zelldon
Copy link
Member Author

Zelldon commented Apr 7, 2022

As a follow up, we may have to update the troubleshooting guide to link to this dashboard. It would be also good to test it during the next Gameday - if the information in this dashboard is useful and enough for an overview.

Jep totally agree :) First I need to get this into cloud and updating the throubleshooting I have also on my todo list :D 馃憤

@Zelldon
Copy link
Member Author

Zelldon commented Apr 7, 2022

bors r+

zeebe-bors-camunda bot added a commit that referenced this pull request Apr 7, 2022
9067: Add zeebe overview dashboard r=Zelldon a=Zelldon

## Description

This feature was already part of several discussions and plannings, but we had never time for it. Again it came up in the most recent gameday https://confluence.camunda.com/display/ZEEBE/2022-03-23+Game+Day + #8998 that people like Support Engineers should have a dashboard which shows only a good overview without overwhelming to much.

Yesterday at the air port I had some time to play around with it and made up a dashboard with panels which I would normally check to detect issues. 


It shows metrics like:

 * Partition topology and healthness
 * Current processing, exporting, incoming requests and backpressure
 * Resource consumption like cpu, memory and disk
 * Process instance execution latency is also shownion

![first](https://user-images.githubusercontent.com/2758593/162138337-30dfc08d-1492-4ae3-9569-d21daf829adc.png)
![second](https://user-images.githubusercontent.com/2758593/162138341-838060f6-b76b-480c-bf80-df1d9d24bdd8.png)


Tbh I'm a bit proud of the first table, since I was finally able to combine the partition topology (roles) and healthiness in one table 馃挭 


You can take a look here http://34.77.165.228/d/NzsO1mUnk/zeebe-overview?orgId=1&refresh=10s&var-DS_PROMETHEUS=Prometheus&var-namespace=All&var-pod=All&var-partition=All&from=now-15m&to=now
<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

I think this covers #8998 but we can discuss this `@deepthidevaki` If you think we need more here.



Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
Copy link
Member

@npepinpe npepinpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

馃殌

Late but still wanted to comment 馃槃

@zeebe-bors-camunda
Copy link
Contributor

Build failed:

@Zelldon
Copy link
Member Author

Zelldon commented Apr 7, 2022

bors r+

zeebe-bors-camunda bot added a commit that referenced this pull request Apr 7, 2022
9067: Add zeebe overview dashboard r=Zelldon a=Zelldon

## Description

This feature was already part of several discussions and plannings, but we had never time for it. Again it came up in the most recent gameday https://confluence.camunda.com/display/ZEEBE/2022-03-23+Game+Day + #8998 that people like Support Engineers should have a dashboard which shows only a good overview without overwhelming to much.

Yesterday at the air port I had some time to play around with it and made up a dashboard with panels which I would normally check to detect issues. 


It shows metrics like:

 * Partition topology and healthness
 * Current processing, exporting, incoming requests and backpressure
 * Resource consumption like cpu, memory and disk
 * Process instance execution latency is also shownion

![first](https://user-images.githubusercontent.com/2758593/162138337-30dfc08d-1492-4ae3-9569-d21daf829adc.png)
![second](https://user-images.githubusercontent.com/2758593/162138341-838060f6-b76b-480c-bf80-df1d9d24bdd8.png)


Tbh I'm a bit proud of the first table, since I was finally able to combine the partition topology (roles) and healthiness in one table 馃挭 


You can take a look here http://34.77.165.228/d/NzsO1mUnk/zeebe-overview?orgId=1&refresh=10s&var-DS_PROMETHEUS=Prometheus&var-namespace=All&var-pod=All&var-partition=All&from=now-15m&to=now
<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

I think this covers #8998 but we can discuss this `@deepthidevaki` If you think we need more here.



Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
@zeebe-bors-camunda
Copy link
Contributor

Build failed:

@Zelldon Zelldon merged commit ec6b462 into main Apr 7, 2022
@Zelldon Zelldon deleted the zell-overview-metrics branch April 7, 2022 10:18
@deepthidevaki deepthidevaki added the version:8.1.0-alpha1 Marks an issue as being completely or in parts released in 8.1.0-alpha1 label May 3, 2022
@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version:8.1.0-alpha1 Marks an issue as being completely or in parts released in 8.1.0-alpha1 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants