Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add backup metrics to grafana dashboard #10521

Merged
merged 3 commits into from
Sep 28, 2022

Conversation

lenaschoenburg
Copy link
Member

@lenaschoenburg lenaschoenburg commented Sep 27, 2022

Screenshot 2022-09-28 at 13-33-27 Zeebe with Backup Metrics - Grafana

"targets": [
{
"exemplar": false,
"expr": "min(max_over_time(zeebe_backup_operations_in_progress{namespace=~\"$namespace\", partition=~\"$partition\", pod=~\"$pod\", operation=\"take\"}[5m])) by (partition)",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ This query is a bit weird. I tried to remove time series where a gauge would be stuck at a specific value. This happened during testing when a broker would transition to follower while taking a backup. Because that follower never saw the backup complete or fail it would still report an in-progress backup.

I've tried to explain the query I came up here but ultimately I'm not sure why exactly this works or even if it is correct at all 🙈

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we should also update the in progress metrics while closing BackupService. If the node restarts, the metrics is reset anyway, I think, as observed in other metrics. Can you verify it? If it works, then we probably don't need this complex query.

In this query, it is not clear what would be the value if inprogress backup takes more than 5m. Will max_over_time query ignored it and shows the result as 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we should also update the in progress metrics while closing BackupService

Good idea. Alternatively we could update when marking the in-progress backup as failed during startup of a new BackupService.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it by resetting the in-progress counters to 0 when the BackupService actor closes. Query is adjusted too.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 27, 2022

Test Results

   763 files   -    169     763 suites   - 169   1h 50m 26s ⏱️ - 14m 33s
6 013 tests  - 1 477  6 002 ✔️  - 1 478  10 💤 ±0  1 +1 
6 191 runs   - 1 487  6 180 ✔️  - 1 488  10 💤 ±0  1 +1 

For more details on these failures, see this check.

Results for commit 388cf10. ± Comparison against base commit 751ac05.

♻️ This comment has been updated with latest results.

@deepthidevaki
Copy link
Contributor

@oleschoenburg Could you also share the final dashboard view here? Just for reference.

monitor/grafana/zeebe.json Outdated Show resolved Hide resolved
"targets": [
{
"exemplar": false,
"expr": "min(max_over_time(zeebe_backup_operations_in_progress{namespace=~\"$namespace\", partition=~\"$partition\", pod=~\"$pod\", operation=\"take\"}[5m])) by (partition)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we should also update the in progress metrics while closing BackupService. If the node restarts, the metrics is reset anyway, I think, as observed in other metrics. Can you verify it? If it works, then we probably don't need this complex query.

In this query, it is not clear what would be the value if inprogress backup takes more than 5m. Will max_over_time query ignored it and shows the result as 0?

Copy link
Contributor

@deepthidevaki deepthidevaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 Looks nice. One small thing - the unit of take backup latency is not clear from the graph.

@lenaschoenburg
Copy link
Member Author

I added the correct unit for backup latency, thanks for spotting this 👍

bors r+

@zeebe-bors-camunda
Copy link
Contributor

Build succeeded:

@backport-action
Copy link
Collaborator

Successfully created backport PR #10544 for release-8.1.0.

zeebe-bors-camunda bot added a commit that referenced this pull request Sep 28, 2022
10544: [Backport release-8.1.0] feat: add backup metrics to grafana dashboard r=oleschoenburg a=backport-action

# Description
Backport of #10521 to `release-8.1.0`.

relates to 

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this pull request Sep 28, 2022
10544: [Backport release-8.1.0] feat: add backup metrics to grafana dashboard r=oleschoenburg a=backport-action

# Description
Backport of #10521 to `release-8.1.0`.

relates to 

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants