-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add backup metrics to grafana dashboard #10521
Conversation
lenaschoenburg
commented
Sep 27, 2022
•
edited
Loading
edited
monitor/grafana/zeebe.json
Outdated
"targets": [ | ||
{ | ||
"exemplar": false, | ||
"expr": "min(max_over_time(zeebe_backup_operations_in_progress{namespace=~\"$namespace\", partition=~\"$partition\", pod=~\"$pod\", operation=\"take\"}[5m])) by (partition)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ This query is a bit weird. I tried to remove time series where a gauge would be stuck at a specific value. This happened during testing when a broker would transition to follower while taking a backup. Because that follower never saw the backup complete or fail it would still report an in-progress backup.
I've tried to explain the query I came up here but ultimately I'm not sure why exactly this works or even if it is correct at all 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we should also update the in progress metrics while closing BackupService. If the node restarts, the metrics is reset anyway, I think, as observed in other metrics. Can you verify it? If it works, then we probably don't need this complex query.
In this query, it is not clear what would be the value if inprogress backup takes more than 5m. Will max_over_time query ignored it and shows the result as 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we should also update the in progress metrics while closing BackupService
Good idea. Alternatively we could update when marking the in-progress backup as failed during startup of a new BackupService.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it by resetting the in-progress counters to 0 when the BackupService actor closes. Query is adjusted too.
Test Results 763 files - 169 763 suites - 169 1h 50m 26s ⏱️ - 14m 33s For more details on these failures, see this check. Results for commit 388cf10. ± Comparison against base commit 751ac05. ♻️ This comment has been updated with latest results. |
@oleschoenburg Could you also share the final dashboard view here? Just for reference. |
monitor/grafana/zeebe.json
Outdated
"targets": [ | ||
{ | ||
"exemplar": false, | ||
"expr": "min(max_over_time(zeebe_backup_operations_in_progress{namespace=~\"$namespace\", partition=~\"$partition\", pod=~\"$pod\", operation=\"take\"}[5m])) by (partition)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we should also update the in progress metrics while closing BackupService. If the node restarts, the metrics is reset anyway, I think, as observed in other metrics. Can you verify it? If it works, then we probably don't need this complex query.
In this query, it is not clear what would be the value if inprogress backup takes more than 5m. Will max_over_time query ignored it and shows the result as 0?
e0db00f
to
0edd599
Compare
0edd599
to
4a9af3c
Compare
4a9af3c
to
388cf10
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 Looks nice. One small thing - the unit of take backup latency is not clear from the graph.
388cf10
to
ebe5a1b
Compare
I added the correct unit for backup latency, thanks for spotting this 👍 bors r+ |
Build succeeded: |
Successfully created backport PR #10544 for |