New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgr/dashboard: Compare values of MTU alert by device #45583
mgr/dashboard: Compare values of MTU alert by device #45583
Conversation
a73f52d
to
7599770
Compare
|
jenkins test dashboard |
Fixes: https://tracker.ceph.com/issues/55004 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
Fixes: https://tracker.ceph.com/issues/55004 Signed-off-by: Aashish Sharma <aasharma@redhat.com>
f06c7d0
to
49d6068
Compare
|
jenkins test dashboard |
|
Does this PR need to go through Teuthology? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks a lot @p-se !
We discussed at today's stand-up whether to improve the MTU alert in a follow-up PR and we agreed on:
- as per this PR the MTU alert will work as long as all hosts are using the same MTU setting for the same interface name across the cluster (i.e.:
eth0in all hosts has MTU 1500,eth1has MTU 9000, etc). - to make this work with the network address instead of with the interface name, we'd need to provide new metrics via mgr/prometheus (since node-exporter doesn't provide this info yet).
- based on upstream users and downstream customer experience so far, we think that it is fair to assume that Ceph users are generally configuring the same networks (cluster or public) on network interfaces whose names are the same.
- for those users don't following that rule, the impact of this alert would be minimal, as it could be easily muted from the Dashboard as soon as it would be first triggered misleadingly.
| node_network_mtu_bytes * (node_network_up{device!="lo"} > 0) == | ||
| scalar( | ||
| max by (device) (node_network_mtu_bytes * (node_network_up{device!="lo"} > 0)) != | ||
| quantile by (device) (.5, node_network_mtu_bytes * (node_network_up{device!="lo"} > 0)) | ||
| ) | ||
| or | ||
| node_network_mtu_bytes * (node_network_up{device!="lo"} > 0) == | ||
| scalar( | ||
| min by (device) (node_network_mtu_bytes * (node_network_up{device!="lo"} > 0)) != | ||
| quantile by (device) (.5, node_network_mtu_bytes * (node_network_up{device!="lo"} > 0)) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤯 this is well beyond my PromQL literacy 🙈 well, I'll trust the unit tests, which seem to make sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took me a while to come up with it as well. As @aaSharma14 contributed the tests, found an issue and suggested the fix, some credit goes to him as well. So, thanks @aaSharma14 for the fruitful cooperation :)
Fixes: https://tracker.ceph.com/issues/55004
Signed-off-by: Patrick Seidensal pseidensal@suse.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows