Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/dashboard: Compare values of MTU alert by device #45583

Merged
merged 2 commits into from Apr 1, 2022

Conversation

p-se
Copy link
Contributor

@p-se p-se commented Mar 23, 2022

Fixes: https://tracker.ceph.com/issues/55004

Signed-off-by: Patrick Seidensal pseidensal@suse.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@p-se p-se requested a review from aaSharma14 March 23, 2022 13:55
@p-se p-se requested a review from a team as a code owner March 23, 2022 13:55
@p-se p-se requested review from pereman2 and removed request for a team March 23, 2022 13:55
@aaSharma14 aaSharma14 requested review from epuertat and removed request for aaSharma14 March 24, 2022 09:18
@github-actions github-actions bot added this to In progress in Dashboard Mar 24, 2022
@p-se p-se force-pushed the monitoring-alert-mtu-group-by-devices branch from a73f52d to 7599770 Compare March 24, 2022 09:22
@aaSharma14
Copy link
Contributor

jenkins test dashboard

p-se and others added 2 commits March 28, 2022 13:38
Fixes: https://tracker.ceph.com/issues/55004

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
Fixes: https://tracker.ceph.com/issues/55004
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
@p-se p-se force-pushed the monitoring-alert-mtu-group-by-devices branch from f06c7d0 to 49d6068 Compare March 28, 2022 11:42
@aaSharma14
Copy link
Contributor

jenkins test dashboard

Dashboard automation moved this from In progress to Reviewer approved Mar 30, 2022
@p-se
Copy link
Contributor Author

p-se commented Mar 31, 2022

Does this PR need to go through Teuthology?

@epuertat epuertat moved this from Reviewer approved to Ready-to-merge in Dashboard Apr 1, 2022
Copy link
Member

@epuertat epuertat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot @p-se !

We discussed at today's stand-up whether to improve the MTU alert in a follow-up PR and we agreed on:

  • as per this PR the MTU alert will work as long as all hosts are using the same MTU setting for the same interface name across the cluster (i.e.: eth0 in all hosts has MTU 1500, eth1 has MTU 9000, etc).
  • to make this work with the network address instead of with the interface name, we'd need to provide new metrics via mgr/prometheus (since node-exporter doesn't provide this info yet).
  • based on upstream users and downstream customer experience so far, we think that it is fair to assume that Ceph users are generally configuring the same networks (cluster or public) on network interfaces whose names are the same.
  • for those users don't following that rule, the impact of this alert would be minimal, as it could be easily muted from the Dashboard as soon as it would be first triggered misleadingly.

Comment on lines +708 to +718
node_network_mtu_bytes * (node_network_up{device!="lo"} > 0) ==
scalar(
max by (device) (node_network_mtu_bytes * (node_network_up{device!="lo"} > 0)) !=
quantile by (device) (.5, node_network_mtu_bytes * (node_network_up{device!="lo"} > 0))
)
or
node_network_mtu_bytes * (node_network_up{device!="lo"} > 0) ==
scalar(
min by (device) (node_network_mtu_bytes * (node_network_up{device!="lo"} > 0)) !=
quantile by (device) (.5, node_network_mtu_bytes * (node_network_up{device!="lo"} > 0))
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯 this is well beyond my PromQL literacy 🙈 well, I'll trust the unit tests, which seem to make sense!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a while to come up with it as well. As @aaSharma14 contributed the tests, found an issue and suggested the fix, some credit goes to him as well. So, thanks @aaSharma14 for the fruitful cooperation :)

Dashboard automation moved this from Ready-to-merge to Reviewer approved Apr 1, 2022
@epuertat epuertat merged commit 2d1c480 into ceph:master Apr 1, 2022
13 checks passed
Dashboard automation moved this from Reviewer approved to Done Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Dashboard
  
Done
4 participants