New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgr/prometheus: Update rule format and enhance SNMP support #43783
Conversation
Here's the tox output [paul@rhp1gen3 tests]$ tox py3 installed: attrs==21.2.0,beautifulsoup4==4.10.0,bs4==0.0.1,iniconfig==1.1.1,packaging==21.0,pluggy==1.0.0,py==1.10.0,pyparsing==3.0.3,pytest==6.2.5,PyYAML==6.0,soupsieve==2.2.1,toml==0.10.2 py3 run-test-pre: PYTHONHASHSEED='3007607034' py3 run-test: commands[0] | pytest -rA test_syntax.py test_unittests.py =============================================================================================== test session starts =============================================================================================== platform linux -- Python 3.9.7, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 cachedir: .tox/py3/.pytest_cache rootdir: /home/paul/git/ceph/monitoring/prometheus/tests collected 8 items test_syntax.py ..... [ 62%] test_unittests.py ... [100%] ===================================================================================================== PASSES ====================================================================================================== ============================================================================================= short test summary info ============================================================================================= PASSED test_syntax.py::test_alerts_present PASSED test_syntax.py::test_unittests_present PASSED test_syntax.py::test_rules_format PASSED test_syntax.py::test_unittests_format PASSED test_syntax.py::test_rule_syntax PASSED test_unittests.py::test_alerts_present PASSED test_unittests.py::test_unittests_present PASSED test_unittests.py::test_run_unittests ================================================================================================ 8 passed in 5.33s ================================================================================================ py3 run-test: commands[1] | ./validate_rules.py Checking rule groups cluster health : .. mon : ..... osd : ................ mds : ....... mgr : .. pgs : ......... nodes : ..... pools : .... healthchecks : . cephadm : ... PrometheusServer : . rados : . Summary Rule file : ../alerts/ceph_default_alerts.yml Unit Test file : test_alerts.yml Rule groups processed : 12 Rules processed : 56 SNMP OIDs declared : 34 Rule errors : 0 Rule warnings : 0 Rule name duplicates : 0 Unit tests missing : 0 No problems detected in the rule file No problems detected in unit tests file _____________________________________________________________________________________________________ summary _____________________________________________________________________________________________________ py3: commands succeeded congratulations :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just a small comment on the readability of the CamelCase alert names.
org cluster (alerts) source Category | ||
1.3.6.1 .4 .1 .50495 .1 .2 .1 .2 (Ceph Health) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liewegas I checked the IANA registry for the 50495 org and saw your newdream email address. Should we update that address to some generic ceph.io address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW @pcuzner according to this conversation, could you plz add the RECENT_CRASH alert too? Thanks!
Rules now adhere to the format defined by Prometheus.io. This changes alert naming and each alert now includes a a summary description to provide a quick one-liner. In addition to reformatting some missing alerts for MDS and cephadm have been added, and corresponding tests added. The MIB has also been refactored, so it now passes standard lint tests and a README included for devs to understand the OID schema. Fixes: https://tracker.ceph.com/issues/53111 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
@epuertat Doh. Added, and squashed. |
Thanks! |
jenkins test make check |
Rules now adhere to the format defined by Prometheus.io.
This changes alert naming and each alert now includes a
a summary description to provide a quick one-liner.
In addition to reformatting some missing alerts for MDS and
cephadm have been added, and corresponding tests added.
The MIB has also been refactored, so it now passes standard
lint tests and a README included for devs to understand the
OID schema.
Fixes: https://tracker.ceph.com/issues/53111
Signed-off-by: Paul Cuzner pcuzner@redhat.com
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox