Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/prometheus: Update rule format and enhance SNMP support #43783

Merged
merged 1 commit into from Nov 10, 2021
Merged

mgr/prometheus: Update rule format and enhance SNMP support #43783

merged 1 commit into from Nov 10, 2021

Conversation

pcuzner
Copy link
Contributor

@pcuzner pcuzner commented Nov 3, 2021

Rules now adhere to the format defined by Prometheus.io.
This changes alert naming and each alert now includes a
a summary description to provide a quick one-liner.

In addition to reformatting some missing alerts for MDS and
cephadm have been added, and corresponding tests added.

The MIB has also been refactored, so it now passes standard
lint tests and a README included for devs to understand the
OID schema.

Fixes: https://tracker.ceph.com/issues/53111

Signed-off-by: Paul Cuzner pcuzner@redhat.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@pcuzner
Copy link
Contributor Author

pcuzner commented Nov 3, 2021

Here's the tox output

[paul@rhp1gen3 tests]$ tox
py3 installed: attrs==21.2.0,beautifulsoup4==4.10.0,bs4==0.0.1,iniconfig==1.1.1,packaging==21.0,pluggy==1.0.0,py==1.10.0,pyparsing==3.0.3,pytest==6.2.5,PyYAML==6.0,soupsieve==2.2.1,toml==0.10.2
py3 run-test-pre: PYTHONHASHSEED='3007607034'
py3 run-test: commands[0] | pytest -rA test_syntax.py test_unittests.py
=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.9.7, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
cachedir: .tox/py3/.pytest_cache
rootdir: /home/paul/git/ceph/monitoring/prometheus/tests
collected 8 items                                                                                                                                                                                                 

test_syntax.py .....                                                                                                                                                                                        [ 62%]
test_unittests.py ...                                                                                                                                                                                       [100%]

===================================================================================================== PASSES ======================================================================================================
============================================================================================= short test summary info =============================================================================================
PASSED test_syntax.py::test_alerts_present
PASSED test_syntax.py::test_unittests_present
PASSED test_syntax.py::test_rules_format
PASSED test_syntax.py::test_unittests_format
PASSED test_syntax.py::test_rule_syntax
PASSED test_unittests.py::test_alerts_present
PASSED test_unittests.py::test_unittests_present
PASSED test_unittests.py::test_run_unittests
================================================================================================ 8 passed in 5.33s ================================================================================================
py3 run-test: commands[1] | ./validate_rules.py

Checking rule groups
        cluster health   : ..
        mon              : .....
        osd              : ................
        mds              : .......
        mgr              : ..
        pgs              : .........
        nodes            : .....
        pools            : ....
        healthchecks     : .
        cephadm          : ...
        PrometheusServer : .
        rados            : .

Summary

Rule file             : ../alerts/ceph_default_alerts.yml
Unit Test file        : test_alerts.yml

Rule groups processed :  12
Rules processed       :  56
SNMP OIDs declared    :  34 
Rule errors           :   0
Rule warnings         :   0
Rule name duplicates  :   0
Unit tests missing    :   0

No problems detected in the rule file

No problems detected in unit tests file

_____________________________________________________________________________________________________ summary _____________________________________________________________________________________________________
  py3: commands succeeded
  congratulations :)

Copy link
Member

@epuertat epuertat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a small comment on the readability of the CamelCase alert names.

monitoring/prometheus/tests/validate_rules.py Show resolved Hide resolved
Comment on lines +30 to +32
org cluster (alerts) source Category
1.3.6.1 .4 .1 .50495 .1 .2 .1 .2 (Ceph Health)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liewegas I checked the IANA registry for the 50495 org and saw your newdream email address. Should we update that address to some generic ceph.io address?

Copy link
Member

@epuertat epuertat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW @pcuzner according to this conversation, could you plz add the RECENT_CRASH alert too? Thanks!

Rules now adhere to the format defined by Prometheus.io.
This changes alert naming and each alert now includes a
a summary description to provide a quick one-liner.

In addition to reformatting some missing alerts for MDS and
cephadm have been added, and corresponding tests added.

The MIB has also been refactored, so it now passes standard
lint tests and a README included for devs to understand the
OID schema.

Fixes: https://tracker.ceph.com/issues/53111

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
@github-actions github-actions bot added this to In progress in Dashboard Nov 4, 2021
@pcuzner
Copy link
Contributor Author

pcuzner commented Nov 4, 2021

@epuertat Doh. Added, and squashed.

@epuertat
Copy link
Member

epuertat commented Nov 5, 2021

@epuertat Doh. Added, and squashed.

Thanks!

@sebastian-philipp
Copy link
Contributor

jenkins test make check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Dashboard
  
Done
3 participants