New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr: Zabbix monitoring module #16019

Merged
merged 3 commits into from Jul 7, 2017

Conversation

Projects
None yet
4 participants
@wido
Member

wido commented Jun 29, 2017

This ceph-mgr module will pull various values from the Ceph cluster and send them to a Zabbix Server using zabbix_sender.

This requires the zabbix_sender executable to be present on the system running ceph-mgr as it will be invoked to send data to Zabbix.

A Zabbix template can be found in this directory which can be used to easily get data from your Ceph cluster into Zabbix.

More information is available in the README file found in the module's directory.

Signed-off-by: Wido den Hollander wido@42on.com

@wido wido added the mgr label Jun 29, 2017

@wido wido requested review from tchaikov and jcsp Jun 29, 2017

self.log.debug('Sleeping for {0} seconds'.format(conf['interval']))
try:
self.q.get(timeout=conf['interval'])

This comment has been minimized.

@jcsp

jcsp Jun 29, 2017

Contributor

The queue is just for the shutdown signal, right? I'd use a threading.Event here

This comment has been minimized.

@wido

wido Jun 30, 2017

Member

Yes, it is. I was looking for a non-blocking way to break out of the while True loop. I'll look into threading.Event

super(Module, self).__init__(*args, **kwargs)
self.q = Queue()
def get_localized_config(self, key):

This comment has been minimized.

@jcsp

jcsp Jun 29, 2017

Contributor

It looks like this method is already copy-pasted in two modules, lets move it up into mgr_module.py in this PR

This comment has been minimized.

@wido

wido Jun 30, 2017

Member

Ok! I'll do that.

::
ceph config-key put mgr/zabbix/zabbix_host zabbix.localdomain
ceph config-key put mgr/zabbix/hostname ceph.eu-ams02.local

This comment has been minimized.

@jcsp

jcsp Jun 29, 2017

Contributor

What's the motivation for asking the user to set the hostname as a string? Does zabbix get confused if we use the hostname of whichever mgr is active?

Maybe this should be called something other than hostname, if it's really something that's meant to be the same no matter which host is sending the updates?

This comment has been minimized.

@wido

wido Jun 30, 2017

Member

It should be the same no matter what mgr sends it. It's the identifier of the Host configured in Zabbix.

Let's say you call the host ceph.ams02 in Zabbix, it wants the key/value pairs to be send with that identifier.

This comment has been minimized.

@jcsp

jcsp Jun 30, 2017

Contributor

OK, I figured it was something like this. It feels weird to be pretending like this, but zabbix would is not the only monitoring system that doesn't have a concept of clusters and wants everything too be a host, so I guess we just go with the flow and make up a hostname.

zabbix_port = self.get_localized_config('zabbix_port') or '10051'
config['zabbix_port'] = int(zabbix_port)
except (ValueError, TypeError):
raise RuntimeError(

This comment has been minimized.

@jcsp

jcsp Jun 29, 2017

Contributor

Where the config options need validation, it would be nice to implement commands that set them, so that you can validate them on the way in, rather than having the module fail later (not necessarily obvious to user).

In fairness to you, the recent port/server additions to restful+dashboard also have this problem :-/

This comment has been minimized.

@jcsp

jcsp Jun 29, 2017

Contributor

Oh, also, if you add setter commands you can make it reflect settings immediately instead of requiring a restart.

This comment has been minimized.

@wido

wido Jun 30, 2017

Member

Hmm, ok. I don't know exactly what you mean yet. Is there an example out there?

if data['overall_status'] == key:
data['overall_status_int'] = value
break
except (KeyError, ValueError):

This comment has been minimized.

@jcsp

jcsp Jun 29, 2017

Contributor

Because this module is in-tree, it shouldn't need any of these KeyError/ValueError exception handlers -- if the format of the data structures changes, the module should be updated at the same time.

This comment has been minimized.

@wido

wido Jun 30, 2017

Member

Ok!

proc.stdin.write('{0} ceph.{1} {2}\n'.format(hostname, key,
str(value)))
stdout, stderr = proc.communicate()

This comment has been minimized.

@jcsp

jcsp Jun 29, 2017

Contributor

Does zabbix-sender ever block, like if the network goes away or something? The module serve methods are each in their own thread, so that's usually okay, but if zabbix-sender was stuck during shutdown it would be awkward.

There probably isn't a neat way to make the subprocess killable, just thinking out loud.

This comment has been minimized.

@wido

wido Jun 30, 2017

Member

No, it doesn't. I sends the data over UDP as well, doesn't mean it can't block, but it will fail within a few seconds.

self.port,
hostname)
proc = subprocess.Popen(cmd, shell=True, stdin=subprocess.PIPE,

This comment has been minimized.

@jcsp

jcsp Jun 29, 2017

Contributor

Avoid using shell=True if possible -- passing in a [] of arguments avoids vulnerability to malicious argument substitution in the string.

raise RuntimeError('%s exited non-zero' % self.sender)
self.log.debug('Zabbix Sender: %s', stdout.rstrip())
except:

This comment has been minimized.

@jcsp

jcsp Jun 29, 2017

Contributor

Redundant except:raise

@jcsp

This comment has been minimized.

Contributor

jcsp commented Jun 29, 2017

Very cool! Comments added inline.

@jcsp

This comment has been minimized.

Contributor

jcsp commented Jun 29, 2017

Like any external-service-consuming mgr module, testing this is awkward -- I wonder if it would be worth it to have a sort of test mode for the module, so that it runs its get_data and spits it to the console or something like that. That way we would be able to pick up any places where data structures had changed in a way that upset the module, without requiring an external server to send data to.

@wido

This comment has been minimized.

Member

wido commented Jun 30, 2017

@jcsp What about a "dry-run" config option? Where it doesn't send the data to Zabbix, but it executes a different method which looks at the data dict and does a few tests on that dict?

If it encounters some KeyErrors or such it can throw a RuntimeException.

@jcsp

This comment has been minimized.

Contributor

jcsp commented Jun 30, 2017

I'm thinking maybe this is going to be a common enough case that we should define a self_test method as part of the mgr_module interface, that would be executed by a magic CLI command (magic in the sense that it wouldn't appear in ceph --help).

The self test method in this case would just call into get_data and let any KeyError/ValueErrors bubble up -- the caller would be a test script, so no need to try and prettify any of it.

@wido

This comment has been minimized.

Member

wido commented Jun 30, 2017

Ok! I'll look into that.

Two things I don't know:

  • Commands to update config keys? Is there an example somewhere? Can't find it in the other modules
  • get('health') sometimes gives back None
2017-06-30 17:02:33.024625 7f6e51a9a700 -1 mgr serve Traceback (most recent call last):
  File "/usr/lib/ceph/mgr/zabbix/module.py", line 173, in serve
    data = self.get_data()
  File "/usr/lib/ceph/mgr/zabbix/module.py", line 96, in get_data
    health = json.loads(self.get('health')['json'])
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

The reason is because self.get('health')['json'] is None. That's why I have to check for a ValueError, otherwise the module stops.

The first time the mgr starts 'health' doesn't seem to be available. That's the problem.

@wido

This comment has been minimized.

Member

wido commented Jun 30, 2017

@jcsp I pushed new code. Could you check again?

Think I've fixed most of your comments, except for the configuration. Didn't know how to fix that.

@liewegas

This comment has been minimized.

Member

liewegas commented Jun 30, 2017

Two configuration keys are mandatory for the module to work:
- mgr/zabbix/zabbix_host

This comment has been minimized.

@tchaikov

tchaikov Jul 3, 2017

Contributor

you are repeating this section in doc/mgr/zabbix.rst, probably, we can consolidate them in doc/mgr/zabbix.rst instead?

for key, value in data.items():
proc.stdin.write('{0} ceph.{1} {2}\n'.format(hostname, key,
str(value)))

This comment has been minimized.

@tchaikov

tchaikov Jul 3, 2017

Contributor

nit, by default str is used to represent the formatted vars. and wrong indent.

module.py:31:62: E127 continuation line over-indented for visual indent
proc.stdin.write('{0} ceph.{1} {2}\n'.format(hostname, key,
str(value)))
stdout, stderr = proc.communicate()

This comment has been minimized.

@tchaikov

tchaikov Jul 3, 2017

Contributor

stderr is not captured, if we want to check it when proc.returncode != 0, you might need to pass stderr=PIPE to Popen, like

proc = Popen(cmd, stdin=PIPE, stdout=PIPE, stderr=PIPE)
raise RuntimeError('%s exited non-zero' % self.sender)
self.log.debug('Zabbix Sender: %s', stdout.rstrip())

This comment has been minimized.

@tchaikov

tchaikov Jul 3, 2017

Contributor

nit, pep8

module.py:39:1: E302 expected 2 blank lines, found 1
def get_module_config(self):
config = dict()
config['zabbix_sender'] = self.get_localized_config('zabbix_sender') or '/usr/bin/zabbix_sender'

This comment has been minimized.

@tchaikov

tchaikov Jul 3, 2017

Contributor

wrap at 79 chars if you please.

@@ -191,6 +191,17 @@ def get_config_prefix(self, key_prefix):
"""
return ceph_state.get_config_prefix(self._handle, key_prefix)
def get_localized_config(self, key):

This comment has been minimized.

@tchaikov

tchaikov Jul 3, 2017

Contributor

how about adding a default param, like

def get_localized_config(self, key, default_config=None):
if data['overall_status'] not in self.ceph_health_mapping:
raise RuntimeError('No valid overall_status found in data')
return True

This comment has been minimized.

@tchaikov

tchaikov Jul 3, 2017

Contributor

so if self_test() never returns False, why bother returning True?

This comment has been minimized.

@wido

wido Jul 3, 2017

Member

If it's not True you know the tests failed or at least caught an exception

This comment has been minimized.

@tchaikov

tchaikov Jul 3, 2017

Contributor

@wido seems you are raising an exception if self_test() does not pass, that why i suggest just return nothing if everything looks good.

This comment has been minimized.

@wido

wido Jul 3, 2017

Member

Ah, yes. I get it. When no Exception is thrown all should be good

This comment has been minimized.

@tchaikov

tchaikov Jul 3, 2017

Contributor

exactly =)

@jcsp

This comment has been minimized.

Contributor

jcsp commented Jul 3, 2017

get('health') sometimes gives back None

This is getting fixed here: #16020

wido added some commits Jun 30, 2017

mgr: set/get_localized_config in MgrModule
get_localized_config was getting redundant as it was copied to various
modules.

This commit also introduces set_localized_config() which set a localized
configuration option for a module.

Signed-off-by: Wido den Hollander <wido@42on.com>
mgr: Implement self_test() method in MgrModule
Other modules are encouraged to override this method and implement
a as good as possible self-test which returns True or False

It should be a simple way of running (automated) testing on a ceph-mgr
module

Signed-off-by: Wido den Hollander <wido@42on.com>
@wido

This comment has been minimized.

Member

wido commented Jul 3, 2017

Ok, thanks @jcsp @liewegas and @tchaikov for the feedback! I think I've fixed all of it.

Global evaluation
-----------------
Your code has been rated at 9.17/10 (previous run: 9.11/10, +0.06)
root@alpha:~# ceph tell mgr zabbix config-set interval 5
Error EINVAL: Traceback (most recent call last):
  File "/usr/lib/ceph/mgr/zabbix/module.py", line 208, in handle_command
    self.set_config_option(key, value)
  File "/usr/lib/ceph/mgr/zabbix/module.py", line 104, in set_config_option
    raise RuntimeError('interval should be set to at least 10 seconds')
RuntimeError: interval should be set to at least 10 seconds

root@alpha:~# ceph tell mgr zabbix config-set zabbix_host 2001:db8::100
Configuration option zabbix_host updated
root@alpha:~# ceph tell mgr zabbix config-set interval 60
Configuration option interval updated
root@alpha:~# ceph tell mgr zabbix config-show
{"zabbix_port": 10051, "zabbix_host": "2001:db8::100", "identifier": "ceph.wido.laptop", "zabbix_sender": "/usr/bin/zabbix_sender", "interval": 60}
root@alpha:~#

Configuration is now dynamic and you can use commands to change it when needed.

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Jul 4, 2017

/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-dup.sh:45: TEST_filestore_to_bluestore:  rados bench -p foo 10 write -b 4096 --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_ceph-builders_11736
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       465       449    1.7538   1.75391   0.0257855   0.0348628
    2      16      1146      1130   2.20646   2.66016   0.0139535   0.0282324
    3      16      1875      1859   2.42005   2.84766   0.0180086   0.0240097
    4      16      2468      2452   2.39407   2.31641   0.0192654   0.0252615
    5      16      3163      3147   2.45814   2.71484   0.0182556   0.0253671
    6      16      3829      3813   2.48194   2.60156   0.0212635   0.0251516
    7      16      4587      4571   2.55031   2.96094    0.014603   0.0244706
    8      16      5046      5030   2.45562   1.79297   0.0310385   0.0253719
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-dup.sh: line 25: 11736 Terminated              rados bench -p foo 10 write -b 4096 --no-cleanup
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-dup.sh:45: TEST_filestore_to_bluestore:  return 1
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-dup.sh:20: run:  return 1

retest this please.

@wido

This comment has been minimized.

Member

wido commented Jul 5, 2017

@tchaikov I would be very surprised if a Python module could case a BlueStore test failure. Or am I missing something here?

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Jul 5, 2017

@wido so would i. i was pasting that error message just for the record, so we understand why we were running the test again, and so we can root cause the recurring "make check" failures with more evidences.

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Jul 5, 2017

i am trying to reproduce that failure locally.

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Jul 5, 2017

adding "needs-qa" to make sure the packaging works with this change.

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Jul 5, 2017

retest this please.

mgr: Zabbix monitoring module
This ceph-mgr module will pull various values from the Ceph cluster
and send them to a Zabbix Server using zabbix_sender.

This requires the zabbix_sender executable to be present on the system
running ceph-mgr as it will be invoked to send data to Zabbix.

A Zabbix template can be found in this directory which can be used
to easily get data from your Ceph cluster into Zabbix.

More information is available in the README file found in the module's
directory.

Signed-off-by: Wido den Hollander <wido@42on.com>

@liewegas liewegas merged commit 3a85938 into ceph:master Jul 7, 2017

4 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment