Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_sensu plugin implementation #912

Merged
merged 3 commits into from Apr 8, 2015

Conversation

fabricemarie
Copy link

Plugin to send datapoints and notifications to sensu client's UDP socket. Inspired from write_riemann plugin.

@jtyr
Copy link
Contributor

jtyr commented Mar 2, 2015

Is there a way how to test this plugin with collectd-5.1.0 installed from rpmforge (http://apt.sw.be/redhat/el6/en/x86_64/testing/RPMS/)?

@mfournier
Copy link

@jtyr, yes, something close to this should do the trick:
gcc -DHAVE_CONFIG_H -Wall -Werror -g -O2 -shared -fPIC -I/usr/include/collectd/core/ -ldl -o write_sensu.so write_sensu.c

Keep in mind that write_sensu.so will have to be copied in /usr/lib64/collectd (or wherever the rpmforge RPM installs the plugins). ie: no symlinks allowed for security reasons.

Please give us some feedback about this plugin once you've succeeded getting it to work ! Thanks !

@jtyr
Copy link
Contributor

jtyr commented Mar 4, 2015

Thanks for the compilation command. I have tested it a bit and works just fine. Only when I forgot to define handlers (MetricHandler or NotificationHandler) it bombed on my with this segfault:

segfault at 0 ip 00007f5d581d5cd5 sp 00007f5d563cf810 error 4 in write_sensu.so[7f5d581d4000+5000]

I think it would be nice to catch this exception and print some human-readable message, it that's possible to do from the plugin.

@jtyr
Copy link
Contributor

jtyr commented Mar 4, 2015

I have one more question. When the write_sensu is loaded, it starts showing the collectd check in Sensu (Uchiwa). But when I stop the collectd service, the collectd check stays green. I would expect it to become red when the write_sensu is not connected to the local sensu-client anymore. Or am I doing something wrong? If that is the way how it works, I would prefer the write_sensu plugin not being shown as the collectd check in Sensu.

@fabricemarie
Copy link
Author

@jtyr Fixed the bug with empty MetricHandler and empty NotificationHandler. Rebased the branch as well. I've made MetricHandler mandatory and NotificationHandler handler optional. Could you please see if that works for you?

@fabricemarie
Copy link
Author

@jtyr yes this is the notification drawback with that toolchain. if collectd stops, sensu-client will simply stop receiving and forwarding all notifications to sensu server. However, the rest works fine. I use collectd with the threshold plugin to monitor pretty much everything. I've configured it to always send notifications for all thresholds monitored (OKs as well). This way everything shows green in uchiwa and flapjack except when the thresholds are reached. You can find more information here: http://www.kibinlabs.com/collectd-sensu-write-plugin/ and here http://www.kibinlabs.com/collectd-thresholds/

@jtyr
Copy link
Contributor

jtyr commented Mar 9, 2015

I can confirm that the module do not fail when there is no MetricHandler or NotificationHandler defined. Thanks for fixing it.

We use CollectD in a very similar setup to yours. We collect metrics by CollectD and store them in InfluxdDB via write_graphite plugin. Then we visualize the data in Grafana either by reading the metrics directly from InfluxDB or via Graphite API from the InfluxDB. With the write_sensu plugin, we are able to use the threshold plugin to report problems to Sensu, which is pretty cool!

We also use the Persist true and PersistOK true with every threshold to receive OK and FAILED notification all the time because otherwise CollectD would not send the OK notification when it's restarted like in the following case:

Let's say we have a threshold which starts alerting and we find out that it needs to be changed. We go and set the threshold to a value that doesn't alert, we restart collectd to load the new config and as it doesn't send OK notification, the metric stays in failed state in Sensu.

That could be fixed by adding a new option (e.g. OKAtStart true) which would allow to send OK message when CollectD start. That would allow to clear the failed state in Sensu and eliminate the need to send so many messages all the time. For example with 1000 machines where each of them have 20 thresholds, it makes 20k messages sent every time CollectD collects metrics (e.g. every 10 secs). I think that's a bit overhead. If there would be the OKAtStart option, we could set Persist and PersistOK to false and send messages only when something has changed.

What do you think about the OKAtStart option, @fabricemarie @mfournier?

@fabricemarie
Copy link
Author

@jtyr Thanks for confirming. Nice setup too :) For the case you mention you need only change the hit variable. The write_sensu plugin only forwards whatever it receives (be it performance data or notifications). It does not keep track at all of what check is in what state. To add this would mean a significant re-write of the plugin. I believe the hit parameter of the threshold plugin is what you are looking for.

@jtyr
Copy link
Contributor

jtyr commented Mar 9, 2015

I believe that the hit parameter doesn't help here. The problem is that the threshold plugin holds the information about the state for each threshold which is lost after the restart. This is why you get no OK notification on previously FAILED metric if the threshold is in a "normal" state the restart because the threshold plugin doesn't know that the threshold was in FAILED state before the restart. Let me illustrate this on an example:

Let's say we have a metric with a value of 82 and the following threshold applied to it before collectd restarts:

<Plugin "threshold">
  <Plugin "df">
     Instance "root"
     <Type "percent_bytes">
       Instance "used"
       WarningMax 70
       FailureMax 80
     </Type>
  </Plugin>
</Plugin>

When we change the threshold to the following (assuming the metric value did not change):

<Plugin "threshold">
  <Plugin "df">
     Instance "root"
     <Type "percent_bytes">
       Instance "used"
       WarningMax 90
       FailureMax 95
    </Type>
  </Plugin>
</Plugin>

and we restart the collectd, the threshold doesn't alert anymore but no notification was send out so the metric in Sensu stays in failed state (unless you use Persist true and PersistOK true - which generates unnecessary traffic). But if we would have the OKAtStart option, CollectD would send OK notification after the restart which would clear the failed state in Sensu.

@fabricemarie
Copy link
Author

@jtyr I see what you mean now. But wouldn't starting with OK be a problem if the state was in WARNING state and persisted as such before restart? Like I said we'd have to re-factor the plugin significantly to query the table of states from collectd main if that's possible, no?

@jtyr
Copy link
Contributor

jtyr commented Mar 9, 2015

The OKAtStart is meant to be specified in the threshold plugin, not in the write_sensu plugin.

Regarding the situation where the state of the metric did not change during the restart is that if the metric is still in alerting state, collectd would send the appropriate notification (WARNING, FAILED). Basically it would send the same kind of notification as if the Persist true and PersistOK true would be set but only once, right after the restart.

@fabricemarie
Copy link
Author

@jtyr sorry for the confusion, I understand now. Yup, that would work, and for all plugins too.

@jtyr
Copy link
Contributor

jtyr commented Mar 9, 2015

Exactly. I think that the implementation should not be that difficult. I think that it requires to change:

((th->flags & UT_FLAG_PERSIST_OK) == 0)

to:

((th->flags & UT_FLAG_PERSIST_OK) == 0 || ((th->flags & UT_FLAG_OK_AT_START) == 0 && first_run))

where the first_run indicates if the threshold plugin is called the first time (e.g. state_old is not set or something like that). Any idea how to get this, @mfournier?

@mfournier
Copy link

@jtyr from what I understand, you'd need the threshold plugin to unconditionnally send "OK" values once at startup ? This feels conceptually wrong to me.

Does it actually make this plugin unusable ? Or is it one specific setup which has annoying side effects ?

@jtyr
Copy link
Contributor

jtyr commented Mar 11, 2015

@mfournier The write_sensu plugin is perfectly usable as it is. The only problem here is the threshold plugin which doesn't send "OK" after the CollectD was restarted for metrics which was in "WARNING" or "CRITICAL" state before the restart. It only works as expected if the Persist and PersistOK is set to true which creates unnecessary traffic (sends notification on every collection instead of only on a threshold change). The OKAtStart would take care of this and send "OK" for thresholds which are in "OK" state after the restart. I just don't know how to detect the "after restart" state in the threshold plugin.

@jtyr
Copy link
Contributor

jtyr commented Mar 13, 2015

@fabricemarie Please could you make the metrics sending optional? For my use case I would like to send only notifications but no metrics. If you could add option like Metrics with default value of true, that would solve the problem.

…nt TCP socket. Inspired from write_riemann.
@fabricemarie
Copy link
Author

@jtyr I've made the modifications you requested and rebased it at the same time. Could you please test it and see if it works for you?

@jtyr
Copy link
Contributor

jtyr commented Mar 16, 2015

@fabricemarie thanks for implementing this. I have just tested it and it works as expected. I have only noticed one problem when I set the Metrics to false but I left the MetricHandler configured. When I restarted CollectD, the metrics was still because the plugin enabled it internally as indicated by this message in the log:

write_sensu plugin: MetricHandler given so forcing metrics to be enabled

I believe that the MetricHandler should be ignored if the Metrics is set to false (the same should apply for notifications). I also think that the plugin should not give up when only Notifications are enabled but no NotificationHandler is specified. This is actually how I would like o use it - send only notifications and let them display in the Uchiwa webui without any further action done in a handler.

Once more time thank you very much for improving this great plugin. I believe that it will be very useful tool for plenty of users.

@fabricemarie
Copy link
Author

@jtyr the current implementation does not enable metrics nor notifications sending by default. Instead it waits for you to enable one or both of them, and aborts if you did not specify any. Similarly, if you don't want metrics then don't set any metrics handler. If you don't want notifications then don't set notification handlers. Setting them means you want them no? I thought it was logical.
At least it does a bit of checking to avoid bad surprises... :)
Finally, I believe that sensu-client itself sticks the default handler if you haven't specified any, so I made it explicit: if you want notifications without any particular handler then set the notification handler to default.

@jtyr
Copy link
Contributor

jtyr commented Mar 16, 2015

@fabricemarie This is exactly how I do it now :o) Let's leave it as it is. I'm very happy how it works now. I hope it will make it into the 5.5 release! Pitty that there is still the issue with the threshold plugin not resetting the state after collectd was restarted. I hope @mfournier will help me to resolve this issue :o)

@fabricemarie
Copy link
Author

@jtyr Thanks for all the patience, testing and feedback :)

char *val = NULL;

if (child->values_num != 2) {
WARNING("sensu attributes need both a key and a value.");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the build with clang's "scan-build" static analysis tool, and I got this warning:
Potential leak of memory pointed to by 'sensu_tags_arr.strs'

Not sure it's worth caring in this case, what do you think ?

@fabricemarie
Copy link
Author

@mfournier I originally ran collectd with write_sensu plugin with valgrind with debug symbols enabled and didn't see any memory leaks back then. Let me have a look again when I get a minute.

@mfournier
Copy link

@fabricemarie, thanks a lot for this new plugin. I'm not too familiar with sensu, but I trust you and @jtyr the implementation makes sense from sensu's point of view :)

One question I have regarding this: no "collectd_host" field is set in the json submitted to sensu. What if someone decides to proxy several collectd nodes (using the network or amqp plugin) behind one central server which submits the values to sensu for all of them ?

One other detail I noticed is that no persistent connection is held open with the sensu daemon. It makes the whole connection handling part much simpler, but I'm a bit worried it might lead to performance issue when collectd submits a ton of data to sensu (which would typically happen in the case I describe above). NB: this is just a remark, I wouldn't consider this as a blocker. If needed, this can be implemented later.

The code looks good from my point of view, but it would be awesome if one of @octo, @tokkee or @pyr could also take a quick look !
Out of curiosity, what is the rationale behind vendoring asprintf.c ?

All in all, +1 from me, especially since collectd integration with other tools is something I care a lot about :-)

@mfournier mfournier added this to the 5.5 milestone Mar 17, 2015
@fabricemarie
Copy link
Author

@mfournier I couldnt' find the Potential leak of memory pointed to by 'sensu_tags_arr.strs' that you mentioned. I checked manually and with valgrind to no avail.

I've fixed the 2 indentations issues you mentioned, as well as the uninitialized statuses.

I didn't make the plugin send a "collectd_host" key:value pair since I don't think that sensu-client will accept it. I have not tried though. I would expect sensu-client to set its own (overwriting the one you send via write_sensu) but it might be worth a shot. It might be easier to try manually rather than change the plugin :)

For the lack of persistent connection I have tested more. When testing manually I noticed that sensu-client would accept only one message per connection. So naturally, I wrote the first instance of write_sensu plugin to send to localhost UDP sensu-client socket, as there is no need to handshake. This somewhat worked, but lost randomly a lot of metrics. Eventually, I was forced to make the plugin the way it is now: connect, send a metric or a notification, disconnect. It is far from ideal, but it works without losing any metric/notification. I haven't found the way to send multiple metrics/notifications to sensu-client and frankly nothing is documented on the sensu side when it comes to the sensu-client socket.

The rationale behind vendoring asprintf.c is that I was under the impression that it wasn't available on all platforms on which collectd compiles..

Hope this helps. Obviously, fell free to pull and merge, and later modify as you see fit :)

@@ -448,6 +448,9 @@ Features
- write_riemann
Sends data to Riemann, a stream processing and monitoring system.

- write_sensu
Sends data to Sensu a stream processing and monitoring system, via sensu client local TCP socket.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sends data to Sensu, a stream processing and monitoring system, via the Sensu client local TCP socket.

@mfournier mfournier merged commit c19a8c2 into collectd:master Apr 8, 2015
@mfournier
Copy link

Thanks for the detailed explanations @fabricemarie !

I just merged the PR to master, so this plugin will be part of collectd 5.5. Thanks for the great work !

@mfournier
Copy link

@fabricemarie when attempting to build debian packages, which default to compile with -Werror=format-security I get a lot of errors such as:

write_sensu.c: In function 'add_str_to_list':
write_sensu.c:127:3: error: format not a string literal and no format arguments [-Werror=format-security]
   ERROR(alloc_err);
   ^

The simple way around this seems to be to remove *alloc_err on line 119 and duplicate the error message in each ERROR() statement. What do you think ?

I'd really like to fix this to avoid putting the burden on debian/ubuntu package maintainers. It would be a shame to have write_sensu not built by default in future releases of these distros because of this.

@mfournier
Copy link

#1001 implements the fix I mentioned above. It allows the build to pass when building "hardened" debian packages (which is the case for collectd).

@mfournier
Copy link

@fabricemarie FYI I just added 7834021, which allows the plugin to build on a freebsd10 default install.

@afrozl
Copy link

afrozl commented Mar 15, 2016

Is it possible to use this plugin to send data to sensu over udp instead of tcp?

@dago
Copy link
Contributor

dago commented Jul 18, 2016

In fact the older Solaris implementations are missing asprintf, so the vendor substitute would be good to have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants