processes plugin: Add support for Linux Delay Accounting. #2598

octo · 2017-12-06T21:30:11Z

Linux Delay Accounting reports the time a task was delayed by

waiting for a CPU (while being runnable),
completion of synchronous block I/O initiated by the task,
swapping in pages,
memory reclaim.

This patch adds four metrics per configured process, one for each of the bullet points. Metrics are reported in percent rather than, for example, nanoseconds per second.

dothebart · 2017-12-07T10:34:50Z

it seems the STRERROR macro isn't used properly, will error out on centos7; and I don't see where STRERRNO is defined?

[edit] won't compile on a debian stretch for the same reasons.]

ok, scratch that, won't compile if the environment is 5.8.0; works flawlessly in master.

octo · 2017-12-07T19:49:10Z

Yeah, the STRERROR and STRERRNO macros are relatively new, see #2519.

This allows us to print helpful error messages to the user if something goes wrong.

rpv-tomsk · 2017-12-07T21:17:40Z

src/processes.c

+              "for the \"CollectDelayAccounting\" option.");
+#endif
+    } else {
+      ERROR("processes plugin: Option `%s' not allowed heeere.", c->key);


Please fix a typo here ;-)

We are watch for changes! ;-)

Good eye, thanks!

rpv-tomsk · 2017-12-08T06:06:53Z

src/collectd.conf.pod

+Delay Accounting provides the time processes wait for the CPU to become
+available, for I/O operations to finish, for pages to be swapped in and for
+freed pages to be reclaimed. The metrics are reported as a percentage, e.g.
+C<percent-delay-cpu>. Disabled by default.


Hi!
IMHO documented type instance does not match to implemented - there will be no 'delay' word.

Can you please re-check this?

Thanks for noticing! This is a regression introduced in 17b81d4, fixed in 023790e.

rpv-tomsk · 2017-12-08T06:25:17Z

That is amazing feature!

That is much-much more useful than mine try to add metrics of processes/threads states (similar to 'ps_state' metrics reported for a system-wide process list, but for selected processes only and their threads).
For my changes, most of metrics reported 'sleeping' state for threads and processes, so I find my change not so useful as I expected when work was started.

Example of a chart, related to MySQL process:

rpv-tomsk · 2017-12-08T06:45:11Z

Also, want to notice - for myself, in my systems I will report these metrics with a 'ps_delay' type, not a 'percent'.

Plugins may report different sets of metrics which units are percents, but related to different datasets ("delays", "cpu usage", other ratios). I dislike 'percent' as a type in a such cases.

This fixes a regression introduced in 17b81d4.

octo · 2017-12-08T07:47:34Z

That is amazing feature!

Agreed. What blows my mind is that <linux/taskstats.h> has a copyright notice from 2006.

Please feel free to merge when you're happy – I don't have any pending changes on this.

Best regards,
—octo

rpv-tomsk · 2017-12-08T08:06:13Z

cc @tokkee

Hi!

I have a minor note for you, related to this change.
To use this update in my Debian package, I also updated debian/rules in the following way:

--- a/debian/rules
+++ b/debian/rules
@@ -363,6 +362,7 @@ binary-arch: build install-arch
                -dDepends debian/collectd-core/usr/lib/collectd/rrdtool.so
        dpkg-shlibdeps -Tdebian/collectd-core.substvars \
                -dDepends debian/collectd-core/usr/sbin/* \
+               -dDepends debian/collectd-core/usr/lib/collectd/processes.so \
                -dSuggests debian/collectd-core/usr/lib/collectd/*.so
        grep shlibs:Suggests debian/collectd-core.substvars \
                | sed -e 's/shlibs:Suggests/shlibs:Recommends/' \

I think processes plugin is used in 'all' setups, so we need to depend on libmnl rather than just suggest/recommend it.

rpv-tomsk · 2017-12-08T08:09:35Z

Please feel free to merge when you're happy – I don't have any pending changes on this.

Ok, if we leave reported type as a percent, then I have no other remarks.
Thanks for this cool feature!

octo · 2017-12-08T08:09:53Z

@rpv-tomsk Regarding the percent type: it has one serious shortcoming: its maximum value of (a little over) 100. A process with five threads can be blocked for five seconds every second, i.e. 500%.

There are two ways to fix this:

Remove the upper bound from the percent type.
Introduce a new type.

I suggest to introduce delay_rate and then report the metric as "seconds per second". What do you think?

P.S.: a third option would be to re-use another existing type, but there are no good choices. delay exists and is used by the ntpd plugin for round trip times in milliseconds, i.e. something complete different.

rpv-tomsk · 2017-12-08T08:30:47Z

A process with five threads can be blocked for five seconds every second, i.e. 500%.

Of course, I like delay_rate proposal. But another thought about this - I think it will be better if values will be normalised. Maybe we can use cpu_run_real_total or cpu_run_virtual_total fields for this?

octo · 2017-12-08T08:39:00Z

Maybe we can use cpu_run_real_total or cpu_run_virtual_total fields for this?

But those fields report entirely different metrics …?

rpv-tomsk · 2017-12-08T08:52:19Z

But those fields report entirely different metrics …?

I'm unsure which one we should use.

My thought was the following:

suppose the process awakened and it takes 50 units of time before it went to sleep again. In real world that was 100 units of time, so the process uses 50% of CPU.

During that awakened time it might to use 50 units of CPU, or it might to be delayed for a while.

For example, it can spent 5 units of time for IO waiting. So, then we report 10% as 'io delay'.

Without a such normalisation we would report only 5% as 'io delay'.

I hoped what we can get "50 units" value from cpu_run_real_total field.
I'm missing something?

I think both variants might be accepted, with delay time normalized to process CPU usage or not...

rpv-tomsk · 2017-12-08T08:53:55Z

Yeah, I also see 140% as IO wait value . )

Also, I missed that fact even at a chart I posted as example. It presents there too.

rpv-tomsk · 2017-12-08T09:09:47Z

Each of these variangs has own advantages:
I think what values in CPU usage units (non-normalized) will be better to use at charts, and normalized is somewhat better for thresholds. As a final decision, I think non-normalized is enough. ;-)

octo · 2017-12-08T16:03:24Z

Agreed, let's go with the wall clock time for now. We can fiddle with the metrics some more in the future.

octo added the Feature label Dec 6, 2017

octo added 2 commits December 6, 2017 22:31

src/utils_taskstats.[ch]: Add library for Linux Delay Accounting.

c7c01bf

processes plugin: Implement the "CollectDelayAccounting" option.

4ea7a57

octo force-pushed the ff/delayacct branch from fd7ad47 to 4ea7a57 Compare December 6, 2017 21:31

octo added 3 commits December 7, 2017 21:23

processes plugin: Check for the CAP_NET_ADMIN capability.

22912eb

This allows us to print helpful error messages to the user if something goes wrong.

contrib/systemd.collectd.service: Add the processes plugin.

7c1d95f

processes plugin: Make delay metric reporting less repetitive.

17b81d4

rpv-tomsk reviewed Dec 7, 2017

View reviewed changes

processes plugin: Fix error message.

8289f37

rpv-tomsk reviewed Dec 8, 2017

View reviewed changes

processes plugin: Add the "delay-" prefix to type instances.

023790e

This fixes a regression introduced in 17b81d4.

processes plugin: Use the new "delay_rate" type for Delay Accounting.

38709ee

octo merged commit d3bd9c3 into collectd:master Dec 8, 2017

octo deleted the ff/delayacct branch September 26, 2018 06:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

processes plugin: Add support for Linux Delay Accounting. #2598

processes plugin: Add support for Linux Delay Accounting. #2598

octo commented Dec 6, 2017

dothebart commented Dec 7, 2017 •

edited

octo commented Dec 7, 2017

rpv-tomsk Dec 7, 2017

octo Dec 7, 2017

rpv-tomsk Dec 8, 2017 •

edited

octo Dec 8, 2017

octo Dec 8, 2017

rpv-tomsk commented Dec 8, 2017 •

edited

rpv-tomsk commented Dec 8, 2017

octo commented Dec 8, 2017

rpv-tomsk commented Dec 8, 2017

rpv-tomsk commented Dec 8, 2017

octo commented Dec 8, 2017 •

edited

rpv-tomsk commented Dec 8, 2017

octo commented Dec 8, 2017

rpv-tomsk commented Dec 8, 2017 •

edited

rpv-tomsk commented Dec 8, 2017 •

edited

rpv-tomsk commented Dec 8, 2017

octo commented Dec 8, 2017 •

edited

processes plugin: Add support for Linux Delay Accounting. #2598

processes plugin: Add support for Linux Delay Accounting. #2598

Conversation

octo commented Dec 6, 2017

dothebart commented Dec 7, 2017 • edited

octo commented Dec 7, 2017

rpv-tomsk Dec 7, 2017

Choose a reason for hiding this comment

octo Dec 7, 2017

Choose a reason for hiding this comment

rpv-tomsk Dec 8, 2017 • edited

Choose a reason for hiding this comment

octo Dec 8, 2017

Choose a reason for hiding this comment

octo Dec 8, 2017

Choose a reason for hiding this comment

rpv-tomsk commented Dec 8, 2017 • edited

rpv-tomsk commented Dec 8, 2017

octo commented Dec 8, 2017

rpv-tomsk commented Dec 8, 2017

rpv-tomsk commented Dec 8, 2017

octo commented Dec 8, 2017 • edited

rpv-tomsk commented Dec 8, 2017

octo commented Dec 8, 2017

rpv-tomsk commented Dec 8, 2017 • edited

rpv-tomsk commented Dec 8, 2017 • edited

rpv-tomsk commented Dec 8, 2017

octo commented Dec 8, 2017 • edited

dothebart commented Dec 7, 2017 •

edited

rpv-tomsk Dec 8, 2017 •

edited

rpv-tomsk commented Dec 8, 2017 •

edited

octo commented Dec 8, 2017 •

edited

rpv-tomsk commented Dec 8, 2017 •

edited

rpv-tomsk commented Dec 8, 2017 •

edited

octo commented Dec 8, 2017 •

edited