Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

processes plugin: Add support for Linux Delay Accounting. #2598

Merged
merged 8 commits into from Dec 8, 2017

Conversation

octo
Copy link
Member

@octo octo commented Dec 6, 2017

Linux Delay Accounting reports the time a task was delayed by

  • waiting for a CPU (while being runnable),
  • completion of synchronous block I/O initiated by the task,
  • swapping in pages,
  • memory reclaim.

This patch adds four metrics per configured process, one for each of the bullet points. Metrics are reported in percent rather than, for example, nanoseconds per second.

@octo octo added the Feature label Dec 6, 2017
@dothebart
Copy link
Contributor

dothebart commented Dec 7, 2017

it seems the STRERROR macro isn't used properly, will error out on centos7; and I don't see where STRERRNO is defined?

[edit] won't compile on a debian stretch for the same reasons.]

ok, scratch that, won't compile if the environment is 5.8.0; works flawlessly in master.

@octo
Copy link
Member Author

octo commented Dec 7, 2017

Yeah, the STRERROR and STRERRNO macros are relatively new, see #2519.

src/processes.c Outdated
"for the \"CollectDelayAccounting\" option.");
#endif
} else {
ERROR("processes plugin: Option `%s' not allowed heeere.", c->key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix a typo here ;-)

We are watch for changes! ;-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good eye, thanks!

Delay Accounting provides the time processes wait for the CPU to become
available, for I/O operations to finish, for pages to be swapped in and for
freed pages to be reclaimed. The metrics are reported as a percentage, e.g.
C<percent-delay-cpu>. Disabled by default.
Copy link
Contributor

@rpv-tomsk rpv-tomsk Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!
IMHO documented type instance does not match to implemented - there will be no 'delay' word.

Can you please re-check this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing! This is a regression introduced in 17b81d4, fixed in 023790e.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing! This is a regression introduced in 17b81d4, fixed in 023790e.

@rpv-tomsk
Copy link
Contributor

rpv-tomsk commented Dec 8, 2017

That is amazing feature!

That is much-much more useful than mine try to add metrics of processes/threads states (similar to 'ps_state' metrics reported for a system-wide process list, but for selected processes only and their threads).
For my changes, most of metrics reported 'sleeping' state for threads and processes, so I find my change not so useful as I expected when work was started.

Example of a chart, related to MySQL process:
dacct

@rpv-tomsk
Copy link
Contributor

Also, want to notice - for myself, in my systems I will report these metrics with a 'ps_delay' type, not a 'percent'.

Plugins may report different sets of metrics which units are percents, but related to different datasets ("delays", "cpu usage", other ratios). I dislike 'percent' as a type in a such cases.

@octo
Copy link
Member Author

octo commented Dec 8, 2017

That is amazing feature!

Agreed. What blows my mind is that <linux/taskstats.h> has a copyright notice from 2006.

Please feel free to merge when you're happy – I don't have any pending changes on this.

Best regards,
—octo

@rpv-tomsk
Copy link
Contributor

cc @tokkee

Hi!

I have a minor note for you, related to this change.
To use this update in my Debian package, I also updated debian/rules in the following way:

--- a/debian/rules
+++ b/debian/rules
@@ -363,6 +362,7 @@ binary-arch: build install-arch
                -dDepends debian/collectd-core/usr/lib/collectd/rrdtool.so
        dpkg-shlibdeps -Tdebian/collectd-core.substvars \
                -dDepends debian/collectd-core/usr/sbin/* \
+               -dDepends debian/collectd-core/usr/lib/collectd/processes.so \
                -dSuggests debian/collectd-core/usr/lib/collectd/*.so
        grep shlibs:Suggests debian/collectd-core.substvars \
                | sed -e 's/shlibs:Suggests/shlibs:Recommends/' \

I think processes plugin is used in 'all' setups, so we need to depend on libmnl rather than just suggest/recommend it.

@rpv-tomsk
Copy link
Contributor

Please feel free to merge when you're happy – I don't have any pending changes on this.

Ok, if we leave reported type as a percent, then I have no other remarks.
Thanks for this cool feature!

@octo
Copy link
Member Author

octo commented Dec 8, 2017

@rpv-tomsk Regarding the percent type: it has one serious shortcoming: its maximum value of (a little over) 100. A process with five threads can be blocked for five seconds every second, i.e. 500%.

There are two ways to fix this:

  • Remove the upper bound from the percent type.
  • Introduce a new type.

I suggest to introduce delay_rate and then report the metric as "seconds per second". What do you think?

P.S.: a third option would be to re-use another existing type, but there are no good choices. delay exists and is used by the ntpd plugin for round trip times in milliseconds, i.e. something complete different.

@rpv-tomsk
Copy link
Contributor

A process with five threads can be blocked for five seconds every second, i.e. 500%.

Of course, I like delay_rate proposal. But another thought about this - I think it will be better if values will be normalised. Maybe we can use cpu_run_real_total or cpu_run_virtual_total fields for this?

@octo
Copy link
Member Author

octo commented Dec 8, 2017

Maybe we can use cpu_run_real_total or cpu_run_virtual_total fields for this?

But those fields report entirely different metrics …?

@rpv-tomsk
Copy link
Contributor

rpv-tomsk commented Dec 8, 2017

But those fields report entirely different metrics …?

I'm unsure which one we should use.

My thought was the following:

suppose the process awakened and it takes 50 units of time before it went to sleep again. In real world that was 100 units of time, so the process uses 50% of CPU.

During that awakened time it might to use 50 units of CPU, or it might to be delayed for a while.

For example, it can spent 5 units of time for IO waiting. So, then we report 10% as 'io delay'.

Without a such normalisation we would report only 5% as 'io delay'.

I hoped what we can get "50 units" value from cpu_run_real_total field.
I'm missing something?


I think both variants might be accepted, with delay time normalized to process CPU usage or not...

@rpv-tomsk
Copy link
Contributor

rpv-tomsk commented Dec 8, 2017

Yeah, I also see 140% as IO wait value . )

Also, I missed that fact even at a chart I posted as example. It presents there too.

@rpv-tomsk
Copy link
Contributor

Each of these variangs has own advantages:
I think what values in CPU usage units (non-normalized) will be better to use at charts, and normalized is somewhat better for thresholds. As a final decision, I think non-normalized is enough. ;-)

@octo
Copy link
Member Author

octo commented Dec 8, 2017

Agreed, let's go with the wall clock time for now. We can fiddle with the metrics some more in the future.

@octo octo merged commit d3bd9c3 into collectd:master Dec 8, 2017
@octo octo deleted the ff/delayacct branch September 26, 2018 06:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants