-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[collectd 6] disk plugin: Align metrics with OpenTelemetry recommendations. #4217
Conversation
f78bdb7
to
96a53a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general code changes look OK, but:
-
I think it would be better to handle disk: add /proc/diskstats fields 15 to 20 in KERNEL_LINUX #4242 in
main
, and merge relatedmain
changes todisk.c
before this PR. -
Reporting (some) time units as seconds (
s
) is IMHO not good while collectd counters are integers i.e. do not support subsecond timings. Meaning that one does not see from counters very small usage spikes i.e. is disk completely idle or not. -
Time manipulation for values appended to
fam_ops_time
is confusing as they can be multiplied or divided by different values, and variable names do not indicate their time units (like they IMHO should).
Ok, fair enough.
Thanks, it's much more readable now! (I.e. only remaining item is odd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Value wrap-up calculations in disk.c
are broken, see: grep UINT_MAX disk.c
They deduct 64-bit (signed) derive_t
value from 32-bit (unsigned) UINT_MAX
value.
Maybe collectd should have DERIVE_T_MAX
define for this (both in v5 & v6 branches)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. I'll file separate bug about the UINT_MAX vs derive_t issue, as it's already in the v5 version of disk plugin, and not introduced by these changes.
That's an excellent callout, thank you! We actually have a function that is doing this right, After a lot of fiddling I figured out that we calculated the average time per operation … just for that variable never to be read. I think the existing metrics are much better at conveying useful information about the disk, so I remove the calculcation and made the plugin a good deal simpler. We also unnecessarily stored a bunch of stuff in global state, which I have now removed. |
Gah, read Linux diskstats documentation, and counters use native unsigned long (see the new bug). I.e. they are indeed 64-bit, but only on 64-bit host. For 32-bit hosts, UINT_MAX was actually the correct maximum value. |
Nice, The very best changes are ones that remove lots of code, and still manage retain the (relevant) functionality. :-) On (very) quick check of the changes, they seemed OK, i.e. from my side this still seems fine for merging. |
As to adding tests for I think that could be done for the Linux functionality by extracting that from (CPU plugin test coverage could be similarly extended to parsing of |
I think we need to separate I/O and parsing. Maybe a schema along the lines of: int parse(char const *data, metric_family_t families[]);
int read_callback(void) {
char *buffer = read_text_file_contents();
parse(buffer);
} This will allow us to test We should also split plugins into multiple files, one for each operating system we support. E.g. |
* Rename labels to `system.device` and `disk.io.direction`. * Rename `system.disk.time` to `system.disk.operation_time`. * Add descriptions and units to all metric families. * Add the "utilization" metric to FreeBSD.
Also detone the time scale in the variable names.
The fields do not mean what we thought they meant. "wtime" means "wait time", "rtime" means "run time". Fixes: collectd#3875 (for collectd 6)
Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
…`, and `system.disk.pending_operations` for Solaris.
* Use `counter_diff` to calculate counter differences. * Use `counter_t` as we actually want the counter overflow behavior for these metrics. * Remove the `has_...` fields from the global data struct. * Use `value_to_rate()` to calculate disk busyness. * Use `strtoull()` to parse counter values.
system.device
anddisk.io.direction
.system.disk.time
tosystem.disk.operation_time
.ChangeLog: Disk plugin: The metric schema has been aligned with OpenTelemetry semantic conventions.